L. Error estimation in the Monte Carlo optimization

This aim of this appendix is to clarify which are the sources of errors in the estimates of the NNBP values when applying the MC optimization. There are three kinds of errors at different levels that we will denote as $ \sigma_1$, $ \sigma_2$ and $ \sigma_3$:

  1. The first error $ \sigma_1$ comes from the fitting algorithm. The uncertainties of the estimated NNBP energies ( $ \sigma _{\epsilon _i}$) indicate how much the error function ( $ E(\epsilon_1,\dots,\epsilon_{10},\epsilon_\mathrm{loop})$ see Eq. 5.2) changes when the fitting parameters $ \epsilon _i$ are varied around the minimum. For instance, a variation of the AA/TT motif ( $ \delta \epsilon_1$) around the minimum (see Fig. L.1) produces a larger change in the error function than a variation of the TA/AT motif ( $ \delta \epsilon_{10}$). This indicates that the uncertainty of AA/TT is lower than that of TA/AT. The curvature of the minimum in each direction $ \epsilon _i$ gives the uncertainty. There is a different set of $ \sigma _{\epsilon _i}$ uncertainties for each fit (i.e., each molecule). A quantitative evaluation of the uncertainty of the NNBP parameters requires the evaluation of the $ \chi^2$ function for each FDC (i.e., each fit), which is given by:

    $\displaystyle \chi^2(\vec{\epsilon})=\sum_{i=1}^{N}\left( \frac{f_i-f(x_i;\vec{\epsilon})}{\sigma_y} \right)^2$ (L.1)

    where $ N$ is the number of experimental points of the FDC; $ x_i$ and $ f_i$ are the position and the force measurements, respectively; $ \vec{\epsilon}$ is the vector of fitting parameters $ \{\epsilon_i\}$, $ i = 1,\dots,10$,loop; $ f(x_i;\vec{\epsilon})$ is the theoretically predicted FDC according to the model (see Sec. 3.4.1); and $ \sigma_y$ is the experimental error of the force measurements performed with the optical tweezers. The resolution of the instrument is taken as $ \sigma_y = 0.1$ pN. The uncertainty of the fit parameters is given by the following expression [162]:

    $\displaystyle \sigma_{\epsilon_i}=\sqrt{C_{ii}}$ (L.2)

    where $ C_{ii}$ are the diagonal elements of the variance-covariance matrix $ C_{ij}$. In a non-linear least square fit, this matrix can be obtained from $ C_{ij}=2\cdot H_{ij}^{-1}$, where $ H_{ij}^{-1}$ is the inverse of the Hessian matrix

    $\displaystyle H_{ij}=\frac{\partial^2 \, \chi^2(\vec{\epsilon}_m) }{\partial \epsilon_i \, \partial \epsilon_j}$ (L.3)

    of $ \chi^2(\vec{\epsilon})$ evaluated at the point $ \vec{\epsilon}_m$ that minimizes the error. Note that the error function and the $ \chi^2$ function are related by a constant factor, $ \chi^2(\vec{\epsilon})=(N/\sigma_y^2)\cdot E(\vec{\epsilon})$, so their Hessians are related by one constant factor, as well. The calculation of $ \sigma _{\epsilon _i}$ is quite straightforward and it gives values between $ 0.003-0.015$ kcal$ \cdot $mol$ ^{-1}$. These values represent the first type of error that we call $ \sigma_1$. Note that the Hessian matrix evaluated at the minima found with the heat-quench algorithm is very similar to the Hessian matrix evaluated at the minimum, which means that the curvature is almost the same in all heat-quench minima. Therefore the error of the fit ($ \sigma_1$) takes the same value within a region of $ \pm0.1$ kcal$ \cdot $mol$ ^{-1}$.
  2. The second error comes from the dispersion of the heat-quench minima. As we saw previously, there are several minima corresponding to different possible solutions (each solution being a set of 10 NNBP energies) for the same molecule. The values of the NNBP energies corresponding to the different solutions are Gaussian distributed (see Fig. 5.8c) and the average standard deviation is about $ 0.05$ kcal$ \cdot $mol$ ^{-1}$. All these considerations result in a second typical error $ \sigma_2 = 0.05$ kcal$ \cdot $mol$ ^{-1}$.
  3. Finally, the third error corresponds to the molecular heterogeneity intrinsic to single molecule experiments. Such heterogeneity results in a variability of solutions among different molecules. Indeed, the FDCs of the molecules are never identical and this variability leads to differences in the values of the NNBP energies. This variability is the major source of error in the estimation of our results. The error bars in Figs. 5.12c,d and 5.13 indicate the standard error of the mean, which is around 0.1 kcal$ \cdot $mol$ ^{-1}$ on average. This is what finally determines the statistical error of our analysis, $ \sigma_3 = 0.1$ kcal$ \cdot $mol$ ^{-1}$.

Since the major source of errors is the variability of the results from molecule to molecule, we simply report this last error in the manuscript. Because $ \sigma_3 > \sigma_2 > \sigma_1$ we can safely conclude that the propagation of the errors of the heat-quench algorithm will not increase the final value of the error bar.

Figure L.1: Error function (see Eq. 5.2) around the minimum for small variations of some NNBP energies. Blue dots show the error function evaluated at different values of $ \epsilon _i$. Orange curves show the quadratic fits around the minimum according to $ E(\epsilon )=c/2\cdot (\epsilon -\epsilon _0)^2+E_0$, where c, $ \epsilon _0$ and $ E_0$ are fitting parameters. Red crosses show the solutions found with the MC algorithm, which differ by less than 0.05 kcal$ \cdot $mol$ ^{-1}$ with respect to the minimum. Note that the values of the curvatures ($ c$) of the error function are different for each NNBP parameter. The curvature allows us to estimate the error of the fitting parameters. We have checked that the curvature of the quadratic fit (i.e., the value of the parameter $ c$) for each NNBP parameter coincides with the diagonal elements of the Hessian matrix (Eq. L.3), which give the curvature of the error function in the 11-dimensional space.
\includegraphics[width=\textwidth]{figs/appendix5/errorfunction.eps}

JM Huguet 2014-02-12