Tutorial 5: Nonlinear Regression

© 2017 Griffin Chure. This work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license

In [3]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# For curve-fitting through least-squares minimization
import scipy.optimize

# For reading data from a csv file.
import pandas as pd

# For displaying plots in the notebook
%matplotlib inline

In this tutorial, we will learn how to perform nonlinear regression using the $\chi^2$ statistic by examining the dissociation constant of molecular oxygen to human myoglobin.

What is probability?

One of the most powerful skills a scientist can possess is a knowledge of probability and statistics. I can nearly guarantee that regardless of discipline, you will have to perform a nonlinear regression and report the value of a parameter. Before we get into the details, let's briefly touch on what we mean we report the "value of a parameter".

What we seek to identify through parameter estimation is the value of a given parameter which best describes the data under a given model. Estimation of a parameter is intimately tied to our interpretation of probability of which there are two major schools of thought (Frequentism and Bayesian). A detailed discussion of these two interpretations is outside the scope of this tutorial, but see this wonderful 25 minute video on Bayesian vs Frequentist probability by Jake VanderPlas. Also see BE/Bi 103 - Data Analysis in the Biological Sciences by Justin Bois, which is one of the best courses I have ever taken.

How do we estimate a parameter?

Above, we defined a parameter value as that which best describes our data given a specific model, but what do we mean by "best describes"? Imagine we have some set of data ($x$ and $y$), which are linearly correlated, as is shown in the figure below.

We can generate a model for this process of the form

$$ y = \alpha x \tag{1} $$

where $\alpha$ is the parameter we are interested in estimating. Our gut instinct is to say that the best estimate of $\alpha$ is the value in which the difference between each data point and the line resulting from our model is the smallest. This value is called the residual.

We can assume that the noise in our data is Gaussian distributed about this model, meaning that we are assuming there are no outliers. Under this assumption, we can compute the sum of the squares of the residuals of all the data points for a range of values of $\alpha$. This is called the Chi-squared statistic,

$$ \chi^2 = \sum\limits_{i} (y_i - y(\alpha))^2 \tag{2} $$

where $y_i$ is the $i^\text{th}$ data point and $y(\alpha)$ is the predicted value of $y$ using a given value of $\alpha$. The value of $\alpha$ which minimizes $\chi^2$ is deemed as the best-fit parameter value.

In the following section, we'll use the Chi squared statistic to determine the binding energy of Oxygen to a protein we all depend on $-$ Myoglobin

An Example: Transporting $\mathrm{O}_2$ in Muscle

One could argue that the study of the mechanisms of oxygen transport has resulted in the discovery of priciples as vital to biology as the structure of DNA. In 1903, Christian Bohr (father of Niels Bohr) described the relationship between the concentration of CO2 and the ability of hemoglobin to bind gaseous oxygen. This discovery suggested that different copies of hemoglobin interacted with each other (now known to be between subunits) and influenced their ability to bind to oxygen. This observation sparked a vibrant community of research. In 1965, Jacques Monod, Jeffries Wyman, and Jean-Pierre Changeux described a physical description for this affect coined "allostery" which Monod uniquely described as "the second secret of life."

Now, we know that interactions between the different subunits in a functional hemoglobin molecule facilitate the allosteric effects described by Monod, Wyman, and Changeux. However, there exists another oxygen carrying molecule found in muscle tissue called Myoglobin which exists as a monomer. With a single protein and binding site, the binding of oxygen to Myoglobin is driven alone by the affinity of Oxygen to the heme group in the protein.

Like Hemoglobin, Myoglobin has recieved an extensive degree of research. In 1958, Alessandro Ross-Fanelli and Eduardo Antonini performed an extensive study on the binding affinity of oxygen to human Myoglobin (purified from cadavers!) under a variety of temperatures, pH, and oxygen pressures.

In class, we used the tools of statistical mechanics to craft a equation that described the regulation of gene expression as a function of repressor copy number. We can treat Oxygen binding to myoglobin using the same set of principles. As is routine in the statistical mechanical protocol, we can enumerate the states of the system and compute the statistical weights, as is shown below.

In the above diagram, we've defined $c$ as the pressure of Oxygen, $c_0$ as the reference pressure of oxygen (which we will use as 1 atmosphere), $\Delta\varepsilon$ as the difference in energy between pulling an Oxygen molecule out of solution and coordinating with the Myoglobin, and $k_BT$ as the thermal energy of the system.

Using these states and statistical weights, we can write the probability of an Oxygen molecule binding Myoglobin as

$$ P_\text{bound} = {{c \over c_0}e^{-\Delta\varepsilon / k_BT} \over 1 + {c \over c_0}e^{-\beta\Delta\varepsilon / k_BT}}. \tag{3} $$

In their paper, Ross-Fanelli and Antonini generated a model using the Hill Equation in which they measured the dissociation constant ($K$) of Oxygen as a function of Oxygen pressure and temperature (in °C) of the system as is shown in Figure 2.

Today, we will determine the binding energy of Oxygen to Myoglobin ourselves using the curve-fitting algorithm described in the first section of this tutorial. As these data are not available in digital form (the paper was published in 1958 after all!), we have extracted the data from the image above using the wonderful tool DigitizeIt which can extract data from a screenshot or photo. You should download the data and place it in the data/ folder in your current working directory. If you open the data file in Excel or any text-file reader, it should look like this:

# These data correspond to the Oxygen dissociation 
# constant determination for human myglobin as presented
# in figure 2 of A. Ross-Fanelli and E. Antonini 1958. 
# Briefly, purified hemoglobin protein was incubated with 
# gaseous oxygen at a range of pressures over a 
# range of concentrations. The fraction of the myoglobin
# that was bound with molecular oxygen was measured and 
# tabulated. These data were collected from that figure
# through the DigitizeIt software. The temperature is
# given in units of degrees celsius and the 
# pressure of oxygen is given in units of mmHg.

Note, at the top of this file is a series of comments prepended with a hash symbol (#) which indicates a comment. When loading the data, we'll have to tell our computers to ignore these lines.

To begin, let's load up the data. To read in the data, we'll use the powerful data management tool in Python called Pandas. While we will only touch the bare bones of the package, it may be worth your time to see what is available.

In [4]:
# Load the data using the pandas `read_csv` function.
data = pd.read_csv('data/ross_fanelli_1958.csv', comment='#')

# Look at the beginning of the data file. 
temp pO2 frac
0 10 0.396 0.527
1 10 0.397 0.560
2 10 0.766 0.661
3 10 0.967 0.768
4 10 1.506 0.881

Loading the CSV file using pandas creates something called a DataFrame. In the code cell above, we've assigned this DataFrame to a variable called data. By typing data.head(), we see nicely formatted preview of the first 6 lines of the data.

We see that these data are organized in the tidy data format where each individual data point has a temperature, the O2 pressure at which the experiment was performed in mmHg, and the fractional saturation ($P_\text{bound}$) of Oxygen to Myoglobin. To begin our analysis, let's plot all of the data and see if it agrees with the figure shown above from the paper.

To do this, we will want to go through our DataFrame and extract all of the data points from a given temperature. We can do this by indexing the DataFrame. For example, we can extract the column titled 'pO2' by indexing as follows:

In [5]:
# Extract only the pO2 column.
0     0.396
1     0.397
2     0.766
3     0.967
4     1.506
5     1.901
6     0.162
7     0.303
8     0.391
9     0.505
10    0.606
11    0.976
12    0.989
13    1.402
14    1.502
15    2.520
16    2.901
17    3.001
18    4.526
19    4.907
20    5.006
21    0.487
22    0.999
23    1.496
24    2.006
25    2.501
26    4.509
27    6.019
28    0.499
29    0.996
30    2.017
31    3.008
32    4.506
33    7.020
34    0.994
35    3.019
Name: pO2, dtype: float64

We can also grab only the data points which were measured at Oxygen pressures greater than 5 mmHg.

In [6]:
# Get only the data where pO2 is greater than 5.0 mmHg
data[data['pO2'] > 5.00]
temp pO2 frac
20 20 5.006 0.899
27 30 6.019 0.798
33 35 7.020 0.775

Now that we know how to slice Pandas DataFrames, let's plot all of the data points from the paper.

In [7]:
# Set up the temperatures we want to look at.
temps = [10, 20, 30, 35, 40]

# Go through the data frame and extract the desired data
for temp in temps:

    # Separate the temperature of interest
    temp_data = data[data['temp']==temp]
    # Plot the data
    plt.plot(temp_data['pO2'], temp_data['frac'], 'o', label='%s° C' %temp)

# Add axes labels and a legend
plt.xlabel('pO$_2$ (mm Hg)')
plt.ylabel('fraction bound')
plt.xlim([0, 8])
<matplotlib.legend.Legend at 0x10f86a5f8>

Notice that when we plotted the points, we were able to extract only the values of the pressure by indexing the data column titled pO2 as data['pO2'].

We see the same range of diversity in the fractional saturation as Ross-Fanelli and Antonini did in their figure 2. Let's begin our analysis by fitting the binding energy to only one temperature, 20° C.

In [8]:
# Take a look at just the 20 degree data
data_20 = data[data['temp']==20]

# Plot it
plt.plot(data_20['pO2'], data_20['frac'], 'o')
plt.ylabel('fraction bound')
<matplotlib.text.Text at 0x113ea3780>

As described in the first section, to determine the value of $\Delta\varepsilon$, we want to minimize the sum of the squares of the residuals from data and the value of the model using that parameter value. To begin, we will write a function that will compute $P_\text{bound}$ (Equation 3) and the $\chi^2$ statistic (Equation 2).

In [74]:
def p_bound(c, ep, c_0=760):
    Computes the probability of Oxygen being bound to Myoglobin at 
    a given pressure and binding energy.
    c : 1d-array or float
        The pressure of Oxygen used in the experiment. This should be
        in units of mmHg.
    ep : float
        The binding energy of Oxygen to Myoglobin. This should be in
        units of kBT.
    c_0 : float
        The reference pressure for Oxygen. The default value is 
        760 mmHg (1 atm).
    prob : 1d-array or float
        The probability of oxygen being bound given the parameters. This
        is the same shape as c.
    numer =  (c / c_0) * np.exp(-ep)
    denom = 1 + numer
    prob = numer / denom
    return prob

def chi_sq(prob, c, ep, c_0=760):
    Computes the sum of the squares of the residuals from a given 
    probability of Oxygen being bound to Myoglobin.
    prob : 1d-array or float
        The experimentally measured probability of Oxygen being
    c : 1d-array or float
        The pressure of Oxygen used in the experiment. This should
        be in units of mmHg.
    ep : float
        The binding energy of Oxygen to Myoglobin. This should be in 
        units of kBT.
    c_0 : float
        The reference pressure for Oxygen. The default value is 760 mmHg.
    sum_squares : 1d-array
        The  sum of the squares of the residuals from the predicted
        probability of being bound and the expeimentally measured value.
    theo = p_bound(c, ep, c_0)
    sum_squares = np.sum((prob - theo)**2)
    return sum_squares

With these functions defined, let's evaluate $\chi^2$ over a range of possible values for the binding energy. We know that Oxygen does bind to Myoglobin, this means that the binding energy cannot be greater than 0 $k_BT$. What is the strongest that this interaction could be? One of the strongest known non-covalent interactions is the strepavidin-biotin binding, which has a binding energy of $\approx -30\, k_BT$. Let's look at the $\chi^2$ between these two extremes.

In [75]:
# Try for a range of ep values using 300 points. 
ep_range = np.linspace(-30, 0, 300)

# Set up an empty vector where we will keep track of the chi sq. value. 
sum_squares =  np.zeros_like(ep_range)
for i in range(len(ep_range)):
    sum_squares[i] = chi_sq(data_20['frac'],  data_20['pO2'], ep_range[i])

# Plot the sums quares of the residuals as a function of our ep of interest. 
plt.plot(ep_range, sum_squares, '.')
plt.xlabel('$\Delta\epsilon$ $(k_BT)$')
<matplotlib.text.Text at 0x117cd11d0>

Ah-ha! We see a minimum between $-10$ and $-5\, k_BT$. Let's plot this again except only looking between this range.

In [76]:
# Restrict the limits of the x axis.
plt.plot(ep_range, sum_squares, '.')
plt.xlabel('$\Delta\epsilon$ $(k_BT)$')

# Set the limits so we can see the minimum by eye
plt.xlim([-10, -5])
plt.ylim([-0.1, 3])
(-0.1, 3)

From this plot, we can see that a value of $\Delta\varepsilon$ around $-7\, k_BT$ seems to be the minimum of this curve. We can confirm this by taking the minimum of our sum_squares variable.

In [77]:
# Find the minimum value of the sum_squares variable. 
min_val = np.min(sum_squares)

# Find the index of this minimum value within the k_range array. The `np.where`
# function will tell us the index at which a given condition is met. 
ind = np.where(sum_squares == min_val)

# Use this index to get the value of ep which gives the lowest chi sq.
min_ep = ep_range[ind][0]
print("Our 'by-eye' best parameter value is %.2f kBT" %min_ep)
Our 'by-eye' best parameter value is -7.02 kBT

We were pretty close with our by-eye guess! We can see if we are on the right track here by plotting our data with our fit parameter value for $\Delta\varepsilon$, along with some values we think will be a "bad" fit.

In [78]:
# Set up a range of pO2 over which to calculate our fit curve.
pO2_range = np.linspace(0, 8, 500)

# Calculate our fit curve. 
fit_eye = p_bound(pO2_range, min_ep)

# Choose two parameters that would lead to a "bad" fit. 
large_neg_ep = -10
small_neg_ep = -5
large_neg_fit = p_bound(pO2_range, large_neg_ep)
small_neg_fit = p_bound(pO2_range, small_neg_ep)

large_ep = p_bound(pO2_range, -1)

# Plot our data and fits.
plt.plot(data_20['pO2'], data_20['frac'], 'o', label='20° C data')
plt.plot(pO2_range, fit_eye, '-',
         label='By-eye fit, $\Delta\epsilon = %.2f$ $k_BT$' %min_ep)
plt.plot(pO2_range, small_neg_fit, '-',
         label='Bad fit, $\Delta\epsilon = %.2f$ $k_BT$' %small_neg_ep)
plt.plot(pO2_range, large_neg_fit, '-',
         label='Bad fit, $\Delta\epsilon = %.2f$ $k_BT$' %large_neg_ep)
# Note that the %.2f formats the string as a float to the second decimal point.

# Add the labels and legends.
plt.xlabel('pO2 (mmHg)')
plt.ylabel('fraction bound')
<matplotlib.legend.Legend at 0x119ce0fd0>

Our bye-eye fit looks pretty good, while the choices with larger values of $\chi^2$ look pretty bad. However, we really only got the ball park of what the best-fit value of the fit is. The value we found is entirely dependent on how finely we spread our values within the ep_range variable. Remember, we only chose 300 points! If we chose 1,000,000 points, we would get a different answer.

Determining the best-fit parameter values for a given model is a subject that has received an enormous amount of research and attention. There are many ways in which we can statistically infer a parameter value.

Using the built-in functions for curve fitting

Now that we know what is happening under the hood, we can perform this same fitting using the professional algorithms packaged in the scipy.optimize module. From this package, we will use the curve_fit module which will perform the least-squares fitting we've described above. Rather than drawing a very finely spaced array of possible values, it uses the Levenberg-Marquardt algorithm. This algorithm identifies a local minimum in the $\chi^2$ given an initial guess value. It will then go through another round of this local minimum detection until a stopping criterion is satisfied. If you're interested in the details of this optimization procedure, see this set of notes from Henry P. Gavin at Duke University.

The scipy.optimize.curve_fit function returns two parameters $-$ the optimal value found by the algorithm and the approximate covariance matrix of the parameter estimate. Manipulation of this matrix extends beyond the context of this class, so we will simply call it junk.

Let's give this algorithm a spin and see how accurate our by-eye guess is.

In [79]:
# Set the initial guess
p0 = min_ep

# Find the best-fit parameter using the curve_fit function
popt, junk = scipy.optimize.curve_fit(p_bound, data_20['pO2'], 
                                      data_20['frac'], p0=p0)

# Extract the value of the best-fit parameter.
best_ep = popt[0]
print("The best-fit parameter value is %.2f k_BT" %best_ep)
The best-fit parameter value is -7.00 k_BT

While our by-eye fit was very close, it was quite the same as was determined using the Levenberg-Marquardt algortithm. Let's plot both of our fit curves along with the data and see to what degree they agree.

In [80]:
# Compute our best-fit curve. 
best_fit = p_bound(pO2_range, best_ep)

# Plot our data, our by-eye fit, and our LM fit.
plt.plot(data_20['pO2'], data_20['frac'], 'o', label='20° C data')
plt.plot(pO2_range, fit_eye, '-',
         label='By eye fit, $\Delta\epsilon = %.2f$ $k_BT$' %min_ep)
plt.plot(pO2_range, best_fit, ':',
         label='Best fit, $\Delta\epsilon = %.2f$ $k_BT$' %best_ep)

# Add axes labels and a legend
plt.xlabel('pO2 (mmHg)')
<matplotlib.legend.Legend at 0x116c02198>

Both fits seem pretty good. Let's now repeat this procedure across all of the other temperatures measured by Ross-Fanelli and Antonini and plot them all together.

In [81]:
# Set up a range of colors so the points and the line will be the same colors.
colors = sns.color_palette('deep')

# Set up an empty array to which we'll add the best-fit parameter values.
best_fit_vals = np.zeros(len(temps))

# Set the guess value. We can use the same one for each temperature. 
p0 = -5  # in units of kBT

# Start the loop through all of the temperatures.
for i in range(len(temps)):
    # Get the data of interest. 
    data_temp = data[data['temp']==temps[i]]
    # Fit the value for ep and extract.
    popt, junk = scipy.optimize.curve_fit(p_bound, data_temp['pO2'],
                                          data_temp['frac'], p0=-7)
    fit_ep = popt[0]
    # Add it to our array. 
    best_fit_vals[i] = fit_ep
    # Plot the data and the fit
    plt.plot(data_temp['pO2'], data_temp['frac'], 'o', color=colors[i],
             label='%s° C, $\Delta\epsilon=%.2f$ $k_BT$' %(temps[i], fit_ep))
    fit_curve = p_bound(pO2_range, fit_ep)
    plt.plot(pO2_range, fit_curve, '-', color=colors[i])

# Add labels and the legend
plt.xlabel('pO2 (mmHg)')
plt.legend(loc='lower right')
<matplotlib.legend.Legend at 0x11a5d44e0>