In [3]:

```
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# For curve-fitting through least-squares minimization
import scipy.optimize
# For reading data from a csv file.
import pandas as pd
# For displaying plots in the notebook
%matplotlib inline
```

One of the most powerful skills a scientist can possess is a knowledge of probability and statistics. I can nearly guarantee that regardless of discipline, you will have to perform a nonlinear regression and report the value of a parameter. Before we get into the details, let's briefly touch on what we mean we report the "value of a parameter".

What we seek to identify through parameter estimation is the value of a given parameter which **best describes the data under a given model**. Estimation of a parameter is intimately tied to our interpretation of probability of which there are two major schools of thought (Frequentism and Bayesian). A detailed discussion of these two interpretations is outside the scope of this tutorial, but see this wonderful 25 minute video on Bayesian vs Frequentist probability by Jake VanderPlas. Also see BE/Bi 103 - Data Analysis in the Biological Sciences by Justin Bois, which is one of the best courses I have ever taken.

Above, we defined a parameter value as that which best describes our data given a specific model, but what do we mean by "best describes"? Imagine we have some set of data ($x$ and $y$), which are linearly correlated, as is shown in the figure below.

We can generate a model for this process of the form

$$ y = \alpha x \tag{1} $$where $\alpha$ is the parameter we are interested in estimating. Our gut instinct is to say that the best estimate of $\alpha$ is the value in which the difference between each data point and the line resulting from our model is the smallest. This value is called the **residual**.

We can assume that the noise in our data is Gaussian distributed about this model, meaning that we are assuming there are no outliers. Under this assumption, we can compute the sum of the squares of the residuals of all the data points for a range of values of $\alpha$. This is called the Chi-squared statistic,

$$ \chi^2 = \sum\limits_{i} (y_i - y(\alpha))^2 \tag{2} $$where $y_i$ is the $i^\text{th}$ data point and $y(\alpha)$ is the predicted value of $y$ using a given value of $\alpha$. The value of $\alpha$ which minimizes $\chi^2$ is deemed as the best-fit parameter value.

In the following section, we'll use the Chi squared statistic to determine the binding energy of Oxygen to a protein we all depend on $-$ Myoglobin

One could argue that the study of the mechanisms of oxygen transport has resulted in the discovery of priciples as vital to biology as the structure of DNA. In 1903, Christian Bohr (father of Niels Bohr) described the relationship between the concentration of CO_{2} and the ability of hemoglobin to bind gaseous oxygen. This discovery suggested that different copies of hemoglobin interacted with each other (now known to be between subunits) and influenced their ability to bind to oxygen. This observation sparked a vibrant community of research. In 1965, Jacques Monod, Jeffries Wyman, and Jean-Pierre Changeux described a physical description for this affect coined "allostery" which Monod uniquely described as "the second secret of life."

Now, we know that interactions between the different subunits in a functional hemoglobin molecule facilitate the allosteric effects described by Monod, Wyman, and Changeux. However, there exists another oxygen carrying molecule found in muscle tissue called Myoglobin which exists as a monomer. With a single protein and binding site, the binding of oxygen to Myoglobin is driven alone by the affinity of Oxygen to the heme group in the protein.

Like Hemoglobin, Myoglobin has recieved an extensive degree of research. In 1958, Alessandro Ross-Fanelli and Eduardo Antonini performed an extensive study on the binding affinity of oxygen to human Myoglobin (purified from cadavers!) under a variety of temperatures, pH, and oxygen pressures.

In class, we used the tools of statistical mechanics to craft a equation that described the regulation of gene expression as a function of repressor copy number. We can treat Oxygen binding to myoglobin using the same set of principles. As is routine in the statistical mechanical protocol, we can enumerate the states of the system and compute the statistical weights, as is shown below.

In the above diagram, we've defined $c$ as the pressure of Oxygen, $c_0$ as the reference pressure of oxygen (which we will use as 1 atmosphere), $\Delta\varepsilon$ as the difference in energy between pulling an Oxygen molecule out of solution and coordinating with the Myoglobin, and $k_BT$ as the thermal energy of the system.

Using these states and statistical weights, we can write the probability of an Oxygen molecule binding Myoglobin as

$$ P_\text{bound} = {{c \over c_0}e^{-\Delta\varepsilon / k_BT} \over 1 + {c \over c_0}e^{-\beta\Delta\varepsilon / k_BT}}. \tag{3} $$In their paper, Ross-Fanelli and Antonini generated a model using the Hill Equation in which they measured the dissociation constant ($K$) of Oxygen as a function of Oxygen pressure and temperature (in Â°C) of the system as is shown in Figure 2.

Today, we will determine the binding energy of Oxygen to Myoglobin ourselves using the curve-fitting algorithm described in the first section of this tutorial. As these data are not available in digital form (the paper was published in 1958 after all!), we have extracted the data from the image above using the wonderful tool DigitizeIt which can extract data from a screenshot or photo. You should download the data and place it in the `data/`

folder in your current working directory. If you open the data file in Excel or any text-file reader, it should look like this:

```
# These data correspond to the Oxygen dissociation
# constant determination for human myglobin as presented
# in figure 2 of A. Ross-Fanelli and E. Antonini 1958.
# Briefly, purified hemoglobin protein was incubated with
# gaseous oxygen at a range of pressures over a
# range of concentrations. The fraction of the myoglobin
# that was bound with molecular oxygen was measured and
# tabulated. These data were collected from that figure
# through the DigitizeIt software. The temperature is
# given in units of degrees celsius and the
# pressure of oxygen is given in units of mmHg.
temp,pO2,frac
10,0.396,0.527
10,0.397,0.560
...
```

Note, at the top of this file is a series of comments prepended with a hash symbol (`#`

) which indicates a comment. When loading the data, we'll have to tell our computers to ignore these lines.

To begin, let's load up the data. To read in the data, we'll use the powerful data management tool in Python called Pandas. While we will only touch the bare bones of the package, it may be worth your time to see what is available.

In [4]:

```
# Load the data using the pandas `read_csv` function.
data = pd.read_csv('data/ross_fanelli_1958.csv', comment='#')
# Look at the beginning of the data file.
data.head()
```

Out[4]:

Loading the CSV file using pandas creates something called a DataFrame. In the code cell above, we've assigned this DataFrame to a variable called `data`

. By typing `data.head()`

, we see nicely formatted preview of the first 6 lines of the data.

We see that these data are organized in the tidy data format where each individual data point has a temperature, the O_{2} pressure at which the experiment was performed in mmHg, and the fractional saturation ($P_\text{bound}$) of Oxygen to Myoglobin. To begin our analysis, let's plot all of the data and see if it agrees with the figure shown above from the paper.

To do this, we will want to go through our DataFrame and extract all of the data points from a given temperature. We can do this by indexing the DataFrame. For example, we can extract the column titled 'pO2' by indexing as follows:

In [5]:

```
# Extract only the pO2 column.
data['pO2']
```

Out[5]:

We can also grab only the data points which were measured at Oxygen pressures greater than 5 mmHg.

In [6]:

```
# Get only the data where pO2 is greater than 5.0 mmHg
data[data['pO2'] > 5.00]
```

Out[6]:

Now that we know how to slice Pandas DataFrames, let's plot all of the data points from the paper.

In [7]:

```
# Set up the temperatures we want to look at.
temps = [10, 20, 30, 35, 40]
# Go through the data frame and extract the desired data
for temp in temps:
# Separate the temperature of interest
temp_data = data[data['temp']==temp]
# Plot the data
plt.plot(temp_data['pO2'], temp_data['frac'], 'o', label='%sÂ° C' %temp)
# Add axes labels and a legend
plt.xlabel('pO$_2$ (mm Hg)')
plt.ylabel('fraction bound')
plt.ylim([0,1.1])
plt.xlim([0, 8])
plt.legend()
```

Out[7]:

Notice that when we plotted the points, we were able to extract only the values of the pressure by indexing the data column titled `pO2`

as `data['pO2']`

.

We see the same range of diversity in the fractional saturation as Ross-Fanelli and Antonini did in their figure 2. Let's begin our analysis by fitting the binding energy to only one temperature, 20Â° C.

In [8]:

```
# Take a look at just the 20 degree data
data_20 = data[data['temp']==20]
# Plot it
plt.plot(data_20['pO2'], data_20['frac'], 'o')
plt.xlabel('pO2')
plt.ylabel('fraction bound')
```

Out[8]:

In [74]:

```
def p_bound(c, ep, c_0=760):
"""
Computes the probability of Oxygen being bound to Myoglobin at
a given pressure and binding energy.
Parameters
----------
c : 1d-array or float
The pressure of Oxygen used in the experiment. This should be
in units of mmHg.
ep : float
The binding energy of Oxygen to Myoglobin. This should be in
units of kBT.
c_0 : float
The reference pressure for Oxygen. The default value is
760 mmHg (1 atm).
Returns
-------
prob : 1d-array or float
The probability of oxygen being bound given the parameters. This
is the same shape as c.
"""
numer = (c / c_0) * np.exp(-ep)
denom = 1 + numer
prob = numer / denom
return prob
def chi_sq(prob, c, ep, c_0=760):
"""
Computes the sum of the squares of the residuals from a given
probability of Oxygen being bound to Myoglobin.
Parameters
----------
prob : 1d-array or float
The experimentally measured probability of Oxygen being
bound.
c : 1d-array or float
The pressure of Oxygen used in the experiment. This should
be in units of mmHg.
ep : float
The binding energy of Oxygen to Myoglobin. This should be in
units of kBT.
c_0 : float
The reference pressure for Oxygen. The default value is 760 mmHg.
Returns
-------
sum_squares : 1d-array
The sum of the squares of the residuals from the predicted
probability of being bound and the expeimentally measured value.
"""
theo = p_bound(c, ep, c_0)
sum_squares = np.sum((prob - theo)**2)
return sum_squares
```

*does* bind to Myoglobin, this means that the binding energy cannot be greater than 0 $k_BT$. What is the strongest that this interaction could be? One of the strongest known non-covalent interactions is the strepavidin-biotin binding, which has a binding energy of $\approx -30\, k_BT$. Let's look at the $\chi^2$ between these two extremes.

In [75]:

```
# Try for a range of ep values using 300 points.
ep_range = np.linspace(-30, 0, 300)
# Set up an empty vector where we will keep track of the chi sq. value.
sum_squares = np.zeros_like(ep_range)
for i in range(len(ep_range)):
sum_squares[i] = chi_sq(data_20['frac'], data_20['pO2'], ep_range[i])
# Plot the sums quares of the residuals as a function of our ep of interest.
plt.plot(ep_range, sum_squares, '.')
plt.xlabel('$\Delta\epsilon$ $(k_BT)$')
plt.ylabel('$\chi^2$')
```

Out[75]:

In [76]:

```
# Restrict the limits of the x axis.
plt.plot(ep_range, sum_squares, '.')
plt.xlabel('$\Delta\epsilon$ $(k_BT)$')
plt.ylabel('$\chi^2$')
# Set the limits so we can see the minimum by eye
plt.xlim([-10, -5])
plt.ylim([-0.1, 3])
```

Out[76]:

`sum_squares`

variable.

In [77]:

```
# Find the minimum value of the sum_squares variable.
min_val = np.min(sum_squares)
# Find the index of this minimum value within the k_range array. The `np.where`
# function will tell us the index at which a given condition is met.
ind = np.where(sum_squares == min_val)
# Use this index to get the value of ep which gives the lowest chi sq.
min_ep = ep_range[ind][0]
print("Our 'by-eye' best parameter value is %.2f kBT" %min_ep)
```

In [78]:

```
# Set up a range of pO2 over which to calculate our fit curve.
pO2_range = np.linspace(0, 8, 500)
# Calculate our fit curve.
fit_eye = p_bound(pO2_range, min_ep)
# Choose two parameters that would lead to a "bad" fit.
large_neg_ep = -10
small_neg_ep = -5
large_neg_fit = p_bound(pO2_range, large_neg_ep)
small_neg_fit = p_bound(pO2_range, small_neg_ep)
large_ep = p_bound(pO2_range, -1)
# Plot our data and fits.
plt.plot(data_20['pO2'], data_20['frac'], 'o', label='20Â° C data')
plt.plot(pO2_range, fit_eye, '-',
label='By-eye fit, $\Delta\epsilon = %.2f$ $k_BT$' %min_ep)
plt.plot(pO2_range, small_neg_fit, '-',
label='Bad fit, $\Delta\epsilon = %.2f$ $k_BT$' %small_neg_ep)
plt.plot(pO2_range, large_neg_fit, '-',
label='Bad fit, $\Delta\epsilon = %.2f$ $k_BT$' %large_neg_ep)
# Note that the %.2f formats the string as a float to the second decimal point.
# Add the labels and legends.
plt.xlabel('pO2 (mmHg)')
plt.ylabel('fraction bound')
plt.legend()
```

Out[78]:

Our bye-eye fit looks pretty good, while the choices with larger values of $\chi^2$ look pretty bad. However, we really only got the ball park of what the best-fit value of the fit is. The value we found is entirely dependent on how finely we spread our values within the `ep_range`

variable. Remember, we only chose 300 points! If we chose 1,000,000 points, we would get a different answer.

Determining the best-fit parameter values for a given model is a subject that has received an enormous amount of research and attention. There are many ways in which we can statistically infer a parameter value.

Now that we know what is happening under the hood, we can perform this same fitting using the professional algorithms packaged in the `scipy.optimize`

module. From this package, we will use the `curve_fit`

module which will perform the least-squares fitting we've described above. Rather than drawing a very finely spaced array of possible values, it uses the Levenberg-Marquardt algorithm. This algorithm identifies a local minimum in the $\chi^2$ given an initial guess value. It will then go through another round of this local minimum detection until a stopping criterion is satisfied. If you're interested in the details of this optimization procedure, see this set of notes from Henry P. Gavin at Duke University.

The `scipy.optimize.curve_fit`

function returns two parameters $-$ the optimal value found by the algorithm and the approximate covariance matrix of the parameter estimate. Manipulation of this matrix extends beyond the context of this class, so we will simply call it `junk`

.

Let's give this algorithm a spin and see how accurate our by-eye guess is.

In [79]:

```
# Set the initial guess
p0 = min_ep
# Find the best-fit parameter using the curve_fit function
popt, junk = scipy.optimize.curve_fit(p_bound, data_20['pO2'],
data_20['frac'], p0=p0)
# Extract the value of the best-fit parameter.
best_ep = popt[0]
print("The best-fit parameter value is %.2f k_BT" %best_ep)
```

In [80]:

```
# Compute our best-fit curve.
best_fit = p_bound(pO2_range, best_ep)
# Plot our data, our by-eye fit, and our LM fit.
plt.plot(data_20['pO2'], data_20['frac'], 'o', label='20Â° C data')
plt.plot(pO2_range, fit_eye, '-',
label='By eye fit, $\Delta\epsilon = %.2f$ $k_BT$' %min_ep)
plt.plot(pO2_range, best_fit, ':',
label='Best fit, $\Delta\epsilon = %.2f$ $k_BT$' %best_ep)
# Add axes labels and a legend
plt.xlabel('pO2 (mmHg)')
plt.ylabel('$P_{bound}$')
plt.legend()
```

Out[80]:

In [81]:

```
# Set up a range of colors so the points and the line will be the same colors.
colors = sns.color_palette('deep')
# Set up an empty array to which we'll add the best-fit parameter values.
best_fit_vals = np.zeros(len(temps))
# Set the guess value. We can use the same one for each temperature.
p0 = -5 # in units of kBT
# Start the loop through all of the temperatures.
for i in range(len(temps)):
# Get the data of interest.
data_temp = data[data['temp']==temps[i]]
# Fit the value for ep and extract.
popt, junk = scipy.optimize.curve_fit(p_bound, data_temp['pO2'],
data_temp['frac'], p0=-7)
fit_ep = popt[0]
# Add it to our array.
best_fit_vals[i] = fit_ep
# Plot the data and the fit
plt.plot(data_temp['pO2'], data_temp['frac'], 'o', color=colors[i],
label='%sÂ° C, $\Delta\epsilon=%.2f$ $k_BT$' %(temps[i], fit_ep))
fit_curve = p_bound(pO2_range, fit_ep)
plt.plot(pO2_range, fit_curve, '-', color=colors[i])
# Add labels and the legend
plt.xlabel('pO2 (mmHg)')
plt.ylabel('$P_{bound}$')
plt.legend(loc='lower right')
```

Out[81]: