© 2017 Griffin Chure. This work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license
# Our standard modules
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Utilities for image processing
import skimage.io
import skimage.exposure
import skimage.measure
# To allow for inline rendering of plots.
%matplotlib inline
In this tutorial, we will learn how to load, manipulate, and extract quantitative data from microscopy images. The data set used in this experiment can be downloaded from here.
Much like sequencing has revolutionized how we think of the origin and diversification of life, microscopy has revolutionized how we think about cellular function. Over the past decade and a half, we have pushed our ability to image past the diffraction limit of light, cracking open a world teeming with scientific questions we couldn't have come up with otherwise.
Of course, microscopy is not a "new" method like DNA sequencing. In the late seventeenth century, Antonie van Leeuwenhoek developed the first microscope. While the ball-lens design allowed for only a small region of focus at a modest magnification, he was able to see the myriad life forms around him that hid from observation by the naked eye. While I will leave the details to your imagination, he looked at a large array of body fluids and samples from his surroudings. He was not able to capture the images he saw but he paid close attention and sketched his observations. The image below shows the design of his microscope (left) and some of his sketches (right).
The current top-of-the-line microscopes permit observation of life at impressively high resolution and speed. The lattice light-sheet microscope is arguably the most advanced microscope in use today. Through some clever tricks of illumination, this microscope allows us to make single-molecule measurements in living cells in real time. Please see these videos on cell division, cell movement through collagen networks, and cell-cell interaction for examples of images captured on a lattice light sheet micrsocope. This microscope is certainly more complicated than van Leeuwenhoek's first design (image below, left side), but the resulting images are truly astounding (image below, right side).
The development of the electron microscope was another thrust forward in our ability to resolve the stuff of life. Although this method cannot make measurements of living cells, it can resolve the details of life down to the atomic level. In addition to tomographic imaging, electron microscopy can be used to resolve the structure of large proteins and protein complexes down to a few angstroms of resolution. For some beautiful examples of electron microscopy images, take a look at the lab of Caltech's own Grant Jensen.
With this technical revolution comes a problem $-$ we are drowning in data. It's now relatively easy to generate 1 TB of image data in a single experiment. This means there is now a bottleneck in our ability to properly analyze and extract the quantitative information buried within the images. In the coming sections, we will cover how one can make such measurements. The images themselves are not exceedingly complex, however this tutorial will hopefully demonstrate how one can analyze images with the same rigor and purpose as any other type of quantitative data.
In class (and in homework 6), we showed how we can use some priciples of statistical mechanics to generate a model for the regulation of a gene by a repressor protein. We began by enumerating the states and weights of the promoter and when the dust settled, we were left with an equation describing the fold-change in gene expression,
$$ \text{fold-change} = \left(1 + {R \over N_{NS}}e^{-\Delta\varepsilon_r / k_BT}\right)^{-1}, \tag{1} $$where $R$ is the number of repressor proteins per cell, $N_{NS}$ is the number of non-specific binding sites available to the repressor, $\Delta\varepsilon_r$ is the binding energy of the repressor to the DNA, and $k_BT$ is the thermal energy of the system (with $k_B$ as the Boltzmann constant and $T$ as the temperature).
We are able to directly manipulate many of these "biophysical knobs". For example, we can change the repressor copy number by manipulating the ribosomal binding site of the repressor gene, making it stronger or weaker to produce many or few repressor proteins. We can also change the strength at which the repressor binds the DNA by manipulating the DNA sequence of the repressor binding site, known as the operator. While we can change these parameters (and determine their values through independent methods), we need to engineer some kind of reporter system that will allow us to actually measure the gene expression. The development of fluorescent proteins and epifluorescence microscopy has allowed us to make such measurements by observing just how brightly the bacterial cells glow. A schematic of our strains with the various biophysical knobs can be seen below. In this construct, we have directly placed these manipulated genes into the bacterial chromosome, indicated by the black circles.
While we can make predictions about the behavior of these systems by plugging values into Equation 1, we would like to verify them by experimentally measuring the fold-change in gene expression. We recall that the fold-change is simply the ratio of the probability of an RNA polymerase molecule being bound to the promoter ($p_\text{bound}$) in the presence of repressor protein to that in the absence of repressor,
$$ \text{fold-change} = {p_\text{bound}(R > 0) \over p_\text{bound} (R = 0)}. \tag{2} $$Since we have built some bacterial strains that express a fluorescent protein, we can experimentally measure the fold-change by taking the ratio of the brightness of a cell in the presence of repressor (dim) to that in the absence (bright),
which can be written mathematically as,
$$ \text{fold-change} = {\langle I_\text{R > 0} \rangle - \langle I_\text{background}\rangle \over \langle I_\text{R = 0}\rangle - \langle I_\text{background}\rangle}, \tag{3} $$where $\langle I_\text{R > 0} \rangle$ and $\langle I_\text{R = 0}\rangle$ is the average single cell intensity in the presence and absence of repressor, respectively and $\langle I_\text{background}\rangle$ is the average background fluorescence of the cell. While you may not expect it at first pass, cells are slightly fluorescent even when they are not expressing any fluorescent proteins. Since we are making the assumption that gene expression is proportional to the probability of a polymerase being bound to the promoter, we must correct for the baseline fluorescence (autofluorescence for short) of the cells.
In the coming sections, you will learn how to read in images from your computer, identify the regions which are bacteria, and make some measurements of their intensity. We will be working with some images of E. coli taken under a contrast generation method called phase contrast for segmentation and a YFP illumination image for the fluorescence channel. This data set can be downloaded here.
Digital images are nothing but a matrix in which each index is assigned a number reprenting the value at that point. These indices, more commonly called pixels, are a measure of how many photons hit that particular position on the camera sensor. Note that in the case of analog film, there are no such things as pixels and the information is contained in the pigmentation of a film negative. Virtually all images you will deal with in your scientific career (and probably your life in general) will be digital, so we will use the term "image" and "digitial image" interchangeably.
As images are nothing but matrices, we can do many of the same mathematical operations on images as we do on matrices. However, we don't want to manually generate every image by populating a matrix. This means that we will have to have some manner of interacting with image files on our computer programmatically. Although it is beyond the scope of this tutorial, you should be familiar with the different file formats of images and their benefits and drawbacks. For all scientific images, you should save them in a file format that minimizes the data lost due to compression. The most common file format you will run into is the Tagged Image File Format (.tiff
or .tif
) which is a lossless compression format. Unfortunately, you will sometimes run into images saved as .jpg
. Never save data you care about as jpegs! This file format uses a lossy compression algorithm, meaning that you will be throwing away some of your data.
There are several image processing libraries available in Python, such as Sci-kit Image, Scipy.ndimage, Python Imaging Library, and Open CV. While each have their own features, they all have the same (or at least very similar) utilities. For our purposes, we will be using the Sci-kit Image library (hereafter referred to as skimage
) as it has everything we need packaged together in some very easy to use functions.
The skimage
module has a large number of submodules. Each submodule has an array of functions assoicated with one kind of image manipulation. For example, the filters
submodule contains a variety of functions that will perform different filtering operations. The submodule measure
is composed of functions useful for making measurements of segmented object. The submodule io
allows for file input/output. While skimage
is a very powerful module, we'll only use a small portion of its capability for the purposes of this tutorial.
To begin, let's read the image into memory using the skimage.io.imread
function.
# Load the phase contrast image by providing the file path
phase_im = skimage.io.imread('data/ecoli_images/ecoli_phase_01.tif')
In this cell, we've assigned the loaded image to the variable phase_im
. As we have mentioned before, an image is nothing but a matrix of values. We can prove this to ourselves by printing the values of the array.
# Look at the phase_im array
phase_im
We can see that our image is a very large matrix with some value at each entry (called a 'pixel'). At the end of this array, we can see that the data type (dtype
) of this image is uint16
. This means that the image is composed of unsigned integers (can only be 0 or positive) with a bit depth of 16. A 16-bit image allows for values ranging from 0 to $2^{16} - 1 = 65535$. The larger the bit depth, the more information your image will contain. For comparison, nearly all commercial cameras for photography (everything from your iPhone camera to a Nikon D5) have a bit depth of 8 for each channel in an RGB image. For qualitative consumption, an 8-bit image is sufficient as the human eye can only distinguish around 200 different hues. However, in science we are often more interested in making quantitative measurements from our images. For this reason, the majority of microscopy images are either 12, 14, or 16 bit images.
While an image is nothing but a matrix, it's not very useful for us just to look at the numbers printed out on our screen. Rather, we will plot the image, assigning each value a color.
# Set the colormap for data display.
gray = plt.cm.Greys_r
# Show the image
plt.imshow(phase_im, cmap=gray)
# Add a colorbar
plt.colorbar()
In this image, we've used a gray colormap where the lowest pixel values are shown in black and the brightest in white. This colormap is perceptually flat, meaning that the transition from black to white is linear across all pixel values. It appears that all of the bacteria are dark, meaning that their pixel values are lower than the values of the background pixels.
We can see in the above image that all of the bacteria are significantly darker compared to the background of the image. If we wanted to identify the regions of the image that correspond only to the bacteria, we could very easily draw a threshold at this value. We could then go through each individual pixel, determine whether its value is above or below this threshold, and mark it as either a bacterium (1.0
) or background (0.0
).
But how do we determine this threshold? We could try guessing a bunch of different values and seeing which works or we could look at the histogram of the image. We can predict that the histogram would show two peaks $-$ one for the background pixels and one for the bacteria. We'll use the skimage.exposure
submodule to generate the histogram of the image and plot it. The function skimage.exposure.histogram
will return the frequency values and the bin centers rather than just generating the plot.
# Generate the histogram of the image. `skimage.exposure.histogram` will return
# the values of the histogram as well as the centers of the bins.
hist, bins = skimage.exposure.histogram(phase_im)
# Plot the histogram values versus the bin centers.
plt.plot(bins, hist, linewidth=1)
plt.xlabel('pixel value (a.u.)')
plt.ylabel('counts')
We can see that the largest peak in the histogram peters off around a pixel count of 5000. Since our bacteria are dark, we can use this value as our threshold and mark anything below this value as belonging to bacteria.
# Threshold the image showing pixels only below 5000 counts
thresh_val = 5000
thresh_im = phase_im < thresh_val
# Plot the image.
plt.imshow(thresh_im, cmap=gray)
That looks pretty good! While we do pick up some background pixels, we seem to get all of the bacteria without an issue. Since we will want to apply this procedure to a large number of images, we should try this same approach on another image just to make sure that it's robust.
# Load another phase contrast image.
phase_im2 = skimage.io.imread('data/ecoli_images/ecoli_phase_02.tif')
# Apply the threshold value of 5000 counts.
thresh_im2 = phase_im2 < thresh_val
# Show the image.
plt.imshow(thresh_im2, cmap=gray)