Tutorial 1: Basic DNA Sequence Analysis

© 2017 Griffin Chure and Soichi Hirokawa. This work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license


In this tutorial, we'll cover some basics of using Python to read data from a file into memory and how to examine the dispersal patterns of a famous reptile by looking at the sequence of its mitochondrial DNA.

The curious case of Mabuya atlantica

Approximately 200 miles off the coast of Brazil exists a small (and quite beautiful) archipelago named Fernando de Noronha. The twenty one islands and islets of this archipelago are volcanic in origin and have many endemic plant and animal species. One curious inhabitant of the archipelago is Mabuya atlantica (also known as the Noronha skink), which is phylogenetically placed along with the African species of Trachylepis. Because the archipelago was formed from volcanoes, there was no way that these lizards traveled to the island by land. How did they arrive at the island? In 1888, Alfred Russel Wallace proposed that perhaps the skink arrived on the archipelago by floating on a raft of vegetation across the open ocean, although this trip was estimated to take around 150 days using the knowledge of the oceanic currents from Africa to South America, as is shown in the figure below.

While the exact trajectory of this 'improbable voyage' is still up for debate, the phylogenetic placement of this species is fairly certain. This placement was not performed through morphology but through DNA sequence analysis. In this tutorial, we will use the data from S. Carranza and E. N. Arnold who sequenced a component of the mitochondrial DNA from a large variety of skinks across the world and from Fernando de Noronha. We will learn how to read in the DNA sequences and score them by the similarity of their sequence to test this hypothesis.

Loading in sequences

Before we can do any sequence comparisons, we must first load the sequences from the two text files into our computer to do the analysis. To become familiar with the structure of these documents, let's take a look at them. We can use the shell command head to take a look at a small section of the head of the text file. To have access to various shell commands in the Jupyter notebook, we must first type an exclamation point (!).

In [ ]:
# Look at the sequences from Fernando de Noronha
! head data/mabuya_atlantica/noronha_mabuya.txt

This line should give the following output:

>Mabuya atlantica isolate 1, Fernando de Noronha ttgtcttctaaataaagaccagtatgaacggctaaatgaggacaaacctgtctcttttaa ctaatcagtgaaactgatctcccagtacaaaagctgggatacaaacataagacgagaaga ccccgtggagcttaaaacaaaaccaccaatcca--c------gccctgg-accctggtgg actgttttgagttggggcgacttcggaataaaaagaaacttccgagcacagaaccac-aa attctaaccaaggccaacaagcctaagcata---taa-ctgacccggccc--acgccgat caacgaaccaagttaccccggggataacagcgctatcttcttcaagagtccctatcaaca agaaggtttacgacctcgatgttggatcaggacacccaaatggtgcagccgctattaaag gttcgtttgttcaacgatt

We can see that the file begins with a carrot '>' followed by some information about that specific sample. This is a standard sequence format known as FASTA format. There are many sequence storage types used in modern sequence analysis and Biopython is capable of reading many of them. We can also see from the description that the first sample is the first isolate from the island Fernando de Noronha. Below this line is the actual DNA sequence. This sequence is composed of atgc and '-'. The letters correspond to the base identity at that position while the dash '-' indicates an insertion or deletion when the sequences were aligned to each other (see this wonderful blog post for a primer on DNA alignment). Rather than copying and pasting these sequences into our notebooks to actually do the comparison, we'll use the popular bioinformatics package Biopython to load in these sequences.


Note that Biopython is not installed with the standard Anaconda distribution. To install it directly from the Jupyter notebook, type the following command into a code cell and hit enter. You should only have to do this once.
In [ ]:
! conda install biopython --yes

Once you have Biopython installed, we can import only the submodule we will use for this tutorial, Bio.SeqIO. Note that in future tutorials, we will import all of the modules we will use at the beginning of the Jupyter notebook.

In [2]:
# Import Biopython for file I/O.
import Bio.SeqIO

Biopython can be used to do a wide variety of operations on DNA sequences. We will be using it for one rather simple purpose $-$ to load the sequences into memory. To do so, we will need to open the file, tell Biopython to read the contents, and then close the file. We'll first load the sequence from Fernando de Noronha.

In [3]:
# Open the file with sequences from Fernando de Noronha
with open('data/mabuya_atlantica/noronha_mabuya.txt', 'r') as seq:
    noronha_file = Bio.SeqIO.parse(seq, 'fasta')
    noronha_sequences = [record for record in noronha_file]

Let's take a second to cover the syntax line-by-line in the above code cell.

  1. We used a with statement to open our file for reading only ('r') and assigned it to the variable seq.

  2. Within the with block, we used Biopython's SeqIO.parse function to read the file. We specified the format as 'fasta'. This function generates something called a SeqRecord iterator which contains the sequence and description for each entry in the file.

  3. We separated each record in the SeqRecord iterator into a list by using a single line for loop. By putting this loop in brackets [], we convert the output to a list.

Now, each entry in the noronha_sequences list has two attributes; the sequence (.seq) and the description (.description). Let's take a look at the sequence and description of the Noronha skink.

In [4]:
# Look at the sequence and description of the first isolate.
noronha_seq = noronha_sequences[0].seq
noronha_desc = noronha_sequences[0].description
print(noronha_seq)
print(noronha_desc)
ttgtcttctaaataaagaccagtatgaacggctaaatgaggacaaacctgtctcttttaactaatcagtgaaactgatctcccagtacaaaagctgggatacaaacataagacgagaagaccccgtggagcttaaaacaaaaccaccaatcca--c------gccctgg-accctggtggactgttttgagttggggcgacttcggaataaaaagaaacttccgagcacagaaccac-aaattctaaccaaggccaacaagcctaagcata---taa-ctgacccggccc--acgccgatcaacgaaccaagttaccccggggataacagcgctatcttcttcaagagtccctatcaacaagaaggtttacgacctcgatgttggatcaggacacccaaatggtgcagccgctattaaaggttcgtttgttcaacgatt
Mabuya atlantica isolate 1, Fernando de Noronha

Let's do this again with the skink sequences from other regions of the world.

In [5]:
# Open the file with sequences from the rest of the world.
with open('data/mabuya_atlantica/world_mabuya.txt', 'r') as seq:
    world_file = Bio.SeqIO.parse(seq, 'fasta')
    world_sequences = [record for record in world_file]

Again, let's take a look at the sequence and description.

In [6]:
# Look at the sequence and description of a skink in Brazil
brazil_seq = world_sequences[0].seq
brazil_desc = world_sequences[0].description
print(brazil_seq)
print(brazil_desc)
TTGTCTTTTAAATAAAGACCTGTATGAATGGCTGAATGAGGATAAACCTGTCTCTTATAACTAATCAGTGAAACTGATCTCCCAGTACAAAAGCTGGAATATACACATAAGACGAGAAGACCCTGTGGAGCTTAAAAC-AAACCACTAAACAA-----GT--ATACCACTACCTTAGTGTAC-GTTTTCAGTTGGGGCGACTTCGGAATAAAATGAAACTTCCGAGCACAGAGGCAC-TTCCTCTAACTAAGGCCAACAAGCCAAAGACCC---ATAAACGACCCGGCC---TTGCCGATCAACGAACCAAGTTACCCCAGGGATAACAGCGCAATCTTCTTCGAGAGCCCTTATCAACAAGAAGGTTTACGACCTCGATGTTGGATCAGGACACCCAAATGGTGCAGCCGCTATTAAAGGTTCGTTTGTTCAACGATT
Mabuya agilis, SW Brazil; Mato Grosso do Sul

We see again that we have the DNA sequence (although this one is upper case) and its description (Mabuya agilis, SW Brazil; Mato Grosso do Sul). Now that we know how to load and store the sequence information from a file, let's take a look at how to actually perform sequence comparison.

Computing a similarity score

We'll compare the sequences from our skink on Fernando de Noronha to those in other regions of the world. To do so, we'll simplify our analysis to only compare the number of mismatches at a given position in the sequence. This means that we will have to ignore any gaps in the sequence (denoted by '-' signifying an insertion or deletion in the reference genome). How do we define what is 'similar'? There are myriad ways of scoring sequence similarity and is a subject that people have thought very deeply about. To whet our appetite for sequence analysis, we will simply compute what percentage of the base pairs we compared are identical to each other. By this definition, organisms with a larger percent similarity are more similar to each other than those with a lower percent similarity.

Programmatically, how do we compute this? Remember from the Python syntax tutorial we can index letters in a string much like we can numbers in an array or items of a list. To perform the comparison, we'll simply iterate over each position in a sequence and ask two questions:

  1. Is this position in either sequence a gap '-'? If so, we'll skip it and move to the next position.

  2. Is this position in sequence 1 the same in sequence 2? If so, we'll count that as a similarity and move on. Note that by this definition, the two bases must be the same letter and in the same case. This means we will have to convert our brazil_seq to lowercase before we perform the comparison.

Before we begin the loop, we will need to make sure that the two sequences are the same length.

In [7]:
# Determine the length of the two sequences and make sure they are the same.
noronha_length = len(noronha_seq)
brazil_length = len(brazil_seq)
compared_length = (noronha_length == brazil_length)  # Should be True
print('Our sequences are the same length: ' + str(compared_length))
Our sequences are the same length: True

Great! Now, we can go ahead and set up a for loop to compare our two example sequences.

In [8]:
# Convert the Brazil sequence to lowercase.
brazil_seq = brazil_seq.lower()

# Set up a counter to keep track of how many bases are identical.
num_sim = 0

# Set up a counter to keep track of how many positions we've compared.
comp_length = 0

# Iterate through each position in the sequences and compare.
for base in range(brazil_length):
    
    # Make sure we are not comparing a gap.
    if (noronha_seq[base] != '-') and (brazil_seq[base] != '-'):
        
        # Add one position to our counter of the comparison length.
        comp_length += 1   # Note this is same as comp_length = comp_length + 1
       
        # Compare the position and each sequence.
        if noronha_seq[base] == brazil_seq[base]:
            
            # If they are the same, add that to our counter.
            num_sim += 1
            
# Now compute the percent similarity and print it.
score = num_sim / comp_length
print(score)
0.8878281622911695

Our analysis tells us that our sequences are approximately 89% similar. This makes some intuitive sense considering that Fernando de Noronha is only about 200 miles away from the coast of mainland Brazil. However, we have a few other sequences to analyze before we can come to a conclusion regarding their dispersal. Let's take a look at the two other localities (South Africa and Turkey) in our world_mabuya.txt file. We could type everything again by hand, but this would be a wonderful opportunity to write a function.

Our function will do the following operations with two provided sequences, seq_1 and seq_2.

  1. Ensure that seq_1 and seq_2 are the same length. If not, it will produce an error. For this, we'll include a raise statement with the ValueError exception.

  2. Convert both sequences to lowercase and iterate through each sequence and compare every position that is not a gap in either sequence.

  3. Complete and return the percent similarity between the two sequences.

Let's give it a shot.

In [9]:
def compute_similarity(seq_1, seq_2):
    """
    Computes the percent similarity between two sequences ignoring gaps. 
    
    Parameters
    ----------
    seq_1, seq_2 : strings
        DNA sequences to compare. These must be the same length.
        
    Returns
    -------
    score : float
        The percent similarity between the two sequences. 
    """
    # Make sure they are the same length. 
    if len(seq_1) != len(seq_2):
        raise ValueError('Sequences must be the same length!')
        
    # Make both sequences lowercase.
    seq_1 = seq_1.lower()
    seq_2 = seq_2.lower()
        
    # Set up counters of length and similarity.
    comp_length = 0
    num_sim = 0
    
    # Iterate through each position in the sequences.
    for base in range(len(seq_1)):
        
        # Ensure we are not comparing gaps.
        if (seq_1[base] != '-') and (seq_2[base] != '-'):
            
            # Increase the counter for compared length.
            comp_length += 1
            
            # Compare the two positions.
            if seq_1[base] == seq_2[base]:
                
                # Increase the similarity counter.
                num_sim += 1
                
    # Compute and return the percent similarity.
    score = num_sim  / comp_length
    return score
    

To check our function, let's feed it the two sequences we've worked with for this entire tutorial. It should return a similarity of 0.8878.

In [10]:
# Test the function on our example sequences
function_score = compute_similarity(noronha_seq, brazil_seq)
print(function_score)
0.8878281622911695

Now, let's check that it fails when we give the function different sequences.

In [11]:
# Generate sequences of dissimilar length.
seq_1 = 'aTtAcg-a'
seq_2 = 'ttac'

# Run these through the function. This should give an error.
failed_score = compute_similarity(seq_1, seq_2)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-11-89abb3ac9b92> in <module>()
      4 
      5 # Run these through the function. This should give an error.
----> 6 failed_score = compute_similarity(seq_1, seq_2)

<ipython-input-9-f7c56fd3ddfe> in compute_similarity(seq_1, seq_2)
     15     # Make sure they are the same length.
     16     if len(seq_1) != len(seq_2):
---> 17         raise ValueError('Sequences must be the same length!')
     18 
     19     # Make both sequences lowercase.

ValueError: Sequences must be the same length!

Great! Now that we are confident that our function works as advertised, let's compare the three other localities to Fernando de Noronha. Rather than doing this over three separate times, we can throw it in a for loop.

In [12]:
# Loop through the three other localities and print the result to the screen.
for locale in world_sequences:
    
    # Compute the similarity.
    score = compute_similarity(noronha_seq, locale.seq)
    
    # Print the locality and the score.
    print(locale.description, ': Score = ' + str(score))
Mabuya agilis, SW Brazil; Mato Grosso do Sul : Score = 0.8878281622911695
Mabuya capensis, South Africa; Kuga Mts. E. Cape : Score = 0.9304556354916067
Mabuya vittata, Turkey; Osmandere : Score = 0.9049881235154394

It looks like the skink from Brazil is actually the least similar of the three locales we examined. These data are suggestive that the skinks of Fernando de Noronha arrived from somewhere on mainland Africa!

What we've learned

In this tutorial, we've learned some basic principles of scripting in Python and using an open source module (Biopython) to read in sequence from a file and perform some rudimentary analysis. Furthermore, we've learned that through the use of DNA sequencing, we can assess some bold claims about the long distance travel of reptiles. In this week's homework, you will apply what we've learned here to a similar dispersal case study: the amphibians on São Tomé off of the west coast of Africa. In that problem, you will compare the Ptychadena newtoni on São Tomé to more than three amphibian species on mainland Africa.