How do we Sequence DNA?
DNA Sequencing is at the center of the Human Genome Project, which promises to revolutionize
the Biomedical Sciences and the treatment of human diseases. This page is designed to help you
understand how DNA is sequenced.
If you are looking for information on our DNA sequencing service facility, our home page is here:
The University of Michigan DNA Sequencing Core
First you need to know a few key terms:
As you go through the subsequent discussion, you may need to jump back here to
refresh your memory on various definitions.
- We assume you've read through the
description of DNA structure,
an earlier link in this thread ... right? You hopefully also read the link that describes
DNA Denaturation, Annealing and Replication,
since the following page builds on those basics.
- A 'plasmid' is a small, circular piece of DNA that is often found in bacteria.
This innocuous molecule might help the bacteria survive in the presence of
an antibiotic, for example, due to the genes it carries. To scientists, however,
plasmids are important because (i) we can isolate them in large quantities,
(ii) we can cut and splice them, adding whatever DNA we choose, (iii) we can
put them back into bacteria, where they'll replicate along with the bacteria's
own DNA, and (iv) we can isolate them again - getting billions of copies of
whatever DNA we inserted into the plasmid! Plasmid are limited to sizes of
2.5-20 kilobases (kb), in general.
- The term 'BAC" is an acronym for 'Bacterial Artificial Chromosome', and in principle,
it is used like a plasmid. We construct BACs that carry DNA from humans or mice or
wherever, and we insert the BAC into a host bacterium. As with the plasmid, when we
grow that bacterium, we replicate the BAC as well. Huge pieces of DNA can be easily
replicated using BACs - usually on the order of 100-400 kilobases (kb). Using BACs,
scientists have cloned (replicated) major chunks of human DNA. This, as you will see
later, is critical to the Human Genome Project.
- The 'vector' is generally the basic type of DNA molecule used to replicate your DNA,
like a plasmid or a BAC.
- The 'insert' is a piece of DNA we've purposely put into another (a 'vector') so that
we can replicate it. Usually the 'insert' is the interesting part, consequently. In
the case of the Human Genome Project or other sequencing projects, the insert is
the part we want to sequence - the part we don't know. Usually we know the complete
DNA sequence of the vector.
- Shotgun Sequencing
- Shotgun sequencing is a method for determining the sequence fo a very large piece of
DNA. The basic DNA sequencing reaction can only get the sequence of a few hundred
nucleotides. For larger ones (like BAC DNA), we usually fragment the DNA and insert
the resultant pieces into a convenient vector (a plasmid, usually) to replicate them.
After we sequence the fragments, we try to deduce from them the sequence of the
original BAC DNA.
- For more definitions ...
- See our Molecular Biology Glossary.
Now for the details on how DNA Sequencing works:
DNA sequencing reactions are just like the PCR reactions for replicating DNA
(refer to the previous page DNA Denaturation, Annealing and Replication).
The reaction mix includes the template DNA, free nucleotides,
an enzyme (usually a variant of Taq polymerase) and a 'primer' - a small piece
of single-stranded DNA about 20-30 nt long that can hybridize to one strand
of the template DNA.
The reaction is initiated by heating until the two strands of DNA separate, then
the primer sticks to its intended location and DNA polymerase starts elongating
the primer. If allowed to go to completion, a new strand of DNA would be the
result. If we start with a billion identical pieces of template DNA, we'll get
a billion new copies of one of its strands.
Dideoxynucleotides: We run the reactions, however, in the presence of a dideoxyribonucleotide. This
is just like regular DNA, except it has no 3' hydroxyl group - once it's added
to the end of a DNA strand, there's no way to continue elongating it.
Now the key to this is that MOST of the nucleotides are regular ones, and just a fraction
of them are dideoxy nucleotides....
Replicating a DNA strand in the presence of dideoxy-T
MOST of the time when a 'T' is required to make the new strand, the enzyme will
get a good one and there's no problem. MOST of the time after adding a T, the
enzyme will go ahead and add more nucleotides. However, 5% of the time, the enzyme will
get a dideoxy-T, and that strand can never again be elongated. It eventually breaks away
from the enzyme, a dead end product.
Sooner or later ALL of the copies will get terminated by a T, but each time the
enzyme makes a new strand, the place it gets stopped will be random. In millions
of starts, there will be strands stopping at every possible T along the way.
ALL of the strands we make started at one exact position. ALL of them end with
a T. There are billions of them ... many millions at each possible T position.
To find out where all the T's are in our newly synthesized strand, all we have
to do is find out the sizes of all the terminated products!
Here's how we find out those fragment sizes.
Gel electrophoresis can be used to separate the fragments by size and measure
them. In the cartoon at left, we depict the results of a sequencing reaction
run in the presence of dideoxy-Cytidine (ddC).
First, let's add one fact: the dideoxy nucleotides in my lab have been chemically
modified to fluoresce under UV light. The dideoxy-C, for example, glows blue. Now
put the reaction products onto an 'electrophoresis gel' (you may need to refer to 'Gel Electrophoresis' in
the Molecular Biology Glossary), and you'll see something like depicted
at left. Smallest fragments are at the bottom, largest at the top. The positions
and spacing shows the relative sizes. At the bottom is the smallest fragment that's
been terminated by ddC; that's probably the C closest to the end of the primer (which
is omitted from the sequence shown). Simply by scanning up the gel, we can see that
we skip two, and then there's two more C's in a row. Skip another, and there's
yet another C. And so on, all the way up. We can see where all the C's are.
Putting all four deoxynucleotides into the picture:
Well, OK, it's not so easy reading just C's, as you perhaps saw in the last figure.
The spacing between the bands isn't all that easy to figure out. Imagine,
though, that we ran the reaction with *all four* of the dideoxy nucleotides
(A, G, C and T) present, and with *different* fluorescent colors on each. NOW
look at the gel we'd get (at left). The sequence of the DNA is rather obvious
if you know the color codes ... just read the colors from bottom to top: TGCGTCCA-(etc).
(Forgive me for using black - it shows up better than yellow).
An Automated sequencing gel:
That's exactly what we do to sequence DNA, then - we run DNA replication reactions
in a test tube, but in the presence of trace amounts of all four of the
dideoxy terminator nucleotides. Electrophoresis is used to separate the resulting
fragments by size and we can 'read' the sequence from it, as the colors march past
In a large-scale sequencing lab, we use a machine to run the electrophoresis step
and to monitor the different colors as they come out. Since about 2001, these
machines - not surprisingly called automated DNA sequencers - have used 'capillary
electrophoresis', where the fragments are piped through a tiny glass-fiber capillary
during the electrophoresis step, and they come out the far end in size-order.
There's an ultraviolet laser built into the machine that shoots through the liquid
emerging from the end of the capillaries, checking for pulses of fluorescent
colors to emerge. There might be as many as 96 samples moving through as many
capillaries ('lanes') in the most common type of sequencer.
At left is a screen shot of a real fragment of sequencing gel (this one from an
older model of sequencer, but the concepts are identical). The four colors red,
green, blue and yellow each represent one of the four nucleotides.
The actual gel image, if you could get a monitor large enough to see it all at
this magnification, would be perhaps 3 or 4 meters long and 30 or 40 cm wide.
A 'Scan' of one gel lane:
We don't even have to 'read' the sequence from the gel - the computer does that for us!
Below is an example of what the sequencer's computer shows us for one sample. This is a plot of
the colors detected in one 'lane' of a gel (one sample), scanned from smallest fragments
to largest. The computer even interprets the colors by printing the nucleotide
sequence across the top of the plot. This is just a fragment of the entire file,
which would span around 900 or so nucleotides of accurate sequence.
The sequencer also gives the operator a text file containing just the nucleotide sequence,
without the color traces.
As you have seen, we can get the sequence of a fragment of DNA as long as
900 or so nucleotides. Great! But what about longer pieces? The human genome is
3 *billion* bases long, arranged on 23 pairs of chromosomes. Our sequencing
machine reads just a drop in the bucket compared to what we really need!
To do it, we break the entire genome up into manageable
pieces and sequence them. There are two approaches currently in use:
- The Publically-funded Human Genome Project:
The National Institutes of Health and the National Science Foundation
have funded the creation of 'libraries' of BAC clones. Each BAC carries
a large piece of human genomic DNA on the order of 100-300 kb. All of
these BACs overlap randomly, so that any one gene is probably on several
different overlapping BACs. We can replicate those BACs as many times as necessary, so
there's a virtually endless supply of the large human DNA fragment.
In the Publically-funded project, the BACs are subjected to shotgun sequencing
(see below) to figure out their sequence. By sequencing all the BAC's, we
know enough of the sequence in overlapping segments to reconstruct how the original
chromosome sequence looks.
- A Privately-Funded Sequencing Project: Celera Genomics
An innovative approach to sequencing the human genome has been pioneered by
Celera Genomics. The founders of this company realized that it might be
possible to skip the entire step of making libraries of BAC clones. Instead,
they blast apart the entire human genome into fragments of 2-10 kb and
sequence those. Now the challenge is to assemble those fragments of sequence
into the whole genome sequence.
Imagine, for example that you have hundreds of 500-piece puzzles, each being
assembled by a team of puzzle experts using puzzle-solving computers. Those
puzzles are like BACs - smaller puzzles that make a big genome manageable.
Now imagine that Celera throws all those puzzles together into one room and
scrambles the pieces. They, however, have scanners that scan all the puzzle
pieces and huge computers that figure out where they all go.
It is controversial still as to whether the Celera approach will succeed on
a puzzle as large as the human genome. Whether it does or not, they have
certainly stirred up the intellectual pot a bit.
Shotgun sequencing: assembly of random sequence fragments
To sequence a BAC, we take millions of copies of it and chop them all up randomly.
We then insert those into plasmids and for each one we get, we grow lots
of it in bacteria and sequence the insert. If we do this to enough fragments,
eventually we'll be able to reconstruct the sequence of the original BAC
based on the overlapping fragments we've sequenced!