Differences

This shows you the differences between two versions of the page.

--- chapter_07 [2024/08/31 16:10] – [Identifying a gene based on DNA sequence data] mike
+++ chapter_07 [2025/03/14 07:29] (current) – [Polymerase Chain Reaction (PCR)] mike
@@ Line 1: / Line 1: @@
+<-chapter_06|Chapter 06^table_of_contents|Table of Contents^chapter_08|Chapter 08->
+<typo fs:x-large>Chapter 07. Analysis of %%gene sequences%%</typo>
 Although eukaryotic genes may be generally more interesting to most students, it is useful to first consider bacterial genes. Most eukaryotic molecular biologists use bacteria as tools for various things (e.g., molecular cloning; see [[chapter_09|Chapter 09]]), so it’s useful to understand how bacteria work from a practical perspective. Also, although bacterial genes have some pretty important differences compared to eukaryotic genes, many basic principles are the same.
@@ Line 46: / Line 50: @@
 </figure>
-An experimental way to identify eukaryotic genes physically is by examining mRNA instead of DNA. If an mRNA exists in a cell, this means that it was most likely transcribed from a gene. mRNAs can be purified from cells biochemically, and an enzyme called reverse transcriptase (usually isolated from various types of retroviruses) can be used to convert mRNAs into DNAs called complimentary DNAs (cDNAs). cDNA does not exist in nature - it is created by scientists in the lab. We can sequence (discussed below) these cDNAs and compare the sequences to genomic DNA sequences to identify and locate genes. Randomly sequenced cDNAs derived from mRNAs isolated from cells or tissues are often called expressed sequence tags (ESTs). If an EST sequence matches small stretches of genomic DNA sequence, that would suggest that a gene is there even if we don't know what the gene or gene product (protein) does. There are now more modern technologies, such as RNAseq (see below), that are much faster than traditional cDNA-based approaches - the details are not important for now, but RNAseq is very similar to the NGS technologies introduced below. This book is more focused on how to conceptually analyze gene function rather than details of molecular and genomic approaches to identifying genes.
+An experimental way to identify eukaryotic genes physically is by examining mRNA instead of DNA. If an mRNA exists in a cell, this means that it was most likely transcribed from a gene. mRNAs can be purified from cells biochemically, and an enzyme called reverse transcriptase (usually isolated from various types of retroviruses) can be used to convert mRNAs into DNAs called complementary DNAs (cDNAs). cDNA does not exist in nature - it is created by scientists in the lab. We can sequence (discussed below) these cDNAs and compare the sequences to genomic DNA sequences to identify and locate genes. Randomly sequenced cDNAs derived from mRNAs isolated from cells or tissues are often called expressed sequence tags (ESTs). If an EST sequence matches small stretches of genomic DNA sequence, that would suggest that a gene is there even if we don't know what the gene or gene product (protein) does. There are now more modern technologies, such as RNAseq (see below), that are much faster than traditional cDNA-based approaches - the details are not important for now, but RNAseq is very similar to the NGS technologies introduced below. This book is more focused on how to conceptually analyze gene function rather than details of molecular and genomic approaches to identifying genes.
 ===== How to sequence DNA: background information =====
@@ Line 79: / Line 83: @@
 {{ :dna_replication_substrate.jpg?400 |}}
 <caption>
-The substrate for DNA polymerase includes a template strand, a primer with a free 3' hydroxyl group, and dNTPs (not shown). The primer can be made from single-stranded DNA (ssDNA) or RNA, but in this diagram we are using represents ssDNA.  Credit: M. Chao.
+The substrate for DNA polymerase includes a template strand, a primer with a free 3' hydroxyl group, and dNTPs (not shown). The primer can be made from single-stranded DNA (ssDNA) or RNA, but in this diagram the primer is ssDNA.  Credit: M. Chao.
 </caption>
 </figure>
@@ Line 94: / Line 98: @@
 ===== DNA Sequencing: the details =====
-We first discuss an older but still relevant type of DNA sequencing technology called Sanger sequencing. The basic method was invented in 1977 and is named after its inventor Frederick Sanger. Consider a segment of dsDNA that is about 1000 base pairs long that we wish to sequence. To sequence this DNA, we first need to have a source of DNA material, which we consider [[chapter_07#Polyermase_Chain_Reaction_(PCR)|further below]]. Assuming we have several μg (microgram; 1 μg = 10<sup>-6</sup> g) of DNA template to sequence, we first separate the two DNA strands by heating the DNA to about 100 °C to melt the hydrogen bonds that hold the ssDNAs together through base pairing.
+We first discuss an older but still relevant type of DNA sequencing technology called Sanger sequencing. The basic method was invented in 1977 and is named after its inventor [[wp>Frederick_Sanger|Frederick Sanger]]. Consider a segment of dsDNA that is about 1000 base pairs long that we wish to sequence. To sequence this DNA, we first need to have a source of DNA material, which we consider [[chapter_07#Polyermase_Chain_Reaction_(PCR)|further below]]. Assuming we have several μg (microgram; 1 μg = 10<sup>-6</sup> g) of DNA template to sequence, we first separate the two DNA strands by heating the DNA to about 100 °C to melt the hydrogen bonds that hold the ssDNAs together through base pairing.
-Next, a short single-stranded primer((In DNA sequencing and similar applications, this primer is also sometimes called an oligonucleotide (sometimes abbreviated as oligo)- you can use these the terms “primer” and “oligonucleotide” interchangeably in most cases.)) (about 18-20 bases long) designed to be complimentary to the end of one of the strands is allowed to anneal to the single stranded DNA. These primers are designed with the help of a computer and synthesized through commercially available services. The primer is added at a huge molar excess compared to the DNA you are trying to sequence – so most ssDNAs will pair with primer DNA rather than their original complementary partner ssDNA. The resulting DNA hybrid looks much like the general DNA polymerase substrate shown in Fig. {{ref>Fig5}}.
+Next, a short single-stranded primer (about 18-20 bases long) designed to be complimentary to the end of one of the strands is allowed to anneal to the single stranded DNA. These primers are designed with the help of a computer and synthesized through commercially available services. The primer is added at a huge molar excess compared to the DNA you are trying to sequence – so most ssDNAs will pair with primer DNA rather than their original complementary partner ssDNA. The resulting DNA hybrid looks much like the general DNA polymerase substrate shown in Fig. {{ref>Fig5}}.
 DNA polymerase is then added along with the four dNTP nucleotide precursors (dATP, dGTP, dCTP, and dTTP). A small quantity of a slightly different nucleotide precursor called a dideoxyribonucleotide triphosphate is also added. Dideoxy nucleotide precursors are abbreviated ddATP, ddGTP, ddCTP, and ddTTP (or ddNTPs collectively). The ddNTPs have each been chemically labeled with a unique fluorophore that emits a different color of light after stimulation with a laser – for instance, green for ddATP, cyan for ddCTP, yellow for ddGTP, and red for ddTTP. These molecules are identical to the normal dNTPs in all respects except that they lack a hydroxyl group at their 3’ position (3’ OH) and that their nucleotide bases are chemically labeled with a fluorophore (Fig. {{ref>Fig7}}; fluorophore not shown). ddNTPs are also called chain terminators.
@@ Line 116: / Line 120: @@
 </figure>
-Dideoxynuclotides can be incorporated into DNA, but once a dideoxynucleotide has been incorporated, further elongation stops because the resulting DNA will no longer have a free 3’ OH end.  Each of the ddNTPs is added at about 1% the concentration of the normal nucleotide precursors. Thus, using ddATP as an example, about 1% of the elongated chains will randomly terminate at the position of an A in the sequence; the same will be true for the other ddNTPs. Once all the elongating chains have been terminated, there will be a population of newly synthesized and fluorescently labeled ssDNA strands that have terminated at the position of the sequence. The right-hand side of Fig. {{ref>Fig8}} (the part with colorful horizontal bands) represents a gel where DNA fragments of different lengths, each ending with a chain terminator, are separated using electrophoresis on a high-resolution gel. The DNA fragments migrate toward the positively charged cathode (because the phosphate groups on the backbone of DNA are negatively charged), and shorter fragments migrate faster than longer fragments; this technique is called electrophoresis. A laser and light detector coupled to a computer then automatically reads the different colored bands in order and determines the DNA sequence (Fig. {{ref>Fig8}}).
+ddNTPs can be incorporated into DNA, but once a ddNTP has been incorporated, further elongation stops because the resulting DNA will no longer have a free 3’ OH end.  Each of the ddNTPs is added at about 1% the concentration of the normal nucleotide precursors. Thus, using ddATP as an example, about 1% of the elongated chains will randomly terminate at the position of an A in the sequence; the same will be true for the other ddNTPs. Once all the elongating chains have been terminated, there will be a population of newly synthesized and fluorescently labeled ssDNA strands that have terminated at the position of the sequence. The right-hand side of Fig. {{ref>Fig8}} (the part with colorful horizontal bands) represents a gel where DNA fragments of different lengths, each ending with a chain terminator, are separated using electrophoresis on a high-resolution gel. The DNA fragments migrate toward the positively charged cathode (because the phosphate groups on the backbone of DNA are negatively charged), and shorter fragments migrate faster than longer fragments; this technique is called electrophoresis. A laser and light detector coupled to a computer then automatically reads the different colored bands in order and determines the DNA sequence (Fig. {{ref>Fig8}}).
 ===== Polymerase Chain Reaction (PCR) =====
@@ Line 123: / Line 127: @@
 Now let’s consider how to physically obtain DNA for sequencing. A relatively large amount of DNA (approx. 1 μg worth for a piece of DNA several kbp long) is needed for the Sanger chemistry to work. As a student, you might not have a feel for how much DNA this is, but it's a substantial amount! To give you an idea of the scale of the problem, see [[chapter_07#Questions_and_exercises|Exercise 1 below]]. In the early days of molecular biology research, DNA for sequencing was obtained from cloned DNA segments, which can be difficult to create, but once created can easily be produced to give the quantity of DNA needed (molecular cloning is still used today for various purposes). We will discuss some methods for cloning new genes in [[chapter_09|Chapter 09]].
-If we want to quickly find the sequence of a new mutant allele of a known gene, we need an easy way to obtain a relatively large quantity of this DNA without needing to go through molecular cloning. The easiest and most common way to do this is to use an //in vitro// method known as the polymerase chain reaction (PCR) that was developed by Kary Mullis in the mid-1980s (Fig. {{ref>Fig7}}). The steps in PCR are as follows:
+If we want to quickly find the sequence of a new mutant allele of a known gene, we need an easy way to obtain a relatively large quantity of this DNA without needing to go through molecular cloning. The easiest and most common way to do this is to use an //in vitro// method known as the polymerase chain reaction (PCR) that was developed by [[wp>Kary_Mullis|Kary Mullis]] in the mid-1980s (Fig. {{ref>Fig7}}). The steps in PCR are as follows:
   - A crude preparation of chromosomal DNA is extracted from the tissue source of interest (there is usually not enough DNA for sequencing from this step).
   - Two short primers (each about 18-20 bases long) are added to the DNA at an enormous molar excess. The primers are designed from the known genomic sequence to be complimentary to opposite strands of DNA and to flank the chromosomal segment of interest.
-  - The double stranded DNA is melted by heating to around 100 ˚C (in practice we usually use 95 oC) and then the mixture is cooled to allow the primers to anneal to the template DNA. Since there is a huge molar excess of primer vs. template, most of the template will anneal with primer rather than reanneal with its original partner strand.
+  - The double stranded DNA is melted by heating to around 100 ˚C (in practice we usually use 95 °C) and then the mixture is cooled (usually around 50 °C) to allow the primers to anneal to the template DNA. Since there is a huge molar excess of primer vs. template, most of the template will anneal with primer rather than re-anneal with its original partner strand.
-  - DNA polymerase and the four nucleotide precursors are added, and the reaction is incubated at 37 ˚C for a period of time to allow a copy of the segment to be synthesized.
+  - DNA polymerase and the four nucleotide precursors are added, and the reaction is incubated at around 72 ˚C for a period of time to allow a copy of the segment to be synthesized. The reason we use 72 °C instead of 37 °C like we do for most enzymatic reactions is that we use a special heat stable enzyme called Taq DNA polymerase instead of standard DNA polymerase.
-  - Repeat steps 3 and 4 multiple times (up to 30-35 cycles). To avoid the inconvenience of having to add new DNA polymerase in each cycle, a special DNA polymerase called Taq polymerase that can withstand heating to 100 ˚C is used.
+  - Repeat steps 3 and 4 multiple times (up to 30-35 cycles). To avoid the inconvenience of having to add new DNA polymerase in each cycle (due to the heating cycle eliminating DNA polymerase activity), a special DNA polymerase called Taq polymerase that can withstand heating to 100 ˚C is used.
 The idea behind PCR is that in each cycle of melting, annealing, and DNA synthesis, the amount of the DNA segment is doubled. This gives an exponential increase((If you think about it, the idea behind PCR is actually quite simple. It is mimicking how DNA replication occurs in dividing cells. Dividing cells with unlimited resources also replicate exponentially.)) in the amount of the specific DNA bounded by the primers on either side as the cycles proceed. After 10 cycles the DNA is amplified 2<sup>10</sup>-fold (2<sup>10</sup> = 1024; in other words, about 1000-fold) and after 20 cycles the DNA will be amplified 2<sup>20</sup>-fold (or approximately 10<sup>6</sup>-fold). Amplification usually continues until all of the nucleotide precursors are incorporated into synthesized DNA. The resulting PCR product can be used for DNA sequencing.
@@ Line 140: / Line 144: @@
 </figure>
-Let's do some quick back of the envelope map to see if the numbers add up. Let's assume we start with a single molecule of dsDNA 1000 bp long that we want to sequence. Our goal is to get 1 μg of this DNA to use in Sanger sequencing. How many cycles of PCR do we need to do to get this amount of DNA?
+Let's do some quick back of the envelope math to see if the numbers add up. Let's assume we start with a single molecule of dsDNA 1000 bp long that we want to sequence. Our goal is to get 1 μg of this DNA to use in Sanger sequencing. How many cycles of PCR do we need to do to get this amount of DNA?
 The first thing we need to do is to figure out how many molecules is 1 μg of a 1000 bp fragment of dsDNA. From a quick Google search, we learn that the average molecular weight of a nucleotide is approximately 330 Da (g/mol). Since DNA is double-standed, this means the the average molecular weight of a base pair is 660 g/mol, and the approximate molecular weight of a 1000 bp dsDNA fragment is therefore:
@@ Line 167: / Line 171: @@
-The Sanger sequencing method [[chapter_07#DNA_Sequencing:the_details|described above]] was used to complete the sequencing of the genomes of many organisms we have talked about (or will talk about) in this class, including the bacterium //Eschericha coli//, yeast, Drosophila, //C. elegans//, and even humans. This technology is still used today for certain purposes. However, next generation sequencing (NGS) technologies developed over the last two decades or so has made DNA sequencing significantly faster and cheaper. For instance, the original Human Genome Project, which started in 1990, took 13 years and thousands of researchers, and cost \$2.7 billion dollars to complete. With current NGS technologies, that cost has dropped to \$600 per genome as of this writing in 2023 and uses a fully automated machine that one technician can easily operate. A complete human genome can be sequenced now in just a few days.
+The Sanger sequencing method [[chapter_07#DNA_Sequencing:the_details|described above]] was used to complete the sequencing of the genomes of many organisms we have talked about (or will talk about) in this class, including the bacterium //Eschericha coli//, yeast, Drosophila, //C. elegans//, and even humans. This technology is still used today for certain purposes. However, next generation sequencing (NGS) technologies developed over the last two decades or so has made DNA sequencing significantly faster and cheaper. For instance, the original Human %%Genome%% Project, which started in 1990, took 13 years and thousands of researchers, and cost \$2.7 billion dollars to complete. With current NGS technologies, that cost has dropped to \$600 per genome as of this writing in 2023 and uses a fully automated machine that one technician can easily operate. A complete human genome can be sequenced now in just a few days.
 ==== Illumina sequencing ====
-There are several different types of NGS technology. The most common type of NGS is called Illumina sequencing (Figs {{ref>Fig10}}-{{ref>Fig11}}, and the sequencing chemistry in this method is called sequencing by synthesis. Despite the fancy name it is not conceptually different than Sanger sequencing in that it still depends on DNA polymerase. The key difference is in how many templates are sequenced at once. Instead of sequencing one fragment of DNA at a time, Illumina sequencing takes an entire genome (for instance, from a cancer patient tumor biopsy sample) and fragments it into millions of small pieces, each around 300-600 bp long. These fragments are affixed to a flow cell - a device that is roughly the size of a microscope slide – and amplified in situ by a process called solid-phase bridge PCR. Each fragment, once affixed and amplified, has a unique position on the slide. In essence, you are forming a "colony" of DNA clones at different positions on the flow cell. This process is called cluster generation (Fig. {{ref>Fig10}}).
+There are several different types of NGS technology. The most common type of NGS is called Illumina sequencing (Figs {{ref>Fig10}}-{{ref>Fig11}}), and the sequencing chemistry in this method is called sequencing by synthesis. Despite the fancy name it is not conceptually different than Sanger sequencing in that it still depends on DNA polymerase. The key difference is in how many templates are sequenced at once. Instead of sequencing one fragment of DNA at a time, Illumina sequencing takes an entire genome (for instance, from a cancer patient tumor biopsy sample) and fragments it into millions of small pieces, each around 300-600 bp long. These fragments are affixed to a flow cell - a device that is roughly the size of a microscope slide – and amplified in situ by a process called solid-phase bridge PCR. Each fragment, once affixed and amplified, has a unique position on the slide. In essence, you are forming a "colony" of DNA clones at different positions on the flow cell. This process is called cluster generation (Fig. {{ref>Fig10}}).
-The flow cell is then exposed to sequencing reagents similar to Sanger sequencing, except that instead of ddNTP chain terminators, fluorescently labeled dNTPs with a blocking group are used that pauses the elongation reaction each time a new nucleotide is added (Fig. {{ref>Fig11}}). Just like with Sanger sequencing, each added nucleotide is coupled to a fluorophore that emits a different color. A camera takes a picture of the entire slide, and a computer keeps track of the fluorescent signals at each unique position within the flow cell. The blocking groups and fluorophores are then chemically removed, the flow cell washed, and new sequencing reagents are added to "sequence" the next nucleotide. Each individual template is sequenced very slowly compared to Sanger sequencing. For instance, the speed of DNA polymerase is roughly 100 bp/sec - so that is how fast Sanger sequencing can go((Actually, the slowest step in Sanger sequencing is the separating of synthesized DNA fragments through a gel. The actual chemistry takes just 10 minutes, but preparing and running the gel can take several hours.)). In Illumina sequencing, adding one nucleotide, taking a picture, removing the fluorophores and blocking agents, and re-starting the sequencing reaction is slow - this step can take minutes. The trick is that Illumina sequencing is analyzing millions of fragments of DNA at the same time (it is analyzing multiple DNA fragments in parallel). Thus, on a strand-by-strand basis it is slow, but in terms of number of bases sequenced per unit time, it is orders of magnitude faster than Sanger sequencing.
+The flow cell is then exposed to sequencing reagents similar to Sanger sequencing, except that instead of ddNTP chain terminators, fluorescently labeled dNTPs with a removable blocking group are used that pauses the elongation reaction each time a new nucleotide is added (Fig. {{ref>Fig11}}). Just like with Sanger sequencing, each added nucleotide is coupled to a fluorophore that emits a different color. A camera takes a picture of the entire slide, and a computer keeps track of the fluorescent signals at each unique position within the flow cell. The blocking groups and fluorophores are then chemically removed, the flow cell washed, and new sequencing reagents are added to "sequence" the next nucleotide. Each individual template is sequenced very slowly compared to Sanger sequencing. For instance, the speed of DNA polymerase is roughly 100 bp/sec - so that is how fast Sanger sequencing can go((Actually, the slowest step in Sanger sequencing is the separating of synthesized DNA fragments through a gel. The actual chemistry takes just 10 minutes, but preparing and running the gel can take several hours.)). In Illumina sequencing, adding one nucleotide, taking a picture, removing the fluorophores and blocking agents, and re-starting the sequencing reaction is slow - this step can take minutes. The trick is that Illumina sequencing is analyzing millions of fragments of DNA at the same time (it is analyzing multiple DNA fragments in parallel). Thus, on a strand-by-strand basis it is slow, but in terms of number of bases sequenced per unit time, it is orders of magnitude faster than Sanger sequencing.
 <figure Fig10>
@@ Line 197: / Line 201: @@
 Technologies such as Illumina sequencing are now the preferred method for most types of large-scale DNA sequencing, and it has been adapted for related technologies such as RNA sequencing (RNAseq). In RNA sequencing, RNA is first converted to complementary DNA (cDNA) using an enzyme called reverse transcriptase; the cDNA is then sequenced using standard Illumina sequencing. The number of reads of a particular RNA sequence gives information as to how much a particular gene is expressed via transcription. Unlike DNA (which is present at 2 copies per cell), mRNAs can be present in hundreds of thousands of copies per cell; this means that Illumina sequencing is sensitive enough to sequence mRNA from single cells (single cell RNAseq, or scRNAseq).
-Illumina sequencing can also be adapted for other applications. For instance, proteins interact with DNA //in vivo// to form a dynamic structure called chromatin. Let's say you are interested in a DNA protein called X. To find out what DNA sequences X binds to, you can extract and purify chromatin from cells, then use enzymes to gently cleave the DNA into small fragments under conditions in which X still binds to DNA. You can then purify X using antibodies and use Illumina sequencing to sequence the DNA fragments that co-purify with X. This procedure is called chromatin immunoprecipitation sequencing, or ChIPseq. There are many other similar applications too many to list and discuss here in detail.
+Illumina sequencing can also be adapted for other applications. For instance, proteins interact with DNA //in vivo// to form a dynamic structure called chromatin. Let's say you are interested in a DNA-binding protein called X. To find out what DNA sequences X binds to, you can extract and purify chromatin from cells, then use enzymes to gently cleave the DNA into small fragments under conditions in which X still binds to DNA. You can then purify X using antibodies and use Illumina sequencing to sequence the DNA fragments that co-purify with X. This procedure is called chromatin immunoprecipitation sequencing, or ChIPseq. There are many other similar applications too many to list and discuss here in detail.
 For small scale DNA sequencing, Sanger sequencing described above is still a commonly used method, although the cost for various NGS technologies have dropped so much that it is also starting to replace Sanger sequencing for small scale sequencing experiments. For instance, Nanopore sequencing (Fig. {{ref>Fig12}}) does not use DNA polymerase and can be used to replace Sanger sequencing in some routine applications. Nanopore sequencing also has the advantage of being able to detect DNA that has been chemically modified, such as methylated bases. This is relevant for studying things like epigenetic inheritance, and it something that neither Sanger nor Illumina sequencing can easily do.