Differences

This shows you the differences between two versions of the page.

--- chapter_21 [2024/09/15 20:24] – [Polymorphisms and mapping] mike
+++ chapter_21 [2024/09/18 19:47] (current) – mike
@@ Line 1: / Line 1: @@
+<- chapter_20|Chapter 20^table_of_contents|Table of Contents^chapter_22|Chapter 22 ->
 <typo fs:x-large> %%Polymorphisms in human DNA sequences%%</typo>
@@ Line 22: / Line 24: @@
 </table>
-Of course, to many people humans are the organism in which they are most interested, either for their intrinsic interest in themselves or for biomedical applications. Humans are also more difficult to study compared to the other model organisms we have discussed in this book. The human genome is much larger and less gene dense than invertebrates such as Drosophila or //C. elegans//. More importantly, we cannot study human genetics by designing experiments using true-breeding mutants that can generate large quantities of data to test hypotheses prospectively the way we can with model organisms; for ethical reasons we can only study human genetics retrospectively, using whatever meager quantity of data natural human populations provide. Human geneticists therefore rely on non-invasive DNA tests, genetic mapping, and statistical analyses to make connections between genotype and phenotype. There are, whoever,  some advantages to studying human genetics. First, humans are self-screening; they will self-report interesting phenotypes (usually these are genetic diseases). Second, subtle phenotypes may be more recognizable, as affected individuals may be able to communicate directly with researchers.
+Of course, to many people humans are the organism in which they are most interested, either for their intrinsic interest in themselves or for biomedical applications. Humans are also more difficult to study compared to the other model organisms we have discussed in this book. The human genome is much larger and less gene dense than invertebrates such as Drosophila or //C. elegans//. More importantly, we cannot study human genetics by designing experiments using true-breeding mutants that can generate large quantities of data to test hypotheses prospectively the way we can with model organisms; for ethical reasons we can only study human genetics retrospectively, using whatever meager quantity of data natural human populations provide. Human geneticists therefore rely on non-invasive DNA tests, genetic mapping, and statistical analyses to make connections between genotype and phenotype. There are, whoever,  some advantages to studying human genetics. First, humans are self-screening; they will self-report interesting phenotypes (usually these are genetic diseases). Second, subtle phenotypes may be more recognizable, as affected individuals may be able to talk directly with researchers.
 ===== Polymorphisms and mapping =====
@@ Line 31: / Line 33: @@
   - You can describe the physical location of a mutation using coordinates on DNA (i.e., "individuals that are homozygous for a recessive disease allele have a different DNA sequence at position 3563562 on Chromosome III").
-In human genetics, we don't have visible marker mutations such as "white" and "yellow" in Drosophila, and even if we did we could not force humans to do crosses (and even if we could, we would have to wait 20 years to get an F1 generation!). Instead, we reply on DNA polymorphisms as markers. A polymorphism is simply a difference in DNA sequence at a particular location in the genome between individuals in a population. These differences can be inside a gene, or in between genes. DNA polymorphisms include substitutions, duplications, deletions, etc. Since polymorphisms are defined as differences in DNA sequence at a specific location, we can use the term locus to describe or refer to a polymorphism even if it is not part of a gene. We can also use allele to describe different versions of that locus. A locus is said to be polymorphic is two or more alleles are each present at a frequency of at least 1% in a population.
+In human genetics, we don't have visible marker mutations such as $white$ and $yellow$ in Drosophila, and even if we did we could not force humans to do crosses (and even if we could, we would have to wait 20 years to get an F1 generation!). Instead, we reply on DNA polymorphisms as markers. A polymorphism is simply a difference in DNA sequence at a particular location in the genome between individuals in a population. These differences can be inside a gene, or in between genes. DNA polymorphisms include substitutions, duplications, deletions, etc. Since polymorphisms are defined as differences in DNA sequence at a specific location, we can use the term locus to describe or refer to a polymorphism even if it is not part of a gene. We can also use allele to describe different versions of that locus. A locus is said to be polymorphic is two or more alleles are each present at a frequency of at least 1% in a population.
+When human individuals with interesting (usually disease-related) phenotypes are found, we can examine their family tree (i.e., their pedigree), see if there are other affected individuals, and determine which polymorphisms segregate with the phenotype at a greater frequency than random. In other words, we are trying to identify which polymorphisms are linked with the disease phenotype. Since we know the DNA coordinates of the polymorphisms, it follows that the mutation causing the disease phenotype must be nearby. The strength of the linkage must be assessed using statistical analysis. Thus, human genetics is a collaboration between classical Mendelian genetics, molecular biology, and biostatistics.
+Two types of DNA polymorphisms are of particular importance in human genetics: single nucleotide polymorphisms (SNPs) and simple sequence repeats (SSRs).
+==== Single nucleotide polymorphisms (SNPs) ====
+A single nucleotide polymorphism, or SNP (pronounced "snip"), simply means a polymorphism that consists of a single nucleotide. For instance, the DNA sequence at position 3563562 on chromosome III for might be a G-C pair for 80% of all humans, but an A-T pair for the remaining 20%. This difference is a SNP.
+How frequently are SNPs found? All humans are 99.9% identical at the DNA level. This means that on average, at a randomly selected locus, two randomly selected human alleles will differ at a frequency of 0.001. This implies that your maternal genome (the haploid genome that you inherited from your mother) differs from your paternal genome (inherited from your father) at about 1 bp per 1000. Since the human genome is 3x10<sup>9</sup> bp long, this means that two randomly selected individuals from a human population will differ at several million loci.
+The vast majority (probably 99%) of SNPs are selectively “neutral” changes of little or no functional consequence. This is mostly because they likely exist outside coding or gene regulatory regions (>97% of human genome). They can also be silent substitutions in coding sequences, or amino acid substitutions that do not affect protein stability or function. A small minority of SNPs are of functional consequence and are selectively advantageous or disadvantageous; this can affect the allele frequency of these SNPs.
+SNPs can be detected in a variety of ways. Conceptually, all you need to detect a SNP is knowledge of the DNA sequence surrounding the SNP. Armed with this information, you could use PCR to amplify a DNA fragment that contains the SNP from individuals and sequence the DNA ([[chapter_07|chap. 7]]). In practice, you would likely use a device called a DNA microarray that is specifically designed to detect SNPs. Or, you could simply use NGS sequencing technology to sequence the genome of the individual to identify SNPs.
+==== Simple sequence repeats (SSRs) ====
+Simple sequence repeats (SSRs) also go by a variety of other names: microsatellites, short tandem repeats (STRs), simple sequence length polymorphisms (SSLPs), or variable number of tandem repeats (VNTRs). These are loci where typically a dinucleotide sequence, such as CA or CG, is repeated with different repeat numbers in different alleles. In mammals, the most common type of SSR is CA repeats (MAKE A FIGURE). For example, a particular SSR loci might have six different alleles, each with a different number of CA repeats. The human genome contains on the order of 50.000 - 100,000 dinucleotide SSRs. While SSRs are about 10x less dense than SNPs, they are usually much easier to detect.
+SSRs in noncoding regions typically do not affect gene function, and therefore are usually not under any kind of selection. This means that SSR loci can accumulate mutations which leads to different alleles in a population. The different alleles will vary in length, and this difference in length can be detected by PCR ([[chapter_07|Chapter 07]]). To distinguish between different SSR alleles, a researcher would use primers that flank the SSR; the length of the PCR amplification product can then be determined by electrophoresis, which then defines a specific allele.
+<figure>
+{{ :codis.jpg?400 |}}
+<caption>
+CODIS (placeholder). Source: [[https://strbase-archive.nist.gov/fbicore.htm|National Institute of Standards and Technology]]. Licensing: public domain.
+</caption>
+</figure>
+SSRs can be used as markers for any kind of mapping of human genes, but they are commonly used in forensics. The Federal Bureau of Investigation (FBI) maintains a DNA database called the Combined DNA Index System (CODIS) that contains data on a core set of SSR alleles from convicted offenders or arrestees of various crimes. Prior to 2017, 13 STR loci were used in CODIS entries; since 2017, an additional 7 STR loci have been added. These loci are chosen such that they are unlinked from each other. This maximizes their utility in identifying unique individuals. This technique of using STR allele combinations to identify individuals is called DNA fingerprinting.
+A consequence of SSR loci being neutral and not under selection is that these loci are usually in Hardy-Weinberg equilibrium ([[chapter_18|Chapter 18]]). This allows forensic scientists to use the principles of population genetics to calculate allele frequencies. When the DNA of a suspect matches forensic evidence at a crime scene, allele frequency (together with information on SSR loci mutation rate) allows forensic scientists to calculate the likelihood that the combination of SSR alleles found in evidence matches that of the suspect is due to random chance. For instance, let's say there 11 different alleles at an SSR locus. Let's say that allele 1 has a frequency of 0.5 and allele 2 has a frequency of 0.2; the remaining alleles are more rare and make up the remaining 0.3. The likelihood of a random individual in the population being heterozygous at this locus for allele 1 and 2 is $0.5 \times 0.2 = 0.1$ or 10. Let's just guesstimate that the likelihood for a random match for most SSR loci is also about 10% (or 0.1). If a forensic investigator compares 13 different loci and gets a perfect match to a suspect, the likelihood that this match is due to random chance is $0.1^{13}=10^{-12}$, or one in a trillion! This is why DNA fingerprinting is such a powerful method for law enforcement.
+===== Example of using polymorphisms to map a human mutation: hypolactasia =====
-Two types of DNA polymorphisms are of particular importance in human genetics: single nucleotide polymorphisms (SNPs) and simple sequence repeats (SSRs; also called microsatellites).
+The digestion of lactose requires the enzyme lactase-phlorizin hydrolase (LPH), which is produced by cells in the intestine and is encoded by the $LCT$ gene. When the intestines lack LPH, undigested lactose will ferment in the gut, leading to diarrhea. Adult-type hypolactasia is a genetic condition where individuals are able to digest lactose as children but lose the ability as they get older. This condition is also known as lactase non-persistence, or lactose intolerance. It is estimated that 68% of the world population is lactose intolerant. There are medical tests that can be done to determine if you are lactose intolerant, but they are inaccurate. Knowing the underlying genetic reasons for why some people have adult-type hypolactasia might give doctors the ability to administer an more accurate genetic test to determine whether they are lactose intolerant.
+In 2002, a team of Finnish scientists set out to use human genetics methods to identify mutations that are associated with hypolactasia. They reasoned that mutations that affect individuals are probably not in the protein-coding region of $LCT$, since these individuals could digest lactose as children (they also knew from other studies that there were no mutations in the $LCT$ gene of individuals that had hypolactasia). Instead, they believed that there may be mutations in nearby cis-acting regulatory sequences (see [[chapter_13|Chap. 13]]) that control the expression of $LCT$, such that it is no longer expressed in adults (they also had some other evidence to support this idea).
+The scientists examined the pedigrees of nine Finnish families with a history of hypolactasia (Fig xxx: NOTE: WAITING FOR PERMISSION FROM HHMI TO USE THE FIGURES). From the pedigrees, you can see that the inheritance pattern is consistent with hypolactasia being an autosomal recessive mutation. The scientists then collected DNA samples from volunteers in these families and analyzed various polymorphisms. They utilized seven SSRs that flanked the $LCT$ gene on either side. They found strong statistical evidence (see [[chapter_22|Chap. 22]]) for linkage of hypolactasia to an SSR upstream of the $LCT$ gene - consistent with it being a regulatory mutant instead of a coding mutant. Since SNPs are much denser than SSRs, they then used SNPs to further narrow down the genetic interval for the mutations to a 47 kb region of DNA upstream of $LCT$. In this region, they found many DNA polymorphisms among the family members, but only two SNPs that showed complete co-segregation with the hypolactasia trait. Subsequent reverse genetic studies in mice and cultured human cells suggest that these two SNPs may be mutations that affect the ability of a transcription factor called OCT-1 to bind, thereby affecting the expression of $LCT$ in adults.