- chapter_21|Chapter 21^table_of_contents|Table of Contents^chapter_23|Chapter 23 ->
%%Chapter 22. Using statistics to evaluate linkage%%
Earlier we made the case for the importance of statistics in human genetics. But statistics are important for all genetics research. We begin our discussion of statistics in human genetics with a brief return to model organisms first.
In earlier chapters, we skipped over statistics when evaluating linkage. For instance, in [[chapter_03#the_test_cross|Chapter 03]], we gave the example of a test cross between $shi$ and $vg$:
A test cross between $shi$ and $vg$ mutants. This is reproduction of Chapter 03 Figure 9.
The possible phenotypes from this cross are:
* not paralyzed, normal wings
* not paralyzed, vestigial wings
* paralyzed, normal wings
* paralyzed, vestigial wings
We said in [[chapter_03#the_test_cross|Chapter 03]] that if the phenotypic ratio of the offspring is 1:1:1:1, this meant that $shi$ and $vg$ are unlinked. Let's hypothetically say that you looked at 10,000 F2 offspring from a test cross, and obtained exactly 2,500 offspring of each phenotype combination. In this case, the conclusion is probably pretty clear; the ratio is 1:1:1:1. However, what if you just looked at just 100 F2 offspring and obtained a ratio of 20:23:30:27? That's kind of close to 1:1:1:1, but just how close is it?
In statistics, we assume that there is a "true" value for something you are attempting to measure, but because taking measurements in the real world is an imprecise process, there is sampling error in your data that deviates from the "true" value. Statistical tests let you estimate the probability that your measurement deviates from the "true" value because of sampling error. We first discuss the χ2 test commonly used in genetics. We will use the test cross between $shi$ and $vg$ as out example.
===== The $\chi^2$ test =====
The χ2 test((pronounced "kai-squared test"; χ is a Greek letter that is usually Romanized as "chi" but is pronounced "kai")) is a statistical test for categorical data. Examples of categorical data are biological sex (male or female), political affiliation (Democrat or Republican), species (cat or dog), etc. It is different from numerical data (e.g., the heights of people in a population, weights of fruits from a tree, speed of cars on the freeway). Phenotypes can often be considered as categorical data. For instance, in the case of $shi$, we can say a fruit fly is either paralyzed or not paralyzed; in the case of $vg$, we can say a fruit either has vestigial wings or normal wings.
Using the χ2 test, we can ask the question: what is the probability that the 20:23:30:27 ratio we observed for the $shi$ $vg$ test cross is different from 1:1:1:1 is due to sampling error? To perform the χ2 test, we first create a table of all the relevant information called a contingency table:
In the contingency table (Table {{ref>Tab1}}), the column "observed number of offspring ($O$)" is simply the data you observed. The column "expected number of offspring if unlinked ($E$)" looks at the total sample size of your experiment (in this case, $n=100$) multiplied by the expected fraction of each genotype. The expectation is that, if $shi$ and $vg$ are unlinked, we should see a 1:1:1:1 ratio of the four possible F2 phenotypes; therefore, we should see $\frac{1}{4} \times 100$ offspring for each type. The next step is calculating the difference between $O$ and $E$. Since we don't care whether $O$ is larger or smaller than $E$ (we just want to know the magnitude of the difference), we square the difference to "get rid" of any potential negative values. We then divide by $E$ to normalize the differences based on the expected sample size of each genotype. Finally, we add up the normalized values for each genotype - this sum is called the χ2 statistic. The χ2 statistic lets us calculate a very useful number called the //p// value.
The //p// value is the probability that our observed data deviates from a "perfect" measurement due to sampling error. The smaller the //p// value, the more likely the deviation from a "perfect" measurement of 1:1:1:1 is due to sampling error - in other words, a small //p// value means there is strong statistical support for the hypothesis that $shi$ and $vg$ are unlinked. //p// values can be determined using a probability distribution called a χ2 distribution. Most of the time you are not asked to calculate the //p// value from scratch (this uses advanced statistics and calculus); instead, you can look up the //p// value based on your χ2 statistic and another variable called degrees of freedom, which is generally defined as "the number of categories of data minus 1". However, inn our case, there is actually only one degree of freedom in this case, because (based on our understanding of meiosis) determining the frequency of one phenotypes in principle determines the frequencies of the other three phenotypes regardless of whether the two mutations are linked or not. We next consult a χ2 table (Table {{ref>Tab2}}):
We look at the row for degrees of freedom = 3, and see that our χ2 statistic from Table {{ref>Tab1}} is 2.32, which falls between the columns for the //p// values of 0.900 and 0.100. Although we cannot determine an exact //p// value from the table, we know that the //p// value must be 0.1/p//<0.9. Generally speaking, most scientists have decided that //p//<0.05 as a standard for statistical significance (this is an arbitrary choice and in some cases can be problematic; see below). Therefore, the statistics here suggest that it is pretty likely that the reason our observed data deviate from the "perfect" measurement is not due to experimental error. Another more intuitive way to say this is that there is no statistical evidence (at least within our experiment) to show that $shi$ and $vg$ are linked.
===== The multiple testing problem =====
Let's now apply the idea of the χ2 test to a more human-like model, the mouse. One limitation with working with mice (just as with humans) is sample size - it is not as easy to breed large numbers of mice in genetic crosses (although it can be done with enough resources).
Let's say we have a $mutant$ ($m$) mouse with phenotype m (wild type mice are +). $m$ is a recessive mutation. How could you map $m$? One thing you could do is to randomly pick a genetic marker somewhere in the genome and ask the question, "is $m$ linked to this marker?" Let's call this marker $s$. Let's say $s$ is a single nucleotide polymorphism (SNP) in the $m$ mouse. The cross to map $m$ might look something like this:
$$ P: \frac{m}{m}\cdot \frac{s}{s} \times \frac{+}{+}\cdot \frac{+}{+} $$
This generates heterozygotes, which we then backcross to the parent (equivalent to a test cross):
$$ F1: \frac{m}{+} \cdot \frac{s}{+} \times \frac{m}{m}\cdot \frac{s}{s} $$
Among the F2 offspring, we would get the following phenotypes/genotypes:
placeholder. note here that expected is the same as null hypothesis. also note that SNPs are determined by PCR. note the smaller sample size compared to the drosophila exp. remind what the dot symbol means.
We can calculate the χ2 statistic as follows:
$$ \begin{aligned} \chi^2 &= \sum{ \frac{(O-E)^2}{E}} \\
&= \frac{(5-10)^2}{5} + \frac{(5-7)^2}{5} + \frac{(5-2)^2}{} + \frac{(5-1)^2}{5} \\
&= 10.8
\end{aligned}$$
Using the contingency table shown in Table {{ref>Tab2}}, we can see that $p$<0.01 for χ2=10.8. In fact, with a contingency table that shows more columns to the right, we would see that 0.001<$p$<0.002. Since the typical cutoff for $p$ values for statistical significance is 0.05, based on the χ2 test we can conclude that $m$ is linked to $s$. Is this conclusive evidence, especially based on the relatively small sample size?
To model this, let's imagine an infinitely large bag containing 4 different kinds of balls, each present at an equal frequency, labeled as A, B, C, and D. If we were to reach into the bag and pull out 20 balls, the expected result we would get would be 5 A, 5 B, 5 C, and 5 D balls. However, it's unlikely we would get exactly 5 balls each - there would be some sampling error. For instance, 4 A, 4 B, 6 C, and 6 D balls being drawn is likely different from the expected result of 5 each is probably just due to sampling error. The $p$ value we calculated from the data in Table {{ref>Tab2}} means that the likelihood we would draw 10 A, 7 B, 2 C, and 1 D balls is around 0.001, or 0.1%. Those sound like pretty good odds for NOT drawing 10 A, 7B, 2 C, and 1 D, until you start repeating the experiment.
Let's put this into some numerical perspective. If the probability that each time we draw 20 balls we'll get a result that is a false positive (that is, a result that deviates too far from the expected value as in the example) is 0.001, this means that the probability we'll get a result that is relatively close to the expected value is 0.999. If we draw 20 balls $n$ times (each draw is independent, since the bag of balls is infinite in size), the likelihood that all $n$ draws will give results close to expected is 0.999//n//. This means that in $n$ draws, the likelihood that at least one draw will give a false positive is 1-0.999//n//. If we set this probability as 10%, or 0.1, we can solve for $n$:
$$1-0.999^n=0.1 \\
0.999^n=0.9\\
n\log{0.999}=\log{0.9}\\
n=\frac{\log{}0.9}{\log{0.999}}=105.3 $$
In other words, if we were to draw 20 balls around a hundred times, then on average 10 samples would give us results that would deviate from the expected results enough to fool us.
Now let's return to mapping $m$. The SNP $s$ that we used was randomly chosen. If SNP density between two mouse strains were roughly 1 SNP/1000 bp of DNA, and the mouse genome is ~109 total, then this means that if we were to look at all SNPs between the $p$ mouse and wild type, we might be looking looking at a million SNPs! For mapping a mouse mutant, we wouldn't want to limit our mapping experiment to a single SNP like $s$; we would probably want to look at many (or possibly even all) SNPs throughout the genome. But each time we compare linkage between $m$ and a SNP we are essentially doing the same thing as reaching into our bag and pulling out 4 balls, and the likelihood we'll get a false positive is ~0.1%. This means that the more SNPs we test for linkage, the greater the chance that we will get data that "shows linkage" but is really an unlinked SNP due to sampling error.
This conundrum is known as the multiple testing problem. We can partially alleviate the multiple testing problem by increasing the sample size of the experiment and getting lower $p$ values for SNPs that are truly linked. The $p$ value threshold for statistical significance must be substantially lower than the "standard" 0.05 if we doing a genome-wide search for linkage (this is a multiple testing experiment by definition) - the general consensus among geneticists is that the cutoff should be at least $p$<0.0001 instead.
===== Mapping human mutations using LOD scores =====
Mapping human mutations is especially challenging, because (1) we cannot control human crosses (at least not in any ethical way), and (2) sample sizes are usually small. We use a tool called the logarithm of odds (LOD) score to help us. To understand LOD scores and how to map human genes, we first need to look at several concepts. By way of example, let's imagine an autosomal dominant human disease for which the molecular identity of the disease gene is unknown and is caused by allele $D$. The normal (non-disease) allele is $d$. Let's say an affected male has an affected male child with an unaffected female. The pedigree would look like this:
{{ :phase_ex1.jpg?400 |}}
placeholder
==== Concept 1: phase ====
Let's say we have identified a SNP marker $s$ that is linked to $D$ and has two alleles in the population: $s$ and $S$. Can we deduce what the genotype is of the child for both the $d$ and $s$ loci? First of all, we can easily determine the genotypes of the SNP allele in both parents and the child by using PCR and DNA sequencing. We discover that the father is homozygous $\frac{S}{S}$, and both the mother and child are heterozygous $\frac{S}{s}$.
Next, since autosomal dominant traits are rare, we can assume that the father's genotype with regard to the $d$ gene is $\frac{D}{d}$. Since we know the father's genotype for the $s$ SNP loci is $\frac{S}{S}$, this means that, combined, the father's genotype must be $\frac{S \; D}{S \; d}$. Since the mother is not affected by the disease and the disease allele is dominant, the mother's genotype must be $\frac{S \; d}{s \; d}$. Given that the child is affected, their genotype therefore must be $\frac{S \; D}{s \; d}$. Given what we know about the parents, the child's genotype cannot be $\frac{S \; d}{s \; D}$. In this case, the disease allele $D$ must be on the same chromosome as marker $A$ in the child, and the normal allele $d$ must be on the same chromosome as the ; we say that the phase is known for both chromosome in the child.
==== Concept 2: informativeness ====
Now let us consider a different scenario to illustrate our second concept of informativeness. Let's say that we know the father's genotype is $\frac{S \; D}{s \; d}$ and the mother's genotype is $\frac{S \; d}{s \; d}$. We discover that the affected child is homozygous $\frac{S}{S}$. In this case, we can definitely conclude that the child inherited a $S \; D$ chromosome from their father, and that there must not have been recombination between the $S$ and $D$ loci. We say that the chromosome inherited from his father is informative, because we know for certain whether it was a product of recombination or not. By comparison, the child must have inherited a $S \; d$ chromosome from their mother. This chromosome is uninformative, because we can't tell if it is a result of recombination between $S$ and $d$ or not. When we look at individuals in pedigrees, they might have 0, 1, or 2 informative chromosomes.
Let's revisit the example in Table {{ref>Tab3}}. The question we want to ask is: are $m$ and $s$ linked? Let's assume that they are (this is our hypothesis) - in such a case, we can determine which chromosomes are informative or not, and which are recombinant:
^ %%genotype%% ^ $\frac{+ \; +}{m \; s}$ ^ $\frac{m \; +}{m \; s}$ ^ $\frac{+ \; s}{m \; s}$ ^ $\frac{m \; +}{m \; s}$ ^
^ expected | 5 | 5 | 5 | 5 |
^ observed | 10 | 7 | 2 | 1 |
^ "top" %%chromosome%% | I, NR | I, NR | I, R | I, R |
^ "bottom" %%chromosome%% | U | U | U | U |
placeholder, note abbreviations
==== LOD score formula for known phase ====
We can now derive the LOD score formula for alleles with known phase. We first define the following terms:
* θ = recombinant fraction (equivalent to map distance)
* 1-θ = nonrecombinant fraction
* R = number of recombinant chromosomes
* NR = number of non-recombinant chromosomes
The logarithm of odds (LOD) score is given as:
$$ \begin{aligned}\text{LOD score} &= \log{\frac{\text{probability of observed pedigree data given } 0<\theta<\frac{1}{2}}{\text{probability of observed pedigree data given } \theta=\frac{1}{2}}} \\
&= \log{\frac{\text{probability}(\theta)}{\text{Probability}(\frac{1}{2})}}\end{aligned}
$$
The denominator in this formula reflects our null hypothesis: that two loci are unlinked. When two loci are unlinked, the probability that recombination occurs between them is $\frac{1}{2}$. By extension, the probability that recombination does not occur between them is $1-\frac{1}{2}=\frac{1}{2}$. In order for the data to be consistent with the null hypothesis, every chromosome for which the phase is known would have to occur with probability $\frac{1}{2}$. Therefore:
$$ \text{probability}(\frac{1}{2})=(\frac{1}{2})^R \cdot (\frac{1}{2})^{NR}=(\frac{1}{2})^{R+NR} $$
The numerator in this formula reflects our experimental hypothesis: that the two loci are linked. When two loci are linked, the probability that recombination occurs between them is θ. By extension, the probability that recombination does not occur between them is (1-θ). In order for the data to be consistent with our experimental hypothesis, every recombinant chromosome for which the phase is known would have to occur with probability θ, and every non-recombinant chromosome for which the phase is known would have to occur with probability (1-θ). Therefore:
$$ \text{probability}(\theta)=\theta^R \cdot (1-\theta)^{NR} $$
We can thus re-write the LOD score formula as:
$$ \begin{aligned} LOD &= \log{\frac{\theta^R \cdot (1-\theta)^{NR}}{(\frac{1}{2})^{R+NR}}} \\
&= R\log{\theta} + NR\log{1-\theta} + (N+NR)\log{2} \end{aligned} $$
Let's use the mouse data in Table {{ref>Tab4}} to illustrate the use of this formula. Here we have isolated 20 F2 offspring, 17 of which are non-recominant (NR=17) and 3 are recombinant (R=3). We can use different values of θ to calculate LOD:
placeholder. graph plotting LOD as a function of theta.
We can see from Table {{ref>Tab5}} and Figure {{ref>Fig3}} that the LOD score is highest around θ = 0.15, at which LOD = 2.35 (you can calculate a more precise maximal value of LOD using some calculus). Experience tells us that LOD > 3.3 is usually the cutoff point at which the linkage is likely to be real (it corresponds to a 0.05 false positive rate); conversely, LOD < -2 usually means we can rule out linkage. Therefore, we can see that the data do not support linkage between $m$ and $s$, but does not rule it out either.
What if our sample size were bigger? Let's say that instead of 20 offspring, we looked at 40 offspring and obtained R=6 and NR=34. At θ=0.15, we can calculate that LOD=5.9. This puts us over the threshold of 3.3. Based on this data, we can conclude that $m$ and $s$ are linked, and the strongest support of the data suggest they are 15 map units apart.
==== LOD score formula for unknown phase ====
In mouse experiments, the phase of all offspring are usually known. Now, let's return to a human example where it is more common that the phase is unknown:
{{ :pedigree_ex2.jpg?400 |}}
placeholder
In this example, a family with 5 children is affected by an autosomal dominant mutation $D$. We are testing linkage with marker $m$, of which there are two alleles, $M$ and $m$. $m$ is known to be on the same chromosome as $D$ but it is unknown if they are linked. The genotype of the mother (individual 2) is known to be $\frac{a \; d}{a \; d}$. The genotype of the father (individual 1) is known but the phase is not known and therefore there are two possible phases; his genotype is either $\frac{A \; D}{a \; d}$ (phase 1) or $\frac{A \; d}{a \; D}$ (phase 2). Based on this information, we can categorize the genotypes and chromosomes of the children:
^ individual ^ 3 ^ 4 ^ 5 ^ 6 ^ 7 ^
^ inferred %%genotype%% | $\frac{A \; D}{a \; d}$ | $\frac{a \; D}{a \; d}$ | $\frac{A \; D}{a \; d}$ | $\frac{a \; d}{a \; d}$ | $\frac{a \; d}{a \; d}$ |
^ paternal %%chromosome%% | $A \; D$ | $a \; D$ | $A \; D$ | $a \; d$ | $a \; d$ |
^ if phase 1 | NR | R | NR | NR | NR |
^ if phase 2 | R | NR | R | R | R |
^ maternal %%chromosome%% | not informative |||||
placeholder. note that the genotype of the $m$ locus can be determined by PCR, whereas the genotype of the $D$ locus is inferred by phenotype.
We first define some terms:
* T = total number of informative chromosomes; T = R+NR
* R1 = number of recombinant chromosomes in phase 1
* NR1 = number of non-recombinant chromosomes in phase 1
* R2 = number of recombinant chromosomes in phase 2
* NR2 = number of non-recombinant chromosomes in phase 2
* θ = recombination fraction (i.e., map distance) between $m$ and $D$
As before, we define the LOD score as:
$$ \begin{aligned}\text{LOD score} &= \log{\frac{\text{probability of observed pedigree data given } 0<\theta<\frac{1}{2}}{\text{probability of observed pedigree data given } \theta=\frac{1}{2}}} \\
&= \log{\frac{\text{probability}(\theta)}{\text{probability}(\frac{1}{2})}}\end{aligned}
$$
Since we don't know what the phase is, we must calculate $\text{probabilty}(\theta)$ for both phases:
$$\text{probabilty}(\theta)_1=\theta^{R1} \cdot (1-\theta)^{NR1} \\
\text{probabilty}(\theta)_2=\theta^{R2} \cdot (1-\theta)^{NR2} $$
And since phase 1 and phase 2 are equally likely, we take the average of both for the LOD score:
$$ \begin{aligned} \text{LOD} &=\log{\frac{\frac{\theta^{R1} \cdot (1-\theta)^{NR1}+\theta^{R2} \cdot (1-\theta)^{NR2}}{2}}{(\frac{1}{2})^{R+NR}}} \\
&=(T-1)\log{2}+\log{\theta^{R1} \cdot (1-\theta)^{NR1}+\theta^{R2} \cdot (1-\theta)^{NR2}}\end{aligned}$$
In this example, there are a total of 5 informative chromosomes (R1=1, NR1=4; R2=4, NR2=1). Using the formula, we can calculate that at θ=0.25 we get a max LOD score of 0.25 - this does not cross the threshold of LOD=3.3 and therefore is not evidence of linkage. When the phase is unknown, LOD scores will be substantially lower than if the phase were known. If we knew the phase in our example to be phase 1, we could calculate at θ=0.25, LOD=0.403. That's still not significant but it's higher.
One last important point on LOD scores is that they are additive. This is because probabilities (odds) are multiplicative, but since we're using the logarithm of odds, we can use addition. In our example above, we were able to calculate the LOD score for linkage between $D$ and $m$ for this family as LOD=0.25 at θ=0.25. If we obtained data from another family that resulted in LOD=0.24 at θ=0.25, we could then combine the data for the two families as LOD=0.25+0.29=0.54. This makes analyzing the data much easier and intuitive; the more data you have, the stronger case you have for linkage.