Differences

This shows you the differences between two versions of the page.

--- chapter_22 [2024/11/25 15:38] – [Concept 2: informativeness] mike
+++ chapter_22 [2024/11/25 18:34] (current) – [LOD score formula for unknown phase] mike
@@ Line 94: / Line 94: @@
 <columns 100% *100%*>
-^  phenotype/genotype  ^  +, $\frac{s}{+}$  ^  m, $\frac{s}{s}$  ^  +, $\frac{s}{s}$  ^  m, $\frac{s}{+}$  ^
+^  genotype  ^  $\frac{+}{m}\cdot\frac{+}{s}$  ^  $\frac{m}{m}\cdot\frac{s}{s}$  ^  $\frac{+}{m}\cdot\frac{s}{s}$  ^  $\frac{m}{m}\cdot\frac{+}{s}$  ^
 ^  expected  |  5  |  5  |  5  |  5  |
 ^  observed  |  10  |  7  |  2  |  1  |
 </columns>
 <caption>
-placeholder. note here that expected is the same as null hypothesis. also note that SNPs are determined by PCR. note the smaller sample size compared to the drosophila exp.
+placeholder. note here that expected is the same as null hypothesis. also note that SNPs are determined by PCR. note the smaller sample size compared to the drosophila exp. remind what the dot symbol means.
 </caption>
 </table>
@@ Line 148: / Line 148: @@
 Now let us consider a different scenario to illustrate our second concept of informativeness. Let's say that we know the father's genotype is $\frac{S \; D}{s \; d}$ and the mother's genotype is $\frac{S \; d}{s \; d}$. We discover that the affected child is homozygous $\frac{S}{S}$. In this case, we can definitely conclude that the child inherited a $S \; D$ chromosome from their father, and that there must not have been recombination between the $S$ and $D$ loci. We say that the chromosome inherited from his father is informative, because we know for certain whether it was a product of recombination or not. By comparison, the child must have inherited a $S \; d$ chromosome from their mother. This chromosome is uninformative, because we can't tell if it is a result of recombination between $S$ and $d$ or not. When we look at individuals in pedigrees, they might have 0, 1, or 2 informative chromosomes.
-Let's revisit the example in Table {{ref>Tab3}}. The question we want to ask is: are $m$ and $s$ linked? Let's assume that they are - in such a case, we can determine which chromosomes are informative or not, and which are recombinant:
+Let's revisit the example in Table {{ref>Tab3}}. The question we want to ask is: are $m$ and $s$ linked? Let's assume that they are (this is our hypothesis) - in such a case, we can determine which chromosomes are informative or not, and which are recombinant:
-^  phenotype/genotype  ^  +, $\frac{s}{+}$  ^  m, $\frac{s}{s}$  ^  +, $\frac{s}{s}$  ^  m, $\frac{s}{+}$  ^
+<table Tab4>
+<columns 100% *100%*>
+^  %%genotype%%  ^  $\frac{+ \; +}{m \; s}$  ^  $\frac{m \; +}{m \; s}$  ^  $\frac{+ \; s}{m \; s}$  ^  $\frac{m \; +}{m \; s}$  ^
 ^  expected  |  5  |  5  |  5  |  5  |
 ^  observed  |  10  |  7  |  2  |  1  |
-^  paternal chromosome genotype |  $m \;
+^  "top" %%chromosome%%  |  I, NR  |  I, NR  |  I, R  |  I, R  |
+^  "bottom" %%chromosome%%  |  U  |  U  |  U  |  U  |
+</columns>
+<caption>
+placeholder, note abbreviations
+</caption>
+</table>
+==== LOD score formula for known phase ====
+We can now derive the LOD score formula for alleles with known phase. We first define the following terms:
+  * θ = recombinant fraction (equivalent to map distance)
+  * 1-θ = nonrecombinant fraction
+  * R = number of recombinant chromosomes
+  * NR = number of non-recombinant chromosomes
+The logarithm of odds (LOD) score is given as:
+$$ \begin{aligned}\text{LOD score} &= \log{\frac{\text{probability of observed pedigree data given } 0<\theta<\frac{1}{2}}{\text{probability of observed pedigree data given } \theta=\frac{1}{2}}} \\
+&= \log{\frac{\text{probability}(\theta)}{\text{Probability}(\frac{1}{2})}}\end{aligned}
+$$
+The denominator in this formula reflects our null hypothesis: that two loci are unlinked. When two loci are unlinked, the probability that recombination occurs between them is $\frac{1}{2}$. By extension, the probability that recombination does not occur between them is $1-\frac{1}{2}=\frac{1}{2}$. In order for the data to be consistent with the null hypothesis, every chromosome for which the phase is known would have to occur with probability $\frac{1}{2}$. Therefore:
+$$ \text{probability}(\frac{1}{2})=(\frac{1}{2})^R \cdot (\frac{1}{2})^{NR}=(\frac{1}{2})^{R+NR} $$
+The numerator in this formula reflects our experimental hypothesis: that the two loci are linked. When two loci are linked, the probability that recombination occurs between them is θ. By extension, the probability that recombination does not occur between them is (1-θ). In order for the data to be consistent with our experimental hypothesis, every recombinant chromosome for which the phase is known would have to occur with probability θ, and every non-recombinant chromosome for which the phase is known would have to occur with probability (1-θ). Therefore:
+$$ \text{probability}(\theta)=\theta^R \cdot (1-\theta)^{NR} $$
+We can thus re-write the LOD score formula as:
+$$ \begin{aligned} LOD &= \log{\frac{\theta^R \cdot (1-\theta)^{NR}}{(\frac{1}{2})^{R+NR}}} \\
+&= R\log{\theta} + NR\log{1-\theta} + (N+NR)\log{2} \end{aligned} $$
+Let's use the mouse data in Table {{ref>Tab4}} to illustrate the use of this formula. Here we have isolated 20 F2 offspring, 17 of which are non-recominant (NR=17) and 3 are recombinant (R=3). We can use different values of θ to calculate LOD:
+<table Tab5>
+<columns 100% *100%*>
+^  θ value  ^  LOD score  ^
+|  0  |  undefined  |
+|  0.05  |  1.74  |
+|  0.1  |  2.24  |
+|  0.15  |  2.35  |
+|  0.3  |  1.82  |
+|  0.5  |  0 (by definition)  |
+</columns>
+<caption>
+placeholder for LOD score table
+</caption>
+</table>
+<figure Fig3>
+{{ :phase_known_graph.png?400 |}}
+<caption>
+placeholder. graph plotting LOD as a function of theta.
+</caption>
+</figure>
+We can see from Table {{ref>Tab5}} and Figure {{ref>Fig3}} that the LOD score is highest around θ = 0.15, at which LOD = 2.35 (you can calculate a more precise maximal value of LOD using some calculus). Experience tells us that LOD > 3.3 is usually the cutoff point at which the linkage is likely to be real (it corresponds to a 0.05 false positive rate); conversely, LOD < -2 usually means we can rule out linkage. Therefore, we can see that the data do not support linkage between $m$ and $s$, but does not rule it out either.
+What if our sample size were bigger? Let's say that instead of 20 offspring, we looked at 40 offspring and obtained R=6 and NR=34. At θ=0.15, we can calculate that LOD=5.9. This puts us over the threshold of 3.3. Based on this data, we can conclude that $m$ and $s$ are linked, and the strongest support of the data suggest they are 15 map units apart.
+==== LOD score formula for unknown phase ====
+In mouse experiments, the phase of all offspring are usually known. Now, let's return to a human example where it is more common that the phase is unknown:
+<figure Fig4>
+{{ :pedigree_ex2.jpg?400 |}}
+<caption>
+placeholder
+</caption>
+</figure>
+In this example, a family with 5 children is affected by an autosomal dominant mutation $D$. We are testing linkage with marker $m$, of which there are two alleles, $M$ and $m$. $m$ is known to be on the same chromosome as $D$ but it is unknown if they are linked. The genotype of the mother (individual 2) is known to be $\frac{a \; d}{a \; d}$. The genotype of the father (individual 1) is known but the phase is not known and therefore there are two possible phases; his genotype is either $\frac{A \; D}{a \; d}$ (phase 1) or $\frac{A \; d}{a \; D}$ (phase 2). Based on this information, we can categorize the genotypes and chromosomes of the children:
+<table Tab6>
+<columns 100% *100%*>
+^  individual  ^  3  ^  4  ^  5  ^  6  ^  7  ^
+^  inferred %%genotype%%  |  $\frac{A \; D}{a \; d}$  |  $\frac{a \; D}{a \; d}$  |  $\frac{A \; D}{a \; d}$  |  $\frac{a \; d}{a \; d}$  |  $\frac{a \; d}{a \; d}$  |
+^  paternal %%chromosome%%  |  $A \; D$  |  $a \; D$  |  $A \; D$  |  $a \; d$  |  $a \; d$  |
+^  if phase 1  |  NR  |  R  |  NR  |  NR  |  NR  |
+^  if phase 2  |  R  |  NR  |  R  |  R  |  R  |
+^  maternal %%chromosome%%  |  not informative  |||||
+</columns>
+<caption>
+placeholder. note that the genotype of the $m$ locus can be determined by PCR, whereas the genotype of the $D$ locus is inferred by phenotype.
+</caption>
+</table>
+We first define some terms:
+  * T = total number of informative chromosomes; T = R+NR
+  * R1 = number of recombinant chromosomes in phase 1
+  * NR1 = number of non-recombinant chromosomes in phase 1
+  * R2 = number of recombinant chromosomes in phase 2
+  * NR2 = number of non-recombinant chromosomes in phase 2
+  * θ = recombination fraction (i.e., map distance) between $m$ and $D$
+As before, we define the LOD score as:
+$$ \begin{aligned}\text{LOD score} &= \log{\frac{\text{probability of observed pedigree data given } 0<\theta<\frac{1}{2}}{\text{probability of observed pedigree data given } \theta=\frac{1}{2}}} \\
+&= \log{\frac{\text{probability}(\theta)}{\text{probability}(\frac{1}{2})}}\end{aligned}
+$$
+Since we don't know what the phase is, we must calculate $\text{probabilty}(\theta)$ for both phases:
+$$\text{probabilty}(\theta)_1=\theta^{R1} \cdot (1-\theta)^{NR1} \\
+\text{probabilty}(\theta)_2=\theta^{R2} \cdot (1-\theta)^{NR2}  $$
+And since phase 1 and phase 2 are equally likely, we take the average of both for the LOD score:
+$$ \begin{aligned} \text{LOD} &=\log{\frac{\frac{\theta^{R1} \cdot (1-\theta)^{NR1}+\theta^{R2} \cdot (1-\theta)^{NR2}}{2}}{(\frac{1}{2})^{R+NR}}} \\
+&=(T-1)\log{2}+\log{\theta^{R1} \cdot (1-\theta)^{NR1}+\theta^{R2} \cdot (1-\theta)^{NR2}}\end{aligned}$$
+In this example, there are a total of 5 informative chromosomes (R1=1, NR1=4; R2=4, NR2=1). Using the formula, we can calculate that at θ=0.25 we get a max LOD score of 0.25 - this does not cross the threshold of LOD=3.3 and therefore is not evidence of linkage. When the phase is unknown, LOD scores will be substantially lower than if the phase were known. If we knew the phase in our example to be phase 1, we could calculate at θ=0.25, LOD=0.403. That's still not significant but it's higher.
+One last important point on LOD scores is that they are additive. This is because probabilities (odds) are multiplicative, but since we're using the logarithm of odds, we can use addition. In our example above, we were able to calculate the LOD score for linkage between $D$ and $m$ for this family as LOD=0.25 at θ=0.25. If we obtained data from another family that resulted in LOD=0.24 at θ=0.25, we could then combine the data for the two families as LOD=0.25+0.29=0.54. This makes analyzing the data much easier and intuitive; the more data you have, the stronger case you have for linkage.