QuBi/bio203: Difference between revisions
imported>Weigang |
imported>Weigang |
||
(37 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
; <div style="font-size:180%">BIOL 203 Lab 4. Bioinformatics Exercises</div> | ; <div style="font-size:180%">BIOL 203 Lab 4. Bioinformatics Exercises</div> | ||
---- | ---- | ||
Research in modern molecular genetics increasingly | Research in modern molecular genetics increasingly relies on genomic information and computation. The following exercises will expose you to the field of bioinformatics, including the use of online databases and statistical analysis of genetic data. | ||
==Introduction== | ==Introduction== | ||
DNA and its organization into genes makes up an organism's genotype. The expression and presentation of those genes in the organism's development, physiology, and physical appearance (physical traits) make up the phenotype of the organism. Phenotypic variations among individuals of a species (e.g., humans) are caused by genotype variations, environmental factors, and interactions between genetic and environmental factors. In other words, phenotypic variations among individuals often have complex, unclear mechanisms and are not necessarily due entirely to genetic differences. | DNA and its organization into genes makes up an organism's genotype. The expression and presentation of those genes in the organism's development, physiology, and physical appearance (physical traits) make up the phenotype of the organism. Phenotypic variations among individuals of a species (e.g., humans) are caused by genotype variations, environmental factors, and interactions between genetic and environmental factors. In other words, phenotypic variations among individuals often have complex, unclear mechanisms and are not necessarily due entirely to genetic differences. | ||
In this lab section, we will explore the concepts of phenotype and genotype by looking at the variations in the TAS2R38 gene, which is responsible for part of the sensation of taste. The taste receptor protein TAS2R38 (taste receptor 2, member 38) has been associated with the ability to taste the bitter compound phenylthiocarbamide (PTC) [http://www.ncbi.nlm.nih.gov/pubmed/12595690 (Kim et al. 2003)]. Although most people can taste PTC ("tasters"), a | In this lab section, we will explore the concepts of phenotype and genotype by looking at the variations in the TAS2R38 gene, which is responsible for part of the sensation of taste. The taste receptor protein TAS2R38 (taste receptor 2, member 38) has been associated with the ability to taste the bitter compound phenylthiocarbamide (PTC) [http://www.ncbi.nlm.nih.gov/pubmed/12595690 (Kim et al. 2003)]. Although most people can taste PTC ("tasters"), a certain percentage of people cannot ("nontasters"). In this experiment, you will test your Taster phenotype as well as determine your Taster genotype. Subsequently, your results and those of your classmates will be combined to statistically validate if there is an association between the Taster phenotype and TAS2R38 genotypes. | ||
==Learning goals and outcomes== | ==Learning goals and outcomes== | ||
* Understand phenotype, genotype, and their association | * Understand phenotype, genotype, and their association | ||
* Be able to use the NCBI online databases | * Be able to use the NCBI online databases | ||
* Be able to predict genotype frequencies | * Be able to compare genes among species using phylogeny | ||
* Be able to use the contingency test of genotype-phenotype associations | * Be able to predict genotype frequencies based on Hardy-Weinberg equilibrium | ||
* Be able to use the contingency-table test of genotype-phenotype associations | |||
== | ==Web Exercise 1. Search for gene information using NCBI online databases== | ||
# Point your browser to the [http://www.ncbi.nlm.nih.gov/genome/guide/human/ NCBI Human Genome Resource] page | # Point your browser to the [http://www.ncbi.nlm.nih.gov/genome/guide/human/ NCBI Human Genome Resource] page | ||
# Type in the "Find A Gene" search box "TAS2R38" and select "Homo sapiens" from the pull-down menu. Click "Go" | # Type in the "Find A Gene" search box "TAS2R38" and select "Homo sapiens" from the pull-down menu. Click "Go" | ||
Line 21: | Line 22: | ||
## Chromosome location | ## Chromosome location | ||
## Click on "GenBank" and identify its gene structure, including the length of primary transcript, coding sequences, 5'-UTR and 3'-UTR. Does it have any introns? | ## Click on "GenBank" and identify its gene structure, including the length of primary transcript, coding sequences, 5'-UTR and 3'-UTR. Does it have any introns? | ||
## Zoom out the Sequence View to find its neighboring genes. Zoom | ## Zoom out the Sequence View to find its neighboring genes. Zoom in to read DNA sequences. | ||
# Click the link to OMIM (under '''Phenotype''') and find phenotypes associated with TAS2R38 gene | # Click the link to OMIM (under '''Phenotype''') and find phenotypes associated with TAS2R38 gene | ||
## What does OMIM stand for? | ## What does OMIM stand for? | ||
Line 28: | Line 29: | ||
## Is the correlation between TAS2R38 gene variations and the PTC phenotype variations 100%? If not, what could be the other causes? | ## Is the correlation between TAS2R38 gene variations and the PTC phenotype variations 100%? If not, what could be the other causes? | ||
== | ==Web Exercise 2. Cross-species comparisons with HomoloGene== | ||
# From the NCBI "TAS2R38" Gene page, click "HomoloGene" link under the "Related Information" (right-side navigation panel) | |||
# You should see a page listing TAS2R38 orthologous (i.e., same gene in different species) genes from 7 mammalian species, including human (''Homo sapiens''), chimpanzee (''Pan troglodytes''), macaque (''Macaca mulatta''), dog (''Canis lupus familiaris''), cow (''Bos taurus''), rat (''Rattus norvegicus''), and mouse (''Mus musculus''). | |||
# Write down your expectations for the following species relationships: | |||
## Is chimpanzee more closely related to macaque or to human? | |||
## Is dog more related to mouse or to cow? | |||
## Is rat and mouse more closely related than human and chimpanzee? | |||
# Click on the link "Show Pairwise Alignment Scores" under "Protein Alignments" and fill in the following table when the page loads. Do these sequence-comparison results change your expectations in the above? Explain. | |||
<center> | |||
{| class="wikitable" | |||
|- | |||
! Species pair !! % Protein Sequence Identity !! % DNA Seq Identity | |||
|- | |||
| Chimp-Human || ? || ? | |||
|- | |||
| Chimp-Macaque || ? || ? | |||
|- | |||
| Dog-Cow || ? || ? | |||
|- | |||
| Dog-Mouse || ? || ? | |||
|- | |||
| Rat-Mouse || ? || ? | |||
|} | |||
</center> | |||
You can find exact differences by clicking on "Blast" for each pairwise comparisons. Lastly, obtain a phylogenetic tree of TAS2R38 protein sequences from these 7 species using [http://www.phylogeny.fr the phylogeny.fr web] | |||
# Click "Show Multiple Alignment" | |||
# Click "Download" and, when the page uploads, click "download" again | |||
# Go to the [http://www.phylogeny.fr the phylogeny.fr web] and select "Phylogenetic Analysis" and then "One Click" analysis | |||
# Copy and paste your downloaded sequences into the text box and click on "Submit" | |||
# When analysis is finished, you should see a phylogenetic tree. Answer the following questions: | |||
## Define "orthologous genes" | |||
## What do tree nodes represent? | |||
## What do tree branches and branch length represent? | |||
## How do you determine species relatedness based on a phylogenetic tree? | |||
## Do you think this gene tree reflects species relationships? How would you improve the inference of species tree (more genes, DNA instead of protein sequences)? | |||
## Do you think differences in protein sequences are associated with different sense of smells among these species? How would you test? | |||
==Web Exercise 3. Predict results of PCR and restriction analysis== | |||
On a printout of the DNA sequence of TAS2R38 gene (from the GenBank link, see above), | On a printout of the DNA sequence of TAS2R38 gene (from the GenBank link, see above), | ||
# Identify 5'-UTR, 3'-UTR, start codon, and stop codon. | # Identify 5'-UTR, 3'-UTR, start codon, and stop codon. | ||
# Identify the regions your PCR primers should bind | # Identify the regions your PCR primers should bind using the Primer3 web server | ||
## Point your browser to [http://primer3.ut.ee/ Primer3 Web Server] | |||
## Select "check_primer" in the top box, and "HUMAN" in the 2nd box | |||
## Paste the raw gene sequence into the 3rd box from the [http://www.ncbi.nlm.nih.gov/nuccore/NC_000007.13?report=fasta&from=141672259&to=141673743&strand=true GenBank page] | |||
## Paste the two primer sequences (use only the sequences within {}) into the 4th and 6th boxes: <pre>(p2283) ttttggatccAACTGGCAGAa{TAAAGATCTCAATTTAT}; (p2285) ttttggatcc{AACACAAACCATCACCCCTATTTT}</pre> | |||
## Click "Pick Primers" | |||
# Identify the base location that contains 785 C/T SNP | # Identify the base location that contains 785 C/T SNP | ||
# Copy and paste the expected | # Copy and paste the expected 303-bp section and locate the Fnu4H1 site using the [http://tools.neb.com/NEBcutter2/ NEBcutter website] | ||
# What are the expected lengths for the C/C, C/T, and T/T genotypes? | # What are the expected lengths for the C/C, C/T, and T/T genotypes? | ||
== | ==Statistical Exercise 1. Test Hardy-Weinberg Equilibrium== | ||
The Hardy-Weinberg | The Hardy-Weinberg Equilibrium (HWE) predicts the genotype frequencies at a genetic locus in a random-mating population. If two alleles (e.g., C and T) are segregating in a diploid population with respective frequencies ''p'' and ''q'', the mating is random, and there is no fitness difference between the two alleles, then the genotypes frequencies (expected from random Mendelian segregation) are stably maintained throughout the generations with the following values: C/T (''2pq''), T/T (''q<sup>2</sup>''), and C/C (''p<sup>2</sup>''). | ||
1. After PCR/sequencing experiments, collect genotype frequencies in your class as a group using the following table: | |||
<center> | |||
Table 1. '''Observed''' Genotype Frequencies | |||
{| class="wikitable" | |||
# | |- | ||
! !! C/C !! C/T !! T/T !! Total | |||
|- | |||
| Count || N<sub>CC</sub> || N<sub>CT</sub> || N<sub>TT</sub> || N | |||
|- | |||
| Frequency || f<sub>CC</sub> || f<sub>CT</sub> || f<sub>TT</sub> || 1 | |||
|} | |||
</center> | |||
2. Calculate SNP allele frequencies using the following formula: | |||
<center> | |||
{| class="wikitable" | |||
|- | |||
| f<sub>C</sub>=(2N<sub>CC</sub> + N<sub>CT</sub>)/2N | |||
|- | |||
| f<sub>T</sub>=(2N<sub>TT</sub> + N<sub>CT</sub>)/2N | |||
|} | |||
</center> | |||
3. Predict expected genotype frequencies using HWE: | |||
<center> | |||
Table 2. '''Predicted''' Genotype Frequencies | |||
{| class="wikitable" | |||
|- | |||
! !! C/C !! C/T !! T/T !! Total | |||
|- | |||
| Frequency || f<sub>C</sub> x f<sub>C</sub> || 2 x f<sub>C</sub> x f<sub>T</sub> || f<sub>T</sub> x f<sub>T</sub> || 1 | |||
|- | |||
| Count || (multiply the above by N) || (multiply the above by N) || (multiply the above by N) || N | |||
|} | |||
</center> | |||
4. Test the goodness-of-fit between the observed (Table 1) and expected (Table 2) genotype frequencies using the [http://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test Chi-Squared Test]: | |||
<center> | |||
Χ<sup>2</sup> = ∑[(N<sub>observed</sub> - N<sub>expected</sub>)<sup>2</sup>/N<sub>expected</sub>] | |||
(sum over all three genotypes) | |||
</center> | |||
5. Exit Questions: | |||
# Compare your Χ<sup>2</sup> value with the critical value of Χ<sup>2</sup> (with 2 degrees of freedom)=5.99. If your value is greater than 5.99, it is considered statistically significant at the level of p=0.05. It means that there is less than a 1 to 20 chance of getting a value equal or greater than your value, if the observation agrees with HWE. A significant Chi-squared value in this case would suggest a deviation from HWE. | |||
# What are the possible causes of deviation from HWE? Biological? Statistical? | |||
# However, if your result is not significant (i.e., < 5.99), the interpretations could be either (1) in support of HWE, or (more scientifically) (2) there is no evidence for deviation from HWE based on the sample of your class. | |||
==Statistical Exercise 2. Test phenotype-genotype association== | |||
Genome-Wide Association Study (GWAS) is a method for mapping phenotypes to genotypes. In a typical GWAS study, frequencies of alleles (e.g., C or T at position 785) are determined in a sample of affected individuals (the "cases") as well as in a sample of unaffected individuals (the "controls"). For example, the following table shows results of a hypothetical case-control study at a locus segregating with two alleles (C and T): | |||
<center> | |||
Table 3. Sample Genotype Frequencies | |||
{| class="wikitable" | |||
|- | |||
! !! T/T !! T/C !! C/C !! Total | |||
|- | |||
| Case || 0 || 24 || 127 || ? | |||
|- | |||
| Control || 9 || 68 || 114 || ? | |||
|- | |||
| Total || ? || ? || ? || ? | |||
|} | |||
</center> | |||
Association between the genotype and the phenotype could be assessed with a [http://en.wikipedia.org/wiki/Contingency_table contingency table analysis] (also using chi-square, as in the preceding exercise). In this case, Χ<sup>2</sup> = 26.4, p=0.0005, suggesting a significant association between genotypes and diseases. (In this case, the result suggests that C/C genotypes are over-represented in disease cases.) | |||
1. Perform an [http://www.physics.csbsju.edu/stats/contingency.html online contingency table analysis] using the hypothetical data in Table 3. | |||
2. Deriving from Table 3, fill the following table with allele counts. Then perform a 2-by-2 contingency table analysis using the link above. Is there a statistically significant association between alleles and disease phenotype? Which allele (C or T) is over-represented in (i.e., statistically associated with) disease cases? | |||
<center> | |||
Table 4. Sample Allele Frequencies | |||
{| class="wikitable" | |||
|- | |||
! !! T !! C !! Total | |||
|- | |||
| Case || ? || ? || ? | |||
|- | |||
| Control || ? || ? || ? | |||
|- | |||
| Total || ? || ? || ? | |||
|} | |||
</center> | |||
3. Following the above two examples, perform both the genotype and allele association tests using the class data. | |||
# Design a 2-by-3 contingency table for three genotypes | |||
# Design a 2-by-2 contingency table for the C/T alleles | |||
4. Exit Questions: | |||
# | # Is there a statistically significant association between the ''genotypes'' and the Taster phenotype? | ||
# | # Is there a statistically significant association between the ''alleles'' and the Taster phenotype? | ||
# | # Which allele (C or T) is over-represented in the Nontasters? | ||
# Is the association 100% (i.e., are there exceptions)? | |||
# What could be other causes if there are exceptions? | |||
Latest revision as of 21:05, 11 October 2013
- BIOL 203 Lab 4. Bioinformatics Exercises
Research in modern molecular genetics increasingly relies on genomic information and computation. The following exercises will expose you to the field of bioinformatics, including the use of online databases and statistical analysis of genetic data.
Introduction
DNA and its organization into genes makes up an organism's genotype. The expression and presentation of those genes in the organism's development, physiology, and physical appearance (physical traits) make up the phenotype of the organism. Phenotypic variations among individuals of a species (e.g., humans) are caused by genotype variations, environmental factors, and interactions between genetic and environmental factors. In other words, phenotypic variations among individuals often have complex, unclear mechanisms and are not necessarily due entirely to genetic differences.
In this lab section, we will explore the concepts of phenotype and genotype by looking at the variations in the TAS2R38 gene, which is responsible for part of the sensation of taste. The taste receptor protein TAS2R38 (taste receptor 2, member 38) has been associated with the ability to taste the bitter compound phenylthiocarbamide (PTC) (Kim et al. 2003). Although most people can taste PTC ("tasters"), a certain percentage of people cannot ("nontasters"). In this experiment, you will test your Taster phenotype as well as determine your Taster genotype. Subsequently, your results and those of your classmates will be combined to statistically validate if there is an association between the Taster phenotype and TAS2R38 genotypes.
Learning goals and outcomes
- Understand phenotype, genotype, and their association
- Be able to use the NCBI online databases
- Be able to compare genes among species using phylogeny
- Be able to predict genotype frequencies based on Hardy-Weinberg equilibrium
- Be able to use the contingency-table test of genotype-phenotype associations
Web Exercise 1. Search for gene information using NCBI online databases
- Point your browser to the NCBI Human Genome Resource page
- Type in the "Find A Gene" search box "TAS2R38" and select "Homo sapiens" from the pull-down menu. Click "Go"
- Select the first link, which leads to an NCBI Gene Card page. Use the Gene Card to identify the following information on TAS2R38 gene:
- NCBI GeneID
- Chromosome location
- Click on "GenBank" and identify its gene structure, including the length of primary transcript, coding sequences, 5'-UTR and 3'-UTR. Does it have any introns?
- Zoom out the Sequence View to find its neighboring genes. Zoom in to read DNA sequences.
- Click the link to OMIM (under Phenotype) and find phenotypes associated with TAS2R38 gene
- What does OMIM stand for?
- What are the expected "taster" and "nontaster" frequencies within human populations?
- If the ability to taste bitterness is evolutionary advantageous, how are alleles contributing to "nontaster" maintained in population?
- Is the correlation between TAS2R38 gene variations and the PTC phenotype variations 100%? If not, what could be the other causes?
Web Exercise 2. Cross-species comparisons with HomoloGene
- From the NCBI "TAS2R38" Gene page, click "HomoloGene" link under the "Related Information" (right-side navigation panel)
- You should see a page listing TAS2R38 orthologous (i.e., same gene in different species) genes from 7 mammalian species, including human (Homo sapiens), chimpanzee (Pan troglodytes), macaque (Macaca mulatta), dog (Canis lupus familiaris), cow (Bos taurus), rat (Rattus norvegicus), and mouse (Mus musculus).
- Write down your expectations for the following species relationships:
- Is chimpanzee more closely related to macaque or to human?
- Is dog more related to mouse or to cow?
- Is rat and mouse more closely related than human and chimpanzee?
- Click on the link "Show Pairwise Alignment Scores" under "Protein Alignments" and fill in the following table when the page loads. Do these sequence-comparison results change your expectations in the above? Explain.
Species pair | % Protein Sequence Identity | % DNA Seq Identity |
---|---|---|
Chimp-Human | ? | ? |
Chimp-Macaque | ? | ? |
Dog-Cow | ? | ? |
Dog-Mouse | ? | ? |
Rat-Mouse | ? | ? |
You can find exact differences by clicking on "Blast" for each pairwise comparisons. Lastly, obtain a phylogenetic tree of TAS2R38 protein sequences from these 7 species using the phylogeny.fr web
- Click "Show Multiple Alignment"
- Click "Download" and, when the page uploads, click "download" again
- Go to the the phylogeny.fr web and select "Phylogenetic Analysis" and then "One Click" analysis
- Copy and paste your downloaded sequences into the text box and click on "Submit"
- When analysis is finished, you should see a phylogenetic tree. Answer the following questions:
- Define "orthologous genes"
- What do tree nodes represent?
- What do tree branches and branch length represent?
- How do you determine species relatedness based on a phylogenetic tree?
- Do you think this gene tree reflects species relationships? How would you improve the inference of species tree (more genes, DNA instead of protein sequences)?
- Do you think differences in protein sequences are associated with different sense of smells among these species? How would you test?
Web Exercise 3. Predict results of PCR and restriction analysis
On a printout of the DNA sequence of TAS2R38 gene (from the GenBank link, see above),
- Identify 5'-UTR, 3'-UTR, start codon, and stop codon.
- Identify the regions your PCR primers should bind using the Primer3 web server
- Point your browser to Primer3 Web Server
- Select "check_primer" in the top box, and "HUMAN" in the 2nd box
- Paste the raw gene sequence into the 3rd box from the GenBank page
- Paste the two primer sequences (use only the sequences within {}) into the 4th and 6th boxes:
(p2283) ttttggatccAACTGGCAGAa{TAAAGATCTCAATTTAT}; (p2285) ttttggatcc{AACACAAACCATCACCCCTATTTT}
- Click "Pick Primers"
- Identify the base location that contains 785 C/T SNP
- Copy and paste the expected 303-bp section and locate the Fnu4H1 site using the NEBcutter website
- What are the expected lengths for the C/C, C/T, and T/T genotypes?
Statistical Exercise 1. Test Hardy-Weinberg Equilibrium
The Hardy-Weinberg Equilibrium (HWE) predicts the genotype frequencies at a genetic locus in a random-mating population. If two alleles (e.g., C and T) are segregating in a diploid population with respective frequencies p and q, the mating is random, and there is no fitness difference between the two alleles, then the genotypes frequencies (expected from random Mendelian segregation) are stably maintained throughout the generations with the following values: C/T (2pq), T/T (q2), and C/C (p2).
1. After PCR/sequencing experiments, collect genotype frequencies in your class as a group using the following table:
Table 1. Observed Genotype Frequencies
C/C | C/T | T/T | Total | |
---|---|---|---|---|
Count | NCC | NCT | NTT | N |
Frequency | fCC | fCT | fTT | 1 |
2. Calculate SNP allele frequencies using the following formula:
fC=(2NCC + NCT)/2N |
fT=(2NTT + NCT)/2N |
3. Predict expected genotype frequencies using HWE:
Table 2. Predicted Genotype Frequencies
C/C | C/T | T/T | Total | |
---|---|---|---|---|
Frequency | fC x fC | 2 x fC x fT | fT x fT | 1 |
Count | (multiply the above by N) | (multiply the above by N) | (multiply the above by N) | N |
4. Test the goodness-of-fit between the observed (Table 1) and expected (Table 2) genotype frequencies using the Chi-Squared Test:
Χ2 = ∑[(Nobserved - Nexpected)2/Nexpected] (sum over all three genotypes)
5. Exit Questions:
- Compare your Χ2 value with the critical value of Χ2 (with 2 degrees of freedom)=5.99. If your value is greater than 5.99, it is considered statistically significant at the level of p=0.05. It means that there is less than a 1 to 20 chance of getting a value equal or greater than your value, if the observation agrees with HWE. A significant Chi-squared value in this case would suggest a deviation from HWE.
- What are the possible causes of deviation from HWE? Biological? Statistical?
- However, if your result is not significant (i.e., < 5.99), the interpretations could be either (1) in support of HWE, or (more scientifically) (2) there is no evidence for deviation from HWE based on the sample of your class.
Statistical Exercise 2. Test phenotype-genotype association
Genome-Wide Association Study (GWAS) is a method for mapping phenotypes to genotypes. In a typical GWAS study, frequencies of alleles (e.g., C or T at position 785) are determined in a sample of affected individuals (the "cases") as well as in a sample of unaffected individuals (the "controls"). For example, the following table shows results of a hypothetical case-control study at a locus segregating with two alleles (C and T):
Table 3. Sample Genotype Frequencies
T/T | T/C | C/C | Total | |
---|---|---|---|---|
Case | 0 | 24 | 127 | ? |
Control | 9 | 68 | 114 | ? |
Total | ? | ? | ? | ? |
Association between the genotype and the phenotype could be assessed with a contingency table analysis (also using chi-square, as in the preceding exercise). In this case, Χ2 = 26.4, p=0.0005, suggesting a significant association between genotypes and diseases. (In this case, the result suggests that C/C genotypes are over-represented in disease cases.)
1. Perform an online contingency table analysis using the hypothetical data in Table 3.
2. Deriving from Table 3, fill the following table with allele counts. Then perform a 2-by-2 contingency table analysis using the link above. Is there a statistically significant association between alleles and disease phenotype? Which allele (C or T) is over-represented in (i.e., statistically associated with) disease cases?
Table 4. Sample Allele Frequencies
T | C | Total | |
---|---|---|---|
Case | ? | ? | ? |
Control | ? | ? | ? |
Total | ? | ? | ? |
3. Following the above two examples, perform both the genotype and allele association tests using the class data.
- Design a 2-by-3 contingency table for three genotypes
- Design a 2-by-2 contingency table for the C/T alleles
4. Exit Questions:
- Is there a statistically significant association between the genotypes and the Taster phenotype?
- Is there a statistically significant association between the alleles and the Taster phenotype?
- Which allele (C or T) is over-represented in the Nontasters?
- Is the association 100% (i.e., are there exceptions)?
- What could be other causes if there are exceptions?