QuBi/modules/biol303: Difference between revisions

From QiuLab
Jump to navigation Jump to search
imported>Weigang
imported>Lab
 
(52 intermediate revisions by 2 users not shown)
Line 3: Line 3:


==Objectives==
==Objectives==
# Understand the RNA-SEQ technology and its use in genomewide identification of gene functions.
# Understand the RNA-SEQ technology and its use in genome-wide identification of gene functions.
# Be able to identify co-expressed and co-repressed genes based on time-course gene expression data.
# Be able to identify co-expressed and co-repressed genes based on time-course gene expression data.
----
----


==Lab Report Grading Policy==
==Lab Report Grading Policy==
; '''Introduction''' ''(3 pts)'' : Define transcriptome. List key steps in RNA-SEQ technology. Describe advantages of high-throughput technologies in comparison with traditional gene-by-gene approaches of studying gene function. Your statements are not to be copied from the Lab Manual.
; '''Introduction''' ''(1 pts)'' : Define transcriptome. List key steps in RNA-SEQ technology. Describe advantages of high-throughput technologies in comparison with traditional gene-by-gene approaches of studying gene function. Your statements are not to be copied from the Lab Manual.


; '''Materials and Methods''' ''(4 pts)'': Describe experimental procedures of the study that have produced these gene expression data by reading [http://genomebiology.com/2010/11/3/R35 this paper] and [http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE17637 this experimental report]. Answer the following questions:
; '''Materials and Methods''' ''(1 pts)'': Describe experimental procedures of the study that have produced these gene expression data by reading [http://genomebiology.com/2010/11/3/R35 this paper] and [http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE17637 this experimental report]. Answer the following questions:
# Name of the two species used in experiments
# Name of the two species used in experiments
# How many genes were measured for their expression levels?
# How many genes were measured for their expression levels?
# How many time points, developmental stages, and cell types have been tested?
# How many time points, developmental stages, and cell types have been tested?


; '''Results''' ''(12 pts)'' : Answers to questions A through L. Reproduce answers from your worksheets. For the questions that are TABLES, copy the table but you can omit some rows as indicated on some of the table captions.
; '''Results''' ''(5 pts)'':
# Table 2 (annotation for 15 genes)
# Expression profiles (screen capture for 15 genes)
# Table 4 (correlation coefficients)
# Heat map of 15 genes (screen capture)


; '''Discussion''' ''(9 pts)'' : Answer three discussion questions.
; '''Discussion''' ''(2 pts)'' : Answer the four discussion questions.


; '''Summary/Conclusion''' ''(1 pt)'' : A sentence or two will suffice.
; '''Summary/Conclusion''' ''(1 pt)'' : A sentence or two will suffice.
Line 28: Line 32:
[[File:Bio_202_fig_4.jpg|thumb|frameless|right|'''Figure 2. ''' '''a)''' Results from four individual Northern blots examining four different genes and measuring mRNA production over time, as indicated. b) Results from a series of microarrays for the same four genes of interest. Note the color scale on the bottom of '''b)''', where bright green indicates a 20-fold repression and bright red indicates a 20-fold induction. Black indicates no change in transcription. (Source: Campbell & Heyer. (2003). Discovering Genomics, Proteomics, & Bioinformatics. Pearson Education, Inc.)]]
[[File:Bio_202_fig_4.jpg|thumb|frameless|right|'''Figure 2. ''' '''a)''' Results from four individual Northern blots examining four different genes and measuring mRNA production over time, as indicated. b) Results from a series of microarrays for the same four genes of interest. Note the color scale on the bottom of '''b)''', where bright green indicates a 20-fold repression and bright red indicates a 20-fold induction. Black indicates no change in transcription. (Source: Campbell & Heyer. (2003). Discovering Genomics, Proteomics, & Bioinformatics. Pearson Education, Inc.)]]


[[File:RNA-SEQ-1.png|thumb|frameless|left|200px|'''Figure 3.'''  A typical RNA-Seq experiment. Briefly, long RNAs are first converted into a library of cDNA fragments through either RNA fragmentation or DNA fragmentation. Sequencing adaptors (blue) are subsequently added to each cDNA fragment and a short sequence is obtained from each cDNA using high-throughput sequencing technology. The resulting sequence reads are aligned with the reference genome or transcriptome, and classified as three types: exonic reads, junction reads and poly(A) end-reads. These three types are used to generate a base-resolution expression profile for each gene, as illustrated at the bottom; a yeast ORF with one intron is shown. (Source: [http://www.ncbi.nlm.nih.gov/pubmed/19015660 Wang, Gerstein, and Snyder (2009)]). The expression level of a gene is measured by its '''FPKM''', which stands for ''fragments per kilobase of gene length per million mapped fragments''. In essence, FPKM is the amount of short reads mapped to the gene normalized by gene length and total number of mapped reads from an experiment. The normalization by gene length and total reads makes it possible to compare gene expression levels across different genes as well as among different experiments.]]
[[File:RNA-SEQ-1.png|thumb|frameless|left|200px|'''Figure 3.'''  A typical RNA-Seq experiment. Briefly, long RNAs are first converted into a library of cDNA fragments through either RNA fragmentation or DNA fragmentation. Sequencing adaptors (blue) are subsequently added to each cDNA fragment and a short sequence is obtained from each cDNA using high-throughput sequencing technology. The resulting sequence reads are aligned with the reference genome or transcriptome, and classified as three types: exonic reads, junction reads and poly(A) end-reads. These three types are used to generate a base-resolution expression profile for each gene, as illustrated at the bottom; a yeast ORF with one intron is shown. (Source: [http://www.ncbi.nlm.nih.gov/pubmed/19015660 Wang, Gerstein, and Snyder (2009)]). The expression level of a gene is measured by its '''FPKM''', which stands for ''fragments per kilobase of total gene length per million mapped reads''. In essence, FPKM is the amount of short reads mapped to a gene normalized by the gene length and the total number of reads generated from an experiment. The normalization by gene length and total reads makes it possible to compare expression levels across genes as well as among experiments.]]


'''Gene expression''' is the transcription of a DNA template into RNA molecules, some of which are eventually translated into proteins. In a multicellular organism, the subset of genes that are expressed defines and gives rise to a specific tissue or cell type. In this laboratory exercise, we will use bioinformatics techniques to identify genes up- and down-regulated in ''Dictyostelium'' during its development from a unicellular stage to a multi-cellular stage.
'''Gene expression''' is the transcription of a DNA template into RNA molecules, some of which are eventually translated into proteins. In a multicellular organism, the subset of genes that are expressed defines and gives rise to a specific tissue or cell type. In this laboratory exercise, we will use bioinformatics techniques to identify genes up- and down-regulated in ''Dictyostelium'' during its development from a unicellular stage to a multi-cellular stage.
Line 36: Line 40:
Traditionally, gene expressions are studied one gene at a time using blotting techniques. For example, in a '''Northern Blot''' experiment ('''Figure 2a'''), the whole messenger RNA (mRNA) content of a cell is extracted and loaded on a solid gel slab. Different mRNA molecules are then separated using electrophoresis and transferred to a nitrocellulose sheet. To identify if a gene is expressed, a radioactively (or fluorescently) labeled oligonucleotide probe that is specific to the gene sequence is applied to the sheet. If the gene is expressed, the probe will hybridize with a specific mRNA molecule and a black band will appear on an Xray film. Other blotting techniques for detecting gene expression include '''Southern Blot''', in which mRNAs in a cell are reverse transcribed to their complementary DNA (cDNA) before being hybridized with gene-specific oligo-nucleotide probes. In a Western Blot experiment, the protein product (instead of the mRNA intermediate) of a gene is probed using antibodies (instead of the oligonucleotide probes).
Traditionally, gene expressions are studied one gene at a time using blotting techniques. For example, in a '''Northern Blot''' experiment ('''Figure 2a'''), the whole messenger RNA (mRNA) content of a cell is extracted and loaded on a solid gel slab. Different mRNA molecules are then separated using electrophoresis and transferred to a nitrocellulose sheet. To identify if a gene is expressed, a radioactively (or fluorescently) labeled oligonucleotide probe that is specific to the gene sequence is applied to the sheet. If the gene is expressed, the probe will hybridize with a specific mRNA molecule and a black band will appear on an Xray film. Other blotting techniques for detecting gene expression include '''Southern Blot''', in which mRNAs in a cell are reverse transcribed to their complementary DNA (cDNA) before being hybridized with gene-specific oligo-nucleotide probes. In a Western Blot experiment, the protein product (instead of the mRNA intermediate) of a gene is probed using antibodies (instead of the oligonucleotide probes).


After the genomic revolution since 1990s, it became possible to study the expression of all genes in a cell at once using ''high-throughput'' techniques. Detecting the expression profiles of a whole genome was made possible by the availability of the whole genome sequences of bacteria, yeasts, and humans. The '''DNA microarray''' ('''Figure 2b''') is one such high throughput technique. In contrast to the Northern Blot technique in which the mRNA sample is fixed on a nylon sheet, nucleotide probes for all genes are fixed on a glass slide, creating a “gene chip”. The cellular mRNAs are reverse transcribed into cDNAs labeled with fluorescent dyes, which are then hybridized with the gene chips. After the unattached cDNAs are washed away, the fluorescent intensity remains at each probe location is measured as an indication of the amount of mRNA transcribed from each gene in a genome. The entire cellular RNA content transcribed from a genome is called a '''transcriptome'''. Each DNA microarray reading is therefore essentially a snap shot of the whole genome expression profile of a cell at a particular physiological stage. It is no longer necessary to know or decide beforehand candidate genes to be targets of exploration, as in the traditional blotting techniques.
After the genomic revolution since the 1990s, it became possible to study the expression of all genes in a cell at once using ''high-throughput'' techniques. Detecting the expression profiles of a whole genome was made possible by the availability of the whole genome sequences of bacteria, yeasts, and humans. The '''DNA microarray''' ('''Figure 2b''') is one such high throughput technique. In contrast to the Northern Blot technique in which the mRNA sample is fixed on a nylon sheet, nucleotide probes for all genes are fixed on a glass slide, creating a “gene chip”. The cellular mRNAs are reverse transcribed into cDNAs labeled with fluorescent dyes, which are then hybridized with the gene chips. After the unattached cDNAs are washed away, the fluorescent intensity remains at each probe location is measured as an indication of the amount of mRNA transcribed from each gene in a genome. The entire cellular RNA content transcribed from a genome is called a '''transcriptome'''. Each DNA microarray reading is therefore essentially a snap-shot of the whole genome expression profile of a cell at a particular physiological stage. It is no longer necessary to know or decide beforehand candidate genes to be targets of exploration, as in the traditional blotting techniques.


Most recently, direct sequencing of the whole mRNA content of a cell using the so-called '''RNA-SEQ technology''' ('''Figure 3''') provides an alternative and even more accurate way of obtaining the transcriptome of a cell. Unlike the microarray technology, the RNA-SEQ technology allows ''de novo'' discovery of transcribed genes since it does not rely on a pre-defined DNA probes. Another major advantage of the RNA technology is its ability to detect splice variants, which are differentially spliced exons of the same gene.
Most recently, direct sequencing of the whole mRNA content of a cell using the so-called '''RNA-SEQ technology''' ('''Figure 3''') provides an alternative and even more accurate way of obtaining the transcriptome of a cell. Unlike the microarray technology, the RNA-SEQ technology allows ''de novo'' discovery of transcribed genes since it does not rely on a pre-defined DNA probes. Another major advantage of the RNA technology is its ability to detect splice variants, which are differentially spliced exons of the same gene.
Line 45: Line 49:


==Procedures==
==Procedures==
===Search gene information on [http://dictybase.org/ DictyBase]===
'''HINT''': Start a WORD or PowerPoint file as your personal lab notebook. Using this file, you could copy and paste gathered information as well as write notes to yourself. 
===Understand the design of an RNA-SEQ experiment using [http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE17637 NCBI GEO database]===
# Name and describe the two species tested in experiments
# How many genes were measured for their expression levels for each species?
# How many time points, developmental stages, and cell types have been tested for expression differences?
# How many replicates for each developmental stage?
----


'''Question A:''' Consider a microarray with only two spots, one for Gene A and one for Gene B. A researcher cultures two yeast strains in Glucose solution, extracts mRNA from the cells, creates cDNA with either red dye (for the experimental strain) or green dye (for the control strain). The dye containing cDNA is then allowed to combine with a microarray containing spots with complementary DNA for Gene A and Gene B. In each situation below, predict what color the two spots will show. The first answer is filled in for you as an example.
===Search for gene information using [http://dictybase.org/ DictyBase]===
# Select at least five genes from each of 3 gene groups in Table 1
# For each of the five genes, search its annotation in [http://dictybase.org/ DictyBase] by copying & pasting the ID in the search box (top right) and click "Search All"
# Collect the gene information and make a table by following the example in Table 2
{| class="wikitable"
|+Table 1. Gene lists
|-
! Gene Group !! DictyBase IDs
|-
| Group A || DDB_G0267376 DDB_G0276887 DDB_G0286385 DDB_G0278077 DDB_G0285425 DDB_G0283385 DDB_G0274569 DDB_G0269108 DDB_G0291372 DDB_G0269124 DDB_G0284331 DDB_G0280047 DDB_G0283907 DDB_G0292436 DDB_G0289329 DDB_G0289075 DDB_G0288677 DDB_G0277215 DDB_G0275687 DDB_G0280961 DDB_G0281381 DDB_G0287291 DDB_G0286121 DDB_G0288041 DDB_G0292266 DDB_G0281387
|-
| Group B || DDB_G0277823 DDB_G0292460 DDB_G0271976 DDB_G0278539 DDB_G0288273 DDB_G0281677 DDB_G0285277 DDB_G0286117 DDB_G0291526 DDB_G0290141 DDB_G0271668 DDB_G0283597 DDB_G0283741 DDB_G0272893 DDB_G0268302 DDB_G0289593 DDB_G0284093 DDB_G0285759 DDB_G0281469 DDB_G0267604 DDB_G0293700 DDB_G0281565 DDB_G0273191 DDB_G0285881 DDB_G0276871 DDB_G0286399 DDB_G0275881 DDB_G0286075 DDB_G0283275 DDB_G0292388 DDB_G0293742
|-
| Group C || DDB_G0275703 DDB_G0282247 DDB_G0269624 DDB_G0278867 DDB_G0280049 DDB_G0290439 DDB_G0269298 DDB_G0293184 DDB_G0293124 DDB_G0274211 DDB_G0269424 DDB_G0282943 DDB_G0286773 DDB_G0282381 DDB_G0269222 DDB_G0293396 DDB_G0271806
|}


[[File:Bio_202_3.jpg|thumb|frameless|right|90px|'''Question B''']]
{| class="wikitable"
[[File:Bio_202_2.jpg|thumb|frameless|left|100px|'''Figure 5'''. (from Cambell, p. 46)]]
|+Table 2. Gene annotations
|-
! DictyBase ID !! Gene Name !! Gene Product !! Description !! GO-Molecular Function (MF) (pick one) !! GO-Biological Process (BP) (pick one) !! GO-Cellular Component (CC) (pick one) !! Curator Notes (brief quote)
|-
| DDB_G0267376 || ''acrA'' || adenylate cyclase || contains a cyclase domain, 7 transmembrane helices, a histidine kinase domain, and two receiver domains || adenylate cyclase activity  || sporulation resulting in formation of a cellular spore || integral component of membrane || The ''acrA'' gene encodes the late developmental stage adenylate cyclase which is essential for spore encapsulation.
|}
----


'''Question B:''' Now consider a culture where one yeast population (the experimental group) is grown in a solution where the amount of glucose decreases over time to zero, while a second population (the control group) is grown in flask where the Glucose concentration remains nearly constant. The researcher then extracts mRNA from the yeast cells to creates red-dyed-DNA for the experimental strain and green dye for cDNA made from the control strain. The dye-containing cDNA is then allowed to combine with a microarray containing spots with complementary DNA for a cytochrome C gene and a tRNA synthesase Gene. The expression levels in the control group are assumed to remain constant for both genes but the expression levels for the experimental group change according to the graph in '''Figure 5'''. In each situation, predict what color the two spots will show.
===Explore expression profiles of individual genes using [http://www.dictyExpress.org dictyExpress]===
 
# Click on the website for dictyExpress and "Run dictyExpress (RNA-seq)."
[[File:Bio_202_4.jpg|thumb|frameless|90px|'''Table 1 (Question C)''']]
# A tutorial may start if this is your first time using the website. Feel free to do the tutorial. If you do not want to do the tutorial, close the tutorial box.
[[File:Bio_202_fig_2.jpg|thumb|frameless|'''Figure 2''' <sub>The red and green cDNAs are mixed, placed on the same chip, covered by a glass coverslip, and incubated overnight with the DNA microarray. During this time, fluorescent cDNA will adhere to the appropriate spots on the glass slide. Later, under the bright light of a scanner, the spots will glow red or green, depending on which cDNA is bound to them and will glow yellow if similar amounts of both cDNA are bound to them. This glow can be quantified.</sub>.]]
# In the "Experiment and Gene Selection" panel, select "1. D. discoideum vs D. purpurem, Parikh A et.al., D. discoideum." This will select the experiment you read about. Make sure it is highlighted before you do the next steps.
[[File:Bio_202_5.jpg|thumb|frameless|60px|'''Table 2 (Question D)''']]
# In the "Experiment and Gene Selection" panel, type in a Gene Name from Group A (e.g., acrA) in the area under “Genes.”
 
# Click "Update Selection." A plot should generate in the "Expression Time Course" panel. Screen shot this plot in your notebook/Word Document/PowerPoint file (Hints: Check the "Legend" box to show the gene’s name on the plot. Click the lower right arrow to expand the plot to full screen as needed. You can move around the windows as well by dragging and dropping the window near the title of the window. For example, you can drag and drop the “Expression Time Courses” window to the center of the screen if you wish).
===Determine expression profiles of individual genes===
# Is this gene up- or down-regulated during development?
# Understand fold change
# Repeat the above for the other genes (15 in total).
[[File:Bio_202_fig_3.jpg|thumb|frameless|'''Figure 3''' <sub>The first ten rows, 800 genes altogether, from one microarray used in the yeast starvation experiment we will discuss today. The spots appear either red for cDNA that was over-expressed in the glucose deprived experimental cells, green if that gene was under-expressed, yellow if the gene was equally expressed in both cells, or colorless if the gene was not expressed in either</sub>.]]
# Group genes by their similarity in expression profiles. How many such groups can you identify? For each group, describe the pattern and speculate on their functions.
 
----
'''Question C:''' For this question you will examine the top three rows from a microarray. Open a browser and access the image at http://cmgm.stanford.edu/pbrown/explore/M6.jpg. This is the magnified scan of an actual microarray from glucose deprived yeast. If you cannot access the image you may use an image provided by your lab instructor or '''Color Plate Figure 3'''.
 
===Identify co-regulated genes using cluster analysis===
# Understand correlation
In microarray analysis we need to precisely quantify the level of gene expression. The machine measured brightness of red and yellow dye is usually put into a table and expressed as a fold number. Fold numbers are a measure of doubling and are the base-2-logarithm of the color signal intensity ratios.
 
The first line of the chart below is not from the microarray, it is simply a measurement of colony growth taken by shining light through the yeast colony as the cells multiplied. The second line is a measure of the ever-dropping nutrient concentration in the medium.
 
The third and fourth lines quantify the color of a single microarray spot (in this case, the gene NUP120) over time. For each time point, calculate the ratio of color intensity (red / green) and the base-two-logarithm of that ratio [ log<sub>2</sub>(red/green) ] . The first has been done for you as an example.
 
'''Question D:''' On your lab report only copy the portions of '''Table 2''' which you filled in, i.e. the bottom two rows. Your answers may differ slightly from your labmates, depending on your perception of color and quality of the image you are viewing. '''Table 1''' is only meant to be an estimation, not an exact number.
 
'''Question E:''' What can you tell about the yeast cell density by looking at the O.D. line of the chart? How was this measured?
 
'''Question F:''' What molecule must bind to the microarray in order to create a strong green signal?
What makes the green glow? Which type of cells, the control or the experimental sample, is the
green signal associated with and which type of cells is the red signal associated with?
 
:'''''Quick review of logarithms:'''''
:*<small>Base two logarithms are a measure of how much a thing has doubled. log2(1) = 0, log2(8) = 3, log2 of (.25) = 2 <sub>[note the negative sign]</sub>. If your calculator doesn't have a log2 function, you may use log10 and multiply your result by 3.3219.</small>
 
 
'''Explore the Expression Profile using EXCEL'''
 
[[File:Bio_202_6.jpg|thumb|frameless|left|'''Table 3''']]
 
The data in '''Table 3''' ''(Red-to-Green Intensity Ratios)'' were measured in an experiment where two strains of yeast were grown under conditions of decreasing glucose concentration. A whole-genome microarray measured the expression levels of 6,400 genes for the two strains as the glucose level in the medium decreased. Data for 4 of those genes are shown below; the ratio of red fluorescence intensity to green fluorescence intensity for 4 genes (COX14, NUP145, NUP133 and NUP170) are shown.
 
 
[[File:Bio_202_7.jpg|thumb|frameless|'''Figure 6''']]
*By just looking at the '''Figure 6''', make your best guess, which two genes below seem to have the most similar expression levels ? ___________ and _______.
*Guessing is not the best way to describe gene clusters. Statistics has a variety of ways to measure correlation. We will use Pearson's Correlation Coefficient, '''''r'''''.
 
*Excel easily measures correlation and Pearson's '''''r'''''. Like most calculations in Excel, you simply click on an empty cell, type "=", write a formula, and indicate what range of cells you wish to perform the calculation on.
 
[[File:Bio_202_8.jpg|thumb|framed|100px|center]]
The above example was created by clicking on Cell J3 and entering the formula:
'''=CORREL(B1:H1,B2:H2 )'''
which caused the spreadsheet to compare the values in Row 1 and Row 2 and print a correlation value (Pearson's r) in Cell J3 .
 
[[File:Bio_202_9.jpg|thumb|frameless|'''Table 4 (Question G)''']]
You will do this for six different pairs (one correlation has to be measured for every possible twosome,
i.e. first and second row, first and third row, etc) Write your six pairwise correlation results into the
following table. The correlation of any gene with itself is, of course, perfect, and hence "1". A double
dash has been placed in half of the spaces to save you the trouble of writing a result twice. This may
take some time, especially if you are unfamiliar with Excel, spreadsheets, or Statistics. Work together
and ask someone who can help.
 
[[File:Bio_202_10.jpg|thumb|frameless|left|'''Table 5 (Question H)''']]
Now, since your data is from a much larger experiment (DeRisi et al measured all 6000plus genes in the Yeast genome), transfer your numbers into a bigger table. The table below describes the pairwise
correlation for 13 yeast genes undergoing glucose deprivation. Add your correlation data calculated in
the previous question to the 6 blank spaces in '''Table 5'''.
 
'''Question I:''' Name two pairs of genes which act in tandem, i.e. rise together and fall together as
the yeast cell experiences glucose deprivation.
 
'''Question J:''' Name two pairs of genes where, if one gene is over-expressed, the other gene will
be suppressed.
 
 
[[File:Bio_202_11.jpg|thumb|frameless|90px|'''Table 6 (Question K)''']]
<big>'''5. Database Search'''</big>
 
Your lab instructor. will assign a few genes below to you or to your group. For whichever genes your lab instructor has assigned to you, do a fold calculation using the same math you employed in Step 2, where you calculated the ratio of gene expression (level in the experimental group divided by level in the control group). Recall that you then found the base two log (log<sub>2</sub>) of this ratio. Starving yeast cell may be expected to alter the enzymes in its metabolic pathway.
 
Get your data by searching for your gene name at http://cmgm.stanford.edu/pbrown/explore/diauxsearch.html. Use the buttons that allow you to put in a Description keyword. Note that, although the website data is labeled as a fold it is actually a simple ratio of the two colors, so you will need to take that fold and take the base 2 log of it. A line of correct answers for TPS2 has been written in for you as an example.
: ''*For example, the DeRisi experiment found that the fold increase of TPS2 was 1.11, 1.15, 1.19, 2.04, 1.96, 4, 2.27. From this one can calculate the log<sub>2</sub> values filled into the first row of '''Table 6'''.''
 
[[File:Bio_202_13.jpg|right|thumb|frameless|90px|'''Figure 7''' <sub>This dendrogram takes the numerical distance values from Table 7 and displays them as visual distances.</sub>]]
 
 
'''6. Distance Matrix and Clustering Dendrogram'''
[[File:Bio_202_12.jpg|thumb|frameless|90px|'''Table 7''' <sub>This distance matrix was generated from the correlation data in '''''Table 5'''''.</sub>]]
 
'''Question L:''' Much of the previous questions in this lab were just preliminary work leading up to one
big question: ''Which genes form interactive networks?'' Let us now attempt to answer that! '''Table 5'''
showed the behavior of yeast genes under glucose deprivation. The correlation data (Pearson's correlation, '''''r''''') can are interpreted in '''Table 7''' as distances. The '''Table 7''' data shows large values for disparate genes and approaches 0 for the ideal case where two genes are expressed in exactly the same levels in varying conditions. This then can be graphically shown as a dendrogram (tree) where similar expression gives close tree-distance. Three genes: NUP 145, COX4, and MLP1 still need to be placed on the dendrogram. Using the pairwise distances from '''Table 7''', place these three genes in the proper location within the tree.


===Identify co-regulated genes using correlational distances and cluster analysis===
[[File:Bio_202_7.jpg|thumb|frameless|'''Figure 4.''' Pearson's ''r'']]
[[File:Bio_202_8.jpg|thumb|frameless|'''Figure 5.''' Calculate ''r'' using the Excel function CORREL()]]
# Understand correlation coefficient: Pearson's Correlation Coefficient (''r'') is a measure of how tightly linked two variables are ('''Figure 4'''). Here we will use it to measure the similarity in gene expression profile between two genes. Note that correlation does not necessarily imply causation.
# Calculate correlation coefficients using EXCEL
## Excel easily measures correlation and Pearson's '''''r'''''. Like most calculations in Excel, you simply click on an empty cell, type "=", write a formula, and indicate what range of cells you wish to perform the calculation on. The example in '''Figure 5''' was created by clicking on Cell J3 and entering the formula: '''=CORREL(B1:H1,B2:H2 )''' , which caused the spreadsheet to compare the values in Row 1 and Row 2 and print a correlation value (Pearson's r) in Cell J3 .
## You will do this for six different pairs of genes in Table 3 (one correlation has to be measured for every possible twosome, i.e. first and second row, first and third row, etc) Write your six pairwise correlation results into Table 4. The correlation of any gene with itself is, of course, perfect, and hence "1". A double dash has been placed in half of the spaces to save you the trouble of writing a result twice. This may take some time, especially if you are unfamiliar with Excel, spreadsheets, or Statistics. Work together and ask someone who can help.
# Understand cluster analysis: [http://media.hhmi.org/biointeractive/click/microarray_analyzing/12.html Watch an HHMI slide presentation] on how to group genes and samples by their overall similarity in gene expression levels
# In the "Hierarchical Clustering" Panel, choose the "Pearson Correlation" for "Distance Function"
# Choose "Average Linkage" for "Linkage" and your choice of color gradient.
# Screen-capture the heatmap and record your answers to the following questions
## What is represented by each row?
## What is represented by each column?
## Compare the cluster diagram with the groups you have identified visually. Do they agree with each other?
## Do three groups of genes (Group A, B, and C; Table 1) form clusters by themselves?
{| class="wikitable"
|+Table 3. Read counts of six genes
|-
! DictyBase ID !! Gene Name !! Hour 00 !! Hour 04 !! Hour 08 !! Hour 12 !! Hour 16 !! Hour 20 !! Hour 24
|-
| DDB_G0267376||''acrA''||47.84401093||220.1386||335.2265||288.9046||333.8650||244.5453||201.6707
|-
| DDB_G0267604||''mserS''||0.563943||3.0999||49.9572||419.2713||2147.4096||1804.1120||4527.1415
|-
|DDB_G0268302||''rpl38''||67.19935935||1.6417||19.8430||5.8377||2.5870||1.5032||4.0678
|-
| DDB_G0269108||''catB''||8600.321169||4904.7794||1429.3452||1503.6344||905.1408||654.8504||995.3257
|-
| DDB_G0269222||''gefB''||3.530691422||285.1393||385.5468||261.5676||251.0168||171.1454||180.7623
|-
| DDB_G0269298||''gefX''||83.30521325||437.0953||227.8182||254.1875||124.2749||70.7291||136.7658
|}


{| class="wikitable"
|+ Table 4. Obtain the correlation coefficients using the EXCEL function CORREL
|-
!  !! ''acrA'' !! ''mserS'' !! ''rpl38'' !! ''catB'' !! ''gefB'' !! ''gefX''
|-
| ''acrA'' || 1 || ? || ? || ? || ? || ?
|-
| ''mserS'' || - || 1 || ? || ? || ? || ?
|-
| ''rpl38'' || - || - || 1 || ? || ? || ?
|-
| ''catB'' || - || - || - || 1 || ? || ?
|-
| ''gefB'' || - || - || - || - || 1 || ?
|-
| ''gefX'' || - || - || - || - || - || 1
|}
----
----


==Discussion Questions==
==Discussion Questions==
'''Question M:''' What differences and similarities does a DNA chip have with a Northern blot?
[[File:Parikh-fig.png|thumb|frameless|450px|'''Figure 6.''' The heat maps represent the patterns of change in standardized mRNA abundance for all the genes in the ''D. discoideum'' and the ''D. purpureum'' genomes. Each row represents an average of 85 genes and each column represents a developmental time point (hours). The colors represent relative mRNA abundances (see scale). The genes are ordered according to their regulation pattern in each species. The black lines divide the transcripts, from top to bottom, into: down-regulated, intermediate regulation and up-regulated. The dendrograms represent the differences between the transcriptomes at each time point. Source:  [http://genomebiology.com/2010/11/3/R35  Parikh et al (2010)] ]]
:
# What differences and similarities does a RNA-Seq experiment have with a Northern blot?
'''Question N:''' It is stated in the DeRisi paper that “[k]nowing when and where a gene is expressed often provides a strong clue as to its biological role”. Explain how a time-course experiment using microarray could be used for discovering genes in a metabolic network (e.g., glucose utilization), or in a subcellular structure (e.g., the nuclear pore complex).
# Explain how a time-course experiment like this one could be used for discovering gene networks during development.
:
# The [http://genomebiology.com/2010/11/3/R35  Parikh et al (2010) paper] concludes that developmental pathway genes are conserved in their expression profiles between the two ''Dictyostelium'' species. Based on the two heat maps ('''Figure 6'''), identify similarities and differences in expression profiles between the two species.
'''Question O:''' In the next statement, the DeRisi paper writes that “Conversely, the pattern of genes expressed in a cell can provide detailed information about its state”. Describe how the whole-genome expression profiles could be used for, e.g., early diagnosis of cancer and drug discovery.
# Similar genes (homologs) are found in humans. Do you expect the expression of these genes to be similar during human development? during cancer development?
----
----


==References & Resources==
==References & Resources==
# [http://dictybase.org/ DictyBase: a database of Dictyostelium genes]
#[https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-017-1706-9 DictyExpress: a web-based platform for sequence data management and analytics in Dictyostelium and beyond]
# [http://dictyexpress.biolab.si/index.htm DictyExpress: a web application to analyze gene expressions in Dictyostelium species]
#[http://dictybase.org/ DictyBase: a database of Dictyostelium genes]
# [http://www.dictyExpress.org DictyExpress: a web application to analyze gene expressions in Dictyostelium species]: [https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-017-1706-9 Paper]
# [http://genomebiology.com/2010/11/3/R35  Parikh et al (2010). Conserved developmental transcriptomes in evolutionarily divergent species. ''Genome Biology'', 11:R35]
# [http://genomebiology.com/2010/11/3/R35  Parikh et al (2010). Conserved developmental transcriptomes in evolutionarily divergent species. ''Genome Biology'', 11:R35]
# [http://www.ncbi.nlm.nih.gov/pubmed/19015660 Wang, Gerstein, and Snyder (2009). RNA-Seq: a revolutionary tool for transcriptomics. Natural Review of Genetics]
# [http://www.ncbi.nlm.nih.gov/pubmed/19015660 Wang, Gerstein, and Snyder (2009). RNA-Seq: a revolutionary tool for transcriptomics. Natural Review of Genetics]
Line 159: Line 162:


----
----
# Read [http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE17637 this experimental report] and extract the following information:
## Name of the two species used in experiments
## How many genes were measured for their expression (i.e., mRNA) levels?
## Describe a biological question that can be answered by this experiment (e.g., which genes are expressed at a particular developmental stage)
# [http://dictyexpress.biolab.si/index.htm Go to dictyExpress] and explore the time course of a set of genes
## Choose the 2nd Box: "Run dictyExpress (RNA-seq)"
## In the "Gene Selection" Box, type the following gene names one at a time (DON'T copy and paste; when the gene is found, highlight it and press enter): acrA, catB, dcsA, acgA, abcG18
## Click "Update" and answer the question based on the plot in the "Expression Profile" panel: Are these genes up- or down-regulated during development?
# Do the same for the 2nd set of genes: mserS, rpl38, rpsA, rpl35a, gfm1
# Do the same for the 3rd set of genes: gefB, gefX, gxcB, mgp3, gefN
# Combine all 3 sets of genes and produce a heatmap
## In the "Hierarchical Clustering" Panel, choose the "Pearson Correlation" for "Distance Function"
## Choose "Average Linkage" for "Linkage" and your choice of color gradient
## What is represented by each row?
## What is represented by each column?
## Do these 3 sets of genes form clusters by themselves?
## [http://media.hhmi.org/biointeractive/click/microarray_analyzing/12.html HHMI slides: A technical description] of how to group genes and samples by their overall similarity in gene expression levels

Latest revision as of 21:39, 10 March 2021

Bioinformatics Lab: Exploration of Gene Expression in Dictyostelium species

Figure 1. Development of Dictyostelium (M. Grimson, R. Blanton, Texas Tech University)

Objectives

  1. Understand the RNA-SEQ technology and its use in genome-wide identification of gene functions.
  2. Be able to identify co-expressed and co-repressed genes based on time-course gene expression data.

Lab Report Grading Policy

Introduction (1 pts)
Define transcriptome. List key steps in RNA-SEQ technology. Describe advantages of high-throughput technologies in comparison with traditional gene-by-gene approaches of studying gene function. Your statements are not to be copied from the Lab Manual.
Materials and Methods (1 pts)
Describe experimental procedures of the study that have produced these gene expression data by reading this paper and this experimental report. Answer the following questions:
  1. Name of the two species used in experiments
  2. How many genes were measured for their expression levels?
  3. How many time points, developmental stages, and cell types have been tested?
Results (5 pts)
  1. Table 2 (annotation for 15 genes)
  2. Expression profiles (screen capture for 15 genes)
  3. Table 4 (correlation coefficients)
  4. Heat map of 15 genes (screen capture)
Discussion (2 pts)
Answer the four discussion questions.
Summary/Conclusion (1 pt)
A sentence or two will suffice.
References (1 pt)
Credit is given for pertinent references obtained from sources other than the Lab Manual.

Introduction

Figure 2. a) Results from four individual Northern blots examining four different genes and measuring mRNA production over time, as indicated. b) Results from a series of microarrays for the same four genes of interest. Note the color scale on the bottom of b), where bright green indicates a 20-fold repression and bright red indicates a 20-fold induction. Black indicates no change in transcription. (Source: Campbell & Heyer. (2003). Discovering Genomics, Proteomics, & Bioinformatics. Pearson Education, Inc.)
Figure 3. A typical RNA-Seq experiment. Briefly, long RNAs are first converted into a library of cDNA fragments through either RNA fragmentation or DNA fragmentation. Sequencing adaptors (blue) are subsequently added to each cDNA fragment and a short sequence is obtained from each cDNA using high-throughput sequencing technology. The resulting sequence reads are aligned with the reference genome or transcriptome, and classified as three types: exonic reads, junction reads and poly(A) end-reads. These three types are used to generate a base-resolution expression profile for each gene, as illustrated at the bottom; a yeast ORF with one intron is shown. (Source: Wang, Gerstein, and Snyder (2009)). The expression level of a gene is measured by its FPKM, which stands for fragments per kilobase of total gene length per million mapped reads. In essence, FPKM is the amount of short reads mapped to a gene normalized by the gene length and the total number of reads generated from an experiment. The normalization by gene length and total reads makes it possible to compare expression levels across genes as well as among experiments.

Gene expression is the transcription of a DNA template into RNA molecules, some of which are eventually translated into proteins. In a multicellular organism, the subset of genes that are expressed defines and gives rise to a specific tissue or cell type. In this laboratory exercise, we will use bioinformatics techniques to identify genes up- and down-regulated in Dictyostelium during its development from a unicellular stage to a multi-cellular stage.

Due to its unique mode of development (Figure 1), Dictyostelium is an important model organism for the study of how multicellular organisms evolved from unicellular ones. It is also a key disease model for understanding cancer, especially regarding the mechanism of cell migration, chemotaxis, and metastasis.

Traditionally, gene expressions are studied one gene at a time using blotting techniques. For example, in a Northern Blot experiment (Figure 2a), the whole messenger RNA (mRNA) content of a cell is extracted and loaded on a solid gel slab. Different mRNA molecules are then separated using electrophoresis and transferred to a nitrocellulose sheet. To identify if a gene is expressed, a radioactively (or fluorescently) labeled oligonucleotide probe that is specific to the gene sequence is applied to the sheet. If the gene is expressed, the probe will hybridize with a specific mRNA molecule and a black band will appear on an Xray film. Other blotting techniques for detecting gene expression include Southern Blot, in which mRNAs in a cell are reverse transcribed to their complementary DNA (cDNA) before being hybridized with gene-specific oligo-nucleotide probes. In a Western Blot experiment, the protein product (instead of the mRNA intermediate) of a gene is probed using antibodies (instead of the oligonucleotide probes).

After the genomic revolution since the 1990s, it became possible to study the expression of all genes in a cell at once using high-throughput techniques. Detecting the expression profiles of a whole genome was made possible by the availability of the whole genome sequences of bacteria, yeasts, and humans. The DNA microarray (Figure 2b) is one such high throughput technique. In contrast to the Northern Blot technique in which the mRNA sample is fixed on a nylon sheet, nucleotide probes for all genes are fixed on a glass slide, creating a “gene chip”. The cellular mRNAs are reverse transcribed into cDNAs labeled with fluorescent dyes, which are then hybridized with the gene chips. After the unattached cDNAs are washed away, the fluorescent intensity remains at each probe location is measured as an indication of the amount of mRNA transcribed from each gene in a genome. The entire cellular RNA content transcribed from a genome is called a transcriptome. Each DNA microarray reading is therefore essentially a snap-shot of the whole genome expression profile of a cell at a particular physiological stage. It is no longer necessary to know or decide beforehand candidate genes to be targets of exploration, as in the traditional blotting techniques.

Most recently, direct sequencing of the whole mRNA content of a cell using the so-called RNA-SEQ technology (Figure 3) provides an alternative and even more accurate way of obtaining the transcriptome of a cell. Unlike the microarray technology, the RNA-SEQ technology allows de novo discovery of transcribed genes since it does not rely on a pre-defined DNA probes. Another major advantage of the RNA technology is its ability to detect splice variants, which are differentially spliced exons of the same gene.

These high-throughput technologies, however, create new technical challenges of their own. The main challenge is the analysis of the huge amount of data resulting from each microarray or sequencing experiment. First, data from high-throughput experiments need computer-assisted data processing and analysis. Second, statistical analysis and testing become essential tools for the discovery and exploration of gene functions, e.g., finding co-expressed genes.


Procedures

HINT: Start a WORD or PowerPoint file as your personal lab notebook. Using this file, you could copy and paste gathered information as well as write notes to yourself.

Understand the design of an RNA-SEQ experiment using NCBI GEO database

  1. Name and describe the two species tested in experiments
  2. How many genes were measured for their expression levels for each species?
  3. How many time points, developmental stages, and cell types have been tested for expression differences?
  4. How many replicates for each developmental stage?

Search for gene information using DictyBase

  1. Select at least five genes from each of 3 gene groups in Table 1
  2. For each of the five genes, search its annotation in DictyBase by copying & pasting the ID in the search box (top right) and click "Search All"
  3. Collect the gene information and make a table by following the example in Table 2
Table 1. Gene lists
Gene Group DictyBase IDs
Group A DDB_G0267376 DDB_G0276887 DDB_G0286385 DDB_G0278077 DDB_G0285425 DDB_G0283385 DDB_G0274569 DDB_G0269108 DDB_G0291372 DDB_G0269124 DDB_G0284331 DDB_G0280047 DDB_G0283907 DDB_G0292436 DDB_G0289329 DDB_G0289075 DDB_G0288677 DDB_G0277215 DDB_G0275687 DDB_G0280961 DDB_G0281381 DDB_G0287291 DDB_G0286121 DDB_G0288041 DDB_G0292266 DDB_G0281387
Group B DDB_G0277823 DDB_G0292460 DDB_G0271976 DDB_G0278539 DDB_G0288273 DDB_G0281677 DDB_G0285277 DDB_G0286117 DDB_G0291526 DDB_G0290141 DDB_G0271668 DDB_G0283597 DDB_G0283741 DDB_G0272893 DDB_G0268302 DDB_G0289593 DDB_G0284093 DDB_G0285759 DDB_G0281469 DDB_G0267604 DDB_G0293700 DDB_G0281565 DDB_G0273191 DDB_G0285881 DDB_G0276871 DDB_G0286399 DDB_G0275881 DDB_G0286075 DDB_G0283275 DDB_G0292388 DDB_G0293742
Group C DDB_G0275703 DDB_G0282247 DDB_G0269624 DDB_G0278867 DDB_G0280049 DDB_G0290439 DDB_G0269298 DDB_G0293184 DDB_G0293124 DDB_G0274211 DDB_G0269424 DDB_G0282943 DDB_G0286773 DDB_G0282381 DDB_G0269222 DDB_G0293396 DDB_G0271806
Table 2. Gene annotations
DictyBase ID Gene Name Gene Product Description GO-Molecular Function (MF) (pick one) GO-Biological Process (BP) (pick one) GO-Cellular Component (CC) (pick one) Curator Notes (brief quote)
DDB_G0267376 acrA adenylate cyclase contains a cyclase domain, 7 transmembrane helices, a histidine kinase domain, and two receiver domains adenylate cyclase activity sporulation resulting in formation of a cellular spore integral component of membrane The acrA gene encodes the late developmental stage adenylate cyclase which is essential for spore encapsulation.

Explore expression profiles of individual genes using dictyExpress

  1. Click on the website for dictyExpress and "Run dictyExpress (RNA-seq)."
  2. A tutorial may start if this is your first time using the website. Feel free to do the tutorial. If you do not want to do the tutorial, close the tutorial box.
  3. In the "Experiment and Gene Selection" panel, select "1. D. discoideum vs D. purpurem, Parikh A et.al., D. discoideum." This will select the experiment you read about. Make sure it is highlighted before you do the next steps.
  4. In the "Experiment and Gene Selection" panel, type in a Gene Name from Group A (e.g., acrA) in the area under “Genes.”
  5. Click "Update Selection." A plot should generate in the "Expression Time Course" panel. Screen shot this plot in your notebook/Word Document/PowerPoint file (Hints: Check the "Legend" box to show the gene’s name on the plot. Click the lower right arrow to expand the plot to full screen as needed. You can move around the windows as well by dragging and dropping the window near the title of the window. For example, you can drag and drop the “Expression Time Courses” window to the center of the screen if you wish).
  6. Is this gene up- or down-regulated during development?
  7. Repeat the above for the other genes (15 in total).
  8. Group genes by their similarity in expression profiles. How many such groups can you identify? For each group, describe the pattern and speculate on their functions.

Identify co-regulated genes using correlational distances and cluster analysis

Figure 4. Pearson's r
Figure 5. Calculate r using the Excel function CORREL()
  1. Understand correlation coefficient: Pearson's Correlation Coefficient (r) is a measure of how tightly linked two variables are (Figure 4). Here we will use it to measure the similarity in gene expression profile between two genes. Note that correlation does not necessarily imply causation.
  2. Calculate correlation coefficients using EXCEL
    1. Excel easily measures correlation and Pearson's r. Like most calculations in Excel, you simply click on an empty cell, type "=", write a formula, and indicate what range of cells you wish to perform the calculation on. The example in Figure 5 was created by clicking on Cell J3 and entering the formula: =CORREL(B1:H1,B2:H2 ) , which caused the spreadsheet to compare the values in Row 1 and Row 2 and print a correlation value (Pearson's r) in Cell J3 .
    2. You will do this for six different pairs of genes in Table 3 (one correlation has to be measured for every possible twosome, i.e. first and second row, first and third row, etc) Write your six pairwise correlation results into Table 4. The correlation of any gene with itself is, of course, perfect, and hence "1". A double dash has been placed in half of the spaces to save you the trouble of writing a result twice. This may take some time, especially if you are unfamiliar with Excel, spreadsheets, or Statistics. Work together and ask someone who can help.
  3. Understand cluster analysis: Watch an HHMI slide presentation on how to group genes and samples by their overall similarity in gene expression levels
  4. In the "Hierarchical Clustering" Panel, choose the "Pearson Correlation" for "Distance Function"
  5. Choose "Average Linkage" for "Linkage" and your choice of color gradient.
  6. Screen-capture the heatmap and record your answers to the following questions
    1. What is represented by each row?
    2. What is represented by each column?
    3. Compare the cluster diagram with the groups you have identified visually. Do they agree with each other?
    4. Do three groups of genes (Group A, B, and C; Table 1) form clusters by themselves?
Table 3. Read counts of six genes
DictyBase ID Gene Name Hour 00 Hour 04 Hour 08 Hour 12 Hour 16 Hour 20 Hour 24
DDB_G0267376 acrA 47.84401093 220.1386 335.2265 288.9046 333.8650 244.5453 201.6707
DDB_G0267604 mserS 0.563943 3.0999 49.9572 419.2713 2147.4096 1804.1120 4527.1415
DDB_G0268302 rpl38 67.19935935 1.6417 19.8430 5.8377 2.5870 1.5032 4.0678
DDB_G0269108 catB 8600.321169 4904.7794 1429.3452 1503.6344 905.1408 654.8504 995.3257
DDB_G0269222 gefB 3.530691422 285.1393 385.5468 261.5676 251.0168 171.1454 180.7623
DDB_G0269298 gefX 83.30521325 437.0953 227.8182 254.1875 124.2749 70.7291 136.7658
Table 4. Obtain the correlation coefficients using the EXCEL function CORREL
acrA mserS rpl38 catB gefB gefX
acrA 1 ? ? ? ? ?
mserS - 1 ? ? ? ?
rpl38 - - 1 ? ? ?
catB - - - 1 ? ?
gefB - - - - 1 ?
gefX - - - - - 1

Discussion Questions

Figure 6. The heat maps represent the patterns of change in standardized mRNA abundance for all the genes in the D. discoideum and the D. purpureum genomes. Each row represents an average of 85 genes and each column represents a developmental time point (hours). The colors represent relative mRNA abundances (see scale). The genes are ordered according to their regulation pattern in each species. The black lines divide the transcripts, from top to bottom, into: down-regulated, intermediate regulation and up-regulated. The dendrograms represent the differences between the transcriptomes at each time point. Source: Parikh et al (2010)
  1. What differences and similarities does a RNA-Seq experiment have with a Northern blot?
  2. Explain how a time-course experiment like this one could be used for discovering gene networks during development.
  3. The Parikh et al (2010) paper concludes that developmental pathway genes are conserved in their expression profiles between the two Dictyostelium species. Based on the two heat maps (Figure 6), identify similarities and differences in expression profiles between the two species.
  4. Similar genes (homologs) are found in humans. Do you expect the expression of these genes to be similar during human development? during cancer development?

References & Resources

  1. DictyExpress: a web-based platform for sequence data management and analytics in Dictyostelium and beyond
  2. DictyBase: a database of Dictyostelium genes
  3. DictyExpress: a web application to analyze gene expressions in Dictyostelium species: Paper
  4. Parikh et al (2010). Conserved developmental transcriptomes in evolutionarily divergent species. Genome Biology, 11:R35
  5. Wang, Gerstein, and Snyder (2009). RNA-Seq: a revolutionary tool for transcriptomics. Natural Review of Genetics
  6. Description of the experiment and data from the NCBI GEO database