Southwest-University: Difference between revisions

From QiuLab
Jump to navigation Jump to search
imported>Weigang
imported>Weigang
 
(73 intermediate revisions by 2 users not shown)
Line 1: Line 1:
<center>'''Biomedical Genomics'''</center>
<center>'''Biomedical Genomics'''</center>
<center>July 8-19, 2019</center>
<center>July 8-19, 2019</center>
<center>'''Instructor:''' Weigang Qiu, Ph.D.<br>Professor, Department of Biological Sciences, City University of New York, Hunter College & Graduate Center<br>Adjunct Faculty, Department of Physiology and Biophysics,
<center>'''Guest Instructor:''' Weigang Qiu, Ph.D.<br>Professor, Department of Biological Sciences, City University of New York, Hunter College & Graduate Center<br>Adjunct Faculty, Department of Physiology and Biophysics,
Institute for Computational Biomedicine, Weil Cornell Medical College</center>
Institute for Computational Biomedicine, Weil Cornell Medical College</center>
<center>'''Office:''' B402 Belfer Research Building, 413 East 69th Street, New York, NY 10021, USA</center>
<center>'''Office:''' B402 Belfer Research Building, 413 East 69th Street, New York, NY 10021, USA</center>
Line 36: Line 36:
* Attendance: 50 pts
* Attendance: 50 pts
* Assignments: 5 x 10 = 50 pts  
* Assignments: 5 x 10 = 50 pts  
* Quizzes: 2 x 25 pts = 50 pts
* Open-book Quizzes: 2 x 25 pts = 50 pts
* Mid-term: 50 pts
* Take-home Mid-term: 50 pts
* Final presentation: 50 pts
* Final presentation: 50 pts
Total: 250 pts
Total: 250 pts


==Course Schedule==
==Course Schedule==
{| class="wikitable"
===Session 1. Introduction & R Tutorial I===
|-
* Date & Hours: July 8 (Mon), 8:40-12:10
! Date & Hour !! Tutorials !! Assignment !! Quiz & Exam
* Lecture slides: [[File:R-part-1-small.pdf|thumbnail|Lecture slides]]
|-
* Assignment #1 (create a WORD document including scripts & graphs (i.e., compile your work into a lab report, due tomorrow)
| July 8 (Mon), 8:40-12:10 || Introduction; R Tutorial I; 
** Install R/R studio and the "tidyverse" package on your own computer
[[File:R-part-1-small.pdf|thumbnail|Lecture slides]]
** Recreate Script 1 & Mini-Practical
||
** Show help page for function "seq"
Assignment #1 (create a WORD document including scripts & graphs (i.e., compile your work into a lab report, due tomorrow)
** Download dataset
* Install R/R studio and the "tidyverse" package on your own computer
*** Create a new folder (e.g., Desktop/rtutor)
* Recreate Script 1 & Mini-Practical
*** Create a sub-folder (e.g., Desktop/rtutor/data/)  
* Show help page for function "seq"
*** Download from http://www.r4all.org/the-book/datasets
* Download dataset
*** Save to the sub-folder
** Create a new folder (e.g., Desktop/rtutor)
*** Unzip the file
** Create a sub-folder (e.g., Desktop/rtutor/data/)  
 
** Download from http://www.r4all.org/the-book/datasets
===Session 2. R Tutorial II===
** Save to the sub-folder
* Date & Hours: July 9 (Tu), 8:40-12:10
** Unzip the file
* Lecture slides: [[File:R-part-2.pdf|thumbnail|Lecture slides]]
  ||
* Assignment #2.
|-
The following is a portion of the dataset of Mycobacterium growth (kindly shared by Aswad from Dr Xie's lab). It shows OD (optical density) values. Transform this table ("wide" format) into the "tall/tidy" format (use paper & pen, no need to use R studio or any computer program):
| July 9 (Tu), 8:40-12:10 || R Tutorials II && III,
[[File:R-part-2.pdf|thumbnail|Lecture slides]]
||
Assignment #2
* The following is a portion of the dataset of Mycobacterium growth (kindly shared by Aswad from Dr Xie's lab). It shows OD (optical density) values. Transform this table ("wide" format) into the "tall/tidy" format (use paper & pen, no need to use R studio or any computer program):
{| class="wikitable"
{| class="wikitable"
|-
|-
Line 87: Line 82:
** Perform the following data transformation using the chaining operator (i.e., "%>%"): Select rows from the 3rd to the 20th, then filter by colour of "red", and then show head
** Perform the following data transformation using the chaining operator (i.e., "%>%"): Select rows from the 3rd to the 20th, then filter by colour of "red", and then show head
** Obtain the mean number of visit for each colour as a group (Hint: use "group_by" & "summarise")
** Obtain the mean number of visit for each colour as a group (Hint: use "group_by" & "summarise")
||
 
|-
===Session 3. R Tutorial III & Quiz I===
| July 10 (Wed), 8:40-12:10 || R Tutorial IV
* Date & Hours: July 10 (Wed), 8:40-12:10
[[File:R-part-3.pdf|thumbnail|Lecture slides]]
* Lecture slides: [[File:R-part-3.pdf|thumbnail|Lecture slides]]
||
* Assignment #3
Assignment #3
{| class="wikitable" style="width: 60%;"
{| class="wikitable"
|-
|-
! Task!! Graph
! Task!! Graph
|-
|-
| Use the "iris" dataset to make the plot shown at right (Hint: load data with <code>data(iris)</code>) ||  
| Use the "iris" dataset to reproduce the plot shown at right (Hint: load data with <code>data(iris)</code>) ||  
[[File:Iris-1.png|200px|thumbnail]]
[[File:Iris-1.png|200px|thumbnail]]
|-
|-
| Use the "flower" dataset (see Assignment #2 on how to load data) to make the plot shown at right ||  
| Use the "flower" dataset (see Assignment #2 on how to load data) to reproduce the plot shown at right ||  
[[File:Flower-1.png|200px|thumbnail]]
[[File:Flower-1.png|200px|thumbnail]]
|}
|}


  || Quiz I
===Session 4. Intro to NGS & R Tutorial IV===
* Date & Hours: July 11 (Thur), 8:40-12:10
* Slides for NGS: [[File:Intro-and-NGS.pdf|thumbnail]]
* Slides for R Tutorial IV; [[File:R-part-4.pdf|thumbnail|Lecture slides]]
* Take-home mid-term (50 pts)
 
===Weekend break (No class; July 12 - 14, Fri, Sat & Sun)===
===Session 5. Case Study I (Trout microbiome) & R Tutorial V===
* Date & Hours: July 15 (Mon), 8:30-12:10
* Dataset: [[File:Trout.txt|thumbnail]]
* Slides for R Tutorial Part V:  [[File:R-part-5.pdf|thumbnail|Lecture slides]]
* Slide for trout microbiome: [[File:Case-1-Trout-Microbiome.pdf|thumbnail|Lecture slides]]
R code for today's lecture
<syntaxhighlight lang='bash'>
# Case Study 1. Trout microbiome
# Date: Monday, July 15, 2019
# Author: Weigang Qiu
 
library(tidyverse)
# Load data
setwd("C:/Users/lai/Dropbox/Courses/ChongQing-2019/")
 
trout <- read_csv("Trout.txt")
glimpse(trout)
 
# Exercise 1. Transform into a long table
trout.long <- gather(trout, 2:29, key = "sample", value = "read.cts")
 
# Exercise 2. filter out phyla < 1%
trout.ph.cts <- trout.long %>% group_by(phylum) %>% summarise(phy.sum = sum(read.cts)) # counts in each phylum
trout.ph.perc <- trout.ph.cts %>% mutate(ph.perc = phy.sum/sum(phy.sum) * 100) # get percentages
trout.ph.hi <- trout.ph.perc %>% filter(ph.perc > 1) # select phyla > 1%
 
# The above could be combined using pipes
# trout.ph.hi <- trout.long %>% group_by(phylum) %>% summarise(cts = sum(read.cts)) %>% mutate(perc = cts/sum(cts) * 100) %>% filter(perc >= 1)
 
# Exercise 3. get phylum counts in each sample
trout.ph <- trout.long %>% filter(phylum %in% trout.ph.hi$phylum) # select only the hi-frequency phyla
trout.ph <- trout.ph %>% group_by(sample, phylum) %>% summarise(total.cts = sum(read.cts)) # counts in each sample
trout.ph <- trout.ph %>% mutate(per.cts = total.cts/sum(total.cts) * 100) # add perc
trout.ph %>% group_by(sample) %>% summarise(sum(per.cts)) # check
 
# Exercise 4. plot by sample
ggplot(data = trout.ph, aes(x=sample, y=per.cts, fill=phylum)) + geom_bar(stat = 'identity')
 
# Exercise 5. group by diet
trout.ph <- trout.ph %>% mutate(diet = str_remove(sample, "_[1234]")) # add a new column "diet" use regular expression
trout.diet <- trout.ph %>% group_by(diet, phylum) %>% summarise(cts.by.diet = sum(total.cts)) %>% mutate(per.by.diet = cts.by.diet/sum(cts.by.diet)*100)
trout.diet %>% group_by(diet) %>% summarise(sum(per.by.diet)) # check
ggplot(data = trout.diet, aes(x=diet, y=per.by.diet, fill=phylum)) + geom_bar(stat = 'identity') # per, stacked
</syntaxhighlight>
Assignment #4 (finalized at 8:25pm)
{| class="wikitable" style="width: 85%;"
|-
| 1. Define microbiome ||  
|-
|-
| July 11 (Thur), 8:40-12:10 || Intro to NGS; R Tutorial V ||
| 2. Explain how 16S ribosomal RNA read counts are used to quantify bacterial composition
Assignment #4
* List pros & cons of Sanger vs NGS
* Compare accuracy, read length, and error rate between Illumina and PacBio
* Describe sequence information captured with each of the following file formats: FASTA, FASTQ, SAM, VCF
* Run t-test
  ||  
  ||  
|-
|-
| July 12 - 14 (Fri, Sat & Sun) || (Weekend Break) || ||
| 3. Run the above code (or your own code) & compose a figure legend explaining the final graph (explain x-axis, y-axis, and what colors represent). Which diet is most similar to the control diet ("E")?
|| [[File:Phyla.png|400px|thumbnail]]
|}
 
===Session 6. Case Study 1. Trout microbiome (continued)===
* Date & Hours: July 16 (Tu), 8:30-12:10
* Dataset: [[File:Cancer-array.txt|thumbnail]]
* Today's code:
<syntaxhighlight lang='bash'>
# Calculate alpha diversity (at species level)
trout <- read_csv("Trout.txt")
trout.long <- gather(trout, 2:29, key = "sample", value = "read.cts") # tranform into a long table
trout.long <- trout.long %>% mutate(diet = str_remove(sample, "_[1234]")) # add diet group variable
trout.sp <- trout.long %>% group_by(diet, species) %>% summarise(total.cts = sum(read.cts)) # count species reads in each diet
trout.sp2 <- trout.sp %>% filter(species != 's__') %>% filter(total.cts > 0) # remove species with un-specified & zeros reads
trout.sp2 %>% group_by(diet) %>% count() # count num. species per diet
trout.sp2 <- trout.sp2 %>% group_by(diet) %>% mutate(frq = total.cts/(sum(total.cts))) %>% mutate(log.frq = log2(frq)) # add columns for frequency and its log2()
trout.sp2 %>% group_by(diet) %>% summarise(-sum(frq * log.frq)) # shannon = -sum(p*log2(p)) # Shannon diversity for each diet
 
# ANOVA to compare two phyla: Fusobacteria and Bacteroidetes (beta-diversity)
trout <- read_csv("Trout.txt")
trout.long <- gather(trout, 2:29, key = "sample", value = "read.cts") # tranform into a long table
trout.ph <- trout.long %>% group_by(sample, phylum) %>% summarise(cts = sum(read.cts)) # count reads for each phylum in each sample
trout.ph <- trout.ph %>% group_by(sample) %>% mutate(perc = cts/sum(cts) *100) # calculate percentages
trout.fuso <- trout.ph %>% filter(phylum == 'p__Fusobacteria') # select rows for a phylum
trout.fuso <- trout.fuso %>% mutate(diet = str_remove(sample, "_[1234]")) # add diet
ggplot(data = trout.fuso, aes(x=diet, y=perc, color=diet)) +  geom_point(size=3, alpha=0.5) + theme_bw() # plot
lm.fuso <- lm(data = trout.fuso, perc ~ diet) # run anova model
summary(lm.fuso) # show difference with reference ("dietA") and p values
anova(lm.fuso) # show overall signficance
 
mean.fuso <- trout.fuso %>% group_by(diet) %>% summarise(mean.prc = mean(perc)) # calculate mean percentages for each diet
ggplot(data = trout.fuso, aes(x=diet, y=perc, color=diet)) +  geom_point(size=3, alpha=0.5) + geom_point(data = mean.fuso, aes(x=diet, y=mean.prc), shape = 10, size=6) + theme_bw() # add mean values
 
# Make diet E the reference:
trout.fuso <- trout.fuso %>% mutate(diet.mod = ifelse(diet=='E', str_c("control", diet), str_c("treat", diet)))
lm.fuso.mod <- lm(data = trout.fuso, perc ~ diet.mod)
summary(lm.fuso.mod)
</syntaxhighlight>
* Assignment #5 (finalized @5:00pm)
# Run ANOVA to test if the percentages for the phylum "Bacteroidetes" are significantly different among the 7 diets.
# Show graph with mean percentages values
 
===Session 7. Case Study 2. Cancer microarray===
* Date & Hours: July 17 (Wed), 8:30-12:10
* Dataset: [[File:Cancer-array.txt|thumbnail]] (provided by Prof Zhu)
* Quiz II.
* Figures (created by Dr Di)
{| class="wikitable"
|-
|-
| July 15 (Mon), 8:00-12:10 || Case Study 1. Fish microbiome ||
! MA plot !! Volcano plot !! Heat map
|| Quiz II
|-
|-
| July 16 (Tu), 8:00-12:10 || Case Study 2. Transcriptome || Assignment #5 
| [[File:GeneExp1.jpeg|200px|thumbnail| fold change (y-axis) vs. total expression levels (x-axis)]] ||  
||  
[[File:GeneExp2.jpeg|200px|thumbnail| p-value (y-axis) vs. fold change (x-axis)]]
|-
||  
| July 17 (Wed), 8:00-12:10 || Case Study 3. Lyme Disease  || ||
[[File:GeneExp3.jpeg|200px|thumbnail| genes significantly down or up-regulated (at p<1e-4)]]
|-
| July 18 (Thur), 8:00-12:10|| || || Presentations
|}
|}
* Code for MA plot
<syntaxhighlight>
# Cancer microarray dataset (Data provided by Prof. Zhu)
# July 10, 2019
# Authors: Weigang Qiu & Lia Di
library(tidyverse) # load library
setwd("C:/Users/lai/Dropbox/Courses/ChongQing-2019/")
# 1. Load data & make long table
arr <- read_csv(file = "Cancer-array.txt")
arr.long <- arr %>% gather(3:8, key = "control.treat", value = "gene.expression")
arr.long <- arr.long %>% mutate(group = str_remove(control.treat, "_[123]")) # create a treat/control variable
ggplot(data = arr.long, aes(x = control.treat, y=gene.expression, fill = group)) + geom_violin() # show distribution & density; normalized & already in log2()
# 2. calculate fold change
arr.fc <- arr.long %>% group_by(aff.id, group) %>% summarise(mean.exp = mean(gene.expression))
arr.fc <- arr.fc %>% spread(group, mean.exp)
arr.fc <- arr.fc %>% mutate(fc = treat-control)
# 3. MA plot, colored by fold change
ggplot(arr.fc, aes(x=control+treat, y=fc, color=fc)) + theme_bw() + geom_point(size=0.5) + scale_color_gradient2(midpoint=0, low="blue", mid="gray", high="red") + xlab("Total Expression") + ylab("Fold Change") + geom_hline(yintercept = c(-1, 0, 1), linetype="dashed")
</syntaxhighlight>
* Code for t-test
<syntaxhighlight>
# 4. run t-tests for each gene
library(broom)
t.model <- arr.long %>% group_by(aff.id) %>% do(tidy(t.test(gene.expression ~ group, data = .))) # run t-test for each gene
ggplot(t.model, aes(x=estimate, y=p.value)) + geom_point(size=0.2) + scale_y_log10() + geom_hline(yintercept = 5e-5, color="blue", linetype="dashed") + geom_vline(xintercept = c(-1,1), color="blue", linetype="dashed") + xlab("Fold Change") + theme_bw() # volcano plot
</syntaxhighlight>
* Code for heatmap
<syntaxhighlight>
# 5. select & plot significant genes
mod.select <- t.model %>% filter(p.value < 1e-4) # select significant genes
arr.sig <- arr %>% filter(aff.id %in% mod.select$aff.id)
arr.mat <- as.matrix(arr.sig[,3:8])
rownames(arr.mat) <- arr.sig$gene.symbol
library(gplots)
heatmap(arr.mat, cexRow = 0.6, cexCol = 0.9, scale="none", col=colorpanel(360,"white","blue"), Colv = NA)
</syntaxhighlight>
===Session 8. Final presentations===
* Date & hours: July 18 (Thur), 8:30-12:10
* Hints to prepare the final presentation:
** Use the trout microbiome dataset
** Show distributions of class/order/family/genus for each diet (we did it for the phylum in class)
** Calculate Shannon indices for phylum/class/order/family/genus in each diet (we did it for the species in class)
** Select a particular phylum/class/order/family/genus/species & test significance in percentages among diets using ANOVA (we did it for the phylum Fusobacteria in class
* Rubric for tinal presentations (4 slides, 5 minute)
** Slide 1. (1 min; 5 pts). Introduction: background, question, & signficance
** Slide 2. (1 min; 10 pts). Material & Methods: sample size, replicates, control, sequencing technology, software tools, statistical analysis
** Slide 3. (2 min; 15 pts). Results: a graph: title, legends, caption, main R commands
** Slide 4. (1 min; 5 pts). Discussion, conclusion & questions
** Slide style (10 pts). Use more figures, less words; show bullets & outlines, not complete sentences
** Presentation style (5 pts). Speak loudly, slowly, and clearly. Do not read from slides.


==Papers & Datasets==
==Papers & Datasets==
Line 132: Line 277:
! Omics Application !! Paper link !! Data set !! NGS Technology
! Omics Application !! Paper link !! Data set !! NGS Technology
|-
|-
| Microbiome || [https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0193652 Rimoldi_etal_2018_PlosOne] || [https://doi.org/10.1371/journal.pone.0193652.s004 S1 Dataset] || 16S rDNA amplicon sequencing
| Microbiome || [https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0193652 Rimoldi_etal_2018_PlosOne] ||  
* [https://doi.org/10.1371/journal.pone.0193652.s004 S1 Dataset]  
* [[File:Trout.txt|thumbnail]]
|| 16S rDNA amplicon sequencing
|-
|-
| Transcriptome || [https://science.sciencemag.org/content/350/6264/1096 Wang_etal_2015_Science] || Tables S2 & S4 || RNA-Seq
| Transcriptome || [https://science.sciencemag.org/content/350/6264/1096 Wang_etal_2015_Science] || Tables S2 & S4 || RNA-Seq
Line 145: Line 293:
|-
|-
| TB surveillance || [https://jcm.asm.org/content/53/7/2230 Brow_etal_2015]  || [https://www.ebi.ac.uk/ena/data/view/PRJEB9206 Sequence Archives]|| Whole-genome sequencing (WGS)
| TB surveillance || [https://jcm.asm.org/content/53/7/2230 Brow_etal_2015]  || [https://www.ebi.ac.uk/ena/data/view/PRJEB9206 Sequence Archives]|| Whole-genome sequencing (WGS)
|-
| Example || Example || Example || Example
|-
| Example || Example || Example || Example
|-
| Example || Example || Example || Example
|}
|}

Latest revision as of 16:11, 30 July 2019

Biomedical Genomics
July 8-19, 2019
Guest Instructor: Weigang Qiu, Ph.D.
Professor, Department of Biological Sciences, City University of New York, Hunter College & Graduate Center
Adjunct Faculty, Department of Physiology and Biophysics, Institute for Computational Biomedicine, Weil Cornell Medical College
Office: B402 Belfer Research Building, 413 East 69th Street, New York, NY 10021, USA
Email: weigang@genectr.hunter.cuny.edu
Lab Website: http://diverge.hunter.cuny.edu/labwiki/


Host: Shunqin Zhu (祝顺琴), Ph.D.
Associate Professor, School of Life Science, South West University

Figure 1. Gains & losses of host-defense genes among Lyme pathogen genomes (Qiu & Martin 2014)

Course Overview

Welcome to BioMedical Genomics, a computer workshop for advanced undergraduates and graduate students. A genome is the total genetic content of an organism. Driven by breakthroughs such as the decoding of the first human genome and next-generation DNA -sequencing technologies, biomedical sciences are undergoing a rapid and irreversible transformation into a highly data-intensive field.

Genome information is revolutionizing virtually all aspects of life sciences including basic research, medicine, and agriculture. Meanwhile, use of genomic data requires life scientists to be familiar with concepts and skills in biology, computer science, as well as data analysis.

This workshop is designed to introduce computational analysis of genomic data through hands-on computational exercises, using published studies.

The pre-requisites of the course are college-level courses in molecular biology, cell biology, and genetics. Introductory courses in computer programming and statistics are preferred but not strictly required.

Learning goals

By the end of this course successful students will be able to:

  • Describe next-generation sequencing (NGS) technologies & contrast it with traditional Sanger sequencing
  • Explain applications of NGS technology including pathogen genomics, cancer genomics, human genomic variation, transcriptomics, meta-genomics, epi-genomics, and microbiome.
  • Visualize and explore genomics data using RStudio
  • Replicate key results using a raw data set produced by a primary research paper

Web Links

Quizzes and Exams

Student performance will be evaluated by attendance, three (4) quizzes and a final report:

  • Attendance: 50 pts
  • Assignments: 5 x 10 = 50 pts
  • Open-book Quizzes: 2 x 25 pts = 50 pts
  • Take-home Mid-term: 50 pts
  • Final presentation: 50 pts

Total: 250 pts

Course Schedule

Session 1. Introduction & R Tutorial I

  • Date & Hours: July 8 (Mon), 8:40-12:10
  • Lecture slides:
  • Assignment #1 (create a WORD document including scripts & graphs (i.e., compile your work into a lab report, due tomorrow)
    • Install R/R studio and the "tidyverse" package on your own computer
    • Recreate Script 1 & Mini-Practical
    • Show help page for function "seq"
    • Download dataset

Session 2. R Tutorial II

  • Date & Hours: July 9 (Tu), 8:40-12:10
  • Lecture slides:
    File:R-part-2.pdf
    Lecture slides
  • Assignment #2.

The following is a portion of the dataset of Mycobacterium growth (kindly shared by Aswad from Dr Xie's lab). It shows OD (optical density) values. Transform this table ("wide" format) into the "tall/tidy" format (use paper & pen, no need to use R studio or any computer program):

Hour Control Gene Control.with.Arg Gene.with.Arg
0 0.06 0.022 0.031 0.01
4 0.087 0.102 0.082 0.081
8 0.113 0.185 0.086 0.135
  • In R studio, read the dataset from the file "FlowerColourVisits.csv" and save it into an object named as "flower"
    • Show head, tail, dimension of the data frame "flower"
    • Show data summary with "summary" & "glimpse" commands. Which column is a categorical data type?
    • Select the column named "colour"
    • Select rows from the 3rd to the 20th
    • Select the 3rd, 10th, and 20th rows
    • Select only the rows that have the colour of "red" (hint use colour=="red"
    • Create a new column, named "logVisit", that is log(1+number.of.visit)
    • Sort the "flower" data by the column "number.of.visit"
    • Perform the following data transformation using the chaining operator (i.e., "%>%"): Select rows from the 3rd to the 20th, then filter by colour of "red", and then show head
    • Obtain the mean number of visit for each colour as a group (Hint: use "group_by" & "summarise")

Session 3. R Tutorial III & Quiz I

  • Date & Hours: July 10 (Wed), 8:40-12:10
  • Lecture slides:
    File:R-part-3.pdf
    Lecture slides
  • Assignment #3
Task Graph
Use the "iris" dataset to reproduce the plot shown at right (Hint: load data with data(iris))
Iris-1.png
Use the "flower" dataset (see Assignment #2 on how to load data) to reproduce the plot shown at right
Flower-1.png

Session 4. Intro to NGS & R Tutorial IV

Weekend break (No class; July 12 - 14, Fri, Sat & Sun)

Session 5. Case Study I (Trout microbiome) & R Tutorial V

R code for today's lecture

# Case Study 1. Trout microbiome
# Date: Monday, July 15, 2019
# Author: Weigang Qiu

library(tidyverse)
# Load data
setwd("C:/Users/lai/Dropbox/Courses/ChongQing-2019/")

trout <- read_csv("Trout.txt")
glimpse(trout)

# Exercise 1. Transform into a long table
trout.long <- gather(trout, 2:29, key = "sample", value = "read.cts") 

# Exercise 2. filter out phyla < 1%
trout.ph.cts <- trout.long %>% group_by(phylum) %>% summarise(phy.sum = sum(read.cts)) # counts in each phylum
trout.ph.perc <- trout.ph.cts %>% mutate(ph.perc = phy.sum/sum(phy.sum) * 100) # get percentages
trout.ph.hi <- trout.ph.perc %>% filter(ph.perc > 1) # select phyla > 1%

# The above could be combined using pipes
# trout.ph.hi <- trout.long %>% group_by(phylum) %>% summarise(cts = sum(read.cts)) %>% mutate(perc = cts/sum(cts) * 100) %>% filter(perc >= 1)

# Exercise 3. get phylum counts in each sample
trout.ph <- trout.long %>% filter(phylum %in% trout.ph.hi$phylum) # select only the hi-frequency phyla
trout.ph <- trout.ph %>% group_by(sample, phylum) %>% summarise(total.cts = sum(read.cts)) # counts in each sample
trout.ph <- trout.ph %>% mutate(per.cts = total.cts/sum(total.cts) * 100) # add perc
trout.ph %>% group_by(sample) %>% summarise(sum(per.cts)) # check

# Exercise 4. plot by sample
ggplot(data = trout.ph, aes(x=sample, y=per.cts, fill=phylum)) + geom_bar(stat = 'identity')

# Exercise 5. group by diet
trout.ph <- trout.ph %>% mutate(diet = str_remove(sample, "_[1234]")) # add a new column "diet" use regular expression
trout.diet <- trout.ph %>% group_by(diet, phylum) %>% summarise(cts.by.diet = sum(total.cts)) %>% mutate(per.by.diet = cts.by.diet/sum(cts.by.diet)*100)
trout.diet %>% group_by(diet) %>% summarise(sum(per.by.diet)) # check
ggplot(data = trout.diet, aes(x=diet, y=per.by.diet, fill=phylum)) + geom_bar(stat = 'identity') # per, stacked

Assignment #4 (finalized at 8:25pm)

1. Define microbiome
2. Explain how 16S ribosomal RNA read counts are used to quantify bacterial composition
3. Run the above code (or your own code) & compose a figure legend explaining the final graph (explain x-axis, y-axis, and what colors represent). Which diet is most similar to the control diet ("E")?
Phyla.png

Session 6. Case Study 1. Trout microbiome (continued)

# Calculate alpha diversity (at species level)
trout <- read_csv("Trout.txt")
trout.long <- gather(trout, 2:29, key = "sample", value = "read.cts") # tranform into a long table
trout.long <- trout.long %>% mutate(diet = str_remove(sample, "_[1234]")) # add diet group variable
trout.sp <- trout.long %>% group_by(diet, species) %>% summarise(total.cts = sum(read.cts)) # count species reads in each diet
trout.sp2 <- trout.sp %>% filter(species != 's__') %>% filter(total.cts > 0) # remove species with un-specified & zeros reads
trout.sp2 %>% group_by(diet) %>% count() # count num. species per diet
trout.sp2 <- trout.sp2 %>% group_by(diet) %>% mutate(frq = total.cts/(sum(total.cts))) %>% mutate(log.frq = log2(frq)) # add columns for frequency and its log2()
trout.sp2 %>% group_by(diet) %>% summarise(-sum(frq * log.frq)) # shannon = -sum(p*log2(p)) # Shannon diversity for each diet

# ANOVA to compare two phyla: Fusobacteria and Bacteroidetes (beta-diversity)
trout <- read_csv("Trout.txt")
trout.long <- gather(trout, 2:29, key = "sample", value = "read.cts") # tranform into a long table
trout.ph <- trout.long %>% group_by(sample, phylum) %>% summarise(cts = sum(read.cts)) # count reads for each phylum in each sample
trout.ph <- trout.ph %>% group_by(sample) %>% mutate(perc = cts/sum(cts) *100) # calculate percentages
trout.fuso <- trout.ph %>% filter(phylum == 'p__Fusobacteria') # select rows for a phylum
trout.fuso <- trout.fuso %>% mutate(diet = str_remove(sample, "_[1234]")) # add diet 
ggplot(data = trout.fuso, aes(x=diet, y=perc, color=diet)) +  geom_point(size=3, alpha=0.5) + theme_bw() # plot
lm.fuso <- lm(data = trout.fuso, perc ~ diet) # run anova model
summary(lm.fuso) # show difference with reference ("dietA") and p values
anova(lm.fuso) # show overall signficance

mean.fuso <- trout.fuso %>% group_by(diet) %>% summarise(mean.prc = mean(perc)) # calculate mean percentages for each diet
ggplot(data = trout.fuso, aes(x=diet, y=perc, color=diet)) +  geom_point(size=3, alpha=0.5) + geom_point(data = mean.fuso, aes(x=diet, y=mean.prc), shape = 10, size=6) + theme_bw() # add mean values

# Make diet E the reference:
trout.fuso <- trout.fuso %>% mutate(diet.mod = ifelse(diet=='E', str_c("control", diet), str_c("treat", diet)))
lm.fuso.mod <- lm(data = trout.fuso, perc ~ diet.mod)
summary(lm.fuso.mod)
  • Assignment #5 (finalized @5:00pm)
  1. Run ANOVA to test if the percentages for the phylum "Bacteroidetes" are significantly different among the 7 diets.
  2. Show graph with mean percentages values

Session 7. Case Study 2. Cancer microarray

  • Date & Hours: July 17 (Wed), 8:30-12:10
  • Dataset: (provided by Prof Zhu)
  • Quiz II.
  • Figures (created by Dr Di)
MA plot Volcano plot Heat map
fold change (y-axis) vs. total expression levels (x-axis)
p-value (y-axis) vs. fold change (x-axis)
genes significantly down or up-regulated (at p<1e-4)
  • Code for MA plot
# Cancer microarray dataset (Data provided by Prof. Zhu)
# July 10, 2019
# Authors: Weigang Qiu & Lia Di
library(tidyverse) # load library
setwd("C:/Users/lai/Dropbox/Courses/ChongQing-2019/")

# 1. Load data & make long table
arr <- read_csv(file = "Cancer-array.txt")
arr.long <- arr %>% gather(3:8, key = "control.treat", value = "gene.expression")
arr.long <- arr.long %>% mutate(group = str_remove(control.treat, "_[123]")) # create a treat/control variable
ggplot(data = arr.long, aes(x = control.treat, y=gene.expression, fill = group)) + geom_violin() # show distribution & density; normalized & already in log2()

# 2. calculate fold change
arr.fc <- arr.long %>% group_by(aff.id, group) %>% summarise(mean.exp = mean(gene.expression))
arr.fc <- arr.fc %>% spread(group, mean.exp)
arr.fc <- arr.fc %>% mutate(fc = treat-control)

# 3. MA plot, colored by fold change
ggplot(arr.fc, aes(x=control+treat, y=fc, color=fc)) + theme_bw() + geom_point(size=0.5) + scale_color_gradient2(midpoint=0, low="blue", mid="gray", high="red") + xlab("Total Expression") + ylab("Fold Change") + geom_hline(yintercept = c(-1, 0, 1), linetype="dashed")
  • Code for t-test
# 4. run t-tests for each gene
library(broom)
t.model <- arr.long %>% group_by(aff.id) %>% do(tidy(t.test(gene.expression ~ group, data = .))) # run t-test for each gene

ggplot(t.model, aes(x=estimate, y=p.value)) + geom_point(size=0.2) + scale_y_log10() + geom_hline(yintercept = 5e-5, color="blue", linetype="dashed") + geom_vline(xintercept = c(-1,1), color="blue", linetype="dashed") + xlab("Fold Change") + theme_bw() # volcano plot
  • Code for heatmap
# 5. select & plot significant genes
mod.select <- t.model %>% filter(p.value < 1e-4) # select significant genes
arr.sig <- arr %>% filter(aff.id %in% mod.select$aff.id)
arr.mat <- as.matrix(arr.sig[,3:8])
rownames(arr.mat) <- arr.sig$gene.symbol
library(gplots)
heatmap(arr.mat, cexRow = 0.6, cexCol = 0.9, scale="none", col=colorpanel(360,"white","blue"), Colv = NA)

Session 8. Final presentations

  • Date & hours: July 18 (Thur), 8:30-12:10
  • Hints to prepare the final presentation:
    • Use the trout microbiome dataset
    • Show distributions of class/order/family/genus for each diet (we did it for the phylum in class)
    • Calculate Shannon indices for phylum/class/order/family/genus in each diet (we did it for the species in class)
    • Select a particular phylum/class/order/family/genus/species & test significance in percentages among diets using ANOVA (we did it for the phylum Fusobacteria in class
  • Rubric for tinal presentations (4 slides, 5 minute)
    • Slide 1. (1 min; 5 pts). Introduction: background, question, & signficance
    • Slide 2. (1 min; 10 pts). Material & Methods: sample size, replicates, control, sequencing technology, software tools, statistical analysis
    • Slide 3. (2 min; 15 pts). Results: a graph: title, legends, caption, main R commands
    • Slide 4. (1 min; 5 pts). Discussion, conclusion & questions
    • Slide style (10 pts). Use more figures, less words; show bullets & outlines, not complete sentences
    • Presentation style (5 pts). Speak loudly, slowly, and clearly. Do not read from slides.

Papers & Datasets

Omics Application Paper link Data set NGS Technology
Microbiome Rimoldi_etal_2018_PlosOne 16S rDNA amplicon sequencing
Transcriptome Wang_etal_2015_Science Tables S2 & S4 RNA-Seq
Transcriptome & Regulome Nava_etal_2019_BMCGenomics Tables S2 & S3 RNA-Seq & CHIP-Seq
Proteome Qiu_etal_2017_NPJ (to be posted) SILAC
Population genomics (Lyme) Di_etal_2018_JCM Data & R codes Amplicon sequencing (antigen locus)
Population genomics/GWAS (Human) Simonti_etal_2016_Science Table S2 whole-genome sequencing (WGS); 1000 Genome Project (IGSR)
TB surveillance Brow_etal_2015 Sequence Archives Whole-genome sequencing (WGS)