BioMed-R-2021: Difference between revisions

From QiuLab
Jump to navigation Jump to search
imported>Weigang
imported>Weigang
 
(29 intermediate revisions by the same user not shown)
Line 30: Line 30:
* Visualize and explore genomics data using R & RStudio
* Visualize and explore genomics data using R & RStudio
* Replicate key results using a raw data set produced by a primary research paper
* Replicate key results using a raw data set produced by a primary research paper
==A sample of original NGS paper with data sets==
* [https://doi.org/10.1128/mSystems.00939-20 de la Cuesta-Zuluaga J, Spector TD, Youngblut ND, Ley RE. 2021. Genomic insights into adaptations of trimethylamine-utilizing methanogens to diverse habitats, including the human gut. mSystems 6:e00939-20.]
* [https://doi.org/10.1128/mSystems.00033-21 Reiter et al. 2021. Transcriptomics provides a genetic signature of vineyard site and offers insight into vintage-independent inoculated fermentation outcomes. mSystems 6:e00033-21]
* [https://doi.org/10.1128/mSystems.01350-20 Renninger et al (2021). Indoor dust as a matrix for surveillance of COVID-19. mSystems 6:e01350-20.]. [https://datadryad.org/stash/dataset/doi:10.5061/dryad.rn8pk0p8t Data available at Dryad]
* [https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-1133-7 Litzenburger  et al. Single-cell epigenomic variability reveals functional cancer heterogeneity. Genome Biol 18, 15 (2017).]
* [https://science.sciencemag.org/content/361/6409/1380 Cao et al. Single-cell chromatin and RNA analysis. Science (2018)]


==Web Links==
==Web Links==
* Install R base: https://cloud.r-project.org
* Install R base: https://cloud.r-project.org
* Install R Studio (Desktop version): http://www.rstudio.com/download
* Install R Studio (Desktop version): http://www.rstudio.com/download
* R Markdown Template: [http://diverge.hunter.cuny.edu/~weigang/Rmarkdown-template.Rmd R markdown template (by Hector)]
* Textbook: [http://r4all.org/#about Introduction to R for Biologists]
* Textbook: [http://r4all.org/#about Introduction to R for Biologists]
* Download: [http://www.r4all.org/books/datasets R datasets]
* Download: [http://www.r4all.org/books/datasets R datasets]
Line 73: Line 81:
===Jan 30, 2021===
===Jan 30, 2021===
* Introduction
* Introduction
* R Tutorial 1: Use interface, basic operations, load data. Slides: [[File:R-part-1.pdf|thumbnail]]
* R Tutorial 1: Use interface, basic operations, load data. (slides available on Blackboard)
{| class="wikitable sortable mw-collapsible"
{| class="wikitable sortable mw-collapsible"
! Assignment 1 (15 pts; Due next class, Submit using Blackboard)
! In-class Exercise & Assignment 1 (15 pts)
|-
|-
|  
|  
* (5 pts) Transform the following "untidy/wide" table into a "tidy/tall" table (print a hard copy)
* (5 pts, Due in-class) Transform the following "untidy/wide" table into a "tidy/tall" table (print a hard copy)
<pre>
<pre>
PropertyName,Density_250m,Density_500m,Density_1000m
PropertyName,Density_250m,Density_500m,Density_1000m
Line 87: Line 95:
VanCortlandtPark,0.000550151,0.000979312,0.001372675
VanCortlandtPark,0.000550151,0.000979312,0.001372675
</pre>
</pre>
* (4 pts) Make a single slide of a primary research paper using next-generation sequencing (NGS) technologies, show the following
* (10 pts, Due 2/7) Make a single slide of a primary research paper using next-generation sequencing (NGS) technologies, show the following
** proper citation (authors, title, year, journal, URL)
** proper citation (authors, title, year, journal, URL)
** NGS method (Illumina, PacBio, or NanoPore)
** NGS method (Illumina, PacBio, or NanoPore)
Line 94: Line 102:
** raw data table (show first few columns and first few rows)
** raw data table (show first few columns and first few rows)
** for example, a student has worked on tissue regeneration, the search in PubMed with key words "regeneration zebra fish transcriptome" found the following primary paper as the best because of the high quality of journal and the availability of raw data: https://www.ncbi.nlm.nih.gov/pubmed/28096348
** for example, a student has worked on tissue regeneration, the search in PubMed with key words "regeneration zebra fish transcriptome" found the following primary paper as the best because of the high quality of journal and the availability of raw data: https://www.ncbi.nlm.nih.gov/pubmed/28096348
** Papers to AVOID:
*** Reviews (not original research paper): e.g.,[https://pubmed.ncbi.nlm.nih.gov/30089861/ a review paper (usually appear first on PubMed; SKIP)]
*** Original research, but no computer readable raw data: e.g., [https://www-ncbi-nlm-nih-gov.proxy.wexler.hunter.cuny.edu/pmc/articles/PMC7737787/ Raw data not readable (PDF is not computer-readable)]
|}
|}


===Feb 6, 2021===
===Feb 6, 2021===
* Introduction to NGS: [[File:Intro-NGS.pdf|thumbnail]]
* Introduction to NGS: (slides available on Blackboard)
* 1-slide presentations on Next-Generation Sequencing Technologies (Group I)
* 1-slide presentations on Next-Generation Sequencing Technologies (Group I)
* R Tutorial, Part 2. Data manipulation with dplyr. Slides: [[File:R-tutorials-2.pdf|thumbnail]]
* R Tutorial, Part 2. Data manipulation with dplyr. Slides: [[File:R-tutorials-2.pdf|thumbnail]]
{| class="wikitable sortable mw-collapsible"
{| class="wikitable sortable mw-collapsible"
! Assignment 2 (10 pts; Due next session)
! In-class Exercise 2 (10 pts; Due in class)
|-
|-
|  
|  
* (3 pts) Print a copy of your 2nd R script, with proper annotations
* Show R commands for the following operations
* (4 pts) Show following commands with the chaining operator ("%>%") for the "iris" data set (4 individual commands; not a single one)
** load the "tidyverse" library
** load the "iris" data
** Select columns "Sepal.Length" & "Species"
** Select columns "Sepal.Length" & "Species"
** Filter rows 2 through 10
** Filter rows 2 through 10
** Add a column "logSepalLength" by taking the logarithm of the said column
** Add a column "logSepalLength" by taking the logarithm of the said column
** Calculate mean and standard deviation of Petal.Length in each species
** Calculate mean and standard deviation of Petal.Length in each species
* (3 pts) Transform the "iris" data table into a "tidy/tall" table (manually, show first 10 rows, print a hard copy)
** Save all commands in a script "in-class-ex-2.R"
|}
|}
* Assignment 1 Due next day


===Feb 13, 2021===
===Feb 13, 2021===
* NGS presentations (Group II)
* NGS presentations
* R Tutorial. Part 3. Data visualization with ggplot2. Slides: [[File:R-tutorials-3.pdf|thumbnail]]
* R Tutorial. Part 3. Data visualization with ggplot2. Slides: [[File:R-tutorials-3.pdf|thumbnail]]
* No assignment (go over slides and 3 tutorial scripts to prepare for Quiz next week)
* Assignment 2: Submit R-Markdown with scatter plot & boxplot


===Feb 20, 2020===
===Feb 20, 2020===
* Quiz 1 (Open Book)
* R Tutorial: Part 4. BioStat (chi-square & t-test) Lecture slides: [[File:R-tutorial-4.pdf|thumbnail]]
* R Tutorial: Part 4. BioStat (chi-square & t-test) Lecture slides: [[File:R-tutorial-4.pdf|thumbnail]]
{| class="wikitable sortable mw-collapsible"
* Presentation
! Assignment 3 (10 pts). In-class workshop. Evaluation of papers according to the following rubrics (submit by email)
* Assignment 3 (10 pts). R Markdown upload (by Wed 8pm)
|-
|
* Citation & PubMed Link
* Main research question
* Samples, sample sizes, & controls
* Omics technologies (e.g., genomics, metagenomics, microbiome, transcriptome, proteome, mythylome, RNA-seq, 16S amplicon sequencing)
* Sequencing platform (e.g., illumina, PacBio, nanopore)
* Main computational tools (e.g., R, RStudio, QIMME)
* Main graphics (e.g., scatterplot, boxplot, heatmap, vocano plot)
* Main statistical analysis (e.g., t-test, chi-square, regression analysis)
* Data set: a short description & links
|}


===Feb 27, 2020===
===Feb 27, 2020===
* Paper evaluation & selection
* Review
* R Tutorial: Part 4. BioStat (regression & ANOVA) [[File:R-tutorial-5.pdf|thumbnail]]
* Quiz (30 pts)
** NGS slides & R tutorial Slides
** Excluding the four statistical tests
** Openbook but due in class (on Blackboard)


===March 6, 2020===
===March 6, 2020===
* Self Study 1 (no class): RNA-Seq analysis. '''Assignment 4 (10 pts; due 3/14/2020)''': [http://diverge.hunter.cuny.edu/~weigang/self-study-1.html Self Study 1]
* R Tutorial: Part 4. BioStat (regression & ANOVA) [[File:R-tutorial-5.pdf|thumbnail]]
* Review for mid-term exam: 6 PDF presentations (intro to NGS & 5 R-tutorials)
* Paper review
* Assignment 4 (10 pts). Submit R markdown on Blackboard


===March 13, 2020===
===March 13, 2020===
* Mid-term exam (50 pts). Open Book
* Paper review
* Review for mid-term exam: 6 PDF presentations (intro to NGS & 5 R-tutorials)


===March 20, 2020===
===March 20, 2020===
* Live Session using Blackboard Collaborator
* Mid-term
* [http://cov.genometracker.org Covid-19 Genome Tracker] (developed by the Qiu Lab)
** Focus on four statistical tests (with visualization
* [https://wwwnc.cdc.gov/eid/article/26/6/20-0357_article Analysis of a Covid-19 symptom onset timing]
** Submit sectioned R markdown PDF: [http://diverge.hunter.cuny.edu/~weigang/Rmarkdown-template.Rmd R markdown template (by Hector)]
* R Markdown Tutorial: [http://diverge.hunter.cuny.edu/~weigang/Rmarkdown-template.Rmd R markdown template (by Hector)]
* [http://diverge.hunter.cuny.edu/~weigang/self-study-2.html In-class Exercises]
* Assignment 5 (10 pts; due next session): see above link


===March 27, 2020===
===March 27, 2020===
Line 159: Line 161:


===April 3, 2020===  
===April 3, 2020===  
* In class workshop: [http://diverge.hunter.cuny.edu/~weigang/self-study-3.html Sef-study-3: Covid-19 cases]
* No class: Holiday break


===April 10, 2020===
===April 10, 2020===
* Quiz II (25 pts; Open Book; R markdown-generated WORD/PDF file as submission)
* Paper assignments & break into project groups
* In-class workshop on identify genes/proteins/metabolites associated with tissue regeneration
* For final presentation (10 pts)
** [https://www.pnas.org/content/114/5/E717 Article link] (submission by Jenifer)
** Each member will present his or her part
** Tutorial: [http://diverge.hunter.cuny.edu/~weigang/Case-study-for-final.html Tutorial for case study]
* For final report (90), you are required to:
** Will be used for final presentation & R markdown report
** Read the paper and identify a dataset to replicate
** Create an R markdown file to record your work
** Produce a final WORD or PDF file as final report
* Case study 1: Restriction/Modification system in Lyme pathogen: [https://jb.asm.org/content/200/24/e00395-18 Casselii  et al (2018)]


===April 17, 2020===
===April 17, 2020===
* Reference & data sets for final project have been posted & assignments have been made: [[BioMed-R-2020#Final_project_assignment|Final_project_assignment]]
* Group presentation #1. Summary slides
* Before class: read paper, download assigned Excel workbook, save data set as TSV (tab-separated file); Read into R-studio.
* During class: present the data set, including:
** Biological question
** Experimental design: samples, sample sizes, controls
** Experimental techniques/measurements
** Data set description, column by column
** Visualization to be made
** Statistical tests to be performed


===April 24, 2020===
===April 24, 2020===
* [[BioMed-R-2020#Final_project_assignment|Final_project_assignment]]
* Group presentation #2. R Markdown & Data import
* Tutorial: [http://diverge.hunter.cuny.edu/~weigang/Case-study-for-final.html Tutorial for case study (updated)]
* I will hear group presentation Round #2. I will grade individual performance by the following rubric:
* Presentations of draft figures
** Biologist: Be prepared to answer my questions regarding the study background, question, and significance
** Writer: Be prepared to show R Markdown (in R Studio) that included ALL sections from the last week's slide
** Data Scientist: Be prepared to show data table that has been read into R Studio & show a preliminary graph & visualization
** Statistician: Be prepared to answer questions on what is the statistical null hypothesis and what test to perform


===May 1, 2020===
===May 1, 2020===
* Self study (no live session)
* Group presentation #3. R Markdown & Data analysis
* Tutorial: [http://diverge.hunter.cuny.edu/~weigang/Case-study-for-final.html Tutorial for case study]
* For final report, you are required to:
** Read the paper and identify a dataset to replicate
** Create an R markdown file to record your work
** Produce a final WORD or PDF file as final report
* '''Your final report (100 pts) should include the following required components''':
** (10 pts) Section 1. Background & Objectives. Describe (a) the overall goal of the study; (b) the specific question to be addressed by your dataset
** (20 pts) Section 2. Material & Methods. Describe experimental design, i.e., how your assigned data set was generated, including the nature of the biological samples, sample size, number of replicates (biological & technical), controls (if any), sequencing technologies. Hint: Fig S1
** (40 pts) Section 3. R codes & graphs. Show R codes with comments for individual commands. Graphics should be as close to the published figure as possible (e.g., with proper axis labels)
** (10 pts) Section 4. Statistical analysis. Show mull hypothesis and p-value. Draw statistical conclusion
** (10 pts) Section 5. Conclusion. Draw biological conclusions of your analysis
** (10 pts) Section 6. Citations/source/URL to paper, your dataset, and methods


===May 8, 2020===
===May 8, 2020===
* Consultation (no mandatory participation)
* Consultation by appointment (no live session)


===May 15, 2020===
===May 15, 2020===
* Consultation (no mandatory participation)
* Consultation by appointment (no live session)
* '''Submit your Teacher's Evaluation''', using either:
** Personal computer at [http://www.hunter.cuny.edu/te www.hunter.cuny.edu/te]; or,
** Smartphone at [http://www.hunter.cuny.edu/mobilete www.hunter.cuny.edu/mobilete]


===May 22, 2020===
===May 22, 2020===
* Friday, 5pm: Final report Due (Blackboard submission)
* Friday, 5pm: Final report Due (Blackboard submission)
* '''Your final report (100 pts) should include the following required components''':
** (10 pts) Section 1. Background & Objectives. Describe (a) the overall goal of the study; (b) the specific question to be addressed by your dataset
** (20 pts) Section 2. Material & Methods. Describe experimental design, i.e., how your assigned data set was generated, including the nature of the biological samples, sample size, number of replicates (biological & technical), controls (if any), sequencing technologies. Hint: Fig S1
** (40 pts) Section 3. R codes & graphs. Show R codes with comments for individual commands. Graphics should be as close to the published figure as possible (e.g., with proper axis labels)
** (10 pts) Section 4. Statistical analysis. Show mull hypothesis and p-value. Draw statistical conclusion
** (5 pts) Section 5. Conclusion. Draw biological conclusions of your analysis
** (5 pts) Section 6. Citations/source/URL to paper, your dataset, and methods

Latest revision as of 02:12, 2 May 2021

BIOL47120 Biomedical Genomics II
Spring 2021, Saturdays 9-12 noon
Synchronous Zoom Session, with Meeting ID: 762 490 6348
Instructor: Weigang Qiu, Ph.D., Professor, Department of Biological Sciences, Hunter College, CUNY; Email: weigang@genectr.hunter.cuny.edu
Office: B402 Belfer Research Building, 413 East 69th Street, New York, NY 10021, USA; Office hour: Friday 12-2pm
MA plot Volcano plot Heat map
fold change (y-axis) vs. total expression levels (x-axis)
p-value (y-axis) vs. fold change (x-axis)
genes significantly down or up-regulated (at p<1e-4)

Course Overview

Welcome to Introductory BioMedical Genomics, a seminar course for advanced undergraduates and graduate students. A genome is the total genetic content of an organism. Driven by breakthroughs such as the decoding of the first human genome and rapid DNA and RNA-sequencing technologies, biomedical sciences are undergoing a rapid & irreversible transformation into a highly data-intensive field, that requires familiarity with concepts in both biology, computational, and data sciences.

Genome information is revolutionizing virtually all aspects of life sciences including basic research, medicine, and agriculture. Meanwhile, use of genomic data requires life scientists to be familiar with concepts and skills in biology, computer science, as well as statistics.

This workshop is designed to introduce computational analysis of genomic data through hands-on computational exercises. Students are expected to be able to replicate key results of data analysis from published studies.

The pre-requisites of the course are college-level courses in molecular biology, cell biology, and genetics. Introductory courses in computer programming and statistics are preferred but not strictly required.

Learning goals

By the end of this course successful students will be able to:

  • Describe next-generation sequencing (NGS) technologies & contrast it with traditional Sanger sequencing
  • Explain applications of NGS technology including pathogen genomics, cancer genomics, human genomic variation, transcriptomics, meta-genomics, epi-genomics, and microbiome.
  • Visualize and explore genomics data using R & RStudio
  • Replicate key results using a raw data set produced by a primary research paper

A sample of original NGS paper with data sets

Web Links

Quizzes and Exams

Student performance will be evaluated by attendance, weekly assignments, quizzes, and a final report in R Markdown:

  • Attendance & In-class participation: 100 pts
  • Assignments: 5 x 10 = 50 pts
  • Quizzes: 2 x 25 pts = 50 pts
  • Mid-term: 50 pts
  • Final presentation & report: 50 pts

Total: 300 pts

Tips for Success

To maximize the your experience we strongly recommend the following strategies:

  • Follow the directions for efficiently, finding high-impact papers, reading science research papers and preparing presentations.
  • Read the papers, watch required videos and do the exercises regularly, long before you attend class.
  • Attend all classes, as required. Late arrival results in loss of points.
  • Keep up with online exercises. Don’t wait until the due date to start tasks.
  • Take notes or annotate slides while attending the lectures.
  • Listen actively and participate in class and in online discussions.
  • Review and summarize material within 24 hrs after class.
  • Observe the deadlines for submitting your work. Late submissions incur penalties.
  • Put away cell phones, do not TM, email or play computer games in class.

Hunter/CUNY Policies

  • Policy on Academic Integrity

Hunter College regards acts of academic dishonesty (e.g., plagiarism, cheating on homework, online exercises or examinations, obtaining unfair advantage, and falsification of records and official documents) as serious offenses against the values of intellectual honesty. The College is committed to enforcing the CUNY Policy on Academic Integrity, and we will pursue cases of academic dishonesty according to the Hunter College Academic Integrity Procedures. Students will be asked to read this statement before exams.

  • ADA Policy

In compliance with the American Disability Act of 1990 (ADA) and with Section 504 of the Rehabilitation Act of 1973, Hunter College is committed to ensuring educational parity and accommodations for all students with documented disabilities and/or medical conditions. It is recommended that all students with documented disabilities (Emotional, Medical, Physical, and/or Learning) consult the Office of AccessABILITY, located in Room E1214B, to secure necessary academic accommodations. For further information and assistance, please call: (212) 772- 4857 or (212) 650-3230.

  • Syllabus Policy

Except for changes that substantially affect implementation of the evaluation (grading) statement, this syllabus is a guide for the course and is subject to change with advance notice, announced in class or posted on Blackboard.

Course Schedule

Jan 30, 2021

  • Introduction
  • R Tutorial 1: Use interface, basic operations, load data. (slides available on Blackboard)
In-class Exercise & Assignment 1 (15 pts)
  • (5 pts, Due in-class) Transform the following "untidy/wide" table into a "tidy/tall" table (print a hard copy)
PropertyName,Density_250m,Density_500m,Density_1000m
HighbridgePark,0.006561319,0.009462031,0.010578611
BronxRiverParkway,0.001318749,0.001978858,0.002652118
CrotonaPark,0.009412087,0.01164712,0.01202321
ClaremontPark,0.016391948,0.019972485,0.020350481
VanCortlandtPark,0.000550151,0.000979312,0.001372675
  • (10 pts, Due 2/7) Make a single slide of a primary research paper using next-generation sequencing (NGS) technologies, show the following
    • proper citation (authors, title, year, journal, URL)
    • NGS method (Illumina, PacBio, or NanoPore)
    • NGS application (genomics, cancer, transcriptome, microbiome, proteome, metagenomics, human variation, etc)
    • a key figure, with a caption explaining x-axis, y-axis, samples, experiments
    • raw data table (show first few columns and first few rows)
    • for example, a student has worked on tissue regeneration, the search in PubMed with key words "regeneration zebra fish transcriptome" found the following primary paper as the best because of the high quality of journal and the availability of raw data: https://www.ncbi.nlm.nih.gov/pubmed/28096348
    • Papers to AVOID:

Feb 6, 2021

  • Introduction to NGS: (slides available on Blackboard)
  • 1-slide presentations on Next-Generation Sequencing Technologies (Group I)
  • R Tutorial, Part 2. Data manipulation with dplyr. Slides:
In-class Exercise 2 (10 pts; Due in class)
  • Show R commands for the following operations
    • load the "tidyverse" library
    • load the "iris" data
    • Select columns "Sepal.Length" & "Species"
    • Filter rows 2 through 10
    • Add a column "logSepalLength" by taking the logarithm of the said column
    • Calculate mean and standard deviation of Petal.Length in each species
    • Save all commands in a script "in-class-ex-2.R"
  • Assignment 1 Due next day

Feb 13, 2021

  • NGS presentations
  • R Tutorial. Part 3. Data visualization with ggplot2. Slides:
  • Assignment 2: Submit R-Markdown with scatter plot & boxplot

Feb 20, 2020

  • R Tutorial: Part 4. BioStat (chi-square & t-test) Lecture slides:
  • Presentation
  • Assignment 3 (10 pts). R Markdown upload (by Wed 8pm)

Feb 27, 2020

  • Review
  • Quiz (30 pts)
    • NGS slides & R tutorial Slides
    • Excluding the four statistical tests
    • Openbook but due in class (on Blackboard)

March 6, 2020

  • R Tutorial: Part 4. BioStat (regression & ANOVA)
  • Paper review
  • Assignment 4 (10 pts). Submit R markdown on Blackboard

March 13, 2020

  • Paper review
  • Review for mid-term exam: 6 PDF presentations (intro to NGS & 5 R-tutorials)

March 20, 2020

March 27, 2020

  • No class: Spring Break

April 3, 2020

  • No class: Holiday break

April 10, 2020

  • Paper assignments & break into project groups
  • For final presentation (10 pts)
    • Each member will present his or her part
  • For final report (90), you are required to:
    • Read the paper and identify a dataset to replicate
    • Create an R markdown file to record your work
    • Produce a final WORD or PDF file as final report
  • Case study 1: Restriction/Modification system in Lyme pathogen: Casselii et al (2018)

April 17, 2020

  • Group presentation #1. Summary slides

April 24, 2020

  • Group presentation #2. R Markdown & Data import
  • I will hear group presentation Round #2. I will grade individual performance by the following rubric:
    • Biologist: Be prepared to answer my questions regarding the study background, question, and significance
    • Writer: Be prepared to show R Markdown (in R Studio) that included ALL sections from the last week's slide
    • Data Scientist: Be prepared to show data table that has been read into R Studio & show a preliminary graph & visualization
    • Statistician: Be prepared to answer questions on what is the statistical null hypothesis and what test to perform

May 1, 2020

  • Group presentation #3. R Markdown & Data analysis

May 8, 2020

  • Consultation by appointment (no live session)

May 15, 2020

May 22, 2020

  • Friday, 5pm: Final report Due (Blackboard submission)
  • Your final report (100 pts) should include the following required components:
    • (10 pts) Section 1. Background & Objectives. Describe (a) the overall goal of the study; (b) the specific question to be addressed by your dataset
    • (20 pts) Section 2. Material & Methods. Describe experimental design, i.e., how your assigned data set was generated, including the nature of the biological samples, sample size, number of replicates (biological & technical), controls (if any), sequencing technologies. Hint: Fig S1
    • (40 pts) Section 3. R codes & graphs. Show R codes with comments for individual commands. Graphics should be as close to the published figure as possible (e.g., with proper axis labels)
    • (10 pts) Section 4. Statistical analysis. Show mull hypothesis and p-value. Draw statistical conclusion
    • (5 pts) Section 5. Conclusion. Draw biological conclusions of your analysis
    • (5 pts) Section 6. Citations/source/URL to paper, your dataset, and methods