NYRaMP-Informatics-2025: Difference between revisions
| Line 265: | Line 265: | ||
==Week 3. Aug 19== | ==Week 3. Aug 19== | ||
* R Tutorial 3. Data visualization. Lecture slides: [[File: | * R Tutorial 3. Data visualization. Lecture slides: [[File:NYRamP_bioinformatics_3_slides.pdf|thumb]] | ||
* Session 3 R code: | * Session 3 R code: | ||
<syntaxhighlight lang='bash'> | <syntaxhighlight lang='bash'> | ||
Revision as of 15:29, 20 August 2025
| MA plot | Volcano plot | Heat map |
|---|---|---|
Overview
A genome is the total genetic content of an organism. Driven by breakthroughs such as the decoding of the first human genome and rapid DNA and RNA-sequencing technologies, biomedical sciences are undergoing a rapid & irreversible transformation into a highly data-intensive field, that requires familiarity with concepts in both biological, computational, and statistical sciences.
Genome information is revolutionizing virtually all aspects of life Sciences including basic research, medicine, and agriculture. Meanwhile, use of genomic data requires life scientists to be familiar with concepts and skills in biology, computer science, as well as statistics.
This workshop is designed to introduce computational analysis of genomic data through hands-on computational exercises.
Learning goals
By the end of this workshop students will be able to:
- Manipulate data with R & Rstudio
- Visualize data using R & RStudio
- Analyze microbiome data
Web Links
- Cloud R account (free): https://posit.cloud/; Join the shared work space "NYRaMP-Informatics"
- For your own computer, download the desktop version: https://posit.co/download/rstudio-desktop/
- Textbook: Introduction to R for Biologists
- Download: R datasets
- A reference book: R for Data Science (Wickharm & Grolemund)
Week 1. Aug 5
- Pre-test: visualization, interpretation, and stats. Download file: File:Pre-test.pdf
- Computer/Cloud setup & software download/installation
- R Tutorial 1. Getting started: Basics: interface, packages, variables, objects, functions. Download slides: File:NYRamP bioinformatics 1 slides.pdf
- Session 1 R code: Basic R syntax, working with vectors, and using functions
##### Practice 1 - Together #####
# TASK 1: define a variable that is your name
MyName <- 'Brandon'
print(MyName)
paste('My name is',MyName, sep = ' ')
# TASK 2: output the 3rd and 4th letters in your name using substr function
substr(MyName, start = 3, stop = 4)
# TASK 3: create a vector of the names of all of your Ramp cohort members
roster <- c('Amalya', 'Danny', 'Lorelei', 'Dylan', 'Hadley', 'Brynn', 'Elliot', 'Theo')
# Task 4: check if any of the names have the letters "ic" in them
grepl('an', roster, ignore.case = F)
# Task 5: randomly select 3 names from the roster
sample(roster, size = 3, replace = FALSE)
# TASK 6: combine tasks 4-5
grepl('an', sample(roster, size = 3, replace = F), ignore.case = F)
##### Practice 2 - Independent #####
library(stringr)
# TASK 1: create a character vector for the DNA nucleotides
Nucleotides <- c('A', 'T', 'C', 'G')
# TASK 2: use the "sample" function on your nuc vector to create a DNA sequence of length 200
DNAseq <- sample(Nucleotides, 200, replace = TRUE)
# collapse the vector into a single string
DNAseq2 <- paste(DNAseq, collapse = '')
# TASK 3: Find out if your sequence contains start codons using the "grepl" function; if output = FALSE, start over
grepl('ATG', DNAseq2, ignore.case = F)
# TASK 4: Find all locations of start codons within your sequence using "str_locate_all" function
str_locate_all(string = DNAseq2, pattern = 'ATG')
# TASK 5: use "substring" function to confirm coordinates are actually "ATG"
'ATG' == substr(DNAseq2, start = 96, stop = 98)
# TASK 6: calculate nucleotide % composition using "str_count" function
str_count(string = DNAseq2, pattern = 'A') / 200 * 100
### alternate more advanced way for tasks 5 and 6 ###
coords <- str_locate_all(string = DNAseq2, pattern = 'ATG')
for (i in 1:nrow(coords[[1]])) {
start <- coords[[1]][i, 1]
print('ATG' == substr(DNAseq2, start = start, stop = start+2))
}
for (nuc in Nucleotides) {
print(paste(nuc,' = ', str_count(string = DNAseq2, pattern = nuc)/nchar(DNAseq2)*100, sep = ''))
}
Week 2. Aug 12
- R Tutorial 2. Data manipulation. Download slides: File:NYRamP bioinformatics 2 slides.pdf
- Session 2 R code:
library(tidyverse)
library(broom)
##### loading a data table #####
df <- read.table('http://wiki.genometracker.org/~weigang/datasets-master/compensation.csv', sep = ',', header = T)
# this loads the data
##### quick look at data #####
dim(df)
nrow(df)
ncol(df)
names(df)
head(df)
glimpse(df)
##### use slice to grab specific rows #####
slice(df, 11:15)
slice_head(df, n=5)
slice_tail(df, n=8)
slice_sample(df, n=20)
slice_max(df, Root, n=3)
slice_min(df, Root, n=3)
slice_min(df, Grazing)
##### arrange function #####
arrange(df, Fruit)
arrange(df, desc(Fruit))
arrange(df, Grazing)
arrange(df, desc(Grazing))
##### filter #####
filter(df, Grazing == 'Grazed')
filter(df, Root < 6.5)
filter(df, Grazing == 'Ungrazed' & Fruit > 50)
filter(df, Fruit > 50 | Root >8)
filter(df, Grazing == 'Ungrazed' & ( Fruit > 50 | Root >8))
##### piping #####
df %>% #start by calling my dataframe
select(-Root) %>% # removing the Root data
filter(Fruit > 50) %>% # keep only fruit with value greater than 50
arrange(Grazing, Fruit) # arrange first by grazed vs ungrazed, then by fruit size least to greatest
##### computations #####
# operations on an entire column
mean(df$Fruit)
median(df$Root)
sum(df$Fruit)
# operations on individual values in the column
log2(df$Fruit)
df$Fruit / 2
# make a new column by defining and assigning
new_df <- df
new_df$test <- df$Root + df$Fruit
# making a new column based on a computation of data in the df using mutate
new_df <- df %>%
mutate(log_fruit = log2(Fruit)) %>%
mutate(combined_mass = Fruit + Root)
##### summary statistics #####
summary(df)
# pipe to get summary by specific variables
df %>%
filter(Grazing == 'Grazed') %>%
summary()
# get the stats you want, with grouping
df %>%
group_by(Grazing) %>%
summarise(
mean_fruit = mean(Fruit),
median_fruit = median(Fruit),
sd_fruit = sd(Fruit),
mean_root = mean(Root),
median_root = median(Root),
sd_root = sd(Root)
)
##### rotating a dataframe #####
# add an id so values are in unique rows
df$id <- as.character(1:length(df$Grazing))
# convert to long format
df_long <- df %>% pivot_longer(cols = c(Fruit, Root), values_to = 'value', names_to = 'type')
# you can still group and summarize like you can with the wide format
df_long %>%
group_by(type, Grazing) %>%
summarise(
mean = mean(value)
)
# convert a long format df to wide
df_wide <- df_long %>% pivot_wider(names_from = type, values_from = value)
##### IRIS DATASET PRACTICE #####
data("iris")
# 1 - Find the dimensions of the dataset
dim(iris)
paste('The iris dataset has',dim(iris)[1],'rows and',dim(iris)[2],'columns.', sep = ' ')
# 2 - List the variables and their data types
glimpse(iris)
# 3 - Summarize the variables
summary(iris)
# 4 - Get the last 10 observations of the dataset
tail(iris, n = 10)
slice_tail(iris, n = 10)
# 5 - Select only the first four columns (remove the “species” column)
select(iris, 1:4)
iris %>% select(1:4)
select(iris, -Species)
# 6 - Filter rows by species, retain only rows from one species (e.g., “setosa”)
filter(iris, Species == 'setosa')
iris %>% filter(Species == 'setosa')
# 7 - Filter rows by a cutoff value (e.g., “Sepal.Length >= 4”)
iris %>% filter(Sepal.Length >= 5)
# 8 - Add a column by taking the log10 of “Sepal.Lengh”
iris %>% mutate(log10_sl = log10(Sepal.Length))
# 9 - What are the medians of the variable “Sepal.Length” for each species?
iris %>%
group_by(Species) %>%
summarise(
median_sl = median(Sepal.Length)
)
# 10 - Count how many samples for each species
count(iris, Species)
iris %>% count(Species)
# 11 - Convert the table to long format
iris %>%
pivot_longer(cols = c(Sepal.Length,Sepal.Width,Petal.Length,Petal.Width),
names_to = 'structure',
values_to = 'measurement')
Week 3. Aug 19
- R Tutorial 3. Data visualization. Lecture slides: File:NYRamP bioinformatics 3 slides.pdf
- Session 3 R code:
library(tidyverse)
library(broom)
##### compensation dataset viz #####
df <- read.table('http://wiki.genometracker.org/~weigang/datasets-master/compensation.csv', sep = ',', header = T)
# boxplot for root or fruit measurements
df %>%
ggplot(aes(x = Grazing, y = Root, fill = Grazing)) +
geom_boxplot() +
#geom_point()
geom_jitter(color = 'blue')
# scatter plot with regression line
df %>%
ggplot(aes(x = Root, y = Fruit, color = Grazing)) +
geom_point() +
geom_smooth(method = 'lm')
# bar plot for root or fruit measurement
df %>%
ggplot(aes(x = Grazing, y = Fruit, fill = Grazing)) +
geom_bar(stat = 'identity')
# we can do more with a long data format
df_long <- df %>% pivot_longer(cols = c(Fruit, Root), values_to = 'value', names_to = 'type')
# box plot showing root and fruit together
df_long %>%
ggplot(aes(x = type, y = value, fill = Grazing)) +
geom_boxplot() +
geom_jitter() +
facet_wrap(~type, scales = 'free')
data(iris)
##### compensation dataset hypothesis testing #####
# is there a correlation between root and fruit variables?
cor(df$Root, df$Fruit)
# lets group by to get a better idea
df %>%
group_by(Grazing) %>%
summarise(pearson = cor(Root, Fruit))
# is the mean Fruit size the same for grazed and ungrazed?
# t test
t.test(df$Fruit ~ df$Grazing)
# tidy for results in a neat dataframe
res <- tidy(t.test(df$Fruit ~ df$Grazing))
# another way of doing t.test + tidy
t_test <- df %>%
t.test(Fruit ~ Grazing, data = .) %>%
tidy()
# linear model - can these variables predict each other?
# can root size predict fruit mass?
lm_simple <- lm(Fruit ~ Root, data = df)
lm_interact <- lm(Fruit ~ Root * Grazing, data = df)
tidy(lm_simple)
tidy(lm_interact)
summary(lm(Fruit ~ Root, data = df))
summary(lm(Fruit ~ Root * Grazing, data = df))
##### iris dataset Viz #####
# simple 1 variable boxplot
iris %>%
ggplot(aes(x = Species, y = Sepal.Length, fill = Species)) +
#geom_boxplot() +
geom_violin() +
#geom_point()
geom_jitter()
#geom_bar(stat = 'identity')
# scatter plot for 2 numerical variables
iris %>%
ggplot(aes(x = Sepal.Length, y = Petal.Length, color = Species)) +
geom_point() +
geom_smooth(method = "lm", se = T)
# lets get more information at once!
# convert to long format
df_long <- iris %>%
pivot_longer(cols = c(Sepal.Length,Sepal.Width,Petal.Length,Petal.Width),
names_to = 'structure',
values_to = 'measurement')
df_long %>%
#filter(structure == 'Sepal.Length') %>%
ggplot(aes(x = Species, y = measurement, fill = Species)) +
geom_boxplot() +
geom_jitter(alpha = 0.5) +
facet_wrap(~structure, scales = 'free_y')
df_long %>%
ggplot(aes(x = Species, y = measurement, fill = Species)) +
geom_bar(stat = 'identity') +
facet_wrap(~structure)
##### iris dataset hypothesis testing #####
# 1 - is there a correlation between sepal length and sepal width?
cor(iris$Sepal.Length, iris$Sepal.Width)
cor(iris$Sepal.Length, iris$Petal.Length)
# 2 - is there a difference in mean sepal length among the different species?
# anova to compare all three
anova_res <- aov(Sepal.Length ~ Species, data = iris)
# look at results
summary(anova_res)
# post hoc test to see which groups differ
TukeyHSD(anova_res)
# we can do the similar things with t-tests
df <- iris %>%
filter(Species == 'setosa' | Species == 'virginica')
t.test(Sepal.Length ~ Species, df)$p.value
t_res <- t.test(Sepal.Length ~ Species, data = df)


