NYRaMP-Informatics-2025: Difference between revisions

Revision as of 15:29, 20 August 2025

NYRaMP Informatics Workshop August 2025, Tuesdays 9:30-11:30, DNA Learning Center Instructors: Brandon Ely (CUNY Graduate Center, bely@gradcenter.cuny.edu)

MA plot	Volcano plot	Heat map
fold change (y-axis) vs. total expression levels (x-axis)	p-value (y-axis) vs. fold change (x-axis)	genes significantly down or up-regulated (at p<1e-4)

Overview

A genome is the total genetic content of an organism. Driven by breakthroughs such as the decoding of the first human genome and rapid DNA and RNA-sequencing technologies, biomedical sciences are undergoing a rapid & irreversible transformation into a highly data-intensive field, that requires familiarity with concepts in both biological, computational, and statistical sciences.

Genome information is revolutionizing virtually all aspects of life Sciences including basic research, medicine, and agriculture. Meanwhile, use of genomic data requires life scientists to be familiar with concepts and skills in biology, computer science, as well as statistics.

This workshop is designed to introduce computational analysis of genomic data through hands-on computational exercises.

Learning goals

By the end of this workshop students will be able to:

Manipulate data with R & Rstudio
Visualize data using R & RStudio
Analyze microbiome data

Web Links

Cloud R account (free): https://posit.cloud/; Join the shared work space "NYRaMP-Informatics"
For your own computer, download the desktop version: https://posit.co/download/rstudio-desktop/
Textbook: Introduction to R for Biologists
Download: R datasets
A reference book: R for Data Science (Wickharm & Grolemund)

Week 1. Aug 5

Pre-test: visualization, interpretation, and stats. Download file: File:Pre-test.pdf
Computer/Cloud setup & software download/installation
R Tutorial 1. Getting started: Basics: interface, packages, variables, objects, functions. Download slides: File:NYRamP bioinformatics 1 slides.pdf
Session 1 R code: Basic R syntax, working with vectors, and using functions

##### Practice 1 - Together #####

# TASK 1: define a variable that is your name 
MyName <- 'Brandon'
print(MyName)
paste('My name is',MyName, sep = ' ')

# TASK 2: output the 3rd and 4th letters in your name using substr function
substr(MyName, start = 3, stop = 4)

# TASK 3: create a vector of the names of all of your Ramp cohort members 
roster <- c('Amalya', 'Danny', 'Lorelei', 'Dylan', 'Hadley', 'Brynn', 'Elliot', 'Theo')

# Task 4: check if any of the names have the letters "ic" in them
grepl('an', roster, ignore.case = F)

# Task 5: randomly select 3 names from the roster
sample(roster, size = 3, replace = FALSE)

# TASK 6: combine tasks 4-5
grepl('an', sample(roster, size = 3, replace = F), ignore.case = F)



##### Practice 2 - Independent  #####

library(stringr)

# TASK 1: create a character vector for the DNA nucleotides 
Nucleotides <- c('A', 'T', 'C', 'G')

# TASK 2: use the "sample" function on your nuc vector to create a DNA sequence of length 200
DNAseq <- sample(Nucleotides, 200, replace = TRUE)

# collapse the vector into a single string 
DNAseq2 <- paste(DNAseq, collapse = '')

# TASK 3: Find out if your sequence contains start codons using the "grepl" function; if output = FALSE, start over
grepl('ATG', DNAseq2, ignore.case = F)

# TASK 4: Find all locations of start codons within your sequence using "str_locate_all" function
str_locate_all(string = DNAseq2, pattern = 'ATG')

# TASK 5: use "substring" function to confirm coordinates are actually "ATG"
'ATG' == substr(DNAseq2, start = 96, stop = 98)

# TASK 6: calculate nucleotide % composition using "str_count" function
str_count(string = DNAseq2, pattern = 'A') / 200 * 100

### alternate more advanced way for tasks 5 and 6 ###

coords <- str_locate_all(string = DNAseq2, pattern = 'ATG')

for (i in 1:nrow(coords[[1]])) {
  start <- coords[[1]][i, 1]
  print('ATG' == substr(DNAseq2, start = start, stop = start+2))
}


for (nuc in Nucleotides) {
  print(paste(nuc,' = ', str_count(string = DNAseq2, pattern = nuc)/nchar(DNAseq2)*100, sep = ''))
}

Week 2. Aug 12

R Tutorial 2. Data manipulation. Download slides: File:NYRamP bioinformatics 2 slides.pdf
Session 2 R code:

library(tidyverse)
library(broom)

##### loading a data table #####
df <- read.table('http://wiki.genometracker.org/~weigang/datasets-master/compensation.csv', sep = ',', header = T)
# this loads the data

##### quick look at data #####
dim(df)
nrow(df)
ncol(df)
names(df)
head(df)
glimpse(df)


##### use slice to grab specific rows #####
slice(df, 11:15)
slice_head(df, n=5)
slice_tail(df, n=8)
slice_sample(df, n=20)
slice_max(df, Root, n=3)
slice_min(df, Root, n=3)
slice_min(df, Grazing)

##### arrange function #####
arrange(df, Fruit)
arrange(df, desc(Fruit))
arrange(df, Grazing)
arrange(df, desc(Grazing))


##### filter #####
filter(df, Grazing == 'Grazed')
filter(df, Root < 6.5)
filter(df, Grazing == 'Ungrazed' & Fruit > 50)
filter(df, Fruit > 50 | Root >8)
filter(df,  Grazing == 'Ungrazed' & ( Fruit > 50 | Root >8))

##### piping #####
df %>% #start by calling my dataframe
  select(-Root) %>% # removing the Root data
  filter(Fruit > 50) %>% # keep only fruit with value greater than 50
  arrange(Grazing, Fruit) # arrange first by grazed vs ungrazed, then by fruit size least to greatest

##### computations #####

# operations on an entire column
mean(df$Fruit)
median(df$Root)
sum(df$Fruit)

# operations on individual values in the column 
log2(df$Fruit)
df$Fruit / 2

# make a new column by defining and assigning
new_df <- df
new_df$test <- df$Root + df$Fruit

# making a new column based on a computation of data in the df using mutate
new_df <- df %>%
  mutate(log_fruit = log2(Fruit)) %>%
  mutate(combined_mass = Fruit + Root)

##### summary statistics #####
summary(df)

# pipe to get summary by specific variables
df %>% 
  filter(Grazing == 'Grazed') %>%
  summary()

# get the stats you want, with grouping 
df %>% 
  group_by(Grazing) %>%
  summarise(
    mean_fruit = mean(Fruit),
    median_fruit = median(Fruit),
    sd_fruit = sd(Fruit),
    mean_root = mean(Root),
    median_root = median(Root),
    sd_root = sd(Root)
  )


##### rotating a dataframe #####
# add an id so values are in unique rows
df$id <- as.character(1:length(df$Grazing)) 

# convert to long format
df_long <- df %>% pivot_longer(cols = c(Fruit, Root), values_to = 'value', names_to = 'type')

# you can still group and summarize like you can with the wide format
df_long %>% 
  group_by(type, Grazing) %>%
  summarise(
    mean = mean(value)
  )

# convert a long format df to wide
df_wide <- df_long %>% pivot_wider(names_from = type, values_from = value)


##### IRIS DATASET PRACTICE #####
data("iris")

# 1 - Find the dimensions of the dataset
dim(iris)

paste('The iris dataset has',dim(iris)[1],'rows and',dim(iris)[2],'columns.', sep = ' ')

# 2 - List the variables and their data types
glimpse(iris)

# 3 - Summarize the variables
summary(iris)

# 4 - Get the last 10 observations of the dataset
tail(iris, n = 10)
slice_tail(iris, n = 10)

# 5 - Select only the first four columns (remove the “species” column)
select(iris, 1:4)
iris %>% select(1:4)
select(iris, -Species)

# 6 - Filter rows by species, retain only rows from one species (e.g., “setosa”)
filter(iris, Species == 'setosa')
iris %>% filter(Species == 'setosa')

# 7 - Filter rows by a cutoff value (e.g., “Sepal.Length >= 4”)
iris %>% filter(Sepal.Length >= 5)

# 8 - Add a column by taking the log10 of “Sepal.Lengh”
iris %>% mutate(log10_sl = log10(Sepal.Length))

# 9 - What are the medians of the variable “Sepal.Length” for each species?
iris %>%
  group_by(Species) %>%
  summarise(
    median_sl = median(Sepal.Length)
  )

# 10 - Count how many samples for each species
count(iris, Species)
iris %>% count(Species)

# 11 - Convert the table to long format
iris %>%
  pivot_longer(cols = c(Sepal.Length,Sepal.Width,Petal.Length,Petal.Width), 
               names_to = 'structure',
               values_to = 'measurement')

Week 3. Aug 19

R Tutorial 3. Data visualization. Lecture slides: File:NYRamP bioinformatics 3 slides.pdf
Session 3 R code:

library(tidyverse)
library(broom)
##### compensation dataset viz #####

df <- read.table('http://wiki.genometracker.org/~weigang/datasets-master/compensation.csv', sep = ',', header = T)

# boxplot for root or fruit measurements
df %>%
  ggplot(aes(x = Grazing, y = Root, fill = Grazing)) +
  geom_boxplot() +
  #geom_point() 
  geom_jitter(color = 'blue')

# scatter plot with regression line
df %>%
  ggplot(aes(x = Root, y = Fruit, color = Grazing)) +
  geom_point() +
  geom_smooth(method = 'lm')

# bar plot for root or fruit measurement 
df %>% 
  ggplot(aes(x = Grazing, y = Fruit, fill = Grazing)) +
  geom_bar(stat = 'identity')

# we can do more with a long data format
df_long <- df %>% pivot_longer(cols = c(Fruit, Root), values_to = 'value', names_to = 'type')

# box plot showing root and fruit together
df_long %>%
  ggplot(aes(x = type, y = value, fill = Grazing)) +
  geom_boxplot() +
  geom_jitter() +
  facet_wrap(~type, scales = 'free')


data(iris)
##### compensation dataset hypothesis testing #####
# is there a correlation between root and fruit variables?
cor(df$Root, df$Fruit)

# lets group by to get a better idea
df %>% 
  group_by(Grazing) %>%
  summarise(pearson = cor(Root, Fruit))

# is the mean Fruit size the same for grazed and ungrazed? 
# t test
t.test(df$Fruit ~ df$Grazing)

# tidy for results in a neat dataframe
res <- tidy(t.test(df$Fruit ~ df$Grazing))

# another way of doing t.test + tidy
t_test <- df %>%
  t.test(Fruit ~ Grazing, data = .) %>%
  tidy()

# linear model - can these variables predict each other?
# can root size predict fruit mass? 
lm_simple <- lm(Fruit ~ Root, data = df)
lm_interact <- lm(Fruit ~ Root * Grazing, data = df)

tidy(lm_simple)
tidy(lm_interact)

summary(lm(Fruit ~ Root, data = df))
summary(lm(Fruit ~ Root * Grazing, data = df))
##### iris dataset Viz #####

# simple 1 variable boxplot
iris %>%
  ggplot(aes(x = Species, y = Sepal.Length, fill = Species)) +
  #geom_boxplot() +
  geom_violin() +
  #geom_point()
  geom_jitter()  
#geom_bar(stat = 'identity') 

# scatter plot for 2 numerical variables
iris %>%
  ggplot(aes(x = Sepal.Length, y = Petal.Length, color = Species)) +
  geom_point() +
  geom_smooth(method = "lm", se = T)


# lets get more information at once!
# convert to long format
df_long <- iris %>%
  pivot_longer(cols = c(Sepal.Length,Sepal.Width,Petal.Length,Petal.Width), 
               names_to = 'structure',
               values_to = 'measurement')

df_long %>%
  #filter(structure == 'Sepal.Length') %>%
  ggplot(aes(x = Species, y = measurement, fill = Species)) +
  geom_boxplot() +
  geom_jitter(alpha = 0.5) +
  facet_wrap(~structure, scales = 'free_y')

df_long %>%
  ggplot(aes(x = Species, y = measurement, fill = Species)) +
  geom_bar(stat = 'identity') +
  facet_wrap(~structure)



##### iris dataset hypothesis testing #####
# 1 - is there a correlation between sepal length and sepal width?
cor(iris$Sepal.Length, iris$Sepal.Width)
cor(iris$Sepal.Length, iris$Petal.Length)

# 2 - is there a difference in mean sepal length among the different species?
# anova to compare all three
anova_res <- aov(Sepal.Length ~ Species, data = iris)
# look at results
summary(anova_res)
# post hoc test to see which groups differ
TukeyHSD(anova_res)

# we can do the similar things with t-tests 
df <- iris %>% 
  filter(Species == 'setosa' | Species == 'virginica')

t.test(Sepal.Length ~ Species, df)$p.value

t_res <- t.test(Sepal.Length ~ Species, data = df)

@@ Line 265: / Line 265: @@
 ==Week 3. Aug 19==
-* R Tutorial 3. Data visualization. Lecture slides: [[File:NYRamP_bioinformatics_1_slides.pdf|thumb]]
+* R Tutorial 3. Data visualization. Lecture slides: [[File:NYRamP_bioinformatics_3_slides.pdf|thumb]]
 * Session 3 R code:
 <syntaxhighlight lang='bash'>

NYRaMP-Informatics-2025: Difference between revisions

Revision as of 15:29, 20 August 2025

Contents

Overview

Learning goals

Web Links

Week 1. Aug 5

Week 2. Aug 12

Week 3. Aug 19

Demo I. Microbiome data analysis

Navigation menu

NYRaMP-Informatics-2025: Difference between revisions

Revision as of 15:29, 20 August 2025

Overview

Learning goals

Web Links

Week 1. Aug 5

Week 2. Aug 12

Week 3. Aug 19

Demo I. Microbiome data analysis

Navigation menu

Search