Biol20N02 2017: Difference between revisions
imported>Weigang m (→Course Outline) |
imported>Weigang |
||
Line 300: | Line 300: | ||
===March 23. Hypothesis Testing=== | ===March 23. Hypothesis Testing=== | ||
* In-Class exercise 3. Hypothesis testing through simulation | * In-Class exercise 3. Hypothesis testing through simulation | ||
* Lecture Slides: to be posted | * Lecture Slides: to be posted | ||
{| class="wikitable sortable mw-collapsible" | {| class="wikitable sortable mw-collapsible" |
Revision as of 18:16, 9 March 2017
Course Description
With rapid accumulation of genome sequences and digital health data, biomedicine is becoming an information science. This course is a hands-on, computer-based workshop on how to visualize and analyze biological data. The course introduces R, a modern statistical computing language and platform. In the first half, students will learn to use R to make scatter plots, bar plots, box plots, and other commonly used data-visualization techniques. In the second half, the course will review & apply statistical hypothesis tests including significance testing of means, association tests, and correlation analysis. Throughout the course, students will apply these methods to the analysis of large biological data sets, such as the human genome, transcriptomes (RNA-SEQ), and human genome variations.
This 3-credit experimental course fulfills elective requirements for Biology Major I. Hunter pre-requisites are BIOL100, BIOL102 and STAT113.
Learning Goals
- Be able to use R as a plotting tool to visualize large-scale biological data sets
- Be able to use R as a statistical tool to summarize data and make biological inferences
- Be able to use R as a programming language to automate data analysis
Textbooks
- Whitlock & Schluter (2015). Analysis of Biological Data. (2nd edition). Amazon link
Exams & Grading
- Attendance (or a note in case of absence) is required
- In-Class Exercises (50 pts).
- Assignments. All assignments should be handed in as hard copies only. Email submission will not be accepted. Late submissions will receive 10% deduction (of the total grade) per day (~100 pts total).
- Three Mid-term Exams (3 X 30 pts each = 90 pts)
- Comprehensive Final Exam (50 pts)
- Bonus for active participation in classroom discussions
Course Outline
Feb 2. Introduction & R Demo
- Course overview
- Tutorial 1: R Demo
- Create a new project by navigating: File | New Project | New Directory. Name it project file "human_genes"
- Import the human genes data set: File | Import DataSet | CSV, copy & paste this address: http://diverge.hunter.cuny.edu/~weigang/data-sets-for-biostat/hg.tsv2
- Click "update". Rename the data set if you wish (short but informative names, e.g., hg, or human.genes). Do not use spaces, use dot or underscore as name delimiters (e.g., "human.genes" or "human_genes", but never "human genes") Same rule for column or row names
dim(hg) # show dimension
head(hg) # show top rows
tail(hg) # show bottom rows
hg.len <- hg$Gene.End - hg$Gene.Start + 1 # create a vector of gene lengths (Tab for auto-completion)
hg.len[1] # show first length
hg.len[10] # show the 10th item
hg.len[1:10] # show items 1 through 10
hg.len[c(1,10)] # show items 1 and 10
hg[1,1] # show item in 1st row, 1 column
hg[1,] # show all values for 1st row
hg[,1] # show all values for 1st column
hg[1:10,] # show rows 1 through 10
hg[,1:7] # show columns 1 through 10
hg[1:10, 1:7] # a subset
hist(hg$Length, br = 200) # plot gene-length distribution. Not normal: mostly genes are short, few very long
hist(log10(hg$Length), br = 200) # log transformation for non-normally distributed variable
gene.cts <- table(hg$Chromosome) # count number of genes, separated by chromosomes
barplot(gene.cts) # show distribution by a barplot
mean(hg$Length) # not representative, super-long genes carry too much weight to the average length
median(hg$Length) # More representative. Use median for a variable not normally distributed
summary(hg$Length) # Show all quartiles
hg$Gene.Length <- hg$Gene.End - hg$Gene.Start + 1 # create a new column named "Gene.Length"
boxplot(log10(Length) ~ Chromosome, data = hg) # show gene length by chromosomes
write.csv(hg, "hg.csv", row.names = FALSE) # save into a file
hg <- read.csv("hg.csv") # read back into R
- Export a PDF or image
- Open a new R script, name it as "hg.R"
- Select commands and save to script
- Retrieve and edit a command by pressing "up" or "down" arrows
- Retrieve commands by using the search box on the "History" table
- Type q() to quit. Answer "y" to save workspace
- To reload and restore workspace, go to "C:/Users/instructor/Documents/human.genes" and double click on the file "human.gene"
Assignment #1. Due 2/9, Thursday (Finalized) |
---|
|
Feb 9. (Class cancelled due to snow storm)
Feb 16. R Data Structure & Variable Types
- Vector
x <- c(1,2,3,4,5) # construct a vector using the c() function
x # show x
2 * x + 1 # arithmetic operations, applied to each element
exp(x) # exponent function (base e)
x <- 1:5 # alternative way to construct a vector, if consecutive
x <- seq(from = -1, to = 14, by = 2) # use the seq() function to create a numeric series
x <- rep(5, times = 10) # use the rep() function to create a vector of same element
x <- rep(NA, times = 10) # pre-define a vector with unknown elements; Use no quotes
# Apply vector functions
length(x)
sum(x)
mean(x)
range(x)
# Access vector elements with indices
x[1]
x[1:3]
x[-2]
x[c(1,3)]
# Character vectors
gender <- c("male", "female", "female", "male", "female")
gender[3]
- Matrix
BMI <- c(28, 32, 21, 27, 35) # a vector of body-mass index
bp <- c(124, 145, 127, 133, 140) # a vector of blood pressure
data.1 <- cbind(age, BMI, bp) # create a matrix using column bind function cbind(), individuals in rows
data.1
data.2 <- rbind(age, BMI, bp) # create a matrix using row bind function rbind()
t(data.1) # transpose a matrix: columns to rows & rows to columns
dim(data.1) # dimension of the matrix
colnames(data.1)
rownames(data.1) <- c("subject1", "subject2", "subject3", "subject4", "subject5")
data.1
data.1[3,1] # access the element in row 3, column 1
data.1[2,] # access all elements in row 2
data.1[,2] # access all elements in column 2
matrix(data = 1:12, nrow = 3, ncol =4) # create a matrix with three rows and four columns; filled by column
matrix(data = 1:12, nrow = 3, ncol =4, byrow = TRUE) # filled by row
mat <- matrix(data = NA, nrow = 2, ncol = 3) # create an empty matrix
mat[1,3] <- 5 # assign a value to a matrix element
- Dataframe
class(hg) # show object class
hg[1,] # how first row
hg[,1] # show first column
hg[1:3,] # show rows 1 through 3
hg[,1:3] # show columns 1 through 3
hg$Gene.Name # show column "Gene.Name"
Assignment #2. Due 2/23 (Finalized) |
---|
|
Feb 23. Data Visualization
- Vector functions returning indices
# The which() function returns the indices of TRUE elements
hg$Gene.Length <- hg$Gene.End - hg$Gene.Start + 1 # add a length column
hg.long.idx <- which(hg$Gene.Length > 1e6) # returns indices
hg.long <- hg[hg.long.idx,] # genes longer than 1 million bases
hg.mt.idx <- which(hg$Chromosome == "MT")
hg.mt <- hg[hg.mt.idx,] # mitochondrial genes
# The grep() function returns the indices of matching a pattern
p53.idx <- grep("P53", hg$Gene.Name)
hg.p53 <- hg[p53.idx,]
# The order() function returns the indices of sorted elements
idx.sorted <- order(hg$Gene.Length)
hg.sorted <- hg[idx.sorted,]
- Textbook: Chapter 1. Statistics and Sample; Chapter 2. Displaying Data
- Lecture slides:
Assignment #3. Due 3/2 (Finalized) |
---|
|
March 2. Exam 1
March 9. Describing data
- Textbook: Chapter 3
- Population and sample
x <- rnorm(1000)
x.sample <- sample(x, size = 100)
n.genes <- nrow(hg) # number of rows
sampled.rows <- sample(1:n.genes, size = 100)
hg.sample <- hg[sampled.rows,] # a random sample of 100 genes
- Explore variable distributions
x <- rnorm(1000)
hist(x, breaks = 100) # distribution for continuous variable
hist(hg$Gene.Length, br = 100)
hist(log10(hg$Gene.Length), br = 100)
gene.cts <- table(hg$Chromosomes) # distribution for a categorical vector
barplot(gene.cts)
- In-Class exercise: A study of human gene length
hg.len <- hg$Gene.End - hg$Gene.Start + 1 # calculate gene length
hist(hg.len, br = 200) # plot gene-length distribution. Not normal: mostly genes are short, few very long
mean(hg.len) # not representative, super-long genes carry too much weight to the average length
median(hg.len) # More representative. Use median for a variable not normally distributed
summary(hg.len) # Show all quartiles
IQR(hg.len) # 3rd Quartile - 1st Quartile, the range of majority data points, even for skewed distribution
log.len <- log10(hg.len); hist(log.len, br=200) # Log of gene length is more normally distributed
mean(log.len); median(log.len) # They should be similar, since log.len is normal
# The next block is intend to show that the "mean length" of samples is normally distributed, although the length itself is not
samp.len <- sample(hg.len, 100) # take a random sample of 100 length
mean(samp.len) # a sample mean
# Repeat the above 1000 times, so we could study the distribution of "mean length" (not "length" itself)
mean.sample.100 <- sapply(1:1000, function(x) mean(sample(hg$Gene.Length, size = 100)))
hist(mean.sample.100, br=100) # you should see a more normally distributed histogram
# The above exercise is a demonstration of the "Central Limit Theorem"
Assignment #4. Due 3/16. (Tentative) |
---|
# Combine
sample.combined <- cbind(mean.100, mean.500, mean.5000)
colnames(sample.combined) <- c("samp.100", "samp.500", "samp.5000")
# plot in a single frame
par(mfrow=c(3,1))
hist(sample.combined[,1], br=100, xlim=c(1e4, 2e5), main="sample size 20", xlab = "mean gene length")
hist(sample.combined[,2], br=100, xlim=c(1e4, 2e5), main = "sample size 100", xlab = "mean gene length")
hist(sample.combined[,3], br=100, xlim=c(1e4, 2e5), main = "sample size 500", xlab = "mean gene length")
par(mfrow =c(1,1))
|
March 16. Standard Error of Mean & Hypothesis Testing
- Chapters 4 & 6
- In-Class exercise 1. Descriptive statistics
- Make a vector of the following blood pressure measurements (in mmHg): 112, 128, 108, 129, 125, 153, 155, 132, 137. Calculate sample size, sum, mean, variance, coefficient of variation (CV), and median
- Take a sample of 100 human gene lengths. Calculate median, IQR, 1.5*IQR; Make a boxplot
- The following are measurements of body mass (in grams) of three species of finches in Africa, calculate mean, standard deviation, and CV of each species. Make a boxplot and a strip chart separated by species
- Species 1: 8, 8, 8, 8, 8, 8, 8, 6, 7 ,7, 7, 8, 8, 8, 7, 7
- Species 2: 16, 16, 16, 12, 16, 15, 15, 17, 15, 16, 15, 16
- Species 3: 40, 43, 37, 38, 43, 33, 35, 37, 36, 42, 36, 36, 39, 37, 34, 41
- In-Class exercise 2. standard error & confidence interval
- Blood pressure: What is the standard deviation of the above blood pressure?
- What is the sample size? Calculate standard error of the mean.
- Use the 2SE rule of thumb, calculate 95% confidence interval.
- Plot standard error & standard deviation
- In-Class exercise 3. Hypothesis testing through simulation
# coin-flipping experiments
runif(1) # take a random sample from 0-1, uniformly distributed
rbinom(n = 1, size =100, prob = 0.5) # flipping 100 (size) fair (prob) coin, one (n=1) time
rbinom(n = 1000, size =100, prob = 0.5) # repeat above 1000 times
num.success <- rbinom(n = 1000, size =100, prob = 0.5) # save
barplot(table(num.success)) # distribution of number of successes
length(which(num.success<=40))/1000 # probability of success less than or equal to 40
# test if toads are right-handed: observation 14/18 are right-handed
right.handed.toads.by.chance <- rbinom(n = 1000, size = 18, prob = 0.5) # null distribution, 1000 times
barplot(table(right.handed.toads.by.chance)) # plot
length(which(right.handed.toads.by.chance >= 14))/1000 # probability of getting a value equal or higher than 14
# If the observation is 10/18
right.handed.toads.by.chance <- rbinom(n = 1e6, size = 18, prob = 0.5)
length(which(right.handed.toads.by.chance <= 8 | right.handed.toads.by.chance>=10))/1e6
Assignment #6. Due 3/29 (Finalized) |
---|
A study of expression levels of human genes
|
March 23. Hypothesis Testing
- In-Class exercise 3. Hypothesis testing through simulation
- Lecture Slides: to be posted
Assignment #7. Due 4/5 (Finalized) | ||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
March 30. Exam 2
April 6. Analyzing Proportions
Assignment #8. Due 4/19 (Finalized) |
---|
|
April 13. No Class (Spring Break)
April 20. No Class (Monday Schedule)
April 27. Contingency Analysis
Lecture Slides:
Assignment #9. Due 5/3 (Finalized) | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
The following table shows results of genotype counts in "Taster" and "non-Taster" individuals.
|
May 4. Exam 3
- Review lecture slides (part 3)
- Review 2 previous exams
May 11. One-sample t-test
- Motivating example: weights of ticks
- Import data set: http://diverge.hunter.cuny.edu/~weigang/data-sets-for-biostat/tick.tsv
- How to visualize data? What kinds of plots to make?
- Is there sexual dimorphism?
- No homework assignment (to be combined with the next one)
May 18. Paired & Two sample t-tests
- Import data set: http://diverge.hunter.cuny.edu/~weigang/data-sets-for-biostat/ahus.csv
- PLEASE FILL IN TEACHER EVALUATIONS: Teacher's evaluation
- All lecture slides
- Part 1.
- Part 2.
- Part 3.
- Part 4.
Assignment #10. Due 5/24 (Finalized) | |||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|