Biol20N02 2017: Difference between revisions
imported>Weigang |
imported>Weigang |
||
Line 497: | Line 497: | ||
* Calculate confidence interval of the population slope | * Calculate confidence interval of the population slope | ||
* Obtain R-squared value and explain its meaning and significance | * Obtain R-squared value and explain its meaning and significance | ||
* Add confidence band | * Add confidence band line | ||
* Add confidence line | * Add confidence interval line | ||
|} | |} | ||
Latest revision as of 15:59, 18 May 2017
Course Description
With rapid accumulation of genome sequences and digital health data, biomedicine is becoming an information science. This course is a hands-on, computer-based workshop on how to visualize and analyze biological data. The course introduces R, a modern statistical computing language and platform. In the first half, students will learn to use R to make scatter plots, bar plots, box plots, and other commonly used data-visualization techniques. In the second half, the course will review & apply statistical hypothesis tests including significance testing of means, association tests, and correlation analysis. Throughout the course, students will apply these methods to the analysis of large biological data sets, such as the human genome, transcriptomes (RNA-SEQ), and human genome variations.
This 3-credit experimental course fulfills elective requirements for Biology Major I. Hunter pre-requisites are BIOL100, BIOL102 and STAT113.
Learning Goals
- Be able to use R as a plotting tool to visualize large-scale biological data sets
- Be able to use R as a statistical tool to summarize data and make biological inferences
- Be able to use R as a programming language to automate data analysis
Textbooks
- Whitlock & Schluter (2015). Analysis of Biological Data. (2nd edition). Amazon link
Exams & Grading
- Attendance (or a note in case of absence) is required
- In-Class Exercises (50 pts).
- Assignments. All assignments should be handed in as hard copies only. Email submission will not be accepted. Late submissions will receive 10% deduction (of the total grade) per day (~100 pts total).
- Three Mid-term Exams (3 X 30 pts each = 90 pts)
- Comprehensive Final Exam (50 pts)
- Bonus for active participation in classroom discussions
Course Outline
Feb 2. Introduction & R Demo
- Course overview
- Tutorial 1: R Demo
- Create a new project by navigating: File | New Project | New Directory. Name it project file "human_genes"
- Import the human genes data set: File | Import DataSet | CSV, copy & paste this address: http://diverge.hunter.cuny.edu/~weigang/data-sets-for-biostat/hg.tsv2
- Click "update". Rename the data set if you wish (short but informative names, e.g., hg, or human.genes). Do not use spaces, use dot or underscore as name delimiters (e.g., "human.genes" or "human_genes", but never "human genes") Same rule for column or row names
dim(hg) # show dimension
head(hg) # show top rows
tail(hg) # show bottom rows
hg.len <- hg$Gene.End - hg$Gene.Start + 1 # create a vector of gene lengths (Tab for auto-completion)
hg.len[1] # show first length
hg.len[10] # show the 10th item
hg.len[1:10] # show items 1 through 10
hg.len[c(1,10)] # show items 1 and 10
hg[1,1] # show item in 1st row, 1 column
hg[1,] # show all values for 1st row
hg[,1] # show all values for 1st column
hg[1:10,] # show rows 1 through 10
hg[,1:7] # show columns 1 through 10
hg[1:10, 1:7] # a subset
hist(hg$Length, br = 200) # plot gene-length distribution. Not normal: mostly genes are short, few very long
hist(log10(hg$Length), br = 200) # log transformation for non-normally distributed variable
gene.cts <- table(hg$Chromosome) # count number of genes, separated by chromosomes
barplot(gene.cts) # show distribution by a barplot
mean(hg$Length) # not representative, super-long genes carry too much weight to the average length
median(hg$Length) # More representative. Use median for a variable not normally distributed
summary(hg$Length) # Show all quartiles
hg$Gene.Length <- hg$Gene.End - hg$Gene.Start + 1 # create a new column named "Gene.Length"
boxplot(log10(Length) ~ Chromosome, data = hg) # show gene length by chromosomes
write.csv(hg, "hg.csv", row.names = FALSE) # save into a file
hg <- read.csv("hg.csv") # read back into R
- Export a PDF or image
- Open a new R script, name it as "hg.R"
- Select commands and save to script
- Retrieve and edit a command by pressing "up" or "down" arrows
- Retrieve commands by using the search box on the "History" table
- Type q() to quit. Answer "y" to save workspace
- To reload and restore workspace, go to "C:/Users/instructor/Documents/human.genes" and double click on the file "human.gene"
CollapseAssignment #1. Due 2/9, Thursday (Finalized) |
---|
|
Feb 9. (Class cancelled due to snow storm)
Feb 16. R Data Structure & Variable Types
- Vector
x <- c(1,2,3,4,5) # construct a vector using the c() function
x # show x
2 * x + 1 # arithmetic operations, applied to each element
exp(x) # exponent function (base e)
x <- 1:5 # alternative way to construct a vector, if consecutive
x <- seq(from = -1, to = 14, by = 2) # use the seq() function to create a numeric series
x <- rep(5, times = 10) # use the rep() function to create a vector of same element
x <- rep(NA, times = 10) # pre-define a vector with unknown elements; Use no quotes
# Apply vector functions
length(x)
sum(x)
mean(x)
range(x)
# Access vector elements with indices
x[1]
x[1:3]
x[-2]
x[c(1,3)]
# Character vectors
gender <- c("male", "female", "female", "male", "female")
gender[3]
- Matrix
BMI <- c(28, 32, 21, 27, 35) # a vector of body-mass index
bp <- c(124, 145, 127, 133, 140) # a vector of blood pressure
data.1 <- cbind(age, BMI, bp) # create a matrix using column bind function cbind(), individuals in rows
data.1
data.2 <- rbind(age, BMI, bp) # create a matrix using row bind function rbind()
t(data.1) # transpose a matrix: columns to rows & rows to columns
dim(data.1) # dimension of the matrix
colnames(data.1)
rownames(data.1) <- c("subject1", "subject2", "subject3", "subject4", "subject5")
data.1
data.1[3,1] # access the element in row 3, column 1
data.1[2,] # access all elements in row 2
data.1[,2] # access all elements in column 2
matrix(data = 1:12, nrow = 3, ncol =4) # create a matrix with three rows and four columns; filled by column
matrix(data = 1:12, nrow = 3, ncol =4, byrow = TRUE) # filled by row
mat <- matrix(data = NA, nrow = 2, ncol = 3) # create an empty matrix
mat[1,3] <- 5 # assign a value to a matrix element
- Dataframe
class(hg) # show object class
hg[1,] # how first row
hg[,1] # show first column
hg[1:3,] # show rows 1 through 3
hg[,1:3] # show columns 1 through 3
hg$Gene.Name # show column "Gene.Name"
CollapseAssignment #2. Due 2/23 (Finalized) |
---|
|
Feb 23. Data Visualization
- Vector functions returning indices
# The which() function returns the indices of TRUE elements
hg$Gene.Length <- hg$Gene.End - hg$Gene.Start + 1 # add a length column
hg.long.idx <- which(hg$Gene.Length > 1e6) # returns indices
hg.long <- hg[hg.long.idx,] # genes longer than 1 million bases
hg.mt.idx <- which(hg$Chromosome == "MT")
hg.mt <- hg[hg.mt.idx,] # mitochondrial genes
# The grep() function returns the indices of matching a pattern
p53.idx <- grep("P53", hg$Gene.Name)
hg.p53 <- hg[p53.idx,]
# The order() function returns the indices of sorted elements
idx.sorted <- order(hg$Gene.Length)
hg.sorted <- hg[idx.sorted,]
- Textbook: Chapter 1. Statistics and Sample; Chapter 2. Displaying Data
- Lecture slides:
CollapseAssignment #3. Due 3/2 (Finalized) |
---|
|
March 2. Exam 1
March 9. Describing data & Standard Error
- Textbook: Chapters 3 & 4
- Population and sample
x <- rnorm(1000)
x.sample <- sample(x, size = 100)
n.genes <- nrow(hg) # number of rows
sampled.rows <- sample(1:n.genes, size = 100)
hg.sample <- hg[sampled.rows,] # a random sample of 100 genes
- Explore variable distributions
x <- rnorm(1000)
hist(x, breaks = 100) # distribution for continuous variable
hist(hg$Gene.Length, br = 100)
hist(log10(hg$Gene.Length), br = 100)
gene.cts <- table(hg$Chromosomes) # distribution for a categorical vector
barplot(gene.cts)
- In-Class exercise: A study of human gene length
hg.len <- hg$Gene.End - hg$Gene.Start + 1 # calculate gene length
hist(hg.len, br = 200) # plot gene-length distribution. Not normal: mostly genes are short, few very long
mean(hg.len) # not representative, super-long genes carry too much weight to the average length
median(hg.len) # More representative. Use median for a variable not normally distributed
summary(hg.len) # Show all quartiles
IQR(hg.len) # 3rd Quartile - 1st Quartile, the range of majority data points, even for skewed distribution
log.len <- log10(hg.len); hist(log.len, br=200) # Log of gene length is more normally distributed
mean(log.len); median(log.len) # They should be similar, since log.len is normal
# The next block is intend to show that the "mean length" of samples is normally distributed, although the length itself is not
samp.len <- sample(hg.len, 100) # take a random sample of 100 length
mean(samp.len) # a sample mean
# Repeat the above 1000 times, so we could study the distribution of "mean length" (not "length" itself)
mean.sample.100 <- sapply(1:1000, function(x) mean(sample(hg$Gene.Length, size = 100)))
hist(mean.sample.100, br=100) # you should see a more normally distributed histogram
# The above exercise is a demonstration of the "Central Limit Theorem"
CollapseAssignment #4. Due 3/16. (Finalized) |
---|
# Combine
sample.combined <- cbind(mean.100, mean.500, mean.5000)
colnames(sample.combined) <- c("samp.100", "samp.500", "samp.5000")
# plot in a single frame
par(mfrow=c(3,1))
hist(sample.combined[,1], br=100, xlim=c(1e4, 2e5), main="sample size 20", xlab = "mean gene length")
hist(sample.combined[,2], br=100, xlim=c(1e4, 2e5), main = "sample size 100", xlab = "mean gene length")
hist(sample.combined[,3], br=100, xlim=c(1e4, 2e5), main = "sample size 500", xlab = "mean gene length")
par(mfrow =c(1,1))
|
March 16. Hypothesis Testing
- Chapter 6
- In-Class exercise 1. The following are measurements of body mass (in grams) of three species of finches in Africa, calculate mean, standard deviation, and CV of each species. Make a boxplot and a strip chart separated by species
- Species 1: 8, 8, 8, 8, 8, 8, 8, 6, 7 ,7, 7, 8, 8, 8, 7, 7
- Species 2: 16, 16, 16, 12, 16, 15, 15, 17, 15, 16, 15, 16
- Species 3: 40, 43, 37, 38, 43, 33, 35, 37, 36, 42, 36, 36, 39, 37, 34, 41
- Use the 2SE rule of thumb, calculate 95% confidence interval.
- Plot standard error & standard deviation
df1 <- data.frame(bm = c(8, 8, 8, 8, 8, 8, 8, 6, 7 ,7, 7, 8, 8, 8, 7, 7), species = rep("sp1", 16))
df2 <- data.frame(bm = c(16, 16, 16, 12, 16, 15, 15, 17, 15, 16, 15, 16), species = rep("sp2", 12))
df3 <- data.frame(bm = c(40, 43, 37, 38, 43, 33, 35, 37, 36, 42, 36, 36, 39, 37, 34, 41), species = rep("sp3", 16))
df.bm <- rbind(df1, df2, df3)
boxplot(bm ~ species, data= df.bm)
sd(df1$bm)
mean.1 <-mean(df1$bm)
se.1 <- sd(df1$bm)/sqrt(nrow(df1))
ci.1 <- c(mean(df1$bm) - 2 * se.1, mean(df1$bm) + 2*se.1)
bm.aov <- aov(bm ~ species, data = df.bm)
- In-Class exercise 2. Hypothesis testing through simulation
# coin-flipping experiments
runif(1) # take a random sample from 0-1, uniformly distributed
rbinom(n = 1, size =100, prob = 0.5) # flipping 100 (size) fair (prob) coin, one (n=1) time
rbinom(n = 1000, size =100, prob = 0.5) # repeat above 1000 times
num.success <- rbinom(n = 1000, size =100, prob = 0.5) # save
barplot(table(num.success)) # distribution of number of successes
length(which(num.success<=40))/1000 # probability of success less than or equal to 40
# test if toads are right-handed: observation 14/18 are right-handed
right.handed.toads.by.chance <- rbinom(n = 1000, size = 18, prob = 0.5) # null distribution, 1000 times
barplot(table(right.handed.toads.by.chance)) # plot
length(which(right.handed.toads.by.chance >= 14))/1000 # probability of getting a value equal or higher than 14
# Binomial test
binom.test(x=14, size=18, p =0.5)
CollapseAssignment #5. Due 3/23 (Finalized) |
---|
|
March 23. One-sampled & Two-sampled t-test
- Chapters 10-12
- Lecture Slides:
CollapseAssignment #6. Due 3/30 (Finalized) | |||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
March 30. Exam 2
April 6. Analyzing Proportions
- Chapter 7 & 8
CollapseAssignment #7. Due 4/19 (Finalized) |
---|
|
April 13. No Class (Spring Break)
April 20. No Class (Monday Schedule)
April 27. Contingency Analysis
- Chapter 9.
- Lecture Slides:
CollapseAssignment #8. Due 5/4 (Finalized) | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
The following table shows results of genotype counts in "Taster" and "non-Taster" individuals.
|
May 4. Exam 3
- Review lecture slides (part 3)
- Review two assignments
May 11. Correlation
- Chapter 16. R Data & Code on Publisher's Website
- Lecture slides:
- In-class Exercise 1. Aggression behavior in birds. (1) Make a data frame with two columns; (2) Make a scatter plot with title and axes labels
Variable | Values |
---|---|
Number of visits by non-parent adults | 1,7,15,4,11,14,23,14,9,5,4,10,13,13,14,12,13,9,8,18,22,22,23,31 |
Future aggressive behavior | -0.8,-0.92,-0.8,-0.46,-0.47,-0.46,-0.23,-0.16,-0.23,-0.23,-0.16,-0.10,-0.10,0.04,0.13,0.19,0.25,0.23,0.15,0.23,0.31,0.18,0.17,0.39 |
- In-class Exercise 2. Inbreeding in wolfs. (1) Make a data frame with two columns; (2) Make a scatter plot with title and axes labels; (3) Calculate correlation coefficient and test its significance by using the
cor.test()
function
Variable | Values |
---|---|
Inbreeding coefficient between mating pairs | 0,0,0.13,0.13,0.13,0.19,0.19,0.19,0.22,0.24,0.24,0.24,0.24,0.24,0.24,0.25,0.27,0.30,0.30,0.30,0.30,0.36,0.37,0.40 |
Number of pups surviving first winter | 6,6,7,5,4,8,7,4,3,3,3,3,2,2,6,3,5,3,2,1,3,2,3 |
- In-class Exercise 3. Miracles of memory. (1) Make a data frame with two columns; (2) Make a scatter plot with title and axes labels; (3) Run a rank correlation using
cor.test(method="spearman")
function
Variable | Values |
---|---|
Years elapsed | 2,5,5,4,17,17,31,20,22,25,28,29,34,43,44,46,34,28,39,50,50 |
Impressive score | 1,1,1,2,2,2,3,4,4,4,4,4,4,4,4,4,4,5,5,5,5 |
CollapseAssignment #9. Due 5/18 (Finalized) | ||||||||
---|---|---|---|---|---|---|---|---|
(Note: Lecture slides has been posted. For help, refer to the slides and R Data & Code on Publisher's Website) Question 1 (6 pts). In their study of hyena laughter (or "giggle"), Mathevon et al (2010) asked whether sound spectral properties of the giggles are associated with age. Perform correlation analysis using data in the following table:
Question 2(4 pts). A mutation in the SLC11A1 gene in humans causes resistance to tuberculosis. Barnes et al (2011) examined the frequency of the resistant allele in different towns in Europe and Asia and compared it to how long humans had been settled in the site ("duration of settlement"):
|
May 18. Regression
- Chapter 16. R Data & Code on Publisher's Website
- In-class Exercise 1. Lion noses. Linear regression with regression line & confidence intervals
- In-class Exercise 2. Prairie Home Companion. Linear regression with test of non-zero slope
- In-class Exercise 3. Iris pollination. Square-root transformation
- In-class Exercise 4. Iron and phytoplankton growth. Non-linear regression
- In-class Exercise 5. Guppy cold death. Logistic regression
- Teacher's Evaluation
- Students can complete them now in 3 simple steps:
- Visit www.hunter.cuny.edu/te OR www.hunter.cuny.edu/mobilete (for smartphones)
- Sign in with your net ID and net ID password (forgot your password? Click here: https://netid.hunter.cuny.edu/verify-identity)
- Complete the evaluation for your instructor(s)
- Notes to students:
- Your responses are completely anonymous, and that instructors can only see results after grades are released.
- Teacher evaluations serve a number of important functions such as: improving classes by providing instructors feedback of their teaching; and, serving as supporting documentation in a faculty’s reappointment, tenure and promotion.
- Teacher evaluations also help the student make decisions about what courses and instructors are right for them. Please let the students know that the teacher evaluation results are readily accessible to them at www.hunter.cuny.edu/myprof.
- Students can complete them now in 3 simple steps:
CollapseAssignment #10. Due In-Class or 5/25 |
---|
(10 pts) Using the "Prairie Home Companion" data set to answer the following questions
|