Biol20N02 2017: Difference between revisions
imported>Weigang |
imported>Weigang m (→Course Outline) |
||
Line 149: | Line 149: | ||
# (2 pts) Construct a matrix of 10 rows by combining the previous two vectors using the <code>cbind</code> function. Name the matrix as "mat". Assign row names as "ind1" .. "ind10". Show row values for ind1, column values for ran.2; transpose the matrix and save it as "mat.t". | # (2 pts) Construct a matrix of 10 rows by combining the previous two vectors using the <code>cbind</code> function. Name the matrix as "mat". Assign row names as "ind1" .. "ind10". Show row values for ind1, column values for ran.2; transpose the matrix and save it as "mat.t". | ||
# (2 pts) Construct a character vector of 10 US States (Hint: use the c() function). Name it "us.states". Use full, case-sensitive names and "_" in place of spaces. Show the first and the fifth states with one command. | # (2 pts) Construct a character vector of 10 US States (Hint: use the c() function). Name it "us.states". Use full, case-sensitive names and "_" in place of spaces. Show the first and the fifth states with one command. | ||
# (2 pts) | # (2 pts) Understand variable types. (From Whitlock & Schluter) Researchers randomly assign diabetes patients to two groups. In the first group, the patients receive a new drug while the other group received standard treatment without the new drug. The researchers compared the rate of insulin release in the two groups. | ||
|} | ## List the two variables and state whether each is categorical (if so, whether it is nominal or ordinal) or numerical (if so, whether it is discrete or continuous) | ||
## State & explain which variable is the explanatory (i.e., predictive) and which is the response variable.|} | |||
===Feb 23. Data Visualization=== | ===Feb 23. Data Visualization=== | ||
* Vector functions returning indices | * Vector functions returning indices | ||
Line 201: | Line 201: | ||
## Import the BodyTemperature as a data frame & pick a numerical variable and plot its frequency distribution with the <code>hist()</code> function. Make a customized title (with the "main=" argument) | ## Import the BodyTemperature as a data frame & pick a numerical variable and plot its frequency distribution with the <code>hist()</code> function. Make a customized title (with the "main=" argument) | ||
## Pick a character variable and show its frequency distribution with the <code>table()</code> function. Plot the data using the <code>barplot()</code> function | ## Pick a character variable and show its frequency distribution with the <code>table()</code> function. Plot the data using the <code>barplot()</code> function | ||
|} | |} | ||
Revision as of 04:01, 19 February 2017
Course Description
With rapid accumulation of genome sequences and digital health data, biomedicine is becoming an information science. This course is a hands-on, computer-based workshop on how to visualize and analyze biological data. The course introduces R, a modern statistical computing language and platform. In the first half, students will learn to use R to make scatter plots, bar plots, box plots, and other commonly used data-visualization techniques. In the second half, the course will review & apply statistical hypothesis tests including significance testing of means, association tests, and correlation analysis. Throughout the course, students will apply these methods to the analysis of large biological data sets, such as the human genome, transcriptomes (RNA-SEQ), and human genome variations.
This 3-credit experimental course fulfills elective requirements for Biology Major I. Hunter pre-requisites are BIOL100, BIOL102 and STAT113.
Learning Goals
- Be able to use R as a plotting tool to visualize large-scale biological data sets
- Be able to use R as a statistical tool to summarize data and make biological inferences
- Be able to use R as a programming language to automate data analysis
Textbooks
- Whitlock & Schluter (2015). Analysis of Biological Data. (2nd edition). Amazon link
Exams & Grading
- Attendance (or a note in case of absence) is required
- In-Class Exercises (50 pts).
- Assignments. All assignments should be handed in as hard copies only. Email submission will not be accepted. Late submissions will receive 10% deduction (of the total grade) per day (~100 pts total).
- Three Mid-term Exams (3 X 30 pts each = 90 pts)
- Comprehensive Final Exam (50 pts)
- Bonus for active participation in classroom discussions
Course Outline
Feb 2. Introduction & R Demo
- Course overview
- Tutorial 1: R Demo
- Create a new project by navigating: File | New Project | New Directory. Name it project file "human_genes"
- Import the human genes data set: File | Import DataSet | CSV, copy & paste this address: http://diverge.hunter.cuny.edu/~weigang/data-sets-for-biostat/hg.tsv2
- Click "update". Rename the data set if you wish (short but informative names, e.g., hg, or human.genes). Do not use spaces, use dot or underscore as name delimiters (e.g., "human.genes" or "human_genes", but never "human genes") Same rule for column or row names
dim(hg) # show dimension
head(hg) # show top rows
tail(hg) # show bottom rows
hg.len <- hg$Gene.End - hg$Gene.Start + 1 # create a vector of gene lengths (Tab for auto-completion)
hg.len[1] # show first length
hg.len[10] # show the 10th item
hg.len[1:10] # show items 1 through 10
hg.len[c(1,10)] # show items 1 and 10
hg[1,1] # show item in 1st row, 1 column
hg[1,] # show all values for 1st row
hg[,1] # show all values for 1st column
hg[1:10,] # show rows 1 through 10
hg[,1:7] # show columns 1 through 10
hg[1:10, 1:7] # a subset
hist(hg$Length, br = 200) # plot gene-length distribution. Not normal: mostly genes are short, few very long
hist(log10(hg$Length), br = 200) # log transformation for non-normally distributed variable
gene.cts <- table(hg$Chromosome) # count number of genes, separated by chromosomes
barplot(gene.cts) # show distribution by a barplot
mean(hg$Length) # not representative, super-long genes carry too much weight to the average length
median(hg$Length) # More representative. Use median for a variable not normally distributed
summary(hg$Length) # Show all quartiles
hg$Gene.Length <- hg$Gene.End - hg$Gene.Start + 1 # create a new column named "Gene.Length"
boxplot(log10(Length) ~ Chromosome, data = hg) # show gene length by chromosomes
write.csv(hg, "hg.csv", row.names = FALSE) # save into a file
hg <- read.csv("hg.csv") # read back into R
- Export a PDF or image
- Open a new R script, name it as "hg.R"
- Select commands and save to script
- Retrieve and edit a command by pressing "up" or "down" arrows
- Retrieve commands by using the search box on the "History" table
- Type q() to quit. Answer "y" to save workspace
- To reload and restore workspace, go to "C:/Users/instructor/Documents/human.genes" and double click on the file "human.gene"
Assignment #1. Due 2/9, Thursday (Finalized) |
---|
|
Feb 9. (Class cancelled due to snow storm)
Feb 16. R Data Structure & Variable Types
- Vector
x <- c(1,2,3,4,5) # construct a vector using the c() function
x # show x
2 * x + 1 # arithmetic operations, applied to each element
exp(x) # exponent function (base e)
x <- 1:5 # alternative way to construct a vector, if consecutive
x <- seq(from = -1, to = 14, by = 2) # use the seq() function to create a numeric series
x <- rep(5, times = 10) # use the rep() function to create a vector of same element
x <- rep(NA, times = 10) # pre-define a vector with unknown elements; Use no quotes
# Apply vector functions
length(x)
sum(x)
mean(x)
range(x)
# Access vector elements with indices
x[1]
x[1:3]
x[-2]
x[c(1,3)]
# Character vectors
gender <- c("male", "female", "female", "male", "female")
gender[3]
- Matrix
BMI <- c(28, 32, 21, 27, 35) # a vector of body-mass index
bp <- c(124, 145, 127, 133, 140) # a vector of blood pressure
data.1 <- cbind(age, BMI, bp) # create a matrix using column bind function cbind(), individuals in rows
data.1
data.2 <- rbind(age, BMI, bp) # create a matrix using row bind function rbind()
t(data.1) # transpose a matrix: columns to rows & rows to columns
dim(data.1) # dimension of the matrix
colnames(data.1)
rownames(data.1) <- c("subject1", "subject2", "subject3", "subject4", "subject5")
data.1
data.1[3,1] # access the element in row 3, column 1
data.1[2,] # access all elements in row 2
data.1[,2] # access all elements in column 2
matrix(data = 1:12, nrow = 3, ncol =4) # create a matrix with three rows and four columns; filled by column
matrix(data = 1:12, nrow = 3, ncol =4, byrow = TRUE) # filled by row
mat <- matrix(data = NA, nrow = 2, ncol = 3) # create an empty matrix
mat[1,3] <- 5 # assign a value to a matrix element
- Dataframe
class(hg) # show object class
hg[1,] # how first row
hg[,1] # show first column
hg[1:3,] # show rows 1 through 3
hg[,1:3] # show columns 1 through 3
hg$Gene.Name # show column "Gene.Name"
Assignment #2. Due 2/23 (Tentative) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Show commands and outputs for the following exercises:
Feb 23. Data Visualization
# The which() function returns the indices of TRUE elements
hg$Gene.Length <- hg$Gene.End - hg$Gene.Start + 1 # add a length column
hg.long.idx <- which(hg$Gene.Length > 1e6) # returns indices
hg.long <- hg[hg.long.idx,] # genes longer than 1 million bases
hg.mt.idx <- which(hg$Chromosome == "MT")
hg.mt <- hg[hg.mt.idx,] # mitochondrial genes
# The grep() function returns the indices of matching a pattern
p53.idx <- grep("P53", hg$Gene.Name)
hg.p53 <- hg[p53.idx,]
# The order() function returns the indices of sorted elements
idx.sorted <- order(hg$Gene.Length)
hg.sorted <- hg[idx.sorted,]
x <- rnorm(1000)
x.sample <- sample(x, size = 100)
n.genes <- nrow(hg) # number of rows
sampled.rows <- sample(1:n.genes, size = 100)
hg.sample <- hg[sampled.rows,] # a random sample of 100 genes
x <- rnorm(1000)
hist(x, breaks = 100) # distribution for continuous variable
hist(hg$Gene.Length, br = 100)
hist(log10(hg$Gene.Length), br = 100)
gene.cts <- table(hg$Chromosomes) # distribution for a categorical vector
barplot(gene.cts)
March 2. Exam 1March 9. Describing data
hg.len <- hg$Gene.End - hg$Gene.Start + 1 # calculate gene length
hist(hg.len, br = 200) # plot gene-length distribution. Not normal: mostly genes are short, few very long
mean(hg.len) # not representative, super-long genes carry too much weight to the average length
median(hg.len) # More representative. Use median for a variable not normally distributed
summary(hg.len) # Show all quartiles
IQR(hg.len) # 3rd Quartile - 1st Quartile, the range of majority data points, even for skewed distribution
log.len <- log10(hg.len); hist(log.len, br=200) # Log of gene length is more normally distributed
mean(log.len); median(log.len) # They should be similar, since log.len is normal
# The next block is intend to show that the "mean length" of samples is normally distributed, although the length itself is not
samp.len <- sample(hg.len, 100) # take a random sample of 100 length
mean(samp.len) # a sample mean
# Repeat the above 1000 times, so we could study the distribution of "mean length" (not "length" itself)
mean.len <- rep(NA, 1000) # prepare an empty vector to store the "mean lengths"
for (i in 1:1000) { # i takes the value from 1 to 1000, one at a time
samp.len <- sample(hg.len, 100);
mean.len[i] <- mean(samp.len);
}
hist(mean.len, br=100) # you should see a more normally distributed histogram
# The above exercise is a demonstration of the "Central Limit Theorem"
March 16. Sampling & Standard Error of Mean
March 23. Hypothesis Testing
# coin-flipping experiments
runif(1) # take a random sample from 0-1, uniformly distributed
rbinom(n = 1, size =100, prob = 0.5) # flipping 100 (size) fair (prob) coin, one (n=1) time
rbinom(n = 1000, size =100, prob = 0.5) # repeat above 1000 times
num.success <- rbinom(n = 1000, size =100, prob = 0.5) # save
barplot(table(num.success)) # distribution of number of successes
length(which(num.success<=40))/1000 # probability of success less than or equal to 40
# test if toads are right-handed: observation 14/18 are right-handed
right.handed.toads.by.chance <- rbinom(n = 1000, size = 18, prob = 0.5) # null distribution, 1000 times
barplot(table(right.handed.toads.by.chance)) # plot
length(which(right.handed.toads.by.chance >= 14))/1000 # probability of getting a value equal or higher than 14
# If the observation is 10/18
right.handed.toads.by.chance <- rbinom(n = 1e6, size = 18, prob = 0.5)
length(which(right.handed.toads.by.chance <= 8 | right.handed.toads.by.chance>=10))/1e6
March 30. Exam 2April 6. Analyzing Proportions
April 13. No Class (Spring Break)April 20. No Class (Monday Schedule)April 27. Contingency AnalysisLecture Slides:
May 4. Exam 3
May 11. One-sample t-test
May 18. Paired & Two sample t-tests
May 25. Final Exam (Comprehensive)May 31. Grades submitted to Registrar Office |