Biol20N02 2017: Difference between revisions

From QiuLab
Jump to navigation Jump to search
imported>Weigang
imported>Weigang
 
(90 intermediate revisions by the same user not shown)
Line 47: Line 47:
hg[,1] # show all values for 1st column
hg[,1] # show all values for 1st column
hg[1:10,] # show rows 1 through 10
hg[1:10,] # show rows 1 through 10
hg[,1:10] # show columns 1 through 10
hg[,1:7] # show columns 1 through 10
hg[1:10, 1:10] # a subset
hg[1:10, 1:7] # a subset
hist(hg$Length, br = 200) # plot gene-length distribution. Not normal: mostly genes are short, few very long
hist(hg$Length, br = 200) # plot gene-length distribution. Not normal: mostly genes are short, few very long
hist(log10(hg$Length), br = 200) # log transformation for non-normally distributed variable
hist(log10(hg$Length), br = 200) # log transformation for non-normally distributed variable
Line 83: Line 83:
|}
|}


===Feb 9. R Tutorial: Data Structure===
===Feb 9. (<font color="red">Class cancelled due to snow storm</font>)===
 
===Feb 16. R Data Structure & Variable Types===
* Vector
* Vector
<syntaxhighlight lang=R">
<syntaxhighlight lang=R">
Line 107: Line 109:
gender <- c("male", "female", "female", "male", "female")
gender <- c("male", "female", "female", "male", "female")
gender[3]
gender[3]
# Logical vectors
is.healthy <- c(TRUE, TRUE, FALSE, TRUE, FALSE) # Use no quotes
is.male <- (gender == "male") # obtain a logic vector by a test
age <- c(60, 43, 72, 35, 47)
is.60 <- (age == 60)
less.60 <- (age <= 43)
is.female <- !is.male # use the logical negate operator (!)
# The which() function returns the indices of TRUE elements
ind.male <- which(is.male)
ind.young <- which(age < 45)
age[ind.young] # obtain ages of young individuals
# The order() function returns the indices of sorted elements
ind.sorted <- order(age)
age.sorted <- age[ind.sorted]
</syntaxhighlight>
</syntaxhighlight>
* Matrix
* Matrix
Line 153: Line 141:
{| class="wikitable sortable mw-collapsible"
{| class="wikitable sortable mw-collapsible"
|- style="background-color:lightsteelblue;"
|- style="background-color:lightsteelblue;"
! Assignment #2. Due 2/23, Tuesday (Tentative)
! Assignment #2. Due 2/23 (Finalized)
|- style="background-color:powderblue;"
|
Show commands and outputs for the following exercises:
# (2 pts) Construct a numeric vector of 10 random numbers from the uniform distribution between 0 and 1 (Hint: use the function <code>runif()</code>). Name the resulting vector as "rand.1". Show length, range, mean, and variance.
# (2 pts) Construct a numeric vector of 10 random numbers from a normal distribution with mean of 0 and variance of 1 (Hint: use the function <code>rnorm()</code>). Name the resulting vector as "rand.2". Show length, range, mean, and variance. Compare the variance with the previous one: which is large?
# (2 pts) Construct a matrix of 10 rows by combining the previous two vectors using the <code>cbind</code> function. Name the matrix as "mat". Assign row names as "ind1" .. "ind10". Show row values for ind1, column values for rand.2; transpose the matrix and save it as "mat.t".
# (2 pts) Construct a character vector of 10 US States. Name it "us.states". Use full, case-sensitive names and "_" in place of spaces. Show the first and the fifth states with one command.
# (1 pt) Construct a logical vector (named "less.half") of 10 from rand.1 by testing if the elements are less than 0.5.
# (1 pt) Use the <code>which()</code> function to find the indices of elements in rand.1 of values that are less than 0.5. Show values.
|}
 
===Feb 16. Statistics & Samples===
* Population and sample
<syntaxhighlight lang=R">
x <- rnorm(1000)
x.sample.1 <- sample(x, 100)
</syntaxhighlight>
* Explore variable distributions
<syntaxhighlight lang=R">
x <- rnorm(1000)
hist(x, breaks = 100) # distribution for all nubmers
hist(BodyTemperature[,2], main = "Age frequency distribution", xlab = "Age", ylab = "Counts") # age distribution, with customized title and axis labels
stem(BodyTemperature[,2]) # stem-leaf plot
table(BodyTemperature[,1]) # distribution for a categorical vector
</syntaxhighlight>
* Types of variables
{| class="wikitable sortable mw-collapsible"
|- style="background-color:lightsteelblue;"
! Assignment #3. Due 3/1, Tu (Finalized)
|- style="background-color:powderblue;"
|- style="background-color:powderblue;"
|  
|  
# (6 pts) Understand R functions. Each R function consists of arguments, which specify inputs (which are required) and options (which are optional). When successfully run, functions return an output. To use a function properly, it is <font color = "red">necessary</font> to identify the data type (i.e., whether numeric or character, whether vector, matrix, or a data frame) of the inputs, options, and outputs. In R studio,
# (2 pts) Construct a numeric vector of 10 random numbers from the uniform distribution between 0 and 1 (Hint: use the function <code>runif()</code>). Name the resulting vector as "ran.1". Show length, range, mean, and variance.
## Show help page for the "sample()" function by typing <code>?sample</code>. Identify the data type of the input <code>x</code>.
# (2 pts) Construct a numeric vector of 10 random numbers from a normal distribution with mean of 0 and variance of 1 (Hint: use the function <code>rnorm()</code>). Name the resulting vector as "ran.2". Show length, range, mean, and variance. Compare the variance with the previous one: which is large?
## Explain the following options: <code>size</code> and <code>replace</code>; show their default values (i.e., the values they take when they are not specified).
# (2 pts) Construct a matrix of 10 rows by combining the previous two vectors using the <code>cbind</code> function. Name the matrix as "mat". Assign row names as "ind1" .. "ind10". Show row values for ind1, column values for ran.2; transpose the matrix and save it as "mat.t".
## Identify what is the expected data type of the output. Use the function for the following tasks:
# (2 pts) Construct a character vector of 10 US States (Hint: use the c() function). Name it "us.states". Use full, case-sensitive names and "_" in place of spaces. Show the first and the fifth states with one command.
## Run examples of the function by typing <code>example(sample)</code>. Explain the 1st and 2nd examples by listing the input, options, and output.  
## Create a vector of numbers from 1 to 100; obtain a permuted sample of the same vector (with the same length)
## Create a vector of two elements consisting of "Male" and "Female"; obtain a vector of 100 elements randomly drawn from the original vector [Hint: by using the <code>replace = TRUE</code> argument]
# (2 pts) Explore frequency distribution of variables
## Import the BodyTemperature as a data frame & pick a numerical variable and plot its frequency distribution with the <code>hist()</code> function. Make a customized title (with the "main=" argument)
## Pick a character variable and show its frequency distribution with the <code>table()</code> function. Plot the data using the <code>barplot()</code> function
# (2 pts) Understand variable types. (From Whitlock & Schluter) Researchers randomly assign diabetes patients to two groups. In the first group, the patients receive a new drug while the other group received standard treatment without the new drug. The researchers compared the rate of insulin release in the two groups.
# (2 pts) Understand variable types. (From Whitlock & Schluter) Researchers randomly assign diabetes patients to two groups. In the first group, the patients receive a new drug while the other group received standard treatment without the new drug. The researchers compared the rate of insulin release in the two groups.
## List the two variables and state whether each is categorical (if so, whether it is nominal or ordinal) or numerical (if so, whether it is discrete or continuous)
## List the two variables and state whether each is categorical (if so, whether it is nominal or ordinal) or numerical (if so, whether it is discrete or continuous)
Line 200: Line 153:
|}
|}


===Feb 23. Displaying data===
===Feb 23. Data Visualization===
Slides for part 1: [[File:Biostat-part-1.pdf|thumbnail]]
* Vector functions returning indices
<syntaxhighlight lang="bash">
# The which() function returns the indices of TRUE elements
hg$Gene.Length <- hg$Gene.End - hg$Gene.Start + 1 # add a length column
hg.long.idx <- which(hg$Gene.Length > 1e6) # returns indices
hg.long <- hg[hg.long.idx,] # genes longer than 1 million bases
hg.mt.idx <- which(hg$Chromosome == "MT")
hg.mt <- hg[hg.mt.idx,] # mitochondrial genes
# The grep() function returns the indices of matching a pattern
p53.idx <- grep("P53", hg$Gene.Name)
hg.p53 <- hg[p53.idx,]
# The order() function returns the indices of sorted elements
idx.sorted <- order(hg$Gene.Length)
hg.sorted <- hg[idx.sorted,]
</syntaxhighlight>
* Textbook: Chapter 1. Statistics and Sample; Chapter 2. Displaying Data
* Lecture slides: [[File:Lecture-slides-part-1.pdf|thumbnail]]
{| class="wikitable sortable mw-collapsible"
{| class="wikitable sortable mw-collapsible"
|- style="background-color:lightsteelblue;"
|- style="background-color:lightsteelblue;"
! Assignment #4. Due 3/8, Finalized
! Assignment #3. Due 3/2 (Finalized)
|- style="background-color:powderblue;"
|- style="background-color:powderblue;"
|  
|  
Line 210: Line 179:
# (1 pt) Identity a character variable and obtain frequency counts using the "table()" function
# (1 pt) Identity a character variable and obtain frequency counts using the "table()" function
# (1 pt) Identity a numerical variable and obtain frequency distribution by a histogram. Use customized x-axis label
# (1 pt) Identity a numerical variable and obtain frequency distribution by a histogram. Use customized x-axis label
# (2 pts) Make a boxplot of distribution of each numerical variable with respect to species.  
# (2 pts) Make a boxplot of distribution of each numerical variable with respect to species. For example, <code>boxplot(Sepal.Length ~ Species, data=iris)</code>. Label axes and title.
# (2 pt) Make a strip chart of distribution of each numerical variable with respect to species. Customize it to be vertical, open circle symbol (pch = 1), and using the method of "jitter".  
# (2 pt) Make a strip chart of distribution of a numerical variable with respect to species using the <code>stripchart()</code> function. Customize it with axes labels, open circle symbol (pch = 1), the method of "jitter", and being vertical ("vertical=T").
# (1 pt) Make a scatter plot to show relations between two numerical variables
# (1 pt) Make a scatter plot to show relations between two numerical variables. For example, <code>plot(iris$Sepal.Length, iris$Sepal.Width)</code>
# (2 pts) Among graduate school applicants to a university department, 512 males were admitted, 313 males were rejected, 89 females were admitted, and 19 females were rejected. Explore if there is gender bias in admission by
# (2 pts) Among graduate school applicants to a university department, 512 males were admitted, 313 males were rejected, 89 females were admitted, and 19 females were rejected. Explore if there is gender bias in admission by
## Identify the explanatory and response variables, as well as whether the variables are character or numerical
## Identify the explanatory and response variables, as well as whether the variables are character or numerical
Line 219: Line 188:
## Plot the contingency table using the mosaicplot() function. Based on the plot, explain if there is evidence for gender bias. [Hint: try matrix transposition]
## Plot the contingency table using the mosaicplot() function. Based on the plot, explain if there is evidence for gender bias. [Hint: try matrix transposition]
|}
|}
===March 2. Exam 1===
===March 2. Exam 1===


===March 9. Describing data===
===March 9. Describing data & Standard Error===
# In-Class exercise: A study of human gene length
* Textbook: Chapters 3 & 4
# Import human gene data set from http://diverge.hunter.cuny.edu/~weigang/data-sets-for-biostat/hg.tsv2
* Population and sample
<syntaxhighlight lang="bash">
x <- rnorm(1000)
x.sample <- sample(x, size = 100)
n.genes <- nrow(hg) # number of rows
sampled.rows <- sample(1:n.genes, size = 100)
hg.sample <- hg[sampled.rows,] # a random sample of 100 genes
</syntaxhighlight>
* Explore variable distributions
<syntaxhighlight lang=R">
<syntaxhighlight lang=R">
x <- rnorm(1000)
hist(x, breaks = 100) # distribution for continuous variable
hist(hg$Gene.Length, br = 100)
hist(log10(hg$Gene.Length), br = 100)
gene.cts <- table(hg$Chromosomes) # distribution for a categorical vector
barplot(gene.cts)
</syntaxhighlight>
* In-Class exercise: A study of human gene length
<syntaxhighlight lang="bash">
hg.len <- hg$Gene.End - hg$Gene.Start + 1 # calculate gene length
hg.len <- hg$Gene.End - hg$Gene.Start + 1 # calculate gene length
hist(hg.len, br = 200) # plot gene-length distribution. Not normal: mostly genes are short, few very long
hist(hg.len, br = 200) # plot gene-length distribution. Not normal: mostly genes are short, few very long
Line 237: Line 224:
mean(samp.len) # a sample mean
mean(samp.len) # a sample mean
# Repeat the above 1000 times, so we could study the distribution of "mean length" (not "length" itself)
# Repeat the above 1000 times, so we could study the distribution of "mean length" (not "length" itself)
mean.len <- rep(NA, 1000) # prepare an empty vector to store the "mean lengths"
mean.sample.100 <- sapply(1:1000, function(x) mean(sample(hg$Gene.Length, size = 100)))
for (i in 1:1000) { # i takes the value from 1 to 1000, one at a time
hist(mean.sample.100, br=100) # you should see a more normally distributed histogram
  samp.len <- sample(hg.len, 100);
  mean.len[i] <- mean(samp.len);
}
hist(mean.len, br=100) # you should see a more normally distributed histogram
# The above exercise is a demonstration of the "Central Limit Theorem"
# The above exercise is a demonstration of the "Central Limit Theorem"
</syntaxhighlight>
</syntaxhighlight>
{| class="wikitable sortable mw-collapsible"
{| class="wikitable sortable mw-collapsible"
|- style="background-color:lightsteelblue;"
|- style="background-color:lightsteelblue;"
! Assignment #5. Due 3/22. (Finalized)
! Assignment #4. Due 3/16. (Finalized)
|- style="background-color:powderblue;"
|- style="background-color:powderblue;"
|  
|  
# (2 pts) Repeat the above sampling experiment (including the "for" loop, sample size N =100). Save results to a vector called "mean.100" (which is a vector of means, with a length of 1000 elements). Show histogram. Show mean. Show standard deviation
# (2 pts) Repeat the above sampling experiment (sample size N =100). Save results to a vector called "mean.100" (which is a vector of means, with a length of 1000 elements). Show histogram. Show mean. Show standard deviation.
# (2 pts) Repeat the above sampling experiment (sample size N =20). Save results to a vector called "mean.20". Show histogram. Show mean. Show standard deviation
# (2 pts) Repeat the above sampling experiment (sample size N =500). Save results to a vector called "mean.500". Show histogram. Show mean. Show standard deviation
# (2 pts) Repeat the above sampling experiment (sample size N =500). Save results to a vector called "mean.500". Show histogram. Show mean. Show standard deviation
# (2 pts) Explain why mean is not a good description of a "typical" human gene length
# (2 pts) Repeat the above sampling experiment (sample size N =5000). Save results to a vector called "mean.5000". Show histogram. Show mean. Show standard deviation
# (2 pts) Watch this Khan Academy video and describe the [https://www.khanacademy.org/math/probability/statistics-inferential/sampling_distribution/v/central-limit-theorem Central Limit Theorem]. Explain what is the "sampling distribution of mean".
# (1 pt) Explain why mean is not a good description of a "typical" human gene length and which statistic is a better?
<syntaxhighlight lang=R">
# (0.5 pt) Sort the human genes by length (using the <code>order()</code>) & created a sorted data.frame ("hg.sorted") of the original data.frame ("hg").
# Sample size 100
# (0.5 pt) Obtain human genes longer than 1 million bases by using the <code>which()</code> function & create a new data.frame of long genes ("hg.long")  
samp.size.100.means <- rep(NA, 1000)
# (2 pts) Combine the three simulated-mean vectors into one (see below). Define "Standard error" & "confidence interval (CI)". Does CI quantify accuracy or precision?
for (i in 1:1000) {
<syntaxhighlight lang="bash">
  samp <- sample(hg.len, 100)
  samp.size.100.means[i] <- mean(samp)
}
hist(samp.size.100.means, br=100)
# Sample size 20
samp.size.20.means <- rep(NA, 1000)
for (i in 1:1000) {
  samp <- sample(hg.len, 20)
  samp.size.20.means[i] <- mean(samp)
}
hist(samp.size.20.means, br=100)
# Sample size 500
samp.size.500.means <- rep(NA, 1000)
for (i in 1:1000) {
  samp <- sample(hg.len, 500)
  samp.size.500.means[i] <- mean(samp)
}
hist(samp.size.500.means, br=100)
# Combine
# Combine
sample.combined <- cbind(samp.size.20.means, samp.size.100.means, samp.size.500.means)
sample.combined <- cbind(mean.100, mean.500, mean.5000)
colnames(sample.combined) <- c("samp.20", "samp.100", "samp.500")
colnames(sample.combined) <- c("samp.100", "samp.500", "samp.5000")
# plot in a single frame
# plot in a single frame
par(mfrow=c(3,1))
par(mfrow=c(3,1))
Line 289: Line 253:
|}
|}


===March 16. Sampling & Standard Error of Mean===
===March 16. Hypothesis Testing===
* In-Class exercise 1. Descriptive statistics
* Chapter 6
# Make a vector of the following blood pressure measurements (in mmHg): 112, 128, 108, 129, 125, 153, 155, 132, 137. Calculate sample size, sum, mean, variance, coefficient of variation (CV), and median
* In-Class exercise 1. The following are measurements of body mass (in grams) of three species of finches in Africa, calculate mean, standard deviation, and CV of each species. Make a boxplot and a strip chart separated by species
# Take a sample of 100 human gene lengths. Calculate median, IQR, 1.5*IQR; Make a boxplot
#Species 1: 8, 8, 8, 8, 8, 8, 8, 6, 7 ,7, 7, 8, 8, 8, 7, 7
# The following are measurements of body mass (in grams) of three species of finches in Africa, calculate mean, standard deviation, and CV of each species. Make a boxplot and a strip chart separated by species
#Species 2: 16, 16, 16, 12, 16, 15, 15, 17, 15, 16, 15, 16
## Species 1: 8, 8, 8, 8, 8, 8, 8, 6, 7 ,7, 7, 8, 8, 8, 7, 7
#Species 3: 40, 43, 37, 38, 43, 33, 35, 37, 36, 42, 36, 36, 39, 37, 34, 41
## Species 2: 16, 16, 16, 12, 16, 15, 15, 17, 15, 16, 15, 16
## Species 3: 40, 43, 37, 38, 43, 33, 35, 37, 36, 42, 36, 36, 39, 37, 34, 41
* In-Class exercise 2. standard error & confidence interval
# Blood pressure: What is the standard deviation of the above blood pressure?
# What is the sample size? Calculate standard error of the mean.
# Use the 2SE rule of thumb, calculate 95% confidence interval.  
# Use the 2SE rule of thumb, calculate 95% confidence interval.  
# Plot standard error & standard deviation
# Plot standard error & standard deviation
{| class="wikitable sortable mw-collapsible"
<syntaxhighlight lang=R">
|- style="background-color:lightsteelblue;"
df1 <- data.frame(bm = c(8, 8, 8, 8, 8, 8, 8, 6, 7 ,7, 7, 8, 8, 8, 7, 7), species = rep("sp1", 16))
! Assignment #6. Due 3/29 (Finalized)
df2 <- data.frame(bm = c(16, 16, 16, 12, 16, 15, 15, 17, 15, 16, 15, 16), species = rep("sp2", 12))
|- style="background-color:powderblue;"
df3 <- data.frame(bm = c(40, 43, 37, 38, 43, 33, 35, 37, 36, 42, 36, 36, 39, 37, 34, 41), species = rep("sp3", 16))
|
df.bm <- rbind(df1, df2, df3)
A study of expression levels of human genes
boxplot(bm ~ species, data= df.bm)
# (2 pts) Make a single data frame containing the gene expression values for 3 genes (Hint: name two columns as "expression" and "gene")
sd(df1$bm)
## "MED1": 12.38918, 9.084664, 9.48416, 9.363928, 8.194495, 8.694836, 8.771101, 9.998151, 12.66877, 8.684064, 8.944236, 11.8491, 8.40968, 8.990329, 9.782376, 8.58243, 12.00455, 8.580401, 9.161046, 9.047977, 8.672018, 8.811856, 8.354933, 8.763175
mean.1 <-mean(df1$bm)
## "ZBTB42": 8.377784, 7.832712, 8.65289, 4.59474, 5.598869, 4.912963, 5.24125, 7.688584, 7.36693, 4.463853, 5.646581, 6.830076, 4.485883, 6.741698, 6.967342, 5.307032, 6.80991, 7.612475, 5.795508, 5.033554, 5.032286, 4.979937, 8.315718, 5.801263, 7.136532, 4.722164, 5.416593, 4.456056, 6.253954, 5.684245, 8.255962, 8.629676, 8.348159, 8.114049, 6.786746, 7.893434, 7.836647, 4.733391, 6.895385, 7.123281, 4.75207
se.1 <- sd(df1$bm)/sqrt(nrow(df1))
## "TCF24": 3.427531, 3.383114, 3.574041, 3.449132, 3.784881, 3.686278, 3.545466, 3.624868, 3.377987, 3.256293, 3.381218, 3.79938, 3.419852, 3.292284
ci.1 <- c(mean(df1$bm) - 2 * se.1, mean(df1$bm) + 2*se.1)
# (2 pts) Make a boxplot of expression values separated by genes. Explain which gene has the highest level of expression & which has the lowest
bm.aov <- aov(bm ~ species, data = df.bm)
# (2 pts) Make a histogram of all expression values. Are they normally distributed? Test normality by qqnorm() and qqline() plots
</syntaxhighlight>
# (2 pts) Obtain the mean, median, and standard deviation of expression separated by genes (Use the tapply() function to get full credits)
* In-Class exercise 2. Hypothesis testing through simulation
# (2 pts) Explain what is "standard error of mean" and how is it different from standard deviation (Hint: explain sampling]. In taking a sample from a population, what can you do to make the standard error of the mean smaller?
|}
 
===March 23. Hypothesis Testing===
* In-Class exercise 3. Hypothesis testing through simulation
<syntaxhighlight lang=R">
<syntaxhighlight lang=R">
# coin-flipping experiments
# coin-flipping experiments
Line 332: Line 286:
barplot(table(right.handed.toads.by.chance)) # plot
barplot(table(right.handed.toads.by.chance)) # plot
length(which(right.handed.toads.by.chance >= 14))/1000 # probability of getting a value equal or higher than 14
length(which(right.handed.toads.by.chance >= 14))/1000 # probability of getting a value equal or higher than 14
# If the observation is 10/18
# Binomial test
right.handed.toads.by.chance <- rbinom(n = 1e6, size = 18, prob = 0.5)
binom.test(x=14, size=18, p =0.5)
length(which(right.handed.toads.by.chance <= 8 | right.handed.toads.by.chance>=10))/1e6
</syntaxhighlight>
</syntaxhighlight>
* Lecture Slides [[File:Part-2-distribution.pptx|thumbnail]]
{| class="wikitable sortable mw-collapsible"
{| class="wikitable sortable mw-collapsible"
|- style="background-color:lightsteelblue;"
|- style="background-color:lightsteelblue;"
! Assignment #7. Due 4/5 (Finalized)
! Assignment #5. Due 3/23 (Finalized)
|- style="background-color:powderblue;"
|- style="background-color:powderblue;"
|  
|  
# (2 pts) Identify whether each of the following statements is more appropriate as the null or alternative hypothesis. State appropriate H0 and H1 for each question.
# (1 pt) Make a single data frame containing the gene expression values for two genes (Hint: create two dataframes and name two columns as "expression" and "gene", then use <code>rbind()</code> to combine two dataframes into one)
## "MED1": 12.38918, 9.084664, 9.48416, 9.363928, 8.194495, 8.694836, 8.771101, 9.998151, 12.66877, 8.684064, 8.944236, 11.8491, 8.40968, 8.990329, 9.782376, 8.58243, 12.00455, 8.580401, 9.161046, 9.047977, 8.672018, 8.811856, 8.354933, 8.763175
## "ZBTB42": 8.377784, 7.832712, 8.65289, 4.59474, 5.598869, 4.912963, 5.24125, 7.688584, 7.36693, 4.463853, 5.646581, 6.830076, 4.485883, 6.741698, 6.967342, 5.307032, 6.80991, 7.612475, 5.795508, 5.033554, 5.032286, 4.979937, 8.315718, 5.801263, 7.136532, 4.722164, 5.416593, 4.456056, 6.253954, 5.684245, 8.255962, 8.629676, 8.348159, 8.114049, 6.786746, 7.893434, 7.836647, 4.733391, 6.895385, 7.123281, 4.75207
# (1 pt) Make a stripchart of expression values separated by genes.
# (1 pt) Obtain the mean, median, standard deviation, standard error, and 95% confidence intervals of expression for each gene
# (1 pt) Run two-sampled t test <code>t.test</code> to test if two genes differ significantly in expression levels. Specify what is the null hypothesis and what is the alternative hypothesis
# (2 pts) Identify whether each of the following statements is more appropriate as the null (H0) or alternative (H1) hypothesis. State appropriate H0 and H1 for each question.
## Most genetic mutations are deleterious
## Most genetic mutations are deleterious
## A diet has no effect on liver function
## A diet has no effect on liver function
# (4 pts) The following table lists results of a study on the impact of tourism to species density.
# (4 pts) Can parents distinguish their own children by smell alone? In a study, eight of nine mothers identified their children correctly based on the smell of T-shirts children wore for three consecutive nights.
## Explain what the p-value means
## State the null & alternative hypotheses.
## Which species show a statistically significant reduction in mean density near heavily visited ruins?
## Simulate the null distribution using the "rbinom()" function (with n = 1 million). Plot the distribution using "barplot"
## Which species show a statistically significant increase in mean density near heavily visited ruins?
## Verify your result with the "binom.test()" function.  
## Which species provide no significant evidence in reduction or in increase?
## Conclude by explain the p-value.
|}
 
===March 23. One-sampled & Two-sampled t-test===
* Chapters 10-12
* Lecture Slides: [[File:Lecture-slides-part-2.pdf|thumbnail]]
{| class="wikitable sortable mw-collapsible"
|- style="background-color:lightsteelblue;"
! Assignment #6. Due 3/30 (Finalized)
|- style="background-color:powderblue;"
|
* (5 pts) Ostriches live in hot environments. To test if ostriches could reduce brain temperature relative to body temperature, the mean body and brain temperature (in Celsius) of six ostriches was recorded in the following table.
# State the null and alternative hypotheses
# Make a data frame and visualize with a boxplot combined with a stripchart
# Test if there is significant difference between the body and brain temperature. [Hint: Choose the most appropriate t-test among the one-sample, paired, or two-sample t-test]
{| class="wikitable"
{| class="wikitable"
|-
|-
! Species !! Mean density near ruins !! Mean density far from ruins !! P-value
! Ostrich !! Body Temperature !! Brain Temperature
|-
|-
| Agouti || 160.2 || 14.5 || 0.03
| 1 || 38.51 || 39.32
|-
|-
| Coatimundi || 99.4 || 1.0 || 0.01
| 2 || 38.45 || 39.21
|-
|-
| Collared precany || 4.6 || 1.8 || 0.79
| 3 || 38.27 || 39.2
|-
|-
| Deppes squirrel || 32.3 || 2.2 || 0.54
| 4 || 38.52 || 38.68
|-
|-
| Howler monkey || 7.3 || 1.9 || 0.03
| 5 || 38.62 || 39.09
|-
|-
| Spider monkey || 170.8 || 15.0 || 0.88
| 6 || 38.18 || 38.94
|-
| Crested guan || 0 || 49.4 || 0.001
|-
| Great curassow || 10.8 || 72.0 || 0.048
|-
| Ocellated turkey || 47.0 || 0 || 0.02
|-
| Tinamos || 0 || 4.9 || 0.049
|}
|}
* (4 pts) Can parents distinguish their own children by smell alone? In a study, eight of nine mothers identified their children correctly based on the smell of T-shirts children wore for three consecutive nights.
* (5 pts) Could mosquitoes detect odor of what people drink? The following data points are the relative activation time (the smaller the quicker) of mosquitoes flying toward volunteers, divided into beer- and water-drinking groups:
# State the null & alternative hypotheses.
** Beer group: 0.36, 0.46, 0.06, 0.18, 0.25, 0.18, -0.06, -0.14, 0.12, 0.39, 0.17, -0.16, -0.05, 0.19, 0.25, 0.31, 0.17, -0.03, 0.23, -0.03, 0.26, 0.30, 0.11, 0.13, 0.21
# Simulate the null distribution using the "rbinom()" function (with n = 1 million). Plot the distribution using "barplot"
** Water group: 0.04, 0, -0.08, -0.12, 0.201, -0.039, 0.10, 0.041, 0.02, 0.236, 0.05, 0.097, 0.122, -0.019, 0.021, -0.08, -0.165, -0.28
# Calculate from the null distribution the probability of obtaining 8 or 1 correct out of 9 by chance, which would be the p-value of this experiment. Is the result significant?
# State the null and alternative hypotheses  
# Verify your result with the "binom.test()" function.  
# Make a data frame and visualize with a boxplot combined with a stripchart
# Test normality of data points by showing qqnorm() and qqline() plots (separate for two column)  
# Test if there is significant difference between the two groups using a t-test.
|}
|}


Line 384: Line 350:


===April 6. Analyzing Proportions===
===April 6. Analyzing Proportions===
* Chapter 7 & 8
{| class="wikitable sortable mw-collapsible"
{| class="wikitable sortable mw-collapsible"
|- style="background-color:lightsteelblue;"
|- style="background-color:lightsteelblue;"
! Assignment #8. Due 4/19 (Finalized)
! Assignment #7. Due 4/19 (Finalized)
|- style="background-color:powderblue;"
|- style="background-color:powderblue;"
|  
|  
Line 405: Line 372:
===April 20. No Class (Monday Schedule)===
===April 20. No Class (Monday Schedule)===
===April 27. Contingency Analysis===
===April 27. Contingency Analysis===
Lecture Slides: [[File:Part-3-frequency.pptx|thumbnail]]
* Chapter 9.
* Lecture Slides: [[File:Part-3-frequency-2017.pdf|thumbnail]]
{| class="wikitable sortable mw-collapsible"
{| class="wikitable sortable mw-collapsible"
|- style="background-color:lightsteelblue;"
|- style="background-color:lightsteelblue;"
! Assignment #9. Due 5/3 (Finalized)
! Assignment #8. Due 5/4 (Finalized)
|- style="background-color:powderblue;"
|- style="background-color:powderblue;"
|  
|  
Line 417: Line 385:
!  !! CC !! CT !! TT
!  !! CC !! CT !! TT
|-
|-
| Taster || 79 || 82 || 9
| Taster || 44 || 20 || 40
|-
|-
| Non-Taster || 4 || 6 || 35
| Non-Taster || 19 || 10 || 39
|}
|}
</center>
</center>
# (1 pt) State the explanatory and response variables
# Hardy-Weinberg Analysis
# (1 pt) State the null & alternative hypotheses
## (1 pt) Calculate overall genotype counts (regardless of phenotype)
# (2 pts) Create a matrix called "taster.genotypes" and display the data as a mosaic plot. Make sure that the explanatory variable is on the ''X''-axis
## (1 pt) Test the observed genotype counts against Hardy-Weinberg expectations (consult lecture slides)
# (1 pt) Test the genotype-phenotype association with a chi-square test, with simulated p value. Conclude if there is significant association
## (2 pt) Perform a chi-square test using observed counts and expected proportions. Is the result significant? State for each genotype if it is under- or over-presented. What assumptions of Hardy-Weinberg equilibrium are not met, if your test result is significant?
# (1 pt) Save the above test result as "taste.chisq". Show observed & expected counts
# Genotype-phenotype association analysis
# (1 pt)  Identify over- and under-represented genotypes
## (1 pt) State the null & alternative hypotheses
# (3 pts) Test if there is association between the ''alleles'' (i.e., "C" or "T") and the Taster/non-Taster phenotypes
## (1 pt) Create a matrix called "taster.genotypes" and display the data as a mosaic plot. Make sure that the explanatory variable is on the ''X''-axis
## Calculate allele counts in each phenotypic group
## (2 pts) Test the genotype-phenotype association with a chi-square test, with simulated p value. Conclude if there is significant association between the genotypes and phenotype
## State null and alternative hypotheses; Create a matrix called "taster.alleles". Make a mosaic plot
## (2 pts) Save the above test result as "taste.chisq". Show observed & expected counts and identify over- and under-represented for each genotype/phenotype combination
## Perform chi-square test, draw your conclusions, and identify which allele ("C" or "T") is associated with the "Taster" phenotype
|}
|}


===May 4. Exam 3===
===May 4. Exam 3===
* Review lecture slides (part 3)
* Review lecture slides (part 3)
* Review 2 previous exams
* Review two assignments


===May 11. One-sample t-test===
===May 11. Correlation===
* Motivating example: weights of ticks
* Chapter 16. [http://whitlockschluter.zoology.ubc.ca/r-code/rcode16 R Data & Code on Publisher's Website]
# Import data set: http://diverge.hunter.cuny.edu/~weigang/data-sets-for-biostat/tick.tsv
* Lecture slides: [[File:Part-4-correlation-spring-2017.pdf|thumbnail]]
# How to visualize data? What kinds of plots to make?
* In-class Exercise 1. Aggression behavior in birds. (1) Make a data frame with two columns; (2) Make a scatter plot with title and axes labels
# Is there sexual dimorphism?
{| class="wikitable"
* No homework assignment (to be combined with the next one)
|-
 
! Variable !! Values
===May 18. Paired & Two sample t-tests===
|-
* Import data set: http://diverge.hunter.cuny.edu/~weigang/data-sets-for-biostat/ahus.csv
| Number of visits by non-parent adults || 1,7,15,4,11,14,23,14,9,5,4,10,13,13,14,12,13,9,8,18,22,22,23,31
* PLEASE FILL IN TEACHER EVALUATIONS: [http://www.hunter.cuny.edu/te '''Teacher's evaluation''']
|-
* All lecture slides
| Future aggressive behavior || -0.8,-0.92,-0.8,-0.46,-0.47,-0.46,-0.23,-0.16,-0.23,-0.23,-0.16,-0.10,-0.10,0.04,0.13,0.19,0.25,0.23,0.15,0.23,0.31,0.18,0.17,0.39
** Part 1. [[File:Biostat-part-1.pdf|thumbnail]]
|}
** Part 2. [[File:Part-2-distribution.pptx|thumbnail]]
* In-class Exercise 2. Inbreeding in wolfs. (1) Make a data frame with two columns; (2) Make a scatter plot with title and axes labels; (3) Calculate correlation coefficient and test its significance by using the <code>cor.test()</code> function
** Part 3. [[File:Part-3-frequency.pptx|thumbnail]]
{| class="wikitable"
** Part 4. [[File:Part-4-t-tests.pptx|thumbnail]]
|-
! Variable !! Values
|-
| Inbreeding coefficient between mating pairs || 0,0,0.13,0.13,0.13,0.19,0.19,0.19,0.22,0.24,0.24,0.24,0.24,0.24,0.24,0.25,0.27,0.30,0.30,0.30,0.30,0.36,0.37,0.40
|-
| Number of pups surviving first winter || 6,6,7,5,4,8,7,4,3,3,3,3,2,2,6,3,5,3,2,1,3,2,3
|}
* In-class Exercise 3. Miracles of memory. (1) Make a data frame with two columns; (2) Make a scatter plot with title and axes labels; (3) Run a rank correlation using <code>cor.test(method="spearman")</code> function
{| class="wikitable"
|-
! Variable !! Values
|-
| Years elapsed || 2,5,5,4,17,17,31,20,22,25,28,29,34,43,44,46,34,28,39,50,50
|-
|Impressive score || 1,1,1,2,2,2,3,4,4,4,4,4,4,4,4,4,4,5,5,5,5
|}
{| class="wikitable sortable mw-collapsible"
{| class="wikitable sortable mw-collapsible"
|- style="background-color:lightsteelblue;"
|- style="background-color:lightsteelblue;"
! Assignment #10. Due 5/24 (Finalized)
! Assignment #9. Due 5/18 (Finalized)
|- style="background-color:powderblue;"
|- style="background-color:powderblue;"
|  
|  
* (5 pts) Ostriches live in hot environments. To test if ostriches could reduce brain temperature relative to body temperature, the mean body and brain temperature (in Celsius) of six ostriches was recorded in the following table.  
(Note: Lecture slides has been posted. For help, refer to the slides and [http://whitlockschluter.zoology.ubc.ca/r-code/rcode16 R Data & Code on Publisher's Website])
# State the null and alternative hypotheses <code>H0: there is no difference in body and brain temperature; H1: the brain temperature is different from the body temperature</code>
 
# Make a data frame and visualize with a boxplot combined with a stripchart <code>x <- data.frame(id = 1:6, body = c(), brain = c())</code>
Question 1 (6 pts). In their study of hyena laughter (or "giggle"), Mathevon et al (2010) asked whether sound spectral properties of the giggles are associated with age. Perform correlation analysis using data in the following table:
# Test normality of data points by showing qqnorm() and qqline() plots (separate for two column) <code>qqnorm(x$body); qqline(x$body)</code>
# Test if there is significant difference between the body and brain temperature <code>t.test(x$body, x$brain, paired = T)</code>
{| class="wikitable"
{| class="wikitable"
|-
|-
! Ostrich !! Body Temperature !! Brain Temperature
| Age (years) || 2,2,2,6,9,10,13,10,14,14,12,7,11,11,14,20
|-
|-
| 1 || 38.51 || 39.32
| Frequency (Hz) || 840,670,580,470,540,660,510,520,500,480,400,650,460,500,580,500
|}
* Identify explanatory and response variables and their data types.
* Make a data frame and test normality of both variables
* Make a scatterplot including axes labels and a main title
* Calculate correlation coefficient between the two variables and show test results
* State null and alternative hypotheses of the correlation test
* Conclude if there is a correlation between the age and "giggle", if it is statistically significant, and whether positive or negative if the relationship is significant.
 
Question 2(4 pts). A mutation in the <i>SLC11A1</i> gene in humans causes resistance to tuberculosis. Barnes et al (2011) examined the frequency of the resistant allele in different towns in Europe and Asia and compared it to how long humans had been settled in the site ("duration of settlement"):
{| class="wikitable"
|-
|-
| 2 || 38.45 || 39.21
| Duration of settlement (years) || 8010,5260,4735,4010,3710,2810,2730,2310,2110,1955,1910,1300,378,194,130,110,91
|-
|-
| 3 || 38.27 || 39.2
| Allele frequency || 0.99,1.0,0.948,0.885,0.947,0.854,1.0,0.769,0.9956,0.979,0.865,0.922,0.821,0.842,0.734,0.766,0.772
|-
|}
| 4 || 38.52 || 38.68
* Make a data frame and test variable normality.
|-
* Examine data with a scatterplot (with axes labels).
| 5 || 38.62 || 39.09
* Since the relationship appears curvilinear, use Spearman's rank correlation to test the association
|-
* Show test results and state your conclusions (if there is a correlation, if it is significant, and whether the relation is positive or negative if significant).
| 6 || 38.18 || 38.94
|}
|}
* (5 pts) Could mosquitoes detect odor of what people drink? The following data points are the relative activation time (the smaller the quicker) of mosquitoes flying toward volunteers, divided into beer- and water-drinking groups:
 
** Beer group: 0.36, 0.46, 0.06, 0.18, 0.25, 0.18, -0.06, -0.14, 0.12, 0.39, 0.17, -0.16, -0.05, 0.19, 0.25, 0.31, 0.17, -0.03, 0.23, -0.03, 0.26, 0.30, 0.11, 0.13, 0.21
===May 18. Regression===
** Water group: 0.04, 0, -0.08, -0.12, 0.201, -0.039, 0.10, 0.041, 0.02, 0.236, 0.05, 0.097, 0.122, -0.019, 0.021, -0.08, -0.165, -0.28
* Chapter 16. [http://whitlockschluter.zoology.ubc.ca/r-code/rcode17 R Data & Code on Publisher's Website]
# State the null and alternative hypotheses <code>H0: there is no difference in activation time between the two groups; H1: reaction time differs between the two groups</code>
* In-class Exercise 1. Lion noses. Linear regression with regression line & confidence intervals
# Make a data frame and visualize with a boxplot combined with a stripchart<code>x <- data.frame(time = c(0.36, 0.46, 0.06, 0.18, 0.25, 0.18, -0.06, -0.14, 0.12, 0.39, 0.17, -0.16, -0.05, 0.19, 0.25, 0.31, 0.17, -0.03, 0.23, -0.03, 0.26, 0.30, 0.11, 0.13, 0.21, 0.04, 0, -0.08, -0.12, 0.201, -0.039, 0.10, 0.041, 0.02, 0.236, 0.05, 0.097, 0.122, -0.019, 0.021, -0.08, -0.165, -0.28),  group = c(rep("beer", 25), rep("water",18)))</code>
* In-class Exercise 2. Prairie Home Companion. Linear regression with test of non-zero slope
# Test normality of data points by showing qqnorm() and qqline() plots (separate for two column) <code>qqnorm(x$time[1:25]); qqline(x$time[1:25])</code>
* In-class Exercise 3. Iris pollination. Square-root transformation
# Test if there is significant difference between the two groups<code>t.test(time ~ group, data = x)</code>
* In-class Exercise 4. Iron and phytoplankton growth. Non-linear regression
* In-class Exercise 5. Guppy cold death. Logistic regression
* Teacher's Evaluation
** Students can complete them now in 3 simple steps:
*** Visit [http://www.hunter.cuny.edu/te www.hunter.cuny.edu/te] OR [http://www.hunter.cuny.edu/mobilete www.hunter.cuny.edu/mobilete] (for smartphones)
*** Sign in with your net ID and net ID password (forgot your password? Click here: https://netid.hunter.cuny.edu/verify-identity)
*** Complete the evaluation for your instructor(s)
** Notes to students:
*** Your responses are completely anonymous, and that instructors can only see results after grades are released.
*** Teacher evaluations serve a number of important functions such as: improving classes by providing instructors feedback of their teaching; and, serving as supporting documentation in a faculty’s reappointment, tenure and promotion.
*** Teacher evaluations also help the student make decisions about what courses and instructors are right for them. Please let the students know that the teacher evaluation results are readily accessible to them at www.hunter.cuny.edu/myprof.
{| class="wikitable sortable mw-collapsible"
|- style="background-color:lightsteelblue;"
! Assignment #10. Due In-Class or 5/25
|- style="background-color:powderblue;"
| (10 pts) Using the "Prairie Home Companion" data set to answer the following questions
* Test the normality of response variable using <code>qqnorm</code> and <code>qqline</code>
* Make scatter plot with axes labels
* Log-transform the response variable and make a new scatter plot
* Perform regression analysis using <code>lm</code>
* What is the regression formula (intercept and slope? significance of each?)
* Calculate confidence interval of the population slope
* Obtain R-squared value and explain its meaning and significance
* Add confidence band line
* Add confidence interval line
|}
|}


===May 25. Final Exam (Comprehensive)===
===May 25. Final Exam (11:10-1:40; Comprehensive)===
 
===May 31. Grades submitted to Registrar Office===
===May 31. Grades submitted to Registrar Office===

Latest revision as of 15:59, 18 May 2017

Analysis of Biological Data (BIOL 20N02, Spring 2017)
Instructor: Dr Weigang Qiu, Associate Professor, Department of Biological Sciences
Room: 1001B HN (North Building, 10th Floor, Mac Computer Lab)
Hours: Thursdays 11:10-1:40
Office Hours: Belfer Research Building (Google Map) BB-402; Tuesdays 5-7 pm or by appointment
Course Website: http://diverge.hunter.cuny.edu/labwiki/Biol20N2_2017

Course Description

With rapid accumulation of genome sequences and digital health data, biomedicine is becoming an information science. This course is a hands-on, computer-based workshop on how to visualize and analyze biological data. The course introduces R, a modern statistical computing language and platform. In the first half, students will learn to use R to make scatter plots, bar plots, box plots, and other commonly used data-visualization techniques. In the second half, the course will review & apply statistical hypothesis tests including significance testing of means, association tests, and correlation analysis. Throughout the course, students will apply these methods to the analysis of large biological data sets, such as the human genome, transcriptomes (RNA-SEQ), and human genome variations.

This 3-credit experimental course fulfills elective requirements for Biology Major I. Hunter pre-requisites are BIOL100, BIOL102 and STAT113.

Learning Goals

  • Be able to use R as a plotting tool to visualize large-scale biological data sets
  • Be able to use R as a statistical tool to summarize data and make biological inferences
  • Be able to use R as a programming language to automate data analysis

Textbooks

  • Whitlock & Schluter (2015). Analysis of Biological Data. (2nd edition). Amazon link

Exams & Grading

  • Attendance (or a note in case of absence) is required
  • In-Class Exercises (50 pts).
  • Assignments. All assignments should be handed in as hard copies only. Email submission will not be accepted. Late submissions will receive 10% deduction (of the total grade) per day (~100 pts total).
  • Three Mid-term Exams (3 X 30 pts each = 90 pts)
  • Comprehensive Final Exam (50 pts)
  • Bonus for active participation in classroom discussions

Course Outline

Feb 2. Introduction & R Demo

  1. Course overview
  2. Tutorial 1: R Demo
  • Create a new project by navigating: File | New Project | New Directory. Name it project file "human_genes"
  • Import the human genes data set: File | Import DataSet | CSV, copy & paste this address: http://diverge.hunter.cuny.edu/~weigang/data-sets-for-biostat/hg.tsv2
  • Click "update". Rename the data set if you wish (short but informative names, e.g., hg, or human.genes). Do not use spaces, use dot or underscore as name delimiters (e.g., "human.genes" or "human_genes", but never "human genes") Same rule for column or row names
dim(hg) # show dimension
head(hg) # show top rows
tail(hg) # show bottom rows
hg.len <- hg$Gene.End - hg$Gene.Start + 1 # create a vector of gene lengths (Tab for auto-completion)
hg.len[1] # show first length
hg.len[10] # show the 10th item
hg.len[1:10] # show items 1 through 10
hg.len[c(1,10)] # show items 1 and 10
hg[1,1] # show item in 1st row, 1 column
hg[1,] # show all values for 1st row
hg[,1] # show all values for 1st column
hg[1:10,] # show rows 1 through 10
hg[,1:7] # show columns 1 through 10
hg[1:10, 1:7] # a subset
hist(hg$Length, br = 200) # plot gene-length distribution. Not normal: mostly genes are short, few very long
hist(log10(hg$Length), br = 200) # log transformation for non-normally distributed variable
gene.cts <- table(hg$Chromosome) # count number of genes, separated by chromosomes
barplot(gene.cts) # show distribution by a barplot
mean(hg$Length) # not representative, super-long genes carry too much weight to the average length
median(hg$Length) # More representative. Use median for a variable not normally distributed
summary(hg$Length) # Show all quartiles
hg$Gene.Length <- hg$Gene.End - hg$Gene.Start + 1 # create a new column named "Gene.Length"
boxplot(log10(Length) ~ Chromosome, data = hg)  # show gene length by chromosomes
write.csv(hg, "hg.csv", row.names = FALSE) # save into a file
hg <- read.csv("hg.csv") # read back into R
  • Export a PDF or image
  • Open a new R script, name it as "hg.R"
  • Select commands and save to script
  • Retrieve and edit a command by pressing "up" or "down" arrows
  • Retrieve commands by using the search box on the "History" table
  • Type q() to quit. Answer "y" to save workspace
  • To reload and restore workspace, go to "C:/Users/instructor/Documents/human.genes" and double click on the file "human.gene"
Assignment #1. Due 2/9, Thursday (Finalized)
  1. (2 pts) Install R & R Studio on your own computer:
    1. First, download & install R from this website: https://mirrors.nics.utk.edu/cran/
    2. Next, download & install R studio from this website: https://www.rstudio.com/
  2. (4 pts) Reproduce the "human.gene" project (follow steps in Demo). Save and print the histogram for gene length & the barplot for number of genes on each chromosome. Label x and y axes. Show all commands. [Hint: combine the commands and figures into a single WORD document]
  3. (4 pts) Vector operations.
    1. Create a new vector (named as "gene.length") of gene length
    2. Show commands for extracting the first item, first 10 items, items 20 through 30, the 1st, 2nd, and 5th items
    3. Apply the following functions: range(), min(), max(), mean(), var(). [Hint: use help(var), help(min) for help]

Feb 9. (Class cancelled due to snow storm)

Feb 16. R Data Structure & Variable Types

  • Vector
x <- c(1,2,3,4,5) # construct a vector using the c() function
x # show x
2 * x + 1 # arithmetic operations, applied to each element
exp(x) # exponent function (base e)
x <- 1:5 # alternative way to construct a vector, if consecutive
x <- seq(from = -1, to = 14, by = 2) # use the seq() function to create a numeric series
x <- rep(5, times = 10) # use the rep() function to create a vector of same element
x <- rep(NA, times = 10) # pre-define a vector with unknown elements; Use no quotes
# Apply vector functions
length(x)
sum(x)
mean(x)
range(x)
# Access vector elements with indices
x[1]
x[1:3]
x[-2]
x[c(1,3)]
# Character vectors
gender <- c("male", "female", "female", "male", "female")
gender[3]
  • Matrix
BMI <- c(28, 32, 21, 27, 35) # a vector of body-mass index
bp <- c(124, 145, 127, 133, 140) # a vector of blood pressure
data.1 <- cbind(age, BMI, bp) # create a matrix using column bind function cbind(), individuals in rows
data.1
data.2 <- rbind(age, BMI, bp) # create a matrix using row bind function rbind()
t(data.1) # transpose a matrix: columns to rows & rows to columns
dim(data.1) # dimension of the matrix
colnames(data.1)
rownames(data.1) <- c("subject1", "subject2", "subject3", "subject4", "subject5")
data.1
data.1[3,1] # access the element in row 3, column 1
data.1[2,] # access all elements in row 2
data.1[,2] # access all elements in column 2
matrix(data = 1:12, nrow = 3, ncol =4) # create a matrix with three rows and four columns; filled by column
matrix(data = 1:12, nrow = 3, ncol =4, byrow = TRUE) # filled by row
mat <- matrix(data = NA, nrow = 2, ncol = 3) # create an empty matrix
mat[1,3] <- 5 # assign a value to a matrix element
  • Dataframe
class(hg) # show object class
hg[1,] # how first row
hg[,1] # show first column
hg[1:3,] # show rows 1 through 3
hg[,1:3] # show columns 1 through 3
hg$Gene.Name # show column "Gene.Name"
Assignment #2. Due 2/23 (Finalized)
  1. (2 pts) Construct a numeric vector of 10 random numbers from the uniform distribution between 0 and 1 (Hint: use the function runif()). Name the resulting vector as "ran.1". Show length, range, mean, and variance.
  2. (2 pts) Construct a numeric vector of 10 random numbers from a normal distribution with mean of 0 and variance of 1 (Hint: use the function rnorm()). Name the resulting vector as "ran.2". Show length, range, mean, and variance. Compare the variance with the previous one: which is large?
  3. (2 pts) Construct a matrix of 10 rows by combining the previous two vectors using the cbind function. Name the matrix as "mat". Assign row names as "ind1" .. "ind10". Show row values for ind1, column values for ran.2; transpose the matrix and save it as "mat.t".
  4. (2 pts) Construct a character vector of 10 US States (Hint: use the c() function). Name it "us.states". Use full, case-sensitive names and "_" in place of spaces. Show the first and the fifth states with one command.
  5. (2 pts) Understand variable types. (From Whitlock & Schluter) Researchers randomly assign diabetes patients to two groups. In the first group, the patients receive a new drug while the other group received standard treatment without the new drug. The researchers compared the rate of insulin release in the two groups.
    1. List the two variables and state whether each is categorical (if so, whether it is nominal or ordinal) or numerical (if so, whether it is discrete or continuous)
    2. State & explain which variable is the explanatory (i.e., predictive) and which is the response variable.

Feb 23. Data Visualization

  • Vector functions returning indices
# The which() function returns the indices of TRUE elements
hg$Gene.Length <- hg$Gene.End - hg$Gene.Start + 1 # add a length column
hg.long.idx <- which(hg$Gene.Length > 1e6) # returns indices
hg.long <- hg[hg.long.idx,] # genes longer than 1 million bases
hg.mt.idx <- which(hg$Chromosome == "MT")
hg.mt <- hg[hg.mt.idx,] # mitochondrial genes
# The grep() function returns the indices of matching a pattern
p53.idx <- grep("P53", hg$Gene.Name)
hg.p53 <- hg[p53.idx,]
# The order() function returns the indices of sorted elements
idx.sorted <- order(hg$Gene.Length)
hg.sorted <- hg[idx.sorted,]
Assignment #3. Due 3/2 (Finalized)
  1. (1 pt) Load the iris data set by typing data(iris)
  2. (1 pt) Identity a character variable and obtain frequency counts using the "table()" function
  3. (1 pt) Identity a numerical variable and obtain frequency distribution by a histogram. Use customized x-axis label
  4. (2 pts) Make a boxplot of distribution of each numerical variable with respect to species. For example, boxplot(Sepal.Length ~ Species, data=iris). Label axes and title.
  5. (2 pt) Make a strip chart of distribution of a numerical variable with respect to species using the stripchart() function. Customize it with axes labels, open circle symbol (pch = 1), the method of "jitter", and being vertical ("vertical=T").
  6. (1 pt) Make a scatter plot to show relations between two numerical variables. For example, plot(iris$Sepal.Length, iris$Sepal.Width)
  7. (2 pts) Among graduate school applicants to a university department, 512 males were admitted, 313 males were rejected, 89 females were admitted, and 19 females were rejected. Explore if there is gender bias in admission by
    1. Identify the explanatory and response variables, as well as whether the variables are character or numerical
    2. Make a contingency table using the matrix() function. Add labels to columns and rows using colnames() and rownames() functions
    3. Plot the contingency table using grouped bar plot with the "barplot()" function, and the "beside = T" option.
    4. Plot the contingency table using the mosaicplot() function. Based on the plot, explain if there is evidence for gender bias. [Hint: try matrix transposition]

March 2. Exam 1

March 9. Describing data & Standard Error

  • Textbook: Chapters 3 & 4
  • Population and sample
x <- rnorm(1000)
x.sample <- sample(x, size = 100)
n.genes <- nrow(hg) # number of rows
sampled.rows <- sample(1:n.genes, size = 100)
hg.sample <- hg[sampled.rows,] # a random sample of 100 genes
  • Explore variable distributions
x <- rnorm(1000)
hist(x, breaks = 100) # distribution for continuous variable
hist(hg$Gene.Length, br = 100)
hist(log10(hg$Gene.Length), br = 100)
gene.cts <- table(hg$Chromosomes) # distribution for a categorical vector
barplot(gene.cts)
  • In-Class exercise: A study of human gene length
hg.len <- hg$Gene.End - hg$Gene.Start + 1 # calculate gene length
hist(hg.len, br = 200) # plot gene-length distribution. Not normal: mostly genes are short, few very long
mean(hg.len) # not representative, super-long genes carry too much weight to the average length
median(hg.len) # More representative. Use median for a variable not normally distributed
summary(hg.len) # Show all quartiles
IQR(hg.len) # 3rd Quartile - 1st Quartile, the range of majority data points, even for skewed distribution
log.len <- log10(hg.len); hist(log.len, br=200) # Log of gene length is more normally distributed
mean(log.len); median(log.len) # They should be similar, since log.len is normal
# The next block is intend to show that the "mean length" of samples is normally distributed, although the length itself is not
samp.len <- sample(hg.len, 100) # take a random sample of 100 length
mean(samp.len) # a sample mean
# Repeat the above 1000 times, so we could study the distribution of "mean length" (not "length" itself)
mean.sample.100 <- sapply(1:1000, function(x) mean(sample(hg$Gene.Length, size = 100)))
hist(mean.sample.100, br=100) # you should see a more normally distributed histogram
# The above exercise is a demonstration of the "Central Limit Theorem"
Assignment #4. Due 3/16. (Finalized)
  1. (2 pts) Repeat the above sampling experiment (sample size N =100). Save results to a vector called "mean.100" (which is a vector of means, with a length of 1000 elements). Show histogram. Show mean. Show standard deviation.
  2. (2 pts) Repeat the above sampling experiment (sample size N =500). Save results to a vector called "mean.500". Show histogram. Show mean. Show standard deviation
  3. (2 pts) Repeat the above sampling experiment (sample size N =5000). Save results to a vector called "mean.5000". Show histogram. Show mean. Show standard deviation
  4. (1 pt) Explain why mean is not a good description of a "typical" human gene length and which statistic is a better?
  5. (0.5 pt) Sort the human genes by length (using the order()) & created a sorted data.frame ("hg.sorted") of the original data.frame ("hg").
  6. (0.5 pt) Obtain human genes longer than 1 million bases by using the which() function & create a new data.frame of long genes ("hg.long")
  7. (2 pts) Combine the three simulated-mean vectors into one (see below). Define "Standard error" & "confidence interval (CI)". Does CI quantify accuracy or precision?
# Combine
sample.combined <- cbind(mean.100, mean.500, mean.5000)
colnames(sample.combined) <- c("samp.100", "samp.500", "samp.5000")
# plot in a single frame
par(mfrow=c(3,1))
hist(sample.combined[,1], br=100, xlim=c(1e4, 2e5), main="sample size 20", xlab = "mean gene length")
hist(sample.combined[,2], br=100, xlim=c(1e4, 2e5), main = "sample size 100", xlab = "mean gene length")
hist(sample.combined[,3], br=100, xlim=c(1e4, 2e5), main = "sample size 500", xlab = "mean gene length")
par(mfrow =c(1,1))

March 16. Hypothesis Testing

  • Chapter 6
  • In-Class exercise 1. The following are measurements of body mass (in grams) of three species of finches in Africa, calculate mean, standard deviation, and CV of each species. Make a boxplot and a strip chart separated by species
  1. Species 1: 8, 8, 8, 8, 8, 8, 8, 6, 7 ,7, 7, 8, 8, 8, 7, 7
  2. Species 2: 16, 16, 16, 12, 16, 15, 15, 17, 15, 16, 15, 16
  3. Species 3: 40, 43, 37, 38, 43, 33, 35, 37, 36, 42, 36, 36, 39, 37, 34, 41
  4. Use the 2SE rule of thumb, calculate 95% confidence interval.
  5. Plot standard error & standard deviation
df1 <- data.frame(bm = c(8, 8, 8, 8, 8, 8, 8, 6, 7 ,7, 7, 8, 8, 8, 7, 7), species = rep("sp1", 16))
df2 <- data.frame(bm = c(16, 16, 16, 12, 16, 15, 15, 17, 15, 16, 15, 16), species = rep("sp2", 12))
df3 <- data.frame(bm = c(40, 43, 37, 38, 43, 33, 35, 37, 36, 42, 36, 36, 39, 37, 34, 41), species = rep("sp3", 16))
df.bm <- rbind(df1, df2, df3)
boxplot(bm ~ species, data= df.bm)
sd(df1$bm)
mean.1 <-mean(df1$bm)
se.1 <- sd(df1$bm)/sqrt(nrow(df1))
ci.1 <- c(mean(df1$bm) - 2 * se.1, mean(df1$bm) + 2*se.1)
bm.aov <- aov(bm ~ species, data = df.bm)
  • In-Class exercise 2. Hypothesis testing through simulation
# coin-flipping experiments
runif(1) # take a random sample from 0-1, uniformly distributed
rbinom(n = 1, size =100, prob = 0.5) # flipping 100 (size) fair (prob) coin, one (n=1) time
rbinom(n = 1000, size =100, prob = 0.5) # repeat above 1000 times
num.success <- rbinom(n = 1000, size =100, prob = 0.5) # save
barplot(table(num.success)) # distribution of number of successes
length(which(num.success<=40))/1000 # probability of success less than or equal to 40
# test if toads are right-handed: observation 14/18 are right-handed
right.handed.toads.by.chance <- rbinom(n = 1000, size = 18, prob = 0.5) # null distribution, 1000 times
barplot(table(right.handed.toads.by.chance)) # plot
length(which(right.handed.toads.by.chance >= 14))/1000 # probability of getting a value equal or higher than 14
# Binomial test
binom.test(x=14, size=18, p =0.5)
Assignment #5. Due 3/23 (Finalized)
  1. (1 pt) Make a single data frame containing the gene expression values for two genes (Hint: create two dataframes and name two columns as "expression" and "gene", then use rbind() to combine two dataframes into one)
    1. "MED1": 12.38918, 9.084664, 9.48416, 9.363928, 8.194495, 8.694836, 8.771101, 9.998151, 12.66877, 8.684064, 8.944236, 11.8491, 8.40968, 8.990329, 9.782376, 8.58243, 12.00455, 8.580401, 9.161046, 9.047977, 8.672018, 8.811856, 8.354933, 8.763175
    2. "ZBTB42": 8.377784, 7.832712, 8.65289, 4.59474, 5.598869, 4.912963, 5.24125, 7.688584, 7.36693, 4.463853, 5.646581, 6.830076, 4.485883, 6.741698, 6.967342, 5.307032, 6.80991, 7.612475, 5.795508, 5.033554, 5.032286, 4.979937, 8.315718, 5.801263, 7.136532, 4.722164, 5.416593, 4.456056, 6.253954, 5.684245, 8.255962, 8.629676, 8.348159, 8.114049, 6.786746, 7.893434, 7.836647, 4.733391, 6.895385, 7.123281, 4.75207
  2. (1 pt) Make a stripchart of expression values separated by genes.
  3. (1 pt) Obtain the mean, median, standard deviation, standard error, and 95% confidence intervals of expression for each gene
  4. (1 pt) Run two-sampled t test t.test to test if two genes differ significantly in expression levels. Specify what is the null hypothesis and what is the alternative hypothesis
  5. (2 pts) Identify whether each of the following statements is more appropriate as the null (H0) or alternative (H1) hypothesis. State appropriate H0 and H1 for each question.
    1. Most genetic mutations are deleterious
    2. A diet has no effect on liver function
  6. (4 pts) Can parents distinguish their own children by smell alone? In a study, eight of nine mothers identified their children correctly based on the smell of T-shirts children wore for three consecutive nights.
    1. State the null & alternative hypotheses.
    2. Simulate the null distribution using the "rbinom()" function (with n = 1 million). Plot the distribution using "barplot"
    3. Verify your result with the "binom.test()" function.
    4. Conclude by explain the p-value.

March 23. One-sampled & Two-sampled t-test

Assignment #6. Due 3/30 (Finalized)
  • (5 pts) Ostriches live in hot environments. To test if ostriches could reduce brain temperature relative to body temperature, the mean body and brain temperature (in Celsius) of six ostriches was recorded in the following table.
  1. State the null and alternative hypotheses
  2. Make a data frame and visualize with a boxplot combined with a stripchart
  3. Test if there is significant difference between the body and brain temperature. [Hint: Choose the most appropriate t-test among the one-sample, paired, or two-sample t-test]
Ostrich Body Temperature Brain Temperature
1 38.51 39.32
2 38.45 39.21
3 38.27 39.2
4 38.52 38.68
5 38.62 39.09
6 38.18 38.94
  • (5 pts) Could mosquitoes detect odor of what people drink? The following data points are the relative activation time (the smaller the quicker) of mosquitoes flying toward volunteers, divided into beer- and water-drinking groups:
    • Beer group: 0.36, 0.46, 0.06, 0.18, 0.25, 0.18, -0.06, -0.14, 0.12, 0.39, 0.17, -0.16, -0.05, 0.19, 0.25, 0.31, 0.17, -0.03, 0.23, -0.03, 0.26, 0.30, 0.11, 0.13, 0.21
    • Water group: 0.04, 0, -0.08, -0.12, 0.201, -0.039, 0.10, 0.041, 0.02, 0.236, 0.05, 0.097, 0.122, -0.019, 0.021, -0.08, -0.165, -0.28
  1. State the null and alternative hypotheses
  2. Make a data frame and visualize with a boxplot combined with a stripchart
  3. Test normality of data points by showing qqnorm() and qqline() plots (separate for two column)
  4. Test if there is significant difference between the two groups using a t-test.

March 30. Exam 2

April 6. Analyzing Proportions

  • Chapter 7 & 8
Assignment #7. Due 4/19 (Finalized)
  1. (Analysis of frequency data. 5 pts) A Siena College Poll has been conducted between April 6-11, 2016 by telephone calls. Based on a sample of 538 likely Democratic primary voters in New York State (which votes next Tuesday on 4/19/2016), the poll concludes that "Vermont Senator Bernie Sanders has narrowed the lead against former New York Senator Hillary Clinton among likely Democratic voters, however, she still has a double digit lead 52-42 percent".
    1. What are the estimates of proportion of voters supporting Clinton and Sanders, respectively? Name the two variables p.clinton & p.sanders
    2. What are the standard errors (i.e., "margins of error" of the pool) of these two estimates? Use the formula SE[p]=sqrt(p*(1-p)/n)
    3. What are the 95% confidence intervals for these two estimates? Use the approximate formula p-2*SE[p], p+2*SE[p]
    4. Assuming that the number of Sanders supporters is 226 out of 538 in this pool, use the binom.test() function to test the null hypothesis that the proportion of Sanders supporter is p=0.5. Could the null hypothesis be rejected at p value of 0.05?
    5. Would you predict based on this pool that Clinton would win the New York Primary?
  2. (Test of goodness-of-fit. 5 pts) Variation in flower color of a plant species is determined by a single gene, with RR individuals being red, Rr individuals pink, and rr individuals white. The expected color ratio of crossing F1 individuals is 1:2:1.
    1. In one cross experiment, the results were 10 red, 21 pink, and 9 white-flowered offspring. Do these results differ significantly from the expected frequencies (at 5% level of significance)? Create a data frame with observed and expected counts as the two columns. Calculate the chi-square value and degree of freedom. Obtain p value by using the pchisq() function
    2. Validate your above test using the chisq.test(x=c(observed counts), p=c(0.25, 0.5, 0.25)) function. Save test results and show expected counts
    3. In another, larger experiment, 1000 red, 2100 pink, and 900 white flowers were observed. Do these results differ significantly from the expected ratio?
    4. How would you explain the difference between the two results?

April 13. No Class (Spring Break)

April 20. No Class (Monday Schedule)

April 27. Contingency Analysis

Assignment #8. Due 5/4 (Finalized)

The following table shows results of genotype counts in "Taster" and "non-Taster" individuals.

CC CT TT
Taster 44 20 40
Non-Taster 19 10 39
  1. Hardy-Weinberg Analysis
    1. (1 pt) Calculate overall genotype counts (regardless of phenotype)
    2. (1 pt) Test the observed genotype counts against Hardy-Weinberg expectations (consult lecture slides)
    3. (2 pt) Perform a chi-square test using observed counts and expected proportions. Is the result significant? State for each genotype if it is under- or over-presented. What assumptions of Hardy-Weinberg equilibrium are not met, if your test result is significant?
  2. Genotype-phenotype association analysis
    1. (1 pt) State the null & alternative hypotheses
    2. (1 pt) Create a matrix called "taster.genotypes" and display the data as a mosaic plot. Make sure that the explanatory variable is on the X-axis
    3. (2 pts) Test the genotype-phenotype association with a chi-square test, with simulated p value. Conclude if there is significant association between the genotypes and phenotype
    4. (2 pts) Save the above test result as "taste.chisq". Show observed & expected counts and identify over- and under-represented for each genotype/phenotype combination

May 4. Exam 3

  • Review lecture slides (part 3)
  • Review two assignments

May 11. Correlation

Variable Values
Number of visits by non-parent adults 1,7,15,4,11,14,23,14,9,5,4,10,13,13,14,12,13,9,8,18,22,22,23,31
Future aggressive behavior -0.8,-0.92,-0.8,-0.46,-0.47,-0.46,-0.23,-0.16,-0.23,-0.23,-0.16,-0.10,-0.10,0.04,0.13,0.19,0.25,0.23,0.15,0.23,0.31,0.18,0.17,0.39
  • In-class Exercise 2. Inbreeding in wolfs. (1) Make a data frame with two columns; (2) Make a scatter plot with title and axes labels; (3) Calculate correlation coefficient and test its significance by using the cor.test() function
Variable Values
Inbreeding coefficient between mating pairs 0,0,0.13,0.13,0.13,0.19,0.19,0.19,0.22,0.24,0.24,0.24,0.24,0.24,0.24,0.25,0.27,0.30,0.30,0.30,0.30,0.36,0.37,0.40
Number of pups surviving first winter 6,6,7,5,4,8,7,4,3,3,3,3,2,2,6,3,5,3,2,1,3,2,3
  • In-class Exercise 3. Miracles of memory. (1) Make a data frame with two columns; (2) Make a scatter plot with title and axes labels; (3) Run a rank correlation using cor.test(method="spearman") function
Variable Values
Years elapsed 2,5,5,4,17,17,31,20,22,25,28,29,34,43,44,46,34,28,39,50,50
Impressive score 1,1,1,2,2,2,3,4,4,4,4,4,4,4,4,4,4,5,5,5,5
Assignment #9. Due 5/18 (Finalized)

(Note: Lecture slides has been posted. For help, refer to the slides and R Data & Code on Publisher's Website)

Question 1 (6 pts). In their study of hyena laughter (or "giggle"), Mathevon et al (2010) asked whether sound spectral properties of the giggles are associated with age. Perform correlation analysis using data in the following table:

Age (years) 2,2,2,6,9,10,13,10,14,14,12,7,11,11,14,20
Frequency (Hz) 840,670,580,470,540,660,510,520,500,480,400,650,460,500,580,500
  • Identify explanatory and response variables and their data types.
  • Make a data frame and test normality of both variables
  • Make a scatterplot including axes labels and a main title
  • Calculate correlation coefficient between the two variables and show test results
  • State null and alternative hypotheses of the correlation test
  • Conclude if there is a correlation between the age and "giggle", if it is statistically significant, and whether positive or negative if the relationship is significant.

Question 2(4 pts). A mutation in the SLC11A1 gene in humans causes resistance to tuberculosis. Barnes et al (2011) examined the frequency of the resistant allele in different towns in Europe and Asia and compared it to how long humans had been settled in the site ("duration of settlement"):

Duration of settlement (years) 8010,5260,4735,4010,3710,2810,2730,2310,2110,1955,1910,1300,378,194,130,110,91
Allele frequency 0.99,1.0,0.948,0.885,0.947,0.854,1.0,0.769,0.9956,0.979,0.865,0.922,0.821,0.842,0.734,0.766,0.772
  • Make a data frame and test variable normality.
  • Examine data with a scatterplot (with axes labels).
  • Since the relationship appears curvilinear, use Spearman's rank correlation to test the association
  • Show test results and state your conclusions (if there is a correlation, if it is significant, and whether the relation is positive or negative if significant).

May 18. Regression

  • Chapter 16. R Data & Code on Publisher's Website
  • In-class Exercise 1. Lion noses. Linear regression with regression line & confidence intervals
  • In-class Exercise 2. Prairie Home Companion. Linear regression with test of non-zero slope
  • In-class Exercise 3. Iris pollination. Square-root transformation
  • In-class Exercise 4. Iron and phytoplankton growth. Non-linear regression
  • In-class Exercise 5. Guppy cold death. Logistic regression
  • Teacher's Evaluation
    • Students can complete them now in 3 simple steps:
    • Notes to students:
      • Your responses are completely anonymous, and that instructors can only see results after grades are released.
      • Teacher evaluations serve a number of important functions such as: improving classes by providing instructors feedback of their teaching; and, serving as supporting documentation in a faculty’s reappointment, tenure and promotion.
      • Teacher evaluations also help the student make decisions about what courses and instructors are right for them. Please let the students know that the teacher evaluation results are readily accessible to them at www.hunter.cuny.edu/myprof.
Assignment #10. Due In-Class or 5/25
(10 pts) Using the "Prairie Home Companion" data set to answer the following questions
  • Test the normality of response variable using qqnorm and qqline
  • Make scatter plot with axes labels
  • Log-transform the response variable and make a new scatter plot
  • Perform regression analysis using lm
  • What is the regression formula (intercept and slope? significance of each?)
  • Calculate confidence interval of the population slope
  • Obtain R-squared value and explain its meaning and significance
  • Add confidence band line
  • Add confidence interval line

May 25. Final Exam (11:10-1:40; Comprehensive)

May 31. Grades submitted to Registrar Office