Biol20N02 2016

Analysis of Biological Data (BIOL 20N02, Spring 2015) Instructor: Dr Weigang Qiu, Associate Professor, Department of Biological Sciences Room: 1001B HN (North Building, 10th Floor, Mac Computer Lab) Hours: Tuesdays 10-1 Office Hours: Belfer Research Building (Google Map) BB-402; Wed 5-7 pm or by appointment Course Website: http://diverge.hunter.cuny.edu/labwiki/Biol20N2_2016

Course Description

With rapid accumulation of genome sequences and digitalized health data, biomedicine is becoming a data-intensive science. This course is a hands-on, computer-based workshop on how to visualize and analyze large quantities of biological data. The course introduces R, a modern statistical computing language and platform. Students will learn to use R to make scatter plots, bar plots, box plots, and other commonly used data-visualization techniques. The course will review statistical methods including hypothesis testing, analysis of frequencies, and correlation analysis. Student will apply these methods to the analysis of genomic and health data such as whole-genome gene expressions and SNP (single-nucleotide polymorphism) frequencies.

This 3-credit experimental course fulfills elective requirements for Biology Major I. Hunter pre-requisites are BIOL100, BIOL102 and STAT113.

Learning Goals

Be able to use R as a plotting tool to visualize large-scale biological data sets
Be able to use R as a statistical tool to summarize data and make biological inferences
Be able to use R as a programming language to automate data analysis

Textbooks

Whitlock & Schluter (2015). Analysis of Biological Data. (2nd edition). Amazon link
R Studio (Recommended): Learning RStudio for R Statistical Computing
Digital textbook (Recommended): Data Analysis for the Life Sciences

Exams & Grading

Attendance (or a note in case of absence) is required
In-Class Exercises (50 pts).
Assignments. All assignments should be handed in as hard copies only. Email submission will not be accepted. Late submissions will receive 10% deduction (of the total grade) per day (~100 pts total).
Three Mid-term Exams (3 X 30 pts each = 90 pts)
Comprehensive Final Exam (50 pts)
Bonus for active participation in classroom discussions

Course Outline

Feb 2. Introduction & tutorials for R/R studio

Course overview
Install R & RStudio on your home computers (Chapter 1. pg. 9)
Tutorial 1: First R Session (pg. 12)
1. Create a new project by navigating: File | New Project | New Directory. Name it project file "Abalone"
2. Import abalone data set: Tools | Import DataSet | From Web URL, copy & paste this address: http://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data
3. Assign column names: colnames(abalone) <- c("Sex", "Length", "Diameter", "Height", "Whole_Weight", "Shucked_weight", "Viscera_weight", "Shell_weight", "Rings")
4. Save data into a file: write.csv(abalone, "abalone.csv", row.names = FALSE)
5. Create a new R script: File | New | R script. Type the following commands:abalone <- read.csv("abalone.csv"); boxplot(Length ~ Sex, data = abalone)
6. Save as "abalone.R" using File | Save
7. Execute R script: source("abalone.R")
8. Install the notebook package: install.packages("knitr")
9. Compile a Notebook: File | Compile Notebook | HTML | Open in Browser
Tutorial 2. Writing R Scripts (Chapter 2. pg. 21)
Tutorial 3. Vector

Assignment #1. Due 2/16, Tuesday (Finalized)
(2 pts) Install R & R Studio on your own computer (4 pts) Reproduce the "abalone" project (follow steps in Tutorial 1). Save & Print your notebook (consists of commands and a boxplot). If "Compile notebook" doesn't work, print a copy of your commands and export the boxplot in 4X6 PDF (print & submit) (4 pts) Vector operations. Create a new vector object of abalone height: `ht <- abalone$Height` Show commands for extracting the first item, first 10 items, items 20 through 30, the 1st, 2nd, and 5th items First, obtain the indices for items less than 0.5 using the which() function. Save as a new vector called "ht.idx". Then, obtain the actual items by combining the "ht" and "ht.idx" vectors. Apply the following functions: range(), min(), max(), mean(), var(). [Hint: use help(var), help(min) for help]

Feb 9. No class (Friday Schedule)

Feb 16. Introduction & tutorials for R/R studio

Start a new project called "Session-02-individual"
Tutorial 3: Vector (Continued)

x <- c(1,2,3,4,5) # construct a vector using the c() function
x # show x
2 * x + 1 # arithmetic operations, applied to each element
exp(x) # exponent function (base e)
x <- 1:5 # alternative way to construct a vector, if consecutive
x <- seq(from = -1, to = 14, by = 2) # use the seq() function to create a numeric series
x <- rep(5, times = 10) # use the rep() function to create a vector of same element
x <- rep(NA, times = 10) # pre-define a vector with unknown elements; Use no quotes
# Apply vector functions
length(x)
sum(x)
mean(x)
range(x)
# Access vector elements
x[1]
x[1:3]
x[-2]
# Character vectors
gender <- c("male", "female", "female", "male", "female")
gender[3]
# Logical vectors
is.healthy <- c(TRUE, TRUE, FALSE, TRUE, FALSE) # Use no quotes
is.male <- (gender == "male") # obtain a logic vector by a test
age <- c(60, 43, 72, 35, 47)
is.60 <- (age == 60)
less.60 <- (age <= 43)
is.female <- !is.male # use the logical negate operator (!)
# The which() function returns the indices of TRUE elements
ind.male <- which(is.male)
ind.young <- which(age < 45)
age[ind.young] # obtain ages of young individuals

Tutorial 4: Matrix

BMI <- c(28, 32, 21, 27, 35) # a vector of body-mass index
bp <- c(124, 145, 127, 133, 140) # a vector of blood pressure
data.1 <- cbind(age, BMI, bp) # create a matrix using column bind function cbind(), individuals in rows
data.1
data.2 <- rbind(age, BMI, bp) # create a matrix using row bind function rbind()
t(data.1) # transpose a matrix: columns to rows & rows to columns
dim(data.1) # dimension of the matrix
colnames(data.1)
rownames(data.1) <- c("subject1", "subject2", "subject3", "subject4", "subject5")
data.1
data.1[3,1] # access the element in row 3, column 1
data.1[2,] # access all elements in row 2
data.1[,2] # access all elements in column 2
matrix(data = 1:12, nrow = 3, ncol =4) # create a matrix with three rows and four columns; filled by column
matrix(data = 1:12, nrow = 3, ncol =4, byrow = TRUE) # filled by row
mat <- matrix(data = NA, nrow = 2, ncol = 3) # create an empty matrix
mat[1,3] <- 5 # assign a value to a matrix element

Assignment #2. Due 2/23, Tuesday (Finalized on 2/18, Thursday 10am)
Show commands and outputs for the following exercises: (2 pts) Construct a numeric vector of 10 random numbers from the uniform distribution between 0 and 1 (Hint: use the function `runif()`). Name the resulting vector as "rand.1". Show length, range, mean, and variance. (2 pts) Construct a numeric vector of 10 random numbers from a normal distribution with mean of 0 and variance of 1 (Hint: use the function `rnorm()`). Name the resulting vector as "rand.2". Show length, range, mean, and variance. Compare the variance with the previous one: which is large? (2 pts) Construct a matrix of 10 rows by combining the previous two vectors using the `cbind` function. Name the matrix as "mat". Assign row names as "ind1" .. "ind10". Show row values for ind1, column values for rand.2; transpose the matrix and save it as "mat.t". (2 pts) Construct a character vector of 10 US States. Name it "us.states". Use full, case-sensitive names and "_" in place of spaces. Show the first and the fifth states with one command. (1 pt) Construct a logical vector (named "less.half") of 10 from rand.1 by testing if the elements are less than 0.5. (1 pt) Use the `which()` function to find the indices of elements in rand.1 of values that are less than 0.5. Show values.

Feb 23. Statistics & samples

Tutorial 5. Data Frame: a table to store mixed data types

data.df <- data.frame(age, gender, is.healthy)
data.df
class(data.df) # check object type
factor(gender) # categories (called "levels") of a character vector
data.df[3,4] # access row 3, column 4
data.df[, "age"] # a vector of all ages
data.df$age # an alternative way, using the $ notation
data.df$BMI[4]
data.df$gender[2]
# Create a data frame from text files:
# Download and save the file: http://extras.springer.com/2012/978-1-4614-1301-1/BodyTemperature.txt
BodyTemperature <- read.table(file = "BodyTemperature.txt", header = TRUE, sep = " ")
head(BodyTemperature) # show first 10 lines
names(BodyTemperature) # show column headings
BodyTemperature[1:3, 2:4] # show a slice of data
BodyTemperature$Age[1:3] # show 1-3 ages

Population and sample

x <- rnorm(1000)
x.sample.1 <- sample(x, 100)

Explore variable distributions

x <- rnorm(1000)
hist(x, breaks = 100) # distribution for all nubmers
hist(BodyTemperature[,2], main = "Age frequency distribution", xlab = "Age", ylab = "Counts") # age distribution, with customized title and axis labels
stem(BodyTemperature[,2]) # stem-leaf plot
table(BodyTemperature[,1]) # distribution for a categorical vector

Assignment #3. Due 3/1, Tu (Finalized)
(6 pts) Understand R functions. Each R function consists of arguments, which specify inputs (which are required) and options (which are optional). When successfully run, functions return an output. To use a function properly, it is necessary to identify the data type (i.e., whether numeric or character, whether vector, matrix, or a data frame) of the inputs, options, and outputs. In R studio, Show help page for the "sample()" function by typing `?sample`. Identify the data type of the input `x`. Explain the following options: `size` and `replace`; show their default values (i.e., the values they take when they are not specified). Identify what is the expected data type of the output. Use the function for the following tasks: Run examples of the function by typing `example(sample)`. Explain the 1st and 2nd examples by listing the input, options, and output. Create a vector of numbers from 1 to 100; obtain a permuted sample of the same vector (with the same length) Create a vector of two elements consisting of "Male" and "Female"; obtain a vector of 100 elements randomly drawn from the original vector [Hint: by using the `replace = TRUE` argument] (2 pts) Explore frequency distribution of variables Import the BodyTemperature as a data frame & pick a numerical variable and plot its frequency distribution with the `hist()` function. Make a customized title (with the "main=" argument) Pick a character variable and show its frequency distribution with the `table()` function. Plot the data using the `barplot()` function (2 pts) Understand variable types. (From Whitlock & Schluter) Researchers randomly assign diabetes patients to two groups. In the first group, the patients receive a new drug while the other group received standard treatment without the new drug. The researchers compared the rate of insulin release in the two groups. List the two variables and state whether each is categorical (if so, whether it is nominal or ordinal) or numerical (if so, whether it is discrete or continuous) State & explain which variable is the explanatory (i.e., predictive) and which is the response variable.

March 1. Displaying data

Slides for part 1:

File:Biostat-part-1.pdf

Assignment #4. Due 3/8, Finalized
(1 pt) Load the iris data set by typing `data(iris)` (1 pt) Identity a character variable and obtain frequency counts using the "table()" function (1 pt) Identity a numerical variable and obtain frequency distribution by a histogram. Use customized x-axis label (2 pts) Make a boxplot of distribution of each numerical variable with respect to species. (2 pt) Make a strip chart of distribution of each numerical variable with respect to species. Customize it to be vertical, open circle symbol (pch = 1), and using the method of "jitter". (1 pt) Make a scatter plot to show relations between two numerical variables (2 pts) Among graduate school applicants to a university department, 512 males were admitted, 313 males were rejected, 89 females were admitted, and 19 females were rejected. Explore if there is gender bias in admission by Identify the explanatory and response variables, as well as whether the variables are character or numerical Make a contingency table using the matrix() function. Add labels to columns and rows using colnames() and rownames() functions Plot the contingency table using grouped bar plot with the "barplot()" function, and the "beside = T" option. Plot the contingency table using the mosaicplot() function. Based on the plot, explain if there is evidence for gender bias. [Hint: try matrix transposition]

March 8. Exam 1 (Open-Book)

March 15. Describing data

In-Class exercise: A study of human gene length
Import human gene data set from http://diverge.hunter.cuny.edu/~weigang/data-sets-for-biostat/hg.tsv2

hg.len <- hg$Gene.End - hg$Gene.Start + 1 # calculate gene length
hist(hg.len, br = 200) # plot gene-length distribution. Not normal: mostly genes are short, few very long
mean(hg.len) # not representative, super-long genes carry too much weight to the average length
median(hg.len) # More representative. Use median for a variable not normally distributed
summary(hg.len) # Show all quartiles
IQR(hg.len) # 3rd Quartile - 1st Quartile, the range of majority data points, even for skewed distribution
log.len <- log10(hg.len); hist(log.len, br=200) # Log of gene length is more normally distributed
mean(log.len); median(log.len) # They should be similar, since log.len is normal
# The next block is intend to show that the "mean length" of samples is normally distributed, although the length itself is not
samp.len <- sample(hg.len, 100) # take a random sample of 100 length
mean(samp.len) # a sample mean
# Repeat the above 1000 times, so we could study the distribution of "mean length" (not "length" itself)
mean.len <- rep(NA, 1000) # prepare an empty vector to store the "mean lengths"
for (i in 1:1000) { # i takes the value from 1 to 1000, one at a time
  samp.len <- sample(hg.len, 100);
  mean.len[i] <- mean(samp.len);
}
hist(mean.len, br=100) # you should see a more normally distributed histogram
# The above exercise is a demonstration of the "Central Limit Theorem"

Assignment #5. Due 3/22. (Finalized)

(2 pts) Repeat the above sampling experiment (including the "for" loop, sample size N =100). Save results to a vector called "mean.100" (which is a vector of means, with a length of 1000 elements). Show histogram. Show mean. Show standard deviation
(2 pts) Repeat the above sampling experiment (sample size N =20). Save results to a vector called "mean.20". Show histogram. Show mean. Show standard deviation
(2 pts) Repeat the above sampling experiment (sample size N =500). Save results to a vector called "mean.500". Show histogram. Show mean. Show standard deviation
(2 pts) Explain why mean is not a good description of a "typical" human gene length
(2 pts) Watch this Khan Academy video and describe the Central Limit Theorem. Explain what is the "sampling distribution of mean".

# Sample size 100
samp.size.100.means <- rep(NA, 1000)
for (i in 1:1000) {
  samp <- sample(hg.len, 100)
  samp.size.100.means[i] <- mean(samp)
}
hist(samp.size.100.means, br=100)
# Sample size 20
samp.size.20.means <- rep(NA, 1000)
for (i in 1:1000) {
  samp <- sample(hg.len, 20)
  samp.size.20.means[i] <- mean(samp)
}
hist(samp.size.20.means, br=100)
# Sample size 500
samp.size.500.means <- rep(NA, 1000)
for (i in 1:1000) {
  samp <- sample(hg.len, 500)
  samp.size.500.means[i] <- mean(samp)
}
hist(samp.size.500.means, br=100)
# Combine
sample.combined <- cbind(samp.size.20.means, samp.size.100.means, samp.size.500.means)
colnames(sample.combined) <- c("samp.20", "samp.100", "samp.500")
# plot in a single frame
par(mfrow=c(3,1))
hist(sample.combined[,1], br=100, xlim=c(1e4, 2e5), main="sample size 20", xlab = "mean gene length")
hist(sample.combined[,2], br=100, xlim=c(1e4, 2e5), main = "sample size 100", xlab = "mean gene length")
hist(sample.combined[,3], br=100, xlim=c(1e4, 2e5), main = "sample size 500", xlab = "mean gene length")
par(mfrow =c(1,1))

March 22. Sampling & Standard Error of Mean

In-Class exercise 1. Descriptive statistics

Make a vector of the following blood pressure measurements (in mmHg): 112, 128, 108, 129, 125, 153, 155, 132, 137. Calculate sample size, sum, mean, variance, coefficient of variation (CV), and median
Take a sample of 100 human gene lengths. Calculate median, IQR, 1.5*IQR; Make a boxplot
The following are measurements of body mass (in grams) of three species of finches in Africa, calculate mean, standard deviation, and CV of each species. Make a boxplot and a strip chart separated by species
1. Species 1: 8, 8, 8, 8, 8, 8, 8, 6, 7 ,7, 7, 8, 8, 8, 7, 7
2. Species 2: 16, 16, 16, 12, 16, 15, 15, 17, 15, 16, 15, 16
3. Species 3: 40, 43, 37, 38, 43, 33, 35, 37, 36, 42, 36, 36, 39, 37, 34, 41

In-Class exercise 2. standard error & confidence interval

Blood pressure: What is the standard deviation of the above blood pressure?
What is the sample size? Calculate standard error of the mean.
Use the 2SE rule of thumb, calculate 95% confidence interval.
Plot standard error & standard deviation

Assignment #5. Due 3/29.
A study of expression levels of human genes (2 pts) Make a single data frame containing the gene expression values for 3 genes (Hint: name two columns as "expression" and "gene") "MED1": 12.38918, 9.084664, 9.48416, 9.363928, 8.194495, 8.694836, 8.771101, 9.998151, 12.66877, 8.684064, 8.944236, 11.8491, 8.40968, 8.990329, 9.782376, 8.58243, 12.00455, 8.580401, 9.161046, 9.047977, 8.672018, 8.811856, 8.354933, 8.763175 "ZBTB42": 8.377784, 7.832712, 8.65289, 4.59474, 5.598869, 4.912963, 5.24125, 7.688584, 7.36693, 4.463853, 5.646581, 6.830076, 4.485883, 6.741698, 6.967342, 5.307032, 6.80991, 7.612475, 5.795508, 5.033554, 5.032286, 4.979937, 8.315718, 5.801263, 7.136532, 4.722164, 5.416593, 4.456056, 6.253954, 5.684245, 8.255962, 8.629676, 8.348159, 8.114049, 6.786746, 7.893434, 7.836647, 4.733391, 6.895385, 7.123281, 4.75207 "TCF24": 3.427531, 3.383114, 3.574041, 3.449132, 3.784881, 3.686278, 3.545466, 3.624868, 3.377987, 3.256293, 3.381218, 3.79938, 3.419852, 3.292284 (2 pts) Make a boxplot of expression values separate by genes. Explain which gene has the highest level of expression & which has the lowest (2 pts) Make a histogram of all expression values. Are they normally distributed? Test normality by qqnorm() and qqline() plots (2 pts) Obtain the mean, median, and standard deviation of expression separate by genes (Use the tapply() function to get full credits) (2 pts)

March 29. Hypothesis Testing via Statistical Expectations

In-Class exercise 3. Hypothesis testing

Identify whether each of the following statements should be a null or alternative hypothesis
1. Watching TV affects how preschoolers behave
2. Most mutations are harmful
3. A diet has no effect to health
4. Smoking increases cancer risk
5. Age is a factor in voting for Sanders

April 5. Exam 2

April 12. Normal distribution and controls

April 19. Comparing two means

April 26. No Class (Spring break)

May 3. Exam 3

May 10. Designing experiments & Comparing more than two groups

May 17. Correlation analysis

May 24. Final Exam (Comprehensive)

May 31. Grades submitted to Registrar Office

Biol20N02 2016

Contents

Course Description

Learning Goals

Textbooks

Exams & Grading

Course Outline

Feb 2. Introduction & tutorials for R/R studio

Feb 9. No class (Friday Schedule)

Feb 16. Introduction & tutorials for R/R studio

Feb 23. Statistics & samples

March 1. Displaying data

March 8. Exam 1 (Open-Book)

March 15. Describing data

March 22. Sampling & Standard Error of Mean

March 29. Hypothesis Testing via Statistical Expectations

April 5. Exam 2

April 12. Normal distribution and controls

April 19. Comparing two means

April 26. No Class (Spring break)

May 3. Exam 3

May 10. Designing experiments & Comparing more than two groups

May 17. Correlation analysis

May 24. Final Exam (Comprehensive)

May 31. Grades submitted to Registrar Office

Navigation menu

Biol20N02 2016

Course Description

Learning Goals

Textbooks

Exams & Grading

Course Outline

Feb 2. Introduction & tutorials for R/R studio

Feb 9. No class (Friday Schedule)

Feb 16. Introduction & tutorials for R/R studio

Feb 23. Statistics & samples

March 1. Displaying data

March 8. Exam 1 (Open-Book)

March 15. Describing data

March 22. Sampling & Standard Error of Mean

March 29. Hypothesis Testing via Statistical Expectations

April 5. Exam 2

April 12. Normal distribution and controls

April 19. Comparing two means

April 26. No Class (Spring break)

May 3. Exam 3

May 10. Designing experiments & Comparing more than two groups

May 17. Correlation analysis

May 24. Final Exam (Comprehensive)

May 31. Grades submitted to Registrar Office

Navigation menu

Search