Biol20N02 2016: Difference between revisions
imported>Weigang |
imported>Weigang |
||
Line 209: | Line 209: | ||
# Calculate median and IQR. Which is a better description of the "typical length" of a humnan gene, mean or median? | # Calculate median and IQR. Which is a better description of the "typical length" of a humnan gene, mean or median? | ||
# Perform a log10 transformation and obtain the same set of statistics. Explain why log transformation is an effective method to describe human gene length distribution | # Perform a log10 transformation and obtain the same set of statistics. Explain why log transformation is an effective method to describe human gene length distribution | ||
# | # Sample 100 genes and calculate mean | ||
# Repeat 1000 times and plot distribution of 1000 means | |||
# Fit a normal curve | |||
# Test normality with qqnorm() and qqline() | |||
{| class="wikitable sortable mw-collapsible" | {| class="wikitable sortable mw-collapsible" | ||
|- style="background-color:lightsteelblue;" | |- style="background-color:lightsteelblue;" |
Revision as of 18:59, 12 March 2016
Course Description
With rapid accumulation of genome sequences and digitalized health data, biomedicine is becoming a data-intensive science. This course is a hands-on, computer-based workshop on how to visualize and analyze large quantities of biological data. The course introduces R, a modern statistical computing language and platform. Students will learn to use R to make scatter plots, bar plots, box plots, and other commonly used data-visualization techniques. The course will review statistical methods including hypothesis testing, analysis of frequencies, and correlation analysis. Student will apply these methods to the analysis of genomic and health data such as whole-genome gene expressions and SNP (single-nucleotide polymorphism) frequencies.
This 3-credit experimental course fulfills elective requirements for Biology Major I. Hunter pre-requisites are BIOL100, BIOL102 and STAT113.
Learning Goals
- Be able to use R as a plotting tool to visualize large-scale biological data sets
- Be able to use R as a statistical tool to summarize data and make biological inferences
- Be able to use R as a programming language to automate data analysis
Textbooks
- Whitlock & Schluter (2015). Analysis of Biological Data. (2nd edition). Amazon link
- R Studio (Recommended): Learning RStudio for R Statistical Computing
- Digital textbook (Recommended): Data Analysis for the Life Sciences
Exams & Grading
- Attendance (or a note in case of absence) is required
- In-Class Exercises (50 pts).
- Assignments. All assignments should be handed in as hard copies only. Email submission will not be accepted. Late submissions will receive 10% deduction (of the total grade) per day (~100 pts total).
- Three Mid-term Exams (3 X 30 pts each = 90 pts)
- Comprehensive Final Exam (50 pts)
- Bonus for active participation in classroom discussions
Course Outline
Feb 2. Introduction & tutorials for R/R studio
- Course overview
- Install R & RStudio on your home computers (Chapter 1. pg. 9)
- Tutorial 1: First R Session (pg. 12)
- Create a new project by navigating: File | New Project | New Directory. Name it project file "Abalone"
- Import abalone data set: Tools | Import DataSet | From Web URL, copy & paste this address: http://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data
- Assign column names:
colnames(abalone) <- c("Sex", "Length", "Diameter", "Height", "Whole_Weight", "Shucked_weight", "Viscera_weight", "Shell_weight", "Rings")
- Save data into a file:
write.csv(abalone, "abalone.csv", row.names = FALSE)
- Create a new R script: File | New | R script. Type the following commands:
abalone <- read.csv("abalone.csv"); boxplot(Length ~ Sex, data = abalone)
- Save as "abalone.R" using File | Save
- Execute R script:
source("abalone.R")
- Install the notebook package:
install.packages("knitr")
- Compile a Notebook: File | Compile Notebook | HTML | Open in Browser
- Tutorial 2. Writing R Scripts (Chapter 2. pg. 21)
- Tutorial 3. Vector
Assignment #1. Due 2/16, Tuesday (Finalized) |
---|
|
Feb 9. No class (Friday Schedule)
Feb 16. Introduction & tutorials for R/R studio
- Start a new project called "Session-02-individual"
- Tutorial 3: Vector (Continued)
x <- c(1,2,3,4,5) # construct a vector using the c() function
x # show x
2 * x + 1 # arithmetic operations, applied to each element
exp(x) # exponent function (base e)
x <- 1:5 # alternative way to construct a vector, if consecutive
x <- seq(from = -1, to = 14, by = 2) # use the seq() function to create a numeric series
x <- rep(5, times = 10) # use the rep() function to create a vector of same element
x <- rep(NA, times = 10) # pre-define a vector with unknown elements; Use no quotes
# Apply vector functions
length(x)
sum(x)
mean(x)
range(x)
# Access vector elements
x[1]
x[1:3]
x[-2]
# Character vectors
gender <- c("male", "female", "female", "male", "female")
gender[3]
# Logical vectors
is.healthy <- c(TRUE, TRUE, FALSE, TRUE, FALSE) # Use no quotes
is.male <- (gender == "male") # obtain a logic vector by a test
age <- c(60, 43, 72, 35, 47)
is.60 <- (age == 60)
less.60 <- (age <= 43)
is.female <- !is.male # use the logical negate operator (!)
# The which() function returns the indices of TRUE elements
ind.male <- which(is.male)
ind.young <- which(age < 45)
age[ind.young] # obtain ages of young individuals
- Tutorial 4: Matrix
BMI <- c(28, 32, 21, 27, 35) # a vector of body-mass index
bp <- c(124, 145, 127, 133, 140) # a vector of blood pressure
data.1 <- cbind(age, BMI, bp) # create a matrix using column bind function cbind(), individuals in rows
data.1
data.2 <- rbind(age, BMI, bp) # create a matrix using row bind function rbind()
t(data.1) # transpose a matrix: columns to rows & rows to columns
dim(data.1) # dimension of the matrix
colnames(data.1)
rownames(data.1) <- c("subject1", "subject2", "subject3", "subject4", "subject5")
data.1
data.1[3,1] # access the element in row 3, column 1
data.1[2,] # access all elements in row 2
data.1[,2] # access all elements in column 2
matrix(data = 1:12, nrow = 3, ncol =4) # create a matrix with three rows and four columns; filled by column
matrix(data = 1:12, nrow = 3, ncol =4, byrow = TRUE) # filled by row
mat <- matrix(data = NA, nrow = 2, ncol = 3) # create an empty matrix
mat[1,3] <- 5 # assign a value to a matrix element
Assignment #2. Due 2/23, Tuesday (Finalized on 2/18, Thursday 10am) |
---|
Show commands and outputs for the following exercises:
|
Feb 23. Statistics & samples
- Tutorial 5. Data Frame: a table to store mixed data types
data.df <- data.frame(age, gender, is.healthy)
data.df
class(data.df) # check object type
factor(gender) # categories (called "levels") of a character vector
data.df[3,4] # access row 3, column 4
data.df[, "age"] # a vector of all ages
data.df$age # an alternative way, using the $ notation
data.df$BMI[4]
data.df$gender[2]
# Create a data frame from text files:
# Download and save the file: http://extras.springer.com/2012/978-1-4614-1301-1/BodyTemperature.txt
BodyTemperature <- read.table(file = "BodyTemperature.txt", header = TRUE, sep = " ")
head(BodyTemperature) # show first 10 lines
names(BodyTemperature) # show column headings
BodyTemperature[1:3, 2:4] # show a slice of data
BodyTemperature$Age[1:3] # show 1-3 ages
- Population and sample
x <- rnorm(1000)
x.sample.1 <- sample(x, 100)
- Explore variable distributions
x <- rnorm(1000)
hist(x, breaks = 100) # distribution for all nubmers
hist(BodyTemperature[,2], main = "Age frequency distribution", xlab = "Age", ylab = "Counts") # age distribution, with customized title and axis labels
stem(BodyTemperature[,2]) # stem-leaf plot
table(BodyTemperature[,1]) # distribution for a categorical vector
Assignment #3. Due 3/1, Tu (Finalized) |
---|
|
March 1. Displaying data
Slides for part 1:
Assignment #4. Due 3/8, Finalized |
---|
|
March 8. Exam 1 (Open-Book)
March 15. Describing data & hypothesis testing
- Import human gene data set from http://diverge.hunter.cuny.edu/~weigang/data-sets-for-biostat/hg.tsv
- Plot length distribution. Is it normal?
- Calculate mean, variance, standard deviation
- Calculate median and IQR. Which is a better description of the "typical length" of a humnan gene, mean or median?
- Perform a log10 transformation and obtain the same set of statistics. Explain why log transformation is an effective method to describe human gene length distribution
- Sample 100 genes and calculate mean
- Repeat 1000 times and plot distribution of 1000 means
- Fit a normal curve
- Test normality with qqnorm() and qqline()
Assignment #5. Due 3/22. To be posted |
---|
To be posted |