Biol20N02 2016: Difference between revisions
imported>Weigang |
imported>Weigang |
||
(54 intermediate revisions by the same user not shown) | |||
Line 286: | Line 286: | ||
{| class="wikitable sortable mw-collapsible" | {| class="wikitable sortable mw-collapsible" | ||
|- style="background-color:lightsteelblue;" | |- style="background-color:lightsteelblue;" | ||
! Assignment # | ! Assignment #6. Due 3/29 (Finalized) | ||
|- style="background-color:powderblue;" | |- style="background-color:powderblue;" | ||
| | | | ||
# to | A study of expression levels of human genes | ||
# (2 pts) Make a single data frame containing the gene expression values for 3 genes (Hint: name two columns as "expression" and "gene") | |||
## "MED1": 12.38918, 9.084664, 9.48416, 9.363928, 8.194495, 8.694836, 8.771101, 9.998151, 12.66877, 8.684064, 8.944236, 11.8491, 8.40968, 8.990329, 9.782376, 8.58243, 12.00455, 8.580401, 9.161046, 9.047977, 8.672018, 8.811856, 8.354933, 8.763175 | |||
## "ZBTB42": 8.377784, 7.832712, 8.65289, 4.59474, 5.598869, 4.912963, 5.24125, 7.688584, 7.36693, 4.463853, 5.646581, 6.830076, 4.485883, 6.741698, 6.967342, 5.307032, 6.80991, 7.612475, 5.795508, 5.033554, 5.032286, 4.979937, 8.315718, 5.801263, 7.136532, 4.722164, 5.416593, 4.456056, 6.253954, 5.684245, 8.255962, 8.629676, 8.348159, 8.114049, 6.786746, 7.893434, 7.836647, 4.733391, 6.895385, 7.123281, 4.75207 | |||
## "TCF24": 3.427531, 3.383114, 3.574041, 3.449132, 3.784881, 3.686278, 3.545466, 3.624868, 3.377987, 3.256293, 3.381218, 3.79938, 3.419852, 3.292284 | |||
# (2 pts) Make a boxplot of expression values separated by genes. Explain which gene has the highest level of expression & which has the lowest | |||
# (2 pts) Make a histogram of all expression values. Are they normally distributed? Test normality by qqnorm() and qqline() plots | |||
# (2 pts) Obtain the mean, median, and standard deviation of expression separated by genes (Use the tapply() function to get full credits) | |||
# (2 pts) Explain what is "standard error of mean" and how is it different from standard deviation (Hint: explain sampling]. In taking a sample from a population, what can you do to make the standard error of the mean smaller? | |||
|} | |} | ||
===March 29. Hypothesis Testing | ===March 29. Hypothesis Testing=== | ||
* In-Class exercise 3. Hypothesis testing | * In-Class exercise 3. Hypothesis testing through simulation | ||
# Identify whether each of the following statements | <syntaxhighlight lang=R"> | ||
# coin-flipping experiments | |||
## Most mutations are | runif(1) # take a random sample from 0-1, uniformly distributed | ||
## A diet has no effect to | rbinom(n = 1, size =100, prob = 0.5) # flipping 100 (size) fair (prob) coin, one (n=1) time | ||
## | rbinom(n = 1000, size =100, prob = 0.5) # repeat above 1000 times | ||
## | num.success <- rbinom(n = 1000, size =100, prob = 0.5) # save | ||
barplot(table(num.success)) # distribution of number of successes | |||
length(which(num.success<=40))/1000 # probability of success less than or equal to 40 | |||
# test if toads are right-handed: observation 14/18 are right-handed | |||
right.handed.toads.by.chance <- rbinom(n = 1000, size = 18, prob = 0.5) # null distribution, 1000 times | |||
barplot(table(right.handed.toads.by.chance)) # plot | |||
length(which(right.handed.toads.by.chance >= 14))/1000 # probability of getting a value equal or higher than 14 | |||
# If the observation is 10/18 | |||
right.handed.toads.by.chance <- rbinom(n = 1e6, size = 18, prob = 0.5) | |||
length(which(right.handed.toads.by.chance <= 8 | right.handed.toads.by.chance>=10))/1e6 | |||
</syntaxhighlight> | |||
* Lecture Slides [[File:Part-2-distribution.pptx|thumbnail]] | |||
{| class="wikitable sortable mw-collapsible" | |||
|- style="background-color:lightsteelblue;" | |||
! Assignment #7. Due 4/5 (Finalized) | |||
|- style="background-color:powderblue;" | |||
| | |||
# (2 pts) Identify whether each of the following statements is more appropriate as the null or alternative hypothesis. State appropriate H0 and H1 for each question. | |||
## Most genetic mutations are deleterious | |||
## A diet has no effect on liver function | |||
# (4 pts) The following table lists results of a study on the impact of tourism to species density. | |||
## Explain what the p-value means | |||
## Which species show a statistically significant reduction in mean density near heavily visited ruins? | |||
## Which species show a statistically significant increase in mean density near heavily visited ruins? | |||
## Which species provide no significant evidence in reduction or in increase? | |||
{| class="wikitable" | |||
|- | |||
! Species !! Mean density near ruins !! Mean density far from ruins !! P-value | |||
|- | |||
| Agouti || 160.2 || 14.5 || 0.03 | |||
|- | |||
| Coatimundi || 99.4 || 1.0 || 0.01 | |||
|- | |||
| Collared precany || 4.6 || 1.8 || 0.79 | |||
|- | |||
| Deppes squirrel || 32.3 || 2.2 || 0.54 | |||
|- | |||
| Howler monkey || 7.3 || 1.9 || 0.03 | |||
|- | |||
| Spider monkey || 170.8 || 15.0 || 0.88 | |||
|- | |||
| Crested guan || 0 || 49.4 || 0.001 | |||
|- | |||
| Great curassow || 10.8 || 72.0 || 0.048 | |||
|- | |||
| Ocellated turkey || 47.0 || 0 || 0.02 | |||
|- | |||
| Tinamos || 0 || 4.9 || 0.049 | |||
|} | |||
* (4 pts) Can parents distinguish their own children by smell alone? In a study, eight of nine mothers identified their children correctly based on the smell of T-shirts children wore for three consecutive nights. | |||
# State the null & alternative hypotheses. | |||
# Simulate the null distribution using the "rbinom()" function (with n = 1 million). Plot the distribution using "barplot" | |||
# Calculate from the null distribution the probability of obtaining 8 or 1 correct out of 9 by chance, which would be the p-value of this experiment. Is the result significant? | |||
# Verify your result with the "binom.test()" function. | |||
|} | |||
===April 5. Exam 2 (Open Book)=== | |||
===April 12. Analyzing Proportions=== | |||
{| class="wikitable sortable mw-collapsible" | |||
|- style="background-color:lightsteelblue;" | |||
! Assignment #8. Due 4/19 (Finalized) | |||
|- style="background-color:powderblue;" | |||
| | |||
# (Analysis of frequency data. 5 pts) A Siena College Poll has been conducted between April 6-11, 2016 by telephone calls. Based on a sample of 538 likely Democratic primary voters in New York State (which votes next Tuesday on 4/19/2016), the poll concludes that "Vermont Senator Bernie Sanders has narrowed the lead against former New York Senator Hillary Clinton among likely Democratic voters, however, she still has a double digit lead 52-42 percent". | |||
## What are the estimates of proportion of voters supporting Clinton and Sanders, respectively? Name the two variables <code>p.clinton</code> & <code>p.sanders</code> | |||
## What are the standard errors (i.e., "margins of error" of the pool) of these two estimates? Use the formula <code>SE[p]=sqrt(p*(1-p)/n)</code> | |||
## What are the 95% confidence intervals for these two estimates? Use the approximate formula <code>p-2*SE[p], p+2*SE[p]</code> | |||
## Assuming that the number of Sanders supporters is 226 out of 538 in this pool, use the <code>binom.test()</code> function to test the null hypothesis that the proportion of Sanders supporter is <code>p=0.5</code>. Could the null hypothesis be rejected at p value of 0.05? | |||
## Would you predict based on this pool that Clinton would win the New York Primary? | |||
# (Test of goodness-of-fit. 5 pts) Variation in flower color of a plant species is determined by a single gene, with ''RR'' individuals being red, ''Rr'' individuals pink, and ''rr'' individuals white. The expected color ratio of crossing F1 individuals is 1:2:1. | |||
## In one cross experiment, the results were 10 red, 21 pink, and 9 white-flowered offspring. Do these results differ significantly from the expected frequencies (at 5% level of significance)? Create a data frame with observed and expected counts as the two columns. Calculate the chi-square value and degree of freedom. Obtain p value by using the <code>pchisq()</code> function | |||
## Validate your above test using the <code>chisq.test(x=c(observed counts), p=c(0.25, 0.5, 0.25))</code> function. Save test results and show expected counts | |||
## In another, larger experiment, 1000 red, 2100 pink, and 900 white flowers were observed. Do these results differ significantly from the expected ratio? | |||
## How would you explain the difference between the two results? | |||
|} | |||
===April 19. Contingency Analysis=== | |||
Lecture Slides: [[File:Part-3-frequency.pptx|thumbnail]] | |||
{| class="wikitable sortable mw-collapsible" | |||
|- style="background-color:lightsteelblue;" | |||
! Assignment #9. Due 5/3 (Finalized) | |||
|- style="background-color:powderblue;" | |||
| | |||
The following table shows results of genotype counts in "Taster" and "non-Taster" individuals. | |||
<center> | |||
{| class="wikitable" | |||
|- | |||
! !! CC !! CT !! TT | |||
|- | |||
| Taster || 79 || 82 || 9 | |||
|- | |||
| Non-Taster || 4 || 6 || 35 | |||
|} | |||
</center> | |||
# (1 pt) State the explanatory and response variables | |||
# (1 pt) State the null & alternative hypotheses | |||
# (2 pts) Create a matrix called "taster.genotypes" and display the data as a mosaic plot. Make sure that the explanatory variable is on the ''X''-axis | |||
# (1 pt) Test the genotype-phenotype association with a chi-square test, with simulated p value. Conclude if there is significant association | |||
# (1 pt) Save the above test result as "taste.chisq". Show observed & expected counts | |||
# (1 pt) Identify over- and under-represented genotypes | |||
# (3 pts) Test if there is association between the ''alleles'' (i.e., "C" or "T") and the Taster/non-Taster phenotypes | |||
## Calculate allele counts in each phenotypic group | |||
## State null and alternative hypotheses; Create a matrix called "taster.alleles". Make a mosaic plot | |||
## Perform chi-square test, draw your conclusions, and identify which allele ("C" or "T") is associated with the "Taster" phenotype | |||
|} | |||
===April 26. No Class (Spring break)=== | ===April 26. No Class (Spring break)=== | ||
===May 3. Exam 3=== | ===May 3. Exam 3 (Open Book)=== | ||
===May 10. | * Review lecture slides (part 3) | ||
=== | * Review 2 previous exams | ||
===May 10. One-sample t-test=== | |||
* Motivating example: weights of ticks | |||
# Import data set: http://diverge.hunter.cuny.edu/~weigang/data-sets-for-biostat/tick.tsv | |||
# How to visualize data? What kinds of plots to make? | |||
# Is there sexual dimorphism? | |||
* No homework assignment (to be combined with the next one) | |||
===May 17. Paired & Two sample t-tests=== | |||
* Import data set: http://diverge.hunter.cuny.edu/~weigang/data-sets-for-biostat/ahus.csv | |||
* PLEASE FILL IN TEACHER EVALUATIONS: [http://www.hunter.cuny.edu/te '''Teacher's evaluation'''] | |||
* All lecture slides | |||
** Part 1. [[File:Biostat-part-1.pdf|thumbnail]] | |||
** Part 2. [[File:Part-2-distribution.pptx|thumbnail]] | |||
** Part 3. [[File:Part-3-frequency.pptx|thumbnail]] | |||
** Part 4. [[File:Part-4-t-tests.pptx|thumbnail]] | |||
{| class="wikitable sortable mw-collapsible" | |||
|- style="background-color:lightsteelblue;" | |||
! Assignment #10. Due 5/24 (Finalized) | |||
|- style="background-color:powderblue;" | |||
| | |||
* (5 pts) Ostriches live in hot environments. To test if ostriches could reduce brain temperature relative to body temperature, the mean body and brain temperature (in Celsius) of six ostriches was recorded in the following table. | |||
# State the null and alternative hypotheses <code>H0: there is no difference in body and brain temperature; H1: the brain temperature is different from the body temperature</code> | |||
# Make a data frame and visualize with a boxplot combined with a stripchart <code>x <- data.frame(id = 1:6, body = c(), brain = c())</code> | |||
# Test normality of data points by showing qqnorm() and qqline() plots (separate for two column) <code>qqnorm(x$body); qqline(x$body)</code> | |||
# Test if there is significant difference between the body and brain temperature <code>t.test(x$body, x$brain, paired = T)</code> | |||
{| class="wikitable" | |||
|- | |||
! Ostrich !! Body Temperature !! Brain Temperature | |||
|- | |||
| 1 || 38.51 || 39.32 | |||
|- | |||
| 2 || 38.45 || 39.21 | |||
|- | |||
| 3 || 38.27 || 39.2 | |||
|- | |||
| 4 || 38.52 || 38.68 | |||
|- | |||
| 5 || 38.62 || 39.09 | |||
|- | |||
| 6 || 38.18 || 38.94 | |||
|} | |||
* (5 pts) Could mosquitoes detect odor of what people drink? The following data points are the relative activation time (the smaller the quicker) of mosquitoes flying toward volunteers, divided into beer- and water-drinking groups: | |||
** Beer group: 0.36, 0.46, 0.06, 0.18, 0.25, 0.18, -0.06, -0.14, 0.12, 0.39, 0.17, -0.16, -0.05, 0.19, 0.25, 0.31, 0.17, -0.03, 0.23, -0.03, 0.26, 0.30, 0.11, 0.13, 0.21 | |||
** Water group: 0.04, 0, -0.08, -0.12, 0.201, -0.039, 0.10, 0.041, 0.02, 0.236, 0.05, 0.097, 0.122, -0.019, 0.021, -0.08, -0.165, -0.28 | |||
# State the null and alternative hypotheses <code>H0: there is no difference in activation time between the two groups; H1: reaction time differs between the two groups</code> | |||
# Make a data frame and visualize with a boxplot combined with a stripchart<code>x <- data.frame(time = c(0.36, 0.46, 0.06, 0.18, 0.25, 0.18, -0.06, -0.14, 0.12, 0.39, 0.17, -0.16, -0.05, 0.19, 0.25, 0.31, 0.17, -0.03, 0.23, -0.03, 0.26, 0.30, 0.11, 0.13, 0.21, 0.04, 0, -0.08, -0.12, 0.201, -0.039, 0.10, 0.041, 0.02, 0.236, 0.05, 0.097, 0.122, -0.019, 0.021, -0.08, -0.165, -0.28), group = c(rep("beer", 25), rep("water",18)))</code> | |||
# Test normality of data points by showing qqnorm() and qqline() plots (separate for two column) <code>qqnorm(x$time[1:25]); qqline(x$time[1:25])</code> | |||
# Test if there is significant difference between the two groups<code>t.test(time ~ group, data = x)</code> | |||
|} | |||
===May 24. Final Exam (Comprehensive)=== | ===May 24. Final Exam (Comprehensive)=== | ||
===May 31. Grades submitted to Registrar Office=== | ===May 31. Grades submitted to Registrar Office=== |
Latest revision as of 12:12, 24 May 2016
Course Description
With rapid accumulation of genome sequences and digitalized health data, biomedicine is becoming a data-intensive science. This course is a hands-on, computer-based workshop on how to visualize and analyze large quantities of biological data. The course introduces R, a modern statistical computing language and platform. Students will learn to use R to make scatter plots, bar plots, box plots, and other commonly used data-visualization techniques. The course will review statistical methods including hypothesis testing, analysis of frequencies, and correlation analysis. Student will apply these methods to the analysis of genomic and health data such as whole-genome gene expressions and SNP (single-nucleotide polymorphism) frequencies.
This 3-credit experimental course fulfills elective requirements for Biology Major I. Hunter pre-requisites are BIOL100, BIOL102 and STAT113.
Learning Goals
- Be able to use R as a plotting tool to visualize large-scale biological data sets
- Be able to use R as a statistical tool to summarize data and make biological inferences
- Be able to use R as a programming language to automate data analysis
Textbooks
- Whitlock & Schluter (2015). Analysis of Biological Data. (2nd edition). Amazon link
- R Studio (Recommended): Learning RStudio for R Statistical Computing
- Digital textbook (Recommended): Data Analysis for the Life Sciences
Exams & Grading
- Attendance (or a note in case of absence) is required
- In-Class Exercises (50 pts).
- Assignments. All assignments should be handed in as hard copies only. Email submission will not be accepted. Late submissions will receive 10% deduction (of the total grade) per day (~100 pts total).
- Three Mid-term Exams (3 X 30 pts each = 90 pts)
- Comprehensive Final Exam (50 pts)
- Bonus for active participation in classroom discussions
Course Outline
Feb 2. Introduction & tutorials for R/R studio
- Course overview
- Install R & RStudio on your home computers (Chapter 1. pg. 9)
- Tutorial 1: First R Session (pg. 12)
- Create a new project by navigating: File | New Project | New Directory. Name it project file "Abalone"
- Import abalone data set: Tools | Import DataSet | From Web URL, copy & paste this address: http://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data
- Assign column names:
colnames(abalone) <- c("Sex", "Length", "Diameter", "Height", "Whole_Weight", "Shucked_weight", "Viscera_weight", "Shell_weight", "Rings")
- Save data into a file:
write.csv(abalone, "abalone.csv", row.names = FALSE)
- Create a new R script: File | New | R script. Type the following commands:
abalone <- read.csv("abalone.csv"); boxplot(Length ~ Sex, data = abalone)
- Save as "abalone.R" using File | Save
- Execute R script:
source("abalone.R")
- Install the notebook package:
install.packages("knitr")
- Compile a Notebook: File | Compile Notebook | HTML | Open in Browser
- Tutorial 2. Writing R Scripts (Chapter 2. pg. 21)
- Tutorial 3. Vector
Assignment #1. Due 2/16, Tuesday (Finalized) |
---|
|
Feb 9. No class (Friday Schedule)
Feb 16. Introduction & tutorials for R/R studio
- Start a new project called "Session-02-individual"
- Tutorial 3: Vector (Continued)
x <- c(1,2,3,4,5) # construct a vector using the c() function
x # show x
2 * x + 1 # arithmetic operations, applied to each element
exp(x) # exponent function (base e)
x <- 1:5 # alternative way to construct a vector, if consecutive
x <- seq(from = -1, to = 14, by = 2) # use the seq() function to create a numeric series
x <- rep(5, times = 10) # use the rep() function to create a vector of same element
x <- rep(NA, times = 10) # pre-define a vector with unknown elements; Use no quotes
# Apply vector functions
length(x)
sum(x)
mean(x)
range(x)
# Access vector elements
x[1]
x[1:3]
x[-2]
# Character vectors
gender <- c("male", "female", "female", "male", "female")
gender[3]
# Logical vectors
is.healthy <- c(TRUE, TRUE, FALSE, TRUE, FALSE) # Use no quotes
is.male <- (gender == "male") # obtain a logic vector by a test
age <- c(60, 43, 72, 35, 47)
is.60 <- (age == 60)
less.60 <- (age <= 43)
is.female <- !is.male # use the logical negate operator (!)
# The which() function returns the indices of TRUE elements
ind.male <- which(is.male)
ind.young <- which(age < 45)
age[ind.young] # obtain ages of young individuals
- Tutorial 4: Matrix
BMI <- c(28, 32, 21, 27, 35) # a vector of body-mass index
bp <- c(124, 145, 127, 133, 140) # a vector of blood pressure
data.1 <- cbind(age, BMI, bp) # create a matrix using column bind function cbind(), individuals in rows
data.1
data.2 <- rbind(age, BMI, bp) # create a matrix using row bind function rbind()
t(data.1) # transpose a matrix: columns to rows & rows to columns
dim(data.1) # dimension of the matrix
colnames(data.1)
rownames(data.1) <- c("subject1", "subject2", "subject3", "subject4", "subject5")
data.1
data.1[3,1] # access the element in row 3, column 1
data.1[2,] # access all elements in row 2
data.1[,2] # access all elements in column 2
matrix(data = 1:12, nrow = 3, ncol =4) # create a matrix with three rows and four columns; filled by column
matrix(data = 1:12, nrow = 3, ncol =4, byrow = TRUE) # filled by row
mat <- matrix(data = NA, nrow = 2, ncol = 3) # create an empty matrix
mat[1,3] <- 5 # assign a value to a matrix element
Assignment #2. Due 2/23, Tuesday (Finalized on 2/18, Thursday 10am) |
---|
Show commands and outputs for the following exercises:
|
Feb 23. Statistics & samples
- Tutorial 5. Data Frame: a table to store mixed data types
data.df <- data.frame(age, gender, is.healthy)
data.df
class(data.df) # check object type
factor(gender) # categories (called "levels") of a character vector
data.df[3,4] # access row 3, column 4
data.df[, "age"] # a vector of all ages
data.df$age # an alternative way, using the $ notation
data.df$BMI[4]
data.df$gender[2]
# Create a data frame from text files:
# Download and save the file: http://extras.springer.com/2012/978-1-4614-1301-1/BodyTemperature.txt
BodyTemperature <- read.table(file = "BodyTemperature.txt", header = TRUE, sep = " ")
head(BodyTemperature) # show first 10 lines
names(BodyTemperature) # show column headings
BodyTemperature[1:3, 2:4] # show a slice of data
BodyTemperature$Age[1:3] # show 1-3 ages
- Population and sample
x <- rnorm(1000)
x.sample.1 <- sample(x, 100)
- Explore variable distributions
x <- rnorm(1000)
hist(x, breaks = 100) # distribution for all nubmers
hist(BodyTemperature[,2], main = "Age frequency distribution", xlab = "Age", ylab = "Counts") # age distribution, with customized title and axis labels
stem(BodyTemperature[,2]) # stem-leaf plot
table(BodyTemperature[,1]) # distribution for a categorical vector
Assignment #3. Due 3/1, Tu (Finalized) |
---|
|
March 1. Displaying data
Slides for part 1:
Assignment #4. Due 3/8, Finalized |
---|
|
March 8. Exam 1 (Open-Book)
March 15. Describing data
- In-Class exercise: A study of human gene length
- Import human gene data set from http://diverge.hunter.cuny.edu/~weigang/data-sets-for-biostat/hg.tsv2
hg.len <- hg$Gene.End - hg$Gene.Start + 1 # calculate gene length
hist(hg.len, br = 200) # plot gene-length distribution. Not normal: mostly genes are short, few very long
mean(hg.len) # not representative, super-long genes carry too much weight to the average length
median(hg.len) # More representative. Use median for a variable not normally distributed
summary(hg.len) # Show all quartiles
IQR(hg.len) # 3rd Quartile - 1st Quartile, the range of majority data points, even for skewed distribution
log.len <- log10(hg.len); hist(log.len, br=200) # Log of gene length is more normally distributed
mean(log.len); median(log.len) # They should be similar, since log.len is normal
# The next block is intend to show that the "mean length" of samples is normally distributed, although the length itself is not
samp.len <- sample(hg.len, 100) # take a random sample of 100 length
mean(samp.len) # a sample mean
# Repeat the above 1000 times, so we could study the distribution of "mean length" (not "length" itself)
mean.len <- rep(NA, 1000) # prepare an empty vector to store the "mean lengths"
for (i in 1:1000) { # i takes the value from 1 to 1000, one at a time
samp.len <- sample(hg.len, 100);
mean.len[i] <- mean(samp.len);
}
hist(mean.len, br=100) # you should see a more normally distributed histogram
# The above exercise is a demonstration of the "Central Limit Theorem"
Assignment #5. Due 3/22. (Finalized) |
---|
# Sample size 100
samp.size.100.means <- rep(NA, 1000)
for (i in 1:1000) {
samp <- sample(hg.len, 100)
samp.size.100.means[i] <- mean(samp)
}
hist(samp.size.100.means, br=100)
# Sample size 20
samp.size.20.means <- rep(NA, 1000)
for (i in 1:1000) {
samp <- sample(hg.len, 20)
samp.size.20.means[i] <- mean(samp)
}
hist(samp.size.20.means, br=100)
# Sample size 500
samp.size.500.means <- rep(NA, 1000)
for (i in 1:1000) {
samp <- sample(hg.len, 500)
samp.size.500.means[i] <- mean(samp)
}
hist(samp.size.500.means, br=100)
# Combine
sample.combined <- cbind(samp.size.20.means, samp.size.100.means, samp.size.500.means)
colnames(sample.combined) <- c("samp.20", "samp.100", "samp.500")
# plot in a single frame
par(mfrow=c(3,1))
hist(sample.combined[,1], br=100, xlim=c(1e4, 2e5), main="sample size 20", xlab = "mean gene length")
hist(sample.combined[,2], br=100, xlim=c(1e4, 2e5), main = "sample size 100", xlab = "mean gene length")
hist(sample.combined[,3], br=100, xlim=c(1e4, 2e5), main = "sample size 500", xlab = "mean gene length")
par(mfrow =c(1,1)) |
March 22. Sampling & Standard Error of Mean
- In-Class exercise 1. Descriptive statistics
- Make a vector of the following blood pressure measurements (in mmHg): 112, 128, 108, 129, 125, 153, 155, 132, 137. Calculate sample size, sum, mean, variance, coefficient of variation (CV), and median
- Take a sample of 100 human gene lengths. Calculate median, IQR, 1.5*IQR; Make a boxplot
- The following are measurements of body mass (in grams) of three species of finches in Africa, calculate mean, standard deviation, and CV of each species. Make a boxplot and a strip chart separated by species
- Species 1: 8, 8, 8, 8, 8, 8, 8, 6, 7 ,7, 7, 8, 8, 8, 7, 7
- Species 2: 16, 16, 16, 12, 16, 15, 15, 17, 15, 16, 15, 16
- Species 3: 40, 43, 37, 38, 43, 33, 35, 37, 36, 42, 36, 36, 39, 37, 34, 41
- In-Class exercise 2. standard error & confidence interval
- Blood pressure: What is the standard deviation of the above blood pressure?
- What is the sample size? Calculate standard error of the mean.
- Use the 2SE rule of thumb, calculate 95% confidence interval.
- Plot standard error & standard deviation
Assignment #6. Due 3/29 (Finalized) |
---|
A study of expression levels of human genes
|
March 29. Hypothesis Testing
- In-Class exercise 3. Hypothesis testing through simulation
# coin-flipping experiments
runif(1) # take a random sample from 0-1, uniformly distributed
rbinom(n = 1, size =100, prob = 0.5) # flipping 100 (size) fair (prob) coin, one (n=1) time
rbinom(n = 1000, size =100, prob = 0.5) # repeat above 1000 times
num.success <- rbinom(n = 1000, size =100, prob = 0.5) # save
barplot(table(num.success)) # distribution of number of successes
length(which(num.success<=40))/1000 # probability of success less than or equal to 40
# test if toads are right-handed: observation 14/18 are right-handed
right.handed.toads.by.chance <- rbinom(n = 1000, size = 18, prob = 0.5) # null distribution, 1000 times
barplot(table(right.handed.toads.by.chance)) # plot
length(which(right.handed.toads.by.chance >= 14))/1000 # probability of getting a value equal or higher than 14
# If the observation is 10/18
right.handed.toads.by.chance <- rbinom(n = 1e6, size = 18, prob = 0.5)
length(which(right.handed.toads.by.chance <= 8 | right.handed.toads.by.chance>=10))/1e6
- Lecture Slides
Assignment #7. Due 4/5 (Finalized) | ||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
April 5. Exam 2 (Open Book)
April 12. Analyzing Proportions
Assignment #8. Due 4/19 (Finalized) |
---|
|
April 19. Contingency Analysis
Lecture Slides:
Assignment #9. Due 5/3 (Finalized) | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
The following table shows results of genotype counts in "Taster" and "non-Taster" individuals.
|
April 26. No Class (Spring break)
May 3. Exam 3 (Open Book)
- Review lecture slides (part 3)
- Review 2 previous exams
May 10. One-sample t-test
- Motivating example: weights of ticks
- Import data set: http://diverge.hunter.cuny.edu/~weigang/data-sets-for-biostat/tick.tsv
- How to visualize data? What kinds of plots to make?
- Is there sexual dimorphism?
- No homework assignment (to be combined with the next one)
May 17. Paired & Two sample t-tests
- Import data set: http://diverge.hunter.cuny.edu/~weigang/data-sets-for-biostat/ahus.csv
- PLEASE FILL IN TEACHER EVALUATIONS: Teacher's evaluation
- All lecture slides
- Part 1.
- Part 2.
- Part 3.
- Part 4.
Assignment #10. Due 5/24 (Finalized) | |||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|