Monte Carlo Club

From QiuLab
Jump to navigation Jump to search

Season VII. Spring & Summer 2024

Summer Projects & Readings

Spring schedule

  • Week 2 (Feb 1)
    • Presentation: Wayne (HIV tropism). Data: gp120 seqs. Code: from Arezoo
    • Presentation: Aura & Michelle (Single-cell sequencing on endometrial carcinoma
    • Presentation: Susmita & Lily (Borrelia antigens). Data: variability analysis of chosen antigens
  • Week 1 (Jan 25)
    • Presentation: Esther (GBS by GEMMA)
    • Project update: Borrelia transcriptomics visualization & Shiny App
    • Presentation: Liann (Borrelia replication & polyploidy)
    • Project update: Arezoo (dummy data set II, phylogenetic random)
    • An R package to read Prism data: The pzfx package

Projects & Readings

Season VI. Language Models, Automata, and Evolutionary Games (Spring, Summer & Fall 2023)

Week 2. Lab meeting (June 8, 2023)

Week 1. Summer kickoff at Rockefeller University (June 1, 2023)

Projects & Papers

Season V. Genes, Memes, and Machines (Spring, Summer & fall 2022)

  • A journal club to continue the exploration of the link between evolution & learning.

Week 1

Week 2

Week 3

  • Code submissions by Niemah
Presidents Shuffle bar plot.png
p <- read_tsv("Presidents.txt", col_names = FALSE )
colnames(p) <- c("order", "name", "year.elected")
true.order <- p$name
output <- vector("double", length = 10000)
for(i in 1:10000) {
  random.order <- sample(p$name)
  output[[i]] <- sum(true.order == random.order)
x <- table(output)
df <- tibble(run = 1:1e4, matches = output)
df %>% ggplot(aes(x=matches)) + geom_bar(fill = "lightgreen") + theme_minimal()
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import random

# read file
file = r'/content/Presidents.txt'
dp = pd.read_table('Presidents.txt', header=None, names=['order', 'name', 'year.elected'])
names_list = list(

# simulate by 10000 permutations
output = [] # initialize an empty vector to store the match numbers
for x in range(10000):
  names_permuted = np.random.permutation(
  num_match = sum(names_list == names_permuted)

# tabulate counts
table_match = {} # initialize a dict to store counts
for num in output:
  if num in table_match:
    table_match[num] += 1
  else: # initialize if the number is first seen
    table_match[num] = 0

# plot
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
x = []
y = []
for match, cts in table_match.items():
ax.set_title("Number of President Matches")

Week 4

Entropy plots (Winston & Anh) Fitness plots

Week 6

Week 7

Fitness landscapes of binary strings
three distributions of fitness of mutations (by Anh & Winston)

Reading list


Season IV. Classification using Machine Learning (Summer 2021)

  • Textbook: Aurélien Géron (2019). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems, 2nd Edition. Amazon link
  • Set up Python work environment with mini-conda
  • We will be using Jupyter-notebook (part of mini-codon installation) to share codes

Week 1. MNIST dataset (Chapter 3)

  1. Dataset loading and display (pg 85-87): Jackie, Niemah, and Hannah
  2. Binary classifier
    1. Cross-validation (pg 89-90): Roman, etc
    2. Confusion matrix (pg 90-92):
    3. Precision and recall (pg 92-97)
    4. ROC curve (pg. 97-100)
  3. Multiclass classification (pg 100-108): Brian, etc

Week 2. K-means clustering (Chapter 9)

  1. 2D simulated dataset
  2. Image recognition
  3. MNIST dataset

Week 3. Exercises (pg.275-276, Chapter 9)

  1. Exercise 10: Facial recognition (Olivetti faces dataset, with k-means)
  2. Exercise 11. Facial recognition (semi-supervised learning

Notes on Origin of Life (Spring 2020)

  1. Attie et al (2019). Genetic Code optimized as a traveling salesman problem
  2. GC Origin of Life Seminar (2/19/2021)
  3. Pressman et al (2015). Review: RNA World (ribozyme as origin of genetic code)
  4. A perspective: Miikkulainen, R., Forrest, S. A biological perspective on evolutionary computation. Nat Mach Intell 3, 9–15 (2021)

Season III. Summer 2018 (Theme: Evolutionary Computing)

Week 1. Introduction & Motivating Examples

  1. Genetic Arts
    1. Go to picbreeker website & evolve using "branch" method
    2. Evolve 3D art: Endless Forms
    3. CPPN-NEAT Algorithm: Compositional Pattern Producing Networks (CPPNs)-NeuroEvolution of Augmenting Topologies (NEAT); PicBreeder paper; or a PDF version
  2. NeuroEvolution (by DeepMind team)
  3. Robotic snake (and other soft robots)
    1. A controller built with physics laws difficult (too many parameters)
    2. Simulation with evolutionary computing:
      1. (Genotype) A list of 13 commands (one for each segment; each being a neural net, with 25 inputs and one output of joint angles)
      2. (Phenotype) Fitness function: total displacement
      3. Algorithm: MAP-Elites; Nature paper
      4. Approach: training with simulated data
  4. Robotic Knightfish
    1. Youtube demo
    2. Algorithm: CMA-ES (Covariance Matrix Adaptation Evolution Strategy)
      1. Genotype: 15 variables (sinusoidal wave function or Fourier Series)
      2. Phenotype/Fitness: speed
    3. Implementation: DEAP
  5. Compositional Protein design (CPD)
    1. Genotype: side-chain configuration determined by rotamer library
    2. Phenotype/Fitness: Rosetta energy function & functional (e.g., binding affinity)
    3. Fitness landscape: each node is a protein structure, each edge represent connections/relatedness
  6. Simulation-based optimization: Problem Sets
  7. Self-evolving software(Genetic Programming)

Week 2. Toy Problem: OneMax Optimization

  1. Problem: Create a list of L random bits (0 or 1). Evolve the list until the fitness reaches the maximum value (i.e., contains only 1's)
  2. Neutral Evolution:
    1. Create a vector of L random bits (e.g, L=20). Hint for creating a random vector of 0's and 1's in R: ind<-sample(c(0,1), prob=c(, replace=T, size=20). Biologically, this vector represents a single haploid genome with 20 loci, each with two possible alleles (0 or 1).
    2. Create a population of N=100 such individuals. Hint: creating a list of vectors in R: pop <- lapply(1:100, function(x){<insert sample() function above>})
    3. For each generation, each individual reproduces 10 gametes with mutation (with the probability for bit flip: mu = 1/L = 0.1). Hint: write a mutation function that takes an individual as input and outputs a mutated gamete. Use the "broken stick" algorithm to implement mutation rate: cutoff <- runif(1); ifelse(cutoff <= mu, flip-bits, no-flip)
    4. Take a random sample of N=100 gametes into the next generation & repeat above
    5. Plot mean, minimum, and maximum fitness over generations
    6. Plot diversity statistics over generation (including average allelic heterozygosity per locus as well as haplotype heterozygosity). Hint: write two functions for these heterozygosity
  3. Add natural selection (proportional scheme)
    1. Individuals reproduce with fitness proportional to the total number of 1's.
    2. Iterate until a population contains at least one individual with all 1's
    3. Plot mean, minimum, and maximum fitness over generations
    4. Plot diversity (expectation: decreasing)
  4. Add natural selection (tournament scheme)
    1. Randomly selecting n=5 individuals and allow the fittest one to make gametes
    2. Plot mean, minimum, and maximum fitness over generations (Expectation: faster optimization)
    3. Plot diversity (expectation: decreasing fasters)
  5. Add crossover
    1. Hint: write a crossover function
    2. Does it reach optimization faster?
  6. Code submissions
    1. Python Notebook by John
    2. R code by Weigang
    3. R code by Panlasigui
    4. Python Notebook by Yinheng
    5. RPub by Desiree

Week 3. Python & R Packages for evolutionary computation

Week 4. Multiplex Problem

Week 5. Genetic Programming

Season II. Summer 2017 (Theme: Machine Learning)

Week 1. Introduction & the backprop algorithm

  1. Problem: Classification/Clustering/Predication of flower species (a total of 3 possible species in the sample data set) based on four phenotypic traits/measurements
  2. The "iris" data set: exploratory data analysis with visualization & descriptive statistics: data("iris"); plot(), summary(), qqnorm(); qqline(); hist(); normalization with scale() (Roy)
  3. Mathematics of backpropagating errors to neural connections (Oliver)
    1. Objective/optimization function measuring difference between target (t, expected) and neural activity y: G=Sum{t*log(y)+(1-t)log(1-y)} (this is known as the "cross-entropy" error function; the other alternative is "minimal squared error (MSE)"), which has the gradient in the simple form of g=-(t-y)x, where x is the input. The objective function is minimized when weights are updated by the gradient. Error-minimization by MSE has similar effects but harder to calculate.
    2. Learning algorithm is presented

Week 2. Traditional approaches to multivariate clustering/classification

  1. Dimension reduction with Multidimensional Scaling cmdscale() Mei's rNoteBook.
  2. Dimension reduction with Principal Component Analysisprincomp(). Sipa's rNoteBook
  3. Multivariate clustering with Hierarchical Clustering hclust() (Saymon)
  4. Multivariate clustering with k-means kmeans() Roy's rNoteBook
  5. Classification based on logistic regression glm(), linear discriminatory analysis lda(), & k-nearest neighbor library(class); knn() (John)
  6. Modern, non-linear classifiers: Decision Trees (DT), Support Vector Machines (SVM), and Artificial Neural Networks (ANN) (Brian)

Week 3. Single-neuron classifier

  1. Algorithm:
    1. Read input data with N=150 flowers. Reduce to a matrix x with two traits (use trait 1 & 3) and two species (use rows 51-150, the last two species, skip the first species [easy to separate]) for simplicity. Create a target vector t <- c(rep(0,50),rep(1,50)) indicating two species
    2. Initialize the neuron with two random weights w<-runif(2, 1e-3, 1e-2) and one random bias b<-runif(1)
    3. Neuron activation with k=2 connection weights: a=sum(x[k] * w[k])
    4. Neuron activity/output: y=1/(1+exp(-a-b)) (This logistic function ensures output values are between zero and one)
    5. Learning rules: learning rate eta=0.1, backpropagate error ("e") to get two updated weights (for individual i, feature k): e[i]=t[i]-y[i]; g[k,i]= -e[i] * x[k,i]; g.bias[i] = -e[i] for bias; Batch update weights & bias: w[k]=w[k] - eta * sum(g[k,i]); b = b - eta * sum(g.bias[i]) (same rule for the bias parameter b); Repeat/update for L=1000 epochs (or generations)
    6. Output: weights and errors for each epoch
  2. Evaluation:
    1. Plot changes of weights over epoch
    2. Use the last weights and bias to predict species
    3. Compare prediction with target to get accuracy
    4. Plot scatter plot (x2 vs x1) and add a line using the weights & bias at epoch=1,50,100,200,500, 1000: a=w1*x1 + w2*x2 + b; a=0. The lines should show increasing separation of the two species.
  3. Code submissions
    1. Roy: rPubs notebook
    2. John: Python code on github
    3. Mei: rPubs notebook
    4. Brian: rPubs notebook
    5. Weigang: rPubs notebook
  4. Questions for future exploration
    1. How to avoid over-fitting by regularization (a way to penalize model complexity by adding a weight decay parameter alpha)
    2. Bayesian confidence interval of predictions (with Monte Carlo simulation)
    3. Limitations: equivalent to PCA (linear combination of features); adding a hidden layer generalize the neural net to be a non-linear classifier

Week 4. Single-layer, multiple-neuron learner

  1. Predict all three species (with "one-hot" coding) using all four features: e.g., 100 for species 1, 010 for species 2, and 001 for species 3. Create 3 neurons, each one outputting one digit.
  2. Use three neurons, each accepts 4 inputs and output 1 activity
  3. Use softmax to normalize the final three output activities
  4. Code submissions
    1. John:
      1. reference this webpage
      2. Python code
      3. TensorFlow code
    2. Roy: rpubs notebook
    3. Mei: rPubs Notebook
    4. Weigang: rPubs notebook
  5. Challenge: Implementation with TensorFlow, an Open Source python machine learning library by Google Deep Mind team. Follow this example of softmax regression
  6. A biological application for the summer: identify SNPs associated with biofilm/swarming behavior in Pseudomonas

Week 5. Multi-layer ("Deep") Neuron Network

An investigation of under-fitting (at N=1) & over-fitting (at N=3 nodes). N=0 implies linear fitting.
  1. Input layer: 4 nodes (one for each feature)
  2. Output layer: 3 nodes (one for each species)
  3. Hidden layer: 2 hidden nodes
  4. Advantage: allows non-linear classification; using hidden layers ("convoluted neural net") is able to capture high-order patterns
  5. Implementation: I'm not going to hard-code the algorithm from scratch. The figure was produced using the R package with the following code: library(nnet); library(neuralnet); targets.nn <- class.ind(c(rep("setosa",50), rep("versicolor",50), rep("virginica",50))) # 1-of-N encoding; <- neuralnet(formula = setosa + versicolor + virginica ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data = training.iris, hidden = 2, threshold = 0.01, linear.output = T); plot(, rep="best"). It achieved 98.0% accuracy.
  6. Deep neural net allows non-linear, better fitting, but we don't want over-fitting by adding more hidden layers (or more neurons in the hidden layer). Identify under- and over-fitting with the following procedure:
    1. Randomly sample 100 as training set and the remaining as target. Repeat 100 times
    2. For each sample, plot accuracy for the training set, as well as accuracy for the target
    3. Find the point with the right balance of under- and over-fitting

Week 6. Unsupervised Neural Learner: Hopfield Networks

  1. Hebbian model of memory formation (MacKay, Chapter 42). Preparatory work:
    1. Construct four memories, each for a letter ("D", "J", "C", and "M") using a 5-by-5 grid, with "-1" indicating blank space, and "1" indicating a pixel. Flatten the 5-by-5 matrix to a one-dimensional 1-by-25 vector (tensor).
    2. Write two functions, one to show a letter show.letter(letter.vector), which draws a pixel art of letters (print a blank if if -1, a "x" if 1), another to mutate the letter mutate(letter.vector, number.pixel.flips)
  2. Hopfield Network
    1. Store the four memories into a weight matrix, which consists of symmetric weights between neurons i and neuron j, i.e., w[i,j] = w[j,i]. We will use a total of 25 neurons, one for each pixel.
    2. First, combine the four vectors (one for each letter) into a matrix: x <- matrix(c(letter.d, letter.j, letter.c, letter.m), nrow = 4, byrow = T)
    3. Second, calculate weight matrix by obtaining the outer product of a matrix multiplication: w <- t(x) %*% x. Set diagonal values to be all 0's (i.e., to remove all self-connections): for (i in 1:25) { w[i,i] = 0 }. This implements Hebb learning, which translates correlations into strength of connections quantified by weights: large positive weights indicate mutual stimulation (e.g., 1 * 1 = 1 [to wire/strengthen the connection of co-firing neurons, and ...], -1 * -1 = 1 [to wire/strengthen the connection of co-inhibitory neurons as well]), large negative weights indicate mutual inhibition (e.g., 1 * -1 = -1 [to unwire/disconnect oppositely activated neurons]), and small weights indicate a weak connection (e.g., 0 * 1 = 0 [to weaken connections between neurons with uncorrelated activities, but do not unwire them]). (0, 1, and -1 being values of neuron activities)
    4. Third, implement the learning rule by updating activity for each neuron, sequentially (asynchronously): a[i] <- sum(w[i,j] * x[j])
    5. Iterate the previous step 1-5 times, your code should be able to (magically) restore the correct letter image even when the letter is mutated by 1-5 mutations in pixels. This exercise simulates the error-correction ability of memory (e.g., self-correcting encoding in CDs, a neon-light sign missing a stroke, or our ability to read/understand/reconstruct texts with typos, e.g., you have no problem reading/understanding this sentence: "It deosn’t mttaer in what order the ltteers in a word are, the olny iprmoatnt thing is taht the frist and lsat ltteer be in the rghit pclae.").
    6. Expected capacity of a Hopfield Network: number_of_memories = 0.138 * number_of_neurons. In our case, 25 (neurons) *0.138 = 3.45 memorized letters. So the network/memory fails if we squeeze in one additional letter (try it!).
  3. Code submissions
    1. Roy: rPubs Notebook
    2. John: Python Code on Github
    3. Weigang: rPubs Notebook
  4. Potential biological applications: genetic code optimization; gene family identification

Week 7. Biological Applications

A conceptual map for choosing ML algorithms:

Conceptual map from SciKit-Learn website
  • The Python SciKit Learn framework may be tool of choice (instead of R). See these nice examples
  • Bayesian Network (using e.g., R Package bnlearn) to identify cell-signaling networks. Sache et al (2005). Causal Protein-Signaling Networks Derived from Multiparameter Single-Cell Data
  • Identify proteins associated with learning in mice (Classification & Clustering)UCI Mice Protein Expression Dataset
  • Predict HIV-1 protease cleavage sites (Classification): UCI HIV-1 Protease Dataset
  • Lab project 1. Identify SNPs in 50 c-di-GMP pathway genes that are associated with c-di-GMP levels, swarming ability, and biofilm ability in 30 clinical isolates of Pseudomonas aeruginosa. Approach: supervised neural network (with regularization)
  • Lab project 2. Identify genetic changes contributing to antibiotic resistance in 3 cancer patients. Approach: whole-genome sequencing followed by statistical analysis
  • Lab project 3. Simulated evolution of genetic code. Approach: Multinomial optimization with unsupervised neural network

Summer Project 1. Systems evolution of biofilm/swarming pathway (with Dr Joao Xavier of MSKCC)

Fig.1 .Simulated CDG-correlated SNPs. t-test results: (1) cor=0.8 (strong), t=-5.1142, df=95.596, p=1.619e-06; (2) cor=0.5 (medium), t=-4.7543, df=85.796, p=7.953e-06; (3) cor=0.2 (weak), t=-0.94585, df=79.28, p= 0.3471.
Fig.2. Simulated tree-based SNPs. Generated with the APE function: replicate(10, rTraitDisc(tr, states = c(0,1), rate = 100, model = "ER"))
  1. Acknowledgement: NSF Award 1517002
  2. Explanatory variable (genotypes): Whole-genome data (of ~30 clinical isolates & many experimentally evolved strains) as independent variables
  3. Explanatory variable (genotypes): ~50 genes related to cyclic-di-GMP synthesis and regulation (~8000 SNPs, ~1700 unique, ~5-8 major groups)
  4. Response variables (phenotypes): biofilm size, swarming size, antibiotic sensitivity profile, metabolomics measurements, c-di-GMP levels
  5. Sub-project 1: Database updates (Usmaan, Edgar, Christopher; continuing the work of Rayees and Raymond in previous years)
    1. c-di-GMP pathway SNPs in the new table "cdg_snp"
    2. c-di-GMP levels in "phenotype" table
    3. capture fig orthologs (in progress)
    4. capture matebolomics data with a new table (in process)
      1. a heatmap of deviation from mean (among strains of the same metabolite)
      2. a volcano plot (p values based on t-test from group mean among strains of the same metabolite)
  6. Sub-project 2. Supervised learning for predicting genes, SNPs associated with phenotypes
    1. Advantages over traditional regression analysis: ability to discover non-linear correction structure
    2. Challenges:
      1. over-fitting (we have only ~30 observations while thousands of SNPs, "curse of dimensionality")
      2. the "effective" sample size is further discounted/reduced by phylogenetic relatedness among the strains.
    3. Single neuron, two-levels; by SNP groups (using hclust()); by PCA (linear combination of SNPs); or by linkage groups (looking for homoplasy) (Mei), or by t-SNE (as suggested by Rayees)
    4. Three neurons, three-levels; by gene (Roy)july-14_nnet_cidigmp.R
    5. To Do: (1) predict continuous-value targets; (2) add hidden layers; (3) correct for phylogenetic auto-correlation
  7. Sub-project 3. Generate simulated data (with Choleski Decomposition) (Weigang; continue the work in previous year by Ishmere, Zwar, & Rayees)

Fig.1. Simulated SNPs associated with cdg levels (w/o tree) inspired by this algorithm

# simulate SNP-cdg correlation
r <- 0.2 # desired correlation coefficient
sigma <- matrix(c(1,r,r,1), ncol=2) # var-covariance matrix
s <- chol(sigma) # choleski decomposition
n <- 100 # number of random deviates (data points)
z <- s %*% matrix(rnorm(n*2), nrow=2) # 100 correlated normally distributed deviates with cor(x,y)=r
u <- pnorm(z) # get probabilities for each deviates
snp.states <- qbinom(u[1,], 1, 0.5) # discretize the 1st vector of probabilities into 0/1 with Bernoulli trial
idx.0 <- which(snp.states == 0); # indices for "0"
idx.1 <- which(snp.states == 1); # indices for "1"
# boxplots with stripcharts
boxplot(u[2,] ~ snp.states, main="cor=0.2", xlab="SNP states", ylab="CDG level")
stripchart(u[2,] ~ snp.states, vertical=T, pch=1, method="jitter", col=2,  add=T)

Fig.2. Simulated SNPs associated with strain phylogeny

# Simulate SNPs on a tree
tr <- read.tree("cdg-tree-mid.dnd") # mid-point rooted tree
X <- replicate(10, rTraitDisc(tr, states = c(0,1), rate = 100, model = "ER"))
id <- read.table("cdg.strains.txt3", row.names = 1, sep="\t") <- plot(tr, no.margin = T, x.lim = 0.03, show.tip.label = F)
text(rep(0.013,30), 1:30, id[tr$tip.label,1], pos = 4, cex=0.75)
text(rep(0.02,30), 1:30, snps[tr$tip.label], pos=4, cex=0.75)
abline(h=1:30, col="gray", lty=2)

To Do/Challenges: (1) simulate multiple SNPs (linear correlation); (2) simulate epistasis (non-linear correlation); (3) simulate phylogenetic correlation

  • Simulation & xgboost code contributed by Mei & Yinheng (January 2018)
#To create a simulated matrix w/ correlated xy variables; NOTE: ONLY 1 correlated x variable
get1Simulated <- function(corco, nstrains, snps){
  r <- corco/10 # desired correlation coefficient
  sigma <- matrix(c(1,r,r,1), ncol=2) # var-covariance matrix
  s <- chol(sigma) #cholesky decomposition
  n <- nstrains
  z <- s %*% matrix(rnorm(n*2), nrow=2)
  u <- pnorm(z[2,])
  snp.states <- qbinom(u, 1, 0.5)

  known <- t(rbind(z[1,],snp.states))
  rand <- pnorm(matrix(rnorm(nstrains*(snps-1)),nrow = nstrains))
  x <- cbind(known, snp.rand)

  colnames(x)[1] <- "target"
  colnames(x)[-1] <- paste("feature", seq_len(ncol(x)-1), sep = ".")


#Use this function to artificially create 2 x variables that have relationship w/ Y.
getSimulated.chol <- function(snp1, snp2, nstrains, snps){
  n <- nstrains

  btwn <- snp1 * snp2 #this is a suggested value for inter-SNP correlation. the covariance matrix has to be semi-positive infinite in order to carry out the decomposition

  covar.m <- matrix(c(1.0, snp1, snp2, snp1, 1.0, btwn, snp2, btwn, 1.0), nrow = 3) #cor-matrix

  s <- chol(covar.m)
  z <- s %*% matrix(rnorm(n*3), nrow = 3)

  #decretize the 2 variables
  u <- t(pnorm(z[2:nrow(z),]))
  decretize.u <-qbinom(u,1,0.5)

  #generate some random discrete variables and combine w/ the 2 correlated variables
  rand <- matrix(rnorm(nstrains*(snps-2)),nrow = nstrains)
  rand.u <- pnorm(rand)
  snp.rand <- qbinom(rand.u,1,0.5)
  x <- cbind(z[1,], decretize.u, snp.rand)

  dimnames(x) <- list(c(), c("target", paste("feature", seq_len(ncol(x)-1), sep = ".")))


#This utilizes xgboost function w/o cross-validating the strains and w/ tuned parameters (we think it's the optimal set) tested by Yinheng's python [ getBestParameters] function. Please make sure to have xgboost installed. 
runXg <- function(x){

  bst <- xgboost(data = x[,2:ncol(x)], label = x[,1], max.depth = 2, eta = .05, gamma = 0.3, nthread = 2, nround = 10, verbose = 0, eval_metric = "rmse")

  importance_matrix <- xgb.importance(model = bst, feature_names = colnames(x[,2:ncol(x)]))


#Same as above but w/ cross validation. <- function(x){

  ind <- createDataPartition(x[,1], p = 2/3, list = FALSE )

  trainDf <- as.matrix(x[ind,])
  testDf <- as.matrix(x[-ind,])

  dtrain <- xgb.DMatrix(data = trainDf[,2:ncol(trainDf)], label = trainDf[,1])
  dtest <- xgb.DMatrix(data = testDf[,2:ncol(testDf)], label = testDf[,1])

  watchlist <- list(train=dtrain, test=dtest)

  bst <- xgb.train(data=dtrain, nthread = 2, nround=10, watchlist=watchlist, eval.metric = "rmse", verbose = 0)

  importance_matrix <- xgb.importance(model = bst, feature_names = colnames(x[,2:ncol(x)]))



#This function returns a master table with different iterated values that finds the best number of predictor variables (x).
getTable <- function(x, importance_matrix, cor){
  # v.names <- c("feature", "SNPs", "strains", "cor_given", "cor_actual", "gain", "cover", "rank")
  v.names <- c("feature", "SNPs", "strains", "cor_given", "cor_actual", "p.value", "gain", "cover", "rank")
  for (i in 1:length(v.names)){
    assign(v.names[i], numeric())
  featNum <- NULL

  n <- 0
  new_cor <- rep(cor, length(x)/length(cor))
  for (num in 1:length(x)) {
    featNum <- NULL
    imp_matrix <- importance_matrix[[num]]
    x.iter <- x[[num]]
    x.pv <- getPV(x[[num]])

    for (i in 1:2){
      featNum <- c(featNum, grep(paste("feature.",i,"$", sep = ""), imp_matrix$Feature, perl = TRUE))

    n <-  n + 1

    for(i in 1:2){
      f <- featNum[i]
      feature<-c(feature, imp_matrix$Feature[f])
      SNPs <-c(SNPs, 100)
      strains <- c(strains, nrow(x.iter))
      cor_given <- c(cor_given, new_cor[n])
      cor_actual <-c(cor_actual, cor(x.iter[,1],x.iter[,grep(paste(imp_matrix$Feature[f], "$", sep = ""), colnames(x.iter), perl = TRUE)]))
      p.value <- c(p.value, x.pv[,imp_matrix$Feature[f]])
      gain <- c(gain, imp_matrix$Gain[f])
      cover <- c(cover, imp_matrix$Cover[f])
      rank <- c(rank, f)

  x.master <- data.frame(feature, SNPs, strains, cor_given, cor_actual, p.value, gain, cover, rank)


Summer Project 2. Whole-genome variants associated with multidrug resistance in clinical Pseudomonas & E.coli isolates (with Dr Yi-Wei Tang of MSKCC)

  1. Acknowledgement: Hunter CTBR Pilot Award
  2. Stage 1: Select patients & strains for whole-genome sequencing by MiSeq, based on drug sensitivities (16 isolates from 5 patients were isolated, tested, and selected by April, 2017)
  3. Stage 2: Genome sequencing (FASTQ files generated, 3 replicates for each isolates; done by June 2017)
  4. Stage 3: Variant call
    1. Reference strains identified using Kraken (Roy)
    2. VCF generation using cortex_var (Michele and Hanna, led by John)
  5. Stage 4: Variant annotation
  6. Stage 5: Statistical analysis
  7. Stage 6: Web report

Summer Project 3. Origin of Genetic Code

Tools for testing SGC & evolved codes

  • Shuffle code:
    -f <sgc|code-file> (required)
    -s <1|2|3|4|5> (shuffle by 1st, 2nd, 3rd, aa blocks, and all random)
    -p(olarity; default)
# Input: 'sgc' (built-in) or an evolved code file consisting of 64 rows of "codon"-"position"
# Output: a code file consisting of 64 rows of "codon" - "aa" (to be fed into ./
# Usage examples:
./ -f 'sgc' -s 5 # randomly permute SGC
./ -f 'evolved-code.txt' -p # evolved code, AA assigned according to polarity (default)
./ -f 'evolved-code.txt' -s 5 -h # evolved code, AA assigned according to hydrophobicity, shuffled randomly
  • Code statistics
        [-h]    (help)
        [-f 'sgc (default)|code_file']  (code)
        [-s 'pair (default)|fit|path']  (stats)
        [-p 'pol|hydro|vol|iso', default 'grantham']    (aa prop)
        [-i (ti/tv, default 5)]
        [-b: begin codon ('TTT') -q: panelty for 1st (50); -w: panelty for 2nd (100)]   (options for path)
# Input: 'sgc' (built-in) or a code file consisting of 64 rows of "codon" - "aa"
# Output: codon or code statistics
# Usage examples:
./ # all defaults: print single-mutation codon pairs, grantham distance, for SGC
./ -s 'fit' -p 'pol' # print code fitness according to polarity, for SGC (default)
./ -s 'path' # print tour length for SGC
  1. Participants: Oliver, John, Brian
  2. A representation mimicking TSP (see figure at right)
  3. Code to calculate total (mutational, ti/tv) path of SGC?
  4. Code to calculate energy of SGC?
  5. Randomize amino acid assignment and obtain distributions of path & energy
  6. Code to minimize total path length and energy?
  7. How to evolve a robust SGC? mutation bias + polarity/hydropathy + usage

Season I. Spring 2017 (Themes: Dueling Idiots/Digital Dice/Wright Fisher Process)

"Coalescence" (Backward simulation of Wright-Fisher process) (Due May 12, 2017)


We will conclude Spring 2017 season with the coalescence simulation (in summer, we will start experimenting with simulation of systems evolution)

Previously, we simulated Wright-Fisher process of genetic drift (constant pop size, no selection) starting from a founder population and end with the present population. This is not efficient because the program has to track each individual in each generation, although the majority of them do not contribute to the present population.

Coalescence simulation takes the opposite approach of starting from the present sample of k individuals and trace backward in time to their most recent common ancestor (MRCA). Due to the nature of Poisson process, the waiting time from one coalescence event (at time T) to the next one (at time T-1) is exponentially distributed with a mean of (k choose 2) generations. This process is iterated until the last coalescent event.

The classic text on coalescence is Richard Hudson's chapter, which includes (in Appendix) C codes for simulating tree and mutations. Hudson is also the author of ms, the widely used coalescence simulator.

Like the ms 10 1 -T command, your code should output a tree of 10 individuals. [A lot tougher than I thought; can't figure it out in R; so I did it in Perl]. Similarly, you could use R, where the ape package has a function called rcoal() that generate a random coalescence tree. Try this command: plot(rcoal(10)) (first load library by running library(ape)).

Submitted Codes
  • By Weigang (First draft)
#!/usr/bin/env perl
# Basic coalescence
# First, simulate coalescent events with recursion
# Second, use Bio::Tree to export a newick tree
use strict;
use warnings;
use Data::Dumper;
use Algorithm::Numerical::Sample  qw /sample/;
use Math::Random qw(random_exponential);
use Bio::Tree::Tree;
use Bio::Tree::Node;

# Initialize
die "Usage: $0 <num-of-samples>\n" unless @ARGV == 1;
my $nsamp = shift @ARGV;
my @samples;
for (my $i=1; $i<=$nsamp; $i++) { push @samples, {id=>$i, parent=>undef, br=>0} }
my $ctr = $nsamp;
my $time = 0;
my @all_nodes;

# Simulate events with exponential time intervals between two successive events
&coal(\@samples, \$ctr, \$time);
#print Dumper(\@all_nodes);
# Reconstitute into tree using Bio::Tree
my %seen_node;
my $max_id=0;
my @nodes;
foreach (@all_nodes) {
    $max_id = ($_->{parent} > $max_id) ? $_->{parent} : $max_id;
    if ($seen_node{$_->{id}}) { # child exist, previously as a parent, no branch length
        $seen_node{$_->{id}}->branch_length(sprintf "%.6f", $_->{br}); # add branch length
        if ($seen_node{$_->{parent}}) { # parent exist
        } else { # parent new
            my $pa = Bio::Tree::Node->new(-id=>$_->{parent});
            $seen_node{$_->{parent}} = $pa;
    } else { # child new
        my $new = Bio::Tree::Node->new(-id=>$_->{id}, -branch_length => sprintf "%.6f", $_->{br});
        $seen_node{$_->{id}} = $new;
        if ($seen_node{$_->{parent}}) { # parent exist
        } else { # parent new
            my $pa = Bio::Tree::Node->new(-id=>$_->{parent});
            $seen_node{$_->{parent}} = $pa;

#my $root = $seen_node{$max_id};
my $tree=Bio::Tree::Tree->new(-id=>'coal-sim', -node=>$seen_node{$max_id}, -nodelete=>1);
print $tree->as_text("newick"), "\n";


sub coal {
    my $ref_nodes = shift;
    my $ref_ct = shift;
    my $ref_time = shift;
    my $ct = $$ref_ct;
    my @current_nodes = @$ref_nodes;
    my $k = scalar @current_nodes;
    my @new_nodes;
    return unless $k > 1;
    my @pair = sample(-set => $ref_nodes, -sample_size => 2);
#    print $ct, "\t", $$ref_time, "\t", $pair[0]->{id}, "\t", $pair[1]->{id}, "\n";
    $$ref_time += random_exponential(1, 2/$k/($k-1));
    my $new_nd = {id=>$ct+1, parent=>undef, br=>$$ref_time};
    map {$_->{parent} = $ct+1} @pair;
    map {$_->{br} = $$ref_time - $_->{br}} @pair;
    push @all_nodes, $_ for @pair;
    foreach (@current_nodes) {
        push @new_nodes, $_ unless $_->{id} == $pair[0]->{id} || $_->{id} == $pair[1]->{id};
    push @new_nodes, $new_nd;
    &coal(\@new_nodes, $ref_ct, $ref_time);
  • By John
from Bio import Phylo
from io import StringIO
from random import sample
from scipy.misc import comb
from itertools import combinations
from numpy.random import exponential as exp

# Generate Random Tree with Nodes & Waiting Time
individuals = ["A", "B", "C", "D", "E", "F", "G", "H", "I", "J"]
node_dict = {}
reference = {}
wait_time_ref = {}
node = 0
while len(individuals) != 1:
    node_dict[node] = {"child_nodes": []}
    sample_events = tuple(sample(individuals, 2))
    reference[sample_events] = node
    wait_time = exp(1/comb(len(individuals), 2))
    wait_time_ref[node] = wait_time
    for sample_event in sample_events:
        if type(sample_event) == str:
            child_node = reference[sample_event]
    node += 1
# Calculate Tree Branch Lengths
tree_dict = {}
cumulative_br_length = 0
cumulative_br_len_dict = {}
for i in range(len(node_dict)):
    tree_dict[i] = {}
    cumulative_br_length += wait_time_ref[i]
    cumulative_br_len_dict[i] = cumulative_br_length
    for node in node_dict[i]['child_nodes']:
        if type(node) == str:
            tree_dict[i][node] = cumulative_br_length
            tree_dict[i][node] = cumulative_br_len_dict[i] - cumulative_br_len_dict[node]
# Parse the Tree into a String
for i in range(len(tree_dict)):
    for node in tree_dict[i].keys():
        if type(node) != str:
            temp = str(tree_dict[node])
            tree_dict[i][temp] = tree_dict[i].pop(node)
tree_str = str(tree_dict[i]).replace('\'', '').replace('\"', '').replace('\\', '').replace('{', '(').replace('}', ')').replace(' ', '')
tree_str += ";"            

# Visualize Tree
handle = StringIO(tree_str)
tree =, 'newick')
tree.ladderize()   # Flip branches so deeper clades are displayed at top

April 17, 2017 "Mutation Meltdown" (Due May 5, 2017)


Our last exploration showed that population will reach a steady-state level of DNA sequence polymorphism under the opposing forces of genetic drift and mutations. The steady-state level is expected to be ~ N * mu: the larger the population size and the higher the mutation rate, the higher per-site DNA polymorphism. (Also note that, given a long enough sequence and a low enough mutation rate, we don't expect to see two or more mutations hitting a single position. Each mutation is essentially a new one and only two-state SNPs are expected in a sample of DNA sequences).

However, the steady-state expectation is based on the assumption that all mutations are neutral (i.e., no beneficial or harmful fitness effects). In reality, mutations are predominantly either neutral or harmful and few are beneficial. Natural selection (negative or positive, except those on the immune-defense loci) tends to drive down genetic variation.

Sex to the rescue. Without sex or recombination, the fate of genetic variations across a genome are bundled together, rising or sinking in unison with the fate of a single beneficial or harmful mutation. With recombination, genetic variations at different loci become less tightly linked and are more likely maintained, speeding up adaptation.

This week, we will use simulation to recreate the so-called "Muller's Ratchet", which predicts that asexual populations are evolutionary dead ends. It shows that the combined effect of deleterious mutation and genetic drift will lead to population extinction due to steady accumulation of deleterious mutations, a phenomenon called "mutation meltdown".

The simulation protocol would be similar to that of the last problem involving only mutation and drift, but I suggest you rewrite from scratch to be more efficient by using a simpler data structure (no sequence or bases needed). The main difference is that, instead of calculating average pairwise sequence differences in each generation, you will track the frequency of mutation-free sequences (n0) and find the time until it goes to zero. That is when the Ratchet makes a "click". The next click is when the population losses the one-mutation sequences (n1), and so on. Each click drives the population fitness down by one mutation-level. [ Numerically, if each mutation causes a fitness loss of s ("selection coefficient"), the fitness of a sequence with k mutations is given by w = (1-s)k. Strictly speaking, the selection coefficient should be included to affect gamete size, but let's ignore that for now].

If there is recombination, the mutation-free sequences could be recovered (simulation next time?). Without recombination, the only direction for the population to evolve is a steady loss of best-fit individuals (by genetic drift).


  1. Does the ratchet occur faster in small or large populations? Find by simulating N=100 and N=1000
  2. Does the ratchet occur faster for a short or long genome? Find by simulating L=1e3 and L=1e4
  3. Which parts of the human genomes are asexual (therefore subject to Muller's Ratchet and becoming increasingly small)?
Submitted Codes
  • By John
import numpy as np
import matplotlib.pyplot as plt
from numpy.random import poisson
from numpy.random import choice

def simulator(seq_length, pop, repro_rate, generations):
    mutation_rate = 0.00001
    lam = seq_length * mutation_rate
    individuals = np.array([1 for i in range(pop)])
    non_mutated_pop_rate = []
    for generation in range(generations):
        gametes = []
        for individual in individuals:
            off_sprs = poisson(lam, repro_rate)
            mutation_ix = np.array(np.where(off_sprs != 0))
            indiv_gametes = np.array([individual for j in range(repro_rate)])
            indiv_gametes[mutation_ix] = 0
            gametes += list(indiv_gametes)
        individuals = choice(gametes, size=pop)
        non_mutated_frequency = float(np.count_nonzero(individuals)) / pop
    return non_mutated_pop_rate

genome_length_1000 = 1000
genome_length_10000 = 10000
results_1000 = simulator(genome_length_1000, 1000, 100, 500)
results_10000 = simulator(genome_length_10000, 1000, 100, 500)

genome_length_1000 = 1000
genome_length_10000 = 10000
results2_1000 = simulator(genome_length_1000, 10000, 100, 500)
results2_10000 = simulator(genome_length_10000, 10000, 100, 500)

f, (ax1, ax2) = plt.subplots(1, 2, sharey=True)
line_1_1000 = ax1.plot(results_1000, "r")
line_1_10000 = ax1.plot(results_10000, "b")
line_2_1000 = ax2.plot(results2_1000, "r")
line_2_10000 = ax2.plot(results2_10000, "b")

ax1.set_title("Population = 1000")
ax2.set_title("Population = 10000")

plt.title("Mutation Meltdown")
  • By Weigang
add.mutation <- function(genome, genome.length, mutation.rate) {
  mu.exp <- genome.length * mutation.rate;
  mu.num <- rpois(1, lambda = mu.exp);

ratchet <- function(pop.size=100, gamete.size=100, mu=1e-4, genome.length=1e4) {
  out <- data.frame(generation=numeric(), n0=numeric(), n1=numeric(), pop=numeric(), length=numeric());
  pop <- rep(0, pop.size) # initial all mutation-free
  g <- 1; # generation
  freq0 <- 1; # freq of zero-class
  while(freq0 > 0) {
    cat("at generation", g, "\n");
    freq0 <- length(which(pop == 0));
    out <- rbind(out, data.frame(generation=g, n0=freq0/pop.size, pop=pop.size, length=genome.length));
    gametes <- sapply(1:gamete.size, function(x) {lapply(pop, function(x) add.mutation(x, genome.length, mu))});
    pop <- sample(gametes, pop.size);
    g <- g+1;

out.long.genome.df <- ratchet(genome.length=1e4);
out.short.genome.df <- ratchet(genome.length=1e3);

March 31, 2017 "Drift-Mutation Balance" (4/14/2017)


We will explore genetic drift of a DNA fragment with mutation under Wright-Fisher model. From last week's exercise, we conclude that a population will sooner or later lose genetic diversity (becoming fixed after ~2N generations), if no new alleles are generated (by e.g., mutation or migration).

Mutation, in contrast, increases genetic diversity over time. Under neutrality (no natural selection against or for any mutation, e.g., on an intron sequence), the population will reach an equilibrium point when the loss of genetic diversity by drift is cancelled out by increase of genetic diversity by mutation.

You job is to find this equilibrium point by simulation, given a population size (N) and a mutation rate (mu). The expected answer is pi=2N*mu, where pi is a measure of genetic diversity using DNA sequences, which is the average pairwise sequence differences within a population.

Note: the following algorithm seems to be too complex to build at once. Also, too slow to run in R. Nonetheless, please try to write two R functions: (1)mutate.seq(seq, mut), with "seq" as a vector of bases and "mut" is the mutation rate, and (2) avg.seq.diff(pop), with "pop" as a list of DNA sequences

A suggested algorithm is:

  1. Start with a homogeneous population with N=100 identical DNA sequences (e.g., with a length of L=1e4 bases, or about 5 genes) [R hint: dna <- sample(c("a", "t", "c", "g"), size=1e5, replace=T, prob = rep(0.25,4))]
  2. Write a mutation function, which will mutate the DNA based on Poisson process (since mutation is a rare event). For example, if mu=1e-4 per generation per base per individual (too high for real, but faster for results to converge), then each generation the expected number of mutations would be L * mu = 1 per individual per generation for our DNA segment. You would then simulate the random number of mutations by using the R function num.mutations <- rpois(1,lambda=1).
  3. Apply the mutation function for each individual DNA copy (a total of N=100) during gamete production (100 gametes for each individual) at each generation (for a total of G=1000 generations).
  4. Write another function to calculate, for each generation, instead of counting allele frequencies (as last week's problem), to calculate & output average pairwise differences among the 100 individuals.
  5. Finally, you would graph pi over generation.
Submitted Codes
  • By John
import numpy as np
import matplotlib.pyplot as plt
from itertools import combinations
from random import sample
from numpy.random import choice
from numpy.random import poisson

# global variables
nuc = np.array(["A", "T", "C", "G"])
expected_mutation_rate = 0.00001 # there are 2 sequences per individual
seq_length = 10000
individual_pop = 100
gamete_rate = 100
generation = 1000

def simulator(population, repro_rate, generation):
    observed_mutation_collection = []
    # Original Individuals
    original = choice(nuc, seq_length, 0.25)
    individual_total = [original for i in range(population)]
    for i in range(generation): # iterate over 1000 generations
        # Produce Gametes
        gametes = []
        for individual in individual_total:
            gametes += [individual for i in range(repro_rate)]
        gametes = np.array(gametes)
        gametes_pop = gametes.shape[0] # number of gametes: 100 * 100
        # Mutation
        mutation_number_arr = poisson(lam=seq_length * expected_mutation_rate, size=gametes_pop) # derive number of mutations per individual
        mutation_index_arr = [sample(range(seq_length), mutations) for mutations in mutation_number_arr] # get the index of mutation
        # Mutation: Replace with Mututated Base Pairs
        for i in range(gametes_pop): # iterate over all gametes
            for ix in mutation_index_arr[i]: # iterate over mutated base pair for each gamete
                gametes[i, ix] = choice(nuc[nuc != gametes[i, ix]], 1)[0] # locate mutation and alter the nuc with mutated one
        # Next generation of individuals
        individual_total = gametes[choice(range(10000), 100, replace=False)]
        # Calculate observed mutation rate
        num_combinations = 0
        total_diff_bases = 0
        for pair in combinations(range(len(individual_total)), 2): # aggregate mutations for all possible pairs of individuals
            total_diff_bases += len(np.where((individual_total[pair[0]] == individual_total[pair[1]]) == False)[0])
            num_combinations += 1
        observed_mutation_rate = float(total_diff_bases) / float(num_combinations) / seq_length

# Simulation
result = simulator(100, 100, 1000)

# Visualize results
plt.plot(result, 'b')
plt.ylabel("Mutation Rate")
plt.title("Drift-Mutation Balance")
  • By Weigang
library(Biostrings) # a more compressed way to store and manipulate sequences
bases <- DNA_ALPHABET[1:4];
dna <- sample(bases, size = 1e5, replace = T);
library(ape) # to use the DNAbin methods
# function to mutate (Poisson process)
mutate.seq <- function(seq, mutation.rate) {
  mu.exp <- length(seq) * mutation.rate;
  mu.num <- rpois(1, lambda = mu.exp);
  if (mu.num > 0) {
    pos <- sample(1:length(seq), size = mu.num);
    for (j in 1:length(pos)) {
      current.base <- seq[pos[j]];
      mutated.base <-sample(bases[bases !=current.base)], size = 1);
      seq[pos[j]] <- mutated.base;

# main routine
drift.mutation <- function(pop.size=100, generation.time=500, gamete.size=10, mu=1e-5) {
  out <- data.frame(generation=numeric(), pi=numeric()); # for storing outputs
  pop <- lapply(1:pop.size, function(x) dna)  # create the initial population
  for(i in 1:generation.time) { # for each generation
    cat("at generation", i, "\n"); # print progress
    gametes <- sapply(1:gamete.size, function(x) {lapply(pop, function(x) mutate.seq(x, mu))}); 
    pop <- sample(gametes, pop.size); 
    pop.bin <- as.DNAbin(pop); # change into DNAbin class to advantage of its dist.dna() function
    out <- rbind(out, data.frame(generation=i, pi=mean(dist.dna(pop.bin))));
out.df <- drift.mutation(); # run function and save results into a data frame

March 18, 2017 "Genetic Drift" (Due 3/31/2017)


This is our first biological simulation. Mandatory assignment for all lab members (from interns to doctoral students). An expected result is shown in the graph.

Task: Simulate the Wright-Fisher model of genetic drift as follows:

  1. Begin with an allele frequency p=0.5, and a pop of N=100 haploid individuals [Hint: pop <- c(rep("A", 50), rep("a", 50))]
  2. Each individual produces 100 gametes, giving a total of 10,000 gametes [Hint: use a for loop with rep() function]
  3. Sample from the gamete pool another 100 to give rise to a new generation of individuals [Hint: use sample() function]
  4. Calculate allele frequency [Hint: use table() function]
  5. Repeat the above in succession for a total generation of g=1000 generations [Hint: create a function with three arguments, e.g., wright.fisher(pop.size, gamete.size, generation.time)]
  6. Plot allele frequency changes over generation time
  7. Be prepared to answer these questions:
    1. Why allele frequency fluctuate even without natural selection?
    2. What's the final fate of population, one allele left, or two alleles coexist indefinitely?
    3. Which population can maintain genetic polymorphism (with two alleles) longer?
    4. Which population gets fixed (only one allele remains) quicker?
Submitted Codes
  • By Lili (May, 2019)
#Simple sampling, haploid, as a function 
wright_fisher <- function(pop_size, gam_size, alle_frq, n_gen) {
  pop <- c(rep("A", pop_size*frq), rep("a", pop_size*frq))
  prob <- numeric(n_gen)
  for(time in 1:n_gen){
    gamt <- rep(pop, gam_size)
    pop <- sample(gamt, pop_size)
    prob[time] <- table(pop)[1]/pop_size
  wf_df <- data.frame(generation=1:n_gen, probability=prob)
  plot(drift_df, type= 'l', main = "Wright-Fisher Process (Genetic Drift)", xlab = "generation", ylab = "allele frequency")

wright_fisher(1000, 100, 0.5, 1000)

#Version 2. Using binomial distribution with replicates, diploid

pop_size <- c(50, 100, 1000, 5000)
alle_frq <- c(0.01, 0.1, 0.5, 0.8)
n_gen <- 100
n_reps <- 50
genetic_drift <- data.frame()

for(N in pop_size){
  for(p in alle_frq){
    p0 <- p
    for(j in 1:n_gen){
      X <- rbinom(n_reps, 2*N, p)
      p <- X/(2*N)
      rows <- data.frame(replicate= 1:n_reps, pop=rep(N, n_reps), gen=rep(j, n_reps), frq=rep(p0, n_reps), prob=p )
      genetic_drift <- rbind(genetic_drift, rows)

ggplot(genetic_drift, aes(x=gen, y=prob, group=replicate)) + geom_path(alpha= .5) + facet_grid(pop ~ frq) + guides(colour=FALSE)

# 3rd version: trace ancestry (instead of allele frequency)
  • By John
import numpy as np
import sys
from random import sample
import matplotlib.pyplot as plt
def simulator(gametes_rate, next_individuals, generations):
    individuals = [1 for i in range(50)] + [0 for j in range(50)]
    frequency = []
    for generation in range(generations):
        gametes = []
        for individual in individuals:
            gametes += [individual for i in range(gametes_rate)]
        individuals = sample(gametes, next_individuals)
        frequency.append(np.count_nonzero(individuals) / len(individuals))
N_100 = simulator(100, 100, 1000)
N_1000 = simulator(100, 1000, 1000)
# Create plots with pre-defined labels.
# Alternatively,pass labels explicitly when calling `legend`.
fig, ax = plt.subplots()
ax.plot(N_100, 'r', label='N=100')
ax.plot(N_1000, 'b', label='N=1000')
# Add x, y labels and title
plt.ylim(-0.1, 1.1)
plt.title("Wright-Fisher Model")
# Now add the legend with some customizations.
legend = ax.legend(loc='upper right', shadow=True)
# The frame is matplotlib.patches.Rectangle instance surrounding the legend.
frame = legend.get_frame()
# Set the fontsize
for label in legend.get_texts():
for label in legend.get_lines():
    label.set_linewidth(1.5)  # the legend line width
import scala.util.Random
import scala.collection.mutable.ListBuffer
val gamete_rate = 100
val offspr_rate = 100
val generations = 1000
var frequency: List[Double] = List()
var individuals = List.fill(50)(0) ++ List.fill(50)(1)
for(generation <- 1 to generations){
  var gametes = => List.fill(gamete_rate)(x)).flatten
  individuals = Random.shuffle(gametes).take(offspr_rate)
  frequency = frequency :+ individuals.count(_ == 1).toDouble / offspr_rate
val gamete_rate = 100
val offspr_rate = 100
val generations = 300
var frequency: List[Double] = List()
var individuals = sc.parallelize(List.fill(50)("0") ++ List.fill(50)("1"))
for(generation <- 1 to generations){
  val gametes = individuals.flatMap(x => (x * gamete_rate).split("").tail)
  individuals = sc.parallelize(gametes.takeSample(false, offspr_rate))
  val count = individuals.countByValue
  frequency = frequency :+ count("1").toDouble / offspr_rate
  • By Brian
Wright<-function(pop.size,gam.size,generation) {
 if( pop.size%%2==0) {
  for (i in 1:generation) {
    if (cases[1]<100 && cases[2]<100 ) 
  plot(geneticDrift$frequency~geneticDrift$time,type="b", main="Genetic Drift N=1000", xlab="time",ylab="Proportion Pop A",col="red", pch=18 )
 if(pop.size%%2==1 ) {
   print("Initial Population should be even. Try again")
  • By Jamila
Genetic_code = function(t,R){ 
  N<- 100
  p<- 0.5
  for (i in 1:t){
  plot(frequency, type="1",ylim=c(0,1),col=3,xlab="t",ylab=expression(p(A[1])))
  • By Sharon
pop <- c(rep("A", 50), rep("a", 50)) #create a population of 2 alleles
data<-data.frame(g,p) #a set of variables of the same # of rows 
for (i in 1:1000) { #generation 
  gam_pl <-rep(pop, 100) #gamete pool
  gam_sam <-sample(gam_pl, 100)  #sample from gamete pool
  tab_all<-table(gam_sam) #table ps sample
  all_freq<-tab_all[1]/100 #get allele frequency
  data<-rbind(data, c(i,all_freq))
  • By Sipa
pop <- c(rep("A", 50), rep("a", 50))
wright_fisher <- function(N,gen_time,gam) {
  gen_time = gen_time
  x = numeric(gen_time) 
  x[1] = gam
  for (i in 2:1000) {
  x[i]=sample(0:N, 1, prob=prob)

plot(x[1:gen_time], type="l", pch=10)

pool <- rep(pop, times = 100)
s_pool <- sample(pool, size = 100, replace = F)
wright_fisher2 <- function(pop_size,al_freq,gen_time) {
  pop <- c(rep("A", pop_size/2), rep("a", pop_size/2))
  pool <-rep(pop, times=100)
  s_pool <- sample(pool, size = 100, replace = T)
  for(i in 1:gen_time)
    s_pool <- sample(sample(pool, size = pop_size, replace = F))
  a.f <- table(s_pool)[1]/100
  • By Nicolette
Genetic_drift=array(0, dim=c(g,sim))
for(i in 1:sim) {
  for(j in 2:g){
matplot(1:1000, (X/N), type="l",ylab="allele_frequency",xlab="generations")
  • By Weigang
# output frequency:
wright.fisher <- function(pop.size, generation.time=1000, gametes.per.ind=100) {
  out.df <- data.frame(gen=numeric(), freq=numeric(), pop.size=numeric());
  pop <- c(rep("A", pop.size/2), rep("a", pop.size/2)); # initial pop
  for(i in 1:generation.time) {
    gametes <- unlist(lapply(pop, function(x) rep(x, gametes.per.ind)));
    pop <- sample(gametes, pop.size); # sample gamete pool without replacement
    freq.a <- table(pop)[1]/pop.size; # frequency of "a"
    out.df <- rbind(out.df, c(gen=i, freq=freq.a, pop=pop.size, rep=rep));

# Make Figure 1 above
plot(out.1e2[,1], out.1e2[,2], type = "l", las=1, ylim=c(0,1), xlab="generation", ylab="allele frequency", main="Wright-Fisher Process (Genetic Drift)", col=3, lwd=2)
lines(out.1e3[,1], out.1e3[,2], type = "l", col=2, lwd=2)
legend(400, 0.2, c("N=100", "N=1,000"), lty=1, col=3:2, cex=0.75)
abline(h=0.5, col="gray", lty=2)

# Second function to track founders (not frequency)
wright.fisher.2 <- function(pop.size, generation.time=1000, gametes.per.ind=100) {
  pop <- 1:100 # label founders
  out.list <- list(pop);
  for(i in 1:generation.time) {
    gametes <- unlist(lapply(pop, function(x) rep(x, gametes.per.ind)));
    pop <- sample(gametes, pop.size); # sample gamete pool without replacement
    out.list[[i+1]] <- pop;

# plot loss of genetic diversity by tracking surviving founder lineages
plot(x=1:100, y=rep(1, 100), ylim=c(0,100), xlim=c(1,600), type="n", las=1, main="Drift: Suvival of founder lineages\n(N=100)", cex.main=0.75, xlab="generation", ylab="founders")
for (i in 1:600) { points(rep(i,100), 1:100, col=colors()[out.list[[i]]], cex=0.5) }
for (i in 1:600) {points(rep(i,100), out.list[[i]], cex=0.2, col="yellow") }

March 11, 2017 "Stoplights" (Due 3/18/2017)

  • Source: Paul Nahin (2008). "Digital Dice", Problem 18.
  • Challenge: How many red lights, on average, will you have to wait for on your journey from a city block m streets and n avenues away from Belfer [with coordinates (1,1)]? (assuming equal probability for red and green lights)
  • Note that one has only wait for green light when walking along either the north side of 69 Street or east side of 1st Avenue. On all other intersections, one can walk non-stop without waiting for green light by crossing in the other direction if a red light is on.
  • Formulate your solution with the following steps:
  1. Start from (m+1,n+1) corner and end at (1,1) corner
  2. The average number of red lights for m=n=0 is zero
  3. Find the average number of red lights for m=n=1 by simulating the walk 100 times
  4. Increment m & n by 1 (but keep m=n), until m=n=1000
  5. Plot average number of red lights by m (or n).
Submitted Codes
  • By John
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from pandas import DataFrame
from random import sample
from itertools import combinations, permutations
from numpy import count_nonzero, array
def red_light(av, st):
    traffic_light = ["red_st", "red_av"]
    total_cross = av + st - 1
    while av != 0 and st != 0:
        if sample(traffic_light, 1)[0] == "red_st":
            av -= 1
            st -= 1
    rest_cross = av if av != 0 else st
    return(count_nonzero([sample([0, 1], 1) for corss in range(rest_cross)]))
df = DataFrame(0, index=np.arange(10), columns=np.arange(10))
simulation = 1000
for av in df.index:
    for st in df.index:
        df.loc[av, st] = sum(array([[red_light(av, st)] for n in range(simulation)])) / simulation
plt.imshow(df, cmap='hot', interpolation='nearest')
plt.xticks(np.arange(1, 10))
plt.yticks(np.arange(1, 10))
plt.title("Average Numbers Waiting for Red Light")
plt.xlabel("Number of Av")
plt.ylabel("Number of St")
# Recursive Function for Each Trial

from sys import stdout
av = 30
st = 30
n_space = [0]
traffic_light = ["red_st", "red_av"]
def route(av, st):
    if not av == st == 0:
        if sample(traffic_light, 1)[0] == "red_st":
            if av == 0:
                n_space[0] += 1
                route(av, st - 1)
                stdout.write("\n" + " " * n_space[0] + "|")
                route(av - 1, st)
            if st == 0:
                stdout.write("\n" + " " * n_space[0] + "w" + "\n" + " " * n_space[0] + "|")
                route(av - 1, st)
                n_space[0] += 1
                route(av, st - 1)
route(av, st)
  • By Weigang
#!/usr/bin/env perl
use strict;
use warnings;

# Use recursion
my ($distx, $disty) = @ARGV;
my $num_red = 0; # keep red-light counts
print "-" x $distx, "\n"; # print a starting line
&walk($distx, $disty, \$num_red); # pass reference not value

sub walk {
    my ($x, $y, $ref) = @_;
    my $ct = $$ref;
    my $prob_red;
    if ($x == 1 && $y == 1) { # Reached destination
	print "*\n";
	print "-" x $distx, "\n"; # print the ending line
	print "reached Belfer after waiting for ", $ct, " red lights\n";

    if ($x == 1 && $y > 1) { # Reached right-side end
	$prob_red = rand();
	if ($prob_red < 0.5) { # red light, wait
	    print "w\n", " " x ($distx-1);
	} else {
	    print "|\n", " " x ($distx-1);
	&walk($x, $y-1, \$ct);

    if ($x > 1 && $y == 1) { # Reached bottom
	$prob_red = rand();
	if ($prob_red < 0.5) { # red light, wait
	    print "w";
	} else { print "-"}
	&walk($x-1, $y, \$ct);

    if ($x > 1 && $y > 1) {
	my $prob_across = rand(); # prob of walking right with green light
	if ($prob_across >= 0.5) { # move one block right
	    print "-";
	    &walk($x-1, $y, \$ct);
	} else { # red light, move one block down
	    print "|\n", " " x ($distx-$x);
	    &walk($x, $y-1, \$ct);
  • By Jeff
# In[26]:
def walk (walker):
    import random
    def just_walk (x):                                  #[1]
        if random.choice(["red_m_direction", "green_m"])=="green_m": x["loca"]["m"] -=1
        else: x["loca"]["n"] -=1
        return x    
    def may_have_to_wait (dist,waited):                 #[2]
        if random.choice(["green","red"])=="red": waited += 1
        else: dist -= 1
        return (dist,waited)    

    while (0 not in walker["loca"].values()):         #start walking
        walker = just_walk(walker) 
    while (walker["loca"]["m"] !=0):       # if n =0 and m != 0
        (walker["loca"]["m"], walker["waited"]) = may_have_to_wait(walker["loca"]["m"],walker["waited"]) 
    while (walker["loca"]["n"] !=0):       # if m =0 and n != 0
        (walker["loca"]["n"], walker["waited"]) = may_have_to_wait(walker["loca"]["n"],walker["waited"]) 

    return walker["waited"]

def given_distance(m,n,rep): 
    waited = list()
    for x in range(rep):
        walker={"loca":{"m":m,"n":n}, "waited":0}     #walker variable
    return sum(waited)/len(waited)
# main
for d in range(1000):                                  # set walking distance here
        result.append(given_distance (d,d,100))      # set rep here
import matplotlib.pyplot as plt                     # ploting 

#[1] no reason to wait for ANY red before one of the distances (m or n) is exhausted. Assuming when light is green for m, it must be red for n, vise versa. 
#[2] when one of the directions (m or n = 0), waiting cannot be avoided, start counting.

March 3, 2017 "PiE" (Due 3/10/2017)

  • Source: Paul Nahin (2008). "Dueling Idiots", Problem 5.
  • Challenge: obtain numerical values of Pi and E by simulations
  • Simulate pi by
  1. Randomly generate 10,000 pairs of uniformly distributed numbers from 0 and 1 (simulating throwing darts onto the unit square shown at right)
  2. Count the number of points enclosed within the quarter-circle
  3. Calculate the pi value from this proportion
  • Simulate e by
  1. Generate N random numbers from 0 to 1
  2. Divide into N equal-width bins between 0 and 1
  3. Count the number of bins Z that receive none of the random numbers
  4. Obtain e ~ N/Z (based on binomial sampling formula)
  5. Simulate N=1e2, 1e3, and 1e4
Submitted Codes
  • By Weigang
# simulate pi:
darts <- sapply(1:1000, function(x) { coords<- runif(2); return(ifelse(coords[1]^2+coords[2]^2 <= 1, 1,0)) })
pi <- 4*mean(darts)

# simulate e
N <- 1e3;
n <- runif(N);
cts <- numeric();
for(i in 1:(N-1)) {
  left <- i/N;
  right <- (i+1)/N;
  cts[i] <- length(which(n>=left & n < right))
e <- N/length(which(cts==0))
  • By Nicolette
#Simulate pi
random <- 0; square <- 0;
for (i in 1:1000){
  random[[i]]<- runif(1,0,1)
square[[i]]<- sqrt(1-(random[i])^2)
plot(random, square)
areaofqc <- (pi/4)
ranleqc <- length(which(random <=areaofqc))
squleqc <- length(which(square<=areaofqc))

#Simulate e
e <- 0;
for(N in 1:1000) {
  z.empt<- N - length(table(non.empty))
  e<-c(e, N/z.empt)
  • By Brian
#When you throw a dart at the a unit square dartboard the Probability of hitting the portion of the circle radius with center=(0,0) inside the unit square is  pi/4.  We can say that hitting the 1/4 circle within the unit square with a dart is a bernoulli random variable where p=pi/4. Further, we can say E(bernoulli=hit)=pi/4. 
# Imagine you are scoring the game of darts as follows- 1 point if you throw in the 1/4 circle and 0 points if you miss this space. 
# If you throw 10000 darts you just did 10000 iid bernoulli trials. By the law of large numbers if we count the number of hits to the 1/4 circle of radius 1 and divide by number of darts thrown we will get something pretty close to E(beroulli) . Multiply that number by 4 and you have an estimate of Pi.  

## Generate 10,000 pairs of points in the unit square with uniform distribution

x<-c(runif(1000000, min = 0, max = 1))
y<-c(runif(1000000, min = 0, max = 1))




## simulating exp
#It's a well that a binomial with large n and tiny p is a good estimate of the poisson random For this estimate lambda=np.  In this case n=10,000 and p=the probability of falling into a particular unit=1/10000....... 
 simexp<-function (n) {

for (i in 1:n) {
  c<-subset(a,(i-1)/n<a & a<=i/n)
  • By John
#python code
# Simulate pi
simulation = 10000
distances = np.array([pow(uniform(0, 1), 2) + pow(uniform(0, 1), 2) for i in range(simulation)])
pi = count_nonzero(distances < 1) / simulation * 4

# Simulate e
total = 10000
ranges = [[x/total, (x+1)/total] for x in range(total)]
sample_space = [uniform(0, 1) for x in range(total)]
index = []
for i in range(total):
    if any(ranges[i][0] <= point <= ranges[i][1] for point in sample_space):
e = total / (total - len(set(index)))

Feb 24, 2017 "Idiots" (Due 3/3/2017)

  • Source: Paul Nahin (2000), "Dueling Idiots". Problem #2: "When Idiots Duel"
  • Game: Idiots A and B decide to duel with a gun in the following way: They will insert a single bullet into the gun's cylinder. The cylinder has a total of 6 slots. Idiot A will spin the cylinder and shoot at B. If the gun doesn't fire, then A will give the gun to B, who will spin the cylinder and then shoot at A. This back-and-forth duel will continue until one fool shoots (and kills) the other.
  • Questions: (1) what is the probability that A will win (and B dies); (2) What is the average number of trigger pulls before someone dies?
Submitted Codes
  • By Roy
gun <-c(1,0,0,0,0,0)
gun2 <- c(1,0,0,0,0,0)
deadman <-0 ;idiot1win <-0; idiot2win <- 0; nooned <- 0; total <- 0
 while(deadman < 1000){
  idiot1shot <-sample(gun)
  if (length(which(idiot1shot[1] == 1))){
    deadman <- deadman + 1
    idiot1win <- idiot1win + 1
      idiot2shot <-sample(gun2)
      if (length(which(idiot2shot[1] == 1))){
        deadman <- deadman +1
        idiot2win <- idiot2win +1
      } else {nooned <- nooned + 1}
  total <- total + 1


p <- idiot1win/1000
takes2kill <- deadman/total
  • By Weigang
rounds <- sapply(1:1000, function(x) {
  alive <- 1;
  round <- 0;
  while(alive == 1){
    spin <- sample(c(0,0,0,0,0,1));
    round <- round + 1;
    if(spin[1] == 1) { alive <- 0 }
}) <- length(which(rounds %% 2 == 0))
barplot(table(rounds)/1000, xlab="num rounds", ylab="Prob", main = "Sudden Death with a 6-slot gun (sim=1000)", las=1)

Feb 17, 2017 "Birthday" (Due 2/24/2017)

  • Problem: What is the probability NONE of the N people in a room sharing a birthday?
  1. Randomly select N individuals and record their B-days
  2. Count the B-days NOT shared by ANY two individuals
  3. Repeat (for each N) 1000 times, and obtain probability by averaging the previous counts (i.e., divided by 1000)
  4. Vary N from 10 to 100, increment by 10
  5. Plot probability of no-shared B-Day (Y-axis) versus N (x-axis), with either a stripchart or boxplot, or both
Submitted Codes
  • By Roy
N <- 0
samples <- 0;
prob.nodub <-0;
for(j in 10:100){ <- 0;
  test <- for(i in 1:1000){
  bdays <- sample(seq(as.Date('1990/01/01'), as.Date('1990/12/31'), by="day"), N, replace=T)
  dups <- duplicated(bdays, incomparables = FALSE)
  ch <-length(which(dups == TRUE))
    if(ch==0){ <- +1
    fine <- (
  N <- N + 1
  samples[[j]] <- N
  prob.nodub[[j]] <- fine
plot(samples,prob.nodub, main="Birthday Simulation", xlab = "Sample Size", ylab = "Probability of no Duplicates", las =1)
  • By weigang
days <- 1:365;

find.overlap <- function(x) { return(length(which(table(x)>1))) }

output <- sapply(1:100, function(x) { # num of people in the room <- 0;
  for (k in 1:100) {
    bdays <- sample(days, x, replace = T);
    ct <- find.overlap(bdays);
    if (!ct) { <- + 1}
plot(1:100, output/100, xlab="Group size", ylab = "Prob (no shared b-day)", las=1, main="B-day (sim=100 times)", type="l")
abline(h=0.5, lwd=2, col=2, lty=2)
abline(v=seq(0, 100, 5), col="gray")

Feb 10, 2017 "Dating" (Valentine's Day Special; Due 2/17/2017)

  • Source: Paul Nahin (2008), "Digital Dice". Problem #20: "An Optimal Stopping Problem"
  • Problem: What is the optimal time point when one should stop dating more people and settle on a mate choice (and live with the decision)
  • Your best strategy is to date an initial sample of N individuals, rejecting all, and marry the next one ranked higher than any of your N individuals. The question is what is the optimal number for N.
  1. The problem could be investigated by simulating a pool of 10 individuals, ranked from 1-10 (most desirable being 1) and then take a sample of N
  2. You may only date one individual at a time
  3. You cannot go back to reach previously rejected candidates
  4. Simulate N from 0 to 9 (zero means marrying the first date, a sample size of zero)
  5. For each N, obtain the probability of finding the perfect mate (i.e., ranked 1st) by running simulation 1000 times
  6. Plot barplot of probability versus sample size N.
  7. Expected answer: N=4
Submitted Codes
  • By Weigang
pick.candidate <- function(min, array) {
  for (i in 1:length(array)) {
    if (array[i] < min) {
    } else {next}
  return(0) # No 1. has been sampled and rejected
candidates <- 1:10;
output <- sapply(0:9, function(x) {
  ct <- 0;
  for(k in 1:1000) {
    if (x==0) { # no sample, marry the 1st guy 
      sampled <- sample(candidates, 1);
      if (sampled == 1) {ct <- ct+1}
    } else {
      sampled <- sample(candidates, x);
      not.sampled <- candidates[-sampled];
      not.sampled <- sample(not.sampled);
      if (pick.candidate(min =  min(sampled), array = not.sampled) == 1) {ct <- ct+1}
barplot(output/1e3, names.arg = 0:9, xlab = "number of sampled dates", las=1, main = "Optimal stopping for dating (N=10 candidates)", ylab = "Prob(marrying No.1)")

Feb 3, 2017 "US Presidents" (Due 2/10/2017)

  • Download : 1st column is the order, 2nd column is the name, the 3rd column is the year of inauguration; tab-separated
  • Your job is to create an R, Perl, or Python script called “us-presidents”, which will
  1. Read the table
  2. Store the original/correct order
  3. Shuffle/permute the rows and record the new order
  4. Count the number of matching orders
  5. Repeat Steps 3-4 for a 1000 times
  6. Plot histogram or barplot (better) to show distribution of matching counts
  7. Hint: For R, use the sample() function. For Perl, use the rand() function.
Submitted Codes
  • By Mei
pres.list <- lapply(1:1000, function(x) pres[sample(nrow(pres)),])
cts <- sapply(pres.list, function(x) {
  ct.match <- 0;
  for (i in 1:45){
    if (pres$order[i] == x[i,1]){
      #cat(as.character(x[i,2]), x[i,1], "\n")  #as.character to avoid the factor info
      ct.match <- ct.match + 1;
  ct.match #return
barplot(table(cts),xlab = "Number of matches per shuffle", border = "hotpink", col = "pink", ylab = paste("Frequency total of:",length(pres.list)), main = "US Presidents")
  • By John
import pandas as pd
import numpy as np
from pandas import DataFrame
import matplotlib.pyplot as plt

df = pd.read_table("presidents.txt", names=["num", "name", "presidency"])

name_list = list(

# create a list to store matches after each shuffle
shuffle_record = []

# create a function in which the first argument is the original dataset; second argument is number of shuffles
def shuffler2(original, n):
    record = []
    for i in range(n):
        num = 0
        each_shuffle = {}
        temp = original.reindex(np.random.permutation(original.index)) # do shuffling for each
        compare = original.num == temp.num
        matched_df = original.ix[compare[compare == True].index]
        for i in matched_df.index:
            each_shuffle[i] =[i]
            num = compare.value_counts()[1]
result2 = shuffler2(df, 1000)

plt.hist(result2, color="yellow")
plt.title("Histgram for President Data")
  • By Weigang
p <- read.table("Presidents.txt", sep="\t", header=F)
colnames(p) <- c("order", "name", "inaug.year")
# Use "sapply" or "lapply" for loops: no need to pre-define a vector to store results
p.sim <- sapply(1:10000, function (x) {
  length(which(sample(p$order) == p$order))
barplot(table(p.sim)/1e4, las=1)
p.exp <- rpois(1e4, 1) # draw 10000 Poisson random deviates
mp <- barplot(table(p.exp)/1e4, las=1, xlab = "Num of matching presidents") # mid-point on x-axis
lines(mp[1:7], table(p.sim)/1e4, type="b", col=2) # add a line (Poisson-expected) to the barplot (simulated)
legend("topright", c("Simulated", "Poisson expectation"), col=1:2, lty=1)