Revision as of 22:00, 29 June 2018

Rules of Conduct

No eating, drinking, or loud talking in the lab. Socialize in the lobby only.
Be respectful to each other, regardless of level of study
Be on time & responsible. Communicate in advance with the PI if late or absent

Participants

Dr Oliver Attie, Research Associate
Brian Sulkow, Research Associate
Saymon Akther, CUNY Graduate Center, EEB Program
Lily Li, CUNY Graduate Center, EEB Program
Mei Wu, Bioinformatics Research Assistant
Yinheng Li, Informatics Research Assistant
Christopher Panlasigui, Hunter Biology
Dr Lia Di, Senior Scientist
Dr Weigang Qiu, Principal Investigator
Summer Interns: Muhammad, Pavan, Roman, Benjamin, Andrew, Michelle, Hannah

Journal Club

a Unix & Perl tutorial
A short introduction to molecular phylogenetics: http://www.ncbi.nlm.nih.gov/pubmed/12801728
A review on Borrelia genomics: https://www.ncbi.nlm.nih.gov/pubmed/24704760
ospC epitope mapping: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0067445

Projects

Borrelia genome evolution (Led by Saymon)

Goal 1. Estimate time of cross-Atlantic dispersal using core-genome sequences
Goal 2. Investigate codon biases with respect to levels of gene expression. Data file:
File:B31-cp26.txt

Andrew's BioPython code to calculate CAI

#Opens the fasta file and reads contents into a string (myStr).
myFile = open("B31-cp26.txt","r")
myStr = myFile.read()
myFile.close()

#Imports codon usage module.
from Bio.SeqUtils import CodonUsage as cu

#Takes myStr and processes it into a list of sequences (FastaList).
FastaList = myStr.split(">")
FastaList = FastaList[1:]
IDList = []
EnterList = []
##Separates FastaList into a list of sequence IDs (IDList) and a list of sequences (EnterList).
for seq in FastaList:
    IDList += [seq[:6]]
    EnterList += [seq[6:]]
##Removes enter characters from each sequence in EnterList.
SeqList = []
for seq in EnterList:
    SeqStr = seq.replace("\n", "")
    SeqList += [SeqStr]

#Calculates and presents the CAI value for each sequence using functions from the module.
myObject = cu.CodonAdaptationIndex()
myObject.generate_index("B31-cp26.txt")
for SeqIndex in range(len(SeqList)):
    print (IDList[SeqIndex], ' CAI=', myObject.cai_for_gene(SeqList[SeqIndex]))

Output for cp26: BB_B01 CAI= 0.7190039074113422
BB_B02 CAI= 0.678404951527374
BB_B03 CAI= 0.6893076488255271 BB_B04 CAI= 0.7250154635421513 BB_B05 CAI= 0.6971190458423587 BB_B06 CAI= 0.67042305582205 BB_B07 CAI= 0.6971020959083346 BB_B09 CAI= 0.6786931743972611 BB_B10 CAI= 0.7224886929887183 BB_B12 CAI= 0.6997502136447451 BB_B13 CAI= 0.7592966148479222 BB_B14 CAI= 0.6959525612884284 BB_B16 CAI= 0.6835709626613392 BB_B17 CAI= 0.6974779110749645 BB_B18 CAI= 0.7052250722958308 BB_B19 CAI= 0.7049049245887261 BB_B22 CAI= 0.6860641572293008 BB_B23 CAI= 0.6915165725213809 BB_B24 CAI= 0.7025276490965267 BB_B25 CAI= 0.7439914547011712 BB_B26 CAI= 0.7255623088410704 BB_B27 CAI= 0.7161378416520467 BB_B28 CAI= 0.7316661839512337 BB_B29 CAI= 0.6919705705489939

Identification of host species from ticks (Led by Lily [after first-level])

Goal 1. Protocol optimization for PCR amplification of host DNA from ticks
Goal 2. Protocol development: library construction for MiSeq
Goal 3. Development of bioinformatics protocols and sequence database

Pseudomonas Genome-wide Association Studies (GWAS) (Led by Mai & Yinheng, in collaboration with Dr Xavier of MSKCC)

Goal 1. Association of genes/SNPs with biofilm formation and c-di-GMP levels: Manuscript preparation
Goal 2. Association of genome diversity with metabolic diversity

(Christopher) This script parses excel peak-area file into database & R inputs

#!/usr/bin/perl -w
use strict;
use Data::Dumper;
use Getopt::Std;

my %opts;
my $line_ct = 0;
my (@colnames, @areas, %seen_cmps, %seen_gids);

getopts('dr', \%opts);
while(<>) {
    chomp;
    $line_ct++;
    next unless $line_ct >=4;
    if ($line_ct == 4) {
        @colnames = split "\t", $_;
        for (my $i=5; $i<=$#colnames; $i++) { $seen_gids{$colnames[$i]}++ } # get uniq gids
        next;
    }
    my @data = split "\t", $_;
    $seen_cmps{$data[1]}++; # get unique compound formula
    for (my $i=5; $i<=$#data; $i++) {
        my $area = { 'compound' => $data[1], 'gid' =>$colnames[$i],  'peak_area' => $data[$i]};
        push @areas, $area;
    }
}

if ($opts{d}) { # for database output
    foreach my $cmp (sort keys %seen_cmps) {
        foreach my $gid (sort keys %seen_gids) {
            my @peaks = grep { $_->{compound} eq $cmp && $_->{gid} == $gid } @areas;
            my $peak_str = join ",", map {$_->{peak_area} || "NULL"} @peaks;
            print join "\t", ($gid, $cmp, "{" . $peak_str . "}");
            print "\n";
        }
    }
}

if ($opts{r}) { # for R output
    foreach my $cmp (sort keys %seen_cmps) {
        foreach my $gid (sort keys %seen_gids) {
            my @peaks = grep { $_->{compound} eq $cmp && $_->{gid} == $gid } @areas;
            foreach my $peak (@peaks) {
                next unless $peak->{peak_area};
                print join "\t", ($peak->{peak_area}, $gid, $cmp);          
                print "\n";
            }
        }
    }
}
exit;

Compound amount in each genome

Machine learning approaches to evolution (Led by Oliver & Brian)

OspC structural alignment, converted from S2 from Baum et al (2013)

Goal 1. Implement Hopfield network for optimization of protein structure
Goal 2. Neural-net models of OspC. Structural alignment (S2 from Baum et al 2013):

Goal 3. K-mer-based pipeline for genome classification

Weekly Schedule

Summer kickoff (June 1, 2018, Friday): Introduction & schedules
Week 1 (June 4-8):
- Monday: the Unix & Perl Tutorial, Part 1
- Tuesday: Unix Part2. Explore the "iris" data set using R, by following the the Monte Carlo Club Week 1 (1 & 2) and Week 2. Read McKay (2003), Chapters 38 & 39
- Thursday: 1st field day (Caumsett State Park); Participants: John, Muhammad, Pavan, Andrew, Dr Sun, Weigang, with 3 members of Moses team from Suffolk County Vector Control. Got ~110 deer tick nymphs
- Friday: meeting with MSKCC group at 11am; BBQ afterwards
Week 2 (June 11-15):
- Monday: Lab meeting, projects assigned
- Tuesday: neural net tutorial (by Brian)
- Thursday: 2nd field day (Fire Island National Seashore). Participants: John, Brian, Mei, Muhammad, Pavan, Benjamin, and Weigang. Got ~100 lone-star ticks and 4 deer tick nymphs
Week 3 (June 18-22):
- Monday: Lab meeting, 1st project reports
  - Codon Bias: Theory, Coding, and Data (Andrew, Pavan, Saymon)
  - OspC epitope identification: Serum correlation, sequence correlation, immunity-sequence correction (Muhammad, Roman, Brian)
  - Pseudomonas metabolomics: parsing intensity file; theory & parsing SMBL file (Chris & Benjamin)
- Tuesday: working groups
- Wed: working groups
- Thursday: Big Data Workshop
- Friday: working groups
Week 4 (June 25-29):
Monday: Lab meeting

Lab notes for Summer HS Interns

NCI Cloud: Seven Bridges Cloud Platform. Create an user account
Read documentation & tutorials: Documentation

Notes & Scripts

(Weigang) A sample R script to parse Table S2 from Baum et al 2013, sera-antigen reactivity measurements

# preliminaries: save as TSV; substitute "\r" if necessary; 
# substitute "N/A" to "NA"; remove extra columns
setwd("Downloads/")
x <- read.table("table-s2.txt4", sep="\t", header=T)
View(x)
colnames(x)
which(x[,8]=="A")
x[which(x[,8]=="A"),12]
x[which(x[,8]=="A3"),12]
cor.test(x[which(x[,8]=="A3"),12], x[which(x[,8]=="A"),12], method = "pearson")
x.cor$estimate
levels(x[,8]) # obtain ospC allele types; to be looped through in pairwise
for (i in 1:?) { for (j in ?:?) {cor.test(....)}}

(Muhammad) Output generates data frame of correlation/p values for 23 different Osp-C allele types in pairwise

setwd("C:/R_OspC")
x <- read.table("Table-S2.txt", sep="\t", header=T)
a<-levels(x[,8])
output = data.frame(i=character(), j=character(), cor = numeric(), p = numeric());
#k <-0;
for(i in 1:22) {
  allele.i <- a[i];
  vect.i <- x[which(x[,8]==allele.i),12];
  
  for(j in (i+1):23) {
    allele.j <- a[j];
    vect.j <-x[which(x[,8]==allele.j),12];
    cor <- cor.test(vect.i,vect.j, method = "pearson");
    output <- rbind (output, data.frame(i=allele.i, j=allele.j, cor=cor$estimate, p=cor$p.value)); 
  }
}
 write.table(output, "immune-output.txt", quote = F, sep = "\t")

(Muhammad) Creates a plot for the correlation values of the lab's data and the author's data

#read in the authors cross reactivity correlation matrix
cr<- read.csv("C:/ospc/matricesospc.csv", header=F, sep = ",")
#puts all of the values of cr into corvect
corvect<-c()
for (i in 1:(nrow(cr)-1)) {
  for (j in (i+1):ncol(cr)) {
    corvect[length(corvect)+1]<- cr[i,j]
  }
    }
#merging cross reactivity correlation data and the authors data
df<- data.frame(output, corvect)
#plots the dataframe
plot(output[,3], corvect, main="Cross Reactivity Correlation Comparison", ylab = "Author's Output", xlab="Lab Output")
#gives the liner model, relationship between our data and the authors
b<-lm(corvect~output[,3])
#places the ab line on the plot
abline(b, col=2)

Summer 2018: Difference between revisions

Revision as of 22:00, 29 June 2018

Contents

Rules of Conduct

Participants

Journal Club

Projects

Borrelia genome evolution (Led by Saymon)

Identification of host species from ticks (Led by Lily [after first-level])

Pseudomonas Genome-wide Association Studies (GWAS) (Led by Mai & Yinheng, in collaboration with Dr Xavier of MSKCC)

Machine learning approaches to evolution (Led by Oliver & Brian)

Weekly Schedule

Lab notes for Summer HS Interns

Notes & Scripts

Navigation menu

Summer 2018: Difference between revisions

Revision as of 22:00, 29 June 2018

Rules of Conduct

Participants

Journal Club

Projects

Borrelia genome evolution (Led by Saymon)

Identification of host species from ticks (Led by Lily [after first-level])

Pseudomonas Genome-wide Association Studies (GWAS) (Led by Mai & Yinheng, in collaboration with Dr Xavier of MSKCC)

Machine learning approaches to evolution (Led by Oliver & Brian)

Weekly Schedule

Lab notes for Summer HS Interns

Notes & Scripts

Navigation menu

Search