Undergrad Research Experience: Difference between revisions
Jump to navigation
Jump to search
imported>Weigang |
imported>Weigang |
||
(48 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
=Spring 2021= | |||
==Participants== | |||
* Eamen Ho: BIOL48002 | |||
* Afsana Rahman: BIOL48002 | |||
* Roman Shimonov: BIOL48002 | |||
* Zaheen Hossain: BIOL48002 | |||
* Ariel Cebelinski: BIOL48002 | |||
* Jean Ady: BIOL48001 & BIOl48002 | |||
* Mohamed Elgallad Volunteer | |||
* Anh Pham: Volunteer | |||
* Jannatul Ashpia: Volunteer | |||
==Reading List== | |||
* Design of broadly active anti-microbial vaccines: | |||
** [https://www.nature.com/articles/d41586-021-00340-4 Burton & Topol (2021). A commentary] | |||
**[https://www.nature.com/articles/s41598-018-37288-x Spensley et al (2019). Reverse Immunology. Scientific Report] | |||
**[https://science.sciencemag.org/content/early/2021/01/11/science.abf6840 Cohen et al (2021). Cocktail approach] | |||
** [https://asm.org/Articles/2021/February/SARS-CoV-2-Variants-vs-Vaccines an ASM review] | |||
* Group B Streptococcus (GBS) CC17 Evolution: [https://msystems.asm.org/content/2/5/e00074-17 Almeida et al (2017)] | |||
* CoV mutation analysis: [https://www.biorxiv.org/content/10.1101/2020.06.20.163188v2 Alouane et al (2020)] | |||
* HIV compartmentalized eolution: [https://retrovirology.biomedcentral.com/articles/10.1186/s12977-014-0065-0 Evering et al (2014)] | |||
==Project 1. Covid mutation analysis== | |||
* [https://www.ncbi.nlm.nih.gov/sars-cov-2/ NCBI Covid genome files] | |||
* [http://diverge.hunter.cuny.edu/~weigang/Covid_Mutation.html R Markdown file (by Janatul)] | |||
<syntaxhighlight lang='python'> | |||
# Shared Afsana | |||
# Parsing NCBI SAR-CoV-2 genome and get gene length | |||
#Parse through Covid Mutation File to obtain genes and their gene length. Columns: gene/matpeptide name, gene length, starting coordinate, ending coordinate | |||
from Bio import SeqIO | |||
import sys | |||
inputFile=sys.argv[1] #multiple files can work with code | |||
genome_record = SeqIO.read(inputFile, "genbank") #read file for analysis | |||
seenProduct={} #create dictionary | |||
for feature in genome_record.features: | |||
if feature.type == "gene": #specifically look for gene feature, then loop through it to find location and name of each gene.remember that "gene" is a type and its name is a qualifier. | |||
name= feature.qualifiers["gene"] | |||
loc = feature.location | |||
coordinates=str(feature.location) | |||
length=loc.end - loc.start + 1 | |||
loc_start= str(loc.start) #starting coordinate of gene | |||
loc_end=str(loc.end) #ending coordinate | |||
print(name[0] + '\t' + str(length) + '\t' + loc_start + '\t'+ loc_end) | |||
if feature.type == "mat_peptide": # specifically looking for matpetide to obtain genes of proteins. set loop to find location and name of each gene. remember that "gene" is a type and its name is a qualifier. | |||
name= feature.qualifiers["product"] | |||
if name[0] in seenProduct: | |||
continue #skips first duplicate of gene. | |||
else: | |||
seenProduct[name[0]]=1 | |||
gene=feature.qualifiers["gene"] | |||
loc = feature.location | |||
#length=(loc.end - loc.start + 1) | |||
length = len(loc) | |||
loc_start= str(loc.start) | |||
loc_end=str(loc.end) | |||
print(name[0] + "\t" + str(length) + '\t' + loc_start + '\t'+ loc_end) | |||
</syntaxhighlight> | |||
<syntaxhighlight lang="bash"> | |||
# Shared by Afsana | |||
# UNIX COMMANDS W COVID-19 GENOME MUTATION SET (CREATE COUNT OF SYNONYMOUS & NON-SYNONYMOUS MUTATIONS FOR EACH GENE) | |||
grep synonymous cov-snps.tsv | cut -f3 | sort | uniq -c | |||
grep synonymous cov-snps.tsv | cut -f3 | sort | uniq -c > count_synoymous_covid | |||
grep missense cov-snps.tsv | cut -f3 | sort | uniq -c > count_missense_covid | |||
paste count_missense_covid count_synoymous_covid | |||
paste count_missense_covid count_synoymous_covid | tr -s ' ' | |||
paste -d ' ' count_missense_covid count_synoymous_covid | tr -s ' ' | |||
paste -d ' ' count_missense_covid count_synoymous_covid | tr -s ' ' | sed ' ' | |||
paste -d ' ' count_missense_ccovid count_synoymous_covid | tr -s ' ' | sed "s/ //" | |||
paste -d ' ' count_missense_covid count_synoymous_covid | tr -s ' ' | sed "s/ //" | cut -f 1-3 -d ' ' | |||
paste -d ' ' count_missense_covid count_synoymous_covid | tr -s ' ' | sed "s/ //" | cut -f 1-3 -d ' ' > count_mutations_covid.tsv | |||
paste -d ' ' count_missense_covid count_synoymous_covid | tr -s ' ' | sed "s/ //" | cut -f 1-3 -d ' ' | |||
</syntaxhighlight> | |||
<syntaxhighlight lang="bash"> | |||
# Shared by Roman | |||
# compare syn vs nonsyn | |||
library(tidyverse) | |||
library(ggrepel) | |||
ggplot(data=covid, aes(x=Missense,y=Synonymous, label=Gene)) + geom_point() + geom_text_repel() + geom_smooth(method="lm") | |||
</syntaxhighlight> | |||
<syntaxhighlight lang='python'> | |||
# Shared by Eamen PYTHON CODE FOR COVID MUTATION COUNTS | |||
import re | |||
import sys | |||
gene_name = {} | |||
with open(sys.argv[1]) as file: | |||
for line in file: | |||
lines = line.split() | |||
if re.match("noncoding", lines[7]): | |||
continue | |||
if lines[2] in gene_name: | |||
mutation = lines[7].split(',') | |||
for item in mutation: | |||
if item == 'missense': | |||
gene_name[lines[2]]['total'] += 1 | |||
gene_name[lines[2]]['missense'] += 1 | |||
elif item == 'synonymous': | |||
gene_name[lines[2]]['total'] += 1 | |||
gene_name[lines[2]]['synonymous'] += 1 | |||
else: | |||
continue | |||
else: | |||
gene_name[lines[2]] = {} | |||
gene_name[lines[2]]['total'] = 0 | |||
gene_name[lines[2]]['missense'] = 0 | |||
gene_name[lines[2]]['synonymous'] = 0 | |||
mutation = lines[7].split(',') | |||
for item in mutation: | |||
if item == 'missense': | |||
gene_name[lines[2]]['total'] += 1 | |||
gene_name[lines[2]]['missense'] += 1 | |||
elif item == 'synonymous': | |||
gene_name[lines[2]]['total'] += 1 | |||
gene_name[lines[2]]['synonymous'] += 1 | |||
else: | |||
continue | |||
for gene in gene_name: | |||
print(gene +"\t"+ str(gene_name[gene]['missense']) +"\t"+ str(gene_name[gene]['synonymous']) | |||
+"\t"+ str(gene_name[gene]['total'])) | |||
</syntaxhighlight> | |||
==Project 2. Group B STreptoccocus genomics== | |||
* GenBank submission of assemblies | |||
==Project 3. HIV compartmentalized evolution== | |||
==Project 4. DNABERT classification== | |||
=Fall 2020= | =Fall 2020= | ||
==Participants== | ==Participants== | ||
* | * Eamen Ho: Volunteer research assistant | ||
* | * Ramandeep Singh: BIOL 48002 | ||
* | * Desiree Pante: BIOL 48001 | ||
* | * Afsana Rahman: Volunteer research assistant | ||
* | * Roman Shimonov: BIOL 48002 | ||
* | * Justin Hiraldo: BIOL 48002 | ||
* | * Zaheen Hossain: Volunteer research assistant | ||
* Jerry Sebastian: Volunteer research assistant | |||
* Ariel Cebelinski: Volunteer research assistant | |||
==Schedule== | ==Schedule== | ||
* | * Tuesdays at 12 noon - 2pm by Zoom | ||
* Sept 1, 2020. Week 1. Meet & Greet; Intro to projects | |||
* Sept 8, 2020. Week 2. Presentations (background, data, and methods), based on assigned readings | |||
==Project 1. Structure & evolution of multipartite genome of Lyme disease bacteria== | ==Project 1. Structure & evolution of multipartite genome of Lyme disease bacteria== | ||
* Participants: Desiree & Ramon | * Participants: Desiree & Ramon (Summer 2020), Jerry | ||
* Readings | * Readings | ||
** Review: [https://mmbr.asm.org/content/81/3/e00019-17.long deCenzo & Finan (2017).] | ** Review: [https://mmbr.asm.org/content/81/3/e00019-17.long deCenzo & Finan (2017).] | ||
* Data set: lp54 & cp26 plasmids | * Data set: lp54 & cp26 plasmids | ||
* TO DO: | * TO DO: | ||
** 9/8/2020, 12 noon: 5-slides presentation on multipartite bacterial genome evolution (based on the paper above) | ** Week 1. 9/8/2020, 12 noon: 5-slides presentation on multipartite bacterial genome evolution (based on the paper above) | ||
** Week 2. 9/15, 12 noon: Use prorgram codonO to calculate codon bias (SCUO) for replicons (n=23) on Borrelia burgdorferi B31 genome | |||
** Week 3. 9/22, 12 noon: [https://academic.oup.com/nar/article/35/suppl_2/W132/2923883 codonO paper] presentation (Jerry) | |||
==Project 2. OspC Cross-reactivity analysis== | ==Project 2. OspC Cross-reactivity analysis== | ||
Line 25: | Line 166: | ||
* Data set (to be sent) | * Data set (to be sent) | ||
* To Do | * To Do | ||
** 9/8/2020 12 noon: 5-slide presentation on background, material & methods, and data capture using ImageJ | ** Week 1. 9/8/2020 12 noon: 5-slide presentation on background, material & methods, and data capture using ImageJ | ||
** Week 2. 9/15: Create Excel sheet to capture immunoblot intensities on C3H mice & P.lucus. Capture background for each serum. Getting ready to makes plots in R/Rstudio | |||
==Project 3. Clostridium transcriptome analysis== | ==Project 3. Clostridium transcriptome analysis== | ||
Line 33: | Line 175: | ||
** [https://msphere.asm.org/content/3/2/e00089-18 Fletcher et al (2018).] | ** [https://msphere.asm.org/content/3/2/e00089-18 Fletcher et al (2018).] | ||
** [https://www.sciencedirect.com/science/article/pii/S0944501318301988?via%3Dihub Gu et al (2018)] | ** [https://www.sciencedirect.com/science/article/pii/S0944501318301988?via%3Dihub Gu et al (2018)] | ||
* Data set | * Data set: posted on "genometracker.org" | ||
** Wild type transcriptome at 12 hour, paired-end read files: | |||
** /home/azureuser/18134XR-29-01_S0_L001_R1_001.fastq.gz | |||
** /home/azureuser/18134XR-29-01_S0_L001_R2_001.fastq.gz | |||
* To Do | |||
** Week 1. 9/8/2020 12 noon: | |||
*** A short presentation on C. diff transcriptome (one of the 2 papers above) | |||
*** Demo on read quality using FastQC and mapping reads to reference genomes with bowtie | |||
** Week 2. Use HT-Seq to quantify RNA abundance for C. diff genes. | |||
*** HTSeq installed | |||
*** [https://htseq.readthedocs.io/en/master/count.html Try this protocol first] | |||
** Commands | |||
According to: [https://mmg434.readthedocs.io/en/latest/daythreemod.html reference]; [http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml#getting-started-with-bowtie-2-lambda-phage-example Bowtie website] | |||
<syntaxhighlight lang='bash'> | |||
bioseq -i'genbank' R20291.gb > ref.fa # make FASTA file | |||
bowtie2-build ref.fa index # build index | |||
# -S: sam output (otherwise bam) | |||
bowtie2 -x index -S 18134XR.sam -1 ../18134XR-29-01_S0_L001_R1_001.fastq.gz -2 ../18134XR-29-01_S0_L001_R2_001.fastq.gz | |||
# ref.gff3: need to run sed "s/Chromosome/FN545816/" | |||
# need to use "-i"; default is "gene_id" | |||
conda activate qiulab # change environment to access htseq | |||
htseq-count -m union --stranded=yes 18134XR-29-01.sam ~/xingmin-cdiff/ref.gff3 -i=Parent > 18134XR-29-01.counts | |||
samtools view -b 18134XR-29-01.sam -o 18134XR-29-01.bam # compress sam file into bam file | |||
</syntaxhighlight> | |||
==Project 4. Protein classification using natural language processing== | |||
* Participants: Afsana & Ariel | |||
* Goal: Classify protein sequences | |||
* Week 1. 9/8/2020 Readings: | |||
** [https://www.biorxiv.org/search/Rives%252BBiological%252B622803 Rives et al (2019)] | |||
** [https://arxiv.org/abs/1909.11942 Lan et al (2019)] | |||
* Week 2. Find/Explore ALBERT resources & Tutorials | |||
* [https://github.com/hansaimlim/thesis-works Code from Hansaim Lim] | |||
* [https://huggingface.co/transformers/index.html Transformer: Pretrained models in natural language processing] | |||
* [https://www.biorxiv.org/content/10.1101/2020.09.17.301879v1.full.pdf DNAbert paper] | |||
[https://github.com/jerryji1993/DNABERT DNAbert: github code] | |||
* Including [https://huggingface.co/transformers/model_doc/albert.html Albert] | |||
* Google albert library: [https://github.com/google-research/albert github] | |||
Sample BioPython script: | |||
<syntaxhighlight lang="python"> | |||
#!/usr/bin/env python | |||
import sys | |||
import json | |||
from Bio import SeqIO | |||
alnFile = sys.argv[1] # read file as the first argument | |||
seqList = [] # initialize a list | |||
for record in SeqIO.parse(alnFile, "fasta"): | |||
seqList.append({"id": record.id, | |||
"seq": str(record[0:3].seq) # use the str() function to convert object to string | |||
}) # get residue2 1-3 | |||
print(json.dumps(seqList)) # print to JSON format | |||
exit | |||
</syntaxhighlight> |
Latest revision as of 18:10, 14 March 2021
Spring 2021
Participants
- Eamen Ho: BIOL48002
- Afsana Rahman: BIOL48002
- Roman Shimonov: BIOL48002
- Zaheen Hossain: BIOL48002
- Ariel Cebelinski: BIOL48002
- Jean Ady: BIOL48001 & BIOl48002
- Mohamed Elgallad Volunteer
- Anh Pham: Volunteer
- Jannatul Ashpia: Volunteer
Reading List
- Design of broadly active anti-microbial vaccines:
- Group B Streptococcus (GBS) CC17 Evolution: Almeida et al (2017)
- CoV mutation analysis: Alouane et al (2020)
- HIV compartmentalized eolution: Evering et al (2014)
Project 1. Covid mutation analysis
# Shared Afsana
# Parsing NCBI SAR-CoV-2 genome and get gene length
#Parse through Covid Mutation File to obtain genes and their gene length. Columns: gene/matpeptide name, gene length, starting coordinate, ending coordinate
from Bio import SeqIO
import sys
inputFile=sys.argv[1] #multiple files can work with code
genome_record = SeqIO.read(inputFile, "genbank") #read file for analysis
seenProduct={} #create dictionary
for feature in genome_record.features:
if feature.type == "gene": #specifically look for gene feature, then loop through it to find location and name of each gene.remember that "gene" is a type and its name is a qualifier.
name= feature.qualifiers["gene"]
loc = feature.location
coordinates=str(feature.location)
length=loc.end - loc.start + 1
loc_start= str(loc.start) #starting coordinate of gene
loc_end=str(loc.end) #ending coordinate
print(name[0] + '\t' + str(length) + '\t' + loc_start + '\t'+ loc_end)
if feature.type == "mat_peptide": # specifically looking for matpetide to obtain genes of proteins. set loop to find location and name of each gene. remember that "gene" is a type and its name is a qualifier.
name= feature.qualifiers["product"]
if name[0] in seenProduct:
continue #skips first duplicate of gene.
else:
seenProduct[name[0]]=1
gene=feature.qualifiers["gene"]
loc = feature.location
#length=(loc.end - loc.start + 1)
length = len(loc)
loc_start= str(loc.start)
loc_end=str(loc.end)
print(name[0] + "\t" + str(length) + '\t' + loc_start + '\t'+ loc_end)
# Shared by Afsana
# UNIX COMMANDS W COVID-19 GENOME MUTATION SET (CREATE COUNT OF SYNONYMOUS & NON-SYNONYMOUS MUTATIONS FOR EACH GENE)
grep synonymous cov-snps.tsv | cut -f3 | sort | uniq -c
grep synonymous cov-snps.tsv | cut -f3 | sort | uniq -c > count_synoymous_covid
grep missense cov-snps.tsv | cut -f3 | sort | uniq -c > count_missense_covid
paste count_missense_covid count_synoymous_covid
paste count_missense_covid count_synoymous_covid | tr -s ' '
paste -d ' ' count_missense_covid count_synoymous_covid | tr -s ' '
paste -d ' ' count_missense_covid count_synoymous_covid | tr -s ' ' | sed ' '
paste -d ' ' count_missense_ccovid count_synoymous_covid | tr -s ' ' | sed "s/ //"
paste -d ' ' count_missense_covid count_synoymous_covid | tr -s ' ' | sed "s/ //" | cut -f 1-3 -d ' '
paste -d ' ' count_missense_covid count_synoymous_covid | tr -s ' ' | sed "s/ //" | cut -f 1-3 -d ' ' > count_mutations_covid.tsv
paste -d ' ' count_missense_covid count_synoymous_covid | tr -s ' ' | sed "s/ //" | cut -f 1-3 -d ' '
# Shared by Roman
# compare syn vs nonsyn
library(tidyverse)
library(ggrepel)
ggplot(data=covid, aes(x=Missense,y=Synonymous, label=Gene)) + geom_point() + geom_text_repel() + geom_smooth(method="lm")
# Shared by Eamen PYTHON CODE FOR COVID MUTATION COUNTS
import re
import sys
gene_name = {}
with open(sys.argv[1]) as file:
for line in file:
lines = line.split()
if re.match("noncoding", lines[7]):
continue
if lines[2] in gene_name:
mutation = lines[7].split(',')
for item in mutation:
if item == 'missense':
gene_name[lines[2]]['total'] += 1
gene_name[lines[2]]['missense'] += 1
elif item == 'synonymous':
gene_name[lines[2]]['total'] += 1
gene_name[lines[2]]['synonymous'] += 1
else:
continue
else:
gene_name[lines[2]] = {}
gene_name[lines[2]]['total'] = 0
gene_name[lines[2]]['missense'] = 0
gene_name[lines[2]]['synonymous'] = 0
mutation = lines[7].split(',')
for item in mutation:
if item == 'missense':
gene_name[lines[2]]['total'] += 1
gene_name[lines[2]]['missense'] += 1
elif item == 'synonymous':
gene_name[lines[2]]['total'] += 1
gene_name[lines[2]]['synonymous'] += 1
else:
continue
for gene in gene_name:
print(gene +"\t"+ str(gene_name[gene]['missense']) +"\t"+ str(gene_name[gene]['synonymous'])
+"\t"+ str(gene_name[gene]['total']))
Project 2. Group B STreptoccocus genomics
- GenBank submission of assemblies
Project 3. HIV compartmentalized evolution
Project 4. DNABERT classification
Fall 2020
Participants
- Eamen Ho: Volunteer research assistant
- Ramandeep Singh: BIOL 48002
- Desiree Pante: BIOL 48001
- Afsana Rahman: Volunteer research assistant
- Roman Shimonov: BIOL 48002
- Justin Hiraldo: BIOL 48002
- Zaheen Hossain: Volunteer research assistant
- Jerry Sebastian: Volunteer research assistant
- Ariel Cebelinski: Volunteer research assistant
Schedule
- Tuesdays at 12 noon - 2pm by Zoom
- Sept 1, 2020. Week 1. Meet & Greet; Intro to projects
- Sept 8, 2020. Week 2. Presentations (background, data, and methods), based on assigned readings
Project 1. Structure & evolution of multipartite genome of Lyme disease bacteria
- Participants: Desiree & Ramon (Summer 2020), Jerry
- Readings
- Review: deCenzo & Finan (2017).
- Data set: lp54 & cp26 plasmids
- TO DO:
- Week 1. 9/8/2020, 12 noon: 5-slides presentation on multipartite bacterial genome evolution (based on the paper above)
- Week 2. 9/15, 12 noon: Use prorgram codonO to calculate codon bias (SCUO) for replicons (n=23) on Borrelia burgdorferi B31 genome
- Week 3. 9/22, 12 noon: codonO paper presentation (Jerry)
Project 2. OspC Cross-reactivity analysis
- Participants: Justin, Roman
- Readings: Ivanova et al (2009)
- Tool: ImageJ
- Data set (to be sent)
- To Do
- Week 1. 9/8/2020 12 noon: 5-slide presentation on background, material & methods, and data capture using ImageJ
- Week 2. 9/15: Create Excel sheet to capture immunoblot intensities on C3H mice & P.lucus. Capture background for each serum. Getting ready to makes plots in R/Rstudio
Project 3. Clostridium transcriptome analysis
- Participants: Eaman, Zaheen
- Readings
- Data set: posted on "genometracker.org"
- Wild type transcriptome at 12 hour, paired-end read files:
- /home/azureuser/18134XR-29-01_S0_L001_R1_001.fastq.gz
- /home/azureuser/18134XR-29-01_S0_L001_R2_001.fastq.gz
- To Do
- Week 1. 9/8/2020 12 noon:
- A short presentation on C. diff transcriptome (one of the 2 papers above)
- Demo on read quality using FastQC and mapping reads to reference genomes with bowtie
- Week 2. Use HT-Seq to quantify RNA abundance for C. diff genes.
- HTSeq installed
- Try this protocol first
- Commands
- Week 1. 9/8/2020 12 noon:
According to: reference; Bowtie website
bioseq -i'genbank' R20291.gb > ref.fa # make FASTA file
bowtie2-build ref.fa index # build index
# -S: sam output (otherwise bam)
bowtie2 -x index -S 18134XR.sam -1 ../18134XR-29-01_S0_L001_R1_001.fastq.gz -2 ../18134XR-29-01_S0_L001_R2_001.fastq.gz
# ref.gff3: need to run sed "s/Chromosome/FN545816/"
# need to use "-i"; default is "gene_id"
conda activate qiulab # change environment to access htseq
htseq-count -m union --stranded=yes 18134XR-29-01.sam ~/xingmin-cdiff/ref.gff3 -i=Parent > 18134XR-29-01.counts
samtools view -b 18134XR-29-01.sam -o 18134XR-29-01.bam # compress sam file into bam file
Project 4. Protein classification using natural language processing
- Participants: Afsana & Ariel
- Goal: Classify protein sequences
- Week 1. 9/8/2020 Readings:
- Week 2. Find/Explore ALBERT resources & Tutorials
- Code from Hansaim Lim
- Transformer: Pretrained models in natural language processing
- DNAbert paper
Sample BioPython script:
#!/usr/bin/env python
import sys
import json
from Bio import SeqIO
alnFile = sys.argv[1] # read file as the first argument
seqList = [] # initialize a list
for record in SeqIO.parse(alnFile, "fasta"):
seqList.append({"id": record.id,
"seq": str(record[0:3].seq) # use the str() function to convert object to string
}) # get residue2 1-3
print(json.dumps(seqList)) # print to JSON format
exit