Undergrad Research Experience: Difference between revisions

From QiuLab
Jump to navigation Jump to search
imported>Weigang
imported>Weigang
 
(29 intermediate revisions by the same user not shown)
Line 1: Line 1:
=Spring 2021=
==Participants==
* Eamen Ho: BIOL48002
* Afsana Rahman: BIOL48002
* Roman Shimonov: BIOL48002
* Zaheen Hossain: BIOL48002
* Ariel Cebelinski: BIOL48002
* Jean Ady: BIOL48001 & BIOl48002
* Mohamed Elgallad Volunteer
* Anh Pham: Volunteer
* Jannatul Ashpia: Volunteer
==Reading List==
* Design of broadly active anti-microbial vaccines:
** [https://www.nature.com/articles/d41586-021-00340-4 Burton & Topol (2021). A commentary]
**[https://www.nature.com/articles/s41598-018-37288-x Spensley et al (2019). Reverse Immunology. Scientific Report]
**[https://science.sciencemag.org/content/early/2021/01/11/science.abf6840 Cohen et al (2021). Cocktail approach]
** [https://asm.org/Articles/2021/February/SARS-CoV-2-Variants-vs-Vaccines an ASM review]
* Group B Streptococcus (GBS) CC17 Evolution: [https://msystems.asm.org/content/2/5/e00074-17 Almeida et al (2017)]
* CoV mutation analysis: [https://www.biorxiv.org/content/10.1101/2020.06.20.163188v2 Alouane et al (2020)]
* HIV compartmentalized eolution: [https://retrovirology.biomedcentral.com/articles/10.1186/s12977-014-0065-0 Evering et al (2014)]
==Project 1. Covid mutation analysis==
* [https://www.ncbi.nlm.nih.gov/sars-cov-2/ NCBI Covid genome files]
* [http://diverge.hunter.cuny.edu/~weigang/Covid_Mutation.html R Markdown file (by Janatul)]
<syntaxhighlight lang='python'>
# Shared Afsana
# Parsing NCBI SAR-CoV-2 genome and get gene length
#Parse through Covid Mutation File to obtain genes and their gene length. Columns: gene/matpeptide name, gene length, starting coordinate, ending coordinate
from Bio import SeqIO
import sys
inputFile=sys.argv[1] #multiple files can work with code
genome_record = SeqIO.read(inputFile, "genbank") #read file for analysis
seenProduct={} #create dictionary
for feature in genome_record.features:
    if feature.type == "gene":  #specifically look for gene feature, then loop through it to find location and name of each gene.remember that "gene" is a type and its name is a qualifier.
        name= feature.qualifiers["gene"]
        loc  = feature.location
        coordinates=str(feature.location)
        length=loc.end - loc.start + 1
        loc_start= str(loc.start) #starting coordinate of gene
        loc_end=str(loc.end) #ending coordinate
        print(name[0] + '\t' + str(length) + '\t' + loc_start + '\t'+ loc_end)
    if feature.type == "mat_peptide": #  specifically looking for matpetide to obtain genes of proteins. set loop to find location and name of each gene. remember that "gene" is a type and its name is a qualifier.
        name= feature.qualifiers["product"]
        if name[0] in seenProduct:
            continue #skips first duplicate of gene.
        else:
            seenProduct[name[0]]=1
            gene=feature.qualifiers["gene"]
            loc  = feature.location
            #length=(loc.end - loc.start + 1)
            length = len(loc) 
            loc_start= str(loc.start)
            loc_end=str(loc.end)
            print(name[0] + "\t" + str(length)  + '\t' + loc_start + '\t'+ loc_end)
</syntaxhighlight>
<syntaxhighlight lang="bash">
# Shared by Afsana
# UNIX COMMANDS W COVID-19 GENOME MUTATION SET (CREATE COUNT OF SYNONYMOUS & NON-SYNONYMOUS MUTATIONS FOR EACH GENE)
grep synonymous cov-snps.tsv | cut -f3 | sort | uniq -c
grep synonymous cov-snps.tsv | cut -f3 | sort | uniq -c > count_synoymous_covid
grep missense  cov-snps.tsv | cut -f3 | sort | uniq -c > count_missense_covid
paste count_missense_covid count_synoymous_covid
paste count_missense_covid count_synoymous_covid | tr -s ' '
paste -d ' '  count_missense_covid count_synoymous_covid | tr -s ' '
paste -d ' '  count_missense_covid count_synoymous_covid | tr -s ' ' | sed ' '
paste -d ' '  count_missense_ccovid count_synoymous_covid | tr -s ' ' | sed "s/ //"
paste -d ' '  count_missense_covid count_synoymous_covid | tr -s ' ' | sed "s/ //" | cut -f 1-3 -d ' '
paste -d ' '  count_missense_covid count_synoymous_covid | tr -s ' ' | sed "s/ //" | cut -f 1-3 -d ' ' > count_mutations_covid.tsv
paste -d ' '  count_missense_covid count_synoymous_covid | tr -s ' ' | sed "s/ //" | cut -f 1-3 -d ' '
</syntaxhighlight>
<syntaxhighlight lang="bash">
# Shared by Roman
# compare syn vs nonsyn
library(tidyverse)
library(ggrepel)
ggplot(data=covid, aes(x=Missense,y=Synonymous, label=Gene)) + geom_point() + geom_text_repel() + geom_smooth(method="lm")
</syntaxhighlight>
<syntaxhighlight lang='python'>
# Shared by Eamen PYTHON CODE FOR COVID MUTATION COUNTS
import re
import sys
gene_name = {}
with open(sys.argv[1]) as file:
    for line in file:
        lines = line.split()
        if re.match("noncoding", lines[7]):
            continue
        if lines[2] in gene_name:
            mutation = lines[7].split(',')
            for item in mutation:
                if item == 'missense':
                    gene_name[lines[2]]['total'] += 1
                    gene_name[lines[2]]['missense'] += 1
                elif item == 'synonymous':
                    gene_name[lines[2]]['total'] += 1
                    gene_name[lines[2]]['synonymous'] += 1
                else:
                    continue
        else:
            gene_name[lines[2]] = {}
            gene_name[lines[2]]['total'] = 0
            gene_name[lines[2]]['missense'] = 0
            gene_name[lines[2]]['synonymous'] = 0
            mutation = lines[7].split(',')
            for item in mutation:
                if item == 'missense':
                    gene_name[lines[2]]['total'] += 1
                    gene_name[lines[2]]['missense'] += 1
                elif item == 'synonymous':
                    gene_name[lines[2]]['total'] += 1
                    gene_name[lines[2]]['synonymous'] += 1
                else:
                    continue
for gene in gene_name:
    print(gene +"\t"+ str(gene_name[gene]['missense']) +"\t"+ str(gene_name[gene]['synonymous'])
          +"\t"+ str(gene_name[gene]['total']))
</syntaxhighlight>
==Project 2. Group B STreptoccocus genomics==
* GenBank submission of assemblies
==Project 3. HIV compartmentalized evolution==
==Project 4. DNABERT classification==
=Fall 2020=
=Fall 2020=
==Participants==
==Participants==
Line 9: Line 143:
* Zaheen Hossain: Volunteer research assistant
* Zaheen Hossain: Volunteer research assistant
* Jerry Sebastian: Volunteer research assistant
* Jerry Sebastian: Volunteer research assistant
* Ariel Cebelinski: Volunteer research assistant


==Schedule==
==Schedule==
Line 56: Line 191:
bioseq -i'genbank' R20291.gb > ref.fa # make FASTA file
bioseq -i'genbank' R20291.gb > ref.fa # make FASTA file
bowtie2-build ref.fa index # build index
bowtie2-build ref.fa index # build index
# -S: sam output (otherwise bam) -c: actual seqs, not file list
# -S: sam output (otherwise bam)  
bowtie2 -x index -S 18134XR.sam -1 ../18134XR-29-01_S0_L001_R1_001.fastq.gz -2 ../18134XR-29-01_S0_L001_R2_001.fastq.gz
bowtie2 -x index -S 18134XR.sam -1 ../18134XR-29-01_S0_L001_R1_001.fastq.gz -2 ../18134XR-29-01_S0_L001_R2_001.fastq.gz
 
# ref.gff3: need to run sed "s/Chromosome/FN545816/"
<syntaxhighlight>
# need to use "-i"; default is "gene_id"
conda activate qiulab # change environment to access htseq
htseq-count -m union --stranded=yes 18134XR-29-01.sam ~/xingmin-cdiff/ref.gff3 -i=Parent > 18134XR-29-01.counts
samtools view -b 18134XR-29-01.sam -o 18134XR-29-01.bam # compress sam file into bam file
</syntaxhighlight>


==Project 4. Protein classification using natural language processing==
==Project 4. Protein classification using natural language processing==
* Participants: Afsana
* Participants: Afsana & Ariel
* Goal: Classify protein sequences
* Goal: Classify protein sequences
* Week 1. 9/8/2020 Readings:
* Week 1. 9/8/2020 Readings:
Line 68: Line 207:
** [https://arxiv.org/abs/1909.11942 Lan et al (2019)]
** [https://arxiv.org/abs/1909.11942 Lan et al (2019)]
* Week 2. Find/Explore ALBERT resources & Tutorials
* Week 2. Find/Explore ALBERT resources & Tutorials
* [https://github.com/hansaimlim/thesis-works Code from Hansaim Lim]
* [https://huggingface.co/transformers/index.html Transformer: Pretrained models in natural language processing]
* [https://www.biorxiv.org/content/10.1101/2020.09.17.301879v1.full.pdf DNAbert paper]
[https://github.com/jerryji1993/DNABERT DNAbert: github code]
* Including [https://huggingface.co/transformers/model_doc/albert.html Albert]
* Google albert library: [https://github.com/google-research/albert github]
Sample BioPython script:
<syntaxhighlight lang="python">
#!/usr/bin/env python
import sys
import json
from Bio import SeqIO
alnFile = sys.argv[1] # read file as the first argument
seqList = [] # initialize a list
for record in SeqIO.parse(alnFile, "fasta"):
    seqList.append({"id": record.id,
                    "seq": str(record[0:3].seq) # use the str() function to convert object to string
                }) # get residue2 1-3
print(json.dumps(seqList)) # print to JSON format
exit
</syntaxhighlight>

Latest revision as of 18:10, 14 March 2021

Spring 2021

Participants

  • Eamen Ho: BIOL48002
  • Afsana Rahman: BIOL48002
  • Roman Shimonov: BIOL48002
  • Zaheen Hossain: BIOL48002
  • Ariel Cebelinski: BIOL48002
  • Jean Ady: BIOL48001 & BIOl48002
  • Mohamed Elgallad Volunteer
  • Anh Pham: Volunteer
  • Jannatul Ashpia: Volunteer

Reading List

Project 1. Covid mutation analysis

# Shared Afsana 
# Parsing NCBI SAR-CoV-2 genome and get gene length
#Parse through Covid Mutation File to obtain genes and their gene length. Columns: gene/matpeptide name, gene length, starting coordinate, ending coordinate
from Bio import SeqIO
import sys
inputFile=sys.argv[1] #multiple files can work with code
genome_record = SeqIO.read(inputFile, "genbank") #read file for analysis
seenProduct={} #create dictionary
for feature in genome_record.features:
    if feature.type == "gene":   #specifically look for gene feature, then loop through it to find location and name of each gene.remember that "gene" is a type and its name is a qualifier.
        name= feature.qualifiers["gene"]
        loc  = feature.location
        coordinates=str(feature.location)
        length=loc.end - loc.start + 1
        loc_start= str(loc.start) #starting coordinate of gene
        loc_end=str(loc.end) #ending coordinate
        print(name[0] + '\t' + str(length) + '\t' + loc_start + '\t'+ loc_end)
    if feature.type == "mat_peptide": #  specifically looking for matpetide to obtain genes of proteins. set loop to find location and name of each gene. remember that "gene" is a type and its name is a qualifier.
        name= feature.qualifiers["product"]
        if name[0] in seenProduct:
            continue #skips first duplicate of gene.
        else:
            seenProduct[name[0]]=1
            gene=feature.qualifiers["gene"]
            loc  = feature.location
            #length=(loc.end - loc.start + 1)
            length = len(loc)  
            loc_start= str(loc.start)
            loc_end=str(loc.end)
            print(name[0] + "\t" + str(length)  + '\t' + loc_start + '\t'+ loc_end)
# Shared by Afsana
# UNIX COMMANDS W COVID-19 GENOME MUTATION SET (CREATE COUNT OF SYNONYMOUS & NON-SYNONYMOUS MUTATIONS FOR EACH GENE)
grep synonymous cov-snps.tsv | cut -f3 | sort | uniq -c
grep synonymous cov-snps.tsv | cut -f3 | sort | uniq -c > count_synoymous_covid
grep missense  cov-snps.tsv | cut -f3 | sort | uniq -c > count_missense_covid
paste count_missense_covid count_synoymous_covid 
paste count_missense_covid count_synoymous_covid | tr -s ' ' 
paste -d ' '  count_missense_covid count_synoymous_covid | tr -s ' '
paste -d ' '  count_missense_covid count_synoymous_covid | tr -s ' ' | sed ' '
paste -d ' '  count_missense_ccovid count_synoymous_covid | tr -s ' ' | sed "s/ //"
paste -d ' '  count_missense_covid count_synoymous_covid | tr -s ' ' | sed "s/ //" | cut -f 1-3 -d ' '
paste -d ' '  count_missense_covid count_synoymous_covid | tr -s ' ' | sed "s/ //" | cut -f 1-3 -d ' ' > count_mutations_covid.tsv
paste -d ' '  count_missense_covid count_synoymous_covid | tr -s ' ' | sed "s/ //" | cut -f 1-3 -d ' '
# Shared by Roman
# compare syn vs nonsyn
library(tidyverse)
library(ggrepel)
ggplot(data=covid, aes(x=Missense,y=Synonymous, label=Gene)) + geom_point() + geom_text_repel() + geom_smooth(method="lm")
# Shared by Eamen PYTHON CODE FOR COVID MUTATION COUNTS
import re
import sys
gene_name = {}
with open(sys.argv[1]) as file:
    for line in file:
        lines = line.split()
        if re.match("noncoding", lines[7]):
            continue
        if lines[2] in gene_name:
            mutation = lines[7].split(',')

            for item in mutation:
                if item == 'missense':
                    gene_name[lines[2]]['total'] += 1
                    gene_name[lines[2]]['missense'] += 1
                elif item == 'synonymous':
                    gene_name[lines[2]]['total'] += 1
                    gene_name[lines[2]]['synonymous'] += 1
                else:
                    continue
        else:
            gene_name[lines[2]] = {}
            gene_name[lines[2]]['total'] = 0
            gene_name[lines[2]]['missense'] = 0
            gene_name[lines[2]]['synonymous'] = 0

            mutation = lines[7].split(',')
            for item in mutation:
                if item == 'missense':
                    gene_name[lines[2]]['total'] += 1
                    gene_name[lines[2]]['missense'] += 1
                elif item == 'synonymous':
                    gene_name[lines[2]]['total'] += 1
                    gene_name[lines[2]]['synonymous'] += 1
                else:
                    continue

for gene in gene_name:
    print(gene +"\t"+ str(gene_name[gene]['missense']) +"\t"+ str(gene_name[gene]['synonymous'])
          +"\t"+ str(gene_name[gene]['total']))

Project 2. Group B STreptoccocus genomics

  • GenBank submission of assemblies

Project 3. HIV compartmentalized evolution

Project 4. DNABERT classification

Fall 2020

Participants

  • Eamen Ho: Volunteer research assistant
  • Ramandeep Singh: BIOL 48002
  • Desiree Pante: BIOL 48001
  • Afsana Rahman: Volunteer research assistant
  • Roman Shimonov: BIOL 48002
  • Justin Hiraldo: BIOL 48002
  • Zaheen Hossain: Volunteer research assistant
  • Jerry Sebastian: Volunteer research assistant
  • Ariel Cebelinski: Volunteer research assistant

Schedule

  • Tuesdays at 12 noon - 2pm by Zoom
  • Sept 1, 2020. Week 1. Meet & Greet; Intro to projects
  • Sept 8, 2020. Week 2. Presentations (background, data, and methods), based on assigned readings

Project 1. Structure & evolution of multipartite genome of Lyme disease bacteria

  • Participants: Desiree & Ramon (Summer 2020), Jerry
  • Readings
  • Data set: lp54 & cp26 plasmids
  • TO DO:
    • Week 1. 9/8/2020, 12 noon: 5-slides presentation on multipartite bacterial genome evolution (based on the paper above)
    • Week 2. 9/15, 12 noon: Use prorgram codonO to calculate codon bias (SCUO) for replicons (n=23) on Borrelia burgdorferi B31 genome
    • Week 3. 9/22, 12 noon: codonO paper presentation (Jerry)

Project 2. OspC Cross-reactivity analysis

  • Participants: Justin, Roman
  • Readings: Ivanova et al (2009)
  • Tool: ImageJ
  • Data set (to be sent)
  • To Do
    • Week 1. 9/8/2020 12 noon: 5-slide presentation on background, material & methods, and data capture using ImageJ
    • Week 2. 9/15: Create Excel sheet to capture immunoblot intensities on C3H mice & P.lucus. Capture background for each serum. Getting ready to makes plots in R/Rstudio

Project 3. Clostridium transcriptome analysis

  • Participants: Eaman, Zaheen
  • Readings
  • Data set: posted on "genometracker.org"
    • Wild type transcriptome at 12 hour, paired-end read files:
    • /home/azureuser/18134XR-29-01_S0_L001_R1_001.fastq.gz
    • /home/azureuser/18134XR-29-01_S0_L001_R2_001.fastq.gz
  • To Do
    • Week 1. 9/8/2020 12 noon:
      • A short presentation on C. diff transcriptome (one of the 2 papers above)
      • Demo on read quality using FastQC and mapping reads to reference genomes with bowtie
    • Week 2. Use HT-Seq to quantify RNA abundance for C. diff genes.
    • Commands

According to: reference; Bowtie website

bioseq -i'genbank' R20291.gb > ref.fa # make FASTA file
bowtie2-build ref.fa index # build index
# -S: sam output (otherwise bam) 
bowtie2 -x index -S 18134XR.sam -1 ../18134XR-29-01_S0_L001_R1_001.fastq.gz -2 ../18134XR-29-01_S0_L001_R2_001.fastq.gz
# ref.gff3: need to run sed "s/Chromosome/FN545816/"
# need to use "-i"; default is "gene_id"
conda activate qiulab # change environment to access htseq
htseq-count -m union --stranded=yes 18134XR-29-01.sam ~/xingmin-cdiff/ref.gff3 -i=Parent > 18134XR-29-01.counts
samtools view -b 18134XR-29-01.sam -o 18134XR-29-01.bam # compress sam file into bam file

Project 4. Protein classification using natural language processing

DNAbert: github code

Sample BioPython script:

#!/usr/bin/env python

import sys
import json
from Bio import SeqIO

alnFile = sys.argv[1] # read file as the first argument
seqList = [] # initialize a list
for record in SeqIO.parse(alnFile, "fasta"):
    seqList.append({"id": record.id,
                    "seq": str(record[0:3].seq) # use the str() function to convert object to string
                }) # get residue2 1-3

print(json.dumps(seqList)) # print to JSON format
exit