Spring 2021

Participants

Eamen Ho: BIOL48002
Afsana Rahman: BIOL48002
Roman Shimonov: BIOL48002
Zaheen Hossain: BIOL48002
Ariel Cebelinski: BIOL48002
Jean Ady: BIOL48001 & BIOl48002
Mohamed Elgallad Volunteer
Anh Pham: Volunteer
Jannatul Ashpia: Volunteer

Reading List

Design of broadly active anti-microbial vaccines:
Group B Streptococcus (GBS) CC17 Evolution: Almeida et al (2017)
CoV mutation analysis: Alouane et al (2020)
HIV compartmentalized eolution: Evering et al (2014)

Project 1. Covid mutation analysis

# Shared Afsana 
# Parsing NCBI SAR-CoV-2 genome and get gene length
#Parse through Covid Mutation File to obtain genes and their gene length. Columns: gene/matpeptide name, gene length, starting coordinate, ending coordinate
from Bio import SeqIO
import sys
inputFile=sys.argv[1] #multiple files can work with code
genome_record = SeqIO.read(inputFile, "genbank") #read file for analysis
seenProduct={} #create dictionary
for feature in genome_record.features:
    if feature.type == "gene":   #specifically look for gene feature, then loop through it to find location and name of each gene.remember that "gene" is a type and its name is a qualifier.
        name= feature.qualifiers["gene"]
        loc  = feature.location
        coordinates=str(feature.location)
        length=loc.end - loc.start + 1
        loc_start= str(loc.start) #starting coordinate of gene
        loc_end=str(loc.end) #ending coordinate
        print(name[0] + '\t' + str(length) + '\t' + loc_start + '\t'+ loc_end)
    if feature.type == "mat_peptide": #  specifically looking for matpetide to obtain genes of proteins. set loop to find location and name of each gene. remember that "gene" is a type and its name is a qualifier.
        name= feature.qualifiers["product"]
        if name[0] in seenProduct:
            continue #skips first duplicate of gene.
        else:
            seenProduct[name[0]]=1
            gene=feature.qualifiers["gene"]
            loc  = feature.location
            #length=(loc.end - loc.start + 1)
            length = len(loc)  
            loc_start= str(loc.start)
            loc_end=str(loc.end)
            print(name[0] + "\t" + str(length)  + '\t' + loc_start + '\t'+ loc_end)

# Shared by Afsana
# UNIX COMMANDS W COVID-19 GENOME MUTATION SET (CREATE COUNT OF SYNONYMOUS & NON-SYNONYMOUS MUTATIONS FOR EACH GENE)
grep synonymous cov-snps.tsv | cut -f3 | sort | uniq -c
grep synonymous cov-snps.tsv | cut -f3 | sort | uniq -c > count_synoymous_covid
grep missense  cov-snps.tsv | cut -f3 | sort | uniq -c > count_missense_covid
paste count_missense_covid count_synoymous_covid 
paste count_missense_covid count_synoymous_covid | tr -s ' ' 
paste -d ' '  count_missense_covid count_synoymous_covid | tr -s ' '
paste -d ' '  count_missense_covid count_synoymous_covid | tr -s ' ' | sed ' '
paste -d ' '  count_missense_ccovid count_synoymous_covid | tr -s ' ' | sed "s/ //"
paste -d ' '  count_missense_covid count_synoymous_covid | tr -s ' ' | sed "s/ //" | cut -f 1-3 -d ' '
paste -d ' '  count_missense_covid count_synoymous_covid | tr -s ' ' | sed "s/ //" | cut -f 1-3 -d ' ' > count_mutations_covid.tsv
paste -d ' '  count_missense_covid count_synoymous_covid | tr -s ' ' | sed "s/ //" | cut -f 1-3 -d ' '

# Shared by Roman
# compare syn vs nonsyn
library(tidyverse)
library(ggrepel)
ggplot(data=covid, aes(x=Missense,y=Synonymous, label=Gene)) + geom_point() + geom_text_repel() + geom_smooth(method="lm")

# Shared by Eamen PYTHON CODE FOR COVID MUTATION COUNTS
import re
import sys
gene_name = {}
with open(sys.argv[1]) as file:
    for line in file:
        lines = line.split()
        if re.match("noncoding", lines[7]):
            continue
        if lines[2] in gene_name:
            mutation = lines[7].split(',')

            for item in mutation:
                if item == 'missense':
                    gene_name[lines[2]]['total'] += 1
                    gene_name[lines[2]]['missense'] += 1
                elif item == 'synonymous':
                    gene_name[lines[2]]['total'] += 1
                    gene_name[lines[2]]['synonymous'] += 1
                else:
                    continue
        else:
            gene_name[lines[2]] = {}
            gene_name[lines[2]]['total'] = 0
            gene_name[lines[2]]['missense'] = 0
            gene_name[lines[2]]['synonymous'] = 0

            mutation = lines[7].split(',')
            for item in mutation:
                if item == 'missense':
                    gene_name[lines[2]]['total'] += 1
                    gene_name[lines[2]]['missense'] += 1
                elif item == 'synonymous':
                    gene_name[lines[2]]['total'] += 1
                    gene_name[lines[2]]['synonymous'] += 1
                else:
                    continue

for gene in gene_name:
    print(gene +"\t"+ str(gene_name[gene]['missense']) +"\t"+ str(gene_name[gene]['synonymous'])
          +"\t"+ str(gene_name[gene]['total']))

Project 2. Group B STreptoccocus genomics

GenBank submission of assemblies

Project 3. HIV compartmentalized evolution

Project 4. DNABERT classification

Fall 2020

Participants

Eamen Ho: Volunteer research assistant
Ramandeep Singh: BIOL 48002
Desiree Pante: BIOL 48001
Afsana Rahman: Volunteer research assistant
Roman Shimonov: BIOL 48002
Justin Hiraldo: BIOL 48002
Zaheen Hossain: Volunteer research assistant
Jerry Sebastian: Volunteer research assistant
Ariel Cebelinski: Volunteer research assistant

Schedule

Tuesdays at 12 noon - 2pm by Zoom
Sept 1, 2020. Week 1. Meet & Greet; Intro to projects
Sept 8, 2020. Week 2. Presentations (background, data, and methods), based on assigned readings

Project 1. Structure & evolution of multipartite genome of Lyme disease bacteria

Participants: Desiree & Ramon (Summer 2020), Jerry
Readings
- Review: deCenzo & Finan (2017).
Data set: lp54 & cp26 plasmids
TO DO:
- Week 1. 9/8/2020, 12 noon: 5-slides presentation on multipartite bacterial genome evolution (based on the paper above)
- Week 2. 9/15, 12 noon: Use prorgram codonO to calculate codon bias (SCUO) for replicons (n=23) on Borrelia burgdorferi B31 genome
- Week 3. 9/22, 12 noon: codonO paper presentation (Jerry)

Project 2. OspC Cross-reactivity analysis

Participants: Justin, Roman
Readings: Ivanova et al (2009)
Tool: ImageJ
Data set (to be sent)
To Do
- Week 1. 9/8/2020 12 noon: 5-slide presentation on background, material & methods, and data capture using ImageJ
- Week 2. 9/15: Create Excel sheet to capture immunoblot intensities on C3H mice & P.lucus. Capture background for each serum. Getting ready to makes plots in R/Rstudio

Project 3. Clostridium transcriptome analysis

Participants: Eaman, Zaheen
Readings
Data set: posted on "genometracker.org"
- Wild type transcriptome at 12 hour, paired-end read files:
- /home/azureuser/18134XR-29-01_S0_L001_R1_001.fastq.gz
- /home/azureuser/18134XR-29-01_S0_L001_R2_001.fastq.gz
To Do
- Week 1. 9/8/2020 12 noon:
  - A short presentation on C. diff transcriptome (one of the 2 papers above)
  - Demo on read quality using FastQC and mapping reads to reference genomes with bowtie
- Week 2. Use HT-Seq to quantify RNA abundance for C. diff genes.
  - HTSeq installed
  - Try this protocol first
- Commands

According to: reference; Bowtie website

bioseq -i'genbank' R20291.gb > ref.fa # make FASTA file
bowtie2-build ref.fa index # build index
# -S: sam output (otherwise bam) 
bowtie2 -x index -S 18134XR.sam -1 ../18134XR-29-01_S0_L001_R1_001.fastq.gz -2 ../18134XR-29-01_S0_L001_R2_001.fastq.gz
# ref.gff3: need to run sed "s/Chromosome/FN545816/"
# need to use "-i"; default is "gene_id"
conda activate qiulab # change environment to access htseq
htseq-count -m union --stranded=yes 18134XR-29-01.sam ~/xingmin-cdiff/ref.gff3 -i=Parent > 18134XR-29-01.counts
samtools view -b 18134XR-29-01.sam -o 18134XR-29-01.bam # compress sam file into bam file

Project 4. Protein classification using natural language processing

Participants: Afsana & Ariel
Goal: Classify protein sequences
Week 1. 9/8/2020 Readings:
- Rives et al (2019)
- Lan et al (2019)
Week 2. Find/Explore ALBERT resources & Tutorials
Code from Hansaim Lim
Transformer: Pretrained models in natural language processing
DNAbert paper

DNAbert: github code

Including Albert
Google albert library: github

Sample BioPython script:

#!/usr/bin/env python

import sys
import json
from Bio import SeqIO

alnFile = sys.argv[1] # read file as the first argument
seqList = [] # initialize a list
for record in SeqIO.parse(alnFile, "fasta"):
    seqList.append({"id": record.id,
                    "seq": str(record[0:3].seq) # use the str() function to convert object to string
                }) # get residue2 1-3

print(json.dumps(seqList)) # print to JSON format
exit

Undergrad Research Experience

Contents

Spring 2021

Participants

Reading List

Project 1. Covid mutation analysis

Project 2. Group B STreptoccocus genomics

Project 3. HIV compartmentalized evolution

Project 4. DNABERT classification

Fall 2020

Participants

Schedule

Project 1. Structure & evolution of multipartite genome of Lyme disease bacteria

Project 2. OspC Cross-reactivity analysis

Project 3. Clostridium transcriptome analysis

Project 4. Protein classification using natural language processing

Navigation menu

Undergrad Research Experience

Spring 2021

Participants

Reading List

Project 1. Covid mutation analysis

Project 2. Group B STreptoccocus genomics

Project 3. HIV compartmentalized evolution

Project 4. DNABERT classification

Fall 2020

Participants

Schedule

Project 1. Structure & evolution of multipartite genome of Lyme disease bacteria

Project 2. OspC Cross-reactivity analysis

Project 3. Clostridium transcriptome analysis

Project 4. Protein classification using natural language processing

Navigation menu

Search