Summer 2021: Difference between revisions

From QiuLab
Jump to navigation Jump to search
imported>Weigang
imported>Weigang
Line 58: Line 58:
* Tools & Reading list
* Tools & Reading list
** [https://github.com/facebookresearch/esm Facebook ESM: pre-trained language models (for feature extraction)]
** [https://github.com/facebookresearch/esm Facebook ESM: pre-trained language models (for feature extraction)]
*** Step 1. implementation with colab
*** Step 2. Fine-tuning with OspC seqs; extract embedding
*** Step 3. Applications: classify (OspC vs VlsE), contact map (native vs synthetics), solubility
** Strodthoff et al (2020). Bioinformatics. [https://academic.oup.com/bioinformatics/article/36/8/2401/5698270 UDSMProt: universal deep sequence models for protein classification]. [https://github.com/nstrodt/UDSMProt Source code on Github]
** Strodthoff et al (2020). Bioinformatics. [https://academic.oup.com/bioinformatics/article/36/8/2401/5698270 UDSMProt: universal deep sequence models for protein classification]. [https://github.com/nstrodt/UDSMProt Source code on Github]
** [https://www.pnas.org/content/118/15/e2016239118 Rives et al (2021). PNAS. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences.] [https://github.com/facebookresearch/esm Github repository]
** [https://www.pnas.org/content/118/15/e2016239118 Rives et al (2021). PNAS. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences.] [https://github.com/facebookresearch/esm Github repository]
** Transformer: [https://arxiv.org/pdf/1706.03762.pdf Vaswani et al (2017). Attention is All You Need] [https://github.com/huggingface/transformers Github repository for Huggingface/Transformer]
** Transformer: [https://arxiv.org/pdf/1706.03762.pdf Vaswani et al (2017). Attention is All You Need] [https://github.com/huggingface/transformers Github repository for Huggingface/Transformer]

Revision as of 20:17, 8 June 2021

Group meeting schedule

  • June 3, 2021 (Thursday). Summer research kickoff
  • June 8, 2021 (Tuesday). NLP models of protein structure (Eamen, Roman, Edgar)
  • June 10, 2021 (Thursday).

Project 1. Borrelia genomics

Project 2. Design algorithms for vaccines

  • Participants: Dr Saad Mneimneih (CS Department), Brian
  • Questions & Goals:
    • Generalized algorithms for antigen with arbitrary tree shape
      • Data set 1. Neutral evolution (with exponentially distributed branch lengths). Binary strings (L=100 bits) evolved from a coalescent tree of 20 leaves. Simulated with rcoal(20); rTraitDisc; simSeq(). code from previous work
      • Data set 2. Two major clades. HA sequences from fluB
      • Data set 3. Four major clades. Dengue
      • Data set 4. Star-shaped tree, driven by recombination. OspC
      • Data set 5. Multiple major clades. vls cassette in Lyme species
    • Combination algorithms
    • Naive Bayes models to integrate immunogenicity data
    • Natural language models to improve structural stability (see Project 4 below)
  • Reading list

Project 3. HIV compartmentalized evolution

  • Participants: Lily
  • Questions and goals
    • Do HIV evolve cell type tropisms within the host? Specifically, the Neural(N)-tropism vs T-cell(T)-tropism?
    • Build a classifier of N-tropism HIV subtypes
    • A presentation for an HIV conference in October
  • Reading list
  • Data sets
    • ~500 sequences of env genes from 15 patients
    • 2nd time point single-cell genome sequences for some of the patients
    • Experimentally verified N-tropism subtypes
  • Approach
    • Evolutionary mechanisms: mutation, recombination, and adaptive selection
    • Homoplasy index as a measure of compartmentalization? Randomization to obtain p-values of HI.
    • Evolutionary rates & signature (BEAST)
    • Tests of natural selection (PAML site models, branch-site models & MK analysis)
    • Phylogenetic analysis: tree per individual; supertree; haplotype networks (per individual)
    • Simulated compartmentalization

Project 4. Natural Language models of proteins

  • Participants: Eamen, Roman, and Edgar
  • Questions & Goals
  1. Learn, implement, and compare the existing tools
  2. Fine-tuning for OspC, to be integrated with the centroid algorithm
  3. 2nd-generation centroid design: k-means algorithm (with applications to vls, Dengue, flu B)