Bioinformatics Workshop 2014: Difference between revisions

From QiuLab
Jump to navigation Jump to search
imported>Ntino
(email address update)
imported>Levy
(Corrected Powerpoint PDF link)
 
(102 intermediate revisions by 3 users not shown)
Line 1: Line 1:
<center>'''Summer Bioinformatics Workshop''' (BIOL 470.83/790.86, Summer II 2014)</center>
<center>'''Summer Bioinformatics Workshop''' (BIOL 470.83/790.86, Summer II 2014)</center>
<center>'''Instructors:''' Dr Konstantinos Krampis & Levy Vargas</center>
<center>'''Instructors:''' Drs Konstantinos Krampis & Weigang Qiu, Levy Vargas</center>
<center>'''Room:'''1001B HN (10th Floor, North Building)</center>
<center>'''Room:'''1001B HN (10th Floor, North Building)</center>
<center>'''Hours:''' Tues & Thur 11:30 am-15:00</center>
<center>'''Hours:''' Tues & Thur 11:30 am-15:00</center>
Line 30: Line 30:
===Textbook===
===Textbook===


St.Clair& Visick, (2010). ''Exploring Bioinformatics: a Project-Based Approach''. Jones and Bartlett Publishers, Sudbury, Massachusetts, Inc. (ISBN 0-978-7637-5829-5)
No textbook required, handouts will be provided in the class.
 
This book should be available through several popular retailers and resellers online.


===Grading & Academic Honesty===
===Grading & Academic Honesty===
Line 41: Line 39:


The grading scheme for the course, is as follows ('''Subject to some change. You will be notified with sufficient time'''):
The grading scheme for the course, is as follows ('''Subject to some change. You will be notified with sufficient time'''):
*Assignments (60%): 7 exercises (10 points each).
* In-Class Assignments: 8 exercises, 20 points each. [Attendance is mandatory)
*Final exam (30%)
* Weekly assignment: 4 exercises, 10 points each
**Bioinformatics terminology and concepts
* Mid-term: 50 points, on July 24
**Use of web-based Bioinformatics databases (e.g., NCBI) and tools (e.g., BLAST, CLUSTALW, PHYLIP, ORF-Finder)
* Final exam: 50 points, on August 14
**Ability to interpret an algorithm and its Perl implementations
*Classroom Q & A (5%): Read the chapters before lecture.
*Attendance (5%): 1-2 absences = -2.5%. More than 2 = -5%.
*Email help: Include course code ''("BIOL470", or "BIOL790")'' in the subject line


===Programming Assignment Expectations===
===Programming Assignment Expectations===
Line 60: Line 54:


==Course Schedule (Tuesdays and Thursdays)==
==Course Schedule (Tuesdays and Thursdays)==
<span style="color:red;font-weight:bold;font-size:large;">Dates and assignments below are subject to some change</span>


'''"Lecture slides" links will be available either during or before each lecture, in PDF.'''


'''Homework assignments are due the class *after* the date under which they appear.'''
===July 15. Course Overview & Lab Setup===


===July 15===
*'''Course Overview'''
*'''Course Overview'''
*'''Scope of Bioinformatics''' (Chapter 1)([[Media:Scope.pdf|Lecture Slides-Che]])
*'''LECTURE SLIDES''' ([[Media:Scope-2014.pdf|Bioinformatics]])
*'''WORKSHOP SLIDES''':([[Media:BioTeach1.pdf|Lecture Slides-Slav]])
*'''WORKSHOP SLIDES''':([[Media:BIOL_470.83_790.86_WS1.pdf|Workshop1]])
*'''Workshop 1''': NCBI/OMIM Database
*'''Workshop on Linux proficiency''':  
*'''Workshop 2''': UNIX Operating System
**Terminal & the bash shell
**Terminal & Home Directory
**Text editing
**The vi Editor
**First program
**first basic program
{| class="collapsible collapsed wikitable"
{| class="collapsible collapsed wikitable"
|- style="background-color:lightsteelblue;"
|- style="background-color:lightsteelblue;"
! Assignment #1
! Assignment #1 DUE: July 22
|- style="background-color:powder blue;"
| '''Read''' ([[Media:whatisbioinf1.pdf|What is Bioinformatics]], public via arXiv [[http://arxiv.org/abs/0911.4230v1]])
|- style="background-color:powder blue;"
| '''Required reading''' ([[Media:exgen.pdf|Expression of Genetic Information]], public via NCBI bookshelf [[http://www.ncbi.nlm.nih.gov/books/NBK9842/]])
|- style="background-color:powderblue;"
|- style="background-color:powderblue;"
| '''Linux Proficiency'''<br />
| '''Bioinformatics questions'''<br />
#Install ActivePerl (if you use Windows; Not necessary if you have Mac OS X)
# (5 pts) You are give a long DNA string, describe one or two steps the algorithm should take in order to find genes on the DNA string.
#Install vim (if you use Windows; Not necessary if you have Mac OS X)
# (5 pts) What is the difference between a "Computer Algorithm" and a "Computer Program". Can a Program include an Algorithm ?
# (5 pts) OMIM Question
# (5 pts) Analyze each line of your "Opposite strand code" written in class and describe what it does. (please write your answer)
|-style="background-color:powderblue;"
|-style="background-color:powderblue;"
| '''Read''' Chapter 2
| '''Linux proficiency tests'''<br />
|}
: (5 pts) Install and determine the version of bash and version of Perl on your computer.
::Windows: Install bash (choose one)
:::[http://https://cygwin.com/install.html Cygwin]
:::[http://git-scm.com/downloads Git bash]
::Windows: Install Perl (choose one)
:::[http://strawberryperl.com Strawberry Perl]
:::[http://www.activestate.com/activeperl/downloads ActivePerl]
 
::Mac: Perl should already be installed. Use the Terminal to access bash.
::Linux & others: Are acceptable however no installation instructions will be provided.
 
::Print your version of both Perl and bash installed on your system with the following two commands:
 
:::<b>bash --version</b>
:::<b>perl -V</b>
 
::NOTE: Your output must conform to the code standards in the syllabus above.
 
: (5 pts) Some cases of Alzheimer's Disease have been associated with mutations in the PSEN1 gene. One study indicated that a single G to T mutation resulted in deletion of exon 9. As a consequence, amino acids 290-319 where no longer translated. Using any online database (OMIM, NCBI), answer the following questions. Include the full URL of your source.
 
:# What chromosome is the PSEN1 gene located?
:# Which Alzheimer's Disease type(s) is associated with PSEN1?
:# Copy the whole protein sequence and indicate where amino acids 290-319 are located.The protein sequence of PSEN1 can be accessed from NCBI here: [http://www.ncbi.nlm.nih.gov/protein/15079861?report=fasta PSEN1] <nowiki>http://www.ncbi.nlm.nih.gov/protein/15079861?report=fasta</nowiki>
:# What is the overall length of the normal protein? What is the length of the deletion in AAs?
:# What percentage of the protein is lost when the amino acids were deleted?


===July 17===
*'''Chapter 2.''' Central Dogma & Molecular Biology terms (Chapter 2)([[Media:chap2.pdf|Lecture Slides-Che]])
*'''Workshop 2''': ([[Media:Bioteach2.pdf|Lecture Slides-Slav]])
**Linux tutorial
**Basic Perl (Appendix B1 & B2, pg.310-318)
**'''Algorithm 2''': Transcription
{| class="collapsible collapsed wikitable"
|- style="background-color:lightsteelblue;"
! Assignment #2
|- style="background-color:powderblue;"
| '''Linux Proficiency'''<br />
#"Web Exploration" (pg. 25-27, 7 questions)
#"Running the Program" (pg.33). Show source code, input, and output
#Using the code we have written in class and your new found understanding of Perl, write a code which prompts the user to enter a DNA sequence and then prints the translation. Assume that the user will provide a sequence that consists of only upper-case A,T,G and C AND that the sequence will have a length that is a multiple of three. In addition to using the hash of amino acids and their one letter codes, your program should incorporate some or all of the following:
*length($string): (return a number equal to the length of the variable specified inside the parentheses).
*while (CONDITION) { LINES OF CODE } : (repeatedly execute the instructions within the curly brackets as long as the conditions inside the parentheses are met).
*if (CONDITION) { LINES OF CODE } : (instructions within the curly brackets are executed only if the condition in the parentheses is met).
#
|-style="background-color:powderblue;"
|-style="background-color:powderblue;"
| '''Review''' Chapter 2
|}
|}


===July 22===
===July 17. The Central Dogma of Molecular Biology===
*'''Chapter 2.''' Central Dogma & Molecular Biology (continued) [Lecture Slides Ch.2]
*'''LECTURE SLIDES'''  ([[Media:chap2.pdf|Genes, Proteins, Mutations]])
*'''Workshop 3''': ([[Media:BioTeach3.pdf|Lecture Slides-Slav]])
*'''WORKSHOP SLIDES''':([[Media:BIOL_470.83_WS2a.pdf|Workshop 2]]) ''with corrections July 22''
**'''Perl''' (Appendix B3 & B4, pg. 318-322)
*'''Workshop on Linux proficiency:'''
**'''Algorithm 3''': Translation
**Managing files with bash commands
**Editing with vi
**Writing programs in Perl
 
===July 22. Sequence alignment & homology searching with BLAST===
*'''LECTURE SLIDES''':([[Media:Chap3.pdf|Lecture 3: Gene Alignments and Homology]])
*'''READING MATERIAL''':([[Media:Exploring_Bioinformatics_-_Chp4.pdf|Reading material on sequence alignments and BLAST algorithm]])
*'''WORKSHOP SLIDES''':([[Media:BIOL_470.83_WS3.pdf|Workshop 3]])
*'''Workshop on Linux proficiency:'''
**Input/Output with bash
**Perl Data
**Perl Input/Output
{| class="collapsible collapsed wikitable"
{| class="collapsible collapsed wikitable"
|- style="background-color:lightsteelblue;"
|- style="background-color:lightsteelblue;"
! Assignment #3
! Assignment #2 DUE: July 29
|- style="background-color:powderblue;"
|- style="background-color:powderblue;"
| '''Linux Proficiency'''<br />
| '''Bioinformatics questions'''<br />
#"Running the Program" (pg.37). Input your own sequences. Show input and output, but do NOT print the source code.
#"Putting Your Skills into Practice" Q6 & Q7 (pg.37-38). Show source code (Q7 only), input (Q6 & Q7), and outputs (Q6 & Q7).
#Explain when you would use the following UNIX commands. Your answer should indicate if the command require any arguments:cd, pwd, man, cp, cat, mkdir, rm, grep, wc.
#Choose three commands from the list above and describe two options/arguments which modify the way in which the command functions.
#Describe what the following commands do in your own words:
  cat Sickle_Protein_FASTA | wc
  cat Sickle_Protein_FASTA > wc
  cat Sickle_Protein_FASTA >> wc
  ls -lh /User/Desktop/FASTA_FILES
|-style="background-color:powderblue;"
|-style="background-color:powderblue;"
| '''Read''' Chapter 3
| '''Linux proficiency tests'''<br />
|}
Some diseases have animal models which are useful for studying the disease of interest. Use NCBI's '''BLAST''' tool to help choose between two common laboratory animals for PSEN1.
# Go to the PSEN1 gene in NCBI's GenBank http://www.ncbi.nlm.nih.gov/protein/15079861
# Select "Run BLAST" from the "Analyze this sequence" section
# In the "Choose a Search Set" section, enter "Mus musculus" in the Organism field
# Add another Organism by clicking the + button next to the first field, and enter "Rattus rattus"
# Go down to the bottom and push the BLAST button and wait for the results


===July 24===
Answer the following:
*'''Chapter 3.''' NCBI Databases/Tools; Gene alignments: ([[Media:chap3.pdf|Lecture Slides-Ch3]])
# Which animal would be ideal for a model? (1 point)
*'''Workshop 4''':
# In a sentence or two, how do the BLAST results suggest this? (2 points)
**'''Web Exploration''' (pg.60-66)
# Which hit(s) make(s) the strongest case and describe the relevant value(s)? (2 points)
**'''Algorithm 3''': Translation
{| class="collapsible collapsed wikitable"
|- style="background-color:lightsteelblue;"
! Assignment #4
|- style="background-color:powderblue;"
| '''Linux Proficiency'''<br />
#Create a new vi file with the provided code, grant appropriate file permissions and run the script using a FASTA file as an argument. In a single sentence, describe what the code does. 
#Add comments to every line of code explaining what it does.
<pre>
#!/usr/bin/perl


use strict;
Download a table of hit results by selecting the "Download" widget near the top of the browser window. Select "Hit Table (csv)" and save it. To make this file readable, change the commas into tabs, by using the '''tr''' command. Research how to use the tr command to reformat the file and save the output to a new file. Finally, use '''cat''' with the '''-n''' option on the new file in order to number the lines and name it '''hit_results.txt'''.
use warnings;


die "Usage: $0 <Fasta_File>\n" unless @ARGV >0;
Answer or show the following:
my $filename = shift(@ARGV);
# Show the contents of the file hit_results.txt. (1 point)
# Show how tr and cat were used on the command line (omit output). (1 point)
# How many total hits are in hit_results.txt? (1 point)
# How many mouse hits are in hit_results.txt? (1 point)
# How many rat hits are in hit_results? (1 point)
|-style="background-color:powderblue;"
|}
* July 23 ANNOUNCEMENT: Current participants are invited to join the Google Group for help, questions, and discussion. Send an email to [mailto:biowork2014+subscribe@googlegroups.com biowork2014+subscribe@googlegroups.com] or visit https://groups.google.com/d/forum/biowork2014. IT IS VERY IMPORTANT that you GET MEMBERSHIP in this group. We will use it for answering questions over the weekend for the midterm.


my $dna_string = '';
===July 24. Alignment and Phylogenetics===
*'''LECTURE SLIDES''':([[Media:Chap8.pdf|Lecture 4:  Alignments, Homology, Molecular Phylogenetics]])
*'''REQUIRED READING''':([[Media:Faint.pdf|(Not) for the faint of the heart: Molecular Phylogenentics]])
*'''WORKSHOP SLIDES''':([[Media:BIOL_470.83_WS4.pdf|Workshop 4]])
*'''Workshop on Linux proficiency:'''
**Remote access
**Accounts and passwords
**File transfers


open (FILE, $filename);
===July 29. Structure of human genome & genes===
* Lecture & Workshop Slides: [[Media:July-29-genome-gene-structure.pptx|Gene Genome Structure Powerpoint]] or [[Media:July-29-genome-gene-structure-allframes.pptx.pdf|PDF]] ''PDF updated with animation slides''
* Perl Workshop Slides: [[Media:BIOL_470.83_WS5.pdf|Workshop 5]]
* Tree-thinking Quizzes: [[Media:Baum etal05 Quiz.pdf|Baum etal 2005 Quiz PDF]]


while ( <FILE> ) {
===July 31. Macro-evolution: Cross-species comparisons===
        my $line = $_;
* Learning Goal: Cross-species comparisons
        chomp $line;
* Perl & bash Workshop Slides: [[Media:BIOL_470.83_WS6.pdf|Workshop 6]]
        if ($line =~ /^>/) {
*Web Exercise 1. Cross-species comparisons with HomoloGene
                print $line, "COMPLEMENT\n";
# From the NCBI "TAS2R38" Gene page, click "HomoloGene" link under the "Related Information" (right-side navigation panel)
                next;
# You should see a page listing TAS2R38 orthologous (i.e., same gene in different species) genes from 7 mammalian species, including human (''Homo sapiens''), chimpanzee (''Pan troglodytes''), macaque (''Macaca mulatta''), dog (''Canis lupus familiaris''), cow (''Bos taurus''), rat (''Rattus norvegicus''), and mouse (''Mus musculus'').
        }
# Write down your expectations for the following species relationships:
        else {
## Is chimpanzee more closely related to macaque or to human?
                $dna_string .= $line;
## Is dog more related to mouse or to cow?
                next;
## Is rat and mouse more closely related than human and chimpanzee?
}
# Click on the link "Show Pairwise Alignment Scores" under "Protein Alignments" and fill in the following table when the page loads. Do these sequence-comparison results change your expectations in the above? Explain.
}
<center>
 
{| class="wikitable"
for (my $i=0; $i<length($dna_string); $i++) {
|-
        my $nucleo = substr($dna_string,$i,1);
! Species pair !! % Protein Sequence Differences !! % DNA Seq Differences
        if ( $nucleo eq "A" ) { print "T"; }
|-
        elsif ( $nucleo eq "C" ) { print "G"; }
| Chimp-Human || ? || ?
        elsif ( $nucleo eq "G" ) { print "C"; }
|-
        else { print "A"; }
| Chimp-Macaque || ? || ?
}
|-
 
| Dog-Cow || ? || ?
close FILE
|-
</pre>
| Dog-Mouse || ? || ?
|-style="background-color:powderblue;"
|-
| '''Read''' Chapter 6
| Rat-Mouse || ? || ?
|}
|}
 
</center>
===July 29===
You can find exact differences by clicking on "Blast" for each pairwise comparisons.
*'''Chapter 6.''' Gene Prediction ([[Media:Chap6.pdf|Lecture Slides-Ch6]])
* Movie Break: [http://media.hhmi.org/biointeractive/films/OriginSpecies-Lizards.html Origin of Species: Lizards in an Evolutionary Tree]
*'''Workshop 5''':
# What are the two hypotheses explaining the origin of different ecomorphs of lizards on Caribbean Islands?
**'''Web Exploration''' (pg.168-174)
# What is the expected phylogeny under each hypothesis?
**'''Algorithm 3''': TBD
# Which hypothesis is supported by the phylogeny of actual DNA sequences?
* Web Exercise 2. Derive your own tree of Anole lizards
[[File:Anole-tree.png|thumbnail|Tree of Anole Lizards]]
# Copy and paste [http://media.hhmi.org/biointeractive/activities/lizard/Anolis-DNA-sequences.txt the lizard DNA sequences] into a text editor. Using underscores, attach geographic location and ecomorph of each species according to [http://media.hhmi.org/biointeractive/activities/lizard/Lizard-Cards-Color.pdf the lizard card]
# Go to the [http://www.phylogeny.fr the phylogeny.fr web] and select "Phylogenetic Analysis" and then "One Click" analysis
# Copy and paste your edited sequences into the text box and click on "Submit"
# When analysis is finished, you should see a phylogenetic tree. Answer the following two questions:
## Are species grouped by geography or by habitats?
## Which hypothesis is supported by your phylogenetic tree?
{| class="collapsible collapsed wikitable"
{| class="collapsible collapsed wikitable"
|- style="background-color:lightsteelblue;"
|- style="background-color:lightsteelblue;"
! Assignment #5
! Assignment #3 DUE: Aug 5
|- style="background-color:powderblue;"
|- style="background-color:powderblue;"
| '''Linux Proficiency'''<br />
| '''Bioinformatics questions'''<br />
#Manually translate the follow sequence in all 6 reading frames (use one-letter amino acid code): 5'-GTTCCCTCTCGGGT-3'. '''Show your work'''
* Make a printout of [http://www.ncbi.nlm.nih.gov/nuccore/V00488 this GenBank file with the Accession "V00488"] (print ONLY the part with the DNA sequence). Indicate (with highlights or texts) the following gene elements: (a) all introns, (b) all exons, (c) 5'-UTR and 3'-UTR, (d) start and stop codons.
#Modify your translation script (one-letter code version) so that it translates a DNA sequence in all six reading frames.  Use your script to find the correct reading frame of the given sequence. '''Show your code, input and output''' (partial credits will be considered).
[[File:Q3-answerkey.png|thumbnail|Answer key]]
|-style="background-color:powderblue;"
|-style="background-color:powderblue;"
| '''Read''' Chapter 8
| '''Linux proficiency tests'''<br />
|}
Scenario: Your research lab is interested in the 5' and 3' UTR regions of the human alpha-globin gene. You have been asked to take the same sequence file from above and prepare it for analysis. You have a Perl script that another lab member had used for data quality checks on another gene. Your script will need to transform the UTR bases from uppercase to lowercase. Another member prepared a control sequence in a FASTA file.


===July 31===
#Log into the Linux server (refer to your notes)
*'''Chapter 6.''' Gene Prediction  [continued]
#Copy each of these files from '''/home/levy/biowork/''' into your home directory:
*'''Workshop 6''':
#* alphaglobin.fasta
**'''Web Exploration''' (pg.168-169)
#* control-utrs.seq
**'''Algorithm 4''': TBD
#* id-utrs.pl ''corrected name''
{| class="collapsible collapsed wikitable"
#Modify the Perl script to transform the alpha-globin file:
|- style="background-color:lightsteelblue;"
## Edit the script and change parameters where the script has '''FIX''' in the comments (Hint: There are 4 sections that need edits.)
! Assignment #6
## Run the script: '''cat alphaglobin.fasta | perl id-utrs.pl > new.seq'''
|- style="background-color:powderblue;"
## Check your results with the NCBI website and confirm the UTRs correspond with the GenBank record
| '''Using ONLINE tools '''<br />
#Run BLAST using the control file control-utrs.seq:
# Using the supplied accession number [YP_063283]:  
##Review your results first: '''blastn -query new.seq -subject control-utrs.seq | less'''
# Find the top 6 orthologs using on online tool we covered in class.  
##If the results correlate with NCBI, then save the results: '''blastn -query new.seq -subject control-utrs.seq -outfmt 7 > results.txt'''
# Align the 7 sequences (6 identified orthologs plus given sequence) using another online tool covered in class.
#Filter the results using grep that match with gi and name it filtered.txt
# Your results should include: Printed alignment and tabulated results showing name, scores, and e-values of the significant orthologs.
#Print the following for sumbission: modified Perl script, new.seq, and filtered.txt
#*'''Note''': You must follow the syllabus guidelines
|-style="background-color:powderblue;"
|-style="background-color:powderblue;"
| '''Review''' Chapter 8
|}
|}


===Aug 5===
===Aug 5. Micro-evolution: Human genetic variations===
*'''Chapter 8.''' Molecular Phylogenetics([[Media:chap8.pdf|Lecture Slides-Ch8]])
* Lecture Slides: Genetic Variation ([[Media:Session-3-genetic-variation.pptx|Powerpoint]]) '''Slides updated'''
**'''Web Exploration''' (pg.244-248)
* Workshop Slides: Algorithm Thinking ([[Media:Workshop7.pptx.pdf|Workshop 7]])
*'''Begin Review'''
* [http://media.hhmi.org/fittest/human_selection.html Short Film: Natural Selection in Humans]
**''' Begin Algorithm Review'''
* Worksheets
 
{| class="collapsible collapsed wikitable"
{| class="collapsible collapsed wikitable"
|- style="background-color:lightsteelblue;"
|- style="background-color:lightsteelblue;"
! Assignment #7
! Assignment #4 DUE: Aug 12
|- style="background-color:powderblue;"
|- style="background-color:powderblue;"
| '''Tree Thinking'''<br />
| '''Bioinformatics questions'''<br />[[File:Snp.png|thumbnail]]
#"Tree-Thinking" Puzzles ([[Media:TreeV1.pdf|Download]]): Briefly explain your choices. (Partial credits if you simply mark the choices).
The image at left shows an alignment of 20 codons from 38 strains of a bacterial species. Answer the following questions:
# How many SNP sites? What is the SNP density (i.e., number of SNP sites divided by the total number of aligned bases, which is 20 x 3= 60 bases)
# For each SNP site, identify whether (a) it is a transition or transversion, and (b) it causes synonymous or nonsynonymous change (by consulting [http://scienceblogs.com/digitalbio/wp-content/blogs.dir/460/files/2012/04/i-39185d84268023fb77b43bbf9dba06c7-standard%20genetic%20code.png a genetic code table like this one])
|-style="background-color:powderblue;"
| '''Linux proficiency tests'''<br />
# Log in to the server
# Copy the primer sequence in /home/levy/biowork named '''primer.seq''' into your home directory
# Write a bash script the following conditions:
## Accept a FASTA file as the first argument
## The FASTA file must have the ID in the first line
## The FASTA file must have only one sequence, and the sequence must be in the second line in any case
## The output should be the reverse complement of the sequence in FASTA format with original ID with the same case
## The output should be a file named '''reverse_com.fasta'''
## The script should produce the file in the current directory
# Modify '''primer.seq''' to be used as valid input with your script
# Test your script with '''primer.seq'''
# Create a directory called '''hw4''' in your home directory
# Copy your script, your modified '''primer.seq''', and your script's output to '''hw4/'''
# Print the 3 files above, following the syllabus guidelines, for submission in class
|-style="background-color:powderblue;"
|-style="background-color:powderblue;"
| '''Review''' Chapter 8
|}
|}


===Aug 7===
===Aug 7. Genome function: Transcriptome analysis ===
*'''Review Web Exploration, Databases, and Gene Prediction'''
* Lecture Slides: Gene Expression ([[Media:Session-4-gene-expression.pptx|Powerpoint]] or [[Media:Session-4-gene-expression.pdf|PDF]])
* In-Class Exercise
# Read [http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE17637 this experimental report] and extract the following information:
## Name of the two species used in experiments
## How many genes were measured for their expression (i.e., mRNA) levels?
## Describe a biological question that can be answered by this experiment (e.g., which genes are expressed at a particular developmental stage)
# [http://dictyexpress.biolab.si/index.htm Go to dictyExpress] and explore the time course of a set of genes
## Choose the 2nd Box: "Run dictyExpress (RNA-seq)"
## In the "Gene Selection" Box, type the following gene names one at a time (DON'T copy and paste; when the gene is found, highlight it and press enter): acrA, catB, dcsA, acgA, abcG18
## Click "Update" and answer the question based on the plot in the "Expression Profile" panel: Are these genes up- or down-regulated during development?
# Do the same for the 2nd set of genes: mserS, rpl38, rpsA, rpl35a, gfm1
# Do the same for the 3rd set of genes: gefB, gefX, gxcB, mgp3, gefN
# Combine all 3 sets of genes and produce a heatmap
## In the "Hierarchical Clustering" Panel, choose the "Pearson Correlation" for "Distance Function"
## Choose "Average Linkage" for "Linkage" and your choice of color gradient
## What is represented by each row?
## What is represented by each column?
## Do these 3 sets of genes form clusters by themselves?
## [http://media.hhmi.org/biointeractive/click/microarray_analyzing/12.html HHMI slides: A technical description] of how to group genes and samples by their overall similarity in gene expression levels
* Workshop Slides: [[Media:BIOL_470.83_WS8.pdf|Workshop 8 PDF]]
 
===Aug 12. Review===
* Submit teacher's evaluation:
** Computer: [http://www.hunter.cuny.edu/te Submit using computer]
** Smartphone: [http://www.hunter.cuny.edu/mobilete Submit using mobile]
* Review Slides: Final Review ([[Media:Final-review-summer-shop.pptx|Powerpoint]] or [[Media:Final-review-summer-shop.pdf|PDF]]) and Informatics ([[Media:BIOL_470.83_Review.pdf|PDF]])


===Aug 12===
===Aug 14. In-class Final Exam & Practicum===
*'''Review Code Structure and syntax, as well as common coding errors. Also begin review of phylogeny.'''


===Aug 14===
==Class Links==
* Final Exam
* [http://www.ee.surrey.ac.uk/Teaching/Unix/ A Unix Tutorial]
* [http://ryanstutorials.net/linuxtutorial/ A Linux bash Tutorial]
* [http://www.openvim.com/tutorial.html An Interactive VIM Tutorial]
* [http://cygwin.com/install.html Cygwin bash Shell & Unix Tools for Windows]
* [http://git-scm.com/downloads Git with bash Shell for Windows]
* [http://strawberryperl.com Strawberry Perl for Windows]
* [http://www.activestate.com/activeperl/downloads ActivePerl for Windows]

Latest revision as of 15:28, 13 August 2014

Summer Bioinformatics Workshop (BIOL 470.83/790.86, Summer II 2014)
Instructors: Drs Konstantinos Krampis & Weigang Qiu, Levy Vargas
Room:1001B HN (10th Floor, North Building)
Hours: Tues & Thur 11:30 am-15:00
Office Hours: Room 830 HN; Tuesday 3-5pm or by appointment
Contacts: Konstantinos Krampis <python4bio at gmail.com>; Levy Vargas <levy.vargas at gmail.com>

Course Description

Background

Biomedical research is becoming a high-throughput science. As a result, information technology plays an increasingly important role in biomedical discovery. Bioinformatics is a new interdisciplinary field formed by the merging of molecular biology and computer science techniques.Today’s biology students must therefore not only learn to perform in vivo and invitro, but also in silico research skills. Quantitative/computational biologists are expected to be in increasing demand in the 21st century.

However, the technical barrier to enter the field and perform basic research projects in a bioinformatics lab is daunting for most undergraduate students. This is mainly due to the multidisciplinary nature of quantitative biology, which requires understandings and skills in chemistry, biology, computer programming, and statistics. The Hunter Summer Bioinformatics Workshop aims to introduce bioinformatics to motivated undergraduate and high school students by lowering the barrier and dispensing the usual pre-requisites in advanced biology/chemistry courses as well as entry-level programming/statistics courses. The Workshop does not assume prior programming experience.

The workshop DOES NOT

  • Replace existing advanced bioinformatics courses such as BIOL425 and STAT 319
  • Teach advanced bioinformatics programming skills (e.g., advanced data structure, object-oriented Perl, BioPerl, or relational database with SQL), which are the contents of BIOL425
  • Teach in-depth statistics or the popular R statistical package, although probabilistic thinking (e.g., distributions of a random variable, stochastic processes, likelihood, clustering analysis) is at the core of all bioinformatics analysis (STAT 319 teaches these topics)

To learn these advanced bioinformatics topics and skills, motivated students are encouraged to enroll in one of the Five Bioinformatics Concentrations of at Hunter. The QuBi program prepares the students for bioinformatics positions in a research lab or a biotechnology company.

Contents

This course will introduce both bioinformatics theories and practices. Topics include: database searching, sequence alignment, and basic molecular phylogenetics. The course is held in a UNIX-based instructional lab specifically configured for bioinformatics applications. Each session consists of a first-half instruction on bioinformatics theories and a second-half session of hands-on exercises.

Learning Goals

Students are expected to be able to:

  • Retrieve and analyze DNA and protein sequences using online databases
  • Write simple computer programs to manipulate DNA sequences

Textbook

No textbook required, handouts will be provided in the class.

Grading & Academic Honesty

Hunter College regards acts of academic dishonesty (e.g., plagiarism, cheating on examinations, obtaining unfair advantage, and falsification of records and official documents) as serious offenses against the values of intellectual honesty. The College is committed to enforcing the CUNY Policy on Academic Integrity and will pursue cases of academic dishonesty according to the Hunter College Academic Integrity Procedures.

Student performance will be evaluated by weekly assignments and projects. While these are take-home projects and students are allowed to work in groups, students are expected to compose the final short answers, computer commands, and code independently. There are virtually an unlimited number of ways to solve a computational problem, as are ways and personal styles to implement an algorithm. Writings and blocks of codes that are virtually exact copies between individual students will be investigated as possible cases of plagiarism (e.g., copies from the Internet, text book, or each other). In such a case, the instructor will hold closed-door exams for involved individuals. Zero credits will be given to ALL involved individuals if the instructor considers there is enough evidence for plagiarism. To avoid being investigated for plagiarism, Do Not Copy from Others & Do Not Let Others Copy Your Work.

The grading scheme for the course, is as follows (Subject to some change. You will be notified with sufficient time):

  • In-Class Assignments: 8 exercises, 20 points each. [Attendance is mandatory)
  • Weekly assignment: 4 exercises, 10 points each
  • Mid-term: 50 points, on July 24
  • Final exam: 50 points, on August 14

Programming Assignment Expectations

All code must begin with the lines in the Perl slides, without exception. For each assignment, unless otherwise stated, I would like the full text of the source code. Since you cannot print using the text editor in the lab (even if you are connected from home), you must copy and paste the code into a word processor or a local text editor. If you are using a word processor, change the font to a fixed-width/monospace font. On Windows, this is usually Courier.

Code indentation is your personal taste, so long as it is consistent and readable. Use comments whenever you think either the code is unclear, or simply as a guideline for yourself. Well-commented code improves readability, but be careful not overdo it.

Also, unless otherwise stated, both the input and the output of the program must be submitted as well. This should also be in fixed-width font, and you should label it in such a way so that I know it is the program's input/output. This is so that I know that you've run the program, what data you have used, and what the program produced.

If you are working from the lab, one option is to email the code to yourself, change the font, and then print it somewhere else as there is no printer in the lab.

Course Schedule (Tuesdays and Thursdays)

July 15. Course Overview & Lab Setup

  • Course Overview
  • LECTURE SLIDES (Bioinformatics)
  • WORKSHOP SLIDES:(Workshop1)
  • Workshop on Linux proficiency:
    • Terminal & the bash shell
    • Text editing
    • First program

July 17. The Central Dogma of Molecular Biology

  • LECTURE SLIDES (Genes, Proteins, Mutations)
  • WORKSHOP SLIDES:(Workshop 2) with corrections July 22
  • Workshop on Linux proficiency:
    • Managing files with bash commands
    • Editing with vi
    • Writing programs in Perl

July 22. Sequence alignment & homology searching with BLAST

July 24. Alignment and Phylogenetics

July 29. Structure of human genome & genes

July 31. Macro-evolution: Cross-species comparisons

  • Learning Goal: Cross-species comparisons
  • Perl & bash Workshop Slides: Workshop 6
  • Web Exercise 1. Cross-species comparisons with HomoloGene
  1. From the NCBI "TAS2R38" Gene page, click "HomoloGene" link under the "Related Information" (right-side navigation panel)
  2. You should see a page listing TAS2R38 orthologous (i.e., same gene in different species) genes from 7 mammalian species, including human (Homo sapiens), chimpanzee (Pan troglodytes), macaque (Macaca mulatta), dog (Canis lupus familiaris), cow (Bos taurus), rat (Rattus norvegicus), and mouse (Mus musculus).
  3. Write down your expectations for the following species relationships:
    1. Is chimpanzee more closely related to macaque or to human?
    2. Is dog more related to mouse or to cow?
    3. Is rat and mouse more closely related than human and chimpanzee?
  4. Click on the link "Show Pairwise Alignment Scores" under "Protein Alignments" and fill in the following table when the page loads. Do these sequence-comparison results change your expectations in the above? Explain.
Species pair % Protein Sequence Differences % DNA Seq Differences
Chimp-Human ? ?
Chimp-Macaque ? ?
Dog-Cow ? ?
Dog-Mouse ? ?
Rat-Mouse ? ?

You can find exact differences by clicking on "Blast" for each pairwise comparisons.

  1. What are the two hypotheses explaining the origin of different ecomorphs of lizards on Caribbean Islands?
  2. What is the expected phylogeny under each hypothesis?
  3. Which hypothesis is supported by the phylogeny of actual DNA sequences?
  • Web Exercise 2. Derive your own tree of Anole lizards
Tree of Anole Lizards
  1. Copy and paste the lizard DNA sequences into a text editor. Using underscores, attach geographic location and ecomorph of each species according to the lizard card
  2. Go to the the phylogeny.fr web and select "Phylogenetic Analysis" and then "One Click" analysis
  3. Copy and paste your edited sequences into the text box and click on "Submit"
  4. When analysis is finished, you should see a phylogenetic tree. Answer the following two questions:
    1. Are species grouped by geography or by habitats?
    2. Which hypothesis is supported by your phylogenetic tree?

Aug 5. Micro-evolution: Human genetic variations

Aug 7. Genome function: Transcriptome analysis

  • Lecture Slides: Gene Expression (Powerpoint or PDF)
  • In-Class Exercise
  1. Read this experimental report and extract the following information:
    1. Name of the two species used in experiments
    2. How many genes were measured for their expression (i.e., mRNA) levels?
    3. Describe a biological question that can be answered by this experiment (e.g., which genes are expressed at a particular developmental stage)
  2. Go to dictyExpress and explore the time course of a set of genes
    1. Choose the 2nd Box: "Run dictyExpress (RNA-seq)"
    2. In the "Gene Selection" Box, type the following gene names one at a time (DON'T copy and paste; when the gene is found, highlight it and press enter): acrA, catB, dcsA, acgA, abcG18
    3. Click "Update" and answer the question based on the plot in the "Expression Profile" panel: Are these genes up- or down-regulated during development?
  3. Do the same for the 2nd set of genes: mserS, rpl38, rpsA, rpl35a, gfm1
  4. Do the same for the 3rd set of genes: gefB, gefX, gxcB, mgp3, gefN
  5. Combine all 3 sets of genes and produce a heatmap
    1. In the "Hierarchical Clustering" Panel, choose the "Pearson Correlation" for "Distance Function"
    2. Choose "Average Linkage" for "Linkage" and your choice of color gradient
    3. What is represented by each row?
    4. What is represented by each column?
    5. Do these 3 sets of genes form clusters by themselves?
    6. HHMI slides: A technical description of how to group genes and samples by their overall similarity in gene expression levels

Aug 12. Review

Aug 14. In-class Final Exam & Practicum

Class Links