Bioinformatics Workshop 2013: Difference between revisions
imported>Cmartin |
imported>Skendall |
||
Line 181: | Line 181: | ||
close FILE | close FILE | ||
</pre> | |||
# Here are VERY GOOD answers to this assignment courtesy of one of your classmates: | |||
<pre> | |||
1) What the program does: | |||
The program will open a FASTA file and create a compliment strand for the sequence found in the file. | |||
2) Comments on each line: | |||
#!/usr/bin/perl | |||
#using the language perl | |||
use strict; | |||
#restrict unsafe constructs and allows for the declaration of variables using my | |||
use warnings; | |||
#give warnings | |||
die "Usage: $0 <Fasta_File>\n" unless @ARGV >0; | |||
#stop the program if there is a problem reading the file | |||
my $filename = shift(@ARGV); | |||
#declare the variable $filename | |||
#it is the left most value from the array (file entered by user) | |||
my $dna_string = ''; | |||
#declare an empty variable known as my dna_string | |||
open (FILE, $filename); | |||
#open the user entered file | |||
while ( <FILE> ) { | |||
#create a loop for use while the file designated by FILE is still being read as <STDIN> | |||
my $line = $_; | |||
#line is a variable that the program is currently dealing with, includes new line character | |||
chomp $line; | |||
#remove new line character | |||
if ($line =~ /^>/) { | |||
#if the beginning of the line matches | |||
print $line, "COMPLEMENT\n"; | |||
#print the line and the quoted text | |||
next; | |||
#move to the next line | |||
} | |||
#end of this condition | |||
else { | |||
#else condition, only if the IF condition is not satisfied | |||
$dna_string .= $line; | |||
#combination using the concatenation ,$dna_string variable grows in length | |||
next; | |||
#move to the next line | |||
} | |||
#end of else condition | |||
} | |||
#end of while loop | |||
for (my $i=0; $i<length($dna_string); $i++) { | |||
#start with variable I at zero read the length of the DNA string, increment i by one | |||
my $nucleo = substr($dna_string,$i,1); | |||
#create the variable nucleo, which is a substring of the dna_string, read dna_string one by one | |||
if ( $nucleo eq "A" ) { print "T"; } | |||
#if A-> print T | |||
elsif ( $nucleo eq "C" ) { print "G"; } | |||
#if C-> print G | |||
elsif ( $nucleo eq "G" ) { print "C"; } | |||
#if G-> print C | |||
else { print "A"; } | |||
#else print A | |||
} | |||
close FILE | |||
#close the FASTA file | |||
</pre> | </pre> | ||
|-style="background-color:powderblue;" | |-style="background-color:powderblue;" |
Revision as of 18:06, 25 June 2013
Course Description
Background
Biomedical research is becoming a high-throughput science. As a result, information technology plays an increasingly important role in biomedical discovery. Bioinformatics is a new interdisciplinary field formed by the merging of molecular biology and computer science techniques.Today’s biology students must therefore not only learn to perform in vivo and invitro, but also in silico research skills. Quantitative/computational biologists are expected to be in increasing demand in the 21st century.
However, the technical barrier to enter the field and perform basic research projects in a bioinformatics lab is daunting for most undergraduate students. This is mainly due to the multidisciplinary nature of quantitative biology, which requires understandings and skills in chemistry, biology, computer programming, and statistics. The Hunter Summer Bioinformatics Workshop aims to introduce bioinformatics to motivated undergraduate and high school students by lowering the barrier and dispensing the usual pre-requisites in advanced biology/chemistry courses as well as entry-level programming/statistics courses. The Workshop does not assume prior programming experience.
The workshop DOES NOT
- Replace existing advanced bioinformatics courses such as BIOL425 and STAT 319
- Teach advanced bioinformatics programming skills (e.g., advanced data structure, object-oriented Perl, BioPerl, or relational database with SQL), which are the contents of BIOL425
- Teach in-depth statistics or the popular R statistical package, although probabilistic thinking (e.g., distributions of a random variable, stochastic processes, likelihood, clustering analysis) is at the core of all bioinformatics analysis (STAT 319 teaches these topics)
To learn these advanced bioinformatics topics and skills, motivated students are encouraged to enroll in one of the Five Bioinformatics Concentrations of at Hunter. The QuBi program prepares the students for bioinformatics positions in a research lab or a biotechnology company.
Contents
This course will introduce both bioinformatics theories and practices. Topics include: database searching, sequence alignment, and basic molecular phylogenetics. The course is held in a UNIX-based instructional lab specifically configured for bioinformatics applications. Each session consists of a first-half instruction on bioinformatics theories and a second-half session of hands-on exercises.
Learning Goals
Students are expected to be able to:
- Approach biological questions evolutionarily ("Tree-thinking")
- Design efficient procedures to solve problems ("Algorithm-thinking")
- Manipulate high-volume textual data using UNIX tools, Perl and Relational Database ("Data Visualization")
Textbook
St.Clair& Visick, (2010). Exploring Bioinformatics: a Project-Based Approach. Jones and Bartlett Publishers, Sudbury, Massachusetts, Inc. (ISBN 0-978-7637-5829-5)
This book should be available through several popular retailers and resellers online.
Grading & Academic Honesty
Hunter College regards acts of academic dishonesty (e.g., plagiarism, cheating on examinations, obtaining unfair advantage, and falsification of records and official documents) as serious offenses against the values of intellectual honesty. The College is committed to enforcing the CUNY Policy on Academic Integrity and will pursue cases of academic dishonesty according to the Hunter College Academic Integrity Procedures.
Student performance will be evaluated by weekly assignments and projects. While these are take-home projects and students are allowed to work in groups, students are expected to compose the final short answers, computer commands, and code independently. There are virtually an unlimited number of ways to solve a computational problem, as are ways and personal styles to implement an algorithm. Writings and blocks of codes that are virtually exact copies between individual students will be investigated as possible cases of plagiarism (e.g., copies from the Internet, text book, or each other). In such a case, the instructor will hold closed-door exams for involved individuals. Zero credits will be given to ALL involved individuals if the instructor considers there is enough evidence for plagiarism. To avoid being investigated for plagiarism, Do Not Copy from Others & Do Not Let Others Copy Your Work.
The grading scheme for the course, is as follows (Subject to some change. You will be notified with sufficient time):
- Assignments (50%): 6 exercises (10 points each).
- Final exam (40%)
- Bioinformatics terminology and concepts (10 pts)
- Use of web-based Bioinformatics databases (e.g., NCBI) and tools (e.g., BLAST, CLUSTALW, PHYLIP, ORF-Finder) (15 pts)
- Ability to interpret an algorithm and its Perl implementations (15 pts)
- Classroom Q & A (5%): Read the chapters before lecture.
- Attendance (5%): 1-2 absences = -2.5%. More than 2 = -5%.
- Email help: Include course code ("BIOL470", or "BIOL790") in the subject line
Programming Assignment Expectations
All code must begin with the lines in the Perl slides, without exception. For each assignment, unless otherwise stated, I would like the full text of the source code. Since you cannot print using the text editor in the lab (even if you are connected from home), you must copy and paste the code into a word processor or a local text editor. If you are using a word processor, change the font to a fixed-width/monospace font. On Windows, this is usually Courier.
Code indentation is your personal taste, so long as it is consistent and readable. Use comments whenever you think either the code is unclear, or simply as a guideline for yourself. Well-commented code improves readability, but be careful not overdo it.
Also, unless otherwise stated, both the input and the output of the program must be submitted as well. This should also be in fixed-width font, and you should label it in such a way so that I know it is the program's input/output. This is so that I know that you've run the program, what data you have used, and what the program produced.
If you are working from the lab, one option is to email the code to yourself, change the font, and then print it somewhere else as there is no printer in the lab.
Course Schedule (Tuesdays and Thursdays)
Dates and assignments below are subject to some change
"Lecture slides" links will be available either during or before each lecture, in PDF.
Homework assignments are due the class *after* the date under which they appear. ie, an assignment posted under June 4 is due the following lecture, on June 6.
June 4
- Course Overview
- Scope of Bioinformatics (Chapter 1)(Lecture Slides-Che)
- WORKSHOP SLIDES:(Lecture Slides-Slav)
- Workshop 1: NCBI/OMIM Database
- Workshop 2: UNIX Operating System
- Terminal & Home Directory
- The vi Editor
- first basic program
Assignment #1 |
---|
Linux Proficiency
|
Read Chapter 2 |
June 6
- Chapter 2. Central Dogma & Molecular Biology terms (Chapter 2)(Lecture Slides-Che)
- Workshop 2: (Lecture Slides-Slav)
- Linux tutorial
- Basic Perl (Appendix B1 & B2, pg.310-318)
- Algorithm 2: Transcription
Assignment #2 |
---|
Linux Proficiency
|
Review Chapter 2 |
June 11
- Chapter 2. Central Dogma & Molecular Biology (continued) [Lecture Slides Ch.2]
- Workshop 3: (Lecture Slides-Slav)
- Perl (Appendix B3 & B4, pg. 318-322)
- Algorithm 3: Translation
Assignment #3 |
---|
Linux Proficiency
cat Sickle_Protein_FASTA | wc cat Sickle_Protein_FASTA > wc cat Sickle_Protein_FASTA >> wc ls -lh /User/Desktop/FASTA_FILES |
Read Chapter 3 |
June 13
- Chapter 3. NCBI Databases/Tools; Gene alignments: (Lecture Slides-Ch3)
- Workshop 4:
- Web Exploration (pg.60-66)
- Algorithm 3: Translation
Assignment #4 |
---|
Linux Proficiency
#!/usr/bin/perl use strict; use warnings; die "Usage: $0 <Fasta_File>\n" unless @ARGV >0; my $filename = shift(@ARGV); my $dna_string = ''; open (FILE, $filename); while ( <FILE> ) { my $line = $_; chomp $line; if ($line =~ /^>/) { print $line, "COMPLEMENT\n"; next; } else { $dna_string .= $line; next; } } for (my $i=0; $i<length($dna_string); $i++) { my $nucleo = substr($dna_string,$i,1); if ( $nucleo eq "A" ) { print "T"; } elsif ( $nucleo eq "C" ) { print "G"; } elsif ( $nucleo eq "G" ) { print "C"; } else { print "A"; } } close FILE
1) What the program does: The program will open a FASTA file and create a compliment strand for the sequence found in the file. 2) Comments on each line: #!/usr/bin/perl #using the language perl use strict; #restrict unsafe constructs and allows for the declaration of variables using my use warnings; #give warnings die "Usage: $0 <Fasta_File>\n" unless @ARGV >0; #stop the program if there is a problem reading the file my $filename = shift(@ARGV); #declare the variable $filename #it is the left most value from the array (file entered by user) my $dna_string = ''; #declare an empty variable known as my dna_string open (FILE, $filename); #open the user entered file while ( <FILE> ) { #create a loop for use while the file designated by FILE is still being read as <STDIN> my $line = $_; #line is a variable that the program is currently dealing with, includes new line character chomp $line; #remove new line character if ($line =~ /^>/) { #if the beginning of the line matches print $line, "COMPLEMENT\n"; #print the line and the quoted text next; #move to the next line } #end of this condition else { #else condition, only if the IF condition is not satisfied $dna_string .= $line; #combination using the concatenation ,$dna_string variable grows in length next; #move to the next line } #end of else condition } #end of while loop for (my $i=0; $i<length($dna_string); $i++) { #start with variable I at zero read the length of the DNA string, increment i by one my $nucleo = substr($dna_string,$i,1); #create the variable nucleo, which is a substring of the dna_string, read dna_string one by one if ( $nucleo eq "A" ) { print "T"; } #if A-> print T elsif ( $nucleo eq "C" ) { print "G"; } #if C-> print G elsif ( $nucleo eq "G" ) { print "C"; } #if G-> print C else { print "A"; } #else print A } close FILE #close the FASTA file |
Read Chapter 6 |
June 18
- Chapter 6. Gene Prediction (Lecture Slides-Ch6)
- Workshop 5:
- Web Exploration (pg.168-174)
- Algorithm 3: TBD
Assignment #5 |
---|
Linux Proficiency
|
Read Chapter 8 |
June 20
- Chapter 6. Gene Prediction [continued]
- Workshop 6:
- Web Exploration (pg.168-169)
- Algorithm 4: TBD
Assignment #6 |
---|
Using ONLINE tools
|
Review Chapter 8 |
June 25
- Chapter 8. Molecular Phylogenetics(Lecture Slides-Ch8)
- Web Exploration (pg.244-248)
- Begin Review
- Begin Algorithm Review
Assignment #7 |
---|
Tree Thinking
|
Review Chapter 8 |
June 27
- Review Web Exploration, Databases, and Gene Prediction
July 2
- Review Code Structure and syntax, as well as common coding errors. Also begin review of phylogeny.
July 9
- Student Q&A session
July 11
- Final