Revision as of 03:47, 9 March 2014

Project Goals

A Borrelia Phylogeny

Annotate and add newly sequenced Borrelia genomes to BorreliaBase
Build an informatics pipeline for gene prediction, ortholog calls, databasing, and synteny analysis

Download genome sequences from GenBank

Genome_id	Strain	Species	Group	Genome Sequences	Notes
100	B31	B. burgdorferi (reference genome)	Lyme Disease	main chromosome cp26 plasmid lp54 plasmid	Reference. Already downloaded as "ref.pep"
114	CA382	B. burgdorferi (California)	Lyme Disease	main chromosome	Accession: CP005925; Assigned to: HA
115	CA8	B. burgdorferi (California)	Lyme Disease	7 unassembled contigs	Accession: ADMY01000001; Assigned to: AA Accession: ADMY01000002; Assigned to: TAA Accession: ADMY01000003; Assigned to: KD Accession: ADMY01000004; Assigned to: JG Accession: ADMY01000005; Assigned to: KPG Accession: ADMY01000006; Assigned to: GG Accession: ADMY01000007; Assigned to: TDH
304	BgVir	B. garinii (Russia)	Lyme Disease	main chromosome cp26 lp54	Accession: CP003151; Assigned to: LH Accession: CP003201; Assigned to: SK Accession: CP003202; Assigned to: BK
305	NMJW1	B. garinii (China)	Lyme Disease	main chromosome	Accession: CP003866; Assigned to: AL
402	HLJ01	B. afzelii (China)	Lyme Disease	main chromosome	Accession: CP003882; Assigned to: RL
1003	Ly	B. duttonii (Tanzania)	Relapsing Fever	main chromosome lp23 (homolog of cp26 in LD genomes) Many other plasmids (not to include)	Accession: CP000976; Assigned to: HL Accession: CP000980; Assigned to: NM
1001	A1	B. recurrentis (Ethiopia)	Relapsing Fever	main chromosome lp124 (homologous to lp54 in LD genomes) lp23	Accession: CP000993; Assigned to: JP Accession: CP000994; Assigned to: DP Accession: CP000995; Assigned to: GAR
1100	DAH	B. hermsii (Washington State)	Relapsing Fever	main chromosome	Accession: CP000048; Assigned to: KR
1200	91E135	B. turicatae (Texas)	Relapsing Fever	main chromosome	Accession: CP000049; Assigned to: MDR
1002	Achema	B. crocidurae (Mauritania)	Relapsing Fever	main chromosome Many unassembled plasmids (not to include)	Accession: CP003426; Assigned to: VS
1400	HR1	B. parkeri (??)	Relapsing Fever	main chromosome	Accession: CP0007022; Assigned to: AV
1300	LB-2001	B. miyamotoi (Northeast US)	Relapsing Fever	main chromosome	Accession: CP006647; Assigned to: LLW
107	94a	B. burgdorferi (Northeast US)	Lyme Disease	main chromosome	Accession: ABGK02000008; Assigned to: QZ

Protocol

Dependencies

BASH (default shell of Linux OS and Apple OS X)
Perl and BioPerl
DNATweezer
NCBI Standalone BLAST+

Fetch genome sequences and extract protein sequences

Commands:

# These scripts are in "../../bio425/annotate-a-genome-pipeline". 
# You may either make a copy to your home directory (recommended) or run directly from that directory
./fetch-genome.pl <your_assigned_accession> # Expected output: "accession.gb"
./gb2fas.pl <accession.gb> # Expected output: "accession.pep"

Predict orthologs with reciprocal BLAST

Commands:

makeblastdb -in b31.pep -parse_seqids # Prepare the reference DB
makeblastdb -in new.pep -parse_seqids # Prepare the new genome DB
blastp -query new.pep -db b31.pep -outfmt 6 -evalue 1e-3 -out forward_blast.out # Forward BLAST
blastp -query b31.pep -db new.pep -outfmt 6 -evalue 1e-3 -out reverse_blast.out # Reverse BLAST
./check-reciprocal.pl forward_blast.out reverse_blast.out > new.orthlogs 2> new.not-orthologs # Identify orthologs

Code for "check-reciprocal.pl":

#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;

die "$0 <forward_blast_output> <reverse_blast_output>\n" unless @ARGV == 2;
my ($fwd, $rev) = @ARGV;
my (@fwd_top_hits, @rev_top_hits);
open FWD, "<" . $fwd;
my (%fwd_top_hits, %rev_top_hits, @query);
my $query_ct=0;
while (<FWD>) {
    chomp;
    my @data = split;
    next if $fwd_top_hits{$data[0]};
    $fwd_top_hits{$data[0]} = $data[1];
    push @query, $data[0];
    $query_ct++;
}
close FWD;
warn "Total query having hits:" . $query_ct . "\n";

open REV, "<" . $rev;
while (<REV>) {
    chomp;
    my @data = split;
    next if $rev_top_hits{$data[0]};
    $rev_top_hits{$data[0]} = $data[1];
}
close REV;

foreach my $q (@query) { # e.g., BafPKo_0002
    my $top = $fwd_top_hits{$q}; # e.g. BB_0002
    if ( $q eq $rev_top_hits{$top}) {
	print "Found reciprocol top hits:\t", $q, "\t", $top, "\n";
    } else {
	warn "Not reciprocol top hits:\t", $q, "\t", $top, "\t", $rev_top_hits{$top}, "\n";
    }
}

exit;

Verify with synteny broswer

./gb2fas -t new.gb > new-to-orf-table.txt
# load into database with SQL
# Visualize synteny

@@ Line 107: / Line 107: @@
 * [http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download NCBI Standalone BLAST+]
-==Fetch genome sequences==
+==Fetch genome sequences and extract protein sequences==
 * Commands:
 <syntaxhighlight lang="bash" line enclose="div">
-./bioseq -z 'b31_accession' -o 'genbank' > b31.gb # Reference genome for ortholog identification. Choose main, cp26, or lp54
+# These scripts are in "../../bio425/annotate-a-genome-pipeline".
-./bioseq -z 'gb_accession' -o 'genbank' > new.gb
+# You may either make a copy to your home directory (recommended) or run directly from that directory
-./gb2fas -n b31.gb > b31.nuc # Extract CDS
+./fetch-genome.pl <your_assigned_accession> # Expected output: "accession.gb"
-./gb2fas -n new.gb > new.nuc
+./gb2fas.pl <accession.gb> # Expected output: "accession.pep"
-./bioseq -t b31.nuc > b31.pep # Translate (and remove those with internal stop codons)
-./bioseq -t new.nuc > new.pep
-</syntaxhighlight>
-* Perl code for "gb2fas.pl":
-<syntaxhighlight lang="perl" line enclose="div">
-#!/usr/bin/env perl
-# Extract sequences from a GenBank file
-# Input: a GenBank file
-# Output: -n: CDS sequences in FASTA; -t: CDS information in Tab-delimited
-use strict;
-use Bio::SeqIO;
-use Getopt::Std;
-use Data::Dumper;
-use 5.10.0;
-my %opts;
-getopts('tn',\%opts);
-die "$0 [-nt] <genbank_file>\n" unless @ARGV == 1;
-my $gb_file = shift @ARGV;
-my $in = Bio::SeqIO->new(-file=>$gb_file, -format=>'genbank');
-my $cds_ct=0;
-while (my $seqobj = $in->next_seq() ) {
-    my @features = $seqobj->get_SeqFeatures(); # just top level
-    foreach my $feat ( @features ) {
-	next unless $feat->primary_tag eq "CDS";
-	$cds_ct++;
-	&to_db($feat, $cds_ct) if $opts{t};
-	&to_nt($feat, $seqobj) if $opts{n};
-    }
-}
-exit;
-sub to_nt {
-    my $ft = shift;
-    my $seq = shift;
-    say ">", $ft->get_tag_values("locus_tag");
-    my $subseq = $seq->trunc($ft->start, $ft->end);
-    if ($ft->strand > 0) {
-	say $subseq->seq();
-    } else {
-	say $subseq->revcom()->seq();
-    }
-}
-sub to_db {
-    my $ft = shift;
-    my $ct = shift;
-    my $orf_id = sprintf "ORF%04d", $ct;
-    my $gid = 401; # this is bad and needs improvement: nothing should be hard-coded
-    my $con_id = 111114823; # the same problem
-    my $locus =  sprintf "%s", $ft->get_tag_values('locus_tag');
-    my $prod = sprintf "%s", $ft->get_tag_values('product');
-    $prod =~ tr/\'/_/;
-    my $strand = ($ft->strand > 0) ? 't' : 'f';
-    say join "\t", ($gid, $con_id, $orf_id, 'f', $ft->start, $ft->end, $strand, $locus, $prod);
-}
 </syntaxhighlight>

Annotate-a-genome: Difference between revisions

Revision as of 03:47, 9 March 2014

Contents

Project Goals

Download genome sequences from GenBank

Protocol

Dependencies

Fetch genome sequences and extract protein sequences

Predict orthologs with reciprocal BLAST

Verify with synteny broswer

Navigation menu

Annotate-a-genome: Difference between revisions

Revision as of 03:47, 9 March 2014

Project Goals

Download genome sequences from GenBank

Protocol

Dependencies

Fetch genome sequences and extract protein sequences

Predict orthologs with reciprocal BLAST

Verify with synteny broswer

Navigation menu

Search