Revision as of 22:28, 13 February 2014

Project Goals

Annotate and add newly sequenced Borrelia genomes to BorreliaBase
Build an informatics pipeline for gene prediction, ortholog calls, databasing, and synteny analysis

Download genome sequences from GenBank

Strain	Species	Group	Genome Sequences	Notes
B31	B. burgdorferi (reference genome)	Lyme Disease	main chromosome cp26 plasmid lp54 plasmid	Example
CA382	B. burgdorferi (California)	Lyme Disease	main chromosome	Example
CA8	B. burgdorferi (California)	Lyme Disease	8 unassembled contigs	Example
BgVir	B. garinii (Russia)	Lyme Disease	main chromosome cp26 lp54	Example
NMJW1	B. garinii (China)	Lyme Disease	main chromosome	Example
HLJ01	B. afzelii (China)	Lyme Disease	main chromosome	Example
Ly	B. duttonii (Tanzania)	Relapsing Fever	main chromosome lp23 (homolog of cp26 in LD genomes) Many other plasmids (not to include)	Example
A1	B. recurrentis (Ethiopia)	Relapsing Fever	main chromosome lp124 (homologous to lp54 in LD genomes) lp23	Example
DAH	B. hermsii (Washington State)	Relapsing Fever	main chromosome	Example
91E135	B. turicatae (Texas)	Relapsing Fever	main chromosome	Example
Achema	B. crocidurae (Mauritania)	Relapsing Fever	main chromosome Many unassembled plasmids (not to include)	Example
HR1	B. parkeri (??)	Relapsing Fever	main chromosome	Example
LB-2001	B. miyamotoi (Northeast US)	Relapsing Fever	main chromosome	Example

Protocol

Fetch genome sequences

Commands:

./bioseq -z 'b31_accession' -o 'genbank' > b31.gb # Reference genome for ortholog identification. Choose main, cp26, or lp54
./bioseq -z 'gb_accession' -o 'genbank' > new.gb
./gb2fas -n b31.gb > b31.nuc # Extract CDS
./gb2fas -n new.gb > new.nuc
./bioseq -t b31.nuc > b31.pep # Translate (and remove those with internal stop codons)
./bioseq -t new.nuc > new.pep

Perl code for "gb2fas.pl":

#!/usr/bin/env perl
# Extract sequences from a GenBank file
# Input: a GenBank file
# Output: -n: CDS sequences in FASTA; -t: CDS information in Tab-delimited
use strict;
use Bio::SeqIO;
use Getopt::Std;
use Data::Dumper;
use 5.10.0;

my %opts;
getopts('tn',\%opts);

die "$0 [-nt] <genbank_file>\n" unless @ARGV == 1;
my $gb_file = shift @ARGV;
my $in = Bio::SeqIO->new(-file=>$gb_file, -format=>'genbank');
my $cds_ct=0;
while (my $seqobj = $in->next_seq() ) {
    my @features = $seqobj->get_SeqFeatures(); # just top level
    foreach my $feat ( @features ) {
	next unless $feat->primary_tag eq "CDS";
	$cds_ct++;
	&to_db($feat, $cds_ct) if $opts{t};
	&to_nt($feat, $seqobj) if $opts{n};
    }
}

exit;

sub to_nt {
    my $ft = shift;
    my $seq = shift;
    say ">", $ft->get_tag_values("locus_tag");
    my $subseq = $seq->trunc($ft->start, $ft->end);
    if ($ft->strand > 0) {
	say $subseq->seq();
    } else {
	say $subseq->revcom()->seq();
    }
}

sub to_db {
    my $ft = shift;
    my $ct = shift;
    my $orf_id = sprintf "ORF%04d", $ct;
    my $gid = 401; # this is bad and needs improvement: nothing should be hard-coded
    my $con_id = 111114823; # the same problem
    my $locus =  sprintf "%s", $ft->get_tag_values('locus_tag');
    my $prod = sprintf "%s", $ft->get_tag_values('product');
    $prod =~ tr/\'/_/;
    my $strand = ($ft->strand > 0) ? 't' : 'f';
    say join "\t", ($gid, $con_id, $orf_id, 'f', $ft->start, $ft->end, $strand, $locus, $prod);
}

Predict orthologs with reciprocal BLAST

Commands:

makeblastdb -in b31.pep -parse_seqids # Prepare the reference DB
makeblastdb -in new.pep -parse_seqids # Prepare the new genome DB
blastp -query new.pep -db b31.pep -outfmt 6 -evalue 1e-3 -out forward_blast.out # Forward BLAST
blastp -query b31.pep -db new.pep -outfmt 6 -evalue 1e-3 -out reverse_blast.out # Reverse BLAST
./check-reciprocal.pl forward_blast.out reverse_blast.out > new.orthlogs 2> new.not-orthologs # Identify orthologs

Code for "check-reciprocal.pl":

#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;

die "$0 <forward_blast_output> <reverse_blast_output>\n" unless @ARGV == 2;
my ($fwd, $rev) = @ARGV;
my (@fwd_top_hits, @rev_top_hits);
open FWD, "<" . $fwd;
my (%fwd_top_hits, %rev_top_hits, @query);
my $query_ct=0;
while (<FWD>) {
    chomp;
    my @data = split;
    next if $fwd_top_hits{$data[0]};
    $fwd_top_hits{$data[0]} = $data[1];
    push @query, $data[0];
    $query_ct++;
}
close FWD;
warn "Total query having hits:" . $query_ct . "\n";

open REV, "<" . $rev;
while (<REV>) {
    chomp;
    my @data = split;
    next if $rev_top_hits{$data[0]};
    $rev_top_hits{$data[0]} = $data[1];
}
close REV;

foreach my $q (@query) { # e.g., BafPKo_0002
    my $top = $fwd_top_hits{$q}; # e.g. BB_0002
    if ( $q eq $rev_top_hits{$top}) {
	print "Found reciprocol top hits:\t", $q, "\t", $top, "\n";
    } else {
	warn "Not reciprocol top hits:\t", $q, "\t", $top, "\t", $rev_top_hits{$top}, "\n";
    }
}

exit;

Verify with synteny broswer

./gb2fas -t new.gb > new-to-orf-table.txt

Annotate-a-genome: Difference between revisions

Revision as of 22:28, 13 February 2014

Contents

Project Goals

Download genome sequences from GenBank

Protocol

Fetch genome sequences

Predict orthologs with reciprocal BLAST

Verify with synteny broswer

Navigation menu

Annotate-a-genome: Difference between revisions

Revision as of 22:28, 13 February 2014

Project Goals

Download genome sequences from GenBank

Protocol

Fetch genome sequences

Predict orthologs with reciprocal BLAST

Verify with synteny broswer

Navigation menu

Search