Annotate-a-genome

From QiuLab
Revision as of 21:33, 4 March 2014 by imported>Weigang (→‎Download genome sequences from GenBank)
Jump to navigation Jump to search

Project Goals

A Borrelia Phylogeny
  • Annotate and add newly sequenced Borrelia genomes to BorreliaBase
  • Build an informatics pipeline for gene prediction, ortholog calls, databasing, and synteny analysis

Download genome sequences from GenBank

Genome_id Strain Species Group Genome Sequences Notes
100 B31 B. burgdorferi (reference genome) Lyme Disease Example
114 CA382 B. burgdorferi (California) Lyme Disease Example
115 CA8 B. burgdorferi (California) Lyme Disease Example
304 BgVir B. garinii (Russia) Lyme Disease Example
305 NMJW1 B. garinii (China) Lyme Disease Example
402 HLJ01 B. afzelii (China) Lyme Disease Example
1003 Ly B. duttonii (Tanzania) Relapsing Fever Example
1001 A1 B. recurrentis (Ethiopia) Relapsing Fever Example
1100 DAH B. hermsii (Washington State) Relapsing Fever Example
1200 91E135 B. turicatae (Texas) Relapsing Fever Example
1002 Achema B. crocidurae (Mauritania) Relapsing Fever Example
1400 HR1 B. parkeri (??) Relapsing Fever Example
1300 LB-2001 B. miyamotoi (Northeast US) Relapsing Fever Example

Protocol

Dependencies

Fetch genome sequences

  • Commands:
./bioseq -z 'b31_accession' -o 'genbank' > b31.gb # Reference genome for ortholog identification. Choose main, cp26, or lp54
./bioseq -z 'gb_accession' -o 'genbank' > new.gb
./gb2fas -n b31.gb > b31.nuc # Extract CDS
./gb2fas -n new.gb > new.nuc
./bioseq -t b31.nuc > b31.pep # Translate (and remove those with internal stop codons)
./bioseq -t new.nuc > new.pep
  • Perl code for "gb2fas.pl":
#!/usr/bin/env perl
# Extract sequences from a GenBank file
# Input: a GenBank file
# Output: -n: CDS sequences in FASTA; -t: CDS information in Tab-delimited
use strict;
use Bio::SeqIO;
use Getopt::Std;
use Data::Dumper;
use 5.10.0;

my %opts;
getopts('tn',\%opts);

die "$0 [-nt] <genbank_file>\n" unless @ARGV == 1;
my $gb_file = shift @ARGV;
my $in = Bio::SeqIO->new(-file=>$gb_file, -format=>'genbank');
my $cds_ct=0;
while (my $seqobj = $in->next_seq() ) {
    my @features = $seqobj->get_SeqFeatures(); # just top level
    foreach my $feat ( @features ) {
	next unless $feat->primary_tag eq "CDS";
	$cds_ct++;
	&to_db($feat, $cds_ct) if $opts{t};
	&to_nt($feat, $seqobj) if $opts{n};
    }
}

exit;

sub to_nt {
    my $ft = shift;
    my $seq = shift;
    say ">", $ft->get_tag_values("locus_tag");
    my $subseq = $seq->trunc($ft->start, $ft->end);
    if ($ft->strand > 0) {
	say $subseq->seq();
    } else {
	say $subseq->revcom()->seq();
    }
}

sub to_db {
    my $ft = shift;
    my $ct = shift;
    my $orf_id = sprintf "ORF%04d", $ct;
    my $gid = 401; # this is bad and needs improvement: nothing should be hard-coded
    my $con_id = 111114823; # the same problem
    my $locus =  sprintf "%s", $ft->get_tag_values('locus_tag');
    my $prod = sprintf "%s", $ft->get_tag_values('product');
    $prod =~ tr/\'/_/;
    my $strand = ($ft->strand > 0) ? 't' : 'f';
    say join "\t", ($gid, $con_id, $orf_id, 'f', $ft->start, $ft->end, $strand, $locus, $prod);
}

Predict orthologs with reciprocal BLAST

  • Commands:
makeblastdb -in b31.pep -parse_seqids # Prepare the reference DB
makeblastdb -in new.pep -parse_seqids # Prepare the new genome DB
blastp -query new.pep -db b31.pep -outfmt 6 -evalue 1e-3 -out forward_blast.out # Forward BLAST
blastp -query b31.pep -db new.pep -outfmt 6 -evalue 1e-3 -out reverse_blast.out # Reverse BLAST
./check-reciprocal.pl forward_blast.out reverse_blast.out > new.orthlogs 2> new.not-orthologs # Identify orthologs
  • Code for "check-reciprocal.pl":
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;

die "$0 <forward_blast_output> <reverse_blast_output>\n" unless @ARGV == 2;
my ($fwd, $rev) = @ARGV;
my (@fwd_top_hits, @rev_top_hits);
open FWD, "<" . $fwd;
my (%fwd_top_hits, %rev_top_hits, @query);
my $query_ct=0;
while (<FWD>) {
    chomp;
    my @data = split;
    next if $fwd_top_hits{$data[0]};
    $fwd_top_hits{$data[0]} = $data[1];
    push @query, $data[0];
    $query_ct++;
}
close FWD;
warn "Total query having hits:" . $query_ct . "\n";

open REV, "<" . $rev;
while (<REV>) {
    chomp;
    my @data = split;
    next if $rev_top_hits{$data[0]};
    $rev_top_hits{$data[0]} = $data[1];
}
close REV;

foreach my $q (@query) { # e.g., BafPKo_0002
    my $top = $fwd_top_hits{$q}; # e.g. BB_0002
    if ( $q eq $rev_top_hits{$top}) {
	print "Found reciprocol top hits:\t", $q, "\t", $top, "\n";
    } else {
	warn "Not reciprocol top hits:\t", $q, "\t", $top, "\t", $rev_top_hits{$top}, "\n";
    }
}

exit;

Verify with synteny broswer

./gb2fas -t new.gb > new-to-orf-table.txt
# load into database with SQL
# Visualize synteny