Annotate-a-genome: Difference between revisions

From QiuLab
Jump to navigation Jump to search
imported>Weigang
imported>Weigang
 
(28 intermediate revisions by the same user not shown)
Line 4: Line 4:
* Build an informatics pipeline for gene prediction, ortholog calls, databasing, and synteny analysis
* Build an informatics pipeline for gene prediction, ortholog calls, databasing, and synteny analysis


=Download genome sequences from GenBank=
=Claim your assigned genome=
{| class="wikitable sortable"
{| class="wikitable sortable"
|-
|-
Line 18: Line 18:
| 114 || CA382 || B. burgdorferi (California) || Lyme Disease ||
| 114 || CA382 || B. burgdorferi (California) || Lyme Disease ||
* [http://www.ncbi.nlm.nih.gov/nuccore/CP005925.1 main chromosome]  
* [http://www.ncbi.nlm.nih.gov/nuccore/CP005925.1 main chromosome]  
|| Accession: CP005925; Assigned to: HA
||  
* Accession: CP005925; Assigned to: HA
|-
|-
| 115 || CA8 || B. burgdorferi (California) || Lyme Disease ||  
| 115 || CA8 || B. burgdorferi (California) || Lyme Disease ||  
Line 42: Line 43:
| 305 ||  NMJW1 || B. garinii (China) || Lyme Disease ||  
| 305 ||  NMJW1 || B. garinii (China) || Lyme Disease ||  
* [http://www.ncbi.nlm.nih.gov/nuccore/CP003866.1 main chromosome]
* [http://www.ncbi.nlm.nih.gov/nuccore/CP003866.1 main chromosome]
|| Accession: CP003866.1; Assigned to: AL
||  
* Accession: CP003866; Assigned to: AL
|-
|-
| 402 || HLJ01 || B. afzelii (China) || Lyme Disease ||  
| 402 || HLJ01 || B. afzelii (China) || Lyme Disease ||  
* [http://www.ncbi.nlm.nih.gov/nuccore/CP003882.1 main chromosome]
* [http://www.ncbi.nlm.nih.gov/nuccore/CP003882.1 main chromosome]
|| Accession: CP003882; Assigned to: RL
||  
*Accession: CP003882; Assigned to: RL
|-
|-
| 1003|| Ly || B. duttonii (Tanzania) || Relapsing Fever ||
| 1003|| Ly || B. duttonii (Tanzania) || Relapsing Fever ||
Line 67: Line 70:
| 1100 || DAH || B. hermsii (Washington State) || Relapsing Fever ||  
| 1100 || DAH || B. hermsii (Washington State) || Relapsing Fever ||  
* [http://www.ncbi.nlm.nih.gov/nuccore/CP000048.1 main chromosome]  
* [http://www.ncbi.nlm.nih.gov/nuccore/CP000048.1 main chromosome]  
|| Accession: CP000048; Assigned to: KR
||
* Accession: CP000048; Assigned to: KR
|-
| 1101 || MTW || B. hermsii (?) || Relapsing Fever ||
* [http://www.ncbi.nlm.nih.gov/nuccore/CP005680.1 main chromosome]
||
* Accession: CP005680; Assigned to: ?
|-
| 1102 || YBT || B. hermsii (?) || Relapsing Fever ||
* [http://www.ncbi.nlm.nih.gov/nuccore/CP005706.1 main chromosome]
||
* Accession: CP005706; Assigned to: ?
|-
| ?? || SLO || B. parkeri (?) || Relapsing Fever ||
* [http://www.ncbi.nlm.nih.gov/nuccore/CP005851.1 main chromosome]
||
* Accession: CP005706; Assigned to: ?
|-
| 1103 || YOR || B. hermsii (?) || Relapsing Fever ||
* [http://www.ncbi.nlm.nih.gov/nuccore/CP004146.1 main chromosome]
||
* Accession: CP004146; Assigned to: ?
|-
|-
| 1200 || 91E135 || B. turicatae (Texas) || Relapsing Fever ||
| 1200 || 91E135 || B. turicatae (Texas) || Relapsing Fever ||
* [http://www.ncbi.nlm.nih.gov/nuccore/CP000049.1 main chromosome]
* [http://www.ncbi.nlm.nih.gov/nuccore/CP000049.1 main chromosome]
|| Accession: CP000049; Assigned to: MDR
||  
* Accession: CP000049; Assigned to: MDR
|-
|-
| 1002 || Achema || B. crocidurae (Mauritania) || Relapsing Fever ||
| 1002 || Achema || B. crocidurae (Mauritania) || Relapsing Fever ||
* [http://www.ncbi.nlm.nih.gov/nuccore/CP003426.1 main chromosome]
* [http://www.ncbi.nlm.nih.gov/nuccore/CP003426.1 main chromosome]
* Many unassembled plasmids (not to include)  
* Many unassembled plasmids (not to include)  
|| Accession: CP003426; Assigned to: VS
||  
* Accession: CP003426; Assigned to: VS
|-
|-
|1400 || HR1 || B. parkeri (??) || Relapsing Fever ||  
|1400 || HR1 || B. parkeri (??) || Relapsing Fever ||  
* [http://www.ncbi.nlm.nih.gov/nuccore/CP007022.1 main chromosome]
* [http://www.ncbi.nlm.nih.gov/nuccore/CP007022.1 main chromosome]
|| Accession: CP0007022; Assigned to: AV
||  
* Accession: CP0007022; Assigned to: AV
|-
|-
| 1300 || LB-2001 || B. miyamotoi (Northeast US) || Relapsing Fever ||  
| 1300 || LB-2001 || B. miyamotoi (Northeast US) || Relapsing Fever ||  
* [http://www.ncbi.nlm.nih.gov/nuccore/CP006647.1 main chromosome]  
* [http://www.ncbi.nlm.nih.gov/nuccore/CP006647.1 main chromosome]  
|| Accession: CP006647; Assigned to: LLW
||  
* Accession: CP006647; Assigned to: LLW
|-
|-
| 107 || 94a || B. burgdorferi (Northeast US) || Lyme Disease ||  
| 107 || 94a || B. burgdorferi (Northeast US) || Lyme Disease ||  
* [http://www.ncbi.nlm.nih.gov/nuccore/ABGK02000008.1 main chromosome]  
* [http://www.ncbi.nlm.nih.gov/nuccore/ABGK02000008.1 main chromosome]  
|| Accession: ABGK02000008; Assigned to: QZ
||  
* Accession: ABGK02000008; Assigned to: QZ
|}
|}


Line 98: Line 127:
* [http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download NCBI Standalone BLAST+]
* [http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download NCBI Standalone BLAST+]


==Fetch genome sequences==
==Part 1. Fetch genome sequences and extract protein sequences==
* '''Note''': These scripts are in "../../bio425/annotate-a-genome-pipeline".  You may either make a copy to your home directory (recommended) or run directly from that directory by including the path
* Commands:
* Commands:
<syntaxhighlight lang="bash" line enclose="div">
<syntaxhighlight lang="bash" line enclose="div">
./bioseq -z 'b31_accession' -o 'genbank' > b31.gb # Reference genome for ortholog identification. Choose main, cp26, or lp54
./fetch-genome.pl <your_assigned_accession> # Expected output: "accession.gb"
./bioseq -z 'gb_accession' -o 'genbank' > new.gb
./gb2pep.pl <accession.gb> # Expected output: "accession.pep"
./gb2fas -n b31.gb > b31.nuc # Extract CDS
./gb2fas -n new.gb > new.nuc
./bioseq -t b31.nuc > b31.pep # Translate (and remove those with internal stop codons)
./bioseq -t new.nuc > new.pep
</syntaxhighlight>
</syntaxhighlight>
* Perl code for "gb2fas.pl":
<syntaxhighlight lang="perl" line enclose="div">
#!/usr/bin/env perl
# Extract sequences from a GenBank file
# Input: a GenBank file
# Output: -n: CDS sequences in FASTA; -t: CDS information in Tab-delimited
use strict;
use Bio::SeqIO;
use Getopt::Std;
use Data::Dumper;
use 5.10.0;


my %opts;
==Part 2. Predict orthologs with reciprocal BLAST==
getopts('tn',\%opts);
* '''Note 1: Replace the "accession" in the following commands with your assigned accession number.'''
 
* '''Note 2: You may have to lower the identity (-d, default 90) and length coverage (-l, default 90) for replapsing fever genomes'''
die "$0 [-nt] <genbank_file>\n" unless @ARGV == 1;
my $gb_file = shift @ARGV;
my $in = Bio::SeqIO->new(-file=>$gb_file, -format=>'genbank');
my $cds_ct=0;
while (my $seqobj = $in->next_seq() ) {
    my @features = $seqobj->get_SeqFeatures(); # just top level
    foreach my $feat ( @features ) {
next unless $feat->primary_tag eq "CDS";
$cds_ct++;
&to_db($feat, $cds_ct) if $opts{t};
&to_nt($feat, $seqobj) if $opts{n};
    }
}
 
exit;
 
sub to_nt {
    my $ft = shift;
    my $seq = shift;
    say ">", $ft->get_tag_values("locus_tag");
    my $subseq = $seq->trunc($ft->start, $ft->end);
    if ($ft->strand > 0) {
say $subseq->seq();
    } else {
say $subseq->revcom()->seq();
    }
}
 
sub to_db {
    my $ft = shift;
    my $ct = shift;
    my $orf_id = sprintf "ORF%04d", $ct;
    my $gid = 401; # this is bad and needs improvement: nothing should be hard-coded
    my $con_id = 111114823; # the same problem
    my $locus =  sprintf "%s", $ft->get_tag_values('locus_tag');
    my $prod = sprintf "%s", $ft->get_tag_values('product');
    $prod =~ tr/\'/_/;
    my $strand = ($ft->strand > 0) ? 't' : 'f';
    say join "\t", ($gid, $con_id, $orf_id, 'f', $ft->start, $ft->end, $strand, $locus, $prod);
}
</syntaxhighlight>
 
==Predict orthologs with reciprocal BLAST==
* Commands:
* Commands:
<syntaxhighlight lang="bash" line start="7" enclose="div">
<syntaxhighlight lang="bash" line start="3" enclose="div">
makeblastdb -in b31.pep -parse_seqids # Prepare the reference DB
cp ../../bio425/annotate-genome-pipeline/b31.pep . # get reference genome
makeblastdb -in new.pep -parse_seqids # Prepare the new genome DB
makeblastdb -in b31.pep -dbtype 'prot' -parse_seqids -out ref # Prepare the ref genome DB
blastp -query new.pep -db b31.pep -outfmt 6 -evalue 1e-3 -out forward_blast.out # Forward BLAST
makeblastdb -in accession.pep -dbtype 'prot' -parse_seqids -out mygenome # Prepare the new genome DB
blastp -query b31.pep -db new.pep -outfmt 6 -evalue 1e-3 -out reverse_blast.out # Reverse BLAST
blastp -query accession.pep -db ref -outfmt '6 qseqid sseqid pident length qlen evalue' -evalue 1e-3 -out accession.fwd # Forward BLAST # customized outfmt 6
./check-reciprocal.pl forward_blast.out reverse_blast.out > new.orthlogs 2> new.not-orthologs # Identify orthologs
blastp -query b31.pep -db mygenome -outfmt '6 qseqid sseqid pident length qlen evalue' -evalue 1e-3 -out accession.rev # Reverse BLAST
</syntaxhighlight>
./find-reciprocal.pl <accession.fwd> <accession.rev> > accession.orthlogs 2> accession.not-orthologs # Identify orthologs.  
* Code for "check-reciprocal.pl":
# check results. If orthologs less than 90% of total ORFs in genbank file, run the previous command with more relaxed stringency (use "-d 80", 80% identify cutoff)
<syntaxhighlight lang="perl" line enclose="div">
wc accession.orthlogs
#!/usr/bin/env perl
wc accession.not-orthologs
use strict;
use warnings;
use Data::Dumper;
 
die "$0 <forward_blast_output> <reverse_blast_output>\n" unless @ARGV == 2;
my ($fwd, $rev) = @ARGV;
my (@fwd_top_hits, @rev_top_hits);
open FWD, "<" . $fwd;
my (%fwd_top_hits, %rev_top_hits, @query);
my $query_ct=0;
while (<FWD>) {
    chomp;
    my @data = split;
    next if $fwd_top_hits{$data[0]};
    $fwd_top_hits{$data[0]} = $data[1];
    push @query, $data[0];
    $query_ct++;
}
close FWD;
warn "Total query having hits:" . $query_ct . "\n";
 
open REV, "<" . $rev;
while (<REV>) {
    chomp;
    my @data = split;
    next if $rev_top_hits{$data[0]};
    $rev_top_hits{$data[0]} = $data[1];
}
close REV;
 
foreach my $q (@query) { # e.g., BafPKo_0002
    my $top = $fwd_top_hits{$q}; # e.g. BB_0002
    if ( $q eq $rev_top_hits{$top}) {
print "Found reciprocol top hits:\t", $q, "\t", $top, "\n";
    } else {
warn "Not reciprocol top hits:\t", $q, "\t", $top, "\t", $rev_top_hits{$top}, "\n";
    }
}
 
exit;
</syntaxhighlight>
</syntaxhighlight>


==Verify with synteny broswer==
==Part 3. Generate database tables & deposit your results==
* '''Note: Use your assigned genome_id in the table above as the argument for the "-g" option'''
<syntaxhighlight lang="bash" line start="12" enclose="div">
<syntaxhighlight lang="bash" line start="12" enclose="div">
./gb2fas -t new.gb > new-to-orf-table.txt
./gb2table.pl -g <genome_id> -c <accession.gb> # Expected output: "accession.contig.txt"
# load into database with SQL
./gb2table.pl -g <genome_id> -f <accession.gb> # Expected output: "accession.orf.txt"
# Visualize synteny
# Check your outputs
wc <accession.contig.txt>
head <accession.contig.txt>
tail <accession.contig.txt>
wc <accession.orf.txt>
head <accession.orf.txt>
tail <accession.orf.txt>
cp accession.contig.txt ../../bio425/annotate-a-genome-results/your_initial.accession.contig.txt
cp accession.orf.txt ../../bio425/annotate-a-genome-results/your_initial.accession.orf.txt
cp accession.orthologs ../../bio425/annotate-a-genome-results/your_initial.accession.orth
cp accession.not-orthologs ../../bio425/annotate-a-genome-results/your_initial.accession.not-orth
</syntaxhighlight>
</syntaxhighlight>

Latest revision as of 05:24, 23 May 2014

Project Goals

A Borrelia Phylogeny
  • Annotate and add newly sequenced Borrelia genomes to BorreliaBase
  • Build an informatics pipeline for gene prediction, ortholog calls, databasing, and synteny analysis

Claim your assigned genome

Genome_id Strain Species Group Genome Sequences Notes
100 B31 B. burgdorferi (reference genome) Lyme Disease Reference. Already downloaded as "ref.pep"
114 CA382 B. burgdorferi (California) Lyme Disease
  • Accession: CP005925; Assigned to: HA
115 CA8 B. burgdorferi (California) Lyme Disease
  • Accession: ADMY01000001; Assigned to: AA
  • Accession: ADMY01000002; Assigned to: TAA
  • Accession: ADMY01000003; Assigned to: KD
  • Accession: ADMY01000004; Assigned to: JG
  • Accession: ADMY01000005; Assigned to: KPG
  • Accession: ADMY01000006; Assigned to: GG
  • Accession: ADMY01000007; Assigned to: TDH
304 BgVir B. garinii (Russia) Lyme Disease
  • Accession: CP003151; Assigned to: LH
  • Accession: CP003201; Assigned to: SK
  • Accession: CP003202; Assigned to: BK
305 NMJW1 B. garinii (China) Lyme Disease
  • Accession: CP003866; Assigned to: AL
402 HLJ01 B. afzelii (China) Lyme Disease
  • Accession: CP003882; Assigned to: RL
1003 Ly B. duttonii (Tanzania) Relapsing Fever
  • Accession: CP000976; Assigned to: HL
  • Accession: CP000980; Assigned to: NM
1001 A1 B. recurrentis (Ethiopia) Relapsing Fever
  • Accession: CP000993; Assigned to: JP
  • Accession: CP000994; Assigned to: DP
  • Accession: CP000995; Assigned to: GAR
1100 DAH B. hermsii (Washington State) Relapsing Fever
  • Accession: CP000048; Assigned to: KR
1101 MTW B. hermsii (?) Relapsing Fever
  • Accession: CP005680; Assigned to: ?
1102 YBT B. hermsii (?) Relapsing Fever
  • Accession: CP005706; Assigned to: ?
?? SLO B. parkeri (?) Relapsing Fever
  • Accession: CP005706; Assigned to: ?
1103 YOR B. hermsii (?) Relapsing Fever
  • Accession: CP004146; Assigned to: ?
1200 91E135 B. turicatae (Texas) Relapsing Fever
  • Accession: CP000049; Assigned to: MDR
1002 Achema B. crocidurae (Mauritania) Relapsing Fever
  • Accession: CP003426; Assigned to: VS
1400 HR1 B. parkeri (??) Relapsing Fever
  • Accession: CP0007022; Assigned to: AV
1300 LB-2001 B. miyamotoi (Northeast US) Relapsing Fever
  • Accession: CP006647; Assigned to: LLW
107 94a B. burgdorferi (Northeast US) Lyme Disease
  • Accession: ABGK02000008; Assigned to: QZ

Protocol

Dependencies

Part 1. Fetch genome sequences and extract protein sequences

  • Note: These scripts are in "../../bio425/annotate-a-genome-pipeline". You may either make a copy to your home directory (recommended) or run directly from that directory by including the path
  • Commands:
./fetch-genome.pl <your_assigned_accession> # Expected output: "accession.gb"
./gb2pep.pl <accession.gb> # Expected output: "accession.pep"

Part 2. Predict orthologs with reciprocal BLAST

  • Note 1: Replace the "accession" in the following commands with your assigned accession number.
  • Note 2: You may have to lower the identity (-d, default 90) and length coverage (-l, default 90) for replapsing fever genomes
  • Commands:
cp ../../bio425/annotate-genome-pipeline/b31.pep . # get reference genome
makeblastdb -in b31.pep -dbtype 'prot' -parse_seqids -out ref # Prepare the ref genome DB
makeblastdb -in accession.pep -dbtype 'prot' -parse_seqids -out mygenome # Prepare the new genome DB
blastp -query accession.pep -db ref -outfmt '6 qseqid sseqid pident length qlen evalue' -evalue 1e-3 -out accession.fwd # Forward BLAST # customized outfmt 6
blastp -query b31.pep -db mygenome -outfmt '6 qseqid sseqid pident length qlen evalue' -evalue 1e-3 -out accession.rev # Reverse BLAST
./find-reciprocal.pl <accession.fwd> <accession.rev> > accession.orthlogs 2> accession.not-orthologs # Identify orthologs. 
# check results. If orthologs less than 90% of total ORFs in genbank file, run the previous command with more relaxed stringency (use "-d 80", 80% identify cutoff)
wc accession.orthlogs
wc accession.not-orthologs

Part 3. Generate database tables & deposit your results

  • Note: Use your assigned genome_id in the table above as the argument for the "-g" option
./gb2table.pl -g <genome_id> -c <accession.gb> # Expected output: "accession.contig.txt"
./gb2table.pl -g <genome_id> -f <accession.gb> # Expected output: "accession.orf.txt"
# Check your outputs
wc <accession.contig.txt>
head <accession.contig.txt>
tail <accession.contig.txt>
wc <accession.orf.txt>
head <accession.orf.txt>
tail <accession.orf.txt>
cp accession.contig.txt ../../bio425/annotate-a-genome-results/your_initial.accession.contig.txt
cp accession.orf.txt ../../bio425/annotate-a-genome-results/your_initial.accession.orf.txt
cp accession.orthologs ../../bio425/annotate-a-genome-results/your_initial.accession.orth
cp accession.not-orthologs ../../bio425/annotate-a-genome-results/your_initial.accession.not-orth