Zawar summer 2016

From QiuLab
Jump to navigation Jump to search

Using Kraken

After logging in to cluster, cd into the lustre/projects/qiulab directory. There is a file, node-22.fa, that has been linked from a different directory. This file is in FASTA format, as seen by the extension '.fa'. Kraken can be used with various file formats as the input but FASTA is one of the most common.

To run kraken using this file format, use the following template:

kraken --db Database_Name --fasta-input fasta_file.fa > fasta_file.kraken

where Database_Name is where you put the name of the database and fasta_file.fa is just the FASTA file name. '.kraken' is the extension of the output file.

In this case we used the following code:

kraken --db minikraken_20141208 --fasta-input node-22.fa > node-22.kraken

This may take a while to run, but once it's done, the output file will be in the format:

C/U (C if classified, U if not) SeqID(ID from FASTA header) TxID(Taxonomy ID given by Kraken) Length of Sequence(in bp) List that indicates the LCA mapping for each k-mer in the sequence

When the above code is run, the node-22.kraken file produced starts as follows:

C NODE_22_length_2028197_cov_222.200806 208435 2028267 0:30 211110:1 0:12 ...

which means that it's classified, sequence ID is NODE_22_length_2028197_cov_222.200806, the taxonomy ID is 208435, and the length is 2028267 bp. The last part says that the first 30 k-mers mapped to taxonomy ID 0 which means it's not in the database(0:30), the next 1 k-mer mapped to taxonomy ID #211110(211110:1) and the following 12 k-mers mapped to a sequence not in the database(0:12).

Once you have the .kraken file you can use kraken-translate to get the full taxonomic name of the sequence from the superkingdom to the species name. The code used to run kraken-translate is:

kraken-translate --db Database_Name seq.kraken > seq.labels

You use the same database that you used when you ran the original .fa file with kraken. In this case:

kraken-translate --db minikraken_20141208 node-22.kraken > node-22.labels

If you look at the output file you'll see: NODE_22_length_2028197_cov_222.200806 root;cellular organisms;Bacteria;Firmicutes;Bacilli;Lactobacillales;Streptococcaceae;Streptococcus;Streptococcus agalactiae;Streptococcus agalactiae serogroup V;Streptococcus agalactiae 2603V/R

The first part is the sequence ID and the following is the taxonomic name which goes through the various orders of taxonomy, ending with the species name.

Find the NCBI page associated with the strain by searching for it using the species name.

Download the NCBI page associated with the strain using bioseq:

bioseq -f "NC_004116.1" -o 'genbank' > NC_004116.1.gb