Zawar summer 2016: Difference between revisions
imported>Zawarahmed (basic steps to using kraken) |
imported>Zawarahmed No edit summary |
||
(One intermediate revision by the same user not shown) | |||
Line 5: | Line 5: | ||
To run kraken using this file format, use the following template: | To run kraken using this file format, use the following template: | ||
kraken --db Database_Name --fasta-input fasta_file.fa > fasta_file.kraken | <code>kraken --db Database_Name --fasta-input fasta_file.fa > fasta_file.kraken</code> | ||
where Database_Name is where you put the name of the database and fasta_file.fa is just the FASTA file name. '.kraken' is the extension of the output file. | where Database_Name is where you put the name of the database and fasta_file.fa is just the FASTA file name. '.kraken' is the extension of the output file. | ||
Line 11: | Line 11: | ||
In this case we used the following code: | In this case we used the following code: | ||
kraken --db minikraken_20141208 --fasta-input node-22.fa > node-22.kraken | <code>kraken --db minikraken_20141208 --fasta-input node-22.fa > node-22.kraken</code> | ||
This may take a while to run, but once it's done, the output file will be in the format: | This may take a while to run, but once it's done, the output file will be in the format: | ||
Line 19: | Line 19: | ||
When the above code is run, the node-22.kraken file produced starts as follows: | When the above code is run, the node-22.kraken file produced starts as follows: | ||
C NODE_22_length_2028197_cov_222.200806 208435 2028267 0:30 211110:1 0:12 ... | <code>C NODE_22_length_2028197_cov_222.200806 208435 2028267 0:30 211110:1 0:12 ... </code> | ||
which means that it's classified, sequence ID is NODE_22_length_2028197_cov_222.200806, the taxonomy ID is 208435, and the length is 2028267 bp. The last part says that the first 30 k-mers mapped to taxonomy ID 0 which means it's not in the database(0:30), the next 1 k-mer mapped to taxonomy ID #211110(211110:1) and the following 12 k-mers mapped to a sequence not in the database(0:12). | which means that it's classified, sequence ID is NODE_22_length_2028197_cov_222.200806, the taxonomy ID is 208435, and the length is 2028267 bp. The last part says that the first 30 k-mers mapped to taxonomy ID 0 which means it's not in the database(0:30), the next 1 k-mer mapped to taxonomy ID #211110(211110:1) and the following 12 k-mers mapped to a sequence not in the database(0:12). | ||
Line 25: | Line 25: | ||
Once you have the .kraken file you can use kraken-translate to get the full taxonomic name of the sequence from the superkingdom to the species name. The code used to run kraken-translate is: | Once you have the .kraken file you can use kraken-translate to get the full taxonomic name of the sequence from the superkingdom to the species name. The code used to run kraken-translate is: | ||
kraken-translate --db Database_Name seq.kraken > seq.labels | <code>kraken-translate --db Database_Name seq.kraken > seq.labels</code> | ||
You use the same database that you used when you ran the original .fa file with kraken. In this case: | You use the same database that you used when you ran the original .fa file with kraken. In this case: | ||
kraken-translate --db minikraken_20141208 node-22.kraken > node-22.labels | <code>kraken-translate --db minikraken_20141208 node-22.kraken > node-22.labels</code> | ||
If you look at the output file you'll see: | If you look at the output file you'll see: | ||
<code> | |||
NODE_22_length_2028197_cov_222.200806 root;cellular organisms;Bacteria;Firmicutes;Bacilli;Lactobacillales;Streptococcaceae;Streptococcus;Streptococcus agalactiae;Streptococcus agalactiae serogroup V;Streptococcus agalactiae 2603V/R</code> | |||
The first part is the sequence ID and the following is the taxonomic name which goes through the various orders of taxonomy, ending with the species name. | |||
Find the NCBI page associated with the strain by searching for it using the species name. | |||
Download the NCBI page associated with the strain using bioseq: | |||
<code>bioseq -f "NC_004116.1" -o 'genbank' > NC_004116.1.gb</code> |
Latest revision as of 17:07, 9 June 2016
Using Kraken
After logging in to cluster, cd into the lustre/projects/qiulab directory. There is a file, node-22.fa, that has been linked from a different directory. This file is in FASTA format, as seen by the extension '.fa'. Kraken can be used with various file formats as the input but FASTA is one of the most common.
To run kraken using this file format, use the following template:
kraken --db Database_Name --fasta-input fasta_file.fa > fasta_file.kraken
where Database_Name is where you put the name of the database and fasta_file.fa is just the FASTA file name. '.kraken' is the extension of the output file.
In this case we used the following code:
kraken --db minikraken_20141208 --fasta-input node-22.fa > node-22.kraken
This may take a while to run, but once it's done, the output file will be in the format:
C/U (C if classified, U if not) SeqID(ID from FASTA header) TxID(Taxonomy ID given by Kraken) Length of Sequence(in bp) List that indicates the LCA mapping for each k-mer in the sequence
When the above code is run, the node-22.kraken file produced starts as follows:
C NODE_22_length_2028197_cov_222.200806 208435 2028267 0:30 211110:1 0:12 ...
which means that it's classified, sequence ID is NODE_22_length_2028197_cov_222.200806, the taxonomy ID is 208435, and the length is 2028267 bp. The last part says that the first 30 k-mers mapped to taxonomy ID 0 which means it's not in the database(0:30), the next 1 k-mer mapped to taxonomy ID #211110(211110:1) and the following 12 k-mers mapped to a sequence not in the database(0:12).
Once you have the .kraken file you can use kraken-translate to get the full taxonomic name of the sequence from the superkingdom to the species name. The code used to run kraken-translate is:
kraken-translate --db Database_Name seq.kraken > seq.labels
You use the same database that you used when you ran the original .fa file with kraken. In this case:
kraken-translate --db minikraken_20141208 node-22.kraken > node-22.labels
If you look at the output file you'll see:
NODE_22_length_2028197_cov_222.200806 root;cellular organisms;Bacteria;Firmicutes;Bacilli;Lactobacillales;Streptococcaceae;Streptococcus;Streptococcus agalactiae;Streptococcus agalactiae serogroup V;Streptococcus agalactiae 2603V/R
The first part is the sequence ID and the following is the taxonomic name which goes through the various orders of taxonomy, ending with the species name.
Find the NCBI page associated with the strain by searching for it using the species name.
Download the NCBI page associated with the strain using bioseq:
bioseq -f "NC_004116.1" -o 'genbank' > NC_004116.1.gb