Revision as of 18:02, 24 November 2015

Bp-utils: sequence, alignment & tree utilities by Qiu Lab

bioseq: sequence/FASTA manipulations

Use accession "CP002316.1" to retrieve the Genbank file from NCBI. Save the output (in genbank format) to a file named as "cp002316.gb".

bioseq -f "CP002316.1" -o'genbank' > cp002316.gb

Use the above file as input, extract FASTA sequences for each genes and save the output to a new file called "cp002316.nuc". Use this file for the following questions.

bioseq -i "genbank" -F cp002316.gb > cp002316.fas

Count the number of sequences.

bioseq -n cp002316.fas

In a single command, pick the first 10 sequences and find their length

bioseq -p "order:1-10" cp002316.fas | bioseq –l

In a single command, pick the third and seventh sequences from the file and do the 3-frame translation. Which reading frame is the correct on both? Specify

bioseq -p "order:3,7" cp002316.fas | bioseq -t3

Find the base composition of the last two sequences

bioseq -p "order:25-26" cp002316.fas| bioseq –c

Pick the sequence with id "Bbu|D1_B11|8784|9302|1" and count the number of codons present in this sequence

bioseq -p "id:BbuJD1_B11|8784|9302|1" cp002316.fas | bioseq –C

Delete the last 10 sequences from the file and save the output to cp002316-v2.nuc

bioseq -d "order:17-26" cp002316.fas > cp002316-v2.nuc

In a single command, pick the first sequence, then get the 50-110 nucleotides and make reverse complement of the sub-sequences

bioseq -p "order:1" cp002316.fas | bioseq -s "50,110" | bioseq –r

In a single command, get the first 100 nucleotides of all the sequences present in the file and do 1-frame translation of all sub-sequences.

bioseq -s "1,100" cp002316.fas | bioseq -t1

bioaln: alignment/CLUSTALW manipulations

Go to /home/shared/LabMeetingReadings/Test-Data and find the sequence alignment file “bioaln_tutorial.aln”. Name the format of the alignment file. Use it to answer all the questions below.

CLUSTALW

Find the length of the alignment.

bioaln -l bioaln_tutorial.aln

Count the number of the sequences present in the alignment.

bioaln -n bioaln_tutorial.aln

How do you convert this alignment in phylip format? Save the output.

bioaln -o "phylip" bioaln_tutorial.aln > test.phy

Pick “seq2, seq5, seq7, seq10” from the alignment and calculate their average percent identity.

bioaln -p "seq2, seq5, seq7, seq10" bioaln_tutorial.aln | bioaln -a

Get an alignment slice from “50-140” and find the average identities of the slice for sliding windows of 25.

bioaln -s "50, 140" bioaln_tutorial.aln | bioaln -w "25"

Extract conserved blocks from the alignment.

bioaln -B bioaln_tutorial.aln

Find the unique sequences and list their ids.

bioaln -u bioaln_tutorial.aln | bioaln -L

Extract third sites from the alignment and show only variable sites in match view.

bioaln -T bioaln_tutorial.aln | bioaln -v | bioaln -m

Remove the gaps and show the final alignment in codon view for an alignment slice “1-100”.

 bioaln -s "1, 100" bioaln_tutorial.aln | bioaln -g | bioaln -c

Add a 90% consensus sequence and then show the final alignment in match plus codon view for an alignment slice “20-80”. (Hint: match view followed by codon view)

bioaln -s "20, 80" bioaln_tutorial.aln | bioaln -C "90" | bioaln -m | bioaln -c

biotree: tree/NEWICK manipulations

biopop: SNP statistics

Homology searching and clustering

BLAST+: search("google") for homologs/pariwise alignment

hmmer

cdhit

cdhit -i all.pep -o all.cdhit -c 0.5 -n 3

Options:

-i: input file
-o: output file
-c: percent identity (below which it is considered different families)
-n: word length

interproscan

../../software/interproscan/interproscan-5.13-52.0/interproscan.sh -i trep-cdhit.representatives.pep2 -o  trep-representatives.tsv -t p -goterms -pa -f tsv

Documentation page: How to run

Programs for producing multiple alignments

MUSCLE

CLUSTALW

MAFT

TCOFFEE

Programs for producing phylogeny & phylogenetic analysis

FastTree

PHYLIP

MrBayes

RaXML

PhyloNet

R packages for phylogenetics

APE

phengorn

phytools

Population genetics

ms: coalescence simulation

SFS: forward simulation

PAML: testing selection with Ka/Ks

Microbial genome databases & pipelines in Qiu Lab

borreliabase

pa2

spiro_genomes/treponema

Genome annotation pipeline

de novo variant call with cortex_var

Create binary file of fasta genome file.

Run contex_var_31_c1 (cutoff 1 used for 1 genome)

--se_list is the command the reads the list you want to target (ie: list-genome.txt)
--kmer_size is the middle size, has to be an odd integer
--mem_width always choose 17
--mem_height always choose 100
--dump_binary Name your file name (ie: Genome.ctx)

/home/weigang/CORTEX_release_v1.0.5.21/bin/cortex_var_31_c1 --se_list list-Evo.txt --kmer_size 31 --mem_width 17 --mem_height 100 dump_binary Evo.ctx > Evo.log

Read each binary file (.ctx) into its own individual color list (ls Evo.ctx > Evo.colorlist) Then save these lists into their own collective colorlist.txt (ls *.ctx > colorlist.txt)

Reveal genetic variation using the Bubble Caller from cortex_var.

/home/weigang/CORTEX_release_v1.0.5.21/bin/cortex_var_31_c5 --se_list colorlist.txt --kmer_size 31 --mem_width 17 --mem_height 100 dump_binary all-colors.ctx > all-colors.log

Bubble caller will detect differences between each genome by assigning distinct colors to each genome (note that the UK spelling of color is used: colour)

--multicolour_bin holds your all-colors.ctx binary from the Bubble Caller
--detect_bubbles1 i/i Detects 1 variation between genomes i and i. i indicates the position number the genome is listed on the colorlist.txt file. If the genome is fourth on the colorlist.txt, for example, its corresponding i variable is 4
--output_bubbles1 Output variant reads in fasta format (ie: Evo-RefHG.var for bubble detection between

Evolved genome and Reference HG genome)

--print_colour_coverages necessary for output

/home/weigang/CORTEX_release_v1.0.5.21/bin/cortex_var_31_c5 --kmer_size 31  --mem_height 17 --mem_width 100 --multicolour_bin all-colors.ctx --detect_bubbles1 0/1 --output_bubbles1 Evo-RefHG.var --print colour coverages  > Evo-RefHG.log

@@ Line 185: / Line 185: @@
 Run contex_var_31_c1 (cutoff 1 used for 1 genome)
-*--se_list is the command the reads the list you want to target (ie: list-genome.txt)
+*--''se_list'' is the command the reads the list you want to target (ie: list-genome.txt)
-*--kmer_size is the middle size, has to be an odd integer
+*--''kmer_size'' is the middle size, has to be an odd integer
-*--mem_width always choose 17
+*--''mem_width'' always choose 17
-*--mem_height always choose 100
+*--''mem_height'' always choose 100
-*--dump_binary Name your file name (ie: Genome.ctx)
+*--''dump_binary'' Name your file name (ie: Genome.ctx)
 <syntaxhighlight lang="bash">
@@ Line 203: / Line 203: @@
 </syntaxhighlight>
-Bubble caller will detect differences between each *colored* genome by assigning
+Bubble caller will detect differences between each genome by assigning distinct colors to each genome (note that the UK spelling of color is used: '''colour''')
-*--multicolour_bin
+*--''multicolour_bin'' holds your all-colors.ctx binary from the Bubble Caller
-*--detect_bubbles
+*--''detect_bubbles1 i/i'' Detects 1 variation between genomes ''i'' and ''i''. ''i'' indicates the position number the genome is listed on the colorlist.txt file. If the genome is fourth on the colorlist.txt, for example, its corresponding ''i'' variable is 4
-*--output_bubbles
+*--''output_bubbles1'' Output variant reads in fasta format  (ie: Evo-RefHG.var for bubble detection between
+Evolved genome and Reference HG genome)
+*--''print_colour_coverages'' necessary for output
+<syntaxhighlight lang="bash">
+/home/weigang/CORTEX_release_v1.0.5.21/bin/cortex_var_31_c5 --kmer_size 31  --mem_height 17 --mem_width 100 --multicolour_bin all-colors.ctx --detect_bubbles1 0/1 --output_bubbles1 Evo-RefHG.var --print colour coverages  > Evo-RefHG.log
+</syntaxhighlight>

Mini-Tutorals: Difference between revisions

Revision as of 18:02, 24 November 2015

Contents

Bp-utils: sequence, alignment & tree utilities by Qiu Lab

bioseq: sequence/FASTA manipulations

bioaln: alignment/CLUSTALW manipulations

biotree: tree/NEWICK manipulations

biopop: SNP statistics

Homology searching and clustering

BLAST+: search("google") for homologs/pariwise alignment

hmmer

cdhit

interproscan

Programs for producing multiple alignments

MUSCLE

CLUSTALW

MAFT

TCOFFEE

Programs for producing phylogeny & phylogenetic analysis

FastTree

PHYLIP

MrBayes

RaXML

PhyloNet

R packages for phylogenetics

APE

phengorn

phytools

Population genetics

ms: coalescence simulation

SFS: forward simulation

PAML: testing selection with Ka/Ks

Microbial genome databases & pipelines in Qiu Lab

borreliabase

pa2

spiro_genomes/treponema

Genome annotation pipeline

de novo variant call with cortex_var

Navigation menu

Mini-Tutorals: Difference between revisions

Revision as of 18:02, 24 November 2015

Bp-utils: sequence, alignment & tree utilities by Qiu Lab

bioseq: sequence/FASTA manipulations

bioaln: alignment/CLUSTALW manipulations

biotree: tree/NEWICK manipulations

biopop: SNP statistics

Homology searching and clustering

BLAST+: search("google") for homologs/pariwise alignment

hmmer

cdhit

interproscan

Programs for producing multiple alignments

MUSCLE

CLUSTALW

MAFT

TCOFFEE

Programs for producing phylogeny & phylogenetic analysis

FastTree

PHYLIP

MrBayes

RaXML

PhyloNet

R packages for phylogenetics

APE

phengorn

phytools

Population genetics

ms: coalescence simulation

SFS: forward simulation

PAML: testing selection with Ka/Ks

Microbial genome databases & pipelines in Qiu Lab

borreliabase

pa2

spiro_genomes/treponema

Genome annotation pipeline

de novo variant call with cortex_var

Navigation menu

Search