Bioutils
BioPerl-based Sequence Utilities
What is bputils?
bputils is a suite of Perl scripts that provide convenient command-line access to popular BioPerl methods. Designed as UNIX utilities, these tools aim to circumvent a constant need (and urge) to compose one-off BioPerl scripts for routine manipulations of sequences, alignments and trees.
The initial release of bputils consists of four inter-connected utilities (Figure 1):
- bioseq: a wrapper of BioPerl class Bio::Seq (with additional methods)
- bioaln: a wrapper of Bio::SimpleAlign (which inherits Bio::Seq; with additional methods)
- biopop: a wrapper of Bio::PopGen (which can be converted from Bio::SimpleAlign; with additional methods)
- biotree: a wrapper of Bio::tree (with additional methods)
These utilities have been in development since 2002 in the lab of Dr Weigang Qiu at Hunter College of the City University of New York. They are the main code base of the Qiu Lab, which specializes in microbial evolutionary genomics. They proved to be convenient, efficient, and popular among students and researchers passing through the lab. By releasing bioutils as an Open Source tool (perhaps as a part of bioperl distribution), we hope to (1) share our tools and (2) invite other developers to join the effort of making BioPerl more accessible.
Other Unix utilities for genomics
- BioPerl scripts (This is probably where these utilities will eventually be housed)
- EMBOSS: Official site; a third-party manual
- NEWICK Utilities: Official website; A publication
Live Demos
Basic Usage
- bioseq
bioseq -l foo.fasta # print seq names and lengths from FASTA (default format) file
bioseq -r foo.fasta # reverse complement
bioseq -t1 foo.fasta # translate in the +1 frame
bioseq -t3 foo.fasta # translate in +1, +2, and +3 frames
bioseq -t6 foo.fasta # translate in all 6 frames
bioseq -p'id:seq_1' foo.fasta # pick a sequence by ID
bioseq -p'order:3' foo.fasta # pick the 3rd sequence
bioseq -p're:Human' foo.fasta # pick all sequences labeled as "Human" (by regular expression)
bioseq -g foo.fasta # remove gaps
bioseq -z'CP003201' -o'genbank' # retrieve a GenBank file with accession
bioseq -z'CP003201' -o'fasta' # same file in FASTA
- bioaln
bioaln -i'fasta' -o'phylip' foo.fasta # convert FASTA alignment to PHYLIP
bioaln -l foo.aln # print alignment length of a CLUSTALW (default format) file
bioaln -s'100, 200' foo.aln # obtain an alignment slice
bioaln -m foo.aln # show only variable sites
bioaln -r'seq_2' foo.aln # set "seq_2" as reference (first) sequence
bioaln -g foo.aln # remove gapped sites
bioaln -p'seq_1,seq_3,seq_6' foo.aln # pick a subset of sequences
bioaln -d'seq_1,seq_3,seq_6' foo.aln # delele a subset of sequences
- biotree
biotree -l foo.newick # total tree length
biotree -e 'outgroup' foo.newick # re-root on a "outgroup"
biotree -z foo.newick # print pair-wise tree distances between leafs
biotree -p 'node1' -p 'node2' foo.newick # remove "node1" and node2"
biotree -s 'node1' -s 'node2' -s 'node3' -s'node4' foo.newick # subtree consisting ONLY of these nodes
- biopop
biopop --stats 'pi,theta' foo.aln # print pi and theta of population (default: CLUSTALW alignment)
biopop -m foo.aln # print mismatch distribution
biopop -d foo.aln # print distance matrix (default: Jukes-Cantor model; should be back-tracked into "bioaln")
biopop -s foo.aln # obtain number of segregating sites (i.e., SNPs)
biopop -a foo.aln # get heterozygosity for each SNP site
Power usage (with pipes)
bioseq -p'order:5' foo.fasta | bioseq -s'100,200' | bioseq -r | bioseq -t1 # pick, subseq, revcom, and translate
bioaln -o'fasta' foo.aln | bioseq -g # remove gaps within individual sequences
bioaln -o'fasta' foo.aln | bioseq -t1 | bioaln -i'fasta' # turn a nucleotide alignment into a peptide alignment
biotree -s'otu1' -s'otu2' -s'otu3' foo.newick | biotree -l # subset a large tree and get total tree length
Creative usage (with BASH)
echo -ne ">lookup\nATG\n" | bioseq -t1 # Lookup a codon product
len=$(bioaln -l foo.aln); len_degap=$(bioaln -g foo.aln | bioaln -l); echo "$len-$len_degap" | bc -l # count alignment gaps
for i in {1..100}; do bioaln --boot foo.aln >> foo.boot; done # bootstrap an alignment (with replacement)
Even-more powerful usage (with other applications)
# Permutation Trait Probability test for tree-ness:
for i in {1..100); do
bioaln -P -i'fasta' -o'fasta' nt.fas > nt.permuted.fas;
FastTree -nt nt.permuted.fas 2> /dev/null | biotree -l >> tree-length.txt # tree length of permuted alignment
done;
Full documentation
POD document for bioseq
POD document for bioaln
POD document for biopop
POD document for biopop
Release 1.0 Notes
- Installation
- Dependency
Contributors
- Yozen Hernandez
- Levy Vargas
- Pedro Pagan
- Che Martin
- James Haven
- Girish Ramrattan
- Raymond Liang
- Saymon Akther
- Daniel Packer
- Weigang Qiu