BioPerl-based Sequence Utilities

Figure 1. Design and Methods of bioutils

What is bioutils?

bioutils is a suite of Perl scripts that provide convenient command-line access to popular BioPerl methods. Designed as UNIX utilities, these tools aim to circumvent a constant need (and urge) for composing one-off BioPerl scripts for routine manipulations of sequences, alignments and trees.

The initial release of bioutils consists of four utilities (Figure 1):

bioseq: a wrapper of BioPerl class Bio::Seq (with additional methods)
bioaln: a wrapper of Bio::SimpleAlign (which inherits Bio::Seq; with additional methods)
biopop: a wrapper of Bio::PopGen (which can be converted from Bio::SimpleAlign; with additional methods)
biotree: a wrapper of Bio::tree (with additional methods)

These utilities have been in development since 2002 in the lab of Dr Weigang Qiu at Hunter College of the City University of New York. They are the main code base of the Qiu Lab, which specializes in microbial evolutionary genomics. They proved to be convenient, efficient, and popular among students and researchers passing through the lab. By releasing bioutils as an Open Source tool (perhaps as a part of bioperl distribution), we hope to (1) share our experience and (2) invite other developers to join the effort of making BioPerl more accessible.

Live Demos

Basic Usage

bioseq

bioseq -l foo.fasta # print seq names and lengths from FASTA (default format) file
bioseq -r foo.fasta # reverse complement
bioseq -t1 foo.fasta # translate in the +1 frame
bioseq -t3 foo.fasta # translate in +1, +2, and +3 frames
bioseq -t6 foo.fasta # translate in all 6 frames
bioseq -p'id:seq_1' foo.fasta # pick a sequence by ID
bioseq -p'order:3' # pick the 3rd sequence
bioseq -p're:Human' foo.fasta # pick all sequences labeled as "Human" (by regular expression)
bioseq -g foo.fasta # remove all gaps
bioseq -z'CP003201' -o'genbank' # retrieve a GenBank file with accession
bioseq -z'CP003201' -o'fasta' # same file in FASTA

bioaln

bioaln -i'fasta' -o'phylip' foo.fasta # convert a FASTA alignment to PHYLIP
bioaln -l foo.aln # print alignment length of a CLUSTALW (default format) file
bioaln -s'100, 200' foo.aln # obtain an alignment slice
bioaln -m foo.aln # show only variable sites
bioaln -r'seq_2' foo.aln # use "seq_2" as reference (first) sequence
bioaln -g foo.aln # remove gapped sites
bioaln -p'seq_1,seq_3,seq_6' foo.aln # pick a subset of sequences
bioaln -d'seq_1,seq_3,seq_6' foo.aln # delele a subset of sequences

biotree

biopop

Power usage (with pipes)

# Pipe with the same utility
bioseq -p'order:5' foo.fasta | bioseq -s'100,200' | bioseq -r | bioseq -t1 # pick, subseq, revcom, and translate
# Pipe among utilities
bioaln -o'fasta' foo.aln | bioseq -g # remove gaps within individual sequences

Creative usage (with BASH utils)

echo -ne ">lookup\nATG\n" | bioseq -t1 # Lookup a codon product
len=$(bioaln -l foo.aln); len_degap=$(bioaln -g foo.aln | bioaln -l); echo "$len-$len_degap" | bc -l # count alignment gaps

Full documentation

PerlDoc for bioseq

NAME

bioseq - Fasta sequence editing module based on BioPerl.

SYNOPSIS

bioseq [options] [sequence file]

DESCRIPTION

bioseq will read a sequence file and act upon it by doing the following - reformat input (default is fasta) to Genbank or EMBL formats, delete specified sequences, generate overlapping subsequence with a specified window size, generate the reverese complementary sequence, for nucleic acid sequences only, take input list of sequences apart into individual sequence files, extract a specified subset of the sequence, linearize the sequence, remove gaps, find the longest open reading frame (ORF), remove stop codons, give percentage composition of specified amino acids or nuclic acid bases, split the sequences as specified by the user, translate a specific frame of input sequence, or extract a specific gene ID from multiple file sequences. By default, bioseq will assume that both the input and the output are in FASTA format.

OPTIONS

--help, -h

Print a brief help message and exit.

<code>        Usage: bioseq -h <keyword></code>

--man, -m

Prints the manual page and exit.

<code>        Usage: bioseq -m <keyword> </code>

--input, -i 'format'

Input file format. By default, this is 'fasta'. For Genbank format, use 'genbank'. For EMBL format, use 'embl'.

<code>        Usage: bioseq -i 'genbank' input_file</code>

--output, -o 'format'

Output file format. By default, this is 'fasta'.For Genbank format, use 'genbank'. For EMBL format, use 'embl'.

<code>        Usage: bioseq -o 'EMBL' input_file</code>

--noseq, -n

Print number of sequences specified by n.

<code>        Usage: bioseq -n input_file</code>

--sub, -s 'min,max'

Select substring (of 1st sequence),

<code>        Usage: bioseq -s '<beginning index>, <ending index>' input_file
        Example:  bioseq -s'20,80' input_file (or --sub'20,80' or -s='20,80' or --sub='20,80')</code>

--lengths, -l

Print all sequence lengths.

<code>        Usage: bioseq -l</code>

--leadgaps, -y

Count and return the number of leading gaps in each sequence.

<code>        Usage: bioseq -y</code>

--pick, -p 'tag:value'

Select a single sequence or a comma-separated list of sequences, e.g, --pick 'id:foo' by id --pick 'order:2' by order --pick 're:REGEX' using a regular expression (only one regex is expected)

Specifically for a list of sequences, --pick 'id:foo,bar' list by id --pick 'order:2,3' list by order --pick 'order:2-10' list by range

<code>        Usage: bioseq -p 'id:<number>' input_file</code>

--del, -d 'tag:value'

<code> Delete a sequence or a comma-separated list of sequences, eg,
 --del 'id:foo'         by id
 --del 'order:2'        by order
 --del 'length:n'       by min length, where 'n' is length
 --del 'ambig:x'        by min % ambiguous base/aa, where 'x' is the %
 --del 'id:foo,bar'     list by id
 --del 're:REGEX'       using a regular expression (only one regex is expected)

        Usage: bioseq --del 'id:<number>' input_file</code>

--revcom, -r

Reverse complement

<code>        Usage: bioseq -r input_file</code>

--slidingwindow, -k 'window_size'

Generate overlapping subsequence with a specified window size (default 1)

<code>        Usage: bioseq -k '<index of subsequence>' input_file</code>

--translate, -t [1|3|6]

Translate in 1, 3, or 6 frames. eg, -t1, -t3, or -t6.

<code>        Usage: bioseq -t3 input_file</code>

--shred, -c

Shred into individual sequences

<code>        Usage: bioseq -c input_file</code>

--linearize, -L

Linearize FASTA.

<code>        Usage: bioseq -L fasta_file</code>

--extract, -e

Extract in-frame sequences.

<code>        Usage: bioseq -e input_file</code>

--nogaps, -g

Remove gaps

<code>        Usage: bioseq -g input_file</code>

--longest-orf, -C

Output the frame that gives the longest ORF.

<code>        Usage: bioseq -C input_file</code>

--removestop, -x

Remove stop codons (for e.g., PAML input)

<code>        Usage: bioseq -x input_file</code>

--anonymize, -a 'n'

Replace sequence IDs with serial IDs 'n' characters long, including a leading 'S' (e.g., -a'5' gives S0001). Produces a sed script file with a '.sed' suffix that may be used with sed's '-f' argument. If the filename is '-', the sed file is named STDOUT.sed instead. The sed filename is specified on STDERR.

<code>        Usage: bioseq -a <number> input_file</code>

--prefix 'PREFIX'

Used in conjunction with --anonymize. This lets you specify a custom prefix for the anonymized sequence IDs given when using the --anonymize option. If this is given, then the whole prefix will count toward the total ID length. For example: suppose the prefix chosen is SEQ, and that for --anonymize you supplied 5. Then the maximum id length is 5, so there is room for only two more digits. e.g., SEQ01 atg... SEQ02 atg... SEQ03 atc...

If there are enough sequences that the length of the prefix plus the length of the digit portion exceeds the length given to --anonymize, a warning will be given: aln-manipulations.pl -a 4 --prefix=SEQ # output SEQ1 atg... ... SEQ10 atc...

<code> # more output
 WARNING: Anonymized ID length exceeded requested length: try a different length or prefix.</code>

--composition, -w

Base or AA composition.

<code>        Usage: bioseq -w input_file</code>

--split, -S 'split_at'

Split the input sequence file into several smaller files with 'split_at' sequences per file. For instance, to split a file into several smaller files with 100000 sequences each, you would run:

<code> bioseq -S 100000 seq.fasta</code>

The output files will be named with the label "split_N" according to the input file (or with the STDIN prefix if the file is read via standard in), where N denotes the "part" or "split" number.

--retrieveseq, -z 'sequence retriever using GenBank accession'

<code>    Usage: bioseq -z [Accession]
    Retrieves the sequence from GenBank using the provided GenBank accession. Prints out text in a fasta file.

  EXAMPLE:

    bioseq -z X83553

  OUTPUT:

    >X83553 B.garinii (PHei strain) opsC gene.
    ATGAAAAAGAATACATTAAGTGCGATATTAATGACTTTATTTTTATTTATATCTTGTAAT
    AATTCAGGTGGGGATACTGCATCTACTAATCCTGATGAGTCTGCGAAAGGACCTAATCTT
    ATAGAAATAAGCAAAAAAATTACAGATTCTAATGCATTTGTACTGGCTGTGAAAGAAGTT
    GAGGCTTTGATCTCATCTATAGATGAACTTGCTAATAAAGCTATTGGTAAAAAAATAAAT
    CAAAATGGTTTAGATGCTGATGCTAATCACAACGGATCATTGTTAGCAGGAGCCCATGCA
    ATATCAACTCTAATAAAACAAAAAACAGATGGATTGAAAGATCTAGAAGGGTTAAGTAAA
    GAAATTGCAAAGGTGAAGGAATGTTCCGATAAATTTACTAAAAAGCTAACAGATAGTCAT
    GCACAGCTTGGAGCAGTTGGTGGTGCTATTAATGATGATCGTGCAAAAGAAGCTATTTTA
    AAAACACATGGGACTAACGATAAGGGTGCTAAAGAACTTAAAGAGTTATCTGAATCAGTA
    GAAAGCTTGGCAAAAGCAGCTCAAGCAGCATTAGCTAATTCAGTTAAAGAGCTTACAAGT
    CCTGTTGTGGCAGAAAGTCCAAAAAAACCTTAA</code>

--dotplot, -D 'draw_dotplot'

Extract two sequences from input file and generate a dotplot.

<code>    Usage: bioseq -D 'id1,id2,window_size,slider' fasta_file
    </code>

Id1 and Id2 are extracted with their corresponding sequences. Be sure to use the entire sequence identifer, as this is a whole string match. Window_size corresponds to the number of character you would like to compare (Default window is 10). Slider is the number of windows to compare (Default slider is 10). The sequence corresponding to ID1 will appear on the X axis (row) and ID2 on the Y axis (column). This method will work on both DNA and Amino Acids.

<code>  Example:
    Sample Input:
    >id1
    ATACGA
    >id2
    ATACGA

    Command: bioseq -D 'id1,id2,3'
    Output:

        A T A C G A
    A   *
    T     *
    A       *
    C
    G
    A
    A
    C
    A
    A
    T
    G</code>

==item --rename, -R 'rename_id'

<code>    Usage: bioseq -R [file_with_new_names] file_to_be_changed.fasta

    Ex: bioseq -D list.txt test-bioseq.nuc

  Input:

    list.txt:

    VS116:7:310:IGS:11      VS116
    B31:1:100:IGS:11        B31

    *Left column is the pattern to be replaced by the right column

    file_to_be_changed.fasta:

    >VS116:7:310:IGS:11
    AATTTCAAAAATATAATATAAAAACAGCTAATCCAATAGAAAAATTTGAAATTTTTCTAT
    TGGATAAATTCTATACAAGAAGGTAAATA
    >B31:1:100:IGS:11
    AATTTTTAAAATATAATATAAAAACAGCTAATCCAATAGAAAAATTTTAAAACTTTTCTA
    TTGGATAGATTTTATACAAAGAAGGTAATA

  Output:
    >VS116
    AATTTCAAAAATATAATATAAAAACAGCTAATCCAATAGAAAAATTTGAAATTTTTCTAT
    TGGATAAATTCTATACAAGAAGGTAAATA
    >B31
    AATTTTTAAAATATAATATAAAAACAGCTAATCCAATAGAAAAATTTTAAAACTTTTCTA
    TTGGATAGATTTTATACAAAGAAGGTAATA</code>

EXAMPLES

Section under construction...

REQUIRES

Perl 5.010, BioPerl

AUTHORS

<code> Weigang Qiu at genectr.hunter.cuny.edu
 Yözen Hernández yzhernand at gmail dot com
 Levy Vargas levy dot vargas at gmail dot com</code>

POD document for bioaln

POD document for biopop

POD document for biotree

Release 1.0 Notes

Installation
Dependency

Main contributors

Yozen Hernandez
Levy Vargas
Pedro Pagan
Che Martin
James Haven
Girish Ramrattan
Raymond Liang
Saymon Akther
Daniel Packer
Weigang Qiu

Bioutils

Contents

What is bioutils?

Live Demos

Basic Usage

Power usage (with pipes)

Creative usage (with BASH utils)

Full documentation

NAME

SYNOPSIS

DESCRIPTION

OPTIONS

EXAMPLES

REQUIRES

SEE ALSO

AUTHORS

Release 1.0 Notes

Main contributors

Navigation menu

Bioutils

What is bioutils?

Live Demos

Basic Usage

Power usage (with pipes)

Creative usage (with BASH utils)

Full documentation

NAME

SYNOPSIS

DESCRIPTION

OPTIONS

EXAMPLES

REQUIRES

SEE ALSO

AUTHORS

Release 1.0 Notes

Main contributors

Navigation menu

Search