Mini-Tutorals
Jump to navigation
Jump to search
Bp-utils: sequence, alignment & tree utilities by Qiu Lab
bioseq: sequence/FASTA manipulations
- Use accession "CP002316.1" to retrieve the Genbank file from NCBI. Save the output (in genbank format) to a file named as "cp002316.gb".
bioseq -f "CP002316.1" -o'genbank' > cp002316.gb
- Use the above file as input, extract FASTA sequences for each genes and save the output to a new file called "cp002316.nuc". Use this file for the following questions.
bioseq -i "genbank" -F cp002316.gb > cp002316.fas
- Count the number of sequences.
bioseq -n cp002316.fas
- In a single command, pick the first 10 sequences and find their length
bioseq -p "order:1-10" cp002316.fas | bioseq –l
- In a single command, pick the third and seventh sequences from the file and do the 3-frame translation. Which reading frame is the correct on both? Specify
bioseq -p "order:3,7" cp002316.fas | bioseq -t3
- Find the base composition of the last two sequences
bioseq -p "order:25-26" cp002316.fas| bioseq –c
- Pick the sequence with id "Bbu|D1_B11|8784|9302|1" and count the number of codons present in this sequence
bioseq -p "id:BbuJD1_B11|8784|9302|1" cp002316.fas | bioseq –C
- Delete the last 10 sequences from the file and save the output to cp002316-v2.nuc
bioseq -d "order:17-26" cp002316.fas > cp002316-v2.nuc
- In a single command, pick the first sequence, then get the 50-110 nucleotides and make reverse complement of the sub-sequences
bioseq -p "order:1" cp002316.fas | bioseq -s "50,110" | bioseq –r
- In a single command, get the first 100 nucleotides of all the sequences present in the file and do 1-frame translation of all sub-sequences.
bioseq -s "1,100" cp002316.fas | bioseq -t1
bioaln: alignment/CLUSTALW manipulations
- Go to /home/shared/LabMeetingReadings/Test-Data and find the sequence alignment file “bioaln_tutorial.aln”. Name the format of the alignment file. Use it to answer all the questions below.
CLUSTALW
- Find the length of the alignment.
bioaln -l bioaln_tutorial.aln
- Count the number of the sequences present in the alignment.
bioaln -n bioaln_tutorial.aln
- How do you convert this alignment in phylip format? Save the output.
bioaln -o "phylip" bioaln_tutorial.aln > test.phy
- Pick “seq2, seq5, seq7, seq10” from the alignment and calculate their average percent identity.
bioaln -p "seq2, seq5, seq7, seq10" bioaln_tutorial.aln | bioaln -a
- Get an alignment slice from “50-140” and find the average identities of the slice for sliding windows of 25.
bioaln -s "50, 140" bioaln_tutorial.aln | bioaln -w "25"
- Extract conserved blocks from the alignment.
bioaln -B bioaln_tutorial.aln
- Find the unique sequences and list their ids.
bioaln -u bioaln_tutorial.aln | bioaln -L
- Extract third sites from the alignment and show only variable sites in match view.
bioaln -T bioaln_tutorial.aln | bioaln -v | bioaln -m
- Remove the gaps and show the final alignment in codon view for an alignment slice “1-100”.
bioaln -s "1, 100" bioaln_tutorial.aln | bioaln -g | bioaln -c
- Add a 90% consensus sequence and then show the final alignment in match plus codon view for an alignment slice “20-80”. (Hint: match view followed by codon view)
bioaln -s "20, 80" bioaln_tutorial.aln | bioaln -C "90" | bioaln -m | bioaln -c
biotree: tree/NEWICK manipulations
biopop: SNP statistics
Homology searching and clustering
BLAST+: search("google") for homologs/pariwise alignment
hmmer
cdhit
cdhit -i all.pep -o all.cdhit -c 0.5 -n 3
Options:
- -i: input file
- -o: output file
- -c: percent identity (below which it is considered different families)
- -n: word length
interproscan
../../software/interproscan/interproscan-5.13-52.0/interproscan.sh -i trep-cdhit.representatives.pep2 -o trep-representatives.tsv -t p -goterms -pa -f tsv
Documentation page: How to run
Programs for producing multiple alignments
MUSCLE
CLUSTALW
MAFT
TCOFFEE
Programs for producing phylogeny & phylogenetic analysis
FastTree
PHYLIP
MrBayes
RaXML
PhyloNet
R packages for phylogenetics
APE
phengorn
phytools
Population genetics
ms: coalescence simulation
SFS: forward simulation
PAML: testing selection with Ka/Ks
Microbial genome databases & pipelines in Qiu Lab
borreliabase
pa2
spiro_genomes/treponema
Genome annotation pipeline
de novo variant call with cortex_var
Create binary file of fasta genome file.
Run contex_var_31_c1 (cutoff 1 used for 1 genome)
- --se_list is the command the reads the list you want to target (ie: list-genome.txt)
- --kmer_size is the middle size, has to be an odd integer
- --mem_width always choose 17
- --mem_height always choose 100
- --dump_binary Name your file name (ie: Genome.ctx)
/home/weigang/CORTEX_release_v1.0.5.21/bin/cortex_var_31_c1 --se_list list-Evo.txt --kmer_size 31 --mem_width 17 --mem_height 100 dump_binary Evo.ctx > Evo.log
Read each binary file (.ctx) into its own individual color list (ls Evo.ctx > Evo.colorlist) Then save these lists into their own collective colorlist.txt (ls *.ctx > colorlist.txt)
Reveal genetic variation using the Bubble Caller from cortex_var.
/home/weigang/CORTEX_release_v1.0.5.21/bin/cortex_var_31_c5 --se_list colorlist.txt --kmer_size 31 --mem_width 17 --mem_height 100 dump_binary all-colors.ctx > all-colors.log
Bubble caller will detect differences between each genome by assigning distinct colors to each genome (note that the UK spelling of color is used: colour)
- --multicolour_bin holds your all-colors.ctx binary from the Bubble Caller
- --detect_bubbles1 i/i Detects 1 variation between genomes i and i. i indicates the position number the genome is listed on the colorlist.txt file. If the genome is fourth on the colorlist.txt, for example, its corresponding i variable is 4
- --output_bubbles1 Output variant reads in fasta format (ie: Evo-RefHG.var for bubble detection between
Evolved genome and Reference HG genome)
- --print_colour_coverages necessary for output
/home/weigang/CORTEX_release_v1.0.5.21/bin/cortex_var_31_c5 --kmer_size 31 --mem_height 17 --mem_width 100 --multicolour_bin all-colors.ctx --detect_bubbles1 0/1 --output_bubbles1 Evo-RefHG.var --print colour coverages > Evo-RefHG.log