Mini-Tutorals
Jump to navigation
Jump to search
Running a Screen Session
use byobu
- start a screen session by typing "byobu"
- run commands
- Detach by pressing "F6"
- Reattach by typing "byobu"
- Terminate by typing "exit"
use screen
- Start a screen session
screen
- Detach the running mission
ctrl + A + D
- Show the list of running process
screen -ls
- Reattach a running process
screen -r ProcessID
- Terminate a process
ctrl + A
:quit
Bp-utils: sequence, alignment & tree utilities by Qiu Lab
bioseq: sequence/FASTA manipulations
- Use accession "CP002316.1" to retrieve the Genbank file from NCBI. Save the output (in genbank format) to a file named as "cp002316.gb".
bioseq -f "CP002316.1" -o'genbank' > cp002316.gb
- Use the above file as input, extract FASTA sequences for each genes and save the output to a new file called "cp002316.nuc". Use this file for the following questions.
bioseq -i "genbank" -F cp002316.gb > cp002316.fas
- Count the number of sequences.
bioseq -n cp002316.fas
- In a single command, pick the first 10 sequences and find their length
bioseq -p "order:1-10" cp002316.fas | bioseq –l
- In a single command, pick the third and seventh sequences from the file and do the 3-frame translation. Which reading frame is the correct on both? Specify
bioseq -p "order:3,7" cp002316.fas | bioseq -t3
- Find the base composition of the last two sequences
bioseq -p "order:25-26" cp002316.fas| bioseq –c
- Pick the sequence with id "Bbu|D1_B11|8784|9302|1" and count the number of codons present in this sequence
bioseq -p "id:BbuJD1_B11|8784|9302|1" cp002316.fas | bioseq –C
- Delete the last 10 sequences from the file and save the output to cp002316-v2.nuc
bioseq -d "order:17-26" cp002316.fas > cp002316-v2.nuc
- In a single command, pick the first sequence, then get the 50-110 nucleotides and make reverse complement of the sub-sequences
bioseq -p "order:1" cp002316.fas | bioseq -s "50,110" | bioseq –r
- In a single command, get the first 100 nucleotides of all the sequences present in the file and do 1-frame translation of all sub-sequences.
bioseq -s "1,100" cp002316.fas | bioseq -t1
bioaln: alignment/CLUSTALW manipulations
- Go to /home/shared/LabMeetingReadings/Test-Data and find the sequence alignment file “bioaln_tutorial.aln”. Name the format of the alignment file. Use it to answer all the questions below.
CLUSTALW
- Find the length of the alignment.
bioaln -l bioaln_tutorial.aln
- Count the number of the sequences present in the alignment.
bioaln -n bioaln_tutorial.aln
- How do you convert this alignment in phylip format? Save the output.
bioaln -o "phylip" bioaln_tutorial.aln > test.phy
- Pick “seq2, seq5, seq7, seq10” from the alignment and calculate their average percent identity.
bioaln -p "seq2, seq5, seq7, seq10" bioaln_tutorial.aln | bioaln -a
- Get an alignment slice from “50-140” and find the average identities of the slice for sliding windows of 25.
bioaln -s "50, 140" bioaln_tutorial.aln | bioaln -w "25"
- Extract conserved blocks from the alignment.
bioaln -B bioaln_tutorial.aln
- Find the unique sequences and list their ids.
bioaln -u bioaln_tutorial.aln | bioaln -L
- Extract third sites from the alignment and show only variable sites in match view.
bioaln -T bioaln_tutorial.aln | bioaln -v | bioaln -m
- Remove the gaps and show the final alignment in codon view for an alignment slice “1-100”.
bioaln -s "1, 100" bioaln_tutorial.aln | bioaln -g | bioaln -c
- Add a 90% consensus sequence and then show the final alignment in match plus codon view for an alignment slice “20-80”. (Hint: match view followed by codon view)
bioaln -s "20, 80" bioaln_tutorial.aln | bioaln -C "90" | bioaln -m | bioaln -c
biotree: tree/NEWICK manipulations
biopop: SNP statistics
Homology searching and clustering
BLAST+: search("google") for homologs/pariwise alignment
hmmer
cdhit
cdhit -i all.pep -o all.cdhit -c 0.5 -n 3
Options:
- -i: input file
- -o: output file
- -c: percent identity (below which it is considered different families)
- -n: word length
interproscan
../../software/interproscan/interproscan-5.13-52.0/interproscan.sh -i trep-cdhit.representatives.pep2 -o trep-representatives.tsv -t p -goterms -pa -f tsv
Documentation page: How to run
Programs for producing multiple alignments
MUSCLE
MUGSY
- MUGSY bash
#!/bin/sh
export MUGSY_INSTALL=/home/weigang/mugsy_x86-64-v1r2.2
export PATH=$PATH:$MUGSY_INSTALL:$MUGSY_INSTALL/mapping
export PERL5LIB=$MUGSY_INSTALL/perllibs
#For testing TBA
#export PATH=$PATH:$MUGSY_INSTALL/../../multiz-tba/trunk/
- source the bash file
source mugsyenv.sh
- run mugsy
mugsy --directory /home/chongdi/Streptococcus/mugsy-output -prefix mugsy_aln mugsy-input/*.fa
CLUSTALW
MAFT
TCOFFEE
Programs for producing phylogeny & phylogenetic analysis
FastTree
PHYLIP
MrBayes
RaXML
PhyloNet
R packages for phylogenetics
APE
phengorn
phytools
Population genetics
ms: coalescence simulation
SFS: forward simulation
PAML: testing selection with Ka/Ks
Microbial genome databases & pipelines in Qiu Lab
borreliabase
pa2
spiro_genomes/treponema
Blast a set of genes against a bacteria genome
- download genome
- extract gene sequences & translate
- Make blast database
- Run blastp
de novo variant call with cortex_var
Create binary file of fasta genome file.
Run contex_var_31_c1 (cutoff 1 used for 1 genome)
- --se_list is the command the reads the list you want to target (ie: list-genome.txt)
- --kmer_size is the middle size, has to be an odd integer
- --mem_width always choose 17
- --mem_height always choose 100
- --dump_binary Name your file name (ie: Genome.ctx)
/home/weigang/CORTEX_release_v1.0.5.21/bin/cortex_var_31_c1
--se_list list-Evo.txt
--kmer_size 31
--mem_width 17
--mem_height 100
--dump_binary Evo.ctx
> Evo.log save log file
Read each binary file (.ctx) into its own individual color list (ls Evo.ctx > Evo.colorlist) Then save these lists into their own collective colorlist.txt (ls *.ctx > colorlist.txt)
Reveal genetic variation using the Bubble Caller from cortex_var.
/home/weigang/CORTEX_release_v1.0.5.21/bin/cortex_var_31_c5
--se_list colorlist.txt
--kmer_size 31
--mem_width 17
--mem_height 100
--dump_binary all-colors.ctx
> all-colors.log save log file
Bubble caller will detect differences between each genome by assigning distinct colors to each genome (note that the UK spelling of color is used: colour)
- --multicolour_bin holds your all-colors.ctx binary from the Bubble Caller
- --detect_bubbles1 i/i Detects 1 variation between genomes i and i. i indicates the position number the genome is listed on the colorlist.txt file. If the genome is fourth on the colorlist.txt, for example, its corresponding i variable is 4
- --output_bubbles1 Output variant reads in fasta format (ie: Evo-RefHG.var for bubble detection between
Evolved genome and Reference HG genome)
- --print_colour_coverages necessary for output
/home/weigang/CORTEX_release_v1.0.5.21/bin/cortex_var_31_c5
--kmer_size 31
--mem_height 17
--mem_width 100
--multicolour_bin all-colors.ctx
--detect_bubbles1 0/1
--output_bubbles1 Evo-RefHG.var
--print_colour_coverages
> Evo-RefHG.log save log file
Variant call with cortex_var
Example
- Files
MS00018_S5_L001_R1_001.fastq
MS00018_S5_L001_R2_001.fastq
KU-090401_S3_L001_R1_001.fastq
KU-090401_S3_L001_R2_001.fastq
...
- File content
@M03268:52:000000000-AJFAY:1:1101:16970:1555 1:N:0:7
CCCATGAACGGCACGTTCACGATGCAGAAGGTGGTGACCAGGCCGGTGTCGGCGACCTCGACGTAATCGTCGGTGGGGGA
GCCGTCGCGCGGGTCCGCGCCGCGCGGCGGGAAGTACACCTTGCCGTCCATGCGGGCGCGGGCGCCCAGCAGACGACCCT
CCATGAGCGCATTCAGGAATTCGGCCTCCTGCGGCGCGGCGGTGTGCTTGATATTGAAGTCGACCGGGGTCACGATGCCG
GTGACCGGTT
+
11>1>111@11>1A1FGF1FC0F0A111010GB0ECBFGF0/AFECGGFHECE??EEGHE/EEEHEFHEHHGCEECE??/
<CFCGGCCGCCCCCCHCGGCCCG@G??@@?@-@BFFFFFFFFFFF;B@FBFBFF<?@;@@-=?@??@-EFFFFF@;@@@F
FFBF/BBF;@@FFFFFFFFFFFFF@FFFFFFF<@@@@@@@@@@@FBFFFFFFFFFFFFFFFFF@@@?@?@FFFFFFFBF@
?@@EFF@@<B
@M03268:52:000000000-AJFAY:1:1101:16136:1618 1:N:0:7
TCCTGGCCCGTGAAACCGCTTGCCCGGTACAGGTTCTGGACTACCGCCTGGCACCCGAGCATCCGTTCCCGGCGGCGCTC
GACGACGCCGAGGCGGCGTTCCTGGAACTGGTGGCCGCCGGCTACCGGCCCGAACGCATTGCGGTCGCGGGTGATTCGGCCGGTGGCGGGCTCTCGCTCGCGCTGGCCGAACGGCTGCGCGACCGGCACGGGCTGGTTCCGGCCGCGCTCGGGCTGATCGCGCCCTGGGC
+
11>>11C1A@A?11BDF?EEGAFGFG?ECCHBHHGFG1EGHFHHHCCGGC/GCGGGCEECEFGHGGFFGHGGGGCECCC<CCC@CCCC@CGCCCGGC?:@EBFBF/CFFFF0/CFG?=@=-9>AFF@=@@?@;@@@F--9:BF@A@@?@--;9-BFFFF@A-@99B?-9@?=EFFBAAE;9>?;@@BF@9-@-;@=-E@@@=@@;@?>@9@<?@?BBBFFF;?>;;-@@?@<?@?9-FBFF@-99-9E-9
...
Step 1
- Create matched FASTQ files (python script)
#!/usr/bin/python
from sys import argv
script, File1, File2 = argv
# Create a dictionary listing the sequences in the first file for reference
file1 = open(File1)
dict1 = {}
for line in file1:
if '@M03268' in line:
tag1 = line.rstrip()[:-9]
tail = line.rstrip()[-9:]
dict1[tag1] = []
else:
dict1[tag1].append(line.rstrip())
file1.close()
# Create two output files
f1 = open(File1.replace('.fastq', '_mat.fq'), 'w')
f2 = open(File2.replace('.fastq', '_mat.fq'), 'w')
# Match the sequence
file2 = open(File2)
for line in file2:
if dict1.has_key(line.rstrip()[:-9]): # The has_key method
tag1 = line.rstrip()[:-9]
f1.write(tag1 + tail + '\n')
for j in range(3):
f1.write(dict1[tag1][j] + '\n')
del dict1[tag1]
dict2 = {} # Create a temporary dictionary for sequence in the file2
tag2 = line.rstrip()
dict2[tag2] = []
else:
dict2[tag2].append(line.rstrip())
if len(dict2[tag2]) == 3:
f2.write(tag2 + '\n')
for j in range(3):
f2.write(dict2[tag2][j] + '\n')
file2.close()
f1.close()
f2.close()
Step 2. Clean Fastq Files & Run Single-color Graph & Error Cleaning
- Create a file list showing all outcome files whose extensions need to be changed from _mat.fq to .list
bbduk.sh -Xmx1g in1=fastq_file1 in2=fastq_file2 -out1=clean1.fq -out2=clean2.fq qtrim=rl trimq=20
- Create a file list showing all outcome files whose extensions need to be changed from _mat.fq to .list
for f in *_mat.fq;
do
title=$(echo $f | cut -d'_' -f2);
id=$(echo $f | cut -d'_' -f1);
echo $f > ${id}.list${title};
done
- Single color graph for sample
../../CORTEX_release_v1.0.5.21/bin/cortex_var_31_c1
--pe_list MS00018_S5.list1,MS00018_S5.list2
--kmer_size 31
--mem_height 17
--mem_width 100
--dump_binary MS00018_S5.ctx
--sample_id MS00018_S5
--remove_pcr_duplicates
--quality_score_threshold 20 > MS00018_S5.log
- Single color graph for reference
../../CORTEX_release_v1.0.5.21/bin/cortex_var_31_c1
--se_list ref.filelist
--kmer_size 31
--mem_height 17
--mem_width 100
--dump_binary ref.ctx
--sample_id ref 20 > ref.log
- Error Cleaning
../../CORTEX_release_v1.0.5.21/bin/cortex_var_31_c1
--mem_height 18
--mem_width 100
--kmer_size 31
--multicolour_bin N18_S15.ctx
--remove_low_coverage_supernodes 10
--dump_binary N18_S15.cleaned.ctx
Step 3
- Pull the name of each .ctx file to a .list file, then create a .filelist file for all .list files.
ls file1.ctx > file1.list
ls file2.ctx > file2.list
ls ref.ctx > ref.list
ls -1 *.list > ref-sample.filelist
- Multicolour Graph
../../CORTEX_release_v1.0.5.21/bin/cortex_var_31_c3
--mem_height 20
--mem_width 100
--kmer_size 31
--colour_list ref-sample.filelist
--dump_binary ref-sample.ctx > ref-sample.log
Step 4
- Variation Discovery Using The Bubble Caller
../../CORTEX_release_v1.0.5.21/bin/cortex_var_31_c3
--mem_height 20
--mem_width 100
--kmer_size 31
--multicolour_bin ref-sample.ctx
--detect_bubbles1 -1/-1
--ref_colour 2
--output_bubbles1 bubbles-in-sample.out
--print_colour_coverages
--experiment_type EachColourAHaploidSampleExceptTheRefColour
--genome_size 8000000 > bubbles-in-sample.log
Step 5
- Reference Genome
(When in Cluster execute "module load stampy": doesn't work; path problem) (Run the following on wallace:)
stampy.py -G ref ref.fa
stampy.py -g ref -H ref
- Turn Into VCF with reference
perl ~/CORTEX_release_v1.0.5.21/scripts/analyse_variants/process_calls-wq.pl
--callfile bubbles-in-samples.out2
--callfile_log bubbles-in-samples.log
--outvcf bubbles-in-samples
--outdir vcfout
--samplename_list sample.list
--num_cols 17
--stampy_bin ~/stampy-1.0.28/stampy.py
--stampy_hash ref
--refcol 2
--vcftools_dir /usr/local/bin
--caller BC
--kmer 31
--ploidy 1
hmmer
Annotate proteins with TIGRFAM
hmmsearch --tblout foo.hmmout # table output for all sequences
--domtblout foo.dmout # table output for all domains
-E 0.01 # level of sequence significance
--domE 0.01 # level of domain significance
-o /dev/null # don't show STDOUT
../../TIGRFAMs-Release-15-Sep-17-2014/TIGRFAMs_15.0_HMM.LIB # HMM profile library for tiger fams
GCA_000583715.2_ASM58371v2_protein.faa & # input/query file in FASTA
PopGenome
library(PopGenome)
g = readVCF("pvt1.recode.vcf.gz", 1000, "8", 127890000, 128102000)
pops = split(sample[,1], sample[,2]) # create a list of populations
g = set.populations(g, pops, diploid = T) # set population names
# by windows
slide = sliding.window.transform(g, width = 100, jump = 20) # nsnps, not actual length
slide = F_ST.stats(slide, mode = "nucleotide")
snp.pos = slide@region.data@biallelic.sites # SNP positions
win.num = length(slide@region.names)
win.start = numeric()
for (i in 1:win.num) {win.start[i] = snp.pos[[i]][1]}
fst = slide@nuc.F_ST.vs.all
pop.names = names(slide@populations) # population names
plot(win.start, fst[,1], type ="n", las = 1, ylab = expression(F[st]), xlab = "SNP Position", ylim = c(0,0.4))
for (i in 1:length(slide@populations)) {
lines(win.start, fst[,i], type = "l", col = pop.group[pop.names[i],4])
}
arch.coords=c(127982050, 127992931)
abline(v = arch.coords, col = "orange")
#rect(xleft = arch.coords[1], ybottom = -1, xright = arch.coords[2], ytop = 0.5, border = "transparent", col = 2)
Velvet
Cleaning
../../bbmap/bbduk.sh -Xmx1g
in1=WGC067462_hunhewenku_509_combined_R1.fastq # gz file works as well
in2=WGC067462_hunhewenku_509_combined_R2.fastq # gz file works as well
-out1=clean1.fq
-out2=clean2.fq
qtrim=rl
trimq=20
Running Velvet Optimizer
srun ../../VelvetOptimiser-2.2.5/VelvetOptimiser.pl
--t 32
--s 31 --e 31 --x 6 # kmer sizes
-f '-shortPaired -fastq clean1.fq -shortPaired2 -fastq clean2.fq'
-t 4
--optFuncKmer ‘n50’
-p prefix