A Primer on the Cluster System at Hunter: Difference between revisions

From QiuLab
Jump to navigation Jump to search
imported>Rayrah
mNo edit summary
imported>Rayrah
Line 1: Line 1:
=What is a Cluster System?=
=What is a Cluster System?=
[[File:Ccs-overview.png|thumb|frameless|'''Figure 1''' <sub>The general idea of a cluster system. There is a login node that is directly connected to the other nodes of the cluster</sub>]]
[[File:Ccs-overview.png|thumb|frameless|'''Figure 1''' <sub>The general idea of a cluster system. There is a login node that is directly connected to the other nodes of the cluster</sub>]]


A '''Cluster System''' is a set of connected computers that work together in order to perform tasks. Unlike many of the portals at hunter that require you to first login to a head node, and then to a working node; for example to enter the Qiu lab servers you must first login to Darwin.hunter.cuny.edu and then to a compute node such as Wallace.hunter.cuny.edu, on a cluster system all of the nodes connected can be viewed as a single system. Under this system all you would need to do is login to the head node, and from there you can run your programs. However,  unlike the non-cluster computer servers at hunter, in order to fully utilize the cluster you must submit your job to a job scheduler that controls which node to run the job on.  
A '''Cluster System''' is a set of connected computers that work together in order to perform tasks. Unlike many of the portals at hunter that require you to first login to a head node, and then to a working node.
==Sept 18, 2015==
*For example to enter the Qiu lab servers you must first login to Darwin.hunter.cuny.edu and then to a compute node such as Wallace.hunter.cuny.edu  
* Journal Club: latest statistics in detecting population admixture and genome intragression (d3, f4, h4, ChromosomePainter).[http://www.nature.com/nature/journal/vnfv/ncurrent/full/nature14895.html]. Presenter: Saymon
On a cluster system all of the nodes connected can be viewed as a single system. Under this system all you would need to do is login to the head node, and from there you can run your programs. However,  unlike the non-cluster computer servers at hunter, in order to fully utilize the cluster you must submit your job to a job scheduler that controls which node to run the job on.  
 
==Basics of Using a Cluster System==
 
 
*What programs are currently installed on the cluster System?
The normal way of running programs on a linux machine would be to type the name of the program into the terminal, add any options to the command, and press enter. For example"
<div class="toccolours mw-collapsible">
<syntaxhighlight lang=bash">
$>blastp -query test.fas -db nr -remote -outfmt 6
</syntaxhighlight>
</div>
However this will not work in a cluster.
In order to run programs on a cluster, the cluster must first have those programs installed (or you may have to compile the program and run it locally) and then you must load the program on to your terminal.
To check the available programs on the hunter college cluster use this command:
<div class="toccolours mw-collapsible">
<syntaxhighlight lang=bash">
module avail
</syntaxhighlight>
</div>
This command shows what modules are currently available to use. To save the list of available programs use this command:
<div class="toccolours mw-collapsible">
<syntaxhighlight lang=bash">
module avail 2> ~/available_apps.txt
</syntaxhighlight>
</div>
This command stores the present list programs installed on the cluster to a file called "available_apps.txt" to your home directory.
'''This file will only show the present programs installed on the cluster, you will need to update this file in the future to show new applications!'''
*How do we load a program on the cluster?
Now that we know what are the available apps on the cluster, we need to load the application into our terminal. To do so we use the command:
<div class="toccolours mw-collapsible">
<syntaxhighlight lang=bash">
module load <Program Name>
</syntaxhighlight>
</div>
For example, to load blast on the cluster we need to run the command:
<div class="toccolours mw-collapsible">
<syntaxhighlight lang=bash">
module load R/3.2.2
</syntaxhighlight>
</div>
Notice how we have a version number when we run the module load command?
This means that on the cluster we can different versions of a program! This is useful for those situations when an updated version of a program breaks compatibility with some code you've written or some modules you've downloaded, or if you like one version of a program more than the other. (Looking at you python 3.5 and your terrible syntax changes!)
In order to use a different version of a program we first need to unload our program then load up the correct version of the program:
<div class="toccolours mw-collapsible">
<syntaxhighlight lang=bash">
module unload R/3.2.2
module load R/3.1.0
</syntaxhighlight>
</div>
 
*How do we run a program on the cluster?
Now lets go back to our previous blast example. Lets first load up the module:
<div class="toccolours mw-collapsible">
<syntaxhighlight lang=bash">
module load blast/2.2.31
</syntaxhighlight>
</div>
Notice now, when you tab "bl" on the command line all programs associated with blast comes up, such as blastn / blastp, etc.
Now you may be tempted to run blast, or any program, the normal way, that is:
<div class="toccolours mw-collapsible">
<syntaxhighlight lang=bash">
blastp -query test.fas -db nr -remote -outfmt 6
</syntaxhighlight>
</div>
'''Don't do this!!!''' When you run the command like this, you are running it on the head node!!
You aren't using the power of the cluster to do the work, that and you may be clogging up the little computational power (relative to the compute nodes) the head node has and making everyone else experience on the cluster laggy and unstable!
To run a job on the cluster you must run it through a scheduler, in our case the cluster uses SLURM, but the same principals apply to other cluster formats such as LSF or BSUB.
 
'''The correct way of submitting a job to the cluster from the terminal is to use this command''':
<div class="toccolours mw-collapsible">
<syntaxhighlight lang=bash">
srun <command>>
</syntaxhighlight>
</div>
This will submit the command to a compute node, and run the job interactively, that is, all output will redirect to the terminal you are using, for example:
<div class="toccolours mw-collapsible">
<syntaxhighlight lang=bash">
module load cdhit/4.64
srun cd-hit -h
</syntaxhighlight>
</div>
Will output the help directly to your terminal. While this is fine for quick and dirty programs like cdhit, when you run blast you'll get a blank output until the blast is complete. Try it for your self:
<div class="toccolours mw-collapsible">
<syntaxhighlight lang=bash">
blastp -query test.fas -db nr -remote -outfmt 6
</syntaxhighlight>
</div>
If you close the terminal, or if you lose your connection to the internet during this time, all progress will be lost. So how do we submit jobs to the cluster and just let things run without worrying about our job dying? Also can we allocate specific resources to our programs, like the number of cpu's it needs, or the amount RAM? To do this we need to write some scripts, so open up a text editor and head to the next section, yo.
 
 
==Cluster Scripting==
In order to submit a job to the cluster and have it run without keeping a terminal open '''we need to write a script'''.
You can use this template script as a basis for any future cluster scripts you may need to write:
<div class="toccolours mw-collapsible">
<syntaxhighlight lang=bash">
#!/bin/bash
#
#SBATCH --job-name=<job name>
#SBATCH --output=<out name>
#
#SBATCH --ntasks=<number of tasks>
#SBATCH --time=<time needed to run task>
#SBATCH --mem-per-cpu=<memory per cpu needed>
#SBATCH --mem-per-cpu=<memory per cpu needed>
#SBATCH --cpus-per-task=<number of cpus>
<Bash scripting goes here >
 
</syntaxhighlight>
</div>
 
This is basically what you need to submit a job to the cluster system we have at hunter. Now lets go through it, shall we?
<div class="toccolours mw-collapsible">
<syntaxhighlight lang=bash">
#SBATCH --job-name=<job_name>
</syntaxhighlight>
</div>
This line of code allows you to specify the name of your job, for example if I want my job name to be "pandas_are_awesome" then I would use the following line:
<div class="toccolours mw-collapsible">
<syntaxhighlight lang=bash">
#SBATCH --job-name=pandas_are_awesome
</syntaxhighlight>
</div>
The next line of code let you specify the file you want to dump all output (both stdout and stderr) to the terminal into. '''Remember we are not in interactive mode here, so we wont see any terminal output'''. So if we want to dump the output to pandas_are_not_awesome.log we would use the following line:
<div class="toccolours mw-collapsible">
<syntaxhighlight lang=bash">
#SBATCH --output=pandas_are_not_awesome.log
</syntaxhighlight>
</div>
Hopefully the rest of the #SBATCH lines are self explanatory.
After we put in what is essentially metadata about our job, we can then write a plain bash script to run our job. So if I wanted to run blast I would use the following code:
<div class="toccolours mw-collapsible">
<syntaxhighlight lang=bash">
#!/bin/bash
#
#SBATCH --job-name=blasttest
#SBATCH --output=blasttest.log
#
#SBATCH --ntasks=1
#SBATCH --time=10:00 #10 minutes
#SBATCH --mem-per-cpu=100 #megabytes
#SBATCH --cpus-per-task=4
 
module load blast/2.2.31 ##load the module
blastp -query test.fas -db nr -remote -outfmt 6 -out blasttest.out ##run the program
 
</syntaxhighlight>
</div>
Notice two things:
#I needed to load the blast module inorder for my code to run, you need to do this if your program needs to be loaded in order for it to run.
#You did not have to write "srun" before the command.
 
Since after the #SBATCH lines, the script becomes a bash scrips, '''we can also use loops'''. For example say I had a bunch of sequences in a folder that I wanted to blast; I would use this code:
 
<div class="toccolours mw-collapsible">
<syntaxhighlight lang=bash">
#!/bin/bash
#
#SBATCH --job-name=blasttest
#SBATCH --output=blasttest.log
#
#SBATCH --ntasks=1
#SBATCH --time=10:00 #10 minutes
#SBATCH --mem-per-cpu=100 #megabytes
$SBATCH --cpus-per-task=4
 
module load blast/2.2.31 ##load the module
for file in ./*.fasta ##go through every file that ends in .fasta
do
name=$(basename $file ".fasta" ) ## get the file name
blastp -query $file -db nr -remote -outfmt 6 -out $name.out ##run blast on the current file
done
</syntaxhighlight>
</div>
Cool beans? Now, after we finish writing and saving the script, how do we run it exactly?
To do this we use the command on the terminal:  
<div class="toccolours mw-collapsible">
<syntaxhighlight lang=bash">
$> sbatch <script.sh>
</syntaxhighlight>
</div>
So if the above script was called "test.sh" to run it I would use the command:
<div class="toccolours mw-collapsible">
<syntaxhighlight lang=bash">
$> sbatch test.sh
</syntaxhighlight>
</div>
And it should run without any problems and is now safe to close the connection to the cluster. If we wanted to check on our job there are two possible ways to do it: the not so cool way of looking at our output file we specified in the script. (In this case blasttest.log ) or the cooler ways seen here:
 
==Monitoring Cluster Jobs==
To check if your cluster job is running we use the command:
<div class="toccolours mw-collapsible">
<syntaxhighlight lang=bash">
$> squeue
</syntaxhighlight>
</div>
Our output would be something like this:
<div class="toccolours mw-collapsible">
<syntaxhighlight lang=bash">
            JOBID PARTITION    NAME    USER ST      TIME  NODES NODELIST(REASON)
              164      defq    bash    root  R  17:45:29      1 compute003
              165      defq    bash    root  R  17:45:11      1 compute004
              168      defq    bash  averdes  R  17:42:19      1 compute006
              163    himem    bash  jgorson  R 1-17:51:33      1 compute018
              170    himem    bash  rrahman  blastp 1-17:51:33 1 compute018
</syntaxhighlight>
</div>
Notice, we can not only see our own jobs, but also the jobs of the other people on the cluster, so if you are doing something that is incredibly computationally intensive and slowing everyone else down, we will see it and we will bring pitchforks.  
We can also use the command:
<syntaxhighlight lang=bash">
$> sstat
</syntaxhighlight>
</div>
To know whats going on with our job, like how much ram or cpu threads it's using, etc.  
If we want to kill our job, for whatever reason, we would use the command:
<div class="toccolours mw-collapsible">
<syntaxhighlight lang=bash">
$> scancel <jobid>
</syntaxhighlight>
</div>
To get our job id, we use squeue, its the first column.
==Cluster File System and General Information==
To be added


==Sept 11, 2015==
==Conclusion==
* Journal Club: an in-depth analysis of Staphylococcus aureus genomes. [http://mbe.oxfordjournals.org/content/29/2/797.long] Presenter: John
Hopefully after following this tutorial you have risen from cluster
** Key terms: SNP, mutation, recombination,  linkage disequilibrium (LD), synonymous polymorphism (Pi[s])
padawan, to cluster Jedi. But complete, is your training not. Uhere are always something new you can learn about using the cluster system from google and stack overflow, if you find something new or interesting that you think might be useful for other people to use email me at  rayees.rahman40@myhunter.cuny.edu and I can add it to this write up.  
** Key methods: identify recombination (from mutation) using shape-shape changes; four-gamete test to identify breakage point; LD decay (based on r2 and probability of tree compatibility) to quantify r/m ratio
** Key results: extensive recombination among clones; rates and tract length quantified by LD decay
** My rating: 4/5. Rigorous analysis of recombination in bacteria, innovative methods, informative and attractive figures; the paper is too long and many statements repetitive, effect of selection hinted but not explored.


==Sept 4, 2015==
This write up is far from complete (we don't even have a name for our cluster yet, come on!) so new things will be added all the time, so watch this page!
* Journal Club: a nice review of bacterial population genetics (E.coli model), from protein polymorphisms to whole-genome variations. [http://www.pnas.org/content/112/29/8893.full]. Presenter: Amanda
** Technological history of bacterial population genetics: MLEE -> MLST  -> Whole-genome
** Key terms & concepts: clonality, linkage disequilibrium, recombination, homoplasy, r/m ratio
** Methods for recombination detection: clustered polymorphism, homoplasy (phylogenetic inconsistency) ([http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4094996/figure/F5/ a Borrelia data set to understand how to identify homoplasy and recombination])
** Tools to try: recHMM (detecting homoplatic sites, fine-grained), PHI (per gene detection, coarse), USEARCH (alternative to BLAST)/UCLUST (alternative to CD-HIT), Distance method (? no reference given; can't understand algorithm either)
** My rating: 4.8/5 (concise, thoughtful & solid review, covering a vast range of history, species, and theory; no apparent theoretical or visual flaws; ending a little pessimistic; implications to the greater biomedical audience is not explored)


==Aug 28, 2015==
Rayees
* Journal Club (12:30-1:30): an recent paper claiming wide-spread gene loss & pseudogenization in bacterial pathogens.  [http://gbe.oxfordjournals.org/content/7/8/2173.full%20]. Presenter: Roy
** Key terms/concepts: pan-genome, pan-genes (core/"near core"/rare), normalized identity (NI), genomic fluidity, pseudogene conservation percent (PCP), AAI (aa identity), effective population size (Ne), Muller's Ratchet
** Key methods: FASTA for ortholog/paralog identification, PHI (pairwise homoplasy index) for detecting recombination, TFASTA for HGT (gene gain), RAST for gene calls and genome annotation
** Key findings: bi-modal distribution of pangenes; two clonal species has high genomic fluidity, despite being closely related; little HGT ("rare") but lots of losses ("near core") in clonal species; maintenance of pseudogenes (small Ne)
** Pluses: large number of genomes; results broadly convincing; rigorous interpretations and discussion
** Flaws: No phylogenetic reconstruction; no synteny verification; no gene function analysis; no statistical evaluation of the conclusion; bad presentation (figures should be tables and tables should be figures)
** My overall rating: 3.5/5.0
* Project updates & plans (1:30-2)
** Weigang: design statistical tests for 2 hypotheses: (1) any co-occurrence of oc types? (2) lineage-stabilizing genes
** Saymon: tick-bacteria gene transfer positive; pcr is working for positive controls; need to start testing for nymphs
** John & Rayyes: pa2 database cleaning nearly done; start polymorphism-by-genome-location analysis
** Amanda & Roy: Treponema project has a working database, pipeline, and preliminary validated results; start documenting protocals, tabulating results, and prepare functional analysis

Revision as of 09:44, 2 October 2015

What is a Cluster System?

Figure 1 The general idea of a cluster system. There is a login node that is directly connected to the other nodes of the cluster

A Cluster System is a set of connected computers that work together in order to perform tasks. Unlike many of the portals at hunter that require you to first login to a head node, and then to a working node.

  • For example to enter the Qiu lab servers you must first login to Darwin.hunter.cuny.edu and then to a compute node such as Wallace.hunter.cuny.edu

On a cluster system all of the nodes connected can be viewed as a single system. Under this system all you would need to do is login to the head node, and from there you can run your programs. However, unlike the non-cluster computer servers at hunter, in order to fully utilize the cluster you must submit your job to a job scheduler that controls which node to run the job on.

Basics of Using a Cluster System

  • What programs are currently installed on the cluster System?

The normal way of running programs on a linux machine would be to type the name of the program into the terminal, add any options to the command, and press enter. For example"

$>blastp -query test.fas -db nr -remote -outfmt 6

However this will not work in a cluster. In order to run programs on a cluster, the cluster must first have those programs installed (or you may have to compile the program and run it locally) and then you must load the program on to your terminal. To check the available programs on the hunter college cluster use this command:

module avail

This command shows what modules are currently available to use. To save the list of available programs use this command:

module avail 2> ~/available_apps.txt

This command stores the present list programs installed on the cluster to a file called "available_apps.txt" to your home directory. This file will only show the present programs installed on the cluster, you will need to update this file in the future to show new applications!

  • How do we load a program on the cluster?

Now that we know what are the available apps on the cluster, we need to load the application into our terminal. To do so we use the command:

module load <Program Name>

For example, to load blast on the cluster we need to run the command:

module load R/3.2.2

Notice how we have a version number when we run the module load command? This means that on the cluster we can different versions of a program! This is useful for those situations when an updated version of a program breaks compatibility with some code you've written or some modules you've downloaded, or if you like one version of a program more than the other. (Looking at you python 3.5 and your terrible syntax changes!) In order to use a different version of a program we first need to unload our program then load up the correct version of the program:

module unload R/3.2.2
module load R/3.1.0
  • How do we run a program on the cluster?

Now lets go back to our previous blast example. Lets first load up the module:

module load blast/2.2.31

Notice now, when you tab "bl" on the command line all programs associated with blast comes up, such as blastn / blastp, etc. Now you may be tempted to run blast, or any program, the normal way, that is:

blastp -query test.fas -db nr -remote -outfmt 6

Don't do this!!! When you run the command like this, you are running it on the head node!! You aren't using the power of the cluster to do the work, that and you may be clogging up the little computational power (relative to the compute nodes) the head node has and making everyone else experience on the cluster laggy and unstable! To run a job on the cluster you must run it through a scheduler, in our case the cluster uses SLURM, but the same principals apply to other cluster formats such as LSF or BSUB.


The correct way of submitting a job to the cluster from the terminal is to use this command:

srun <command>>

This will submit the command to a compute node, and run the job interactively, that is, all output will redirect to the terminal you are using, for example:

module load cdhit/4.64
srun cd-hit -h

Will output the help directly to your terminal. While this is fine for quick and dirty programs like cdhit, when you run blast you'll get a blank output until the blast is complete. Try it for your self:

blastp -query test.fas -db nr -remote -outfmt 6

If you close the terminal, or if you lose your connection to the internet during this time, all progress will be lost. So how do we submit jobs to the cluster and just let things run without worrying about our job dying? Also can we allocate specific resources to our programs, like the number of cpu's it needs, or the amount RAM? To do this we need to write some scripts, so open up a text editor and head to the next section, yo.


Cluster Scripting

In order to submit a job to the cluster and have it run without keeping a terminal open we need to write a script. You can use this template script as a basis for any future cluster scripts you may need to write:

#!/bin/bash
#
#SBATCH --job-name=<job name>
#SBATCH --output=<out name>
#
#SBATCH --ntasks=<number of tasks>
#SBATCH --time=<time needed to run task>
#SBATCH --mem-per-cpu=<memory per cpu needed>
#SBATCH --mem-per-cpu=<memory per cpu needed>
#SBATCH --cpus-per-task=<number of cpus>
<Bash scripting goes here >

This is basically what you need to submit a job to the cluster system we have at hunter. Now lets go through it, shall we?

#SBATCH --job-name=<job_name>

This line of code allows you to specify the name of your job, for example if I want my job name to be "pandas_are_awesome" then I would use the following line:

#SBATCH --job-name=pandas_are_awesome

The next line of code let you specify the file you want to dump all output (both stdout and stderr) to the terminal into. Remember we are not in interactive mode here, so we wont see any terminal output. So if we want to dump the output to pandas_are_not_awesome.log we would use the following line:

#SBATCH --output=pandas_are_not_awesome.log

Hopefully the rest of the #SBATCH lines are self explanatory. After we put in what is essentially metadata about our job, we can then write a plain bash script to run our job. So if I wanted to run blast I would use the following code:

#!/bin/bash
#
#SBATCH --job-name=blasttest
#SBATCH --output=blasttest.log
#
#SBATCH --ntasks=1
#SBATCH --time=10:00 #10 minutes 
#SBATCH --mem-per-cpu=100 #megabytes
#SBATCH --cpus-per-task=4

module load blast/2.2.31 ##load the module 
blastp -query test.fas -db nr -remote -outfmt 6 -out blasttest.out ##run the program

Notice two things:

  1. I needed to load the blast module inorder for my code to run, you need to do this if your program needs to be loaded in order for it to run.
  2. You did not have to write "srun" before the command.

Since after the #SBATCH lines, the script becomes a bash scrips, we can also use loops. For example say I had a bunch of sequences in a folder that I wanted to blast; I would use this code:

#!/bin/bash
#
#SBATCH --job-name=blasttest
#SBATCH --output=blasttest.log
#
#SBATCH --ntasks=1
#SBATCH --time=10:00 #10 minutes 
#SBATCH --mem-per-cpu=100 #megabytes
$SBATCH --cpus-per-task=4

module load blast/2.2.31 ##load the module 
for file in ./*.fasta ##go through every file that ends in .fasta
do 
name=$(basename $file ".fasta" ) ## get the file name 
blastp -query $file -db nr -remote -outfmt 6 -out $name.out ##run blast on the current file 
done

Cool beans? Now, after we finish writing and saving the script, how do we run it exactly? To do this we use the command on the terminal:

$> sbatch <script.sh>

So if the above script was called "test.sh" to run it I would use the command:

$> sbatch test.sh

And it should run without any problems and is now safe to close the connection to the cluster. If we wanted to check on our job there are two possible ways to do it: the not so cool way of looking at our output file we specified in the script. (In this case blasttest.log ) or the cooler ways seen here:

Monitoring Cluster Jobs

To check if your cluster job is running we use the command:

$> squeue

Our output would be something like this:

            JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               164      defq     bash     root  R   17:45:29      1 compute003
               165      defq     bash     root  R   17:45:11      1 compute004
               168      defq     bash  averdes  R   17:42:19      1 compute006
               163     himem     bash  jgorson  R 1-17:51:33      1 compute018
               170     himem     bash  rrahman  blastp 1-17:51:33 1 compute018

Notice, we can not only see our own jobs, but also the jobs of the other people on the cluster, so if you are doing something that is incredibly computationally intensive and slowing everyone else down, we will see it and we will bring pitchforks. We can also use the command:

$> sstat

To know whats going on with our job, like how much ram or cpu threads it's using, etc.

If we want to kill our job, for whatever reason, we would use the command:

$> scancel <jobid>

To get our job id, we use squeue, its the first column.

Cluster File System and General Information

To be added

Conclusion

Hopefully after following this tutorial you have risen from cluster padawan, to cluster Jedi. But complete, is your training not. Uhere are always something new you can learn about using the cluster system from google and stack overflow, if you find something new or interesting that you think might be useful for other people to use email me at rayees.rahman40@myhunter.cuny.edu and I can add it to this write up.

This write up is far from complete (we don't even have a name for our cluster yet, come on!) so new things will be added all the time, so watch this page!

Rayees