PaCE
From Education
THE PaCE CLUSTERING AND ASSEMBLY FRAMEWORK
There are three main steps in running the framework:
(i) PREPROCESSING - this step is for getting the input sequence data "cleaned" up and ready for processing. Both the input and output file format is FASTA.
(ii) Run PaCE using a job scheduler
- this step requires access to a parallel machine with MPICH installed and tested. The input is the preprocessed input FASTA file and the output is a set of "clusters" - which is basically a partition of the input sequences based on overlap information.
(iii) Run CAP3 assember
- This step is for "assembling" each PaCE cluster into an assembled set of sequences (also called "contigs").
INSTRUCTIONS FOR PREPROCESSING
Open command prompt in BCCD, and follow the sequence of commands in that order (PS: The symbol "$" represents the shell prompt in your system - ie., there is no need to type that.)
$ ssh username@cluster.earlham.edu
$ ssh c0
$ cd PaCE-pipeline2.2/datafiles
$ perl ../bin/PreprocessPaCE.pl tEST.data 1 > tEST.data.PaCE
This will create a new preprocessed input file called "tEST.data.PaCE". This is the file that will be the actual input file for PaCE.
$ perl ../bin/FastaStat.pl tEST.data.PaCE | grep -i "Number of sequences"
Take a note of the number in the output. This will be the "n" that we will use in the next few steps. For this example, n is 500.
INSTRUCTIONS FOR RUNNING PaCE
$ cd ../PaCE-clustering
$ cat pace-batch.sh
This is the file where the input parameters to PaCE and the number of processors. First time, use the default pace-batch.sh provided in that folder. If you want to change any of the input parameters, open the file using an editor of your choice and modify the appropriate fields. Here is a description of the parameters: - N <name your job according to your choice> - nodes=<number of nodes>:ppn=<number of processing elements per node> (The product of "nodes" and "ppn" will be the maximum number of processes that can be used for this job. Eg., if we want to run a job on 8 processors, "nodes=4,ppn=2" is an apt choice.)
- The last line will have the exact command line to run the MPI parallel job. It will look like : - <path>../mpirun -np 8 -machinefile $PBSNODEFILE <PaCE-clustering's full path>/PaCE <datafiles' full path>/tEST.data.PaCE <n> <PaCE-clustering's full path>/PaCE.cfg 2>&1 > pace.output
Basically PaCE takes three parameters: PaCE {name & location of the preprocessed input FASTA file} {n: ie., number of sequences in the input} {name & location of the PaCE.cfg file}
What precedes these is the mpirun command and its arguments necessary to run a parallel job. Note that the "the number specified after "-np" should match (actually can be less than or equal to) the "#nodes*#ppn".
The PaCE.cfg contains all clustering parameters and has a readme called PaCE.cfg.README explaining what each parameter means.
$ qsub pace-batch.sh
This submits a new parallel job with a new id. Monitor the job progress using qstat. A successful finish of the code will create the following files: (i) pace.output: the output file containing all the runtime and final clustering statistics (ii) estClust.500.3.PaCE: set of clusters and their corresponding sequence members (iii) estClustSize.500.3.PaCE: a file containing the size of each cluster (iv) ContainedESTs.500.PaCE: Sequences that are entirely contained other sequences (v) NonContainedESTs.500.PaCE: Sequences that are not contained in any other sequence
INSTRUCTIONS FOR RUNNING CAP3 (ASSEMBLER)
$ cd ../CAP3-assembly
$ mkdir cap3-on-tEST
Note that this will be where the assembled output for the input file tEST.data.PaCE is going to be stored. So you are using a different input file later, you need to create a new folder for that. This example shows how to perform these steps for tEST.data.PaCE.
$ cd cap3-on-tEST
$ perl ~/PaCE-pipeline2.2/bin/extractCF.pl ~/PaCE-pipeline2.2/PaCE-clustering/estClust.500.PaCE ~/PaCE-pipeline2.2/datafiles/tEST.data.PaCE
This will create a "lot" of files in the current folder. There will be exactly as many files as there are PaCE clusters. For the example there will be 386 cluster files.
$ ~/PaCE-pipeline2.2/bin/caploop .
This step will run cap3 on the current folder. Please DO NOT miss out the "." at the end of the above command.
This completes the PaCE + CAP3 pipeline.
INTERPRETATION OF THE RESULTS
Let us take an example cluster created by PaCE and track it through the CAP3 assembly. If you open the ~/PaCE-pipeline2.2/PaCE-clustering/estClust.500.3.PaCE file (which was the clustering output from PaCE), you will notice that most of the clusters are "singleton" clusters - meaning that several sequences had no overlap and so were left alone in separate clusters. But a few clusters will have multiple sequences in it. Take for example, cluster #23:
{Cluster#} 23
{Member#} >gi|9437186|gb|BE439344.1|BE439344
{Member#} >gi|9437153|gb|BE439311.1|BE439311
{Member#} >gi|9437155|gb|BE439313.1|BE439313
{Member#} >gi|9437188|gb|BE439346.1|BE439346
This has 4 sequences. Now change directory to the folder which contained the CAP3 output:
$ cd ~/PaCE-pipeline2.2/CAP3-assembly/cap3-on-tEST
$ ls cluster.23.*
These are the two CAP3 generated files for cluster #23 that are of interest to us:
(i) cluster.23.cap.contigs This file contains the contigs (or assembled sequences) generated as a result of running CAP3 on this cluster. In this example, you will see only one contig, which means that all the four sequence members had a good overlap with one another and assembled into one long sequence.
(ii) cluster.23.align This file shows how the 4 sequence members assembled. The alignments are displayed line by line. Each line under the line with "-----" is the consensus sequence - ie., the characters from all the 4 member sequences at each aligning position are consolidated into one "representative" character and this is determined through a consensus/majority mechanism. Eg., if you see 4 A's aligned up, the contig sequence is also going to have A in that position. Ties are broken internally in an arbitrary fashion if no external information is provided.