cMonkey installation and setup

1. BASIC installation

Install R: The R Project for Statistical Computing
http://www.r-project.org/
and required packages:

Extract cmonkey.tgz with command tar -zxf cmonkey.tgz.

In cmonkey directory that is created, there will be 3 subdirectories:

In order to run cMonkey, cMonkey will need access to working copies of:

cMonkey has been configured to look for these in a progs/ subdirectory of the main cmonkey
directory. To set to the copy on your local machine, in the initializeDefaultParams.R, you will need
to edit the following lines:
params$meme.cmd <- "./progs/meme"
params$mast.cmd <- "./progs/mast"
params$mcast.cmd <- "./progs/mcast" # not currently used
params$mhmm.cmd <- "./progs/mhmm" # not currently used
params$mhmmscan.cmd <- "./progs/mhmmscan" # not currently used
params$max.blast.e.value <- 1
params$blast.cmd <- "./progs/blastall -a 2"
params$bl2seq.cmd <- "./progs/bl2seq"
params$formatdb.cmd <- "./progs/formatdb"

Once Blast & Meme are configured, cMonkey has been set up to automatically run the E. coli
data provided, which has been included to provide examples of the data format(s) cMonkey is
configured to use.


2. RUNNING cmonkey (on sample data)

To run cMonkey on the sample data, from the shell:

Note, you will likely want to run cmonkey from a screen session (unix command 'screen').
Several good tutorials on basic use of screen can be found on the web, including:

To run cMonkey on the yeast sample data:


3. CONFIGURING cmonkey to work with your data & configuring your data to work with cMonkey

3a. Organism-specific param files:

To configure cMonkey to work with your data, modify either of the initializeOrgParams.*.R
provided and rename it it as initializeOrgParams.R.

For instructions on how to modify these, these are self-documenting files, including the locations where you can find the relevant sequence and network files for your organism.

Note, in the initializeOrgParams.R file provided, cMonkey is set to read all data from the
"data/ecoli" directory, and put all output into an "output/ecoli" directory. This behavior
is specified with the following params in the org-specific params files, and is the recommended
organization for the data and output.

params$organism <- "ecoli"
... lines
params$output.dir <- paste( "output/", params$organism, "/", sep="" )
params$data.dir <- paste( "data/", params$organism, "/", sep="" )

You will also likely want to change the folllwing parameters which have been set to small
values to expedite the initial run:
## number of iterations to optimize clusters
params$n.iters <- 3 #120 #
## Max # of clusters to generate
params$kmax <- 10 #300

3b. Gene naming convetions for built-in cMonkey data parsing

Coordinate data:
cMonkey built-in gene coordinate file parser is written to parse the gene coordinate files from
NCBI, and uses the 'Synonym' field for gene names (as these are more complete & unique to the organism).

Expression data:
cMonkey built-in expression matrix file parser is written with the assumption that the expression
matrix is in gene x condtion format (rows x cols) with each row column named by the appropriate
gene and/or condition. If the genes are named by the "common" gene names, cMonkey will map these
to the correct Synonym names specified in the coordinate file.

Sequence data:
cMonkey's built-in sequence data parser is written to parse files from ras-tools from:
http://rsat.ulb.ac.be/rsat/ (FASTA format).

NOTE, the parser assumes each sequence is specified with rsa-tools' full-identifier, and will not
correctly link the sequence data with expression data if it is not used. If your organization
uses its own header format, you can modify the readUpstreamSeqs.R file to process these.

Network data (prolinks):
cMonkey built-in prolinks file parser is written to correctly parse organism specific prolinks data files and should not need to be modified.

Network data (all other types):
For other network types, cMonkey built-in parser is written to read files in .sif format, a tab-delimitted format which specifies one edge per line. Examples of this format can be found in the sample data provided.


4. OUTPUT:

When optimizing clusters, cMonkey will produce 3 types of files, 2
examples of which can be found in the sample.output directory:

<organism>_###.RData    - an R .RData file that can be loaded using the R 'load' function, containing
                                               an example optimized clusterStack of 10 biclusters (more below)
clust_###.ps                       - a visualization of the data types for cluster ### as it is optimized, one page
                                              per iteration.
cstats_###.ps                     - a visualization of cMonkey's scoring metrics for cluster ## as it is being
                                              optimized, one page per iteration (can be ignored or removed after optimization
                                              is completed, but provided for verification purposes).

4a. <organism>_###.RData:

An R .RData file containing the optimized clusterStack containing ### clusters, and 3 additional
logging variables:
biclust.version - logging var
date.biclust.run - same
out.logs - same
clusterStack - the optimized clusterStack. The clusterStack variable is an R list object containing
list entries, with one entry per bicluster (plus one entry, $k, the length of the stack).
Each bicluster entry is a list, containing the following items:

4b. clust_###.ps:

A postscript file containing a visualization of the bicluster during the optimization. Each page shows:

4c. cstats_###.ps

A visualization of cMonkey's scoring metrics for cluster ## as it is being optimized, one page per iteration
(these can be ignored or removed after optimization is completed, but provided for verification purposes).


For further questions, please send email to:

tjk229@nyu.edu