cMonkey installation and setup
1. BASIC installation
Install R: The R Project for Statistical Computing
http://www.r-project.org/
and required packages:
Extract cmonkey.tgz with command tar -zxf
cmonkey.tgz.
In cmonkey directory that is created, there will be 3 subdirectories:
- R_scripts - containing the R scripts for cMonkey
- sample.output - containing some sample output from cMonkey
(described
below, section 4)
- data: sample data directories containing sample data in the
following
directories:
- data/ecoli - a sample data directory containing data for
E. coli
- data/yeast - a sample data directory containing data for
yeast (S.
cerevisiae)
In order to run cMonkey, cMonkey will need access to working copies of:
- BLAST (bl2seq, blastall, formatdb) which can be found on
ftp://ftp.ncbi.nih.gov/blast/executables/LATEST
- Meme (mast, meme) which can be found on
http://meme.nbcr.net/downloads/
cMonkey has been configured to look for these in a progs/ subdirectory
of
the main cmonkey
directory. To set to the copy on your local machine,
in the initializeDefaultParams.R, you will need
to edit the following
lines:
params$meme.cmd
<- "./progs/meme"
params$mast.cmd <- "./progs/mast"
params$mcast.cmd <- "./progs/mcast" # not currently used
params$mhmm.cmd <- "./progs/mhmm" # not currently used
params$mhmmscan.cmd <- "./progs/mhmmscan" # not currently used
params$max.blast.e.value <- 1
params$blast.cmd <- "./progs/blastall -a 2"
params$bl2seq.cmd <- "./progs/bl2seq"
params$formatdb.cmd <- "./progs/formatdb"
Once Blast & Meme are configured, cMonkey has been set up to
automatically run the E. coli
data provided, which has been included to
provide examples of the data
format(s) cMonkey is
configured to use.
2. RUNNING cmonkey (on sample data)
To run cMonkey on the sample data, from the shell:
- set the cmonkey directory as your working directory.
- start R
- type 'source( "R_scripts/main.R" )'
Note, you will likely want to run cmonkey from a screen session (unix
command 'screen').
Several good tutorials on basic use of screen can be found on the web,
including:
To run cMonkey on the yeast sample data:
- rename the initializeOrgParams.R file for ecoli to
initializeOrgParams.ecoli.R
- rename the initializeOrgParams.yeast.R file to
initializeOrgParams.R
OR (if on a unix/linux machine ):
- create a symbolic link to initializeOrgParams.yeast.R
called
initializeOrgParams.R
3. CONFIGURING cmonkey to work with your data &
configuring
your data to work with cMonkey
3a. Organism-specific param files:
To configure cMonkey to work with your data, modify either of the
initializeOrgParams.*.R
provided and rename it it as
initializeOrgParams.R.
For instructions on how to modify these, these are self-documenting
files, including the
locations where you can find the relevant sequence and network files
for your organism.
Note, in the initializeOrgParams.R file provided, cMonkey is set to
read all data from the
"data/ecoli" directory, and put all output into
an "output/ecoli" directory. This behavior
is specified with the
following params in the org-specific params files, and is the
recommended
organization for the data and output.
params$organism
<- "ecoli"
... lines
params$output.dir <- paste( "output/", params$organism, "/",
sep="" )
params$data.dir <- paste( "data/", params$organism, "/", sep=""
)
You will also likely want to change the folllwing parameters which have
been set to small
values to expedite the initial run:
##
number of iterations to optimize clusters
params$n.iters <- 3 #120 #
## Max # of clusters to generate
params$kmax <- 10 #300
3b. Gene naming convetions for built-in cMonkey data parsing
Coordinate data:
cMonkey built-in gene coordinate file parser is written to parse the
gene coordinate files from
NCBI, and uses the 'Synonym' field for gene names (as these are more
complete & unique to the
organism).
Expression data:
cMonkey built-in expression matrix file parser is written with the
assumption that the expression
matrix is in gene x condtion format
(rows x cols) with each row column named by the appropriate
gene and/or
condition. If the genes are named by the "common" gene names, cMonkey
will map these
to the correct Synonym names specified in the coordinate file.
Sequence data:
cMonkey's built-in sequence data parser is written to parse files from
ras-tools from:
NOTE, the
parser assumes each sequence is specified with rsa-tools'
full-identifier, and will not
correctly link the sequence data with
expression data if it is not used. If your organization
uses its own header format, you can modify the readUpstreamSeqs.R file
to process these.
Network data (prolinks):
cMonkey built-in prolinks file parser is written to correctly parse
organism specific prolinks data
files and should not need to be modified.
Network data (all other
types):
For other network types, cMonkey built-in parser is written to read
files in .sif format, a tab-delimitted format which specifies one edge
per line. Examples of this format can be found
in the sample data provided.
4. OUTPUT:
When optimizing clusters, cMonkey will produce 3 types of files, 2
examples of which can be found in the sample.output directory:
<organism>_###.RData
- an R .RData file that can be loaded using
the R 'load' function, containing
an
example optimized clusterStack of 10 biclusters (more below)
clust_###.ps
- a
visualization of the data types for cluster ### as it
is optimized, one page
per iteration.
cstats_###.ps
- a
visualization of cMonkey's scoring metrics for
cluster ## as it is being
optimized, one
page per iteration (can be
ignored or removed after optimization
is completed,
but provided for
verification purposes).
4a. <organism>_###.RData:
An R .RData file containing the optimized clusterStack containing ###
clusters, and 3 additional
logging variables:
biclust.version - logging var
date.biclust.run - same
out.logs - same
clusterStack - the optimized clusterStack. The clusterStack variable is
an R list object containing
list entries, with one entry per bicluster (plus one entry, $k, the
length of the stack).
Each bicluster entry is a list, containing the following items:
- k - # of bicluster
- nrows - # of genes in bicluster
- ncols - # of conditions in bicluster
- rows - genes in bicluster
- cols - conditions in bicluster
- resid - the "residual" of the bicluster's expression sub-matrix - a
measure
of the
- coherence of the bicluster.
- motif.out - R list object, containing info from MEME/MAST, i.e.
pssms, diagrams, p/e-vals
see: http://meme.sdsc.edu/meme/mast-output.html#pvalues
for explanation
of MAST scoring metrics)
- p.clust - combined p-value for seqs in bicluster (from MAST)
- e.val
- e-values of top n motifs found in bicluster (from MAST)
- net.p.old - the p-values of each assoc'n type for the bicluster as
described
in the Integrative Biclustering paper by Reiss & Bonneau
4b. clust_###.ps:
A postscript file containing a visualization of the bicluster during
the optimization. Each page shows:
- top left: an image of the bicluster's expression v. the rest of the
expression matrix (bicluster's
expression is to the left of the red-dotted line.
- top right: an image of the bicluster's expression by itself.
- lower left: a visualization of the valid motifs (as PSSM') &
network edges associated with the bicluster.
Note,
motif's may disappear/appear between iterations, dependant upon
the sequences in the bicluster.
- lower right: a visualization of the locations of the motifs along the
upstream seq's for the bicluster.
Note, the sample versions provided sample.output are in .pdf
format.
4c. cstats_###.ps
A visualization of cMonkey's scoring metrics for cluster ## as it is
being optimized, one page per iteration
(these can be ignored or
removed after optimization is completed, but provided for verification
purposes).
For further questions, please send email to:
tjk229@nyu.edu