The Inferelator - genetic regulatory networks inference algorithm
Here you can:
1) download the inferelator
2) gets hands on experience
with the inferelator
* For more information about the inner workings of the inferelator read
DREAM3 top performer paper,
DREAM4 top performer paper (coming up soon).
**For a systems-biology application of the inferelator (on Halobacter) read
Dowloading the Inferelator
The latest version of the Inferelator can be found here. Join our Google Group
below, if you would like to
receive notifications of new releases, or need support.
Alex Greenfield, Christoph Hafemeister, and Richard Bonneau
Robust data-driven incorporation of prior knowledge into the inference of dynamic regulatory networks
Bioinformatics first published online March 21, 2013 doi:10.1093/bioinformatics/btt099
Link to Open Access Article
For older versions of the Inferelator:
You need to input a username and password to download the Inferelator.
This password can be found on the home page of our google group "Inferelator announcement".
Please join the group with your academic email address to recieve the download password and notification of new releases and bug
fixes. Feel free to post discussions to the group.
Download top performer DREAM4
pipeline: dream 4 pipeline V:1.4.1.
Download the original Inferelator code (tutorial below): Inferelator V:1.1.
Registeration is simple:
note: If you don't want to receive any email from the group, please
remember to set the 'Delivery' type of your account as 'No Email'.
Inferelator Tutorial for version 1.1
1) extract the the inferelator folder from inferelator.zip.
2) you should now see one inferelator folder, and under it three more folders:
---data (here we have the input data structures);
---R_scripts (all scripts required, for now you should only pay attention to main.R);
---output (here the output of a run is saved).
3) open the data folder, and see 4 main data structures:
---tfNames (has the names of putative transcription factors we consider as predictors);
---ratios (this is the expression matrix, rows are genes, columns are conditions);
---clusterStack (to reduce complexity of whole genomes you may want to cluster genes, each cluster is then a transcriptional unit);
---colMap (the inferelator learns causation out of time series data, but uses equilibrium measurements as well to increase statistical power, thus it requires a data structure that keeps a tab on which conditions belong to a time-series experiment).
4) open R interface (Next sections assume familiarity with the R statistical language).
5) load the four data-structures for e.coli: (this data-structures are required for an inferelator run)
---type tfNames to see a list of b-numbers used as gene names in e.coli (this list is taken from regulonDB)
---type str(ratios) to see a matrix containing the expression data (each column is a microarray experiment, and we have more then 4000 mRNA measurements in each experiment).
---type length(clusterStack), this is the first non trivial data-structure: (note that there are only ~370 clusters, thus we reduced the number of measurements from ~4000 to ~370)
---type str(clusterStack[]), this shows you an object for the first bicluster out of a list of ~370 clusters (it should look something like this)
(purple=text copied from R interface, green=comments to help you make sense out of it)
$ nrows : int 41 (number of genes in cluster 1)
$ ncols : int 285 (this is actually a bicluster so here are the number of conditions in the cluster, a bicluster is simply a cluster over rows and columns)
$ rows : chr [1:41] "B3786" "B3620" "B3166" "B4161" ... (names of 41 genes in the cluster)
$ cols : chr [1:276] "luc___U_N0025_r2" "mazF___U_N0025_r1" "mazF___U_N0025_r2" "mazF___U_N0025_r3" ... (names of 276 coditions in the cluster)
$ p.clust : num -4.33 (associated confidence level we have for this cluster)
$ e.val : num 4.8e-05 (associated confidence level we have for this cluster)
$ k : int 1 (this is the k'th cluster in clusterStack)
$ resid : num 0.309 (a measure of how similar the expression of all the genes in the cluster is)
$ net.p.old: Named num [1:6] 1.00e-30 1.00e-30 7.67e-13 2.00e+00 1.00e-30 ... (not important for us)
..- attr(*, "names")= chr [1:6] "prolinks.GC" "prolinks.GN" "prolinks.PP" "prolinks.RS" ...
$ motif.out:List of 4 (not important for us)
..$ p.values: Named num [1:41] -6.04 -0.65 -4.35 -4.46 -2.51 ...
.. ..- attr(*, "names")= chr [1:41] "B3786" "B3620" "B3166" "B4161" ...
..$ e.values: num [1:3] 4.8e-05 4.4 96
..$ pssms :List of 3
.. ..$ : num [1:25, 1:4] 0.1 0.1 0 0 0 0.2 0.8 0.2 0.2 0 ...
.. ..$ : num [1:21, 1:4] 0 0 0 0.2 0.3 0 0 0.1 0 0 ...
.. ..$ : num [1:20, 1:4] 0 0 1 0 0.2 0 0 0 0 0.4 ...
..$ diagrams: Named chr [1:41] "26_[-1]_9_[-2]_9_[+3]_40_[+1]_1_[+1]" "[-2]_18_[-2]_4_[-3]_20_[-3]_14_[+3]_14_[-1]_4" "40_[+2]_1_[+1]_31_[-3]_4_[+1]_34" "35_[+3]_3_[-3]_90_[+3]_13" ...
.. ..- attr(*, "names")= chr [1:41] "B3786" "B3620" "B3166" "B4161" ...
$ redExp : Named num [1:445] 0.511 0.521 0.466 0.418 0.417 ... (this is the average expression, over genes that belong to this cluster; thus we get a vector of the cluster expression in all 445 conditions)
..- attr(*, "names")= chr [1:445] "dinI___U_N0025_r1" "dinI___U_N0025_r2" "dinI___U_N0025_r3" "dinP___U_N0025_r1" ...
in order to use the inferelator with your own data you will need to create clusterStack (each cluster could be a single gene!) containing $nrows,$ncols,$rows,$cols and $redExp.
---type length(colMap) this is the number of conditions in our data set
---type str(colMap[]), this shows you an object for condition 200 out of 445 coditions total in colMap
$ isTs : logi TRUE (True because this condition is part of a Time series)
$ is1stLast: Factor w/ 4 levels "e","f","m","l": 3 (it is "m" for middle of a time series; the other three options are: "e"=equilibrium/not-Time-series, "f"=first-in-Ts and "l"=last-in-Ts)
$ prevCol : chr "T0_N0025_r3" (the previous time-point/condition-name was "T0_N0025_r3")
$ del.t : int 24 (between the previous and current time-points 24 minutes have passed)
$ condName : Named chr "T24_N0000_r3" (this condition name is "T24_N0000_r3", thus this is the name of condition 200)
..- attr(*, "names")= chr "alias"
the other option is that a condition is not part of a time series (unilike most other methods, the inferelator can use both time-series and equilibrium data simultaniously, thus increasing statistical power)
---type str(colMap[]), this shows you that the first condition is not part of a time series
$ isTs : logi FALSE (False because this condition is not part of a time series)
$ is1stLast: Factor w/ 4 levels "e","f","m","l": 1 (it is "e" for equilibrium, thus condition 1 is not part of a time-series experiment)
$ prevCol : logi NA (it has no meaningful previous conditions/column, because it is not part of a time series, thus we set it to NA)
$ del.t : logi NA (del.t is not defined as it is not part of a time series, thus we set it to NA)
$ condName : Named chr "dinI___U_N0025_r1" (this is the name of condition 1)
..- attr(*, "names")= chr "alias"
in order to use the inferelator with your own data you will need to create the colMap data structure.
6) in your favorite editor open R_scripts/main.R (we will take a look at the first few lines where we set the run parameters)
#////////////////// USER PARAMS #//////////////////
trunc.cs <- T (here you can choose to truncate clusterStack as we will do here, if not truncated the inferelator will get a model for all transcription units in clusterStack)
# trunc.cs.i <- 900000000 ## 20 (if you don't truncate set this to very big value, i.e. bigger value then the number of clusters in clusterStack)
trunc.cs.i <- 2 (we will truncate the run after infering models for 1 and 2 clusters out of clusterStack)
organism = "halo" # organism == "ecoli" (here we set a run for halo, change this for ecoli as we loaded data for ecoli)
mod.bi.cols <- FALSE (not important)
plot.it <- TRUE (set to True if you want plots to show under the output folder at the end of the run)
singleInfer <- "all" # singleInfer <- 350 (similarly to truncation we can choose to infer a model for only one bicluster, or for all)
# set how many predictor max you want per transcriptional unit
num.single.pred = 7; num.inter.pred = 5 (here we set the maximum number of single predictors and interaction predictors, the actuall number of surviving predictors is shrinked further using L1-shrinkage)
# set correlation by which interaction predictors are filtered (if bigger)
max.inter.corr.cutoff = 0.75 (we want our predictors to be different from each other, this does not allow for two interaction predictors correlated with each other by more then 0.75)
# set correlation by which single predictors are filtered (if bigger)
max.single.corr.cutoff = 0.9 (similar to above only for single predictors)
# set the time scale (should be around the expected half-life-time of mRNA in minutes)
tau=15 (this is the time scale for mRNA reactions in minutes, for bacteria e.coli this is probably around 10 minutes)
7) run main.R from inferelator folder, i.e type source("R_scripts/main.R") in R interactive shell.
---this starts an inferelator run, and infers a model for the first 2 clusters in clusterStack (should take about 30 seconds per cluster on a modest single processor).
8) after the run ends, open output folder, in it you will see four files:
---2 plot files showing the L1-shrinkage process for each of the biclusters.
---1 log.txt file that gives information about the run time for each cluster.
---A modles.RData file containing the shrinked models for each cluster.
9) load models.RData, into the R interactive shell.
---type str(models) to see a list of the two models corresponding to the two infered clusters.
List of 2
$ : Named num [1:12] 0.20875 0.08486 0.04205 0.04009 0.00644 ...
..- attr(*, "names")= chr [1:12] "VNG6389G" "VNG5163G" "VNG1237C" "VNG1510C" ...
$ : Named num [1:12] 0.1391 0.0624 0.0628 0.0836 0.0000 ...
..- attr(*, "names")= chr [1:12] "VNG5163G" "VNG1237C" "VNG6389G" "VNG1383G" ...
---in this list each single/interaction predictor is assigned a negative or positive weight that corresponds to a repressor or activator respectively. The absolute value of the weight corresponds to the strenght of the predictor with respect to the other predictors. Note, some predictors are assigned a weight of zero, and thus were "shrinked" out of the model.
10) repeat steps 5 to 9 for Halobacter . (note that Halobacter has an extra data structure, lambda, that needs to be loaded, it has to do with the custom microarray used for the experiments, and in general is not required for an inferelator run)
The Inferelator algorithm is continually developed at the Center for Genomics and Systems Biology
by the Bonneau research group.
For questions about the inferelator algorithm or suggestions/corrections for the web-site please contact Aviv Madar.