V22.0480 Final Project
From Bonneau Courses
Final projects for v22.0480
All projects will use data-set number 1 and associated datasets (the B. anthracis microarray dataset and annotations for the B. anthracis genome).
For each project I have one student in mind already (based on that student already seeming to “get” the point behind the project). Each student must sign up for one and only one project. Projects are due on the day of the course final NO EXCEPTIONS. Each group should give the other groups an mostly working function OR an example input / data-structure to work with 2 weeks before the final.
All project must be handed in in the form of an R package that can be loaded using the function library(). THIS includes documentation. Other languages (perl, python, flash (is that a language?), C, etc.) can be incorporated in the R package, although this is not strictly required.
Project work flow
Approach the project in the following way:
Step 1. Sign up and talk with me to make sure you understand the project objective.
Step 2. Share your data input and output formats with other interdependant projects. All projects will be interoperable (meaning the clusters from project one will be used by project 2 and visualized by project 5, etc.). I will give more details on what must be shared below in the descriptions of the individual projects.
Step 3. Implement the project in individually (two project 5 people) or in groups of 2 (projects 1-4). Check in with the groups that will take your output as input along the way.
Step 4. Make the package and write the vignette / docs.
Step 5. During our scheduled final we’ll present our projects to each other. You don’t need to make a very formal presentation, but will be expected to demo your code and explain how and why you did what you did.
Don’t break the chain & what if the chain is broken?
Remember : this project is your final, and the largest component of your grade. You will have to stand up in front of the class and present your code.
We'll run the projects one after another durring the time-slot scheduled for this classes final and watch the different projects work together. As close to the real world as we can get without actually doing something real ;-)
THUS (!!!!), The interoperability of the projects could be a disaster if group one quits ... BUT it won’t be!
If any group falls flat and cannot produce the results the other groups need I will provide example data-structures to use as inputs OR I will provide functions. If this happens I’ll be very angry (like godzilla angry) BUT don’t worry, you won’t be slowed down by other groups because I can connect the dots if any one group fails.
Project 1 : clustering bigger data : a scalable k-mediods
2 students should sign up for this one: Sign up : ( Kin F. Tam! Mike Franchitti )
We went over a simple implementation of k-means and k-mediods clustering in class. This algorithm did not scale very well with respect to data-size (number of observations OR number of genes/objects). We are often in a situation where we expect a modest number of classes/clusters, but have 500,000 or more genes/object/individuals. Make a version of K-mediods that reads only subsets of the dataset from a SQLite database at any given time and thus has better run time and memory behavior (bounded by the subset of the data read in at any given moment), but has limiting behavior similar to running the K-mediods on the full dataset.
Can you use resampling to judge the robustness of your cluster results? Can you prove that your method limits to mostly the right answer OR can you empirically show this for the B.anthacis. and B.subtilis data sets?
- Inputs: Expression data matrix - load .rda data matrix into the working directory. Source the project code, and then run the function "medoidize(x,y,z)" where x = data matrix to cluster, y = number of clusters, z = number of iterations to run the k-medoids clustering algorithm. !!!NOTE: the program will set a working directory, so set that in the code before running the clustering function!!!
- Outputs: Clusters - After the medoidize() function is run, the program saves the clusters as a list of numeric vectors (each number in the vector representing a row (gene for baa.ratios.rda) in that cluster) as "clusters.medoids.rda" to the working directory. The medoidize() function also saves "matrix.medoids.rda" to the working directory, which is a matrix of the medoid values used in the subsequent plot function to plot the medoid values over that medoid cluster's plot graph. After medoidize() is run, run "plot.clusters(x)", where x is the data matrix originally medoidized, and the function will load the resulting saved clusters.medoids.rda and matrix.medoids.rda from the working directory, and use the data matrix information to plot each cluster.
Project 2 : linear models of gene regulation : model selection
- 2 students needed: Sign up :( Lisa Nguyen, Andrew Stone )
Transcription factors control the expression of genes in our datasets. Given a cluster of genes find the transcription factors that are the best predictors of that cluster's median expression level. I’ll give you a list of a TF, a paper that describes this sort of an approach in the setting of a functional genomics project, and an example data-structure for output.
Can you use resampling to give error estimates on these linear models OR the components of these models? Can you, given more than one clustering (more than one way to pre-cluster the genes) tell me which cluster result from project one is more “predictable” ?
- Inputs: clusters from 1, expression data matrix, list of transcription factors (genes that get to be predictors)
- Outputs: a sparse linear model for each cluster (with statistics showing reliability / significance of each model).
- Code: Image:Proj2.txt
- Database: proj.sqlite3
- R-data object: proj2.rda
Transcription factors are the genes that regulate other genes negatively or otherwise. In this project we determine which tfs are the best predictors of each cluster. You will be given the TFs for this dataset :
tfs<-scan(file="tfs.txt", what="character", sep="\n") geneID<-labels(ratios)[[1]] xpred<-ratios[which(geneID%in%tfs),]
Project 3 : scoring / judging co-expressed groups : annotation data
2 students needed : Sign up : (Peter)
Use something like Fisher’s exact 2x2 test to “score” the degree to which the clusters provided by project 1 are significant.
You have two types of annotations in the annotations set (KEGG and GO) which is more enriched in the clustering and thus more related to co-expression? Can you work out a means of choosing between alternate clusterings and make an argument for the cluster size your method suggests (if any).
- Inputs: clusters from project 1, annotations for Bacillus genes.
- Outputs: summary of significance and score for clusters, a choice for the number of clusters that optimizes the functional significance of the output clusters (or the number of clusters with one or more significant annotations).
Project 4 : Visualization and Animation
View final result: Project 4
2 or 3 students can sign up for this one but students that sign up for this one must work individually: Sign up : Stephanie
Each student will have to think about this and come see me, the following are suggestions. In the end I’m asking for a lot more than just a plot OR just a function that makes a plot. I
Possible things you could make visualizations for, possible projects:
- clusters in the context of the cell cycle: the data we’ve been working with is the cell as it goes through a cell cycle and tehn sporulates (goes dormant). Can you map clusters onto this timeline with an interactive function OR make a annotated flash animation. Can you include data about which annotations are significant?
- the dynamics of the cell’s circuit: given the linear models that represent a circuit can you show how the changing levels of the predictors effects future levels of target genes? (if a regulator goes up in time point 1 what happens to its targets at the next time-point? )
- html summary of clusters and annotations: Given a set of clusters make a html page that organizes plots of the cluster’s expression over the conditions in the dataset
- Inputs: inputs can be clusters, linear models, associated error predictions, the original data, etc.
- Outputs: flash animations, interactive visualizations, automatically generated websites that organize sets of plots, be creative or else!
Project 5 : Publish prior graded homework assignments as web services, or integrate into Python main function that demonstrates the use of all examples in the class
- 1 or 2 students can sign up for this group: Sign up : Kelsey, Tal (tentative)
- One good way to share complicated analysis is via a web service. Take the homework assignments from this class and make them into web services that can be accesses by a Java program. You’ll have to make a small Java program that uses the web service.
For example : have the java program send gene names to the R web service which then returns things like: the p-value and annotations of all significant annotations (assignment 3b), other genes that are in the 95 cint of these genes (assignment 3b), etc.
Before you start we’ll sit down and make sure each of the functions you’ll publish is correct (using examples on the wiki is ok, using your own functions is better).
- Another good way to share complex analysis is to wrap them in a language that more people are familiar with , such as python, and provide clear examples of using each component of the analysis via the, in this case, python program.
If you choose either flavor of this project I'll work with you to make sure that you have good clean versions of the assignments to start.
- Inputs: SQLite database with expression data, annotations data, functions
- Outputs: valid web service
Project 6 : suggest your own
If you have an idea then let me know and it is very likely I’ll agree.
