From Bonneau Courses
To identify human's non-coding RNA genes that influence the Th17 differentiation
Human's immune system is used to protect human from being attacked by pathogens or abnormal cells like bacteria, fungi, virus, cancer cells etc.. It consists of several immune cells such as CD4 T cells, CD8 T cells, regulatory T cells, B cells, macrophages, neutrophils, eosinophils, basophils, monocytes, dendritic cells. Of these cells, CD4 T cells are what we would like to research.The CD4 T cells, while not directly interact with pathogens or abnormal cells, could help other immune cells to eliminate them. Therefore, human's immune system could not function well when CD4 T cells are eliminated, for example, by Human Immunodeficiency Virus(HIV). Generally, there are 4 kinds of CD4 T cells: Th1,Th2,Th17,T-reg. Th1 is used to help CD8 T cells to kill virus-infected cells or cancer cells; Th2 is to stimulate B cells to product antibodies, which further bind to and thus inactivate pathogens or abnormal cells;
Th1,Th2,Th17,T-reg are all differentiated from Naive CD4 T cells(Th0) based on environmental regulation. Of these CD4 T cells, The differentiative process from Th0 to Th17 is what we would like to research. This process involves the gene expression of "ROR-gammaT", which is responsible to generate IL-23R protein(see supplements).
So what is gene expression? Gene expression is the process of translating gene to functional products like proteins, tRNAs, non-coding RNAs. In most cases, gene expression involves three steps: transcription, splicing and translation. Transcription simply means producing complimentary RNA from DNA in cell nucleus. Such RNA is the initial transcription product, which consists of both exons(protein-coding) and introns(non-protein-coding). Because introns are useless for protein-producing, splicing is required to remove such introns in cell nucleus. Then, the spliced RNA, which is the finished transcription product, moves out of cell nucleus to produce proteins, which is called translation. However, the splicing also includes excision of coding regions to produce different type of proteins, which is so-called alternative splicing. In other words, when alternative splicing occurs, some exons become non-coding regions and thus are ultimately removed from the initial transcription transcript. Such non-coding regions are also what we are particularly interested.
The Littman lab (NYU Lagoon Medical Center) has recently collected a large data-set that follows CD4 T-cells as they develop from Th0 to Th17 cells. The Bonneau lab has utilized the data to build a regulatory network that connects transcription factors and chromatin remodeling factors to target genes. We mainly concentrate on already annotated genes, but the RNA-seq data also gives us a deep and comprehensive look into what other long non-coding RNAs are expressed and changing through the differentiation of these cells. We take the whole set of RNA-seq data and use new tools to assemble these RNA genes so that we could see if these RNA genes are homologous to human genes.
We divide into two observations to collect Th17 cell differentiation data. The data is about the gene expression level. The first observation(Th0) is from 0 to 4 hours, and the second observation(Th17) is from 24 to 48 hours. The data we collected are stored to the High Performance Computer(HPC) at New York University as "Cufflinks" data. We further treat these data by applying cuffmerge, cuffcompare and cuffdiff to obtain the genes that are related to Th17 differentiation. Then, these genes are compared to human genome sequence using NCBI blast.
The purpose of using Cufflinks is to obtain the gene expression level. Cufflinks is also measured from Bowtie, which is used to predict genomic sequence by aligning DNA fragments(thus including both exons and introns), and TopHat, which is to predict the splicing junction(i.e. introns). The difference between Bowtie(introns+exons) and TopHat(introns) is Cufflinks(exons). When data are collected and stored in Cufflinks, cuffmerge, cuffcompare and cuffdiff are then used to further analyse the data.
The datasets we have came from Illumina RNA-seq experiment towards CD4 T-cells, from which collecting expression level by fragments per kilobase of exon per million fragments mapped (FPKM). Data were read in different time courses mapped to the differentiation of naive CD4 T cells (Th0) to mature Th17 cells. First of all, the RNA-seq reads were pooled into TopHat, which is a fast splice junction mapper to align RNA-Seq reads to genomes using the ultra high-throughput short read aligner Bowtie, and then analyzes the mapping results to identify splice junctions between exons.
The datasets generated from TopHat called accepted_hits.bam would be the input of Cufflinks to assemble transcripts against reference transcript annotations.
In order to gracefully merge novel isoforms and known isoforms and maximize overall assembly quality, th0_transcripts.gtf and th17_transcripts.gtf, which are the outputs from Cufflinks, were merged by cuffmerge (one of the functions of Cufflinks package) that automatically filters a number of transfrags that are probably artifacts. It's run with references of fasta file of genomic DNA sequences as well as GTF file for annotated genes. The main purpose of this script is to make it easier to make an assembly GTF file suitable for use with Cuffdiff.
The datasets Th0 and Th17, which mapped by Cufflinks, were pooled into cuffmerge that automatically filters a number of transfrags that are probably artfifacts. It's run with references of fasta file of genomic DNA sequences as well as GTF file for annotated genes. This script was built in order to gracefully merge novel isoforms and known isoforms and maximize overall assembly quality. The main purpose of this script is to make it easier to make an assembly GTF file suitable for use with Cuffdiff.
In terms of differential analysis with gene and transcript discovery, we run cuffdiff by using the output of cuffmerge, merged.diff, to identify differentially expressed genes, isoforms, TSS, promoters and CDs.
2. CummerRbund Package in R programming CummeRbund was designed to help simplify the analysis and exploration portion of RNA-Seq data derrived from the output of a differential expression analysis using cuffdiff with the goal of providing fast and intuitive access to the results. We downloaded the CummeRbund package from Bioconductor and ran it in R (2.14.1) to analyze cuffdiff results.
To detect non-coding RNAs that influence the Th17 differentiation,
3. NCBI Blast
We compare the Th0 and Th17 data to find the non-coding RNAs using Cufflinks, and subsequently compare them to human genome by using NCBI blast in HPC.
Since the Cufflinks data were obtained from mice, we should compare them to human genome while utilizing NCBI blast. We are planning to compare the translated sequences by querying database of translated sequences(nblastx).
By applying the preceding methodology, we have identified the following non-coding genes:X,Y,Z. The reason we get such results is....