Functional Interpretation of Gene Expression Data


Microarrays are at the center of a revolution in biotechnology, allowing researchers to simultaneously monitor the expression of tens of thousands of genes. The final aim of a typical microarray experiment is to find a molecular explanation for a given macroscopic observation (e.g., which pathways are affected by the loss of glucose in a cell, what biological processes differentiate a healthy control from a diseased case), and that is called functional interpretation of gene expression data.

The thesis presents two new methods for functional interpretation of gene expression data by combining and using biological knowledge stored in different kinds of databases. The interpretation is done by identifying and describing gene sets that have significantly altered expression profile (e.g., over- or under-expressed). The search of the interesting gene sets is performed in the space of already defined gene sets (gene sets that have common annotation by predefined ontological terms) and in the space of newly generated gene sets that have a predefined characteristics (e.g., minimum number of member genes that are declared to be differentially expressed). The three well established methods, Fisher's exact, Gene Set Enrichment Analysis (GSEA) and Parametric Analysis of Gene Set Enrichment (PAGE), were employed in order to identify gene sets with significantly altered expression profile.

Both methods share the same mechanism of first order (relational) features construction, by using the Gene Ontology (GO), Kyoto Encyclopedia of Genes and Genomes (KEGG) Orthology, gene annotations and gene-gene interaction data. These features are used as generalized gene annotations.

The first method is based on threshold-based functional analysis. It is performed in two steps: in a first step "top" genes of interest are selected using gene differential expression as selection criteria. The selection process does not take into account the fact that these genes are acting cooperatively in the cell and consequently, for better interpretation of the selected gene list, their behavior must be coupled to some extent. Description language used for describing the functionality of the genes is constructed from GO, gene annotations and gene-gene interaction data. Using this background knowledge, together with the paradigm of relational subgroup discovery, implemented in the RSD algorithm, we found common descriptions of groups of genes differentially expressed in specific cancers that can be straightforwardly used by the medical experts.

The second method is based on threshold-free functional analysis. This method is also performed in two steps. In the first step genes are ranked by using their differential expression values when comparing predefined classes (e.g., tumor vs. healthy controls) by means of any appropriate statistical test (e.g., t-test). In the second step positions of the members of the predefined gene sets (e.g., defined by GO and KEGG Orthology terms) in the ranked list, are analyzed using appropriate statistical tests (e.g. Kolmogorov-Smirnov). Gene sets, whose members are predominantly found at the top of the list, are considered enriched and responsible for the phenotype difference (e.g. tumor vs. normal). Our contribution to this methodology is a development of an efficient algorithm, inspired by the RSD first order features construction, for the construction of new, potentially enriched, gene sets. New gene sets are defined by conjunctions of relational features constructed from the background knowledge.

The two developed methods have proved to be of interest to medical experts, the extracted knowledge was consistent with the relevant literature, and have the potential for guiding the biomedical research and generating new hypotheses that explain microarray measurements.

Also, by-product of the thesis is easy to use relational database that integrate several sources of biological knowledge (GO, KEGG Orthology, gene annotations and gene-gene interaction data) in a unified format and this database is now publicly available to the wider scientific community.

==============================================

Full text of the thesis can be found here.
Part of the thesis is implemented in the web tool SEGS (Search for Enriched Gene Sets)