Data Warehouse for Multidimensional Gene Expression Analysis
Toralf Kirsten, Jörg Lange
Interdisciplinary Centre for Bioinformatics
University of Leipzig
Dept. of Computer Science
University of Leipzig
- Max-Planck-Institute for Evolutionary Anthropology
- Institute for Medical Informatics, Statistics and Epidemiology, University of Leipzig
- Medical Department, University of Leipzig, Interdisciplinary Centre for Clinical Research, University of Leipzig
- Biotechnical-Biomedical Centre Leipzig
Many popular high throughput techniques, such as microarrays, produce huge amounts of data.
To effectively support experiments and studies using these techniques, a comprehensive databases solution is necessary to manage
diverse data of many experiments together with all relevant annotations. In , we evaluated current data management solutions for
microarray data and observed significant limitations. First, gene annotations are either ignored or only integrated by web links thus
preventing an automated analysis for subgroups of genes of interest. Second, sample and experiment annotations are mostly captured as
free texts, whose heterogeneity essentially complicates experiment comparisons and cross-experiment analysis. Third, expression analysis
is mostly restricted to stand-alone software tools outside the database system without taking advantage of all relevant annotations.
Commercial solutions, such as LIMS by Affymetrix and GeneExpress by GeneLogic are typically restricted to proprietary algorithms, e.g.
for normalization and analysis, which do not necessarily reflect the current state of research.
We have designed and implemented the gene expression warehouse (GeWare) as an integrated platform to overcome the limitations of
previous approaches. Figure 1 shows the overall architecture of GeWare. Data is imported from several sources and transformed within
a so-called staging area before it is integrated and stored in the central warehouse database for analysis. Gene annotations from several
public sources, such as Locuslink, Ensembl, GeneOntology and NetAffx are available by using the Sequence Retrieval System (SRS).
Specific portions of data can be extracted from the data warehouse and stored in so-called data marts to support application-driven
analysis needs. All administration and analysis functions of GeWare are accessible via web interfaces.
The key aspects of our approach are the following:
- GeWare follows the data warehouse approach  to centrally integrate and store all relevant data, i.e. expression and annotation data.
A data warehouse promises significant advantages because all relevant data is directly accessible for analysis, allowing for both good
performance and extensive analysis capabilities.
- We have developed a multidimensional data model where expression data is stored in raw and pre-processed form and annotations on genes,
samples, experiments, and processing methods are represented within multiple hierarchical dimensions. An advantage of the data model is the
flexible support for focused analysis. For example one can restrict the analysis to experiments involving only a given tissue and/or genes
with particular characteristics by filtering the corresponding dimensions of expression data. Furthermore, the data model is easy to extend
with new dimensions, e.g. to cover new annotations.
- We enforce consistent experiment annotation by means of pre-defined annotation templates and controlled vocabularies. A template subsumes
various annotation categories for which the values have to be captured. The categories can be hierarchically ordered that makes it possible to
automatically generate different web-pages from the template definition as well as to group specific categories. Controlled vocabularies are
used in order to avoid semantic heterogeneity and allow us to find experiments with shared properties. Annotation templates and controlled
vocabularies are managed in a generic format, and thus can be easily adapted to new research requirements
- We apply a hybrid integration approach  to analyze expression data together with gene annotations. While expression data is stored
directly in the data warehouse, gene annotations are stored in their original format, i.e. flat and entry-based files or relational databases,
locally. To access and use this data in the analysis process we apply the SRS tool of Lion Bioscience. This integration approach makes it easy
to integrate new annotation sources and is robust against changes in external sources.
- GeWare provides different algorithms for pre-processing and analyzing expression data, e.g. to identify lists of
differentially expressed genes. The analysis methods process experiment groups, gene groups and gene expression matrices over uniform interfaces.
This not only supports the specification of automatic analysis workflows but also allows the user to use pre-computed results for continuation
of an analysis.
We have designed and implemented the GeWare system as an integrated gene expression analysis platform.
This comprehensive database solution makes it possible to analyze gene expression data together with both, experiment annotations specified
by the experimenter and following the MIAME standard and public available gene annotation from Locuslink, Ensembl, GeneOntology and Netaffx.
The GeWare system is fully operational and has been employed in several research projects in Leipzig. Currently, GeWare manages data for more
than 1,500 experiments focussing on Affymetrix microarrays. Interactive access is available under https://ducati.izbi.uni-leipzig.de/Geware.
|Kirsten, T., Lange, J., Rahm, E.:
An integrated platform for analyzing molecular-biological data within clinical studies.
In: Proc. International Workshop on Information Integration in Healthcare Applications (IIHA) in conjunction with EDBT 2006, Munich, 2006.
|Kirsten, T., Do, H. H., Körner, Ch., Rahm, E. (2005).
Hybrid Integration of molecular-biological Annotation Data.
In: Ludäscher, B., Raschid, L. (Eds.), Proc. of the 2nd International Workshop on Data Integration in the Life Sciences (DILS), San Diego, 2005, Springer Verlag, Berlin Heidelberg.
|Körner, C., Kirsten, T., Do, H.-H., Rahm, E.:
Hybride Integration von molekular-biologischen Annotationsdaten.
Proc. 11th Conf. Database Systems for Business, Technology and Web (BTW), 2005
|Kirsten, T., Do, H.-H., Rahm, E.:
A Data Warehouse for Multidimensional Analysis.
Technical Report, IZBI, November 2004.
|Do, H.-H., Kirsten, T., Rahm, E.:
Comparative Evaluation of Microarray-based Gene Expression Databases.
Proc. 10th Conf. Database Systems for Business, Technology and Web (BTW), 2003