Interdisciplinary Centre
for Bioinformatics


Search  |  Sitemap  |  Imprint  |

Data Warehouse for Multidimensional Gene Expression Analysis

Toralf Kirsten, Jörg Lange
Interdisciplinary Centre for Bioinformatics
University of Leipzig

Erhard Rahm
Dept. of Computer Science
University of Leipzig


Cooperation:

  • Max-Planck-Institute for Evolutionary Anthropology
  • Institute for Medical Informatics, Statistics and Epidemiology, University of Leipzig
  • Medical Department, University of Leipzig, Interdisciplinary Centre for Clinical Research, University of Leipzig
  • Biotechnical-Biomedical Centre Leipzig


Many popular high throughput techniques, such as microarrays, produce huge amounts of data. To effectively support experiments and studies using these techniques, a comprehensive databases solution is necessary to manage diverse data of many experiments together with all relevant annotations. In [1], we evaluated current data management solutions for microarray data and observed significant limitations. First, gene annotations are either ignored or only integrated by web links thus preventing an automated analysis for subgroups of genes of interest. Second, sample and experiment annotations are mostly captured as free texts, whose heterogeneity essentially complicates experiment comparisons and cross-experiment analysis. Third, expression analysis is mostly restricted to stand-alone software tools outside the database system without taking advantage of all relevant annotations. Commercial solutions, such as LIMS by Affymetrix and GeneExpress by GeneLogic are typically restricted to proprietary algorithms, e.g. for normalization and analysis, which do not necessarily reflect the current state of research.


We have designed and implemented the gene expression warehouse (GeWare) as an integrated platform to overcome the limitations of previous approaches. Figure 1 shows the overall architecture of GeWare. Data is imported from several sources and transformed within a so-called staging area before it is integrated and stored in the central warehouse database for analysis. Gene annotations from several public sources, such as Locuslink, Ensembl, GeneOntology and NetAffx are available by using the Sequence Retrieval System (SRS). Specific portions of data can be extracted from the data warehouse and stored in so-called data marts to support application-driven analysis needs. All administration and analysis functions of GeWare are accessible via web interfaces.




The key aspects of our approach are the following:


  • GeWare follows the data warehouse approach [2] to centrally integrate and store all relevant data, i.e. expression and annotation data. A data warehouse promises significant advantages because all relevant data is directly accessible for analysis, allowing for both good performance and extensive analysis capabilities.
  • We have developed a multidimensional data model where expression data is stored in raw and pre-processed form and annotations on genes, samples, experiments, and processing methods are represented within multiple hierarchical dimensions. An advantage of the data model is the flexible support for focused analysis. For example one can restrict the analysis to experiments involving only a given tissue and/or genes with particular characteristics by filtering the corresponding dimensions of expression data. Furthermore, the data model is easy to extend with new dimensions, e.g. to cover new annotations.
  • We enforce consistent experiment annotation by means of pre-defined annotation templates and controlled vocabularies. A template subsumes various annotation categories for which the values have to be captured. The categories can be hierarchically ordered that makes it possible to automatically generate different web-pages from the template definition as well as to group specific categories. Controlled vocabularies are used in order to avoid semantic heterogeneity and allow us to find experiments with shared properties. Annotation templates and controlled vocabularies are managed in a generic format, and thus can be easily adapted to new research requirements
  • We apply a hybrid integration approach [3] to analyze expression data together with gene annotations. While expression data is stored directly in the data warehouse, gene annotations are stored in their original format, i.e. flat and entry-based files or relational databases, locally. To access and use this data in the analysis process we apply the SRS tool of Lion Bioscience. This integration approach makes it easy to integrate new annotation sources and is robust against changes in external sources.
  • GeWare provides different algorithms for pre-processing and analyzing expression data, e.g. to identify lists of differentially expressed genes. The analysis methods process experiment groups, gene groups and gene expression matrices over uniform interfaces. This not only supports the specification of automatic analysis workflows but also allows the user to use pre-computed results for continuation of an analysis.



We have designed and implemented the GeWare system as an integrated gene expression analysis platform. This comprehensive database solution makes it possible to analyze gene expression data together with both, experiment annotations specified by the experimenter and following the MIAME standard and public available gene annotation from Locuslink, Ensembl, GeneOntology and Netaffx. The GeWare system is fully operational and has been employed in several research projects in Leipzig. Currently, GeWare manages data for more than 1,500 experiments focussing on Affymetrix microarrays. Interactive access is available under https://ducati.izbi.uni-leipzig.de/Geware.


Publications:
Kirsten, T., Lange, J., Rahm, E.:
An integrated platform for analyzing molecular-biological data within clinical studies.
In: Proc. International Workshop on Information Integration in Healthcare Applications (IIHA) in conjunction with EDBT 2006, Munich, 2006.
PDF
Kirsten, T., Do, H. H., Körner, Ch., Rahm, E. (2005).
Hybrid Integration of molecular-biological Annotation Data.
In: Ludäscher, B., Raschid, L. (Eds.), Proc. of the 2nd International Workshop on Data Integration in the Life Sciences (DILS), San Diego, 2005, Springer Verlag, Berlin Heidelberg.
ISBN 3-540-27967-9
Körner, C., Kirsten, T., Do, H.-H., Rahm, E.:
Hybride Integration von molekular-biologischen Annotationsdaten.
Proc. 11th Conf. Database Systems for Business, Technology and Web (BTW), 2005
PDF
Kirsten, T., Do, H.-H., Rahm, E.:
A Data Warehouse for Multidimensional Analysis.
Technical Report, IZBI, November 2004.
PDF
Do, H.-H., Kirsten, T., Rahm, E.:
Comparative Evaluation of Microarray-based Gene Expression Databases.
Proc. 10th Conf. Database Systems for Business, Technology and Web (BTW), 2003

top