Interdisciplinary Centre
for Bioinformatics


Search  |  Sitemap  |  Imprint  |

BioFuice: A decentralized approach for data integration in bioinformatics


Toralf Kirsten
Interdisciplinary Centre for Bioinformatics
University of Leipzig

Erhard Rahm
Dept. of Computer Science
University of Leipzig


Cooperation:

  • David Aumüller, Nick Golovin, Andreas Thor
    DBS
    University of Leipzig
  • Peter Stadler, Claudia Fried
    Bioinformatics Group
    University of Leipzig
  • Jürgen Jost
    Anirban Banerjee
    MPI MIS Leipzig


Many bioinformatics applications require data from different sources to answer complex research questions. Integrating such highly diverse data is a major challenge in bioinformatics and often much too laborious and error-prone for scientists. Traditional database approaches are mostly too rigid and not scalable.
We developed the BioFuice approach for interconnecting and integrating data from different autonomous sources. It is based on a decentralized peer-to-peer-like infrastructure. We directly utilize instance (object)-level correspondences between different sources which are often already available in the sources in the form of web links, e.g. based on accession ids. Sets of such correspondences represent mappings between sources which describe objects of different types, such as genes, proteins, and their function. Data sources can comprise multiple of such object types. For instance, the Ensembl source contains amongst others the object types "Gene" and "Transcript", NetAffx the type "Gene" and SwissProt/UniProt the type "Protein". These object types and their corresponding intra- and inter-source mappings form the so called source mapping model. Mappings are also assigned a semantic mapping type. Together with object types they mirror the semantics of the domain within a so called domain model.


Generating the semantic domain model



To process objects and mappings we have devised a set of high-level operators. They can be used within script programs (workflows) to combine and analyze data from different sources. For instance, we can use a script to identify and retrieve all chemokine-related genes of the NetAffx source which can be used to focus the analysis of microarray-based experiments on relevant genes. To form this gene group, the sources HUGO, SwissProt and GeneOntology are included in the query process in order to compensate incomplete sources and mappings.

The key aspects of the BioFuice integration approach are:


  • Peer-to-peer like Integration of different sources: BioFuice extends the iFuice approach (1) to interconnect local data, such as the personal gene list, with publicly available data. In addition, applications that produce structured output data from a given input can also act as sources to be integrated. BioFuice utilizes generic mapping services, such as SQL, XML, and Java Programs, to access sources and retrieve data. That makes it possible on the one hand to integrate data of different formats. On the other hand, the utilization of well established query languages, such as SQL, XPath, and XQuery, reduces resource consuming wrapper implementations enormously. New sources can be easily and quickly integrated as needed by creating at least one mapping that connects the new with an existing source.
  • Semantic integration: Modelling domain specific semantic meaningful object types, such as 'Gene', 'Protein' or 'Interaction', and their connecting mapping types makes a semantically data integration possible.
  • Script-based integration: The application of high-level operators within a script abstracts the integration task from the implementation level. That allows reacting briefly to new application and analysis needs.



Currently, BioFuice integrates data from more than 20 public molecular biological annotation sources, such as Ensembl, Bind, NetAffx, HUGO and HomoloGene, but also personal sources as result of different analyses. The integration approach is applied in various collaborative research projects ranging from analysis of microarray data (IZBI), the analysis of protein interaction networks (MPI MIS) to the detection non-coding RNAs and gene homologues (BioInf).


Publications:
Kirsten, T., Rahm, E.
BioFuice: Mapping-based data integration in bioinformatics.
Proc. of 3rd Int. Workshop on Data Integration in the Life Sciences (DILS), Hinxton, 2006
PDF
Rahm, E.; Thor, A.; Aumüller, D.; Do, H.-H.; Golovin, N.; Kirsten, T.:
iFuice - Information Fusion utilizing Instance Correspondences and Peer Mappings.
8th International Workshop on the Web & Databases (WebDB) in conjunction with SIGMOD 2005, Baltimore, 2005
PDF
Kirsten, T.; Rahm, E.:
BioFuice: A decentralized approach to integrate molecular biological annotation data.
3rd Research Festival, Leipzig, 2005

top