IVOA Interest Group - Internal Draft

In this note we propose that the IVOA develops a "meta-standard" for the design of IVOA data models and derived products. In this note we describe an initial version for such a meta-standard and an implementation of it. We assume that one main usage of a data model is that instances of it ("objects") must be created and represented by some means and stored for use by 3^rd party users. We assume therefore a generic DATABASE that stores these instances and that has services for interaction with it. We assume two major modes of interaction: updates and query. The first, we suggest, will include external users sending their DM instances represented as XML to the DATABSE. The second will include allowing a user to send ADQL queries to the DATABASE for browing its contents. The Simulation Database designed in the theory interest group was based on this approach and its results will be used for examples now and then.

@@ TODO describe other usages, maybe concentrate on META-data models? @@

We propose that DM efforts should be built around a logical UML data model describing the application domain for a given proposed standard specification in full detail. This model is designed using to a prescribed subset of the full UML modelling language, which we represent with a UML profile. TBD expand on profile.

For use in specific software contexts, physical models must be built based on this model. Relevant examples are XML schemas, relational database schemas, Java classes, but also human readable documentation. The second main feature of the proposed meta-standard is that these can be derived from the logical model using a predefined and fixed set of mapping rules. These rules must be decided on in the WG effort we propose here, but we provide a first attempts at these and present these in this note. We also provide an implementation of these rules as a set of XSLT [REF] scripts that transform the logical model into the physical representations. We use here the fact that the logical model is represented as XMI @@ TODO add REF@@, a standard XML format for representing UML models.

The main results of this note are

A UML profile for creating logical data models (including intermediate XML representation to buffer for XMI changes).
Rules for mapping data models created according to the profile to
- XML schema
- Relational Database Schema (plus related TAP meta data representations)
- UTYPE-s
- Java classes annotated with JPA and JAXB attributes for translating between XML and relational database representation
- A standard HTML documentation format
- JSP pages/TAG pages (LAUREN?) for browsing a DATABASE created with these components.
A reference implementation of these rules using an XSLT pipe line.

We feel that the results presented in this note are sufficiently far evolved that the DM WG might consider starting a project to work this out to a standard it proposes for DM efforts in the future.

Acknowledgments

We thank various persons for useful discussions in the course of this work. In particular Jeremy Blaizot, for inviting GL and LB to Lyon to work on his Horizon project, where this project was conceived. We thank the "tiger team" working on the SNAP and later SimDB project: Rick Wagner, Herve Wozniak, Patrizia Manzato and Mireille Louys. And also the participants of the SNAP workshop in Garching, April 2007.

1. Summary

We propose a "meta-standard" for IVOA data modelling efforts that have as their goal defining meta-models describing astronomical data products and related resources. We propose that the DM WG defines a project to evaluate this proposal and if of interest work out the details. So far this proposal has been followed by the Theory Interest Group in its Simulation Database (SimDB) effort and has been shown to speed up development and produce standardised results very efficiently.

We assume that data models are created with the goal of defining structured representations according to which information about ones resources must be made available to the VO. This includes storing such information in a database, relational or otherwise, from which it can be queried; manipulating such information in code; representing it in human and/or machine readable form for communication; and any combination of these. These different usages must have in common that a given information item must be identifiable irrespective of its representation. This implies that a common underlying model must be available, from which the physical representations can be derived.

This realisation (which is by no means new!) forms the core of our proposal. We propose that the core task of a data modelling effort should be to define this so called logical model. The task of deriving the physical representation should be no more than applying a standardised set of mapping rules. TO make this work requires two components which we propose to the DM WG:

2 Data modelling

2.1 Analysis models

2.1.1 Universe of Discourse

2.1.2 Domain Model for Astronomy

2.2 Logical Models

2.3 Physical Models

2.4 Representations as views

3. UML Profile

4 Mapping rules

4.1 Identity and Referencing

The main elements in our profile are the classes (=object types), these embody the core concepts that we model. In our approach we follow standard Object-Oriented design approaches (see [10]) where object types are assumed to have an explicit identity. Two objects (i.e instances of an object type) can have the same values for all fields, but if their identity is not the same they are not the same object. Objects can be referenced by stating their identity (in whatever form this comes). In contrast to this, value types are assumed to be identical if their value (or values, in the case of structured value types) is the same. In our UML model we do not define an explicit identifier attribute on each object type to represent its identity, its existence is assumed and its representation is up to the mapping to the physical model.

Related to this issue is that we need to be able to represent reference relations between different objects. Most contexts provide a natural mapping for references. For example relational databases have the concept of foreign keys, XML documents allow references using ID/IDREF and other mechanisms for references to entities in the same document, Java uses pointers (implicitly) to objects in the same virtual machine. Problems arise when we need to leave the local contexts: references to resources not in the current database, or in another XML document.

It is easy to imagine cases where this may occur. For example when registering a simulation run with the open source Gadget [12] simulation code, one needs to have a reference to the corresponding Gadget SimDB/Simulator. Unless one registers the experiment in the same SimDB where Gadget is registered, one needs to use a reference across SimDB-s. One obvious way is to map all references to globally unique identifiers, possibly using URIs or IVOA Identifiers [11]. The size of such URI-s makes this a rather expensive storage mechanism for use in a relational database, certainly compared to simple integer (or bigint) columns.

4.2 RDBM Schema

4.3 XML Schema

4.4 UTYPE-s

The Spectrum data model was the first to add explicit UTYPE-s for each of the attributes in its model and the Characterisaiton data model has followed that example. As long as the precise usage and relation of the syntax of the underlying data model is is not defined, we will follow these examples by assigning UTYPE-s explicitly to all elements in the model. However, we will follow a fixed set of rules to makes this assignment and implement these in XSLT. If a similar approach is at some time accepted within the IVOA, possibly in an alternative form, it will be straightforward to adjust our definitions. The important point we want to make is that it is possible to simply define rules that then will automatically produce the UTYPE-s for a given data model, i.e. the only discussion that is required is on the rules for doing so.

Our assumption is that the UTYPE should be able to uniquely represent any element in the data model, and in a manner that is also easily interpreted. For now we assume that we need to point to those elements that can be stored in a column in a VOTable, i.e. for now we are looking for "simple" elements. We can use our relational mapping to identify all these features, they are

Of course we could give each of the elements a uniquely generated identifier, but we assume that UTYPE-s should hold semantic information, otherwise we could use the XMI-ids generated by the UML modelling tool. To identify any of these elements uniquely within the context of the IVOA, we then need the following components:

One could argue one could also give nice, unique names to each of the elements, but to find out what the actual element in the model and in other representations one would still need to perform a look up. Such a uniqe name would likely include some of the elements above anyhow. So we believe it would be a waste of efforts to do so and instead propose a simple convention for deriving the UTYPE-s form the model based on this hiherarchy. We have done so using these rules (in BNF-like notation)

4.5 Java/JPA+JAXB (non normative)

5. Usage scenarios: METADATABASE

5.1 ADQL + TAP

5.2 REST

Appendix A: Data modelling specifics

As first step in our simulation pipeline we generate an XML document that represents the data model in a form that is more easily interpreted, both by human readers and by XSLT scripts, than the XMI representation. This document itself is structured according to an XML schema that represents the UML profile rather directly and that we here shortly describe.

We illustrate out UML profile using an example data model derived form the SimDB/DM, shown in the following diagram:

We now describe the individual elements. some of these are standard, some of these are domain specific extensions following standard UML profile stereotype extension elements and associated tag definition.

Custom CSS Classes :

Data Modelling Pipeline
Version 0.1

IVOA Theory Interest Group
Internal Draft 2008 May 15

Abstract

Status of this Document