In this note we propose that the IVOA develops a "meta-standard" for the design of IVOA data models and derived products. In this note we describe an initial version for such a meta-standard and an implementation of it. We assume that one main usage of a data model is that instances of it ("objects") must be created and represented by some means and stored for use by 3rd party users. We assume therefore a generic DATABASE that stores these instances and that has services for interaction with it. We assume two major modes of interaction: updates and query. The first, we suggest, will include external users sending their DM instances represented as XML to the DATABSE. The second will include allowing a user to send ADQL queries to the DATABASE for browing its contents. The Simulation Database designed in the theory interest group was based on this approach and its results will be used for examples now and then.
@@ TODO describe other usages, maybe concentrate on META-data models? @@We propose that DM efforts should be built around a logical UML data model describing the application domain for a given proposed standard specification in full detail. This model is designed using to a prescribed subset of the full UML modelling language, which we represent with a UML profile. TBD expand on profile.
For use in specific software contexts, physical models must be built based on this model. Relevant examples are XML schemas, relational database schemas, Java classes, but also human readable documentation. The second main feature of the proposed meta-standard is that these can be derived from the logical model using a predefined and fixed set of mapping rules. These rules must be decided on in the WG effort we propose here, but we provide a first attempts at these and present these in this note. We also provide an implementation of these rules as a set of XSLT [REF] scripts that transform the logical model into the physical representations. We use here the fact that the logical model is represented as XMI @@ TODO add REF@@, a standard XML format for representing UML models.
The main results of this note are
This is an IVOA Note expressing suggestions from and opinions of the authors.
It is intended to share best practices, possible approaches, or other perspectives on interoperability with the Virtual Observatory.
It should not be referenced or otherwise interpreted as a standard specification.
We thank various persons for useful discussions in the course of this work. In particular Jeremy Blaizot, for inviting GL and LB to Lyon to work on his Horizon project, where this project was conceived. We thank the "tiger team" working on the SNAP and later SimDB project: Rick Wagner, Herve Wozniak, Patrizia Manzato and Mireille Louys. And also the participants of the SNAP workshop in Garching, April 2007.
We propose a "meta-standard" for IVOA data modelling efforts that have as their goal defining meta-models describing astronomical data products and related resources. We propose that the DM WG defines a project to evaluate this proposal and if of interest work out the details. So far this proposal has been followed by the Theory Interest Group in its Simulation Database (SimDB) effort and has been shown to speed up development and produce standardised results very efficiently.
We assume that data models are created with the goal of defining structured representations according to which information about ones resources must be made available to the VO. This includes storing such information in a database, relational or otherwise, from which it can be queried; manipulating such information in code; representing it in human and/or machine readable form for communication; and any combination of these. These different usages must have in common that a given information item must be identifiable irrespective of its representation. This implies that a common underlying model must be available, from which the physical representations can be derived.
This realisation (which is by no means new!) forms the core of our proposal. We propose that the core task of a data modelling effort should be to define this so called logical model. The task of deriving the physical representation should be no more than applying a standardised set of mapping rules. TO make this work requires two components which we propose to the DM WG:
@@TODO Gerard @@ An analysis model, also called domain model, is an abstract, high-level representation of the universe of discourse (UoD), the part of the world that our application deals with. It is a UML model, with emphasis on the concepts and their exact relationships in the UoD, though details such as attributes need not be completely filled in. Importantly, it should not be influenced by application scenarios apart form knowledge of their UoD. Here we describe the UoD and our analysis model. The model is strongly influenced by patterns discovered in earlier work on a Domain model for Astronomy, co-written by one of the authors of the present note. We describe some of its main patterns below as well. @@ TODO or will we? @@
A physical model is (see @@TODO reference to some standard reference on data modelling@@) a representation of the logical model that is adapted to a particular software environment. We propose that these are derived form a logical model described according to the profile using mapping rules. The XMI representation of the logical model is particularly useful, as it allows us to implement these rules i XSLT.
We present mapping rules for the following physical models:
The main elements in our profile are the classes (=object types), these embody the core concepts that we model. In our approach we follow standard Object-Oriented design approaches (see [10]) where object types are assumed to have an explicit identity. Two objects (i.e instances of an object type) can have the same values for all fields, but if their identity is not the same they are not the same object. Objects can be referenced by stating their identity (in whatever form this comes). In contrast to this, value types are assumed to be identical if their value (or values, in the case of structured value types) is the same. In our UML model we do not define an explicit identifier attribute on each object type to represent its identity, its existence is assumed and its representation is up to the mapping to the physical model.
Related to this issue is that we need to be able to represent reference relations between different objects. Most contexts provide a natural mapping for references. For example relational databases have the concept of foreign keys, XML documents allow references using ID/IDREF and other mechanisms for references to entities in the same document, Java uses pointers (implicitly) to objects in the same virtual machine. Problems arise when we need to leave the local contexts: references to resources not in the current database, or in another XML document.
It is easy to imagine cases where this may occur. For example when registering a simulation run with the open source Gadget [12] simulation code, one needs to have a reference to the corresponding Gadget SimDB/Simulator. Unless one registers the experiment in the same SimDB where Gadget is registered, one needs to use a reference across SimDB-s. One obvious way is to map all references to globally unique identifiers, possibly using URIs or IVOA Identifiers [11]. The size of such URI-s makes this a rather expensive storage mechanism for use in a relational database, certainly compared to simple integer (or bigint) columns.
This issue is not yet resolved satisfactory. The following possible approaches offer themselves and need discussions:The DM WG has mandated (IVOA interoperability meeting, Cambridge, UK, May 2003) that each data model should come with an XML schema that represents valid XML serialisations of the data model. We foresee that this representation can be used to communicate instances of SimDB/Resource-s as XML documents. Such communication can be for registering new SimDB/Resources in a SimDB, or used in message to communicate instances of the SimDB Resource type. Here we shortly describe some of the rules for deriving an XML schema from our logical model.
It is generally the case that contents of databases may be represented in ways that do not conform to one of the standard serialisations. Nothing prevents services to be developed on top of SimDB that represent SimDB/Resource-s or even fragments of these in another form. The standard example would be to have VOTables storing the results of a generic ADQL query of the SimDB/RDB representation. VOTable first introduced the option to have a UTYPE attribute in FIELD definition tags store a pointer to an element in a data model that the column represents.
The Spectrum data model was the first to add explicit UTYPE-s for each of the attributes in its model and the Characterisaiton data model has followed that example. As long as the precise usage and relation of the syntax of the underlying data model is is not defined, we will follow these examples by assigning UTYPE-s explicitly to all elements in the model. However, we will follow a fixed set of rules to makes this assignment and implement these in XSLT. If a similar approach is at some time accepted within the IVOA, possibly in an alternative form, it will be straightforward to adjust our definitions. The important point we want to make is that it is possible to simply define rules that then will automatically produce the UTYPE-s for a given data model, i.e. the only discussion that is required is on the rules for doing so.
Our assumption is that the UTYPE should be able to uniquely represent any element in the data model, and in a manner that is also easily interpreted. For now we assume that we need to point to those elements that can be stored in a column in a VOTable, i.e. for now we are looking for "simple" elements. We can use our relational mapping to identify all these features, they are
Of course we could give each of the elements a uniquely generated identifier, but we assume that UTYPE-s should hold semantic information, otherwise we could use the XMI-ids generated by the UML modelling tool. To identify any of these elements uniquely within the context of the IVOA, we then need the following components:
One could argue one could also give nice, unique names to each of the elements, but to find out what the actual element in the model and in other representations one would still need to perform a look up. Such a uniqe name would likely include some of the elements above anyhow. So we believe it would be a waste of efforts to do so and instead propose a simple convention for deriving the UTYPE-s form the model based on this hiherarchy. We have done so using these rules (in BNF-like notation)
<model-name> ":" <package-name>[ "/" <package-name>]* "/" <objecttype-name> "." <attribute-name> [ "." <attribute-name>]*
<model-name> ":" <package-name>[ "/" <package-name>]* "/" <objecttype-name> "." <reference-name>
<model-name> ":" <package-name>[ "/" <package-name>]* "/" <objecttype-name> "." <collection-name>
<model-name> ":" <package-name>[ "/" <package-name>]* "/" <objecttype-name> "." "CONTAINER";
<model-name> ":" <package-name>[ "/" <package-name>]* "/" <objecttype-name> "." "ID";
<model-name> ":" <package-name>[ "/" <package-name>]* "/" <objecttype-name> "." "DTYPE";
The previous chapter has defined a number of physical representations of the logical simulation data model. Using these we can implement a database that can store instances of SimDB/Resources. This could be done using an XML database, or using a relational database management system such as Postgres, MySQL or any of the commercial versions. The data model is rather complex, and more hierarchical than most other data models so far defined in the IVOA. Querying such a data model requires a rich query language and we propose to use ADQL working on the relational representation. ADQL produces tabular results, whose structure is completely governed by the query itself. We also assume it possible, once appropriate information is available, to retrieve complete SimDB/Resource-s as XML documents and propose a simple REST-like query interface for that. Such an XML based interface will likely also be used to upload new resources to SimDB implementations taht support that functionality.
We expect no problems in formulating ADQL queries based on the relational representation of the data model described in the previous chapter. We need to require an appropriate protocol for sending these queries to a SimDB service though. In DAL work has started on the Table Access Protocol (TAP) and clearly some version of that seems to be applicable to our situation. However there are some simplifying features. Foremost is that we pre-define the relational schema, so a generic TAP "getMetadata" service seems not necessary. There are likely going to be other standard DAL service features that we need to support (getCapabilities?), but as meta data databases are expected to be relatively small we may again not require the full richness of asynchronous querying, staging, VOSpace and what not.
Issues that need discussion:Under this heading we mean a protocol whereby data products can be retrieved through HTTP GET requests. Possibly also they can be POST-ed, or PUT. This needs to be discussed further, but maybe can be punted until a future release. The GET will always only be able to get a complete SimDB resource, serialised to SimDB/XML, similar to the IVOA Resource Registry interface @@ TODO is this actually a correct statement?@@.
UML allows communities to create a domain specific modelling language through its Profiling capabilities @@ TODO is this the proper term ?@@. We have an initial implementation of a UML profile as created by MagicDraw available under this link. Here we list the main elements and give a a short motivation for their inclusion in the model/. It is our opinion that the DM working group should be ultimately responsible for a profile such as this, defining a domain specific language for all IVOA data modelling efforts.
As first step in our simulation pipeline we generate an XML document that represents the data model in a form that is more easily interpreted, both by human readers and by XSLT scripts, than the XMI representation. This document itself is structured according to an XML schema that represents the UML profile rather directly and that we here shortly describe.
This schema is located in http://volute.googlecode.com//svn/trunk/projects/theory/snapdm/input/intermediateModel.xsd. We introduce our own XML format, defined by the XML schema in intermediateModel.xsd, for representing the logical model. For the time being we call this the intermediate representation. The first step in the generation pipeline is a translation of the XMI to an XML document following this format. This transformation is implemented in the xmi2intermediate.xsl XSLT script. The latest version of the intermediate representation for the SimDB data model can be found in this location. All other generation scripts work on this intermediate representation, not on the XMI document. Variations in tool-generated XMI, or different versions of XMI can now be supported by an appropriately adjusted XSLT script. One reasons why this may be useful is that are different tools may produce different versions or different dialects of XMI. Another reason for this representation is that XMI is a rather complex representation of a UML model. Since we are using a rather restricted profile we do not need this generality, and this allows us to represent the model using XML documents that are much easier to handle with XSLT.
We illustrate out UML profile using an example data model
derived form the SimDB/DM, shown in the following diagram:
We now describe the individual elements.
some of these are standard, some of these are domain specific extensions following
standard UML profile stereotype extension elements and associated tag definition.
<<ontologyterm>>stereotype. Attributes with this stereotype are assumed to take their values form such a predefined "ontology".
[3] Martin Fowler, Analysis Patterns, 1997, Addison Wesley.
http://
[4] Lemson & Colberg, Theory in the virtual observatory
http://
[5] ???, Characterisation DM
http://
[6] @@ TODO @@references on global-as-view and information integration
http://
[7] @@ TODO @@reference to VisIVO
http://
[8] @@ TODO @@reference to Spectrum data model
http://
[9], Some links to pages on data model normalisation
http://www.datamodel.org/NormalizationRules.html
http://en.wikipedia.org/wiki/Database_normalization
[10], some data model references
http://www.agiledata.org/essays/dataModeling101.html
Meyer, B. Object Oriented Software Construction, 2nd edition, Prentice Hall, 1997
On object identity: http://en.wikipedia.org/wiki/Identity_(object-oriented_programming)