IVOA

Simulation Data Access Protocol (SimDAP)
Draft

IVOA Note July 2008

This version:
http://www.ivoa.net/Documents/...
Latest version:
http://www.ivoa.net/Documents/latest/...
Previous versions:
http://www.ivoa.net/Documents/...
http://www.ivoa.net/Documents/...
Interest Group:
http://www.ivoa.net/twiki/bin/view/IVOA/IvoaTheory
Author(s):
Claudio Gheller
Gerard Lemson
Rick Wagner

Abstract

This specification defines a protocol for retrieving data coming from numerical simulations from a variety of data repositories through a uniform interface. The interface is meant to be reasonably simple to implement by service providers. Data are selected by a proper search procedure. Once data of interest is identified specific quantities can be selected and sub-samples can be extracted and downloaded. Data is returned in VOTable simulation specific format, with support of external binary file management.

Status of this Document

This is a Note. The first release of this document was 18 May 2008.

This is an IVOA Note expressing suggestions from and opinions of the authors.
It is intended to share best practices, possible approaches, or other perspectives on interoperability with the Virtual Observatory. It should not be referenced or otherwise interpreted as a standard specification.

A list of current IVOA Recommendations and other technical documents can be found at http://www.ivoa.net/Documents/.

Acknowledgments

We thank Ugo Becciani, Laurent Bourgès, Patrizia Manzato, Hervé Wozniak for discussions and feedbacks on the topic.

Contents

1 Introduction

This specification defines a prototype standard for accessing theoretical data from a variety of astrophysical simulation repositories: the Simulation Data Access Protocol (hereafter SimDAP). In this context Theoretical Data is defined the outcome of different kinds of numerical applications, like dynamical simulations, semianalytical models, montecarlo simulations etc.

SimDAP will deal with datasets that can always be represented as (large/huge) tables in which raws identify a simulated element (a mesh cell, a particle, a pixel...) and colums represent the associated physical parameters (the 3D spatial coordinates, the velocity, the temperature...). Datasets can represent different timesteps (so, evolutionary configurations) of the the same simulated system. In the rest of the document, we will refer to the considered datasets as snapshot of a numerical application. Snapshots are the data sources. No further assumption are made on data. Data is described and can be searched by means of the SimDM theoretical data model (Lemson et al, 2008).

The simplest access mode is the download of a data file in a standard format. However, in general, data is so large that its direct dowload is unfeasible. The SimDAP protocol describes a standard interface to access services which allow the user to reduce the data volume to move over the network (e.g. focus on a proper subsample of the data), permitting its download. The protocol defines also the interface to preview services which allow the user to choose between different datasets and to set the parameters to properly reduce the data volume.

In operation, SimDAP represents a negotiation between the client and the data service, which allows the user to preview data and to select and retrieve specific subsets. The retrieval of the complete dataset can be considered as a degenerate selection of all the volume and all the parameters. However, notice that even this case does not lead to a simple download of the original data file since data are delivered in the standard TVO format described in section X.X.

The SimDAP protocol is designed primarily as a "data on demand" service, with dataset created on the fly by the service given the position and size of the desired output dataset as specified by the client. This is not a simple task for various reasons. First, simulations data adopts specific units and coordinate systems, which depend on the nature of the problem, the characteristics of the algorithms and their implementation. Furthermore, simulation outputs can be represented by a wide variety of completely different data objects. For example, the output can consist in a set of particles in a given volume, where each particle has its physical position and a set of associated scalar and vector quantities, like velocity, mass density, temperature etc. On the other hand, mesh based simulations describe their data as discrete fields defined on a regular or adaptive mesh. The SimDAP protocol has the goal of providing a uniform description of the selection service trying keep it simple and, at the same time, to include as many different kind of simulations and data as possible.

2 Requirements for Compliance

The keywords "MUST", "REQUIRED", "SHOULD", and "MAY" as used in this document are to be interpreted as described in RFC 2119 [34]. An implementation is compliant if it satisfies all the MUST or REQUIRED level requirements for the protocols it implements. An implementation that satisfies all the MUST or REQUIRED level and all the SHOULD level requirements for its protocols is said to be "unconditionally compliant"; one that satisfies all the MUST level requirements but not all the SHOULD level requirements for its protocols is said to be "conditionally compliant".

Compliance with this specification requires that a SimDAP service is maintained with the following characteristics:

.........................................


3 Service Types

Search and exploration of available data archives and collections is part of the SimDB protocol, presented in Lemson et al. (2008). The result of the search and exploration phase consists in a set of parameters (metadata) which describe each dataset and and access reference to each specific snapshot (see section 4.2). SimDAP uses the access reference for the following services:

SimDAP addresses the following topics:

  • data download
  • data preview
  • data cutout
  • custom

Further services will be defined and supported in the future.

3.1 Download

The search and exploration phase and of other SimDAP service finds/produces one or more data products, which can be stored either in files or in databases, identified by their access references. The SimDAP download service MUST allow the user to retrieve the data selecting only the fields (e.g. calculated physical quantities) of interest, which are listed as a result of the previous operations (search or SimDAP services). The data is delivered according to the Theoretical Data File Format (TDFF), described in section 8.

The SimDAP download service proceeds as follows:

  1. the user select the fields of interest (possibly, all available fields);
  2. data are extracted and stored in one or more binary files according to the TDFF standard
  3. An associated VOTable, following the TDFF standard, is created. The VOTable describes the content of the files according to the SimDB data model and containing their access references.
  4. XML and binary files can be dowloaded by the user by means of any appropriate, available protocol (http, ftp, grid-ftp...).

NOTE: anticipating Section 8, the TDFF standard accepts, as binary part (outcome of step 2), standard formats like HDF5, FITS, NetCDF, provided such format is notified to the user by the associated VOTable.

3.2 Preview

In order to let the user explore the datasets and decide which snapshots he is interested in, the service SHOULD provide tools to preview the available datasets. Such functionality MUST be implemented if the cutout service is supported (in order to select a sub-region specifying its size and position, see section 3.3)

    This service could require the availability of a "simplified" (but meaningful) version of the data, namely a thumbnail, easy to download and handle. Either, the service could allow the user to have a pre-defined view of one or more snapshots. The preview and the thumbnails features depend on the data and their implementation is up to the data provider. E.g. they could be represented by:

    • projections of the computational box in the three coordinate directions (images),
    • a random or decimated sample of the dataset (in particular for point like data),
    • a reduced resolution realization of the dataset (e.g. averages over neighboring cells of a computational mesh)
    • a "clever" selection of regions according to specific criteria (e.g. "overdense" regions) implemented by proper algorithms
    • A data file in the TDFF format

Fig 1. Examples of preview services. Orthogonal projection of the simulated box (left). Decimated points set from an N-Body simulation (right)

3.3 Cutout

The SimDAP service SHOULD support a rectangular cutout of the data in a generic N-Dimensional (N-Dim) parameters space, in order to let the user focus on a region of interest, extracting the corresponding data and downloading the resulting file, strongly reducing the data movement. The service consists in extracting all the simulated elements for which some parameters have values in a given range. Notice that no assumption is made on the dimensionality of the problem (selection can be done on any number of parameters) or on the nature of the parameters (no restrictions to the parameters adopted in the selection operation). However, it can be convenient to consider as a favoured case, a 3D geometric selection, in which data are extracted according to its position in the 3D space. This means that spatial coordinates are used as cut-out parameters. This case is particularly simple and intuitive. Furthermore it is common to a large number of applications. In the rest of the document we will refer to:

  • position, as a N-uple of selection parameters which define a position in the phase space (e.g. the center of a the 3D geometric box)
  • size, as the extension of the selected region (dependent on the position specification) in the N-Dim phase space (sides of the 3D geometric box)
  • field, a physical quantity, result of the theoretical calculation stored in a snapshot (e.g. the temperature or the velocity on a Cartesian Mesh)

Fig 2. Selection and extraction of a sub volume from a snapshot of a cosmological N-Body simulation

The cutout region is selected exploiting the preview tools introduced in section 3.2. The result of the cutout operation are stored in files according to the TDFF standard.

3.4 Custom

Any further service, which allows the user to get data from a distributed theoretical data archive, is considered as Custom. For such services a complete description of the service must be available as a registry entry.

3.5 Future developments: GetCapability and Data Staging

Several services appear to be important for SimDAP. They will be analyzed and included in the protocol as an extension of the first release, which acconts only for the critical and essential functionalities.

The GetCapability service will allow the data provider to publish the supported SimDAP services and to the user to query registries to find where the services he/she needs are available.

By Data Staging we refer to the processing the server performs to retrieve or generate the requested simulation volumes and cache them in online storage for retrieval by a client. Staging can be necessary for large archives which must retrieve simulation data from hierarchical storage, or for services which can dynamically extract subvolumes, where it may take a substantial time (e.g. minutes or hours) to retrieve the data which falls inside the relevant region of the simulation box. Issuing a staging request for a set of simulation subvolumes (e.g. for a set of small cubes randomly placed in a simulation box) also permits large servers to optimize subvolume extraction, for example to take advantage of parallelization for large requests.

4 Simulation and Snapshot Selection

Search and exploration of available data archives and collections is described in detail in Lemson et al (2008). The results are represented by a set of parameters (metadata) according to the SimDB data model summarized in section 4.1. SimDAP makes use of a few of these parameters for its services. Several of these are common to any SimDAP functionalities.

4.1 The Data Model Metadata

Gerard and Rick, please complete...

4.2 Common and Additional Parameters

According to the data model, we need two parameters to identify a dataset: the Simulation ID, which is the unique identifier of the numerical experiment, and the Snapshot ID, which is the unique identifier of the dataset of a particular experiment/simulation. These two parameters are required as input by any SimDAP service:

Two additional parameters are required by a few SimDAP operations: FIELD and UNITS

If the service support the extraction of a subset of the available fields in a snapshot, the FIELD parameter is used to specify the list of the fields to select. The arguments of FIELD is a comma separated list of physical quantities expressed in the VO standards.

For instance, the snapshot of interest could contain five fields: mass density, temperature and the three components of the velocity. The user wants to download only the first two:

FIELDS = "mass_density,temperature"

The UNITS parameter MUST be available for all SimDAP operations. A UNITS matadata is associated to each available physical quantity. It can represent either the unit in which that quantity is internally represented (i.e. at server side, in the data file), or a conversion factor to some "natural" unit. In the first case a standard representation of the units must be available. In the second, the "natural" unit must be specified (e.g. Mpc for cosomlogical simulations, a.u. for planetary science simulations). This choice is still under debate. A further issue is represented by the treatment of special coordinates systems, like comoving or h^-1 coordinates

The UNITS parameter is used exclusively at client side, to specify the units of available quantities, for the sake of presentation and to convert query parameters to server compatible units.

5 Specific Parameters

Each SimDAP service, implemented as a web method, requires specific input parameters. In the following sections, the parameters needed by standard services are presented. Any special (i.e. non standard) parameter or custom service MUST be registered in order to be avaiable and sable in the VO framework.

5.1 Download

The download service requires only the basic EXPERIMENT_ID and SNAP_ID parameters to work. The FIELDS parameter may be used if the corresponding field selection service is available (otherwise it is discarded). No FIELDS specification or a blank FIELDS parameter, is interpeted as: download all available fields. If FIELDS requires unavailable quantities, the corresponding request is discarded.

5.2 Preview

The preview can be implemented in different ways, depending on the specific data we are dealing with. In all the cases, if the service is supported, a getPreview method MUST be implemented. The input of this method is the basic couple EXPERIMENT_ID and SNAP_ID. The FIELDS parameter may be used to specify which fields to preview (if supported, otherwise it is discarded). No FIELDS specification or a blank FIELDS parameter, is interpeted as: preview all available fields. If FIELDS requires unavailable quantities, the corresponding request is discarded.

If the cutout service is available, the preview service MUST provide instruments to select the fields of interest and the cutout region. This will result in setting the parameters presented in Section 5.3.

5.3 Cutout

The goal of the cutout service is to select and extract a sub-volume of data from a given snapshot. Such operation refers to a single snapshot. Multiple sources cutouts, like for various time steps of the same simulation, are not supported by the protocol. Their implementation is up to the client, as, for example, sequences of requests with same subbox and fields but different datasets.

A getCutout web method MUST be implemented, accepting the following input parameters:

Dataset and fields identification:

The following input parameters are defined in 4.2:

The snapshot is identified by the SERVICE_ID, SOURCE_ID parameters.

Geometric variables:

The GEOM parameters let the user select a few snapshot fields (e.g. the coordinates of the particles in a N-Body simulation) as those which defines the sub-volume to be extracted (hereafter geometry fields). It is specified as a comma-seaparted list of strings.

Example: GEOM="xpos,ypos,zpos"

The cutout service supports a maximum number of geometry fields, namely the rank of the service. If GEOM specify a number of fields above the rank, exceeding fields (starting from the last one) are dropped. Blank GEOM is accepted only if the service supports an "intrinsic" geometry (e.g. a cubic regular mesh).

Units (UNITS parameter, see 4.2) are necessary for the geometry fields. Conversion from user to server units of the following POS and SIZE parameters must be performed at client side, BEFORE invoking the setCutout method.

Region of interest:

The sub-volume is set specifying its center and size:

The POS and SIZE parameters are converted client-side to the proper units using the corresponding UNIT parameter. For discrete fields SIZE will be properly approximated such that they represent the smaller interval containing the requested one. POS and SIZE parameters will be returned to the user in the VOTable with their corrected values.

POS and SIZE will be expressed as a N-uple of comma-separated values (embedded whitespaces are not permitted). A NULL value represents the center of the complete computational box for POS and the whole computational box size for SIZE.

Example: POS="0.3,0.25,0.1", SIZE="0.5,NULL,0.5" .

Notice that the number of elements in POS and SIZE MUST be the same and MUST be equal to the number of elements in GEOM.

Boundaries

For regions that intersect the boundary of the simulation box, the service has the option of applying different types of boundary conditions. Possible solutions are truncated boundary conditions (the sub-box is truncated at the box boundaries) or periodic boundary conditions (if applicable). The service MAY support the following parameter specifying the adopted boundary conditions:

BOUNDARY

This parameter have one value for each GEOM element. Possible values are:

Registry metadata of the service indicates what kind of boundary conditions are supported.

Further Custom Parameters

The service MAY support additional service-specific parameters. The names, meanings, and allowed values are defined by the service and published on registries. The names need not be upper-case; however, they should not match any of the reserved parameter names defined above.

6 Service Request, Response and Result

Any SimDAP request is implemented as a web method. The response is represented by the INFO information parameter (error code) which can be Ok or Rejected (however custom values are accepted). ... RICK THIS IS FOR YOU

The result of a any SimDAP request (in some cases, the preview could be an exception) is the data of interest, which is delivered in the standard TDFF format (see section 8) consisting in a metadata VOTable plus one or more external binary files. The VOTable MUST contain the access reference to the external files.

The access reference is... RICK THIS IS FOR YOU

7 Service Metadata and Registration

... RICK THIS IS FOR YOU

8. Theoretical Data File Format

Data can be stored in files in a large number and variety of formats. We propose a standard which makes the file content easily accessible and sharable. At the same time, we want to avoid any re-work on data, like format conversion or encoding, which would result in a overwhelming and error-prone effort for the data provider/producer. Finally, we want to have compact and dense files, therefore the bulk of the data must be stored in a binary format.

We propose to adopt a VOTable based solution, with the following characteristics:

The combination of the XML+binary files define the Theoretical Data File Format (TDFF)

8.1 The VOTable and the data parameters

8.2 The binary file and its parameters

The binary file can have any format.

If the format follows a precise standard, like HDF5 or FITS, the corresponding MIME type must be specified in the VOTable. Further information is not necessary, since the related APIs or tools can properly access the file.

If the format is not standard, a MIME type "PROPRETARY" is defined. In this case the VOTable should describe the file structure. In some cases this could not be possible. This cases have MIME type "UNKNOWN", and are not subject of the present document.

The file format specific parameters account for file parameters, which describe properties of the file as a whole, and field parameters, which are related to each physical field stored in the file (do not confuse with field parameters describing the data content, presented in section 8.1).

File global parameters

File fields parameters

I WILL COMPLETE THIS SECTION IN THE NEXT DAYS (what follows is the old version, just for reference)

The VOTable result of a SimDAP operation MUST have the following items:

  1. The VOTable MUST contain a RESOURCE element, identified with the tag type="results", containing one or more TABLE elements with the metadata results of the setSimDAP operation. The VOTable is permitted to contain additional RESOURCE elements, but the usage of any such elements is not defined here. If multiple resources are present it is recommended that the query results be returned in the first resource element.
  2. The VOTable MUST contain a SERVICE_ID parameter which identifies the used service.
  3. The VOTable MUST contain a REQUEST_ID parameter which identifies uniquely the job request on the service.
  4. The VOTable MUST contain a TABLE with the list of the GEOM fields and the selected SELECTED_FIELDs and the associated metadata. GEOM and SELECTED_FIELDS are stored in the external binary file.
  5. SELECTED_FIELDS are scalars. Vectors and more generally, multidimensional quantities, are not supported. This means that each FIELD represents a scalar value. E.g. temperature of each point, x coordinate of a particle.
  6. The VOTable MUST specify the ACREF parameter with the reference to the associated binary data file, as specified in X.X

Appendix A: VOTable examples

A.1 VOTable for the velocity field of a fluid on a fixed 3D mesh

[GL – We still need a proper way I guess of indicating what the spatial dimensions are for a representation like this. FITS has its WCS system for implicitly specifying the spatial coordinates of a multidimensional array. Is something like this in existence for VOTable ? We need to inquire.]


<RESOURCE name="myVectorField" type="results" >
   <DESCRIPTION>Velocity Field from N-Body run</DESCRIPTION>
   <INFO name="QUERY_STATUS" value="OK"/>

   <TABLE name="VelocityField" ID="Vel" order="sequential">
      <FIELD name="vx" ID="vx1" ucd="phys.veloc;pos.cartesian.x" datatype="float" 
             arraysize="41x41x41"   unit="km/s" geometry="mesh" />
      <FIELD name="vy" ID="vy1" ucd="phys.veloc;pos.cartesian.y" datatype="float" 
             arraysize="41x41x41"   unit="km/s" geometry="mesh" />
      <FIELD name="vz" ID="vz1" ucd="phys.veloc;pos.cartesian.z" datatype="float" 
             arraysize="41x41x41"   unit="km/s" geometry="mesh" />
      <DATA>
        <BINARY>
          <STREAM href="file:///scratch/myhome/test.bin"/>
        </BINARY>
      </DATA>
    </TABLE>
  </RESOURCE>
</VOTABLE>

A.2. VOTable for the velocity and position fields of particles from an N-Body simulation


<RESOURCE name=myParticles type="results">
   <INFO name="QUERY_STATUS" value="OK"/>
   <TABLE name="Particles" ID="NBody" order="tabular">
      <FIELD name="x" ID="x1" ucd="pos.cartesian;pos.cartesian.x"
	datatype="float" arraysize="100000"   unit="Mpc" geometry="particles" />
      <FIELD name="y" ID="y1" ucd="pos.cartesian;pos.cartesian.y"
	datatype="float"arraysize="100000"   unit="Mpc" geometry="particles" />
      <FIELD name="z" ID="z1" ucd="pos.cartesian;pos.cartesian.z"
	datatype="float"arraysize="100000"   unit="Mpc" geometry="particles" />
      <FIELD name="vx" ID="vx1" ucd="phys.veloc;pos.cartesian.x"
	datatype="float"arraysize="100000"   unit="km/s" geometry="particles" />
      <FIELD name="vy" ID="vy1" ucd="phys.veloc;pos.cartesian.y"
	datatype="float"arraysize="100000"   unit="km/s" geometry="particles" />
      <FIELD name="vz" ID="vz1" ucd="phys.veloc;pos.cartesian.z"
	datatype="float" arraysize="100000"   unit="km/s" />      
      <DATA>
        <BINARY>
          <STREAM href="file:///scratch/myhome/test.bin"/>
        </BINARY>
      </DATA>
    </TABLE>
  </RESOURCE>
</VOTABLE>  

A.3. VOTable for the temperature field of a mesh based quantity and the position of N-Body particles extracted from the same spatial region.


<RESOURCE name=myMixedData type="results">
   <INFO name="QUERY_STATUS" value="OK"/>
   <TABLE name="ParticlesAndMesh" ID="NBody" order="sequential">
      <FIELD name="x" ID="x1" ucd="pos.cartesian;pos.cartesian.x"
	datatype="float" arraysize="100000"   unit="Mpc" geometry="particles" />
      <FIELD name="y" ID="y1" ucd="pos.cartesian;pos.cartesian.y"
	datatype="float"arraysize="100000"   unit="Mpc" geometry="particles" />
      <FIELD name="z" ID="z1" ucd="pos.cartesian;pos.cartesian.z"
	datatype="float"arraysize="100000"   unit="Mpc" geometry="particles" />
      <FIELD name="temperature" ID="temp" ucd="phys.temperature;pos.cartesian.x"
	datatype="float"arraysize="41x41x41"   unit="K" geometry="mesh" />
      <DATA>
        <BINARY>
          <STREAM href="file:///scratch/myhome/test.bin"/>
        </BINARY>
      </DATA>
    </TABLE>
  </RESOURCE>
</VOTABLE>

An alternate version


<VOTABLE>
  <RESOURCE name=myMixedData type="results">
   <INFO name="QUERY_STATUS" value="OK"/>
   <TABLE name="Particles" ID="NBodyParticles" order="sequential">
      <FIELD name="x" ID="x1" ucd="pos.cartesian;pos.cartesian.x"
	datatype="float" arraysize="100000"   unit="Mpc" geometry="particles" />
      <FIELD name="y" ID="y1" ucd="pos.cartesian;pos.cartesian.y"
	datatype="float"arraysize="100000"   unit="Mpc" geometry="particles" />
      <FIELD name="z" ID="z1" ucd="pos.cartesian;pos.cartesian.z"
	datatype="float"arraysize="100000"   unit="Mpc" geometry="particles" />
      <DATA>
        <BINARY>
          <STREAM href=_mesh"file:///scratch/myhome/test_particles.bin"/>
        </BINARY>
      </DATA>
    </TABLE>
   <TABLE name="Mesh" ID="NBodyMesh" order="sequential">
      <FIELD name="temperature" ID="temp" ucd="phys.temperature;pos.cartesian.x"
	datatype="float"arraysize="41x41x41"   unit="K" geometry="mesh" />
      <DATA>
        <BINARY>
          <STREAM href="file:///scratch/myhome/test.bin"/>
        </BINARY>
      </DATA>
    </TABLE>
  </RESOURCE>
</VOTABLE>  

[GL – Do we need an example of an "ordinary" tabular VOTable as well ? Something like


>RESOURCE name=myParticles type="results"<
   >INFO name="QUERY_STATUS" value="OK"/<
   >TABLE name="Particles" ID="NBody" <
      >FIELD name="x" ID="x1" ucd="pos.cartesian;pos.cartesian.x"
	datatype="float" unit="Mpc" /<
      >FIELD name="y" ID="y1" ucd="pos.cartesian;pos.cartesian.y"
	datatype="float" unit="Mpc" /<
      >FIELD name="z" ID="z1" ucd="pos.cartesian;pos.cartesian.z"
	datatype="float" unit="Mpc" /<
      >FIELD name="vx" ID="vx1" ucd="phys.veloc;pos.cartesian.x"
	datatype="float" unit="km/s"/<
      >FIELD name="vy" ID="vy1" ucd="phys.veloc;pos.cartesian.y"
	datatype="float"  unit="km/s" /<
      >FIELD name="vz" ID="vz1" ucd="phys.veloc;pos.cartesian.z"
	datatype="float"  unit="km/s" /<      
      >DATA<
        >BINARY<
          >STREAM href="file:///scratch/myhome/test.bin"/<
        >/BINARY<
      >/DATA<
    >/TABLE<
  >/RESOURCE<
>/VOTABLE<  

]

References

[1] R. Hanisch, Resource Metadata for the Virtual Observatory
http://www.ivoa.net/Documents/latest/RM.html

[2] R. Hanisch, M. Dolensky, M. Leoni, Document Standards Management: Guidelines and Procedure
http://www.ivoa.net/Documents/latest/DocStdProc.html