Requirements for a Provenance Data Model

This page is a collection of requirements for a provenance data model that should cover the provenance of observations. let's keep simulations in mind, but focus on observations.

This page was discussed within GAVO, comments after the Face2Face meeting early 2013 are marked in green. Last update: April 16, 2013.

Keywords like must, should, can etc. are just suggestions.
Please add your comment/suggestions!

0 Use cases

The provenance data model should cover following issues:

Aid in debugging
Attribution (who was involved in the project? Who can I ask about these data?)
Aid in reprocessing (but not: allow reprocessing on keypress)
Steps of production (Allow figuring out what steps processing steps have been done already)
Allow *people* to assess the "quality" of the observation/reduction
Let people search in structured provenance metadata

1 General requirements

We divide objects here strictly into data sets/objects and processes or actions. In Gerard's model, "data sets" are called "results" and actions are "experiments".

1.1 Data sets are always connected to each other via actions.

No direct link between data sets is necessary, since there is always an action involved.

1.2 The Provenance Data Model must be able to cope with following data sets

processed data with existing raw data, where all the processing information is available
processed data without raw data, where all the information about the processing is still available
Example: LOFAR data
processed data sets without any raw data; no processing information
i.e. raw data are not accessible or do not exist anymore
Example: satellite data
processed data with raw data, but only partial processing information
Example: unknown pipeline (black box), only "unspecified" information given

F2F: general agreement on 1.1 and 1.2

1.3 The provenance data model should be able to describe the following processes or actions

(Markus would call them "Meta-Actions")

Processes without human interaction
Example: running a standard pipeline with no human control in between
Process using standardized software with standard parameters
Processes with some human interaction
Example: User needs to test the effects of some parameters and thus changes the standard values, user tries to get better results for a certain aspect of the data set
Example: running a pipeline, where a user needs to confirm the end result or do some other step(s) in between
Processes with logging
Processes without logging
Processes without any logging, just human interaction
Example: user performs several steps without tracking these steps or the used tools, e.g. adjusting the contrast of an image in Photoshop, fitting a line to data points by eye, using awk or other search/replace tools to convert numbers in a table to a different unit
Processes where no code name or details are known just a description of the algorithm
Example: A user does not want to provide the source of a custom-written piece of code, no code name exists. But he describes the used method/algorithm
Process for which no description exists, just the source code
Example: A user has written some piece of code for a certain transformation step; he provides the source code of his tool for further reference.
The data model must have the possibility to provide or include standard process descriptions.

F2F: Some issues: how do you handle non-automatical steps (based on a human being's experience); the points given under 1.3 belong to different posings of a question (e.g. logging is not relevant for the question about the relationship between raw data and processed data).

1.4 The Provenance Data Model should allow to group data and their actions into a container, e.g. called “Meta-Actions”. This would allow to have different “resolutions” of the provenance of a data set.

These could also be called "composites" or "macros", grouping "experiments" together.
Example: A user may just be interested in the big steps that created a data set, and not in every detail.
F2F: Markus would not include this. It would in theory be nice if you could add macros to the model. However, that's a major thing and thus is probably out of scope.

2 Raw observation data

These are the data that were directly taken from the telescope, without any further processing.

2.1 For observations, ambient conditions should be provided (link? directly in data file?)

Example: link to the weather report, including weather conditions, lunar phase, etc.

2.2 The ambient conditions should/can/may be searchable => ask scientists, if this is needed!

Example: User might want to extract all data where the seeing was better than …

2.3 Each observation needs to be characterized with a set of keywords or attributes that still needs to be specified.

Example keywords: date, start time, end time or exposure time, target
=> These keywords are already included in characterization data model.

F2F: What kind of phenomena do we want to cover? Suggestion: collect FITS files from as many different instruments as we can and collect what's in there. Of course, there's more data than that; see, e.g., the comprehensive weather reports at http://www.ls.eso.org/lasilla/dimm. We agree, however, that such information is far too detailed for our purposes (though we should probably let people include links to such information if available). Still, we should profit as much as possible from the work of the instrumentation people in selecting information relevant for interpreting the data (i.e., consider their choice of concepts when designing their FITS headers).

A result of a provisional survey of FITS data in the VO is available on http://svn.ari.uni-heidelberg.de/svn/reports/provenance/fitsheaders/observations.txt%%

2.4 Each telescope must be characterized

=> use VO-registry?

2.5 Each instrument must be characterized

=> where to point to? Weblinks are not persistent ...
Example: Filter, gratings, …

2.6 Possibly, also each observatory should be characterized (there already exists an observatory database)

Example: observatory site, typical weather conditions/limitations;
If observatory object exists, it can be referenced by several telescopes

F2F: Could it be sufficient to just link to a web page of the observatory describing their machinery? Probably not, since that's neither machine readable (which would violate our requirement for searchability) nor suffiently stable; we should allow the inclusion of that via a reference URL, though. It would be attractive to require people to store some structured description of their gear in a stable place (i.e., the VO registry) and then just have a reference to that in the actual files, but there's doubts whether that would catch on. To collect the concepts we'd like to have represented, we should again use the sample of headers found in the wild.

2.7 The observer of the data must be provided

2.8 The observer should be characterized by name and current affiliation

F2F: Status: visiting/resident? Probably doesn't help too much because that alone doesn't say much about the observer's proficiency (and even pros foul up things now and then...) Affiliation? is probably useful for figuring out who it was, in particular for names like John Doe. Note added by Markus: Well, of course affiliations are another can of worms, since there's just so many ways to write affiliations for one and the same shop.

2.9 A link to the log-book should be provided

2.10 A link to the calibration data set must be provided for each observation

2.11 An observation can be one of a set of observations

Example: Integral Field Units of MUSE

2.12 An observation can also be a container for other observations

Example: RAVE observations, MUSE fibres

F2F: compound data -- SimDM doesn't model this kind of thing, although of course their data in general is compound (e.g., snapshots); this just gets too complex for them. VOResource has a related thing, the relationships, which basically are triples (relation-type, source-resource, dest-resource), where relation-type is something like "mirror-of"-, "service-for", etc. Note added by Markus: There's also the DataLink effort going on for "grouping" of data products. Maybe this kind of thing can be partially solved by allowing provenance on DataLink documents? Harry should explain a bit more where this comes from.

3 Calibration data

3.1 The calibration data set must be characterized (similar to raw data?)

Example: CCD drift, bad pixels, type of flat field

3.2 The calibration data are directly linked with the observation data

=> Including them would probably become too complex, so skip it for now?

F2F: links to calibration data; that could be interesting for raw data (for reduced data, the calibration data used is declared in the processing steps). However, the subject of what someone *could* do for calibration is probably out of scope -- should there be a raw file DM? If at all, that would be stuff for non-machine-readable freetext.

4 Processes/actions

4.1 The post-processing must be traceable

4.2 Each process must be characterized

4.3 Possible keywords for the characterization are

Name (of software)
Version
Description of algorithm
Parameters; or: link to standard parameter set
Link to source code
Link to documentation

4.4 Input (if possible) and output of the action should be tracked

Example: The "halo finding" action has a simulation snapshot as input and a halo catalogue as output data set.

4.5 A link to the input data must be provided (if possible)

Example: Creating a first input data set (e.g. for cosmological simulations) may not require an input data set

4.6 A quality flag can be provided for each action, also at a later stage, could be similar to VO's "validation level" & "validated by".

Better: use a free-text field for "Warnings" ;
We could try to find a common set of error measures like chi-squared etc.
Example: Observer gives a warning, that something was weird, a plane crossed the field of view, ....

F2F: Quality Flags; how could those be added later? Also, quality is something fairly hard to define in a generic way, and if it's probably not going to be a single number. Use cases for that appear to be mainly: "Plane was crossing" or similar warnings coming from reduction steps. So, it would seem most of this could just be covered by letting experiments add warnings.

4.7 The quality flag can be different for different input data sets

Example: algorithm may work for some cases, but not for all

4.8 Access control flags can be provided for the action

=> Setting a parameter to null is different from saying: it's there, but you won't get it.
Example: A user may not want to publish his steps which produced the final data set, because he wants to publish them first in a scientific article

F2F: Access Control. Examples: (1) I've derived that from a proprietary spectrum, don't bother trying to get it. (2) here's a parameter that I set the value of which is my secret. For (1), there's probably no modelling necessary since people will realize something is proprietary on access (plus, the thing may have become free in the meantime, in which case such a declaration would actually hurt). In the second case, it would make perfect sense to model the difference between NULL and "won't tell". Let's see if our parameter can easily support such a thing.

5 (Processed) Data object

5.1 The data set must be linked to the action which it created

5.2 The data set should/must not be linked directly to the previous data set

5.3 The ownership or authorship of the data set should be provided.

5.4 Access control flags can be provided

Example: the data is only available for internal usage, but not (yet) public

6 Quality Object

Replace it by "Warnings". Should be attached to process, unless they are directly coupled to the data (like a chi-squared)

7 Theory

7.1 The provenance data model should be able to include theory models as input data sets

Example: A user wants to fit a stellar spectrum and uses a catalogue of stellar spectra created from stellar models

F2F: Whatever we do, we can't ask the theory people to provide both SimDM provenance and whatever we come up with. So, SimDM provenance should count as provenance in our sense. Then again, the thought here has rather been to describe theory data used to process observations (e.g., a theoretical spectrum used as a gauge). Here, we need to be careful not to become an analysis data model.

The clean way to allow SimDM provenance would be to derive both SimDM provenance and ours from a common basis; Gerard's Domain Model could be such a base. There's no VO-DML for that yet, but that might change if we know where we're going.

Note that even table columns could have provenance; e.g., in a table of redshifts, each row could be derived from a different spectrum and thus would have a different provenance.