/[volute]/trunk/projects/dm/provenance/ProvDM/doc/intro-general.tex
ViewVC logotype

Contents of /trunk/projects/dm/provenance/ProvDM/doc/intro-general.tex

Parent Directory Parent Directory | Revision Log Revision Log


Revision 5691 - (show annotations)
Fri Nov 15 16:39:50 2019 UTC (8 months, 3 weeks ago) by mathieu.servillat
File MIME type: application/x-tex
File size: 8942 byte(s)
update intro phrase on FAIR principles, update agent role description, update figures
1
2 In this document, we propose an IVOA standard data model (DM) for describing the provenance of astronomical data.
3 How this specification of the Provenance model can be implemented is developed in a companion document to be published as an IVOA Note \citep{std:ProvenanceImplementationNote}.
4
5 The provenance of scientific data is cited in the FAIR principles for data sharing \citep{FAIR-principles}.
6 Provenance is relevant for science data in general and specifically in an open publishing context which is a requirement fo many
7 projects and collaborations.
8 % and corresponds to particularly relevant information in the context of openly published science data.
9
10 We follow the definition of provenance as proposed by the W3C \citep{std:W3CProvDM}, i.e.~that provenance is ``information about entities, activities, and people involved in producing a piece of data or thing, which can be used to form assessments about its quality, reliability or trustworthiness''.
11
12 In astronomy, such entities are generally datasets composed of VOTables, FITS files, database tables or files containing values (spectra, light curves), it could also be any value, logs, documents, as well as physical objects such as instruments, detectors or photographic plates.
13 The activities correspond to processes like an observation, a simulation, processing steps (image stacking, object extraction, etc.), execution of data analysis code, publication, etc.
14 The people involved can be for example individual persons (observer, publisher, etc.), groups or organisations, i.e.~any agent related to an activity or an entity.
15
16 An example for activities, entities and agents as they can be discovered backwards in time is given in Figure~\ref{fig:example-workflow}.
17
18
19 \begin{figure}[ht]
20 \centering
21 \includegraphics[width=1\textwidth]{workflow-backwards.pdf}
22 \caption[Example graph of provenance discovery]{An example graph of provenance discovery. Starting with a released dataset (left), the involved activities (blue boxes),
23 progenitor entities (yellow rounded boxes) and responsible agents (orange pentagons) are
24 discovered.}
25 \label{fig:example-workflow}
26 \end{figure}
27
28
29 \subsection{Goal of the provenance model}
30 \label{sec:goals}
31
32 The goal of this Provenance DM is to describe how provenance information arising from astronomy projects can be modelled, stored and exchanged.
33 Its scope is mainly modelling of the flow of data, of the relations between pieces of data, and of processing steps.
34 However, the Provenance DM is sufficiently abstract that its core pattern could be applied to any kind of process related to either observation or simulation data.
35
36 Information attached to observation activities such as ambient conditions and instrument characteristics provide useful information to assess the quality and reliability of the generated entities.
37 Contextual information during the execution of processing activities (computer structure, nodes, operating system used, etc.) can also be relevant for the description of the main entities generated.
38 This complementary information should be included in the form of metadata or additional entities connected to an activity.
39 However, the precise structure and modelling of this information is out of the scope of this document.
40
41 In general, the model shall capture information in a machine-readable way that would enable a scientist who has no prior knowledge about a dataset to get more background information.
42 This will help the scientist to decide if the dataset is adequate for her research goal, assess its quality and reliability and get enough information to be able to trace back its history as far as required or possible.
43
44 Provenance information can be exposed with different granularity. A specific project has to decide this granularity.
45 The granularity and amount of provenance information provided depends on the available information, the needs of the project and the intended usage of this information.
46
47 This flexible approach has an impact on the interoperability between different services as this level of detail is not known a priori.
48 The objective of the model is to propose a general structure for the provenance information. In addition, proposed vocabularies of reserved words help to further formalize the detailed provenance information.
49
50
51 The following list is a collection of use cases addressed by the Provenance DM.
52
53
54 \paragraphlb{A: Traceability of products}
55 Track the lineage of a product back to the raw material (backwards search), show the
56 workflow or the data flow that led to a product.
57
58 \noindent Examples:
59 \begin{itemize}
60 \item Having a dataset, find the main progenitors and in particular locate the raw data.
61 \item Find out what processing steps have been already performed for a given dataset: Is an image already calibrated? What about dark field subtraction? Were foreground stars removed?
62 \item Find out if a filter to remove atmospheric background muons has been applied.
63 \end{itemize}
64
65
66 \paragraphlb{B: Acknowledgement and contact information}
67 Find the people involved in the production of a dataset, the people\slash{}organizations\slash{}institutes that one may want to acknowledge or can be asked for more information.
68
69 \noindent Examples:
70 \begin{itemize}
71 \item I want to use an image for my own work -- who was involved in creating it? Who can I contact to get information?
72 \item Who was on shift while the data was taken?
73 \item I have a question about column xxx in a data table. Who can I ask about that?
74 \end{itemize}
75
76
77 \paragraphlb{C: Quality and Reliability assessment}
78 Assess the quality and reliability of an observation, production step or dataset, e.g., based on detailed descriptions of the processing steps and manipulated entities.
79
80 \noindent Examples:
81 \begin{itemize}
82 \item Get detailed information on the methods/tools/software that were involved: What algorithm was used for Cherenkov photon reconstruction? How was the stacking of images performed?
83 \item Check if the processing steps (including data acquisition) went ``well'': Were there any warnings during the data processing? Any quality control parameters?
84 \item Extract the ambient conditions during data acquisition (cloud coverage? wind? temperature?)
85 \item Is the dataset produced or published by a person/organisation I rely on? Using methods I trust?
86 \end{itemize}
87
88
89 \paragraphlb{D: Identification of error location}
90 Find the location of possible error sources in the generation of a product. This is connected to use cases described in section C above, but implies an access to more information on the execution such as configuration or execution environment.
91
92 \noindent Examples:
93 \begin{itemize}
94 \item I found something strange in an image. Was there anything strange noted when the image was taken? a warning during the processing?
95 \item Which pipeline version was used, the old one with a known bug for treating bright objects or a newer version?
96 \item What was the execution environment of the pipeline (operating system, system dependencies, software version, etc.)?
97 \item What was the detailed configuration of the pipeline? were the parameters correctly set for the image cleaning step?
98 \end{itemize}
99
100
101 \paragraphlb{E: Search in structured provenance metadata}
102 Use Provenance criteria to locate datasets (forward search), e.g., finding all images produced by a certain processing step or derived from data which were taken by a given facility.
103
104 \noindent Examples:
105 \begin{itemize}
106 \item Find more images that were produced using the same version of the CTA pipeline.
107 \item Get an overview of all images reduced with the same calibration dataset.
108 \item Are there any more images attributed to this observer?
109 \item Find all datasets generated using this given algorithm, with this given configuration, for this given step of the data processing.
110 \item Find all generated data files that used incorrectly generated file X as an input, so that they can be marked for re-processing
111 \item Extract all the provenance information of a SVOM light curve or spectrum to reprocess the raw data with refined parameters.
112 \end{itemize}
113
114 \paragraphlb{General Remarks}
115 In addition to those use cases, if the stored information is sufficiently fine grained, it is possible to enable the \textbf{reproducibility} of an activity or sequence of activities, with the exact same configuration and exact same conditions.
116
117 Provenance information delivers additional information about a scientific dataset to enable the scientist to evaluate its \textbf{relevance for his work}.

msdemlei@ari.uni-heidelberg.de
ViewVC Help
Powered by ViewVC 1.1.26