ViewVC logotype

Contents of /trunk/projects/dm/provenance/description/intro-general.tex

Parent Directory Parent Directory | Revision Log Revision Log

Revision 4378 - (show annotations)
Thu Sep 21 12:54:18 2017 UTC (3 years, 11 months ago) by mathieu.servillat
File MIME type: application/x-tex
File size: 8704 byte(s)
fix intro-general missing words
2 In this document, we discuss a draft of an IVOA standard data model for
3 describing the provenance of astronomical data.
4 We follow the definition of provenance as proposed by the W3C \citep{std:W3CProvDM}, i.e. that provenance is ``information about entities, activities, and people involved in producing a piece of data or thing, which can be used to form assessments about its quality, reliability or trustworthiness''.
6 In astronomy, such entities are generally datasets composed of VOTables, FITS
7 files, database tables or files containing values (spectra, light curves), logs,
8 parameters, etc. The activities correspond to processes like an observation, a
9 simulation, or processing steps (image stacking, object extraction, etc.). The
10 people involved can be individual persons (observer, publisher, \ldots), groups
11 or organisations. An example for activities, entities and agents as they can be
12 discovered backwards in time is given in Figure~\ref{fig:example-workflow}.
14 \begin{figure}[h]
15 \centering
16 \includegraphics[width=1\textwidth]{workflow-backwards.pdf}
17 \caption[Example graph of provenance discovery]{An example graph of provenance discovery. Starting with a released dataset (left), the involved activities (blue boxes),
18 progenitor entities (yellow rounded boxes) and responsible agents (orange pentagons) are
19 discovered.}
20 \label{fig:example-workflow}
21 \end{figure}
24 The currently discussed Provenance Data Model is sufficiently abstract that its core pattern could be applied to any kind of process using either observation or simulation data.
25 It could also be used to describe the workflow for observation proposals or the publication of scientific articles based on (astronomical) data. However, here we focus on astronomical data. The links between the Provenance Data Model and other IVOA data models
26 will be discussed in Section~\ref{sec:dmlinks}. We note here that the provenance of simulated data is already covered by the Simulation Data Model
27 \citep[SimDM,][]{std:SimDM}. Therefore we also give a mapping between SimDM and the Provenance Data Model in Section~\ref{sec:dmlinks}.
30 %including extraction of data from
31 %databases or even the flow of scientific proposals from application to
32 %acceptance, including scheduling of the observations proposed therein.
33 %Provenance information could also be used to check internal processes,
34 %e.g., whether a proposal was approved by a person from a certain committee,
35 %or whether the time span between application and acceptance or rejection
36 %does not extend a certain period, etc...
39 \subsection{Goal of the provenance model}\label{sec:goals}
41 The goal of this Provenance Data Model is to describe how provenance information
42 can be modeled, stored and exchanged. Its scope
43 is mainly modeling of the flow of data, of the relations between data,
44 and of processing steps.
46 Characteristics of observation activities such as ambient conditions and
47 instrument characteristics can be associated to provenance information.
48 Experimental configuration or contextual information during
49 the execution of processing activities (computer structure, nodes, operating
50 system used, \dots) can also be connected to provenance information. However,
51 they will not be modeled here explicitly. This additional information can be
52 included in the form of data or entities linked to those activities, or as
53 attributes of activities (see also Section~\ref{sec:parameters} for parameters
54 of activities).
56 In general, the model shall capture information in a machine-readable way that would enable a scientist who has no prior knowledge about a dataset to get more background information.
57 This will help the scientist to decide if the dataset
58 is adequate for her research goal, assess its quality and get enough information
59 to be able to trace back its history as far as required or possible.
61 Provenance information may be recorded in minute detail or by using coarser
62 elements, depending on the intended usage and the desired level of detail
63 for a specific project that records provenance.
64 This granularity depends on the needs of the project and the intended usage when implementing a system to track provenance information.
65 % NOTE: maybe we need to define minimal requirements of what needs to be included as provenance information?
67 The following list is a collection of tasks which the Provenance Data Model should help to solve. They are flagged with [S] for problems which are more interesting for the end user of datasets (usually a scientist) and with [P] for tasks that are probably more important for data producers and publishers.
68 More specific use cases in the astronomy domain for different types of datasets and workflows along with example implementations are given in Section \ref{sec:usecases-implementations}.
71 \paragraphlb{A: Tracking the production history [S]}
72 Find out which steps were taken to produce a dataset and list the
73 methods\slash{}tools\slash{}software that were involved. Track the
74 history back to the raw data files \slash{} raw images, show the
75 workflow (backwards search), or return a list of progenitor datasets.
77 \noindent Examples:
78 \begin{itemize}
79 \item Is an image from catalogue xxx already calibrated?
80 What about dark field subtraction? Were foreground stars removed? Which technique
81 was used?
82 \item Is the background noise of atmospheric muons still present in my neutrino data sample?
83 \end{itemize}
85 We do not go as far as to consider easy reproducibility as a use case -- this would be too ambitious. But at least the
86 major steps undertaken to create a piece of data should be recoverable.
89 \paragraphlb{B: Attribution and contact information [S]}
90 Find the people involved in the production of a dataset,
91 the people\slash{}organizations\slash{}institutes that need to be cited
92 or can be asked for more information.
94 \noindent Examples:
95 \begin{itemize}
96 \item I want to use an image for my own work -- who was involved in
97 creating it? Who do I need to cite or who can I contact to get this information? Is a license attached to the data?
98 \item I have a question about column xxx in a data
99 table. Who can I ask about that?
100 \item Who should be cited or acknowledged if I use this data in my work?
101 \end{itemize}
104 \paragraphlb{C: Locate error sources [S, P]}
105 Find the location of possible error sources in the generation of a dataset.
107 \noindent Examples:
108 \begin{itemize}
109 \item I found something strange in an image. Where does
110 the image come from? Which instrument was used, with which characteristics,
111 etc.? Was there anything strange noted when the image was taken?
112 \item Which pipeline version was used -- the old one
113 with a known bug for treating bright objects or a newer version?
114 \item This light curve doesn't look quite right. How was
115 the photometry determined for each data point?
116 \end{itemize}
119 \paragraphlb{D: Quality assessment [P]}
120 Judge the quality of an observation, production step or dataset.
122 \noindent Examples:
123 \begin{itemize}
124 \item Since wrong calibration images may increase the
125 number of artifacts on an image rather than removing them, knowledge about
126 the calibration image set will help to assess the quality of the calibrated
127 image.
128 \end{itemize}
131 \paragraphlb{E: Search in structured provenance metadata [P, S]}
132 This would allow one to also do a ``forward search'', i.e. locate derived datasets or outputs, e.g. finding all images produced by a certain processing step or derived from data which were taken by a given facility.
134 \noindent Examples:
135 \begin{itemize}
136 \item Give me more images that were produced using the same pipeline.
137 \item Give me an overview on all images reduced with the same calibration dataset.
138 \item Are there any more images attributed to this observer?
139 \item Which images of the Crab Nebula are of good quality and were produced within the last 10 years by someone not from ESO or NASA?
140 \item Find all datasets generated using this given algorithm for this given step of the data processing.
141 % add another specific use case for tracking scientific productivity?
142 \end{itemize}
144 This task is probably the most challenging. It also includes tracking the history of data items as in A, but we still have listed this task separately, since we may decide that we can't keep this one, but we definitely want A.

ViewVC Help
Powered by ViewVC 1.1.26