ViewVC logotype

Annotation of /trunk/projects/dm/provenance/description/intro-general.tex

Parent Directory Parent Directory | Revision Log Revision Log

Revision 4377 - (hide annotations)
Thu Sep 21 10:48:15 2017 UTC (4 years ago) by mnullmei
File MIME type: application/x-tex
File size: 8651 byte(s)
remaining -- minimal -- grammar, style, and TeX fixes for Section 1
1 mathieu.servillat 3709
2 kriebe 3727 In this document, we discuss a draft of an IVOA standard data model for
3 mathieu.servillat 3709 describing the provenance of astronomical data.
4 kriebe 3721 We follow the definition of provenance as proposed by the W3C \citep{std:W3CProvDM}, i.e. that provenance is ``information about entities, activities, and people involved in producing a piece of data or thing, which can be used to form assessments about its quality, reliability or trustworthiness''.
5 kriebe 3447
6 mnullmei 4376 In astronomy, such entities are generally datasets composed of VOTables, FITS
7     files, database tables or files containing values (spectra, light curves), logs,
8     parameters, etc. The activities correspond to processes like an observation, a
9     simulation, or processing steps (image stacking, object extraction, etc.). The
10     people involved can be individual persons (observer, publisher, \ldots), groups
11     or organisations. An example for activities, entities and agents as they can be
12     discovered backwards in time is given in Figure~\ref{fig:example-workflow}.
13 kriebe 3447
14 kriebe 3734 \begin{figure}[h]
15     \centering
16     \includegraphics[width=1\textwidth]{workflow-backwards.pdf}
17 kriebe 4367 \caption[Example graph of provenance discovery]{An example graph of provenance discovery. Starting with a released dataset (left), the involved activities (blue boxes),
18 kriebe 3734 progenitor entities (yellow rounded boxes) and responsible agents (orange pentagons) are
19     discovered.}
20     \label{fig:example-workflow}
21     \end{figure}
24 kriebe 3984 The currently discussed Provenance Data Model is sufficiently abstract that its core pattern could be applied to any kind of process using either observation or simulation data.
25     It could also be used to describe the workflow for observation proposals or the publication of scientific articles based on (astronomical) data. However, here we focus on astronomical data. The links between the Provenance Data Model and other IVOA data models
26 mathieu.servillat 4218 will be discussed in Section~\ref{sec:dmlinks}. We note here that the provenance of simulated data is already covered by the Simulation Data Model
27 kriebe 3984 \citep[SimDM,][]{std:SimDM}. Therefore we also give a mapping between SimDM and the Provenance Data Model in Section~\ref{sec:dmlinks}.
28 kriebe 3734
29 mathieu.servillat 3709
30     %including extraction of data from
31     %databases or even the flow of scientific proposals from application to
32     %acceptance, including scheduling of the observations proposed therein.
33     %Provenance information could also be used to check internal processes,
34     %e.g., whether a proposal was approved by a person from a certain committee,
35     %or whether the time span between application and acceptance or rejection
36     %does not extend a certain period, etc...
39 kriebe 3448 \subsection{Goal of the provenance model}\label{sec:goals}
40 mathieu.servillat 3709
41 kriebe 3734 The goal of this Provenance Data Model is to describe how provenance information
42 kriebe 3727 can be modeled, stored and exchanged. Its scope
43 mnullmei 3490 is mainly modeling of the flow of data, of the relations between data,
44 mathieu.servillat 3709 and of processing steps.
45 kriebe 3447
46 mnullmei 4376 Characteristics of observation activities such as ambient conditions and
47     instrument characteristics can be associated to provenance information. during
48     the execution of processing activities (computer structure, nodes, operating
49     system used, \dots) can also be connected to provenance information. However,
50     they will not be modeled here explicitly. This additional information can be
51     included in the form of data or entities linked to those activities, or as
52     attributes of activities (see also Section~\ref{sec:parameters} for parameters
53     of activities).
54 mathieu.servillat 3709
55 kriebe 3727 In general, the model shall capture information in a machine-readable way that would enable a scientist who has no prior knowledge about a dataset to get more background information.
56 mathieu.servillat 3709 This will help the scientist to decide if the dataset
57 kriebe 3734 is adequate for her research goal, assess its quality and get enough information
58     to be able to trace back its history as far as required or possible.
59 kriebe 3447
60 mnullmei 3490 Provenance information may be recorded in minute detail or by using coarser
61     elements, depending on the intended usage and the desired level of detail
62 mathieu.servillat 3709 for a specific project that records provenance.
63 kriebe 3727 This granularity depends on the needs of the project and the intended usage when implementing a system to track provenance information.
64     % NOTE: maybe we need to define minimal requirements of what needs to be included as provenance information?
65 kriebe 3641
66 kriebe 3721 The following list is a collection of tasks which the Provenance Data Model should help to solve. They are flagged with [S] for problems which are more interesting for the end user of datasets (usually a scientist) and with [P] for tasks that are probably more important for data producers and publishers.
67 mathieu.servillat 3709 More specific use cases in the astronomy domain for different types of datasets and workflows along with example implementations are given in Section \ref{sec:usecases-implementations}.
68 kriebe 3641
69 mathieu.servillat 3709
70 kriebe 3641 \paragraphlb{A: Tracking the production history [S]}
71 mnullmei 4377 Find out which steps were taken to produce a dataset and list the
72     methods\slash{}tools\slash{}software that were involved. Track the
73     history back to the raw data files \slash{} raw images, show the
74     workflow (backwards search), or return a list of progenitor datasets.
75 kriebe 3641
76     \noindent Examples:
77     \begin{itemize}
78     \item Is an image from catalogue xxx already calibrated?
79     What about dark field subtraction? Were foreground stars removed? Which technique
80     was used?
81     \item Is the background noise of atmospheric muons still present in my neutrino data sample?
82     \end{itemize}
84 kriebe 3727 We do not go as far as to consider easy reproducibility as a use case -- this would be too ambitious. But at least the
85 kriebe 3641 major steps undertaken to create a piece of data should be recoverable.
88     \paragraphlb{B: Attribution and contact information [S]}
89     Find the people involved in the production of a dataset,
90 mnullmei 4377 the people\slash{}organizations\slash{}institutes that need to be cited
91     or can be asked for more information.
92 kriebe 3641
93     \noindent Examples:
94     \begin{itemize}
95     \item I want to use an image for my own work -- who was involved in
96 kriebe 3727 creating it? Who do I need to cite or who can I contact to get this information? Is a license attached to the data?
97 kriebe 3641 \item I have a question about column xxx in a data
98     table. Who can I ask about that?
99 mathieu.servillat 3709 \item Who should be cited or acknowledged if I use this data in my work?
100 kriebe 3641 \end{itemize}
103 kriebe 3734 \paragraphlb{C: Locate error sources [S, P]}
104 mathieu.servillat 3709 Find the location of possible error sources in the generation of a dataset.
105 kriebe 3641
106     \noindent Examples:
107     \begin{itemize}
108     \item I found something strange in an image. Where does
109 mnullmei 4377 the image come from? Which instrument was used, with which characteristics,
110 kriebe 3641 etc.? Was there anything strange noted when the image was taken?
111     \item Which pipeline version was used -- the old one
112     with a known bug for treating bright objects or a newer version?
113     \item This light curve doesn't look quite right. How was
114     the photometry determined for each data point?
115     \end{itemize}
118     \paragraphlb{D: Quality assessment [P]}
119     Judge the quality of an observation, production step or dataset.
121     \noindent Examples:
122     \begin{itemize}
123     \item Since wrong calibration images may increase the
124     number of artifacts on an image rather than removing them, knowledge about
125     the calibration image set will help to assess the quality of the calibrated
126     image.
127     \end{itemize}
130     \paragraphlb{E: Search in structured provenance metadata [P, S]}
131     This would allow one to also do a ``forward search'', i.e. locate derived datasets or outputs, e.g. finding all images produced by a certain processing step or derived from data which were taken by a given facility.
133     \noindent Examples:
134     \begin{itemize}
135 mathieu.servillat 3709 \item Give me more images that were produced using the same pipeline.
136 kriebe 3641 \item Give me an overview on all images reduced with the same calibration dataset.
137     \item Are there any more images attributed to this observer?
138 kriebe 3727 \item Which images of the Crab Nebula are of good quality and were produced within the last 10 years by someone not from ESO or NASA?
139 mnullmei 4377 \item Find all datasets generated using this given algorithm for this given step of the data processing.
140 kriebe 3641 % add another specific use case for tracking scientific productivity?
141     \end{itemize}
143     This task is probably the most challenging. It also includes tracking the history of data items as in A, but we still have listed this task separately, since we may decide that we can't keep this one, but we definitely want A.

ViewVC Help
Powered by ViewVC 1.1.26