# Contents of /trunk/projects/dm/provenance/description/intro-general.tex

Revision 4378 - (show annotations)
Thu Sep 21 12:54:18 2017 UTC (3 years, 11 months ago) by mathieu.servillat
File MIME type: application/x-tex
File size: 8704 byte(s)
fix intro-general missing words

 1 2 In this document, we discuss a draft of an IVOA standard data model for 3 describing the provenance of astronomical data. 4 We follow the definition of provenance as proposed by the W3C \citep{std:W3CProvDM}, i.e. that provenance is information about entities, activities, and people involved in producing a piece of data or thing, which can be used to form assessments about its quality, reliability or trustworthiness''. 5 6 In astronomy, such entities are generally datasets composed of VOTables, FITS 7 files, database tables or files containing values (spectra, light curves), logs, 8 parameters, etc. The activities correspond to processes like an observation, a 9 simulation, or processing steps (image stacking, object extraction, etc.). The 10 people involved can be individual persons (observer, publisher, \ldots), groups 11 or organisations. An example for activities, entities and agents as they can be 12 discovered backwards in time is given in Figure~\ref{fig:example-workflow}. 13 14 \begin{figure}[h] 15 \centering 16 \includegraphics[width=1\textwidth]{workflow-backwards.pdf} 17 \caption[Example graph of provenance discovery]{An example graph of provenance discovery. Starting with a released dataset (left), the involved activities (blue boxes), 18 progenitor entities (yellow rounded boxes) and responsible agents (orange pentagons) are 19 discovered.} 20 \label{fig:example-workflow} 21 \end{figure} 22 23 24 The currently discussed Provenance Data Model is sufficiently abstract that its core pattern could be applied to any kind of process using either observation or simulation data. 25 It could also be used to describe the workflow for observation proposals or the publication of scientific articles based on (astronomical) data. However, here we focus on astronomical data. The links between the Provenance Data Model and other IVOA data models 26 will be discussed in Section~\ref{sec:dmlinks}. We note here that the provenance of simulated data is already covered by the Simulation Data Model 27 \citep[SimDM,][]{std:SimDM}. Therefore we also give a mapping between SimDM and the Provenance Data Model in Section~\ref{sec:dmlinks}. 28 29 30 %including extraction of data from 31 %databases or even the flow of scientific proposals from application to 32 %acceptance, including scheduling of the observations proposed therein. 33 %Provenance information could also be used to check internal processes, 34 %e.g., whether a proposal was approved by a person from a certain committee, 35 %or whether the time span between application and acceptance or rejection 36 %does not extend a certain period, etc... 37 38 39 \subsection{Goal of the provenance model}\label{sec:goals} 40 41 The goal of this Provenance Data Model is to describe how provenance information 42 can be modeled, stored and exchanged. Its scope 43 is mainly modeling of the flow of data, of the relations between data, 44 and of processing steps. 45 46 Characteristics of observation activities such as ambient conditions and 47 instrument characteristics can be associated to provenance information. 48 Experimental configuration or contextual information during 49 the execution of processing activities (computer structure, nodes, operating 50 system used, \dots) can also be connected to provenance information. However, 51 they will not be modeled here explicitly. This additional information can be 52 included in the form of data or entities linked to those activities, or as 53 attributes of activities (see also Section~\ref{sec:parameters} for parameters 54 of activities). 55 56 In general, the model shall capture information in a machine-readable way that would enable a scientist who has no prior knowledge about a dataset to get more background information. 57 This will help the scientist to decide if the dataset 58 is adequate for her research goal, assess its quality and get enough information 59 to be able to trace back its history as far as required or possible. 60 61 Provenance information may be recorded in minute detail or by using coarser 62 elements, depending on the intended usage and the desired level of detail 63 for a specific project that records provenance. 64 This granularity depends on the needs of the project and the intended usage when implementing a system to track provenance information. 65 % NOTE: maybe we need to define minimal requirements of what needs to be included as provenance information? 66 67 The following list is a collection of tasks which the Provenance Data Model should help to solve. They are flagged with [S] for problems which are more interesting for the end user of datasets (usually a scientist) and with [P] for tasks that are probably more important for data producers and publishers. 68 More specific use cases in the astronomy domain for different types of datasets and workflows along with example implementations are given in Section \ref{sec:usecases-implementations}. 69 70 71 \paragraphlb{A: Tracking the production history [S]} 72 Find out which steps were taken to produce a dataset and list the 73 methods\slash{}tools\slash{}software that were involved. Track the 74 history back to the raw data files \slash{} raw images, show the 75 workflow (backwards search), or return a list of progenitor datasets. 76 77 \noindent Examples: 78 \begin{itemize} 79 \item Is an image from catalogue xxx already calibrated? 80 What about dark field subtraction? Were foreground stars removed? Which technique 81 was used? 82 \item Is the background noise of atmospheric muons still present in my neutrino data sample? 83 \end{itemize} 84 85 We do not go as far as to consider easy reproducibility as a use case -- this would be too ambitious. But at least the 86 major steps undertaken to create a piece of data should be recoverable. 87 88 89 \paragraphlb{B: Attribution and contact information [S]} 90 Find the people involved in the production of a dataset, 91 the people\slash{}organizations\slash{}institutes that need to be cited 92 or can be asked for more information. 93 94 \noindent Examples: 95 \begin{itemize} 96 \item I want to use an image for my own work -- who was involved in 97 creating it? Who do I need to cite or who can I contact to get this information? Is a license attached to the data? 98 \item I have a question about column xxx in a data 99 table. Who can I ask about that? 100 \item Who should be cited or acknowledged if I use this data in my work? 101 \end{itemize} 102 103 104 \paragraphlb{C: Locate error sources [S, P]} 105 Find the location of possible error sources in the generation of a dataset. 106 107 \noindent Examples: 108 \begin{itemize} 109 \item I found something strange in an image. Where does 110 the image come from? Which instrument was used, with which characteristics, 111 etc.? Was there anything strange noted when the image was taken? 112 \item Which pipeline version was used -- the old one 113 with a known bug for treating bright objects or a newer version? 114 \item This light curve doesn't look quite right. How was 115 the photometry determined for each data point? 116 \end{itemize} 117 118 119 \paragraphlb{D: Quality assessment [P]} 120 Judge the quality of an observation, production step or dataset. 121 122 \noindent Examples: 123 \begin{itemize} 124 \item Since wrong calibration images may increase the 125 number of artifacts on an image rather than removing them, knowledge about 126 the calibration image set will help to assess the quality of the calibrated 127 image. 128 \end{itemize} 129 130 131 \paragraphlb{E: Search in structured provenance metadata [P, S]} 132 This would allow one to also do a forward search'', i.e. locate derived datasets or outputs, e.g. finding all images produced by a certain processing step or derived from data which were taken by a given facility. 133 134 \noindent Examples: 135 \begin{itemize} 136 \item Give me more images that were produced using the same pipeline. 137 \item Give me an overview on all images reduced with the same calibration dataset. 138 \item Are there any more images attributed to this observer? 139 \item Which images of the Crab Nebula are of good quality and were produced within the last 10 years by someone not from ESO or NASA? 140 \item Find all datasets generated using this given algorithm for this given step of the data processing. 141 % add another specific use case for tracking scientific productivity? 142 \end{itemize} 143 144 This task is probably the most challenging. It also includes tracking the history of data items as in A, but we still have listed this task separately, since we may decide that we can't keep this one, but we definitely want A.