ViewVC logotype

Contents of /trunk/projects/dm/provenance/ProvenanceDMRequirements.html

Parent Directory Parent Directory | Revision Log Revision Log

Revision 2601 - (show annotations)
Thu May 8 13:42:42 2014 UTC (7 years, 4 months ago) by volute@g-vo.org
File MIME type: text/html
File size: 21686 byte(s)
(Kristin) Adjusted mime-type for ProvenanceDMRequirements.html

1 <!DOCTYPE html
2 PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
3 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
5 <!--
6 //
7 // 2005-12-01 ksc Initial release
8 //
9 -->
11 <html>
12 <head>
13 <link rel="icon" href="/favicon.ico" type="image/x-icon" />
15 <title>GAVO ProvenanceDMRequirements</title>
16 <meta http-equiv='Content-Style-Type' content='text/css' />
17 <!--HeaderText--><style type='text/css'><!--
18 ul, ol, pre, dl, p { margin-top:0px; margin-bottom:0px; }
19 code.escaped { white-space: nowrap; }
20 .vspace { margin-top:1.33em; }
21 .indent { margin-left:40px; }
22 .outdent { margin-left:40px; text-indent:-40px; }
23 a.createlinktext { text-decoration:none; border-bottom:1px dotted gray; }
24 a.createlink { text-decoration:none; position:relative; top:-0.5em;
25 font-weight:bold; font-size:smaller; border-bottom:none; }
26 img { border:0px; }
27 .editconflict { color:green;
28 font-style:italic; margin-top:1.33em; margin-bottom:1.33em; }
30 table.markup { border:2px dotted #ccf; width:90%; }
31 td.markup1, td.markup2 { padding-left:10px; padding-right:10px; }
32 table.vert td.markup1 { border-bottom:1px solid #ccf; }
33 table.horiz td.markup1 { width:23em; border-right:1px solid #ccf; }
34 table.markup caption { text-align:left; }
35 div.faq p, div.faq pre { margin-left:2em; }
36 div.faq p.question { margin:1em 0 0.75em 0; font-weight:bold; }
37 div.faqtoc div.faq * { display:none; }
38 div.faqtoc div.faq p.question
39 { display:block; font-weight:normal; margin:0.5em 0 0.5em 20px; line-height:normal; }
40 div.faqtoc div.faq p.question * { display:inline; }
42 .frame
43 { border:1px solid #cccccc; padding:4px; background-color:#f9f9f9; }
44 .lfloat { float:left; margin-right:0.5em; }
45 .rfloat { float:right; margin-left:0.5em; }
46 a.varlink { text-decoration:none; }
49 html, body {
50 min-height: 10%;
51 }
53 * html, * body {
54 min-height: 10%;
55 }
57 body {
58 font-family: Arial, Verdana, Helvetica, sans-serif;
59 color: #042E74;
60 font-size: 11pt;
61 margin: 0 auto;
62 /* background: #D2DcF0 url(images/background.gif) top center repeat-y; */
63 }
65 code{ font-family: "courier new", monospace; font-size: inherit; }
66 pre{ margin-left: 3em; background-color: #EFE; font-family: "courier new", monospace; font-size: medium; }
68 hr { border: 1px; height: 1px; background: #000; color: #FFF; width: 90%; }
70 a {
71 color: #042EF0;
72 text-decoration: none;
73 }
74 a:hover {
75 color: #042EA0;
76 text-decoration: underline;
77 }
79 /* including bkacground again (mostly) covers up a problem with the background in
80 Firefox & Camino when resizing the window to less than content width */
81 #wrapper {
82 position: relative;
83 /* width: 97.3%;
84 width: 1500px; */
85 height: 97.3%;
86 margin: 0;
87 /*background: #92847b url(images/background.gif) top center repeat-y; */
88 }
89 #wikilogo { margin-top:0; padding:0; border-bottom:0 #cccccc solid; }
91 #head {
92 position: relative;
93 top: 0;
94 padding: 0;
95 height: 86px;
96 /*height: 110px; -- prev. used for big header */
97 /*width: 1500px; */
98 background: #899dbe url(images/gavo_header_166_curve.jpg) top left no-repeat;
99 }
100 #page-title {
101 position: absolute;
102 top: 35px;
103 left: 235px;
104 /*width: 605px;*/
105 font-size: 8pt;
106 display: none;
107 font-weight: bold;
108 color: #eee;
109 }
110 #page-subtitle {
111 position: absolute;
112 top: 65px;
113 left: 245px;
114 /*width: 605px;*/
115 font-size: small;
116 font-weight: lighter;
117 color: #F0F4FF;
118 }
119 /*#head #page-actions { */
120 #col-left #page-actions {
121 position: relative;
122 bottom: 0;
123 left: 0;
124 /* width: 605px; */
126 font-size: 6pt;
127 text-align: right;
128 background: #D2DCF0;
129 }
130 #page-actions ul { list-style: none; margin: 0; padding: 2px; color: #042E74;}
131 #page-actions li { display: inline; margin: 0; padding: 2px; color: #042E74;}
132 #page-actions li a { text-decoration: none; color: #eee; margin: 0; padding: 2px; color: #042E74;}
133 #page-actions li a:hover { text-decoration: underline; color: #eee; margin: 0; padding: 2px; color: #042E74;}
135 #content {
136 margin: 0 auto;
137 }
139 #col-left {
140 position: relative;
141 float: left;
142 /* width: 154px; */
143 width: 18.5ex; /*19ex;*/
144 height: 1200px;
145 margin:0;
146 /*margin: 0 0 0 0;*/
147 padding-left: 1.5ex;
148 padding-right: 0.5ex; /* KR */
149 padding-top: 1em;
150 background: #D2DCF0;
151 border: 1px solid silver; /* KR */
152 }
153 #col-right {
154 /*float:left; */
155 margin: 0.1ex 0 0 0.1ex;
156 padding-left: 1.5ex;
157 padding-top: 0.5em;
158 /*border-top: 0.3em solid #E28F6E;*/
159 }
162 #wikitext h1, h2, h3, h4, h5, h6 { color: #666; }
163 #wikitext h1 {
164 font-size: 13pt;
165 font-weight: bold;
166 letter-spacing: 0.1em;
167 font-family: Arial, Verdana, Helvetica, sans-serif;
168 color: #042E74;
169 background: #F0F4FF;
170 line-height:2em
171 }
173 #wikitext h2 { font-size: 12pt; font-weight: bold;}
174 #wikitext h3 { font-size: 11pt; font-weight: bold;}
175 #wikitext h4 { font-size: 11pt; background: #F0F4FF;}
176 #wikitext h5 { font-size: 11pt; font-style: italic; }
178 /* some own styles, KR */
179 #wikitext h6 { margin-top: 0; border-bottom: 1px solid #cccccc; margin-bottom: 10px; font-size: 110%;}
180 #wikitext h6 a { color: #666; }
181 #wikitext div.img {
182 position: absolute;
183 text-align: center;
184 bottom: 1.25ex;
185 width: 40ex;
186 height: 12ex;
187 margin: -1px;
188 padding: -1px;
190 background-color: #fff;
191 border: 1px solid #cccccc;
192 }
193 #wikitext div.img img {
194 width: 100%;
195 height: 100%;
196 }
202 </style>
203 <meta name='robots' content='index,follow' />
205 <!-- <link rel='stylesheet' title="A Bit Modern" href='http://www.g-vo.org/pmwiki/pub/skins/abitmodern/abitmodern.css' type='text/css' />
206 -->
208 </head>
210 <body $BodyAttr>
211 <div id='wrapper'>
213 <div id='content'>
215 <div id='col-right'>
216 <!--PageText-->
217 <div id='wikitext'>
218 <h1>Requirements for a Provenance Data Model </h1>
219 <p>This page is a collection of requirements for a provenance data model that should cover the provenance of observations. let's keep simulations in mind, but focus on observations.
220 </p>
221 <p>
222 This page was discussed within GAVO, comments after the Face2Face meeting early 2013 are marked in green.
223 Last update: April 16, 2013.
225 <p class='vspace'>Keywords like <em>must</em>, <em>should</em>, <em>can</em> etc. are just suggestions. <br />Please add your comment/suggestions!<br />
226 </p><h2>0 Use cases</h2>
227 <p>The provenance data model should cover following issues:
228 </p><ul><li>Aid in debugging
229 </li><li>Attribution (who was involved in the project? Who can I ask about these data?)
230 </li><li>Aid in reprocessing (but not: allow reprocessing on keypress)
231 </li><li>Steps of production (Allow figuring out what steps processing steps have been done already)
232 </li><li>Allow *people* to assess the "quality" of the observation/reduction
233 </li><li>Let people search in structured provenance metadata
234 </li></ul><div class='vspace'></div><h2>1 General requirements</h2>
235 <p>We divide objects here strictly into data sets/objects and processes or actions.
236 In Gerard's model, "data sets" are called "results" and actions are "experiments".
237 </p>
238 <div class='vspace'></div><h3>1.1 Data sets are always connected to each other via actions.</h3>
239 <p>No direct link between data sets is necessary, since there is always an action involved.
240 </p>
241 <div class='vspace'></div><h3>1.2 The Provenance Data Model must be able to cope with following data sets</h3>
242 <ul><li>processed data with existing raw data, where all the processing information is available
243 </li><li>processed data without raw data, where all the information about the processing is still available
244 <br /> <em>Example: LOFAR data</em>
245 </li><li>processed data sets without any raw data; no processing information
246 <br /> i.e. raw data are not accessible or do not exist anymore <br /> <em>Example: satellite data</em>
247 </li><li>processed data with raw data, but only partial processing information<br /> <em>Example: unknown pipeline (black box), only "unspecified" information given</em>
248 </li></ul><p class='vspace'><span style='color: green;'><strong>F2F:</strong> general agreement on 1.1 and 1.2</span>
249 </p>
250 <div class='vspace'></div><h3>1.3 The provenance data model should be able to describe the following processes or actions</h3>
251 <p>(Markus would call them "Meta-Actions")
252 </p><ul><li>Processes without human interaction<br /> <em>Example: running a standard pipeline with no human control in between</em>
253 </li><li>Process using standardized software with standard parameters
254 </li><li>Processes with some human interaction<br /> <em>Example: User needs to test the effects of some parameters and thus changes the standard values, user tries to get better results for a certain aspect of the data set</em><br /> <em>Example: running a pipeline, where a user needs to confirm the end result or do some other step(s) in between</em>
255 </li><li>Processes with logging
256 </li><li>Processes without logging
257 </li><li>Processes without any logging, just human interaction<br /> <em>Example: user performs several steps without tracking these steps or the used tools, e.g. adjusting the contrast of an image in Photoshop, fitting a line to data points by eye, using awk or other search/replace tools to convert numbers in a table to a different unit</em>
258 </li><li>Processes where no code name or details are known just a description of the algorithm<br /> <em>Example: A user does not want to provide the source of a custom-written piece of code, no code name exists. But he describes the used method/algorithm</em>
259 </li><li>Process for which no description exists, just the source code<br /> <em>Example: A user has written some piece of code for a certain transformation step; he provides the source code of his tool for further reference.</em>
260 </li><li>The data model must have the possibility to provide or include standard process descriptions.
261 </li></ul><p class='vspace'><span style='color: green;'><strong>F2F:</strong> Some issues: how do you handle non-automatical steps (based on a human being's experience); the points given under 1.3 belong to different posings of a question (e.g. logging is not relevant for the question about the relationship between raw data and processed data). </span>
262 </p>
263 <div class='vspace'></div><h3>1.4 The Provenance Data Model should allow to group data and their actions into a container, e.g. called “Meta-Actions”. This would allow to have different “resolutions” of the provenance of a data set.</h3>
264 <p>These could also be called "composites" or "macros", grouping "experiments" together.<br /> <em>Example: A user may just be interested in the big steps that created a data set, and not in every detail.</em><br />
265 <span style='color: green;'><strong>F2F:</strong> Markus would not include this. It would in theory be nice if you could add macros to the model. However, that's a major thing and thus is probably out of scope. </span>
266 </p>
267 <div class='vspace'></div><h2>2 Raw observation data</h2>
268 <p>These are the data that were directly taken from the telescope, without any further processing.
269 </p>
270 <div class='vspace'></div><h3>2.1 For observations, ambient conditions should be provided (link? directly in data file?)</h3>
271 <p><em>Example: link to the weather report, including weather conditions, lunar phase, etc.</em>
272 </p><h3>2.2 The ambient conditions should/can/may be searchable =&gt; ask scientists, if this is needed!</h3>
273 <p><em>Example: User might want to extract all data where the seeing was better than … </em>
274 </p><h3>2.3 Each observation needs to be characterized with a set of keywords or attributes that still needs to be specified. </h3>
275 <p><em>Example keywords: date, start time, end time or exposure time, target</em><br />=&gt; These keywords are already included in characterization data model.
276 </p>
277 <p class='vspace'><span style='color: green;'><strong>F2F:</strong> What kind of phenomena do we want to cover? Suggestion: collect FITS files from as many different instruments as we can and collect what's in there. Of course, there's more data than that; see, e.g., the comprehensive weather reports at <a style='color: green' class='external' target='_blank' href='http://www.ls.eso.org/lasilla/dimm'>http://www.ls.eso.org/lasilla/dimm</a>. We agree, however, that such information is far too detailed for our purposes (though we should probably let people include links to such information if available). Still, we should profit as much as possible from the work of the instrumentation people in selecting information relevant for interpreting the data (i.e., consider their choice of concepts when designing their FITS headers). </span>
278 </p>
279 <p class='vspace'><span style='color: green;'>A result of a provisional survey of FITS data in the VO is available on <a style='color: green' class='external' target='_blank' href='http://svn.ari.uni-heidelberg.de/svn/reports/provenance/fitsheaders/observations.txt%%'>http://svn.ari.uni-heidelberg.de/svn/reports/provenance/fitsheaders/observations.txt%%</a></span>
280 </p>
281 <div class='vspace'></div><h3>2.4 Each telescope must be characterized</h3>
282 <p>=&gt; use VO-registry?
283 </p>
284 <div class='vspace'></div><h3>2.5 Each instrument must be characterized</h3>
285 <p>=&gt; where to point to? Weblinks are not persistent ...<br /><em>Example: Filter, gratings, …</em>
286 </p><h3>2.6 Possibly, also each observatory should be characterized (there already exists an observatory database)</h3>
287 <p><em>Example: observatory site, typical weather conditions/limitations;<br />If observatory object exists, it can be referenced by several telescopes</em>
288 </p>
289 <p class='vspace'><span style='color: green;'><strong>F2F:</strong> </span>
290 Could it be sufficient to just link to a web page of the observatory describing their machinery? Probably not, since that's neither machine readable (which would violate our requirement for searchability) nor suffiently stable; we should allow the inclusion of that via a reference URL, though. It would be attractive to require people to store some structured description of their gear in a stable place (i.e., the VO registry) and then just have a reference to that in the actual files, but there's doubts whether that would catch on. To collect the concepts we'd like to have represented, we should again use the sample of headers found in the wild.
291 </p>
292 <div class='vspace'></div><h3>2.7 The observer of the data must be provided</h3>
293 <h3>2.8 The observer should be characterized by name and current affiliation</h3>
294 <p><span style='color: green;'><strong>F2F:</strong> Status: visiting/resident? Probably doesn't help too much because that alone doesn't say much about the observer's proficiency (and even pros foul up things now and then...) Affiliation? is probably useful for figuring out who it was, in particular for names like John Doe. Note added by Markus: Well, of course affiliations are another can of worms, since there's just so many ways to write affiliations for one and the same shop. </span>
295 </p>
296 <div class='vspace'></div><h3>2.9 A link to the log-book should be provided</h3>
297 <h3>2.10 A link to the calibration data set must be provided for each observation</h3>
298 <h3>2.11 An observation can be one of a set of observations</h3>
299 <p><em>Example: Integral Field Units of MUSE</em>
300 </p><h3>2.12 An observation can also be a container for other observations </h3>
301 <p><em>Example: RAVE observations, MUSE fibres</em>
302 </p>
303 <p class='vspace'><span style='color: green;'><strong>F2F:</strong> compound data -- SimDM doesn't model this kind of thing, although of course their data in general is compound (e.g., snapshots); this just gets too complex for them. VOResource has a related thing, the relationships, which basically are triples (relation-type, source-resource, dest-resource), where relation-type is something like "mirror-of"-, "service-for", etc. Note added by Markus: There's also the DataLink effort going on for "grouping" of data products. Maybe this kind of thing can be partially solved by allowing provenance on DataLink documents? Harry should explain a bit more where this comes from. </span>
304 </p>
305 <div class='vspace'></div><h2>3 Calibration data</h2>
306 <h3>3.1 The calibration data set must be characterized (similar to raw data?)</h3>
307 <p><em>Example: CCD drift, bad pixels, type of flat field</em>
308 </p><h3>3.2 The calibration data are directly linked with the observation data</h3>
309 <p>=&gt; Including them would probably become too complex, so skip it for now?
310 </p>
311 <p class='vspace'><span style='color: green;'><strong>F2F:</strong> links to calibration data; that could be interesting for raw data (for reduced data, the calibration data used is declared in the processing steps). However, the subject of what someone *could* do for calibration is probably out of scope -- should there be a raw file DM? If at all, that would be stuff for non-machine-readable freetext. </span>
312 </p>
313 <div class='vspace'></div><h2>4 Processes/actions</h2>
314 <h3>4.1 The post-processing must be traceable</h3>
315 <h3>4.2 Each process must be characterized</h3>
316 <h3>4.3 Possible keywords for the characterization are</h3>
317 <ul><li>Name (of software)
318 </li><li>Version
319 </li><li>Description of algorithm
320 </li><li>Parameters; or: link to standard parameter set
321 </li><li>Link to source code
322 </li><li>Link to documentation
323 </li></ul><div class='vspace'></div><h3>4.4 Input (if possible) and output of the action should be tracked </h3>
324 <p><em>Example: The "halo finding" action has a simulation snapshot as input and a halo catalogue as output data set.</em>
325 </p>
326 <div class='vspace'></div><h3>4.5 A link to the input data must be provided (if possible)</h3>
327 <p><em>Example: Creating a first input data set (e.g. for cosmological simulations) may not require an input data set</em>
328 </p>
329 <div class='vspace'></div><h3>4.6 A quality flag can be provided for each action, also at a later stage, could be similar to VO's "validation level" &amp; "validated by". </h3>
330 <p>Better: use a free-text field for "Warnings" ;<br />We could try to find a common set of error measures like chi-squared etc.<br /><em>Example: Observer gives a warning, that something was weird, a plane crossed the field of view, .... </em>
331 </p>
332 <p class='vspace'><span style='color: green;'><strong>F2F:</strong> Quality Flags; how could those be added later? Also, quality is something fairly hard to define in a generic way, and if it's probably not going to be a single number. Use cases for that appear to be mainly: "Plane was crossing" or similar warnings coming from reduction steps. So, it would seem most of this could just be covered by letting experiments add warnings. </span>
333 </p>
334 <div class='vspace'></div><h3>4.7 The quality flag can be different for different input data sets</h3>
335 <p><em>Example: algorithm may work for some cases, but not for all</em>
336 </p><h3>4.8 Access control flags can be provided for the action</h3>
337 <p>=&gt; Setting a parameter to null is different from saying: it's there, but you won't get it.<br /><em>Example: A user may not want to publish his steps which produced the final data set, because he wants to publish them first in a scientific article</em>
338 </p>
339 <p class='vspace'><span style='color: green;'><strong>F2F:</strong> Access Control. Examples: (1) I've derived that from a proprietary spectrum, don't bother trying to get it. (2) here's a parameter that I set the value of which is my secret. For (1), there's probably no modelling necessary since people will realize something is proprietary on access (plus, the thing may have become free in the meantime, in which case such a declaration would actually hurt). In the second case, it would make perfect sense to model the difference between NULL and "won't tell". Let's see if our parameter can easily support such a thing. </span>
340 </p>
341 <div class='vspace'></div><h2>5 (Processed) Data object</h2>
342 <h3>5.1 The data set must be linked to the action which it created</h3>
343 <h3>5.2 The data set should/must not be linked directly to the previous data set</h3>
344 <h3>5.3 The ownership or authorship of the data set should be provided.</h3>
345 <h3>5.4 Access control flags can be provided</h3>
346 <p><em>Example: the data is only available for internal usage, but not (yet) public</em>
347 </p>
348 <div class='vspace'></div><h2>6 Quality Object</h2>
349 <p>Replace it by "Warnings". Should be attached to process, unless they are directly coupled to the data (like a chi-squared)
350 </p>
351 <div class='vspace'></div><h2>7 Theory</h2>
352 <h3>7.1 The provenance data model should be able to include theory models as input data sets</h3>
353 <p><em>Example: A user wants to fit a stellar spectrum and uses a catalogue of stellar spectra created from stellar models</em>
354 </p>
355 <p class='vspace'><span style='color: green;'><strong>F2F:</strong> Whatever we do, we can't ask the theory people to provide both SimDM provenance and whatever we come up with. So, SimDM provenance should count as provenance in our sense. Then again, the thought here has rather been to describe theory data used to process observations (e.g., a theoretical spectrum used as a gauge). Here, we need to be careful not to become an analysis data model.</span>
356 </p>
357 <p class='vspace'>The clean way to allow SimDM provenance would be to derive both SimDM provenance and ours from a common basis; Gerard's Domain Model could be such a base. There's no VO-DML for that yet, but that might change if we know where we're going.
358 </p>
359 <p class='vspace'>Note that even table columns could have provenance; e.g., in a table of redshifts, each row could be derived from a different spectrum and thus would have a different provenance.
360 </p>
361 </div>
363 <!--PageFooterFmt-->
364 <!--/PageFooterFmt-->
365 </div>
366 <div class='clearfix'></div>
367 </div>
368 </div>
370 </body>
371 </html>


Name Value
svn:mime-type text/html

ViewVC Help
Powered by ViewVC 1.1.26