/[volute]/trunk/projects/dm/provenance/description/datamodel-description.tex
ViewVC logotype

Contents of /trunk/projects/dm/provenance/description/datamodel-description.tex

Parent Directory Parent Directory | Revision Log Revision Log


Revision 4238 - (show annotations)
Mon Sep 11 14:33:33 2017 UTC (3 years, 10 months ago) by mathieu.servillat
File MIME type: application/x-tex
File size: 48671 byte(s)
modify EntityDescription paragraphs, add min/max/option to ParameterDescription table
1 % updates Mireille 2017 April/May 2nd
2 %roles for Agents -updates + funder
3 %
4 In this section, we describe the currently discussed Provenance Data Model. We
5 start with an UML class diagram, explain the core elements and then give
6 in the following sections more details for each class and relation.
7
8 \subsection{Overview: Conceptional UML class diagram and introduction to core classes}
9 %We give in this section an overview on the main classes. More details about
10 %each class and their relations will be explained in the following sections.
11 %Its core elements are colored in blue. These core elements can also be found in the W3C Provenance Data
12 %Model. The pattern defined by these classes is very general and can be reused everywhere where provenance is needed.
13
14 \begin{figure}[h]
15 \centering
16 \includegraphics[width=1.0\textwidth]{../datamodel-diagrams/images/domain-classdiagram.pdf}
17 \caption{Overview of the classes for the Provenance Data Model in a conceptual class diagram. The blue classes are core elements. There appear a number of many-to-many relationships with attached association classes (grey) which can contain additional attributes.}
18 %Objects in the blue box also appear in the W3C Provenance Data Model.
19 %Green classes are links to the IVOA Dataset Metadata Model.}
20 \label{fig:classdiagram-conceptional}
21 \end{figure}
22
23
24 %\label{sec:core}
25 % Some examples for different use cases are given in Section \ref{sec:usecases-implementations}.
26 % The elements of a provenance model can be expressed as a directed graph to capture the causal dependencies.
27
28 Figure~\ref{fig:classdiagram-conceptional} shows the conceptional UML diagram for an IVOA Provenance Data
29 Model.
30 The core elements of the Provenance Data Model are \class{Entity}, \class{Activity} and \class{Agent}.
31 We chose for these elements the same names as were used in the Provenance Data
32 Model of the World Wide Web Consortium (W3C, \citealt{std:W3CProvDM}), which defines
33 a very abstract pattern that can be reused here. Here are the core classes with
34 a short description and some examples:
35
36 \begin{itemize}
37 \item \class{Entity:} a thing at a certain state\\
38 examples: data products like images, catalogs, parameter files, calibration data, instrument characteristics
39
40 \item \class{Activity:} an action/process or a series of actions, occurs over a period of time, performed on or caused by entities, usually results in new entities\\
41 examples: data acquisition like observation, simulation; regridding, fusion, calibration steps, reconstruction
42
43 \item \class{Agent:} executes/controls an activity, is responsible for an activity or an entity\\
44 examples: telescope astronomer, pipeline operator, principal investigator, software engineer, project helpdesk
45
46 \end{itemize}
47
48 \noindent
49
50
51
52 \begin{figure}[h]
53 \centering
54 \includegraphics[scale=0.8]{../datamodel-diagrams/images/classes-core-w3c}
55 \caption{The main core classes and relations of the Provenance Data Model, which also occur in the W3C model.}
56 \label{fig:coreclasses}
57 \end{figure}
58
59 These core classes along with their relations to each other are provided in Figure~\ref{fig:coreclasses}.
60 We use the following relation classes to specify the mapping between the three core
61 classes.
62 The relation names were again chosen to match the W3C model names:
63 \begin{itemize}
64 \item \class{WasGeneratedBy:} a new entity is generated by an activity\\
65 (entity ``image m31.fits'' wasGeneratedBy activity ``observation'')
66 \item \class{Used:} an entity is used by an activity\\
67 (activity ``calibration'' used entities ``calibration data'', ``raw images'')
68 \item \class{WasAssociatedWith:} agents have responsibility for an activity\\
69 (agent ``observer Max Smith'' wasAssociatedWith activity ``observation'')
70 \item \class{WasAttributedTo:} an entity can be attributed to an agent\\
71 (entity ``image m31.fits'' wasAttributedTo ``M31 observation campaign'')
72 \end{itemize}
73
74 Note that the relations appear as extra classes (and thus boxes in the diagrams, instead of just having annotated relations), because they can have additional attributes -- when mapping the model to a relational database, these relations would appear as mapping tables.
75
76 In the domain of astronomy, certain processes and steps are repeated again and
77 again with different parameters. We therefore separate the descriptions of activities
78 from the actual processes and introduce an additional \class{ActivityDescription} class (see Figure~\ref{fig:classdiagram-conceptional}).
79 Likewise, we also apply the same pattern for \class{Entity} and add an \class{EntityDescription}
80 class.
81 Defining such descriptions allows them to be reused, which is very useful
82 when performing a series of tasks of the same type, as is typically done in
83 astronomy.
84
85 A similar normalization of descriptions of the actual processes and datasets
86 can also be found in the IVOA Simulation Data Model \citep[SimDM, ][]{std:SimDM}),
87 which describes simulation metadata. The SimDM classes \class{Experiment} and \class{Protocol}
88 correspond to the Provenance terms \class{Activity} and \class{ActivityDescription}.
89
90 %The W3C-model has the advantage of being already an approved standard, and it
91 %contains all the necessary main features needed for a Provenance model for
92 %Astronomy. However, it is very general, and by adding reusable prototypes,
93 %templates or descriptions for activities and entities, the model may fit better
94 %to the astronomy domain.
95
96 This separation into two classes may not be needed for each and every project,
97 and everyone is free to choose which classes make sense for his/her use case.
98 When serializing provenance, one can integrate the description side into the
99 other classes, thus producing a W3C compliant provenance description. More details about
100 all these classes and relations are given in the following section.
101
102
103 %It still remains to be seen if this separation into two classes is necessary,
104 %useful or just nice to have. Currently, we include the descriptions in our model,
105 %for normalization purposes.
106
107 %But when serialising the provenance one could
108 %integrate the description side into the other classes, thus producing W3C
109 %compliant provenance.
110
111
112 \subsection{Model description}
113
114 \subsubsection{Class diagram and VO-DML compatibility}
115 \begin{figure}[h]
116 \centering
117 \includegraphics[width=1.0\textwidth]{../datamodel-diagrams/images/classes-overview.pdf}
118 \caption{More detailed overview of the classes for the Provenance Data Model. Note that this UML class diagram is more compatible with VO-DML.}
119 \label{fig:classdiagram}
120 \end{figure}
121
122 Figure~\ref{fig:classdiagram} shows the full class diagram with the association classes for the many-to-many relations modeled more directly as mapping classes. When implementing the model in a relational database, these classes can be represented as individual tables for mapping the relation. We model one of the associations of the many-to-many relationships as composition (full diamond), if the mapping class belongs more strongly to one of its linked classes, e.g. the \emph{Used} relations are strongly dependent on the corresponding \emph{Activities}. The documentation of all classes and an automatically generated figure based on the underlying xmi-description behind this UML diagram is available in the Volute repository at \url{https://volute.g-vo.org/svn/trunk/projects/dm/vo-dml/models/provenancedm/ProvenanceDM.html}.
123
124 This version of the UML diagram is fully VO-DML compliant, i.e. we just used the restricted subset of UML to model
125 Provenance and reused the IVOA datatypes.
126
127
128 \subsubsection{Entity and EntityDescription}
129
130 Entities in astronomy are usually astronomical or astrophysical datasets in the
131 form of images, tables, numbers, etc. But they can also be observation or
132 simulation log files, files containing system information, environment variables, names and versions of packages, ambient conditions or, in a wider sense, also observation proposals, scientific
133 articles, or manuals and other documents.
134
135 An entity is not restricted to being a file.
136 It can even be just a number in a table, depending on how fine-grained the
137 provenance shall be described.
138
139 \begin{figure}[h]
140 \centering
141 \includegraphics[scale=0.6]{../datamodel-diagrams/images/entity-details.pdf}
142 \caption{The relation between Entity, EntityDescription and Collection (see Section~\ref{sec:collection}).
143 Links to the Dataset class from the Dataset Metadata Model are described in Section~\ref{sec:dmlinks}.}
144 \label{fig:entity-details}
145 \end{figure}
146
147 The VO concept closest to Entity is the notion of ``Dataset'', which could mean a single
148 table, an image or a collection of them. The Dataset Metadata Model
149 \citep{std:DatasetDM} specifies an ``IVOA Dataset'' as ``a file or files which
150 are considered to be a single deliverable''.
151 Most attributes of the \class{Dataset} class can be mapped
152 directly to attributes of the \class{Entity} and EntityDescription class, see the mapping table \ref{tab:datasetmapping} in Section~\ref{sec:dmlinks}.
153
154
155 \begin{table}[h]
156
157 \small
158 \tymax 0.5\textwidth
159
160 \textbf{\normalsize Entity}\vspace{0.25em}\\
161 \begin{tabulary}{1.0\textwidth}{@{}lp{3.5cm}p{2cm}L@{}}
162 \toprule
163 \head{Attribute} & \head{W3C ProvDM} & \head{Data type} & \head{Description}\\
164 \midrule
165 \textbf{id} & prov:id & (qualified) string & a unique id for this entity (unique in its realm)\\
166 name & prov:label & string & a human-readable name for the entity (to be displayed by clients)\\
167 type & prov:type & string & a provenance type, i.e. one of: prov:collection, prov:bundle, prov:plan, prov:entity; not needed for a simple entity\\
168 %description\_ref & & foreign key/url & link to \class{EntityDescription}\\
169 annotation & prov:description & string & text describing the entity in more detail\\
170 rights & -- & string & access rights for the data, values: public, restricted or internal; can be linked to Curation.Rights from ObsCore/DatasetDM\\
171 creationTime & -- & datetime & date and time at which the entity was created (e.g. timestamp of a file)\\
172 \bottomrule
173 \end{tabulary}
174 \caption{Attributes of entities. Mandatory attributes are marked in bold.
175 }\label{tab:entity-attributes}
176 \end{table}
177
178 For entities, we suggest the attributes given in Table
179 \ref{tab:entity-attributes}. If the attribute also exists in the W3C
180 Provenance Data Model, we list its name in the second column.
181
182 %We discussed further attributes like \emph{size} and \emph{format}, but we decided to treat an
183 %entity of the same content but different format (and thus size) as the same entity,
184 %unless they do not have the same provenance (e.g. when the ``transformation'' activity
185 %for converting one format into another is included in the provenance description).
186
187 %\TODO{format and size may not be needed, if entities with the same content but different format and size are considered as the same entity.}
188
189 The difference between entities that are used as input data or output data
190 becomes clear by specifying the relations between the data and activities producing or using these data.
191 More details on this will follow in Section \ref{sec:entity-activity-relations}.
192
193 \paragraph{EntityDescription.}
194 %The Entity class can have an EntityDescription class attached.
195 The types of entities, or datasets in astronomy, can be predefined using a description class \class{EntityDescription}.
196 This class is meant to store information about an Entity that are known before the Entity instance is created. For example, if we run an activity to create a RGB image from three grey images, we may have a mandatory format for the input and output images before the execution (JPG, PNG, FITS\dots), but we probably cannot know the final size of the image that will be created. Therefore, ``format'' would be an EntityDescription attribute , while ``size'' would be an attribute of the Entity instance.
197
198 %This class thus stores entity-related
199 Some of the attributes that describe the content of the data could be derived from
200 the Dataset Metadata Model.
201
202 The \class{EntityDescription} does NOT contain any information about the usage
203 of the data, it tells nothing about them being used as input or output. This is
204 defined only by the relations (and the relation descriptions) between activities
205 and entities (see Section \ref{sec:entity-activity-relations}).
206
207 The EntityDescription general attributes are summarized in Table
208 \ref{tab:entitydescription-attributes}.
209
210
211 \begin{table}[h]
212 \small
213 \tymax 0.5\textwidth
214 \textbf{\normalsize EntityDescription}\vspace{0.25em}\\
215 \begin{tabulary}{\textwidth}{@{}p{2.75cm}p{0cm}p{2cm}L@{}}
216 \toprule
217 \head{Attribute} & \head{} & \head{Data type} & \head{Description}\\
218 \midrule
219 \textbf{id} & & (qualified) string & a unique identifier for this description\\
220 name & & string & a human-readable name for the entity description\\
221 annotation & & string & a decriptive text for this kind of entity\\
222 category & & string & specifies if the entity contains information on logging, system (environment), calibration, simulation, observation, configuration, ...\\
223 doculink & & url & link to more documentation\\
224 % removed the obscore attributes, since specific for observations only, not applicable to configuration entities etc.
225 % dataproduct\_ type & & string & from ObsCore data model \citep{std:ObsCore}, if applicable; describes, what kind of product it is (e.g. image, table)\\
226 % dataproduct\_ subtype & & string & from ObsCore data model, more specific subtype\\
227 % level & & enum integer & the level of processing or calibration; for ObsCore's calib\_level it is an integer between 0 and 3\\
228 \bottomrule
229 \end{tabulary}
230 \caption{Attributes of \class{EntityDescription}. For simple use cases,
231 the description classes may be ignored and its attributes may be used for
232 \class{Entity} instead.
233 %The utypes may vary depending on the data model, e.g. for simulation data they
234 %would point to utypes of SimDM.
235 }\label{tab:entitydescription-attributes}
236 \end{table}
237
238
239 \begin{table}[h]
240
241 \small
242 \tymax 0.5\textwidth
243
244 \textbf{\normalsize WasDerivedFrom}\vspace{0.25em}\\
245 \begin{tabulary}{1.0\textwidth}{@{}lp{3cm}L@{}}
246 \toprule
247 \head{Attribute} & \head{Data type} & \head{Description}\\
248 \midrule
249 id & string & a unique id for this entity (unique in its realm)\\
250 \textbf{generatedEntity} & string & foreign key to the entity\\
251 \textbf{usedEntity} & string & foreign key to the progenitor, from which the generatedEntity was derived\\
252 activity & string & foreign key to the generation activity\\
253 generation & string & foreign key to the wasGeneratedBy relation\\
254 usage & string & foreign key to the used relation\\
255 \bottomrule
256 \end{tabulary}
257 \caption{Attributes of the WasDerivedFrom relation. This is the same as used in W3C's ProvDM. Mandatory attributes are marked in bold.
258 }\label{tab:wasderivedfrom-attributes}
259 \end{table}
260
261
262 \paragraph{WasDerivedFrom.}
263 In Figure~\ref{fig:entity-details} there is one more relation that we have not mentioned yet:
264 the \class{WasDerivedFrom}-relation which links two entities together, borrowed from the W3C model.
265 It is used to express that
266 one entity was derived from another, i.e. it can be used to find one (or more) progenitor(s)
267 of a dataset, without having to look for the activities in between. It can therefore serve as
268 a shortcut.
269
270 The information this relation provides is somewhat redundant, since progenitors for entities
271 can be found through the links to activity and the corresponding descriptions.
272 Nevertheless, we include \class{WasDerivedFrom} for those cases where an explicit
273 link between an entity and its progenitor is useful (e.g. for speeding up searches for
274 progenitors or if the activity in between is not important).
275
276 Note that the \class{WasDerivedFrom} relation
277 cannot always automatically be infered from following \class{WasGeneratedBy} and \class{Used} relations alone:
278 If there is more than one input and more than one output of an activity, it is not clear (without
279 consulting the activityDescription and entity roles in the relation-descriptions) which entity was derived from which.
280 Only by specifying the descriptions and roles accordingly or by adding the a \class{WasDerivedFrom} relation,
281 this direct derivation becomes known.
282
283
284
285 \subsubsection{Collection}\label{sec:collection}
286 Collections are entities that are grouped together and can be treated as one single entity.
287 From the provenance point of view, they have to have the \emph{same origin}, i.e., they were
288 produced by the same activity (which could also be the activity of collecting
289 data for a publication or similar). The term ``collection'' is
290 also used in the Dataset Metadata Model for grouping datasets.
291 % (but with a slightly different meaning).
292 As an example, a collection
293 with the name `RAVE survey' could consist of a number of database tables and spectra files.
294
295 %\TODO{Do we allow empty collections? Or should collections always contain at least 1 member? (otherwise they are just prov:entities?)}
296
297 The Entity-Collection relation can be modeled using the \emph{Composite} design pattern:
298 Collection is a subclass of Entity, but also an aggregation of 1 to many entities,
299 which could be collections themselves.
300 In order to be compliant to VODML, we model the membership-relation explicitly
301 by including a \class{HadMember} class in our model, which is connected to the
302 \emph{Collection} class via a composition. It may contain an additional role attribute.
303
304 Collections are also known in the W3C model, in the same sense as used here.
305 The relation between entity and collection is also called ``HadMember'' in the W3C model.
306
307 An additional class \class{CollectionDescription} is only
308 needed if it has different attributes than
309 the \class{EntityDescription}. This class should therefore only be introduced if a use case requires it.
310
311 \paragraph{Advantages of collections:} Collections can be used to collect entities with the same provenance information together,
312 in order to hide complexity where necessary. They can be used for defining
313 different levels of detail (granularity).
314
315 %\TODO{Find a really strong use case for Collections to convince everyone that they are useful/needed.}
316
317 \subsubsection{Activity and ActivityDescription}
318
319 \begin{figure}[h]
320 \centering
321 \includegraphics[scale=0.5]{../datamodel-diagrams/images/activity-details.pdf}
322 \caption{Details for Activity, ActivityDescription and ActivityFlow (see Section~\ref{sec:activityflow}).
323 }
324 \label{fig:activity-details}
325 \end{figure}
326
327 \begin{table}[h]
328
329 \small
330 \tymax 0.5\textwidth
331
332 \textbf{\normalsize Activity}\vspace{0.25em}\\
333 \begin{tabulary}{1.0\textwidth}{@{}lp{2.5cm}p{2cm}L@{}}
334 \toprule
335 \head{Attribute} & \head{W3C ProvDM} & \head{Data type} & \head{Description}\\
336 \midrule
337 \textbf{id} & prov:id & (qualified) string & a unique id for this activity (unique in its realm)\\
338 name & prov:label & string & a human-readable name (to be displayed by clients)\\
339 \textbf{startTime} & prov:startTime & datetime & start of an activity\\
340 \textbf{endTime} & prov:endTime & datetime & end of an activity\\
341 annotation & prov:description & string & additional explanations for the specific activity instance\\
342 %description\_ref & & foreign key/url & link to \class{ActivityDescription}\\
343 \bottomrule
344 \end{tabulary}
345 \caption{Attributes of \class{Activity}, their data types and equivalents in the W3C Provenance
346 Data Model, if existing. Attributes in bold are \textbf{mandatory}.}
347 \end{table}
348
349
350 \begin{table}[ht]
351 \small
352 \tymax 0.5\textwidth
353 \textbf{\normalsize ActivityDescription}\vspace{0.25em}\\
354 \begin{tabulary}{1.0\textwidth}{@{}p{0cm}p{2.5cm}lL@{}}
355 \toprule
356 \head{Attribute} & \head{} & \head{Data type} & \head{Description}\\
357 \midrule
358 \textbf{id} & & string & a unique id for this activity description (unique in its realm)\\
359 name & & string & a human-readable name (to be displayed by clients)\\
360 type & & string & type of the activity, from a vocabulary or list, e.g. data acquisition (observation or simulation), reduction, calibration, publication\\
361 subtype & & string & more specific subtype of the activity\\
362 annotation & & string & additional free text description for the activity\\
363 %code & & string & the code used for this process\\
364 %version & & string & a version number for the code\\
365 doculink & & url & link to further documentation on this process, e.g. a
366 paper, the source code in a version control system etc.\\
367 \bottomrule
368 \end{tabulary}
369 \caption{Attributes of \class{ActivityDescription}.}
370 \end{table}
371
372
373 Activities in astronomy include all steps from obtaining data to the reduction of
374 images and production of new datasets, like image calibration, bias subtraction, image stacking;
375 light curve generation from a number of observations, radial velocity
376 determination from spectra, post-processing steps of simulations etc.
377
378 \paragraph{ActivityDescription.}
379 The method underlying an activity can be specified by a corresponding
380 \class{ActivityDescription} class (previously named \class{Method}, corresponds
381 to the \class{Protocol} class in SimDM). This could be,
382 for instance, the name of the code used to perform an activity or a more general
383 description of the underlying algorithm or process. An activity is then a
384 concrete case (instance) of using such a method, with a startTime and endTime,
385 and it refers to a corresponding description for further information.
386
387 There MUST be exactly zero or one \class{ActivityDescription} per \class{Activity}. If steps from a
388 pipeline shall be grouped together, one needs to create a proper
389 \class{ActivityDescription} for describing all the steps at once. This method can then
390 be refered to by the pipeline-activity.
391
392 When serializing the data model, the attributes
393 of the description class may be assigned to the activity in order to produce
394 a W3C compliant serialization (same as with Entity/EntityDescription).
395
396
397 \paragraph{WasInformedBy.}
398 The individual steps of a pipeline can be chained
399 together directly, without mentioning the intermediate datasets, using the \class{WasInformedBy}-relation.
400 This relation can be used as a short-cut, if the exchanged datasets are deemed to be not important
401 enough to be recorded. For grouping activities, also see the
402 next section \ref{sec:activityflow}.
403
404
405 \subsubsection{ActivityFlow}\label{sec:activityflow}
406 \TODO{Link to D-PROV!}
407 For facilitating grouping of activities (and their related entities etc.)
408 we introduce the class \class{ActivityFlow}.
409 It can be used for hiding and grouping a part of the workflow/pipeline
410 or provenance
411 description, if different levels of granularity are needed. Such pipelines and workflows are very common in astronomical data production and processing. Figure \ref{fig:provgraph-activityflow}
412 illustrates an example provenance graph in a detailed level (left side)
413 and using the ActivityFlow (right side).
414
415
416 \begin{figure}[h]
417 \centering
418 \includegraphics[width=1\textwidth]{../datamodel-diagrams/images/provgraph-activityflow}
419 \caption{An example provenance graph. The detailed version is shown on the left side. It also shows
420 the shortcut \class{WasInformedBy} to connect two activities, which could be used if the entity e2
421 would not be needed anywhere else.
422 An ActivityFlow can be used to ``hide'' a part of the provenance graph as is shown on the right side.
423 Activities are marked by blue rectangles, entities by yellow ellipses.}
424 \label{fig:provgraph-activityflow}
425 \end{figure}
426
427 We also explored the different ways to describe a set of activities in the W3C
428 provenance model. This model uses \class{Bundle}, i.e. an entity with type ``Bundle'',
429 for wrapping a provenance description. Each part of a provenance description can be
430 put into a bundle, and the bundle can then be reused in other provenance descriptions.
431 W3C's \class{Plan} is an entity with type ``Plan'' and is used for describing a
432 set of actions or steps. Both, \class{Bundle} and \class{Plan}, are entities and
433 have the attributes and relations of this class (and thus one can define provenance of bundles and plans as well).
434
435 But we would like to consider a set of activities as being an \class{Activity} itself,
436 with all the relations and properties that an activity also has. Therefore we do not reuse
437 W3C's classes for describing workflows and plans, but added
438 the class \class{ActivityFlow} as an activity composed of activities. The composition is represented by
439 the ``hadStep'' relation, as is shown in Figure~\ref{fig:activity-details}.
440
441 %while still making it obvious that this
442 %group contains activities, we introduce the class \class{ActivityFlow}.
443 %This can be used for describing workflows or pipelines, or for
444 %
445 %We also allow ActivityCollections to consist of a whole provenance graph of
446 %activities and entities being linked together.
447
448
449 %We could introduce an additional abstract class, e.g. \class{AbstractActivity}, with \class{Activity} and
450 %\class{ActivityFlow} being subclasses to this one. But this adds another layer of complexity
451 %that we may not want in this data model.
452
453 %Since we introduced \class{ActivityFlow} mainly for having different view levels,
454 %we may want to add an attribute \emph{viewLevel} to descriptions of activityflows.
455 % But where to set the 0 point for viewLevel???
456
457 \begin{figure}[h]
458 \centering
459 \includegraphics[scale=0.6]{../datamodel-diagrams/images/entity-activity-relations.pdf}
460 \hspace{0.15\textwidth}
461 \includegraphics[scale=0.6]{../datamodel-diagrams/images/entity-activity-relations-nodesc.pdf}
462 \caption{\class{Entity} and \class{Activity} are linked via the \class{Used} and \class{WasGeneratedBy} relations. In the left image, the \emph{role} that an entity which was used or generated by an activity played is recorded with the corresponding \emph{UsedDescription} and \emph{WasGeneratedByDescription}, also see Section~\ref{sec:entity-roles}. If these description classes are not used, the \emph{role} can be used directly as an attribute within the \emph{Used} and \emph{WasGeneratedBy}classes (right image).}
463 \label{fig:entity-activity-relations}
464 \end{figure}
465
466
467 \subsubsection{Entity-Activity relations}\label{sec:entity-activity-relations}
468
469 For each data flow it should be possible to clearly identify entities and
470 activities.
471 %If the activities shall not be recorded explicitely, one could also
472 %use the \emph{Derivation}-relation as suggested in the W3C Provenance Data Model
473 %to link derived entities to their originals.
474 Each entity is usually a result from an activity, expressed by a link from
475 the entity to its generating activity using the \class{WasGeneratedBy} relation,
476 and can be used as input for (many) other activities, expressed by the \class{Used} relation.
477 Thus the information on whether data is used as input or was produced as output of
478 some activity is given by the \emph{relation-types} between activities and entities.
479 %In fact,
480 %it would be enough to provide this information just for the relations on the description side (right).
481 % -- Is this true?
482
483 We use two relations, \class{Used} and \class{WasGeneratedBy}, instead of just one
484 mapping class with a flag for input/output, because their descriptions and role-attributes
485 can be different.
486 %in order to model the different
487 %multiplicities explicitely: an entity always has only one (or none)
488 %\class{WasGeneratedBy} relation, but may be \class{Used} many times as input for
489 %different activities.
490
491 The \class{WasGeneratedBy}-relation can have the optional attribute \emph{time} -- this is the time, when
492 the generation of the entity is finished. This generation time corresponds to e.g. \emph{DataID.date} in
493 Dataset Metadata DM.
494 %It therefore corresponds to the \emph{created}-time used in
495 %the Simulation Data Model (SimDM).
496
497 \paragraph{Compositions and multiplicities}
498 In principle, an entity is produced by just one activity.
499 However, by introducing the \class{ActivityFlow} class for grouping activities together,
500 one entity can now have many wasGeneratedBy-links to activities. One of them must
501 be the actual generation activity, the other activities can only be activityFlows
502 containing this generation-activity. This restriction of having only one ``true'' generation activity is not explicitly expressed in the current model\footnote{The reason for this is that we want to keep the model simple and avoid introducing even more classes.}.
503
504
505 The \emph{Used} relation is closely coupled to the \emph{Activity}, so we use a composition here, indicated
506 in Figure~\ref{fig:classdiagram} by a filled diamond:
507 if an activity is deleted, then the corresponding used relations need to be removed as well.
508 The entities that were used still remain, since they may have been used for other activities as well.
509 We need a multiplicity * between \emph{Used} and \emph{Entity}, because an entity can be used more than once
510 (by different activities).
511
512 Similarly, the \emph{WasGeneratedBy} relation is closely coupled with the \emph{Entity} via a composition,
513 since a wasGeneratedBy relation makes no sense without its entity. So if an entity is deleted,
514 then its wasGeneratedBy relation must be deleted as well. There is a multiplicity * between \emph{Activity}
515 and \emph{WasGeneratedBy}, because an activity can generate many entities.
516
517
518 \paragraph{Entity roles}\label{sec:entity-roles}
519 Each activity requires specific roles for each input or output entity, thus
520 we store this information with description classes, in the role-attributes for
521 the \class{UsedDescription} and \class{WasGeneratedByDescription} relation.
522 For example, an activity for darkframe-subtraction requires two input images. But it is
523 very important to know which of the images is the raw image and
524 which one fulfils the role of dark frame.
525
526 The role is in general NOT an attribute for \class{EntityDescription} or \class{Entity},
527 since the same entity (e.g. a specific FITS file containing an image) may play
528 different roles with different activities. If this is not the case, if the
529 image can only play the same role everywhere, only then it is an intrinsic
530 property of the entity and should be stored in the \class{EntityDescription}.
531
532 %Additionally, input (and also output) data can take different roles in an
533 %activity. For example, one file could
534 %be a parameter file, another one is the raw image, and the third one is the
535 %dark field that should be subtracted. Since these roles are very important,
536 %it must be made explicit which data component needs to fulfill which role as
537 %input in or output from an activity.
538 %Each activity requires specific roles for each input or output entity, thus
539 %we store this information on the description side, in the role-attributes for
540 %the \class{UsedDescription} and \class{WasGeneratedByDescription} relation.
541
542 %In W3C, this is partially solved by adding a derivation relation between the Entities (data). Here, we have a mapping-class between Activity and DataEntities as well as between ActivityDescription and DataDescription. The mapping-class at the description side, i.e. between the ActivityDescription and its DataEntityDescriptions, contains additionally a role for each relation, e.g. parameter, dark frame, raw image, etc. If a dataset is used as input to an activity or if it results from it, will become clear with these roles.
543
544
545 Some example roles are given in Table \ref{tab:entity-roles}.
546 Note that these roles don't have to be unique, many datasets may play the same role for
547 a process. For example, many image entities may be used as science-ready-images for an
548 image stacking process.
549
550 \begin{table}[h]
551 \small
552 \begin{tabulary}{1.0\textwidth}{@{}lL@{}}
553 \toprule
554 \head{Role} & \head{Example entities}\\
555 \midrule
556 configuration & configuration file \\ %& used for entities that contain configuration details for an activity\\
557 auxiliary input & calibration image, dark frame, etc. \\%& \\
558 main input & raw image, science-ready images \\%& used for entities that are the main input for an activity\\
559 main result & image, cube or spectrum \\%& used for entities that are the main result of an activity\\
560 log & logging output file \\%& used for logging output \\
561 red & image used for red channel of a composite activity\\%& used for images that will be used as the red channel of a composite activity\\
562 \bottomrule
563 \end{tabulary}
564 \caption{Examples for entity roles as attributes in the
565 \class{UsedDescription} and \class{WasGeneratedByDescription}.}
566 \label{tab:entity-roles}
567 \end{table}
568 % here we cross some notions encountered in parameter descriptions and Activity descriptions while describing parameters
569
570 In order to facilitate interoperability, the possible
571 entity-roles could be defined and described for each activity by the IVOA community, in a
572 vocabulary list or thesaurus.
573 % TODO!!
574
575
576 %\TODO{Roles can be used for checking (validation) if processes use the correct type of entities,
577 %e.g. check if entity-type matches used-role!}
578
579 %Without the mapping tables, the relation between \class{Activity}
580 %(\class{ActivityDescription}) and \class{Entity} (\class{EntityDescription})
581 %would be an aggregation relation, or in other words: an association with the
582 %aggregation kind ``shared''. That would be required to ensure that all
583 %entities linked to an activity (either as input or output) will survive if
584 %the activity is destroyed, since they are almost always shared with other
585 %activities.
586 %
587 %By using the mapping tables we make the role of an entity in an activity more
588 %explicit and thus can replace the aggregation by a composition relation to the
589 %\class{Activity}/\class{ActivityDescription} and simple associations to the
590 %individual data components and their descriptions.
591
592
593 % The derivation relation together with entities is already enough to produce a
594 % Data flow view, but in astronomy we are probably even more interested in the
595 % Processes (as discussed in our first draft for requirements for provenance).
596
597 %\TODO{Add an example here! (From discussions in Heidelberg.)}
598
599
600
601 \subsubsection{Parameters}\label{sec:parameters}
602
603 The concept of activity configuration, generally a set of parameters that can be configured, is different to the concept of provenance information. However, it is tightly connected. We identify three different ways to link configuration information to an activity:
604 \begin{itemize}
605 \item Declare a parameter set (or each parameter) as an input entity that is used by the activity. \\
606 This also allows tracking the provenance of the parameter further.
607 \item Define families of activities, each one with fixed attributes.\\
608 I.e. use different subclasses for activities with different fixed attributes.
609 \item Add activity attributes in the form of key-value parameters.
610 \end{itemize}
611
612 To enable the latter solution, we add a \class{Parameter} class along with a \class{ParameterDescription} for describing additional properties of activities. In this solution, Parameters are directly connected to an Activity without complex Entity-Activity relations. Moreover, we can then describe each parameter in the same way as in FIELD and PARAM elements in VOTable \citep{std:VOTable}.
613
614
615 \begin{table}[h]
616 \small
617 \tymax 0.5\textwidth
618 \textbf{\normalsize Parameter}\vspace{0.25em}\\
619 \begin{tabulary}{1.0\textwidth}{@{}p{0cm}p{2.5cm}lL@{}}
620 \toprule
621 \head{Attribute} & \head{} & \head{Data type} & \head{Description}\\
622 \midrule
623 \textbf{id} & & string & parameter unique identifier\\
624 %description\_ref & & foreign key/url & link to \emph{ParameterDescription}\\
625 %name & & string & parameter name, if no link to ParameterDescription is given\\
626 \textbf{value} & & (value dependent) & the value of the parameter\\
627 \bottomrule
628 \end{tabulary}
629 \caption{Attributes of \class{Parameter}. Attributes in bold are \textbf{mandatory}.}
630 \end{table}
631
632 \begin{table}[ht]
633 \small
634 \tymax 0.5\textwidth
635 \textbf{\normalsize ParameterDescription}\vspace{0.25em}\\
636 \begin{tabulary}{1.0\textwidth}{@{}p{0cm}p{2.5cm}lL@{}}
637 \toprule
638 \head{Attribute} & \head{} & \head{Data type} & \head{Description}\\
639 \midrule
640 \textbf{id} & & string & parameter unique identifier\\
641 \textbf{name} & & string & parameter name\\
642 annotation & & string & additional free text description\\
643 datatype & & string & datatype \\
644 unit & & string & physical unit \\
645 ucd & & string & Unified Content Descriptor, supplying a standardized classification of the physical quantity\\
646 utype & & string & UType, meant to express the role of the parameter in the context of an external data model \\
647 min & & number & minimum value \\
648 max & & number & maximum value\\
649 options & & list & list of accepted values\\
650 \bottomrule
651 \end{tabulary}
652 \caption{Attributes of \class{ParameterDescription}.}
653 \end{table}
654
655 For example, observations generally require information on \emph{ambient conditions} as well as
656 \emph{instrument characteristics}. This contextual data associated with an observation is not directly modelled in the ProvenanceDM. However, this information can be stored as different entities. Alternatively, one could list the instrument characteristics as a set of key-value parameters using the \class{Parameter} class, so that this information is structured and stored with the provenance information (and can thus be queried simultaneously). In the case of a processing activity that cleans an image with a sigma-clipping method, the input and output images would be entities and the value of the number of sigma for sigma-clipping could be a parameter instead of an entity. We may also want to define a 3-sigma-clipping activity where this parameter is fixed to 3.
657
658
659 %For example for observations, the \emph{ambient conditions} as well as
660 %\emph{instrument characteristics} need to be stored. But they can both be treated
661 %as additional entities as well.
662 %Our model can then also take into account that a certain observation
663 %method requires special ambient conditions, already defined via the
664 %ActivityDescription (e.g. radio observations rely on different ambient
665 %conditions than observations
666 %of gamma rays), just following our data -- data description scheme.
667 %Ambient conditions are recorded for a certain time (startTime, endTime) and are
668 %usually only valid for a certain time interval. This time interval should be recorded
669 %with a \emph{validity}-attribute for such entities.
670 %
671 %In contrast to ambient conditions, instrument characteristics do (usually) not
672 %change from one observation to the other, so they are static, strictly related to
673 %the instrument.
674 %All the characteristics could be described either as key-value pairs directly with the
675 %observation (as attributes) or just as datasets, using the \class{Entity} class.
676 %One would then
677 %link the instrument characteristics as a type of input (or output?) dataset to a certain
678 %observation activity. Thus we don't need a separate Instrument or Device class.
679
680 %\note{One should also keep in mind that some instrument related parameters can change within time,
681 %e.g. the CCD temperature. The instruments can also change within time because of aging.}
682
683
684
685 \subsubsection{Agent}\label{sec:w3c-agent}
686
687 An \class{Agent} describes someone who is responsible for a certain task or
688 entity, e.g. who pressed a button,
689 ran a script, performed the observation or published a dataset.
690 The agent can be a single person, a group of persons (e.g. MUSE WISE Team), a
691 project (CTA) or an institute.
692 This is also reflected in the IVOA Dataset Metadata Model, where \class{Party}
693 represents an agent, and it has two types: \class{Individual} and \class{Organization},
694 which are explained in more detail in Table \ref{tab:agent-types} (also see Section~\ref{sec:dmlinks} for comparison between \class{Agent} and \class{Party}).
695 Both agent types are also used in the W3C Provenance Data Model, though
696 \class{Individual} is called \class{Person} there.
697 We decided to not include the type \class{SoftwareAgent} from W3C (yet), since it is not required for our current use cases. This may change in the future.
698
699 \begin{table}[h]
700 \small
701 \tymax 0.5\textwidth
702 \begin{center}
703 \begin{tabulary}{1.0\textwidth}{@{}lllL@{}}
704 \multicolumn{4}{c}{\textbf{AgentType}}\\
705 \toprule
706 \head{Class or type} & \head{W3C ProvDM} & \head{DatasetDM} &\head{Comment} \\
707 \midrule
708 Agent & Agent & Party & \\
709 Individual & Person & Individual & a person, specified by name, email, address,
710 (though all these parts may change in time)\\
711 Organization & Organization & Organization & a publishing house, institute or scientific project\\
712
713
714 \bottomrule
715 \end{tabulary}
716 \caption{Agent class and types of agents/subclasses in this data model, compared to W3C ProvDM and DatasetDM.}
717 \label{tab:agent-types}
718 \end{center}
719 \end{table}
720
721 \begin{table}[h]
722 \small
723 \tymax 0.5\textwidth
724 \begin{center}
725 \begin{tabulary}{1.0\textwidth}{@{}llp{2cm}L@{}}
726 \multicolumn{4}{c}{\textbf{Agent}}\\
727 \toprule
728 \head{Attribute} & \head{W3C ProvDM} & \head{Data type} & \head{Description}\\
729 \midrule
730 \textbf{id} & prov:id & (qualified) string & unique identifier for an agent\\
731 \textbf{name} & prov:name & string & a common name for this agent; e.g. first name and last name; project name, agency name...\\
732 type & prov:type & string & type of the agent: either Individual (Person) or Organization\\
733 % insert here the attributes dedicated to contact for a Party in DataSet Metadata DM.
734 % \hline
735 % \multicolumn{4}{l}{Additional optional attributes from Dataset.Party subclasses:}\\
736 % \hline
737 % address & & string & Address of the agent both for Individual (Person) and Organization\\
738 % phone & & string & Contact phone number of the agent both for Individual (Person) and Organization\\
739 % email & & string & Contact email of the agent both for Individual (Person) and Organization\\
740 \bottomrule
741 \end{tabulary}
742 \caption{Agent attributes}
743 \label{tab:agent-attributes}
744 \end{center}
745 \end{table}
746
747
748
749 A definition of organizations is given in the
750 IVOA Recommendation on Resource Metadata \citep{std:ResourceMeta}, hereafter
751 refered to as RM: ``An organisation is [a] specific type of resource that
752 brings people together to pursue participation in VO applications.''
753 It also specifies further that scientific projects can be considered
754 as organisations on a finer level:
755 ``At a high level, an organisation could be a university, observatory, or government
756 agency. At a finer level, it could be a specific scientific project, space mission,
757 or individual researcher. A provider is an organisation that makes data and/or services
758 available to users over the network.''
759
760
761
762 For each agent a \emph{name} should be specified, a summary of the attributes for \class{Agent} is given in Table~\ref{tab:agent-attributes}.
763 One could also add the optional attributes \emph{address}, \emph{phone} and \emph{email} (compare with subclasses of \emph{Party} in Section~\ref{sec:dmlinks}). However, we skip them here in this main class, since an advanced system may use permanent identifiers (e.g. ORCIDs) to identify agents and retrieve their properties from an external system.
764 It would also increase the value of the given
765 information if the (current) affiliation of the agent (and a project leader/group
766 leader) were specified in order to maximize the chance of finding any contact
767 person later on.
768 The contact information is needed in case more information about a certain step in the past of a dataset is required,
769 but also in order
770 to know who was involved and to fulfill our ``Attribution'' requirement
771 (Section~\ref{sec:requirements}), so that proper credits are given to the right
772 people/projects.
773
774
775
776 It is desired to have at least one agent given for each activity (and entity), but it
777 is not enforced.
778 % , hence the multiplicity between \class{Entity}/\class{Activity} and the relations
779 %to the \class{Agent} starts with 0.
780 There can also be more than one agent for each activity/entity with different \emph{roles}
781 and one agent can be responsible for more than one activity or entity. This
782 many-to-many relationship is made explicit in our model by adding the two
783 following relation classes:
784
785 \begin{itemize}
786 \item wasAssociatedWith: relates an \emph{activity} to an agent
787 \item wasAttributedTo: relates an \emph{entity} to an agent
788 \end{itemize}
789
790 We adopted here the same naming scheme as was used in W3C ProvDM.
791 Note that the attributed-to-agent for a dataset may be different from the
792 agent that is associated with the activity that created an entity.
793 Someone who is performing a task is not necessarily given full attribution,
794 especially if he acts on behalf of someone else (the project, university, ...).
795
796
797 In order to make it clearer what an agent is useful for, we suggest the
798 possible roles an agent can have (along with descriptions partially taken from RM)
799 in Table~\ref{tab:agent-roles}.
800 For comparison, SimDM contains following roles for their contacts:
801 owner, creator, publisher and contributor. Note that the \emph{Party} class in Dataset and SimDM are very similar to the \emph{Agent} class, which is explained in more detail in Section~\ref{sec:dmlinks}.
802
803
804 \begin{table}[h]
805 \small
806 \tymax 0.5\textwidth
807 \begin{center}
808 \begin{tabulary}{1.0\textwidth}{@{}lp{3cm}L@{}}
809 \multicolumn{3}{c}{\textbf{AgentRoles}}\\
810 \toprule
811 \head{role} & \head{type or sub class} & \head{Comment} \\
812 \midrule
813 author & Individual & someone who wrote an article, software, proposal\\
814 contributor & Individual & someone who contributed to something (but not enough to gain authorship)\\
815 editor & Individual & editor of e.g. an article, before publishing\\
816 creator & Individual & someone who created a dataset, creators of articles or software are rather called ``author''\\
817 curator & Individual & someone who checked and corrected a dataset before publishing\\
818 publisher & Organization {(maybe also Individual?)}& organization (publishing house, institute) that published something\\
819 observer & Individual & observer at the telescope\\
820 operator & Individual & someone performing a given task \\ % removed executor: ambiguous
821 coordinator/PI & Individual & someone coordinating/leading a project\\ % we should choose one word : PI?
822 funder & Organization & agency or sponsor for a project as in Prov-N\\
823 provider & Organization & ``an organization that makes data and/or services available to users over the network'' (definition from RM)\\
824 %(owner) & voprov:Individual or voprov:Organization & Does anyone really own the data?\\
825 \bottomrule
826 \end{tabulary}
827 \caption{Examples for roles of agents and the typical type of that agent}
828 \label{tab:agent-roles}
829 \end{center}
830 \end{table}
831
832 %\TODO{\textbf{Mireille + Fran\c{c}ois}: Go through these roles, pick only the necessary ones, crosscheck with other data models.}
833
834 This list is \emph{not} complete. We consider providing a vocabulary list for this
835 in a future version of this model, collected from (future) implementations of this model.
836
837 %\TODO{Do we have a specific use case for fixing the agent-roles? Is anyone
838 %going to search for specific roles in the Provenance meta-data?
839 %Or shall we leave it open, which roles can be defined and just give examples here?}
840 % ... Yes, just give examples here. Should have a vocabulary list somewhere ...
841
842 %\subsubsection{Shortcuts: WasDerivedFrom and WasInformedBy}\label{sec:shortcuts}
843 %The classes \class{WasDerivedFrom} and \class{WasInformedBy} can be used as ``shortcuts'' and
844 %are used in the same way as the corresponding W3C classes.
845
846 %\class{WasDerivedFrom} defines the relation that links two entities together, if one entity was derived
847 %from the other entity. In principle, one can find this information also by tracing the
848 %history of an entity backwards to the generating activity and its input entities.
849 %The descriptions for activity, entity and their relations should provide enough
850 %information to find the progenitor entity from which an entity was derived.
851 %Nevertheless, we include \class{WasDerivedFrom} for those cases where an explicite
852 %link between an entity and its progenitor is useful (e.g. for speeding up searches for
853 %progenitors or if the activity in between is not important).
854
855 %The class \class{WasInformedBy} links two activities together without defining the
856 %intermediate entities that may have been exchanged. This is useful for e.g. pipelines,
857 %if the intermediate entities don't play a major role or only exist temporarily, so that
858 %their provenance information is not deemed to be important enough to be recorded.
859 %``WasInformedBy'' relation (also called ``Communication'' relation, borrowed from W3C's model)
860
861
862 %\subsection{Implementation hints}

msdemlei@ari.uni-heidelberg.de
ViewVC Help
Powered by ViewVC 1.1.26