/[volute]/trunk/projects/registry/acregistry/registry.tex
ViewVC logotype

Contents of /trunk/projects/registry/acregistry/registry.tex

Parent Directory Parent Directory | Revision Log Revision Log


Revision 2512 - (show annotations)
Tue Apr 8 10:25:53 2014 UTC (7 years, 3 months ago) by volute@g-vo.org
File MIME type: application/x-tex
File size: 17324 byte(s)
A&C registry: incremental harvesting whining.


1 %\documentclass[preprint,authoryear,12pt]{elsarticle}
2 \documentclass[final,authoryear,5p,times]{elsarticle}
3
4 \usepackage{graphicx}
5 \usepackage{amssymb}
6 \biboptions{}
7
8 % remove the following two lines before submission
9 \usepackage[usenames]{color}
10 \newcommand{\redcomment}[1]{{\color{red}#1}}
11
12 \journal{Astronomy \& Computing}
13
14 \begin{document}
15
16 \begin{frontmatter}
17
18 %% Title, authors and addresses
19
20 %% use the tnoteref command within \title for footnotes;
21 %% use the tnotetext command for the associated footnote;
22 %% use the fnref command within \author or \address for footnotes;
23 %% use the fntext command for the associated footnote;
24 %% use the corref command within \author for corresponding author footnotes;
25 %% use the cortext command for the associated footnote;
26 %% use the ead command for the email address,
27 %% and the form \ead[url] for the home page:
28 %%
29 %% \title{Title\tnoteref{label1}}
30 %% \tnotetext[label1]{}
31 %% \author{Name\corref{cor1}\fnref{label2}}
32 %% \ead{email address}
33 %% \ead[url]{home page}
34 %% \fntext[label2]{}
35 %% \cortext[cor1]{}
36 %% \address{Address\fnref{label3}}
37 %% \fntext[label3]{}
38
39 \title{The Virtual Observatory Registry}
40
41 %% use optional labels to link authors explicitly to addresses:
42 %% \author[label1,label2]{<author name>}
43 %% \address[label1]{<address>}
44 %% \address[label2]{<address>}
45
46 \author{Gretchen Greene}
47 \author[ari]{Markus Demleitner}
48 \author[obspm]{Pierre LeSidaner}
49 \author[ncsa]{Ray Plante}
50
51 \address[stsci]{Space Telescope Science Institute}
52 \address[obspm]{Observatoire de Paris FIXME}
53 \address[ncsa]{NCSA FIXME}
54 \address[ari]{Unversität Heidelberg,
55 Astronomisches Rechen-Institut, M\"onchhofstra\ss e 12-14, 69120 Heidelberg}
56
57 \begin{abstract}
58 In the Virtual Observatory, the Registry provide the mechanism with which
59 users and applications can discover and select resources -- e.g., data
60 and services -- that are relevant for a particular scientific problem.
61 This article describes the design and operation of the system TBD.
62 \end{abstract}
63
64 \begin{keyword}
65 %% keywords here, in the form: keyword \sep keyword
66
67 %% MSC codes here, in the form: \MSC code \sep code
68 %% or \MSC[2008] code \sep code (2000 is the default)
69
70 \end{keyword}
71
72 \end{frontmatter}
73
74 % \linenumbers
75
76 \section{Introduction}
77 \label{sec:intro}
78
79 A Registry is a system within the IVOA which provides a directory for
80 published resources for data collections provided throughout the
81 astronomical community. There is a network of registries in the IVOA
82 which host collections, provide searching and publication of
83 collections, and follow standard protocols for exchanging information
84 between them. Each registry contains a very rich set of descriptive
85 metadata for the hosted collections defined by a set of IVOA registry
86 standards for data services, astronomical resources and various
87 standards described by astronomical data models. Registries are
88 critical repositories for the implemented standards developed by the
89 IVOA community. Science Data discovery tools rely on registry access to
90 efficiently perform searches across the distributed archives to find
91 data provider services which are discoverable through standard
92 application program interfaces (APIs). A fully searchable registry also
93 provides a resource publication system for individual or organizational
94 data providers to publish metadata describing their data resources and
95 services. This metadata includes the use of standard IVOA protocols for
96 images, spectra, and catalogs, in addition to a wide range of more
97 general information. Another key aspect of registries is resource
98 curation. The IVOA registries are routinely validated and curated to
99 ensure consistency with IVOA standards. Users can publish, fix and
100 update resource records, records and the services described are
101 validated against standards, and errors are actively reported to
102 responsible parties.
103
104
105 \section{Resources and Resource Records}
106
107 The Virtual Observatory can be seen as a collection of \emph{resources}.
108 \citet{std:RMI} defines a VO resource as a ``VO element that can be
109 described in terms of who curates or maintains it and which can be given
110 a name and a unique identifier,'' and names sky coverages, instrumental
111 setups, organisations, or data collections as examples. In practice,
112 over 95\% of resources in the current VO are services.
113
114 From the outset, it was clear that a common way of describing these
115 resources would be required as a very basic building block for
116 interoperability. For instance, VO client programs need to be able to
117 find out what protocols a service supports and at what ``endpoints'' --
118 typically, HTTP URLs -- there are available, and scientists should have
119 reliable and standardized ways to work out who to reference, who to
120 report bugs to, and so on.
121
122 Fortunately, the VO did not have to develop the technology to support
123 such descriptions itself, as library sciences have worked on very
124 comparable problems for centuries already. \redcomment{Say something
125 about Dublin Core, \citet{std:RFC5013}, and introduce the \emph{resource
126 record}}
127
128 A somewhat subtle distinction resulting from these concepts is that a
129 resource is different from the resource record. To illustrate this, say
130 the data collection X contains spectra obtained using the spectrograph
131 S; the resource descriptor R gives a brief description of X. Now,
132 during the lifetime of the instrument, S will add new data to X on every
133 clear night, which means the resource changes. Nevertheless, in the
134 current VO R will not change (though it is conceivable that it will be
135 updated now and then, e.g., to update rough numbers on the number of
136 datasets contained in X).
137
138 For the converse scenario of a changing resource record with a constant
139 resource, suppose S is now decommissioned, while the standard defining
140 the content of the resource record is updated to include the spatial
141 coverage of the data collection. Now, R needs an update without X
142 changing.
143
144 In the VO, the decision was made to hide the distinction between
145 resources and resource records to a large degree. In particular, the
146 identifiers in use in the VO, the IVORNs, always refer to both at the
147 same time. This has worked surprisingly well. The only place where the
148 distinction is actually relevant in the current VO architecture is in
149 the notation of times; both in OAI-PMH and in the resource element, the
150 contepts of creation or update times mean the resource record, whereas
151 modification times of the resource itself can be communicated in in the
152 curation/date child of resource records.
153
154 As to the content of the resource record, very early on in the history
155 of the VO the desired pieces of metadata were collected, which lead up
156 to \citet{std:RMI}. This then was translated into an XML
157 representation in several standards, in particular VOResource
158 \citep(std:VOR} and VODataService \citep{std:VODS} as well as several
159 extensions for particular types of resources or services. This XML
160 representation is used in several service interfaces \citep[][e.g.,
161 ]{std:VOSI} as well as for communication between registries.
162
163
164 \section{Registries}
165
166 Having a set of resource records alone is not enough to build a useful
167 system, even if they already are in a standard format. There must also
168 be ways in which users can locate records of the resources relevant to
169 them within this set. To enable this, there need to be systems in
170 place that allow service operators to feed their resource records into
171 the set and users to execute queries against them. Both requirements
172 are covered by \emph{registries} within the VO.
173
174 Given the VO's highly diverse and distributed structure, it is evident
175 that a distributed system is required, in which neither requirement
176 can be made the task of a single entity. Instead, every publisher can
177 run their own \emph{publishing registry}. This is a service exposing
178 the collection of this publisher's resource record.
179
180 Conceivably, a user looking for a resource matching some constraints
181 could now query each publisher's publishing registry in turn to obtain a
182 list of all matching VO resources. This architecture obviously will not
183 scale well with the number of publishers. It also introduces many
184 points of failure into the system, as all publishers would have to keep
185 their registries highly available to avoid a severe degradation of the
186 whole system.
187
188 Therefore, retrieving resource records from the publishing registries,
189 joining the sets of resource records thus obtained, and offering means
190 of querying this joined set to VO users is the task of a specialized
191 agent, the \emph{searchable registry}. The process of retrieval of
192 resource records by a searchable registry is known as \emph{harvesting}.
193 To allow this harvesting, publishing and searchable registries must
194 agree on a common protocol. Again, the VO could build on work done by
195 the bibliographic community which had already defined
196 OAI-PMH\footnote{The Open Archives Initiative Protocol for
197 Metadata Harvesting, \citet{std:OAIPMH}}. The actual application of
198 OAI-PMH within the VO is described in \citet{std:RI1}, which in
199 particular defines that the VO's own resource record format is selected
200 in OAI-PMH using a metadata prefix of \texttt{ivo\_vor}. VO registries
201 are, however, also required to emit the much simple Dublin core metadata
202 records on request and are thus interoperable with bibliographic
203 services outside of the VO.
204
205
206 \section{Harvesting}
207
208 The VO registry system is de-centralized in both directions: A given
209 publishing registry does not know which searchable registries will
210 eventually carry its records. An implication of this is that it cannot
211 tell the searchable registry when a resource record changes. This, in
212 turn, implies that a searchable registry will have to poll the
213 publishing registries it harvests. The largest publishing registries in
214 the VO currently emit more than 100 Megabytes of resource records, and
215 due to paging and other delays the transfer takes about 10 minutes.
216
217 On the other hand, to keep up to date, searchable registries should poll
218 the publishing registries with a fairly high frequency. Most active
219 searchable registries today poll once or twice a day. To nevertheless
220 keep network and CPU load low, OAI-PMH supports \emph{incremental
221 harvesting}. This allows searchable registries to query publishing
222 registries for records updated since some point in time.
223
224 A common harvesting strategy is that searchable registries memorize the
225 date and time of the last harvest and, on re-harvesting, query the
226 publishing registry for records updated since then. This strategy
227 unfortunately puts rather severe correctness requirements on the
228 publishing registries to prevent losing records due to race conditions.
229 To see why, consider the publishing registry P that contains the
230 resource record R, which is edited at $t_1$, with its updated attribute
231 set to $t_1$. At $t_2>t_1$, a searchable registry S harvests P and
232 memorizes $t_2$ as the date of the last harvest. P typically responds
233 from a database table containing metadata on the resource records. As
234 the row for R has not been updated yet, R is not harvested.
235
236 At $t_3>t_2$, R's metadata is ingested into the metadata table. The
237 updated timestamp is taken from the resource record, i.e., it is $t_1$.
238 Now, when S comes back for an incremental harvest at $t_4>t_3$, it will
239 ask for records updated after $t_2$, which, as $t_1<t_2$, R is not.
240 Hence, the record will be missed by S, which in turn will not have
241 updated it.
242
243 What sounds like a fairly exotic scenario has in practice haunted the VO
244 registry for quite some time and keeps doing so, even though some
245 mitigation is possible by harvesters using the time of the last-but-one
246 harvest. Still, user-visible differences between the content of
247 different searchable registries are frequently due to race conditions of
248 this type. Several solutions are possible -- publishing registries
249 should use the ingestion time as updated timestamp for their record,
250 they could do full re-harvests every time, or OAI-PMH could be fixed to
251 use a more robust mechanism. Meanwhile, incremental harvesting is a
252 curation challenge not be be underestimated, with further subtleties
253 mentioned below.
254
255
256
257
258 RofR, Incremental (Races!), Validation, standardId
259
260 Resource Types,
261 OAI-PMH, searchable registry, validation, sets, the
262 ivo_managed set, distribution of resource types: select res_type,
263 count(*) from rr.resource group by res_type
264
265 \section{Authorities}
266
267 When, as in the VO, the creation of identifiers is distributed, there
268 needs to be a mechanism ensuring uniqueness, which in the case of the VO
269 Registry means making sure that no identifier is assigned to two
270 different resources. In the VO, this mechanism is founded on the notion
271 of \emph{authorities}, which are entities creating IVORNs. Each
272 authority is assigned a namespace, within which the authority is free to
273 create new names, as long as some basic syntactic rules are followed.
274 As with DOIs \citep{std:DOI}, identifiers are then a combination of the
275 authority identifier and the local part. As long as the IVOA makes sure
276 authority identifiers are unique and each authority ensure uniqueness
277 \emph{within their namespace}, the system yields globally unique
278 identifiers.
279
280 Technically, authority identifiers are IVORNs just consisting of the
281 scheme and the URI authority part in the terminology of
282 \citet{std:RFC3986} (it is obviously no coincidence that the VO
283 authority conincides with the URI authority here), e.g.,
284 \texttt{ivo://ivoa.net}. By \citet{std:RI1}, this must already be a
285 valid IVORN, i.e., refer to a resource record, which in this case must
286 be of the type \texttt{vg:Authority}. Resource records of this type
287 (``authority records'' in the following) are an ``assertion of control
288 over a namespace represented by an authority identifier''
289 \citep{std:RI1}. In practice, the metadata
290 e -- should describe
291 what organisational detail suggests the creation of a new authority. In
292 consequence, the contact would be the person responsible for ensuring
293 the uniqueness of the local parts.
294
295 In addition to the usual VOResource pieces of metadata -- like content,
296 curation, and title --, authority records have exactly one
297 \texttt{managingOrg}. This is the organisation that is responsible for
298 an authority, and the distinction from the authority itself is somewhat
299 subtle and best illustrated by an example: An observatory with an
300 infrared unit and an ultraviolet unit that want to avoid having to
301 negotiate before minting identifiers could claim the authorities
302 \texttt{infrared.sample}, \texttt{ultraviolet.sample}, and
303 \texttt{sample}. The observatory itself would then be
304 \texttt{ivo://sample/org}, and it would be the managing organisation for
305 all the authorities. All authority records would also list ``The sample
306 observatory'' (or similar) as their publisher.
307
308 Note that URI authorities are opaque and unstructured, which means that
309 clients are not supposed to infer any relationship from the fact that
310 \texttt{sample} is contained in \texttt{infrared.sample}. There has
311 been a recommendation to re-use DNS names as authority IDs, which has
312 been largely ignored and appears to mainly effect IVORNs which are
313 unnecessarily long. Today, we would suggest to base authority names on
314 the names of national VO projects where available.
315
316 According to \citet{std:RI1}, the uniqueness of the authority names is
317 maintained by putting the burden on the publishing registries: ``Before
318 the publishing registry commits the [authority] record for export, it
319 must first search a full registry to determine if a vg:Authority with
320 this identifier already exists; if it does, the publishing of the new
321 vg:Authority record must fail.'' Given the delays involved in
322 harvesting, this procedure obviously has very real issues with race
323 conditions, and to our knowledge, no engine for publishing registries
324 actually implements such a check.
325
326 In actual operation, the managing organisation is much less important
327 than the registry that manages the authority. The intent here is that
328 only one registry is accepted as the source for registry records under
329 the authority (but a given registry can manage multiple authorities).
330 Full registries can use this mapping from authorities to their managing
331 registries to decide whether to ingest records the harvest when
332 harvesting full registries either complementary to evaluating
333 \texttt{ivo_managed} or instead of it, which has in the history of the
334 VO Registry at times been more stable.
335
336 The difficulty here is the maintenance of this mapping. While name
337 clashes in authorities at the time they are created have not been a
338 problem in practice, this has been, as authorities sometimes move from
339 one registry to another. It happened suprisingly frequently that the
340 releasing registry failed to drop its declaration of managing the
341 departing authority, or did not update the record's modification date
342 such that it did not get harvested incrementally. Such cases lead to
343 severe inconsistencies within the registry system. At this point we
344 believe the way to deal with them is manual curation at the RofR, as the
345 updated resource record from the accepting registry comes in and the
346 name clash becomes obvious.
347
348
349 \section{Conclusions}
350 \label{sec:conc}
351
352
353 \bibliographystyle{elsarticle-harv}
354 \bibliography{registry}
355
356 \end{document}

msdemlei@ari.uni-heidelberg.de
ViewVC Help
Powered by ViewVC 1.1.26