/[volute]/trunk/projects/registry/acregistry/registry.tex
ViewVC logotype

Contents of /trunk/projects/registry/acregistry/registry.tex

Parent Directory Parent Directory | Revision Log Revision Log


Revision 2569 - (show annotations)
Wed Apr 30 14:02:07 2014 UTC (7 years, 5 months ago) by rplante@ncsa.uiuc.edu
File MIME type: application/x-tex
File size: 41175 byte(s)
updated Plante's name and address
1 %\documentclass[preprint,authoryear,12pt]{elsarticle}
2 \documentclass[final,authoryear,5p,times]{elsarticle}
3
4 \usepackage{graphicx}
5 \usepackage{amssymb}
6 \usepackage[utf8]{inputenc}
7 \biboptions{}
8
9 % remove the following two lines before submission
10 \usepackage[usenames]{color}
11 \newcommand{\redcomment}[1]{{\color{red}#1}}
12
13 \journal{Astronomy \& Computing}
14
15 \begin{document}
16
17 \begin{frontmatter}
18
19 %% Title, authors and addresses
20
21 %% use the tnoteref command within \title for footnotes;
22 %% use the tnotetext command for the associated footnote;
23 %% use the fnref command within \author or \address for footnotes;
24 %% use the fntext command for the associated footnote;
25 %% use the corref command within \author for corresponding author footnotes;
26 %% use the cortext command for the associated footnote;
27 %% use the ead command for the email address,
28 %% and the form \ead[url] for the home page:
29 %%
30 %% \title{Title\tnoteref{label1}}
31 %% \tnotetext[label1]{}
32 %% \author{Name\corref{cor1}\fnref{label2}}
33 %% \ead{email address}
34 %% \ead[url]{home page}
35 %% \fntext[label2]{}
36 %% \cortext[cor1]{}
37 %% \address{Address\fnref{label3}}
38 %% \fntext[label3]{}
39
40 \title{The Virtual Observatory Registry}
41
42 %% use optional labels to link authors explicitly to addresses:
43 %% \author[label1,label2]{<author name>}
44 %% \address[label1]{<address>}
45 %% \address[label2]{<address>}
46
47 \author[ari]{Markus Demleitner}
48 \ead{msdemlei@ari.uni-heidelberg.de}
49 \address[ari]{Unversität Heidelberg,
50 Astronomisches Rechen-Institut, M\"onchhofstra\ss e 12-14, 69120
51 Heidelberg, Germany}
52 \author[stsci]{Gretchen Greene}
53 \address[stsci]{Space Telescope Science Institute, 3700 San Martin Dr,
54 Baltimore, MD 21218, USA}
55 \author[obspm]{Pierre Le Sidaner}
56 \address[obspm]{Observatoire de Paris, Paris, France}
57 \author[ncsa]{Raymond L. Plante}
58 \address[ncsa]{National Center for Supercomputing Applications,
59 University of Illinois, 1205 W. Clark St. Urbana, IL 61821}
60
61 \begin{abstract}
62 In the Virtual Observatory (VO), the Registry provides the mechanism with which
63 users and applications can discover and select resources -- e.g., data
64 and services -- that are relevant for a particular scientific problem.
65 This article describes the design and operation of the VO Registry as a
66 distributed information system from a conceptual side, in particular as
67 regards the format and content of the resource records, their
68 interchange, and their interpretation. The client perspective will be
69 given in a forthcoming paper.
70 \end{abstract}
71
72 \begin{keyword}
73 virtual observatory\sep registry\sep standards
74 %% keywords here, in the form: keyword \sep keyword
75
76 %% MSC codes here, in the form: \MSC code \sep code
77 %% or \MSC[2008] code \sep code (2000 is the default)
78 \MSC 68U35
79 \end{keyword}
80
81 \end{frontmatter}
82
83 % \linenumbers
84
85 \section{Introduction}
86 \label{sec:intro}
87
88 The Virtual Observatory (VO) is a distributed system -- by design, there
89 is no central node either running services, handing out data, or even
90 just a single link list-style directory. In order to still maintain the
91 appearance of a single, integrated information system, however, users
92 and clients must have a means of discovering metadata of VO-compliant
93 resources (in the sense discussed in section~\ref{sect:recs}). This
94 means is provided by the VO Registry; written in upper case in the
95 following, the term refers to the entire system, as opposed to the
96 lower-case ``registry,'' which denotes a concrete service.
97
98 Following the VO philosophy, the VO Registry is not a single, central
99 system but rather a network of several types of services, some of which
100 host and publish metadata collections, while other provide capabilities
101 for querying such collections. All follow standard protocols for
102 exchanging information between them and between them and client
103 software.
104
105 The metadata collections consist of records containing a
106 rich set of descriptive metadata for each resource with a description in
107 the registry. These records are defined by a set of standards covering data
108 services, astronomical resources and various other types of resources.
109
110 The VO Registry is critical component for the Virtual Observatory as run by
111 the community organized around the International Virtual Observatory
112 Association (IVOA). For instance, science data discovery tools rely on
113 registry access to efficiently perform searches across the distributed
114 archives, and desktop tools use the standard application program interfaces
115 (APIs) exposed by the registries to discover services that can be
116 operated using whatever protocols these tools understand.
117
118 Another key aspect of the VO Registry is the
119 monitoring of the health and functionality of the VO.
120 The registries themselves are routinely validated and
121 curated to ensure consistency with IVOA standards, which uncovers errors
122 in the metadata supplied by the service operators. Even more
123 importantly, services within the Registry are validated to comply to the
124 standards they claim to implement, and errors are actively reported
125 to responsible parties.
126
127 The VO Registry thus is a complex ecosystem goverened by a fairly large
128 set of standards. Anticipating some terms that will be explained later,
129 let us collect and arrange the relevant standards already in the
130 introduction.\footnote{For an even bigger picture of the VO and its
131 components, see \citet{note:VOARCH}.} Where the standards have short
132 names in common use in the VO community, we introduce these here.
133
134 The basic standards define how to identify entities in the VO
135 \citep[IVOA Identifiers;][]{std:VOID} and what pieces of metadata we
136 consider relevant for the VO's use cases \citep[Resource Metadata for
137 the Virtual Observatory;][]{std:RM}. Based on this, VOResource
138 \citep{std:VOR} lays out the basics of encoding resource metadata in XML
139 and defines the basic types. Several extensions apply these building
140 blocks to more specialized types of services or interfaces:
141 VODataService \citep{std:VODS11} says how to describe data collections
142 and services exposing them, SimpleDALRegExt \citep{std:DALREGEXT}
143 contains capabilities for several ``simple'' protocols of the VO's Data
144 Access Layer (DAL), TAPRegExt \citep{std:TAPREGEXT} does the same for
145 the Table Access Protocol TAP, and StandardsRegExt \citep{std:STDREGEXT}
146 contains resource types for standard texts. Registry Interfaces
147 \citep{std:RI1} specifies how registries exchange the XML records
148 defined in VOResource and extensions, as well as
149 how registries themselves are
150 described within VOResource. It builds on the non-VO OAI-PMH
151 \citep{std:OAIPMH} standard. Registry Interfaces in the still-current
152 version 1.0 also defines two APIs for searching the registry, although
153 it is expected that two new APIs will soon deprecate these. One new
154 API is specified in a standard called RegTAP \citep{std:RegTAP} which
155 currently is under consideration.
156
157 \begin{figure}
158 \includegraphics[width=\hsize]{RegistryExchange.png}
159 \caption{A sketch of the registry system in the Virtual Observatory
160 taken from \citet{2007ASPC..382..445P}:
161 Searchable registries harvest from publishing registries operated by the data
162 providers. Users and client applications can then discover VO
163 resources through queries to a searchable registry, either a full
164 searchable registry that contains everything known to the VO, or a
165 specialized one focused on a particular subset.}
166 \label{fig:arch}
167 \end{figure}
168
169 In the remainder of this paper, we introduce the notions of resources
170 and resource records somewhat more rigorously before going into more
171 details on what registries there are and how they cooperate in
172 section~\ref{sect:registries}. The distributed nature of the VO
173 Registry is enabled by harvesting, which we introduce and critically
174 assess in section~\ref{sect:harvesting}; to ensure correct separation of
175 responsibilities in this process, authority management as discussed in
176 section~\ref{sect:auths} is essential. The remaining chapters up to the
177 conclusions are dedicated to several aspects of the resource records and
178 the metadata contained.
179
180 This paper does not discuss the user-facing parts of the Registry in any
181 detail, i.e., user interfaces, query APIs, and the like.
182 For that as well as further statistics on current Registry content,
183 we refer the reader to an upcoming
184 paper~II.
185
186 \section{Resources and Resource Records}
187 \label{sect:recs}
188
189 The Virtual Observatory can be seen as a collection of \emph{resources}.
190 \citet{std:RM} defines a VO resource as a ``VO element that can be
191 described in terms of who curates or maintains it and which can be given
192 a name and a unique identifier.'' He goes on to name
193 sky coverages, instrumental
194 setups, organisations, or data collections as examples. In practice,
195 over 95\% of resources in the current VO are data services.
196
197 From the outset, it was clear that a common way of describing these
198 resources would be required as a very basic building block for
199 interoperability. For instance, VO enabled client programs need to be able to
200 find out what protocols a service supports and at what ``endpoints'' --
201 typically, HTTP URLs -- there are available, and scientists should have
202 reliable and standardized ways to work out who to reference, who to
203 report bugs to, and so on.
204 Of course, having a standardised structure for content metadata (like
205 keywords, a title, description) helps writing more focused data
206 discovery queries as well.
207
208 Fortunately, the VO did not have to develop the technology to support
209 such descriptions itself, as library sciences have worked on very
210 comparable problems for centuries already. The VO's registry
211 architecture in particular re-uses the Open Archives Initiative's
212 protocol for metadata harvesting \citep[OAI-PMH;][]{std:OAIPMH} for
213 a conceptual framework and the metadata exchange protocol, and
214 Dublin Core \citep{std:RFC5013} for a basis on which to build the
215 metadata model.
216
217 Central to OAI-PMH is the notion of a \emph{unique identifier}, which
218 ``unambiguously identifies an item within'' the set of resources.
219 Other than that these should be URIs \citep{std:RFC2396}, OAI-PMH does
220 not state details on how they should be formed. For VO
221 resources, \citet{std:VOID} prescribes the use of IVOA resource names or
222 \emph{IVORNs}. In short, these are URIs with an scheme of \texttt{ivo},
223 an authority part as discussed in section~\ref{sect:auths}, and a local
224 part goverened by some reasonable restrictions on which characters are
225 allowed to occur.
226
227 A somewhat subtle but nevertheless important distinction made in OAI-PMH
228 is between a resource and a \emph{resource record} containing its
229 description.
230 To see that this distinction has actual consequences, say
231 the data collection X contains spectra obtained using the spectrograph
232 S; the resource record R describes X. Now,
233 during the lifetime of the instrument, S will add new data to X on every
234 clear night, which means the resource changes. Nevertheless, in the
235 current VO R will not change (though it is conceivable that it will be
236 updated now and then, e.g., as the description might contain
237 rough estimates on the number of datasets contained in X).
238
239 For the converse scenario of a changing resource record with a constant
240 resource, suppose S is now decommissioned, while the standard defining
241 the content of the resource record is updated to include the spatial
242 coverage of the data collection. Now, R needs an update without X
243 changing.
244
245 As stated above OAI-PMH defines that the unique identifiers -- and hence
246 the VO's IVORNs -- always reference resource records. As to how the
247 resources themselves should be referenced, OAI-PMH declares that the
248 ``nature of a resource identifier is outside the scope.'' This
249 reservation is motivated by the library use case, where a single book
250 might be described by different libraries and hence have multiple
251 resource records. The libraries must agree on common identifier for the
252 book to see that the different resource records actually all describe
253 the same book, but OAI-PMH did not want to endorse any particular
254 mechanism for this.
255
256 In the IVOA, it was expected that such complications would not arise as
257 the resource records would almost always come from the resource
258 publishers themselves, and no need for multiple resource records for a
259 single resource was foreseen. It was therefore decided that the IVORN of
260 a resource record should also identify the resource itself.
261
262 This explains some duplication of information in OAI-PMH messages in the
263 VO. Awareness of the distinction is relevant to registry users to
264 understand the meaning of the creation or update times in the resource
265 record (which refer to the record itself) and the dates and times given
266 in the curation/date child of the resource record, which pertain to the
267 resource.
268
269 As to the content of the resource record, very early on in the history
270 of the VO the desired pieces of metadata were collected, which lead up
271 to \citet{std:RM}. This then was translated into an XML
272 representation in several standards, in particular VOResource
273 \citep{std:VOR} and VODataService \citep{std:VODS11} as well as several
274 extensions for particular types of resources or services.
275
276
277 \section{Registries}
278 \label{sect:registries}
279
280 Having a set of resource records alone is not enough to build a useful
281 system, even if they already are in a standard format. There must also
282 be ways in which users can locate records of the resources relevant to
283 them within this set. Therefore, systems are required enabling
284 service operators to feed their resource records into
285 the set. Also, users must have a way to execute queries against the set.
286 Both requirements
287 are covered by \emph{registries} within the VO.
288
289 Given the VO's highly diverse and distributed structure, it is evident
290 that a distributed system is required, in which neither requirement
291 can be made the task of a single entity. Instead, every publisher can
292 run their own \emph{publishing registry}. This is a service exposing
293 the collection of this publisher's resource records.
294
295 Conceivably, a user looking for a resource matching some constraints
296 could now query each publisher's publishing registry in turn to obtain a
297 list of all matching VO resources. This architecture obviously will not
298 scale well with the number of publishers. It also introduces many
299 points of failure into the system, as all publishers would have to keep
300 their registries highly available to avoid a severe degradation of the
301 whole system.
302
303 Therefore, retrieving resource records from the publishing registries,
304 joining the sets of resource records thus obtained, and offering means
305 of querying this joined set to VO users is the task of a specialized
306 agent, a \emph{searchable registry}. The process of retrieval of
307 resource records by a searchable registry is known as \emph{harvesting}.
308 To allow this harvesting, publishing and searchable registries must
309 agree on a common protocol.
310 As mentioned in section~\ref{sect:recs}, the adoption of OAI-PMH already
311 defined such a protocol for the VO.
312
313 A secondary distinction between searchable registries is between
314 \emph{full registries} (the term ``searchable'' is usually implied in
315 this case) which strive to harvest all publishing registries
316 in the VO and \emph{local searchable registries} which only carry a
317 selection of records. An example for the second kind that is currently
318 in discussion is an ``educational'' registry that contains a manually
319 curated subset of services delivering data suitable for classroom use
320 (i.e., data of moderate size, with easily understood data types, etc).
321
322 The actual application of
323 OAI-PMH within the VO is described in \citet{std:RI1}, which in
324 particular defines that the VO's own resource record format is selected
325 in OAI-PMH using a metadata prefix of \texttt{ivo\_vor}. VO registries
326 are, however, also required to emit the much simpler Dublin core metadata
327 records on request and are thus interoperable with bibliographic
328 services outside of the VO.
329
330 One additional building block needs to be mentioned, the Registry of
331 Registries or RofR for short \citep{std:RofR}. This is a special
332 publishing registry from which searchable registries can harvest the set
333 of available publishing registries. As such, it is a single point of
334 failure, as there is only one such service globally. On the other hand,
335 as no client code directly accesses the RofR, an outage does not impair
336 the user-visible functionality of the VO. The main impact would be that
337 no new publishing registries could be added to the VO's registry system,
338 and existing registry's endpoints would have be be discovered from
339 searchable registries.
340
341 In the current VO, the RofR also doubles as the publishing
342 registry for standards and other resources managed by IVOA, and it
343 operates a service for validating the content of publishing registries
344 (cf.~section~\ref{sect:validation}).
345
346
347 \section{Harvesting}
348 \label{sect:harvesting}
349
350 The VO registry system is de-centralized in both directions: A given
351 publishing registry does not know which searchable registries will
352 eventually carry its records. An implication of this is that it cannot
353 notify the searchable registries when a resource record changes. This, in
354 turn, implies that the searchable registries will have to poll the
355 publishing registries it harvests. This is not entirely trivial, as
356 the largest publishing registry in
357 the VO currently emits more than 100 Megabytes of resource records, and
358 due to paging and other delays the transfer takes about 10 minutes.
359
360 On the other hand, to keep up to date, searchable registries should poll
361 the publishing registries with a fairly high frequency. Most active
362 searchable registries today poll once or twice a day. To nevertheless
363 keep network and CPU load low, OAI-PMH supports \emph{incremental
364 harvesting}. This allows searchable registries to query publishing
365 registries for records updated since some point in time.
366
367 A common harvesting strategy is that searchable registries persist the
368 date and time of the last harvest and, on re-harvesting, query the
369 publishing registry for records updated since then. Together with a
370 very natural-seeming (but incorrect) implementation on the part of the
371 publishing registry, this can lead to a loss of records with
372 incremental harvesting.
373
374 To see how this happens, consider a publishing registry P that, as is
375 usual, keeps the updated dates of its resources in a database table
376 to facilitate quick responses to OAI-PMH queries with
377 date constraints. Now say a new resource record R is created at $t_1$
378 and its updated
379 attribute accordingly is set to $t_1$ in the record itself. For one
380 reason or another, the program that ingests the updated dates for the
381 record into the database table does not run immediately.
382
383 At $t_2>t_1$, a searchable registry S harvests P and memorizes $t_2$ as
384 the date of the last harvest. As the database table does not contain R
385 yet, R is not harvested.
386
387 At $t_3>t_2$, the program ingesting R into the database table
388 is finally run, but the timestamp is taken from the resource record,
389 i.e., it is $t_1$. Now, when S comes back for an incremental harvest at
390 $t_4>t_3$, it will ask for records updated after $t_2$, which, as
391 $t_1<t_2$, R is not. Hence, the record will be missed by S, which then
392 will not contain R. An analogous problem exists for updates and
393 deletions of records.
394
395 What sounds like a fairly exotic scenario is not uncommon at all with
396 current registry implementations and regularly causes user-visible
397 differences between the content of different registries. Some
398 mitigation is possible if harvesters use the time of the last-but-one
399 harvest to constrain their incremental queries.
400 The correct solution, though, is that publishing registries set the
401 ingestion time as updated timestamp for their records. As the condition
402 outlined above is the result of a straightforward implementation,
403 however, we believe in the medium term a more robust method for
404 incremental harvesting, presumably based on monotonously incrementing IDs,
405 should be put in place, to make the straightforward implementation also
406 a race-free one.
407
408 Another sometimes mis-understood feature has to do with
409 \emph{sets}. These are
410 a feature of OAI-PMH that lets archive operators define subsets of their
411 data holdings sharing some property. The VO's registry interface
412 standard defines one such set that must be supported by all registries,
413 \texttt{ivo\_managed}. This set is defined to comprise all records that
414 originate from the registry \emph{and} should be visible in a searchable
415 VO registry. The idea behind this is that a harvesting registry can
416 constrain its queries to \texttt{ivo\_managed} and will not see records
417 from other registries even for registries harvesting other registries.
418 Note that set membership is a property of a registry, not of a record,
419 so information on set membership is lost at harvesting time.
420
421
422 \section{Authorities}
423 \label{sect:auths}
424
425 When, as in the VO, the creation of identifiers is distributed, there
426 needs to be a mechanism ensuring uniqueness, which in the case of the VO
427 Registry means making sure that no identifier is assigned to two
428 different resources. In the VO, this mechanism is founded on the notion
429 of \emph{authorities}, which are entities creating IVORNs. Each
430 authority is assigned a namespace, within which the authority is free to
431 create new names, as long as some basic syntactic rules are followed.
432 As with DOIs, identifiers are then a combination of the
433 authority identifier and the local part. As long as the IVOA makes sure
434 authority identifiers are unique and each authority ensure uniqueness
435 \emph{within their namespace}, the system yields globally unique
436 identifiers.
437
438 Technically, authority identifiers are IVORNs just consisting of the
439 scheme and the URI authority part, for instance, \texttt{ivo://ivoa.net}.
440 By \citet{std:RI1}, this must already be a
441 valid IVORN, i.e., refer to a resource record, which in this case must
442 be of the type \texttt{vg:Authority}. Resource records of this type
443 (``authority records'' in the following) are an ``assertion of control
444 over a namespace represented by an authority identifier''
445 \citep{std:RI1}. In practice, the metadata
446 should describe
447 what organisational detail suggests the creation of a new authority. In
448 consequence, the contact would be the person responsible for ensuring
449 the uniqueness of the local parts.
450
451 In addition to the usual VOResource pieces of metadata --
452 discussed in detail in section~\ref{sect:vorr} --
453 authority records have exactly one
454 \texttt{managingOrg}. This is the organisation that is responsible for
455 an authority, and the distinction from the authority itself is somewhat
456 subtle and best illustrated by an example: An observatory with an
457 infrared unit and an ultraviolet unit that want to avoid having to
458 negotiate before minting identifiers could claim the authorities
459 \texttt{infrared.sample}, \texttt{ultraviolet.sample}, and
460 \texttt{sample}. The observatory itself would then be
461 \texttt{ivo://sample/org}, and it would be the managing organisation for
462 all the authorities. All authority records would also list ``The sample
463 observatory'' (or similar) as their publisher.
464
465 Note that URI authorities are opaque and unstructured, which means that
466 clients are not supposed to infer any relationship from the fact that
467 \texttt{sample} is contained in \texttt{infrared.sample}. There has
468 been a recommendation to re-use DNS names as authority IDs, which has
469 been largely ignored, probably because it tends to make IVORNs
470 unnecessarily long. Today, we would suggest to base authority names on
471 the names of national VO projects where available.
472
473 In \citet{std:RI1}, the burden of ensuring the uniqueness
474 of the authority names is put on the publishing registries: ``Before
475 the publishing registry commits the [authority] record for export, it
476 must first search a full registry to determine if a vg:Authority with
477 this identifier already exists; if it does, the publishing of the new
478 vg:Authority record must fail.'' Given the delays involved in
479 harvesting, this procedure obviously has very real issues with race
480 conditions, and to our knowledge, no engine for publishing registries
481 implements such a check.
482
483 In actual operation, the managing organisation is much less important
484 than the registry that manages the authority. The intent here is that
485 only one registry is accepted as the source for registry records under
486 the authority (but a given registry can manage multiple authorities).
487 Full registries can use this mapping from authorities to their managing
488 registries to decide whether to ingest records they harvest when
489 harvesting full registries either complementary to evaluating
490 \texttt{ivo\_managed} or instead of it, which has in the history of the
491 VO Registry at times been more stable.
492
493 While name
494 clashes in authorities at the time they are created have not been a
495 problem in practice, maintaining this mapping has been,
496 as authorities sometimes move from
497 one registry to another. It happened surprisingly frequently that the
498 releasing registry failed to drop its declaration of managing the
499 departing authority, or did not update the record's modification date
500 such that it did not get harvested incrementally. Such cases lead to
501 severe inconsistencies within the registry system. At this point we
502 believe the way to deal with them is manual curation at the RofR, as the
503 updated resource record from the accepting registry comes in and the
504 conflicting claims of authority can be diagnosed.
505
506
507 \section{VO resource records}
508 \label{sect:vorr}
509
510 Following \citet{std:RM}, VO resource records contain a fairly
511 comprehensive set of metadata. All resource records must have a title,
512 an identifier, and a status as well as information on its content and
513 the curation. They also have timestamps for the creation and the last
514 update of the resource record. Additional optional metadata includes a
515 short name (primarily for use in cramped displays) and validation
516 information (cf.~section~\ref{sect:validation}).
517
518 The status attribute is used by publishing registries, which may push
519 out a resource record with status set to \texttt{deleted} when asked for modified
520 resource records. This should cause the harvesting registry to remove any
521 resource record with the same identifier it has cached from previous
522 harvests.
523
524 Content metadata consists of subjects -- keywords which are supposed to
525 be drawn from the IVOA thesaurus \citep{std:IVOAT} --, a human readable
526 description, the URL of a reference page giving more information about
527 the resource (the \emph{reference URL}), as well as optionally
528 a bibliographic source and some additional ancillary information.
529 Content also allows defining relationships to other resources, examples
530 for which include ``mirror-of'' or ``service-for'', which is
531 particularly interesting for data collections to declare services
532 allowing access to them.
533
534 Curation metadata gives a simple provenance of a
535 resource: Who has created it -- to a first approximation, this usually
536 is the ``authors'' --, who has published it, who can repair it. Curation
537 also lets publishers specify dates relevant to the history of the
538 resource itself (as opposed to the resource record).
539
540 Resource records also have types, and certain types have additional
541 metadata. As can be seen from table~\ref{tab:typedist}, the
542 overwhelming majority of resources in the current VO registry are of
543 type \texttt{vs:CatalogService}\footnote{Following widespread practice, we
544 abbreviate the namespaces VOResource types come from with their
545 ``canonical'' prefixes. A review of this, including a translation from
546 prefixes to their namespaces, is given in chapter 4 of
547 \citet{std:RegTAP}.}. These are access services for entities with
548 sky coordinates, and most SIAP, SCS, or SSAP services will use this
549 type.
550
551 \begin{table}
552 \begin{center}
553 \begin{tabular}{|l|r|}
554 \hline
555 \multicolumn{1}{|c|}{\vrule width 0pt height 10pt depth 5pt res\_type} &
556 \multicolumn{1}{c|}{$N$} \\
557 \hline
558 \vrule width 0pt height 10pt vs:CatalogService & 13706\\
559 vs:DataCollection & 144\\
560 vg:Authority & 131\\
561 vr:Organisation & 76\\
562 vr:Service & 48\\
563 vs:DataService & 29\\
564 vg:Registry & 24\\
565 vstd:Standard & 7\\
566 vstd:ServiceStandard & 4\\
567 other & 153\\
568 \hline\end{tabular}
569 \end{center}
570 \caption{Distribution of resource types in the GAVO relational registry,
571 April 2014. Other includes deprecated or experimental types.}
572 \label{tab:typedist}
573 \end{table}
574
575 In addition to basic VOResource metadata, catalog services can contain
576 additional information on the facility and the instrument that produced
577 the data, whether the data is public or proprietary, on the area
578 covered by the data contained on the sky, and on the structure of the
579 table that feeds the service. Catalog services share this metadata with
580 \texttt{vs:DataCollection}.
581
582 In contrast to data collections, however, catalog services have
583 capability metadata, which in particular lets clients work out what
584 protocols are available at what network endpoints. Note that capability
585 types and resource types are largely decoupled, and no rules are
586 enforced as to what resource types are allowed for which capabilities if
587 a resource type allows capabilities at all.
588 As capabilities are a fairly complex part of VOResource, we defer their
589 closer discussion to section~\ref{sect:caps}.
590
591 A \texttt{vs:DataService} record is like \texttt{vs:CatalogService}, but without claiming
592 to be based on some tabular structure. In retrospect, it seems doubtful
593 that this distinction should be reflected in the resource type, as
594 witnessed by its low and inconsistent use.
595
596 The interplay between \texttt{vg:Authority}, \texttt{vr:Organization},
597 and \texttt{vg:Registry} was
598 discussed in section~\ref{sect:auths}, and VOResource just follows the
599 roles laid out there: \texttt{vg:Authority} in addition to the basic metadata
600 just gives the organisation that manages the authority,
601 \texttt{vr:Organisation}
602 allows the specification of the organisation's facilities and
603 instruments, and \texttt{vg:Registry} lists the authorities it manages, whether
604 it is a full registry, and it has capabilities. Whether a registry is
605 searchable or publishing or both is determined by its capabilities in
606 the current registry interfaces scheme \citep{std:RI1}.
607
608 While few in number, records of types \texttt{vstd:Standard} and
609 \texttt{vstd:ServiceStandard} are nevertheless important. They serve as
610 destinations for references to standards as required in, e.g.,
611 capability records as discussed below. Such records allow the
612 declaration of the various versions of a standard, associated XML
613 namespace URIs, and also the declaration of terms. This latter feature
614 provides a relatively lightweight way to generate IVORNs for certain
615 concepts standards might need. In the registry extension for TAP
616 \citep{std:TAPREGEXT}, for example, this mechanism is used to introduce
617 identifiers for output formats not distinguishable by MIME type.
618 Service standards, in addition, allow a simple specification of a
619 standard service's interface.
620
621
622 \section{Capabilities}
623 \label{sect:caps}
624
625 Resource types that offer endpoints for interaction (services,
626 registry) also contain zero or more capability elements. Capabilities
627 essentially are VOResource's way to describe the possible interactions
628 with a resource.\footnote{An exception to the interact-through-capability
629 concept is \texttt{vs:DataCollection}'s accessURL, which allows retrieval of the
630 data and is a top-level attribute of the resource.}
631
632 VOResource's basic capability element consists of optional validation
633 information, and optional human-readable description, and zero or more
634 interfaces.
635
636 The interfaces are again typed, with most interfaces in the current VO
637 being one of \texttt{vs:ParamHTTP} -- an interface for operation by HTTP and HTTP
638 request parameters (about 64\%) -- and \texttt{vr:WebBrowser} -- services based
639 on HTML forms (about 35\%). The remaining interfaces are a few
640 SOAP-based services, the special OAIHTTP type used by publishing
641 registries, and some types from abandoned standards.
642
643 Interfaces have one or more access URLs, where we expect that the next
644 version of VOResource will restrict this to exactly one. In addition,
645 a role attribute should be set to \texttt{std} if the interface is a standard
646 interface for the standard the capability claims to implement. In that
647 case, version can give the version of this standard. In current VO
648 practice, this version attribute is typically ignored, as incompatible
649 standards are told apart by the standard identifier of the capability.
650
651 Derivations of \texttt{vr:Interface} may have additional properties. In
652 particular, \texttt{vs:ParamHTTP} declares a result type -- supposed to be a MIME
653 type -- and the input parameters with their names, UCDs, and types,
654 expressed in a simplified type system. This is a cross-protocol way of
655 discovering the parameter metadata which should be provided in addition
656 to protocol-specific means. Compared to the parameter declarations
657 emitted from metadata queries in SIAP and SSAP
658 \citep{std:SIAP,std:SSAP}, parameter declarations in interfaces are less
659 expressive, since the VOTable PARAMs employed in SIAP/SSAP metadata can have
660 VALUES children giving ranges or possible values for enumerated
661 parameters. It is somewhat unfortunate that the same kind of information
662 is exposed in two non-equivalent ways.
663
664 In addition to these basic capability metadata, registry extensions can
665 define capabilities with richer metadata. For instance, the registry
666 extensions for SIAP, SSAP, and SCS \citep{std:DALREGEXT} define things
667 like test queries, limits to search and response sizes, but also the
668 kind of data contained, which for SIAP declares whether the service
669 returns cutouts, pointed observations, mosaiced images, or is an
670 atlas-type service. The most complex capability structure so far is the
671 one for the Table Access Protocol TAP
672 \citep{std:TAPREGEXT}, which exposes many aspects of
673 the TAP service and the languages supported by it. In the context of a
674 paper on the registry, TAPRegExt's
675 \texttt{dataModel} element deserves particular
676 attention. It contains an IVORN of a standard defining a data model,
677 more specifically a set of relational tables. This can be used to
678 locate TAP services having these tables. Both Obscore and the upcoming
679 relational registry use this mechanism to enable service discovery.
680
681 Capabilities are not only used directly in the registry.
682 The VOSI and DALI standards \citep{std:VOSI, std:DALI} mandate that
683 services should also emit the capability elements on a specialised
684 endpoint next to the science endpoints. An example for where these
685 endpoints are already in everyday use is again TAP, where clients determine
686 the details of a TAP service (user defined functions, support for
687 optional features, output formats, limits, etc) without having to
688 consult a registry.
689
690 \section{Validation}
691 \label{sect:validation}
692
693 In a distributed system in which many parties operate services, partly
694 using home-grown implementations, it is inevitable that not all services
695 actually comply to the standards they claim to implement. With a
696 complex system like the VO Registry, it is not trivial to even write
697 correct and complete resource records, let alone follow all rules
698 ensuring that a publishing registry fits into the whole system. Hence,
699 validation on many levels is crucial for maintaining the
700 integrity of the VO.
701
702 As regards the VOResource records themselves, their validity essentially
703 is equivalent to their compliance to the XML schema files that accompany
704 the pertinent standards. For a publishing registry, a large number of
705 further properties need to be checked, for instance a correct
706 implementation of OAI-PMH, the definition of the authorities managed by
707 the registry, the support of the \texttt{ivo\_managed} set, and so
708 forth.
709
710 A service performing such a validation is operated at the RofR, and it
711 has proven instrumental for building a working registry system. In
712 particular, publishing registries that try to enlist themselves in the
713 RofR are validated and can only enter if they are valid.
714
715 Registries may become non-compliant after this initial validation due to
716 software updates or, more commonly, invalid registry records entering
717 the set of resource records. No automatic re-validation is taking
718 place, and registries that become invalid are not removed from the RofR.
719 Relying on the registry operators to re-validate and repair their
720 services has so far proven sufficient for keeping the VO Registry
721 operational.
722
723 There is, however, a second and much larger aspect to validation:
724 resource validation. This is another case in which the distinction
725 between resource record and the resource itself becomes relevant -- a
726 valid resource record might very well describe a service that does not
727 comply to the underlying standard. Validating a resource means
728 examining as many aspects of its operation as possible. While this
729 validation can in principle be performed by anyone, a publishing
730 registry is a natural place for the operation of a service validator:
731 (a) it already has the metadata available; (b) it has a means to
732 disseminate its results.
733
734 As to (a), this metadata obviously includes the access URL and the
735 standard implemented. However, meaningful validation typically requires
736 additional metadata, in particular parameters that must return a
737 non-empty response. SimpleDALRegExt contains elements designed for
738 that purpose. For instance, the cone search capability has a
739 testQuery element that separately lists values for the \texttt{RA},
740 \texttt{DEC}, and \texttt{SR} parameters that VO cone searches require.
741 In actual use, it turned out that
742 separating out the individual parameters of protocols did not
743 significantly help either validators or other VO components. In the
744 most recent simple DAL extension, the one for SSAP, testQuery
745 hence admits the specification of a complete query string otherwise
746 opaque to the validator.
747
748 As to (b), VOResource introduces a validation type that allows
749 operators of validators to communicate their results. It consists of a
750 numeric code and a mandatory URI identifying the validating entity. The
751 numeric code currently ranges between 0 -- ``has a description that is
752 stored in a registry'' -- and 4 -- ``meets additional quality criteria
753 set by the human inspector,'' where from 2 up there is a requirement
754 that the resource described exists and has been ``demonstrated to be
755 functionally compliant.'' The codes are defined in \citet{std:RM}.
756
757 A resource record may contain validation information for both the full
758 record and for a single capability. While the exact semantics of this
759 distinction is not easy to define, the rough guideline from
760 \citet{std:VOR} suffices for a useful interpretation. According to
761 this, when a validation level is given for a resource, the
762 ``grade applies to the core set of metadata,'' whereas ``capability and
763 interface metadata, as well as the compliance of the service with the
764 interface standard, is rated by validationLevel tag in the capability
765 element.''
766
767 Validation information is different from the rest of the resource record
768 in that it is the only part designed to be changed by a third party on
769 the way from the resource record author through publishing and
770 searchable registry to the resource record consumer. It is also the
771 only piece of information that a harvester should accept from a resource
772 record it harvests from somewhere other than the originating
773 registry.
774
775 As almost all other aspects of the VO, validation is distributed.
776 Conceptually, everyone is free to offer a harvesteable registry handing
777 out validity assessments. In actual experience, validity assessments
778 actually differ between various validating entities, for example because
779 the feature sets exercised by the various validators are different.
780 Several organisations in the VO operate validators.
781 .
782
783 \section{Conclusions}
784 \label{sec:conc}
785
786 The VO Registry is a central component of the VO infrastructure.
787 It allows global resource discovery using a wealth of metadata while not
788 introducing a single point of failure. This is enabled by a strictly
789 defined metadata format, the use of standard protocols in the
790 communication between registries, judicious use of cross-harvesting,
791 authority management, and continuous validation.
792
793 The Registry has additional roles to play on top of resource discovery.
794 For instance, information on the publishers, creators, and maintainers
795 of the resources are available in a standardised way. This lets
796 client software present the final user with information on who to credit
797 in a study using VO data, or to find out where to direct questions in
798 case of technical malfunction or scientific issues.
799
800 We contend that the VO Registry is one of the most complex, but also one
801 of the most widely used infrastructures of its type. The success of
802 this concept may also be seen from the adoption of the underlying
803 technologies in similar projects in other fields, for instance a VO-like
804 effort in molecular and atomic spectroscopy called VAMDC
805 \citep{2011ASPC..442...89W}.
806
807
808 \section*{References}
809
810 \bibliographystyle{elsarticle-harv}
811 \bibliography{registry}
812
813 \end{document}

msdemlei@ari.uni-heidelberg.de
ViewVC Help
Powered by ViewVC 1.1.26