Contents of /trunk/projects/registry/acregistry/registry.tex

Revision 2569 - (show annotations)
Wed Apr 30 14:02:07 2014 UTC (7 years, 5 months ago) by rplante@ncsa.uiuc.edu
File MIME type: application/x-tex
File size: 41175 byte(s)
updated Plante's name and address

} 46 47 \author[ari]{Markus Demleitner} 48 \ead{msdemlei@ari.uni-heidelberg.de} 49 \address[ari]{Unversität Heidelberg, 50 Astronomisches Rechen-Institut, M\"onchhofstra\ss e 12-14, 69120 51 Heidelberg, Germany} 52 \author[stsci]{Gretchen Greene} 53 \address[stsci]{Space Telescope Science Institute, 3700 San Martin Dr, 54 Baltimore, MD 21218, USA} 55 \author[obspm]{Pierre Le Sidaner} 56 \address[obspm]{Observatoire de Paris, Paris, France} 57 \author[ncsa]{Raymond L. Plante} 58 \address[ncsa]{National Center for Supercomputing Applications, 59 University of Illinois, 1205 W. Clark St. Urbana, IL 61821} 60 61 \begin{abstract} 62 In the Virtual Observatory (VO), the Registry provides the mechanism with which 63 users and applications can discover and select resources -- e.g., data 64 and services -- that are relevant for a particular scientific problem. 65 This article describes the design and operation of the VO Registry as a 66 distributed information system from a conceptual side, in particular as 67 regards the format and content of the resource records, their 68 interchange, and their interpretation. The client perspective will be 69 given in a forthcoming paper. 70 \end{abstract} 71 72 \begin{keyword} 73 virtual observatory\sep registry\sep standards 74 %% keywords here, in the form: keyword \sep keyword 75 76 %% MSC codes here, in the form: \MSC code \sep code 77 %% or \MSC[2008] code \sep code (2000 is the default) 78 \MSC 68U35 79 \end{keyword} 80 81 \end{frontmatter} 82 83 % \linenumbers 84 85 \section{Introduction} 86 \label{sec:intro} 87 88 The Virtual Observatory (VO) is a distributed system -- by design, there 89 is no central node either running services, handing out data, or even 90 just a single link list-style directory. In order to still maintain the 91 appearance of a single, integrated information system, however, users 92 and clients must have a means of discovering metadata of VO-compliant 93 resources (in the sense discussed in section~\ref{sect:recs}). This 94 means is provided by the VO Registry; written in upper case in the 95 following, the term refers to the entire system, as opposed to the 96 lower-case registry,'' which denotes a concrete service. 97 98 Following the VO philosophy, the VO Registry is not a single, central 99 system but rather a network of several types of services, some of which 100 host and publish metadata collections, while other provide capabilities 101 for querying such collections. All follow standard protocols for 102 exchanging information between them and between them and client 103 software. 104 105 The metadata collections consist of records containing a 106 rich set of descriptive metadata for each resource with a description in 107 the registry. These records are defined by a set of standards covering data 108 services, astronomical resources and various other types of resources. 109 110 The VO Registry is critical component for the Virtual Observatory as run by 111 the community organized around the International Virtual Observatory 112 Association (IVOA). For instance, science data discovery tools rely on 113 registry access to efficiently perform searches across the distributed 114 archives, and desktop tools use the standard application program interfaces 115 (APIs) exposed by the registries to discover services that can be 116 operated using whatever protocols these tools understand. 117 118 Another key aspect of the VO Registry is the 119 monitoring of the health and functionality of the VO. 120 The registries themselves are routinely validated and 121 curated to ensure consistency with IVOA standards, which uncovers errors 122 in the metadata supplied by the service operators. Even more 123 importantly, services within the Registry are validated to comply to the 124 standards they claim to implement, and errors are actively reported 125 to responsible parties. 126 127 The VO Registry thus is a complex ecosystem goverened by a fairly large 128 set of standards. Anticipating some terms that will be explained later, 129 let us collect and arrange the relevant standards already in the 130 introduction.\footnote{For an even bigger picture of the VO and its 131 components, see \citet{note:VOARCH}.} Where the standards have short 132 names in common use in the VO community, we introduce these here. 133 134 The basic standards define how to identify entities in the VO 135 \citep[IVOA Identifiers;][]{std:VOID} and what pieces of metadata we 136 consider relevant for the VO's use cases \citep[Resource Metadata for 137 the Virtual Observatory;][]{std:RM}. Based on this, VOResource 138 \citep{std:VOR} lays out the basics of encoding resource metadata in XML 139 and defines the basic types. Several extensions apply these building 140 blocks to more specialized types of services or interfaces: 141 VODataService \citep{std:VODS11} says how to describe data collections 142 and services exposing them, SimpleDALRegExt \citep{std:DALREGEXT} 143 contains capabilities for several simple'' protocols of the VO's Data 144 Access Layer (DAL), TAPRegExt \citep{std:TAPREGEXT} does the same for 145 the Table Access Protocol TAP, and StandardsRegExt \citep{std:STDREGEXT} 146 contains resource types for standard texts. Registry Interfaces 147 \citep{std:RI1} specifies how registries exchange the XML records 148 defined in VOResource and extensions, as well as 149 how registries themselves are 150 described within VOResource. It builds on the non-VO OAI-PMH 151 \citep{std:OAIPMH} standard. Registry Interfaces in the still-current 152 version 1.0 also defines two APIs for searching the registry, although 153 it is expected that two new APIs will soon deprecate these. One new 154 API is specified in a standard called RegTAP \citep{std:RegTAP} which 155 currently is under consideration. 156 157 \begin{figure} 158 \includegraphics[width=\hsize]{RegistryExchange.png} 159 \caption{A sketch of the registry system in the Virtual Observatory 160 taken from \citet{2007ASPC..382..445P}: 161 Searchable registries harvest from publishing registries operated by the data 162 providers. Users and client applications can then discover VO 163 resources through queries to a searchable registry, either a full 164 searchable registry that contains everything known to the VO, or a 165 specialized one focused on a particular subset.} 166 \label{fig:arch} 167 \end{figure} 168 169 In the remainder of this paper, we introduce the notions of resources 170 and resource records somewhat more rigorously before going into more 171 details on what registries there are and how they cooperate in 172 section~\ref{sect:registries}. The distributed nature of the VO 173 Registry is enabled by harvesting, which we introduce and critically 174 assess in section~\ref{sect:harvesting}; to ensure correct separation of 175 responsibilities in this process, authority management as discussed in 176 section~\ref{sect:auths} is essential. The remaining chapters up to the 177 conclusions are dedicated to several aspects of the resource records and 178 the metadata contained. 179 180 This paper does not discuss the user-facing parts of the Registry in any 181 detail, i.e., user interfaces, query APIs, and the like. 182 For that as well as further statistics on current Registry content, 183 we refer the reader to an upcoming 184 paper~II. 185 186 \section{Resources and Resource Records} 187 \label{sect:recs} 188 189 The Virtual Observatory can be seen as a collection of \emph{resources}. 190 \citet{std:RM} defines a VO resource as a VO element that can be 191 described in terms of who curates or maintains it and which can be given 192 a name and a unique identifier.'' He goes on to name 193 sky coverages, instrumental 194 setups, organisations, or data collections as examples. In practice, 195 over 95\% of resources in the current VO are data services. 196 197 From the outset, it was clear that a common way of describing these 198 resources would be required as a very basic building block for 199 interoperability. For instance, VO enabled client programs need to be able to 200 find out what protocols a service supports and at what endpoints'' -- 201 typically, HTTP URLs -- there are available, and scientists should have 202 reliable and standardized ways to work out who to reference, who to 203 report bugs to, and so on. 204 Of course, having a standardised structure for content metadata (like 205 keywords, a title, description) helps writing more focused data 206 discovery queries as well. 207 208 Fortunately, the VO did not have to develop the technology to support 209 such descriptions itself, as library sciences have worked on very 210 comparable problems for centuries already. The VO's registry 211 architecture in particular re-uses the Open Archives Initiative's 212 protocol for metadata harvesting \citep[OAI-PMH;][]{std:OAIPMH} for 213 a conceptual framework and the metadata exchange protocol, and 214 Dublin Core \citep{std:RFC5013} for a basis on which to build the 215 metadata model. 216 217 Central to OAI-PMH is the notion of a \emph{unique identifier}, which 218 unambiguously identifies an item within'' the set of resources. 219 Other than that these should be URIs \citep{std:RFC2396}, OAI-PMH does 220 not state details on how they should be formed. For VO 221 resources, \citet{std:VOID} prescribes the use of IVOA resource names or 222 \emph{IVORNs}. In short, these are URIs with an scheme of \texttt{ivo}, 223 an authority part as discussed in section~\ref{sect:auths}, and a local 224 part goverened by some reasonable restrictions on which characters are 225 allowed to occur. 226 227 A somewhat subtle but nevertheless important distinction made in OAI-PMH 228 is between a resource and a \emph{resource record} containing its 229 description. 230 To see that this distinction has actual consequences, say 231 the data collection X contains spectra obtained using the spectrograph 232 S; the resource record R describes X. Now, 233 during the lifetime of the instrument, S will add new data to X on every 234 clear night, which means the resource changes. Nevertheless, in the 235 current VO R will not change (though it is conceivable that it will be 236 updated now and then, e.g., as the description might contain 237 rough estimates on the number of datasets contained in X). 238 239 For the converse scenario of a changing resource record with a constant 240 resource, suppose S is now decommissioned, while the standard defining 241 the content of the resource record is updated to include the spatial 242 coverage of the data collection. Now, R needs an update without X 243 changing. 244 245 As stated above OAI-PMH defines that the unique identifiers -- and hence 246 the VO's IVORNs -- always reference resource records. As to how the 247 resources themselves should be referenced, OAI-PMH declares that the 248 nature of a resource identifier is outside the scope.'' This 249 reservation is motivated by the library use case, where a single book 250 might be described by different libraries and hence have multiple 251 resource records. The libraries must agree on common identifier for the 252 book to see that the different resource records actually all describe 253 the same book, but OAI-PMH did not want to endorse any particular 254 mechanism for this. 255 256 In the IVOA, it was expected that such complications would not arise as 257 the resource records would almost always come from the resource 258 publishers themselves, and no need for multiple resource records for a 259 single resource was foreseen. It was therefore decided that the IVORN of 260 a resource record should also identify the resource itself. 261 262 This explains some duplication of information in OAI-PMH messages in the 263 VO. Awareness of the distinction is relevant to registry users to 264 understand the meaning of the creation or update times in the resource 265 record (which refer to the record itself) and the dates and times given 266 in the curation/date child of the resource record, which pertain to the 267 resource. 268 269 As to the content of the resource record, very early on in the history 270 of the VO the desired pieces of metadata were collected, which lead up 271 to \citet{std:RM}. This then was translated into an XML 272 representation in several standards, in particular VOResource 273 \citep{std:VOR} and VODataService \citep{std:VODS11} as well as several 274 extensions for particular types of resources or services. 275 276 277 \section{Registries} 278 \label{sect:registries} 279 280 Having a set of resource records alone is not enough to build a useful 281 system, even if they already are in a standard format. There must also 282 be ways in which users can locate records of the resources relevant to 283 them within this set. Therefore, systems are required enabling 284 service operators to feed their resource records into 285 the set. Also, users must have a way to execute queries against the set. 286 Both requirements 287 are covered by \emph{registries} within the VO. 288 289 Given the VO's highly diverse and distributed structure, it is evident 290 that a distributed system is required, in which neither requirement 291 can be made the task of a single entity. Instead, every publisher can 292 run their own \emph{publishing registry}. This is a service exposing 293 the collection of this publisher's resource records. 294 295 Conceivably, a user looking for a resource matching some constraints 296 could now query each publisher's publishing registry in turn to obtain a 297 list of all matching VO resources. This architecture obviously will not 298 scale well with the number of publishers. It also introduces many 299 points of failure into the system, as all publishers would have to keep 300 their registries highly available to avoid a severe degradation of the 301 whole system. 302 303 Therefore, retrieving resource records from the publishing registries, 304 joining the sets of resource records thus obtained, and offering means 305 of querying this joined set to VO users is the task of a specialized 306 agent, a \emph{searchable registry}. The process of retrieval of 307 resource records by a searchable registry is known as \emph{harvesting}. 308 To allow this harvesting, publishing and searchable registries must 309 agree on a common protocol. 310 As mentioned in section~\ref{sect:recs}, the adoption of OAI-PMH already 311 defined such a protocol for the VO. 312 313 A secondary distinction between searchable registries is between 314 \emph{full registries} (the term searchable'' is usually implied in 315 this case) which strive to harvest all publishing registries 316 in the VO and \emph{local searchable registries} which only carry a 317 selection of records. An example for the second kind that is currently 318 in discussion is an educational'' registry that contains a manually 319 curated subset of services delivering data suitable for classroom use 320 (i.e., data of moderate size, with easily understood data types, etc). 321 322 The actual application of 323 OAI-PMH within the VO is described in \citet{std:RI1}, which in 324 particular defines that the VO's own resource record format is selected 325 in OAI-PMH using a metadata prefix of \texttt{ivo\_vor}. VO registries 326 are, however, also required to emit the much simpler Dublin core metadata 327 records on request and are thus interoperable with bibliographic 328 services outside of the VO. 329 330 One additional building block needs to be mentioned, the Registry of 331 Registries or RofR for short \citep{std:RofR}. This is a special 332 publishing registry from which searchable registries can harvest the set 333 of available publishing registries. As such, it is a single point of 334 failure, as there is only one such service globally. On the other hand, 335 as no client code directly accesses the RofR, an outage does not impair 336 the user-visible functionality of the VO. The main impact would be that 337 no new publishing registries could be added to the VO's registry system, 338 and existing registry's endpoints would have be be discovered from 339 searchable registries. 340 341 In the current VO, the RofR also doubles as the publishing 342 registry for standards and other resources managed by IVOA, and it 343 operates a service for validating the content of publishing registries 344 (cf.~section~\ref{sect:validation}). 345 346 347 \section{Harvesting} 348 \label{sect:harvesting} 349 350 The VO registry system is de-centralized in both directions: A given 351 publishing registry does not know which searchable registries will 352 eventually carry its records. An implication of this is that it cannot 353 notify the searchable registries when a resource record changes. This, in 354 turn, implies that the searchable registries will have to poll the 355 publishing registries it harvests. This is not entirely trivial, as 356 the largest publishing registry in 357 the VO currently emits more than 100 Megabytes of resource records, and 358 due to paging and other delays the transfer takes about 10 minutes. 359 360 On the other hand, to keep up to date, searchable registries should poll 361 the publishing registries with a fairly high frequency. Most active 362 searchable registries today poll once or twice a day. To nevertheless 363 keep network and CPU load low, OAI-PMH supports \emph{incremental 364 harvesting}. This allows searchable registries to query publishing 365 registries for records updated since some point in time. 366 367 A common harvesting strategy is that searchable registries persist the 368 date and time of the last harvest and, on re-harvesting, query the 369 publishing registry for records updated since then. Together with a 370 very natural-seeming (but incorrect) implementation on the part of the 371 publishing registry, this can lead to a loss of records with 372 incremental harvesting. 373 374 To see how this happens, consider a publishing registry P that, as is 375 usual, keeps the updated dates of its resources in a database table 376 to facilitate quick responses to OAI-PMH queries with 377 date constraints. Now say a new resource record R is created at $t_1$ 378 and its updated 379 attribute accordingly is set to $t_1$ in the record itself. For one 380 reason or another, the program that ingests the updated dates for the 381 record into the database table does not run immediately. 382 383 At $t_2>t_1$, a searchable registry S harvests P and memorizes $t_2$ as 384 the date of the last harvest. As the database table does not contain R 385 yet, R is not harvested. 386 387 At $t_3>t_2$, the program ingesting R into the database table 388 is finally run, but the timestamp is taken from the resource record, 389 i.e., it is $t_1$. Now, when S comes back for an incremental harvest at 390 $t_4>t_3$, it will ask for records updated after $t_2$, which, as 391 \$t_1