# Contents of /trunk/projects/registry/acregistry/registry.tex

Revision 2512 - (show annotations)
Tue Apr 8 10:25:53 2014 UTC (7 years, 3 months ago) by volute@g-vo.org
File MIME type: application/x-tex
File size: 17324 byte(s)
A&C registry: incremental harvesting whining.


} 45 46 \author{Gretchen Greene} 47 \author[ari]{Markus Demleitner} 48 \author[obspm]{Pierre LeSidaner} 49 \author[ncsa]{Ray Plante} 50 51 \address[stsci]{Space Telescope Science Institute} 52 \address[obspm]{Observatoire de Paris FIXME} 53 \address[ncsa]{NCSA FIXME} 54 \address[ari]{Unversität Heidelberg, 55 Astronomisches Rechen-Institut, M\"onchhofstra\ss e 12-14, 69120 Heidelberg} 56 57 \begin{abstract} 58 In the Virtual Observatory, the Registry provide the mechanism with which 59 users and applications can discover and select resources -- e.g., data 60 and services -- that are relevant for a particular scientific problem. 61 This article describes the design and operation of the system TBD. 62 \end{abstract} 63 64 \begin{keyword} 65 %% keywords here, in the form: keyword \sep keyword 66 67 %% MSC codes here, in the form: \MSC code \sep code 68 %% or \MSC[2008] code \sep code (2000 is the default) 69 70 \end{keyword} 71 72 \end{frontmatter} 73 74 % \linenumbers 75 76 \section{Introduction} 77 \label{sec:intro} 78 79 A Registry is a system within the IVOA which provides a directory for 80 published resources for data collections provided throughout the 81 astronomical community. There is a network of registries in the IVOA 82 which host collections, provide searching and publication of 83 collections, and follow standard protocols for exchanging information 84 between them. Each registry contains a very rich set of descriptive 85 metadata for the hosted collections defined by a set of IVOA registry 86 standards for data services, astronomical resources and various 87 standards described by astronomical data models. Registries are 88 critical repositories for the implemented standards developed by the 89 IVOA community. Science Data discovery tools rely on registry access to 90 efficiently perform searches across the distributed archives to find 91 data provider services which are discoverable through standard 92 application program interfaces (APIs). A fully searchable registry also 93 provides a resource publication system for individual or organizational 94 data providers to publish metadata describing their data resources and 95 services. This metadata includes the use of standard IVOA protocols for 96 images, spectra, and catalogs, in addition to a wide range of more 97 general information. Another key aspect of registries is resource 98 curation. The IVOA registries are routinely validated and curated to 99 ensure consistency with IVOA standards. Users can publish, fix and 100 update resource records, records and the services described are 101 validated against standards, and errors are actively reported to 102 responsible parties. 103 104 105 \section{Resources and Resource Records} 106 107 The Virtual Observatory can be seen as a collection of \emph{resources}. 108 \citet{std:RMI} defines a VO resource as a VO element that can be 109 described in terms of who curates or maintains it and which can be given 110 a name and a unique identifier,'' and names sky coverages, instrumental 111 setups, organisations, or data collections as examples. In practice, 112 over 95\% of resources in the current VO are services. 113 114 From the outset, it was clear that a common way of describing these 115 resources would be required as a very basic building block for 116 interoperability. For instance, VO client programs need to be able to 117 find out what protocols a service supports and at what endpoints'' -- 118 typically, HTTP URLs -- there are available, and scientists should have 119 reliable and standardized ways to work out who to reference, who to 120 report bugs to, and so on. 121 122 Fortunately, the VO did not have to develop the technology to support 123 such descriptions itself, as library sciences have worked on very 124 comparable problems for centuries already. \redcomment{Say something 125 about Dublin Core, \citet{std:RFC5013}, and introduce the \emph{resource 126 record}} 127 128 A somewhat subtle distinction resulting from these concepts is that a 129 resource is different from the resource record. To illustrate this, say 130 the data collection X contains spectra obtained using the spectrograph 131 S; the resource descriptor R gives a brief description of X. Now, 132 during the lifetime of the instrument, S will add new data to X on every 133 clear night, which means the resource changes. Nevertheless, in the 134 current VO R will not change (though it is conceivable that it will be 135 updated now and then, e.g., to update rough numbers on the number of 136 datasets contained in X). 137 138 For the converse scenario of a changing resource record with a constant 139 resource, suppose S is now decommissioned, while the standard defining 140 the content of the resource record is updated to include the spatial 141 coverage of the data collection. Now, R needs an update without X 142 changing. 143 144 In the VO, the decision was made to hide the distinction between 145 resources and resource records to a large degree. In particular, the 146 identifiers in use in the VO, the IVORNs, always refer to both at the 147 same time. This has worked surprisingly well. The only place where the 148 distinction is actually relevant in the current VO architecture is in 149 the notation of times; both in OAI-PMH and in the resource element, the 150 contepts of creation or update times mean the resource record, whereas 151 modification times of the resource itself can be communicated in in the 152 curation/date child of resource records. 153 154 As to the content of the resource record, very early on in the history 155 of the VO the desired pieces of metadata were collected, which lead up 156 to \citet{std:RMI}. This then was translated into an XML 157 representation in several standards, in particular VOResource 158 \citep(std:VOR} and VODataService \citep{std:VODS} as well as several 159 extensions for particular types of resources or services. This XML 160 representation is used in several service interfaces \citep[][e.g., 161 ]{std:VOSI} as well as for communication between registries. 162 163 164 \section{Registries} 165 166 Having a set of resource records alone is not enough to build a useful 167 system, even if they already are in a standard format. There must also 168 be ways in which users can locate records of the resources relevant to 169 them within this set. To enable this, there need to be systems in 170 place that allow service operators to feed their resource records into 171 the set and users to execute queries against them. Both requirements 172 are covered by \emph{registries} within the VO. 173 174 Given the VO's highly diverse and distributed structure, it is evident 175 that a distributed system is required, in which neither requirement 176 can be made the task of a single entity. Instead, every publisher can 177 run their own \emph{publishing registry}. This is a service exposing 178 the collection of this publisher's resource record. 179 180 Conceivably, a user looking for a resource matching some constraints 181 could now query each publisher's publishing registry in turn to obtain a 182 list of all matching VO resources. This architecture obviously will not 183 scale well with the number of publishers. It also introduces many 184 points of failure into the system, as all publishers would have to keep 185 their registries highly available to avoid a severe degradation of the 186 whole system. 187 188 Therefore, retrieving resource records from the publishing registries, 189 joining the sets of resource records thus obtained, and offering means 190 of querying this joined set to VO users is the task of a specialized 191 agent, the \emph{searchable registry}. The process of retrieval of 192 resource records by a searchable registry is known as \emph{harvesting}. 193 To allow this harvesting, publishing and searchable registries must 194 agree on a common protocol. Again, the VO could build on work done by 195 the bibliographic community which had already defined 196 OAI-PMH\footnote{The Open Archives Initiative Protocol for 197 Metadata Harvesting, \citet{std:OAIPMH}}. The actual application of 198 OAI-PMH within the VO is described in \citet{std:RI1}, which in 199 particular defines that the VO's own resource record format is selected 200 in OAI-PMH using a metadata prefix of \texttt{ivo\_vor}. VO registries 201 are, however, also required to emit the much simple Dublin core metadata 202 records on request and are thus interoperable with bibliographic 203 services outside of the VO. 204 205 206 \section{Harvesting} 207 208 The VO registry system is de-centralized in both directions: A given 209 publishing registry does not know which searchable registries will 210 eventually carry its records. An implication of this is that it cannot 211 tell the searchable registry when a resource record changes. This, in 212 turn, implies that a searchable registry will have to poll the 213 publishing registries it harvests. The largest publishing registries in 214 the VO currently emit more than 100 Megabytes of resource records, and 215 due to paging and other delays the transfer takes about 10 minutes. 216 217 On the other hand, to keep up to date, searchable registries should poll 218 the publishing registries with a fairly high frequency. Most active 219 searchable registries today poll once or twice a day. To nevertheless 220 keep network and CPU load low, OAI-PMH supports \emph{incremental 221 harvesting}. This allows searchable registries to query publishing 222 registries for records updated since some point in time. 223 224 A common harvesting strategy is that searchable registries memorize the 225 date and time of the last harvest and, on re-harvesting, query the 226 publishing registry for records updated since then. This strategy 227 unfortunately puts rather severe correctness requirements on the 228 publishing registries to prevent losing records due to race conditions. 229 To see why, consider the publishing registry P that contains the 230 resource record R, which is edited at $t_1$, with its updated attribute 231 set to $t_1$. At $t_2>t_1$, a searchable registry S harvests P and 232 memorizes $t_2$ as the date of the last harvest. P typically responds 233 from a database table containing metadata on the resource records. As 234 the row for R has not been updated yet, R is not harvested. 235 236 At $t_3>t_2$, R's metadata is ingested into the metadata table. The 237 updated timestamp is taken from the resource record, i.e., it is $t_1$. 238 Now, when S comes back for an incremental harvest at $t_4>t_3$, it will 239 ask for records updated after $t_2$, which, as \$t_1