ViewVC logotype

Diff of /trunk/projects/registry/acregistry/registry.tex

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 2511 by volute@g-vo.org, Tue Apr 8 09:48:58 2014 UTC revision 2512 by volute@g-vo.org, Tue Apr 8 10:25:53 2014 UTC
# Line 205  Line 205 
206  \section{Harvesting}  \section{Harvesting}
208  Incremental (Races!), Validation, standardId  The VO registry system is de-centralized in both directions: A given
209    publishing registry does not know which searchable registries will
210    eventually carry its records.  An implication of this is that it cannot
211    tell the searchable registry when a resource record changes.  This, in
212    turn, implies that a searchable registry will have to poll the
213    publishing registries it harvests.  The largest publishing registries in
214    the VO currently emit more than 100 Megabytes of resource records, and
215    due to paging and other delays the transfer takes about 10 minutes.
217  Resource Types,  On the other hand, to keep up to date, searchable registries should poll
218    the publishing registries with a fairly high frequency.  Most active
219    searchable registries today poll once or twice a day.  To nevertheless
220    keep network and CPU load low, OAI-PMH supports \emph{incremental
221    harvesting}.  This allows searchable registries to query publishing
222    registries for records updated since some point in time.
224    A common harvesting strategy is that searchable registries memorize the
225    date and time of the last harvest and, on re-harvesting, query the
226    publishing registry for records updated since then.  This strategy
227    unfortunately puts rather severe correctness requirements on the
228    publishing registries to prevent losing records due to race conditions.
229    To see why, consider the publishing registry P that contains the
230    resource record R, which is edited at $t_1$, with its updated attribute
231    set to $t_1$.  At $t_2>t_1$, a searchable registry S harvests P and
232    memorizes $t_2$ as the date of the last harvest.  P typically responds
233    from a database table containing metadata on the resource records.  As
234    the row for R has not been updated yet, R is not harvested.
236    At $t_3>t_2$, R's metadata is ingested into the metadata table.  The
237    updated timestamp is taken from the resource record, i.e., it is $t_1$.
238    Now, when S comes back for an incremental harvest at $t_4>t_3$, it will
239    ask for records updated after $t_2$, which, as $t_1<t_2$, R is not.
240    Hence, the record will be missed by S, which in turn will not have
241    updated it.
243    What sounds like a fairly exotic scenario has in practice haunted the VO
244    registry for quite some time and keeps doing so, even though some
245    mitigation is possible by harvesters using the time of the last-but-one
246    harvest.  Still, user-visible differences between the content of
247    different searchable registries are frequently due to race conditions of
248    this type.  Several solutions are possible -- publishing registries
249    should use the ingestion time as updated timestamp for their record,
250    they could do full re-harvests every time, or OAI-PMH could be fixed to
251    use a more robust mechanism.  Meanwhile, incremental harvesting is a
252    curation challenge not be be underestimated, with further subtleties
253    mentioned below.
258    RofR, Incremental (Races!), Validation, standardId
260    Resource Types,
261  OAI-PMH,  searchable registry, validation, sets, the  OAI-PMH,  searchable registry, validation, sets, the
262  ivo_managed set, distribution of resource types: select res_type,  ivo_managed set, distribution of resource types: select res_type,
263  count(*) from rr.resource group by res_type  count(*) from rr.resource group by res_type

Removed from v.2511  
changed lines
  Added in v.2512

ViewVC Help
Powered by ViewVC 1.1.26