International

Virtual

Observatory

Alliance

TAP Implementation Notes
Version 0.1

Filled in automatically

Working Group
http://www.ivoa.net/twiki/bin/view/IVOA/IvoaGridAndWebServices
This version:
filled in automatically
Latest version:
not issued outside DAL WG
Previous version(s):
None
Authors:
Markus Demleitner
Paul Harrison
Mark Taylor

Abstract

This IVOA Note discusses several clarifications to the TAP protocol stack, i.e., to the ADQL dialect, the UWS job system, the VOSI metadata interfaces, and TAP itself. It also proposes a number of enhancements that might be incorporated in the next versions of the respective standards. The authors hope that the proposed text changes and additions can mature while in the relatively fluid note state to achieve a rapid and easy standards process later on.

Further contributions to this text are most welcome.

Status of This Document

This is an IVOA note published within the IVOA DAL working group. The first release of this document was on 2013-12-13.

(updated automatically)

A list of current IVOA Recommendations and other technical documents can be found at http://www.ivoa.net/Documents/.

Acknowledgements

Several sections of this document are based on the the TAPImplementationNotes page on the IVOA wiki IVOAWIKI. Several persons contributed to its content, including Mark Taylor, Paul Harrison, Pierre LeSidaner, Tom McGlynn, and Markus Demleitner.

Contents

Introduction

The protocol stack for exchanging database queries and their results within the Virtual Observatory context is, by 2013, implemented in several software packages, both on the server and on the client side.

Several implementors found that the respective standards leave some questions open. The first purpose of this document is to collect these questions and give answers reflecting a broad consensus on the part of the implementors. The points raised in these clarifications, errata and recommendations should be addressed in future revisions of the standard texts. It is the intent of this document to serve as an evolving reference for implementors that should eventually reflect the updates to the actual standards.

With the experience gathered from roll-out and use of the protocols, several additions to (or deletions from) the standards appeared beneficial. This document collects such proposals for changes to the content of the standards. Some of these changes have been written such that neither servers nor clients break and thus are candidates for minor updates to the standards, whereas the adoption of others might require new major releases. Again, the authors plan to evolve this document to have the note reflect the eventual plans for updates to the standards.

ADQL

ADQL: Clarifications, Errata, and Recommendations

The Separator Nonterminal

The grammar given in appendix A of std:ADQL gives a nonterminal separator, expanding to either a comment or whitespace. This nonterminal, however, is only referenced within the rule for character_string_literal. It is uncontentious that the intent is to allow comments and whitespace wherever SQL1992 allows them. With the nonterminal in the grammar, however, the ADQL standard says differently, and there should be a clarification.

One option for such a clarification is to amend section 2.1 of std:ADQL with a subsection 2.1.4, "Tokens and literals", containing text like the following (taken essentially from std:SQL1992.

Any token may be followed by a separator. A nondelimiter token shall be followed by a delimiter token or a separator.

Since the full rules for the separator are somewhat more complex in std:ADQL, an attractive alternative could be to omit the separator nonterminal from the grammar and to just note:

Whitespace and comments can occur wherever they can occur in std:SQL1992.

Accepted as errata - This item was discussed by members of the working group at the May 2014 IVOA Interop meeting.
It was agreed that this item should be included in the errata note for the current, std:ADQL-20081030, version of the standard.

Type System

The ADQL specification does not explicitly talk about types. Some intentions regarding types can be taken from the grammar (e.g., the lack of a boolean type), but it is clear that for a predictable behaviour across individual ADQL implementations, ADQL should talk about types. The TAP specification has already covered most of the ground here, with a table on PDF page 19 in version 1.0. The following proposal mainly builds on this.

To introduce a notion of types into section 2 of the ADQL recommendation, it should be amended with a subsection 2.6, "ADQL Type System", as follows:

ADQL defines no data definition language (DDL). It is assumed that table definition and data ingestion are performed in the backend database's native language and type system.

However, column metadata needs to give column types in order to allow the construction of queries that are both syntactically and semantically correct. Examples of such metadata includes VODataService's vs:TAPType std:VODS11 or TAP's TAP_SCHEMA. Services SHOULD, if at all possible, try express their column metadata in these terms even if the underlying database employs different types. Services SHOULD also use the following mapping when interfacing to user data, either by serializing result sets into VOTables or by ingesting user-provided VOTables into ADQL-visible tables. Where non-ADQL types are employed in the backend, implementors SHOULD make sure that all operations that are possible with the recommended ADQL type are also possible with the type used in the backend engine. For instance, the ADQL string concatenation operator || should be applicable to all columns resulting from VOTable char-typed columns.

VOTableADQL
datatypearraysizextype type
boolean1implemenation defined
short1SMALLINT
int1INTEGER
long1BIGINT
float1REAL
double1DOUBLE
(numeric)> 1 implementation defined
char1CHAR(1)
charn*VARCHAR(n)
charnCHAR(n)
unsignedByten*VARBINARY(n)
unsignedBytenBINARY(n)
unsignedByten, *, n*adql:BLOBBLOB
charn, *, n*adql:CLOBCLOB
charn, *, n*adql:TIMESTAMPTIMESTAMP
charn, *, n*adql:POINTPOINT
charn, *, n*adql:REGIONREGION

"Implementation defined" in the above table means that an implementation is free to reject attempts to (de-) serialize values in these types. They are to be considered unsupported by ADQL, and the language provides no means to manipulate "native" representations of them.

References to REGION-typed columns must be valid wherever the ADQL region nonterminal is allowed. References to POINT-typed columns must be valid wherever the ADQL point nonterminal is allowed.

Accepted for next version - This item was discussed by members of the working group at the May 2014 IVOA Interop meeting.
It was agreed that this item should be discussed further, with a view to including it in the next (minor) version of the std:ADQL standard.

DATETIME

The term TIMESTAMP has additional meanings above and beyond the simple meaning of a date and time, some of which impose additional constraints on the range of values that can be represented.

With this in mind we would like to propose replacing the terms TIMESTAMP and ADQL:TIMESTAMP in the preceeding section with DATETIME and ADQL:DATETIME.

Rejected - This item was discussed by members of the working group at the May 2014 IVOA Interop meeting.
It was agreed to keep the existing term, TIMESTAMP, to maintain compatibility with the original term defined in std:SQL1992.

Empty Coordinate Systems

The legal values and the semantics of the first arguments to the geometry constructors (POINT, BOX, CIRCLE, POLYGON) have been left largely open by the ADQL standard. The TAP standard clarified those somewhat to the effect that the prescriptions became implementable. On the other hand, the only thing clients can reasonably expect according to TAP (on a recommendation base) from a server is one of four reference frames. Compared to the implementation effort and the potential for user confusion, the additional expressiveness gained by keeping the first argument seems minute. Even allowing more expressive system strings will not help the feature much, since non-trivial transformations (e.g., between reference positions) will need more data than merely the celestial coordinates available to the geometry constructors.

We therefore propose to deprecate the first argument in a point release of ADQL. In the next major release, the first argument as defined in ADQL2 should be declared as ignored. The standard should require constructors both with and without the current first argument, though, in order to ensure backward compatiblity for ADQL2 queries.

To implement the first step, we propose replacing the second paragraph on PDF page 10 of std:ADQL (starting with "For all these functions...") with:

For historical reasons, the geometry constructors (BOX, CIRCLE, POINT, POLYGON) require a string-valued first argument. It was intended to carry information on a reference system or other coordinate system metadata. In this version, we recommend ignoring this first argument, and clients are advised to pass an empty string here. Future versions of this specification will make this first, string-valued parameter optional for the listed functions.

In consequence, the COORDSYS function would be taken out of the enumeration on PDF page 9, and its description on PDF page 11 would be removed, too. All examples would use an empty string rather than "ICRS GEOCENTER" -- which is not contained in the TAP clarification anyway -- as in the current text.

A library of standard generalized user defined functions (see section af-genudf) could provide for simple conversion between reference frames as well as more demanding transformations, e.g., between epochs or reference positions. This, however, depends on allowing geometry-valued user defined functions and is outside of the scope of a clarification. See also section af-genudf.

Requires further discussion - This item was discussed by members of the working group at the May 2014 IVOA Interop meeting.
It was agreed that the proposal needs more work done on it before it could be included in the std:ADQL standard.

Explanation of optional features

We would like to propose adding a section of text to both the std:TAP and std:ADQL specifications that clarifies the optional/required status of the geometric functions, and explains how tr:languageFeatures elements from the std:TAPREGEXT schema extension can be used to describe which of the geometric functions are supported by a particular TAP service.

In the current release documents

  • Section 1.2.1 of std:TAP-20100327 states that "Support for ADQL queries is mandatory".
  • Section 2.4 of std:ADQL-20081030 describes the geometric functions, and section 2.5 describes support for user defined functions.
However, the current std:TAP-20100327 and std:ADQL-20081030 specifications do not describe how tr:languageFeatures elements from the std:TAPREGEXT-20120827 schema extension may be used to describe which of the geometric functions are supported by a service.

The description of the features-adqlgeo feature in the std:TAPREGEXT-20120827 schema extension implies that some of the geometric functions may be optional. However, the current text refers the user back to the std:TAP-20100327 specification for details of which combinations are permitted.

"support for these functions is in general optional for ADQL implementations, though TAP imposes some constraints on what combinations of support are permitted"

The proposed changes would not alter the technical details of any of the specifications. The aim is just to add some additional explanations, references and examples.

Accepted for next version - This item was discussed by members of the working group at the May 2014 IVOA Interop meeting.
It was agreed that this item should be discussed further, with a view to including it in the next (minor) version of the std:ADQL standard.

ADQL: Proposed New Features

Simple Crossmatch Function

Since a simple positional crossmatch is such a common operation, we should define a function CROSSMATCH(ra1, dec1, ra2, dec2, radius) -> INTEGER returning 1 if (ra1, dec1) and (ra2, dec2) are within radius degrees of each other. This allows more compact expressions than the conventional CONTAINS(POINT, CIRCLE) construct, and ADQL to SQL translators can more easily exploit special constructs for fast crossmatching that may be built into the backend databases.

Requires further discussion - This item was discussed by members of the working group at the May 2014 IVOA Interop meeting.
It was agreed that the proposal needs more work done on it before it could be included in the std:ADQL standard.

  • It was agreed that this would be a useful feature for end users
  • It was noted that adding this feature could be difficult to implement
  • It was noted that part of the rationale for the IVOA services was to implement difficult things on the server side, making things easier for the end user

No Type-based Decay of INTERSECTS

Section 2.4.11 of std:ADQL stipulates that a call to INTERSECTS should decay to a CONTAINS when one argument is a POINT. This rule is a major implementation liability for simple translators, since it is the only place in the ADQL specification that actually requires a type calculus. For a feature that does not actually add functionality, this seems a high price to pay.

We therefore recommend to strike the text from "Note that if one of the arguments" through "equivalent to INTERSECTS(b,a)" and add at the end for 2.4.11:

The arguments to INTERSECTS SHOULD be geometric expressions evaluating to either BOX, CIRCLE, POLYGON, or REGION. Previous versions of this specification allow POINTs as well and require servers to interpret the expression as a CONTAINS with the POINT moved into the first position. Servers SHOULD still implement that behaviour, but clients SHOULD NOT expect it. It will be dropped in the next major version of this specification.

Accepted for next version - This item was discussed by members of the working group at the May 2014 IVOA Interop meeting.
It was agreed that the proposed text should be included in the next (minor) version of the std:ADQL standard.

Generalized User Defined Functions

Currently, user defined functions may only return numbers or strings (in terms of the grammar, only numeric_value_function and string_value_function can expand to user_defined_function). Many interesting functions (e.g., coordinate transforms, applying proper motions) are extremely inconvenient to define with such a restriction. Therefore, we propose to add | <user_defined_function> to the right hand side of the geometry_value_function rule.

With this, we could define some standard functions for manipulating geometries; these should be defined in the standard, but they could remain optional. Clients can determine their availability using std:TAPREGEXT.

A future version of this note will propose a library of such functions, including proper motion, precession, and system transformation.

Accepted as errata - This item was discussed by members of the working group at the May 2014 IVOA Interop meeting.
It was agreed that there should be no restriction on the return types of User Defined Functions.
It was agreed that this should be included in the errata note for the current, std:ADQL-20081030, version of the standard.

Futher discussion - It was also noted that the SimDAl working group would like to be able to define table value functions in std:ADQL.
It was agreed to continue the discussion to find a way of adding support for table value functions in a future version of the std:ADQL standard.

Case-Insensitive String Comparisons

ADQL currently has no facility reliably allowing case-insensitive string comparisons. This is particularly regrettable since UCDs and at least the majority of the defined utypes are to be compared case-insensitively.

Thus, we propose the addition of a string function LOWER and the case-insensitive variant of LIKE, ILIKE. Since case folding is a nontrivial operation in a multi-encoding world, ADQL would only require standard behaviour for the ASCII characters (which would suffice for UCDs and utypes) and only recommend following algorithm R2 in section 3.13, "Default Case Algorithms" of std:UNICODE outside of ASCII.

The grammar changes are trivial.

Accepted for next version - This item was discussed by members of the working group at the May 2014 IVOA Interop meeting.
It was agreed that the following functions should be included as an optional feature in the next (minor) version of the std:ADQL standard.

  • UPPER
  • LOWER
It was agreed that the following operator should be included as an optional feature in the next (minor) version of the std:ADQL standard.
  • ILIKE

Set Operators

ADQL 2.0 does not support any of the SQL UNION, EXCEPT and INTERSECT operators. Since at least set union and intersection are basic operations of relational algebra and combining data from several tables is an operation of significant practical use, this is a serious deficit. Also, there is probably no backend SQL system that does not support these operations.

Thus, to add minimal support of set operations to ADQL, ADQL systems will mainly need to update their grammars. The following rules, adapted from std:SQL1992, will suffice (the query_expression rule replaces the one given in the current grammar, all others are new rules):

 ::=
                
              | 

          ::=
                
              |  UNION  [ ALL ] 
              |  EXCEPT [ ALL ] 

          ::=
                
              | 

          ::=
                
              |  INTERSECT [ ALL ]

          ::=
                
              | 

          ::=
                
              |   

]]>

This leaves out the CORRESPONDING specifications of SQL92, and it still does not include VALUES and explicit table specifications (which would enter through non_join_query_primary) in ADQL. None of these seem indispensible, although one could probably make a case for VALUES.

Accepted for next version - This item was discussed by members of the working group at the May 2014 IVOA Interop meeting.
It was agreed that the following operators should be included in the next (minor) version of the std:ADQL standard.

  • UNION
  • EXCEPT
  • INTERSECT
It was agreed that the text describing the set operators in the std:ADQL standard should include the following caveats.
  • The set operands MUST produce the same number of columns
  • The corresponding columns in the operands MUST have the same data types
  • The corresponding columns in the operands SHOULD have the same metadata
  • The metadata for the results SHOULD be generated from the left-hand operand

Adding a Boolean Type

Having a boolean type in ADQL could make some expressions nicer (e.g., it could eliminate the comparison against 1 for the geometry predicate functions). However, adding boolean functions and allowing references to boolean columns complicates catching syntax errors significantly, since expressions like WHERE colref would then parse and only would raise an error when it turns out that colref does not refer to a boolean column. Simple ADQL translators may not be able to verify this.

We therefore propose to add a boolean type to the ADQL type system (see section ac-typesystem) without any grammatical support for it. However, the standard prose should be amended to contain:

If the backend database contains columns of type boolean, a comparison of those against the literal strings True and False must be true and false when the column is true and false, respectively. The comparison to other literals is undefined by this specification. Clients should note that the strings have to be entered exactly as given here, without changing case, adding whitespace, or any other modification.

If this change is adopted, the type system table given in section ac-typesystem should be updated; luckily, the VODataService specification underlying VOSI already allows BOOLEAN as a TAPType. In the table row for VOTable boolean, "implementation defined" should be replaced with "BOOLEAN".

Requires further discussion - This item was discussed by members of the working group at the May 2014 IVOA Interop meeting.
It was agreed that although making these changes would be a good thing, more work needs to be done on identifying and solving potential compatibility issues before the changes can be included in the std:ADQL standard.

  • It was agreed that BOOLEAN data type would be a useful feature to add
  • It was agreed that changing the return type of CONTAINS() to be a BOOLEAN would make it easier to use
  • It was agreed that making these changes could cause compatibility issues which could not be addressed in a (minor) increment of the std:ADQL standard
  • It was agreed that both of these changes should be considered for a future (major) increment of the std:ADQL standard

Casting to Unit

ADQL translators can typically introspect the tables they operate on, and thus can typically infer the (physical) unit of a column. Manually converting units (as in col_in_deg*3600 is error-prone, and expressions like that make it almost impossible to infer the unit of the result.

This problem is addressed by the introduction of a function IN_UNIT(expr, <character_string_literal>); the second argument has to be a literal in order to make sure that an ADQL translator has access to its value; this value must be in the format defined by std:VOUNIT. The intended functionality is that the translator replaces the function call with a new expression that is expr given in the unit defined by the second argument if the translator can figure out expr's unit, and it knows how to convert values in one unit into another. In every other case, the query must be rejected as erroneous.

Requires further discussion - This item was discussed by members of the working group at the May 2014 IVOA Interop meeting.
It was agreed that the proposal needs more work done on it before it could included in the std:ADQL standard.

  • It was agreed that scaling conversions would not be difficult to implement
  • It was agreed that conversion between wavelength and frequency would be difficult to implement consistently
  • It was agreed that unit converstions would be most useful in a SELECT list
  • It was agreed that unit converstions would be most difficult to implement in a WHERE clause

Column References with UCD Patterns

In the same spirit of a function that really is a macro evaluated by an ADQL translator, we suggest a new function UCDCOL(<character_string_literal>). The character_string_literal in this case specifies a posix shell pattern (i.e., users write * for a sequence of 0 or more arbitrary chars, ? for exactly one arbitrary char, [] for a character range, and the backslash is the escape character) for a UCD. The translator replaces the entire function call with the first match of a column matching this pattern. If no such column exists, the query must be rejected as erroneous.

Requires further discussion - This item was discussed by members of the working group at the May 2014 IVOA Interop meeting.
It was agreed that the proposal needs more work done on it before it could be considered ready to be included in the std:ADQL standard.

Modulo operator

ADQL currently supports modulo as the MOD(x,y) function.

Many of our science users are more familiar using the x % y operator syntax.

We would therefore like to propose adding the x % y operator syntax to ADQL.

Many of the poular RDBMS platforms support both the MOD(x,y) function and the x % y operator syntax.

However, we are aware that some platforms only support the MOD(x,y) function syntax and not the x % y operator.

Adding the x % y operator syntax to ADQL would mean that some platforms would need translate the ADQL x % y operator syntax into the MOD(x,y) function syntax.

Rejected - This item was discussed by members of the working group at the May 2014 IVOA Interop meeting.
It was agreed that the benefits of adding the operator, x % y, syntax were outweighed by cost of compatibility issues caused by adding a new operator to the grammar.

Bitwise operators

There are a number of fields in astronomy catalogs that contain combinations of bit flags. For example, the Quality Error Bit Flags in the OmegaCAM Science Archive encodes 32 bits of quality error information for each source detection in a single integer column.

In order to use these as filters in a WHERE clause we need to be able to perform bitwise operations on them.

 -9.99995e+8
    AND
        jmksExt > -9.99995e+8
    AND (
            (jppErrBits | ksppErrBits) < 0x10000
        OR
            (jppErrBits | ksppErrBits) & 0x00800000 != 0
        ) 
]]>

We would therefore like to propose adding support for the four main bitwise operators, AND, OR, XOR and NOT to the ADQL language.

We would also like to propose adding support for hexadecimal literals.

()+
    HEX_DIGIT   ["0"-"9","a"-"f","A"-"F"]
]]>

Many of the popular RDBMS platforms provide the full set of bitwise operators. However, some platforms only provide a limited set of bitwise operators.

Platform BIT_AND BIT_OR BIT_XOR BIT_NOT
PostgreSQL YES YES YES YES
MySQL YES YES YES YES
MariaDB YES YES YES YES
HyperSQL YES YES YES YES
SQLServer YES YES YES YES
SQLite YES YES NO YES
Oracle YES NO NO NO
Derby NO NO NO NO

Given that support for the bitwise operators is not universal, it may be necessary to define the bitwise operators as ADQL functions in addition to the bitwise operators.

We can also define a feature in the std:TAPREGEXT specification, features-bitwise, to describe which of the bitwise functions a service implements, similar to the features-adqlgeo feature defined for the geometric functions.

The meaning of the features-bitwise feature would cover both the function syntax and the corresponding operator syntax. So a registration entry that declares support for the bitwise AND function

would also imply support for the bitwise AND operator.

Accepted for next version - This item was discussed by members of the working group at the May 2014 IVOA Interop meeting.
It was agreed that hexadecimal literal values should be included in the next (minor) version of the std:ADQL standard.
It was agreed that the following functions should be included as an optional feature in the next (minor) version of the std:ADQL standard.

  • BIT_AND(x, y)
  • BIT_OR(x, y)
  • BIT_XOR(x, y)
  • BIT_NOT(x)
It was agreed that the benefits of adding the operator, exp op exp, syntax for each operation were outweighed by cost of compatibility issues caused by adding new operators to the grammar.

CAST operator

There are a number of cases where our scientists use the CAST operator to control the type, size and precision of numerical values. An example of this is a query described in Hambly et al 2008,MNRAS,384,637–662 which is useful for summarizing the contents of a selection and 'binning' the data.

With this use case in mind we would like to propose adding a limited version of the CAST operator to the ADQL language.

 AS )
]]>

Note that the proposed syntax is slightly different to that of a normal ADQL function, using AS rather than a comma as the separator, and a fixed enumeration of types for the second parameter which would cover just the standard numeric types.

The propsed CAST(<expression> AS <type>) syntax is similar that used in many of the standard RDBMS.

The proposed change does not aim to replicate the full functionality and range of types provided by the different RDBMS implementations of CAST. In particular we are not proposing to support CHAR, VARCHAR or DATETIME conversions. The aim is just to cover the primary use case of provding a mechanism for converting between the standard numeric types.

  SHORT INTEGER LONG FLOAT DOUBLE
SHORT - Y Y Y Y
INTEGER Y - Y Y Y
LONG Y Y - Y Y
FLOAT Y Y Y - Y
DOUBLE Y Y Y Y -

Accepted for next version - This item was discussed by members of the working group at the May 2014 IVOA Interop meeting.
It was agreed that the CAST operator should be including as a required operator in the next (minor) version of the std:ADQL standard.
It was agreed that the set of type conversions should be discussed further, with a view to finalising the set of conversions supported in the next (minor) version of the std:ADQL standard.

UWS

UWS: Clarifications, Errata, and Recommendations

Updating Parameters

Section 2.1.11 of std:UWS states that a "particular implementation of UWS may choose to allow the parameters to be updated after the initial job creation step, before the Phase is set to the executing state" and successively allows POSTing to jobs/job-id, jobs/job-id/parameters and PUTting to jobs/job-id/parameters/parameter-name.

It turned out that the concrete semantics of this cavalier approach quickly become difficult. We therefore propose to amend the language on changing parameters post-creation by:

In most cases, the values of the parameters are all established during the initial POST that creates the job. However, a particular implementation of UWS may choose to allow the parameters to be updated after the initial job creation step, before the Phase is set to the executing state. It should, however, not offer the ability to create new parameters nor delete existing parameters. The next major version of this specification will remove the ability to set an individual parameter.

From the client perspective, there is only one guaranteed way to set a parameter that all UWS services must implement: In the initial POST that creates the job.

Behaviour for Failed Job Creation

In Section 2.2.3.1 of std:UWS a UWS is required to return a "code 303 'See other'" "unless the service rejects the request". It is not specified what should happen when the service rejects the request.

We propose to add, at an appropriate position, the following text:

If the execution of an UWS request fails, the service has to generate an appropriate error message with codes in the 400 (client error) or 500 (server error) ranges according to std:HTTP. If the erroneous request is recoverable (e.g., a request for a transition to an impossible state), the job does not go into the ERROR state because of a failed request.

The payload of such an error message SHOULD be a user-presentable error message plain text, which SHOULD not be re-flowed by clients. Clients MUST accept other documents coming back as payloads of such request responses. As such events can be assumed major server failures, it is recommended to abandon a job that had a non-text/plain response to any UWS request.

UWS: Proposed New Features

Format of Quote

Section 2.2.1 of std:UWS states that the jobs/job-id/quote resource represents quote as a number of seconds, while the schema represents quote as an xs:dateTime.

This is an unnecessary inconsistency. If no schema change is required by other changes in a UWS revision, we propose to solve it by requiring the representation in the resource to be in std:DALI YYYY-mm-ddThh:mm:ss form. While doing this, we should also clarify the format for the value of desctruction, that currently just defers to std:iso8601; this should now refer to std:DALI as ISO 8601 allows many variants that are clearly not intended here.

If the UWS schema needs changing for other reasons, we suggest to unify the representations to the number of seconds on grounds that it is the more logical specification for the estimated duration of a job.

TAP

TAP: Clarifications, Errata, and Recommendations

Names of Uploaded Tables

Section 2.5 of std:TAP requires the name of the uploaded tables to be a "legal ADQL table name with no catalog or schema (e.g. an unqualified table name)". This language probably allows delimited identifiers, as the ADQL table_name can expand to one. This, however, was clearly not the intention of text, as the use of delimited identifiers is not (fully) supported by the syntax of the UPLOAD parameter. To resolve these difficulties, we propose to replace the parenthesis starting with "e.g." with:

i.e., a string following the regular_identifier production of std:ADQL.

This could, in theory, invalidate existing clients that might want to use delimited identifiers in uploads. Due to the difficulties with the UPLOAD parameter syntax, however, that would not really be supported in version 1, either. Thus, we claim that this language can enter in a minor version.

Multiple UPLOAD Posts

Since UWS allows posting parameters after job creation, Section 2.5.1 of std:TAP needs to specify what happens when the UPLOAD parameter is posted into a job that already has one or more uploads. We propose to add at the end of the section:

UPLOADs are accumulating, i.e., each UPLOAD parameter given will create one or more tables in TAP_UPLOAD. When the table names from two or more upload items agree after case folding, the service behaviour is unspecified. Clients thus cannot reliably overwrite uploaded tables; to correct errors, they have to tear down the existing job and create a new one.

Database Column Types

Section 2.5 of std:TAP gives "database column types" for all kinds of VOTable objects. Given the lack of an ADQL type system, this must clearly be taken with a grain of salt; the types given in this column at least cannot be taken as conformance criteria. We propose to add the following language before section 2.5.1:

Note that the last column of Table (x) is not normative. Implementations SHOULD try to make sure that the actual types chosen are at least signature-compatible with the recommended types (i.e., integers should remain integers, floating-point values floating-point values, etc.), such that clients can reliably write queries against uploaded tables.

For columns with xtype adql:REGION, this is particularly critical, since databases typically use different types to represent various STC-S objects. Clients are advised to assume that such columns will be approximated with polygons in the actual database table.

The size Column in TAP_SCHEMA

The table TAP_SCHEMA.columns as specified in section 2.6.3 of std:TAP has a column named size. This is unfortunate since SIZE is an ADQL reserved word, and thus must be quoted in queries.

We therefore propose to append the following language to section 2.6.3:

To use size in a query, it must be put in double quotes since it collides with an ADQL reserved word. Since delimited identifiers are case-sensitive, for the size column both clients and servers MUST always (in particular, in the DDL for TAP_SCHEMA) use lower case exclusively.

In the next major version of TAP, this column will be called arraysize.

To allow the text to be consistent with the rules for VOTable error documents, we propose the following changes in Section 2.9 of std:TAP:

CurrentNew
The VOTable must contain a RESOURCE element identified with the attribute type='results', containing a single TABLE element with the results of the query. The VOTable must contain a RESOURCE element identified with the attribute type='results', containing exactly one TABLE element with the results of the query if the job execution was successful or no TABLE element if the job execution failed to produce a result.
The RESOURCE element must contain, before the TABLE element, an INFO element with attribute name = "QUERY_STATUS". The value attribute must contain one of the following values: The RESOURCE element must contain an INFO element with attribute name="QUERY_STATUS" indicating the success of the operation. For RESOURCE elements that contain a TABLE element, this INFO element must appear lexically before the TABLE. The following values are defined for this INFO element's value attribute:

TAP: New Features

An examples Endpoint

Feedback from TAP users indicates that providing query examples is considered most helpful, which is probably not surprising since to effectively use a TAP service, a user has to combine knowlege of a fairly complex query language with server-specific metadata like table schemata and local extensions as well as domain knowledge. A head start as provided by examples doing something related to what the users actually want is therefore most welcome.

TAP services are usually accessed through specialized clients. Therefore, a simple link "for examples see here" will in general not work for them. In principle, one could simply communicate an example URL to a client and let the user browse it. Allowing a certain amount of structuring within the document at this URL, however, lets clients do some useful in-application presentation of the examples.

std:DALI defines a simple system to communicate examples to humans and machine clients alike, based on RDFa. This section specifies how the generic DALI specification is to be applied to TAP.

The Endpoint

A TAP server exposes the example queries in an examples endpoint residing next to sync, async , and the VOSI endpoints. A GET from this endpoint MUST yield a document with a MIME type of either application/xhtml+xml or text/html. A service that does not provide examples MUST return a 404 HTTP status on accessing this resource.

If present, the endpoint must be represented in a capability in the TAP service's registry record. The capability's standardID is, as defined by DALI, ivo://ivoa.net/std/DALI#examples. A capability element could hence look like this:


     
       http://localhost:8080/tap/examples
     
   
]]>

Document Content

The document at examples MUST follow the rules laid out for DALI-examples in std:DALI; in particular, it must be valid XML, viewable with "common web browsers".

TAP defines two additional properties within the ivo://ivoa.net/std/DALI-examples (note that at the time of writing the DALI PR has "DALI#examples" here, which we corrected here) vocabulary:

  • query -- each example MUST have a unique child element with simple text content having a property attribute valued query. It contains the query itself, preferably with extra whitespace for easy human consumption and editing. This will usually be a HTML pre element.
  • table -- examples MAY also have descendants with property attributes having the value table. These must have pure text content and contain fully qualified table names to which the query is somehow "pertaining". Suitable HTML elements holding these include span, or a (which would allow linking to further information on the table).

When using elements with src or href attributes to carry the property attributes, note that the element content must be repeated in a content attribute, as otherwise RDFa clients would interpret the embedded link rather than the element content as the object in the triple. For instance,

theschema.thename

would make sure that RDFa clients see the table name (as they should) rather than the table name (as they would without the content attribute).

An example for a document served from the examples endpoint is given in Appendix appA

Intended Use

In the simplest case, TAP clients can provide links to the current server's example endpoint. A more advanced interface would give an interface element allowing the selection of example titles with the option of entering the sample query into the query field of the user interface. The documentation for the query would be accessed by opening a web browser using the base example URL and the example's fragment identfier.

Advanced clients could render the HTML div elements themselves, and they could provide a means to discover example queries involving particular tables in their table metadata browser based on property=table markup.

Validation

Appendix appB gives an XSLT 1.0 stylesheet that extracts the machine readable information from compliant documents and emits the results in text format.

The style sheet checks for proper vocabulary declaration. If you have no element declaring the vocabulary, the output will be empty.

Service operators should also use RDFa validation tools, e.g., the W3C RDFa validator svc:RDFaVal, to make sure their document is usable from RDF tools.

A plan Endpoint

CDS have a debug endpoint with additional information; join their concepts with this.

As already noted in std:TAP, it is notoriously difficult to predict the runtime of SQL queries. For nontrivial queries, even experts may have a hard time figuring out performance bottlenecks. Therefore, most database systems provide some mechanism to obtain a query plan, that is, to inspect what elementary operations will be performed for a given query.

Since TAP queries are typically formulated by persons not intimately familiar with the database queried, the need for a mechanism allowing insights into the database engine's reasoning is even more pronounced. On the other hand, different database systems give their plans in completely different formats and even schemata. In addition, as the Postgres Documentation says: "Plan-reading is an art that deserves an extensive tutorial" (doc:Postgres92, Sect. 14.1).

Thus, specifying a fixed format for query plans that would be both expressive enough and sufficiently generic to be easily adaptible to various backend database engines is probably impossible. To still allow users to inspect actual query plans, we propose the following language be added at the end of section 2.2.2 of std:TAP:

In addition to the UWS resources, a TAP server SHOULD support a child plan for each job resource. If retrieving this resource is successful (i.e., results in a 200 HTTP response after possible redirects and authentication), it MUST be a preformatted document with MIME type text/plain. Within it, the actual query as executed by the database engine MUST come first.

After at least one blank line, a rendering of the query plan follows. Note that the query as excecuted may contain blank lines, which means that machine clients cannot use the blank line to separate query and plan. In general, clients SHOULD display the plan without any reformatting in a fixed-width font.

Since it is hard to define a generic and sufficiently expressive format for query plans and the authors want to avoid excessive implemenation cost for this feature, this specification does not give a format for the query plan. Implementors are advised to keep as much of the "native" plan format of their database engine as possible.

After the plan, the service is free to give additional debugging information. The indended audience for this information are again humans, so even in cases where proprietary clients actually parse out information from that area, such information should still be decipherable by knowledgeable humans.

If the creation of the query plan fails, the service MUST reply with a 400 (if the failure appears to be due to syntax errors in the query, the query plan not being available in this UWS phase, or similar problems) or 500 HTTP status code. Errors in plan generation do not change the phase of the job. Clients may thus use the plan endpoint to check the syntax of a query on services supporting it.

Services that cannot or choose not to support the retrieval of query plans MUST respond with a 404 HTTP code to requests for plan children of job resources.

Except for 404 responses, all documents delivered from the plan endpoint MUST have the MIME type text/plain. They should contain ASCII exclusively, but clients SHOULD assume UTF-8 encoding if no character set is declared by HTTP means.

Scaleable tables Endpoint

For archives serving hundreds or thousands of tables, the tables endpoint on TAP services as defined by std:VOSI will have to return documents of several dozen megabytes. This results in nontrivial transfer times for data that in all likelihood is uninteresting to the user that typically will only write queries against fairly few of those tables.

To mitigate this problem, we propose to define that vs:Table typed elements in responses from VOSI table endpoints that have no column children are to be regarded as stubs by clients. A client SHOULD give the user the possibility to request "full" information on such a stubbed table. This full information is available from a child resource of tables named like the table, in exactly the captialization as given in the name child of the table stub; it would come as the full table element.

As an example, a service might return the following from its tables endpoint:


  
    ppmxl
    ppmxl.main
]]>

A client could then retrieve the url .../tables/ppmxl.main and would receive something like this:


  ppmxl.main
   PPMXL is a catalog of positions, proper motions...
  
  
    ipix
    Identifier (Q3C ipix of the USNO-B 1.0 object)
	...
]]>

More formally, we propose to replace the last paragraph of section 3.4, "Table metadata", of std:VOSI, Version 1.0, with the following text:

In the REST binding, the registred URL retrieves an XML document containing this element. However, services exposing a large number of tables may only write table stubs into the document retrieved from this web resource. Table stubs are table elements containing no column children. While the XSD requires a name child to be present, the services may or may not include any of the remaining table metadata.

Still in the REST binding, the server that has produced such a columnless table element should provide a child of the tables resource named like the content of the tables name child element, with any leading or trailing whitespace removed. If a request for this resource is successful, the document received must contain an XML document containing a single element of the type {http://www.ivoa.net/xml/VODataService/v1.1}Table with all metadata available for the table.

Making the async Endpoint Optional

Some existing TAP-like services have data that is small and simple enough that synchronous queries are likely to be sufficient. They therefore chose not to implement the async endpoint, which makes these services technically non-TAP. Given the implemenation overhead of a UWS for something that is not really required by the services in question, the choice seems reasonable, though, and the services are "mostly interoperable" with existing clients in that there are usually ways to operate the services from the clients.

It therefore seems reasonable to make the async endpoint optional and add language that requires clients to offer ways to fall back to synchronous operation for services that do not support async.

Against this proposal it has been levelled that

  1. While it is nice to apply TAP to simple tasks, its primary focus is solving hard problems with potentially long runtimes. To make this possible (reliably), asynchronous operation is a must.
  2. Optional features cannot be relied upon by clients. They are thus undesirable in principle, but in particular whatever is necessary to deal with the principal problems which the standard is intended to solve must not be optional;
  3. Thus, if anything, sync should be made optional, as it is easily simulated using async but not the other way round;
  4. Given that, we should not downgrade the standard but upgrade the implementations. There are several good, interoperable, open TAP implementations available, so nobody should be forced to run homegrown, non-compliant services.

Making TAP_SCHEMA optional

The TAP_SCHEMA specification makes a number of assumptions

  • The system is based on a standard RDBMS
  • The data model closely follows the layout of the physical schema in the database
  • All of the tables are visible to everyone
If these assumptions hold true, then implementing TAP_SCHEMA is relatively easy and provides a rich set of tools for clients to query the available information. However, if some of these assumptions are not true, then implementing TAP_SCHEMA can be problematic, and in some cases it may be impossible to publish a full set of metadata describing all of the tables and columns available in a service.

We would therefore like to propose that TAP_SCHEMA is listed as an optional rather than mandatory feature of the TAP specification.

The original intent of the TAP and ADQL specifications was to provide an abstraction layer which hides the details of the underlying database implementation. On the other hand, the TAP_SCHEMA data model is based directly on the system tables are available in many of the standard RDBMS. This close mapping between the TAP_SCHEMA data model and the RDBMS system tables means loosing a level of abstraction provided by TAP.

Examples

Virtual data

If a TAP service is providing access to a virtual dataset that contains data from more than one RDBMS, then there is no single set of system tables that contain information about all of the tables available via the TAP service. It may be possible to create a set of tables that combines the metadata from the individual component datasets, but if the list of tables in the datasets is itself dynamic (e.g. user uploaded tata) then maintaining an up to date copy of the table metadata becomes a problem.

Private data

If a TAP service is providing storage for user data, then some or all of that may be in a protected space, not visible to other users. In this situation, the service would need to selectively filter the visibility of individual tables and columns depending on who is asking. This would mean selectively controlling the visibility of individual rows in the TAP_SCHEMA tables based on the identity of the user making the request.

Alternative systems

Many of the alternative 'NoSQL' systems do not have anything equivalent to the system tables available in many RDBMS. Requiring a TAP service to implement the TAP_SCHEMA tables on such a system adds a barrier to entry for new and experimental services based on alternative database providers.

Letting TAP_SCHEMA name be custom

Even with TAP_SCHEMA set as an optional feature for a TAP service, the fixed name TAP_SCHEMA itself assumes that no more than one service per SQL server is possible unless some work around is in place.
A fixed schema name is useful, e.g., to let a unique query to be understood by all possible TAP services and simplifies, probably, the load of work on client applications. However it complicates the development of server side code for multiple services on a single server.
Given that client application usually interrogate the tables endpoint to retrieve the available tablesets from a TAP service, a solution could be to embed into it the information on the name of the schema that acts as the TAP_SCHEMA if a different name is used.
Letting the schema name be a custom publisher's choice (whatever the implementing solution to provide the schema name to the clients) will disrupt the idea of a single query running on all available TAP services. To fix this a possible solution is to reserve a word in ADQL to be used to identify the TAP_SCHEMA schema in a unique way at query level.

Dynamic metadata

With the advent of TAP services that publish user generated data, the metadata for the tables and columns in these services is likely to change on a regular basis. It would be useful to be able to describe the volatility of a particular metadata record in some way, enabling a client that caches the metadata to be able to manage their cache more efficiently.

Time to live

We would like to propose adding a TimeToLive attribute to elements in the metadata, which would function in a similar manner to the TTL attribute on std:DNS records.

  • Records that are expected to change regularly, e.g. recently uploaded user data, would be assigned a relatively low lifetime, in the order of a few minutes.
  • Records that are not expected to change regularly, e.g. a published science archive, would be assigned a relatively high lifetime, in the order of a few hours.

These attributes would enable a client to prioritise which metadata records needed to be refreshed.

References