Notebook description

This notebook shows how a "naive" client can serialize and deserialize instances according to the Mapping Data Model Instances to VOTable Working Draft.

The following examples may not be complete, general, or efficient. Their goal is to show in a practical way how one can use UTYPEs to serialize and deserialize instances of Data Models in VOTable. Actual implementations may vary significantly and depend on the local setup, requirements, and programming language.

We say that a client is naïve if: 1. it does not parse the VO-DML description file 2. it assumes the a priori knowledge of one or more Data Models 3. it discovers information by looking for a set of predefined UTYPEs in the VOTable

Serializing instances is generally easier than deserializing them. By introducing deserialization first, step by step, serialization patterns should become clear, and serializing instances should then be straightforward.

Data Model

In this tutorial we will use a very simplistic Data Model for STC:

The above figure represents a UML Class Diagram, i.e. a conceptual representation of the domain under study, in this case a small subset of Space Time Coordinates.

This model defines these vodml-ids:

SkyCoordinate                (type) 
SkyCoordinate.longitude
SkyCoordinate.latitude
SkyCoordinateFrame           (type)
SkyCoordinateFrame.name
SkyCoordinateFrame.equinox

UTYPEs are pointers that refer to vodml-ids with the following syntax:

<<model_id>>:<<vodml_id>>

The model_id for STCX is stcx, so:

stcx:SkyCoordinate

is a UTYPE pointing to the SkyCoordinate type in STCX.

A note to VOTable serialization

According to the Mapping Data Model Instances to VOTable, a VOTable must include a preamble that declares the data models used in the file. This signals readers that the VOTable falls under the Mapping specification, and allows more advanced clients to get a copy of the standard model description file (VO-DML/XML).

For such clients, the preamble also provides a resolution mechanism for the model prefixes (more information below).

Naive clients, however, assume a priori knowledge of the data model, so they do not parse the VO-DML/XML file, and they can assume globally unique prefixes.

The file positions.xml contains a list of positions, represented in the following UML Object Diagram

A UML Object Diagram represents specific instances of the model described by the Class Diagram. In this case, the class diagram describes the attributes of generic sky positions, while the object diagram represent some specific values of sky positions.

The following cells will show how to retrieve information about such instances using the Mapping specification.

We will use the lxml Python package to parse the VOTable as an XML, to serialize and deserialize instances. For deserialization we will mostly use XPATH strings.

In [1]:
import lxml.etree as ET
pos_vot = ET.parse('positions.xml').getroot()

VO-DML special UTYPEs

Some special UTYPEs are used to mark-up VOTable and work as handles for clients.

In particular, vo-dml:Instance.root tags GROUP elements that contain an instance representation according to the Mapping specification.

On the other hand, vo-dml:Instance.type is used as the @utype attribute of PARAMs in GROUPs to store the type of the instance serialized in the GROUP itself.

The mapping is recursive, so an instance inside an instance will be represented by a GROUP nested inside a GROUP.

However, only the first GROUP in the hierarchy (the root) will have the @utype set to vo-dml:Instance.root, while the nested GROUPs may have at least one PARAM with @utype=vo-dml:Instance.type.

So, the following command finds all the instance representations that the Data Provider serialized in the file, but other instances may be nested.

In [2]:
pos_vot.findall('.//GROUP[@utype="vo-dml:Instance.root"]')
Out[2]:
[<Element GROUP at 0x1033707e8>, <Element GROUP at 0x103370998>]

The following command shows how to get all instance representations of a specific type, in this case a SkyCoordinate.

The idea is to get all the GROUPs having a PARAM with @utype="vo-dml:Instance.type" and @value="ref:source.stc.SkyCoordinate", that is the ID of the SkyCoordinate class in the VO-DML description of the example model.

In [3]:
positions = pos_vot.xpath('''.//GROUP[PARAM[@utype="vo-dml:Instance.type"
                                            and
                                            @value="stcx:SkyCoordinate"]]''')
print len(positions)
2

Although the file contains four positions, only two GROUPs are found.

Direct vs indirect representation

In the positions.xml file, there are two GROUPs representing SkyCoordinates, i.e. positions in the sky according to a very simplistic STC model.

One GROUP is an example of a direct serialization, i.e., a GROUP that has no FIELDrefs, but only PARAMs (and the same is true for any nested GROUPs therein).

A direct serialization is completely defined by its GROUP as all the values are defined for the instance.

The other GROUP, instead, is an example of an indirect serialization, as the root GROUP, or any of its nested GROUPs have FIELDref in them. So, the GROUP represents a kind of template for instances that have values stored in table cells.

So, one GROUP represents a complete instance, the other represents a template for positions that are serialized in the table, in different rows. In fact, the table in positions.xml contains three rows.

The following code prints the values of longitude and latitude for the direct serialization GROUP, by using the UTYPEs defined in the model for these attributes of a SkyCoordinate class to find the PARAMs olding the actual values.

In [4]:
for position in positions:
    # FIND PARAMs for longitude and latitude, using UTYPEs
    longitude = position.xpath('PARAM[@utype="stcx:SkyCoordinate.longitude"]')
    latitude = position.xpath('PARAM[@utype="stcx:SkyCoordinate.latitude"]')
    
    # IF ANY PARAMs ARE FOUND for longitude
    if len(longitude):
        # GET THE VALUE
        print "longitude: ", longitude[0].attrib['value']
        
    # IF ANY PARAMs ARE FOUND for longitude
    if len(latitude):
        # GET THE VALUE
        print "latitude: ", latitude[0].attrib['value']
longitude:  12.0
latitude:  -12.0

The following code focuses on the indirect serialization to find, by means of UTYPEs, the FIELD ID and index for latitude and longitude.

In [5]:
for position in positions:
    # FIND FIELDrefs for longitude and latitude, using UTYPEs
    longitude = position.xpath('FIELDref[@utype="stcx:SkyCoordinate.longitude"]')
    latitude = position.xpath('FIELDref[@utype="stcx:SkyCoordinate.latitude"]')
    
    # IF ANY FIELDrefs ARE FOUND for longitude
    if len(longitude):
        # GET THE FIELD ID
        fid = longitude[0].attrib['ref']
        
        # GET THE FIELD INDEX
        idx = pos_vot.xpath("count(.//FIELD[@ID = $fid]/preceding-sibling::FIELD)", fid=fid)
        
        # PRINT THE RESULTS
        print("Longitude ID:{} Index:{}").format(fid, int(idx))
    if len(latitude):
        fid = latitude[0].attrib['ref']
        idx = pos_vot.xpath("count(.//FIELD[@ID = $fid]/preceding-sibling::FIELD)", fid=fid)
        print("Latitude ID:{} Index:{}").format(fid, int(idx))
Longitude ID:_ra Index:0
Latitude ID:_dec Index:1

Recap

In the above examples we showed how to use UTYPEs to find VOTable elements that make up an instance of a data model class, namely a sky coordinate with longitude and latitude.

Important points: - We are using a simplistic example model that defines some IDs for STC concepts. - The IDs defined in the model are used in the VOTable to annotate GROUPs, FIELDrefs, and PARAMs. - The IDs defined in the model are prefixed in VOTable by a string (prefix, or namespace) that identifies the model ('stcx'). - We are assuming the mapping strategies defined in the Mapping document.

In other terms, we are assuming direct knowledge of this simple model, and such knowledge is represented by nothing more than the UTYPE strings and domain knowledge regarding STC.

Functions

To show how one can do more interesting stuff we can define functions that use UTYPE strings as parameters.

The following function assumes that the concept represented by a UTYPE is a column, and fetched the column values as a Python array.

Note: As acknowledged before, this kind of functions are not supposed to be scalable or efficient, and they may not be complete. For example the following function definition (and many others below) does not perform any error handling.

In [6]:
def get_column_array(element, utype, type_):
    """
    Given a VOTable element, get the column values for the concept represented by utype,
    casting the elements to type_
    """
    
    # GET THE FIELDref FOR THE CONCEPT REPRESENTED BY utype
    el = element.xpath('FIELDref[@utype=$utype]', utype=utype)
    
    # IF ANY SUCH FIELDrefs exist
    if len(el):
        
        # GET THE FIELD ID
        fid = el[0].attrib['ref']
        
        # GET THE FIELD INDEX
        idx = element.xpath("count(//FIELD[@ID = $fid]/preceding-sibling::FIELD)", fid=fid)+1
        
        # GET THE TDs for that column
        tds = element.xpath('//FIELD[@ID = $fid]/following-sibling::DATA/TABLEDATA/TR/TD[$idx]', fid=fid, idx=int(idx))
        
        # BUILD AND RETURN THE ARRAY OF VALUES
        array = [type_(td.text) for td in tds]
        return array

    

The following simple helper function checks whether an element has an indirect representations.

In [7]:
def is_indirect(element):
    """
    Return true if the element is an indirect representation
    """
    # LOOK FOR ANY FIELDref INSIDE THE ELEMENT
    el = element.xpath('.//FIELDref')
    
    # RETURN TRUE IF THERE IS AT LEAST ONE FIELDref (INDIRECT REPRESENTATION)
    return len(el) > 0

The following example shows how the above functions can be used to get the array of values for a concept serialized in a VOTable.

Again, we are only assuming knowledge of the IDs that define the concept in the Data Model.

In [8]:
utype = "stcx:SkyCoordinate.longitude"
for position in positions:
    if is_indirect(position):
        print get_column_array(position, utype, float)
[1.0, 2.0, 3.0]

Helper Functions. A library for VOTable I/O.

A few more helper functions are defined below. While they may be interesting as concrete examples of how to do some simple I/O, the interesting reason for their creation is that they represent a I/O specific library that implements the mapping patterns defined in the Mapping document.

In other terms they show how it is possible to separate the I/O layer from the business layer. These helper functions are general (although not complete for the the sake of simplicity) in the sense that do not depend on the specific model.

In [9]:
def get_cell(element, utype, type_, index):
    el = element.xpath('.//FIELDref[@utype=$utype]', utype=utype)
    if len(el):
        fid = el[0].attrib['ref']
        idx = element.xpath("count(//FIELD[@ID = $fid]/preceding-sibling::FIELD)", fid=fid)+1
        tds = element.xpath('//FIELD[@ID = $fid]/following-sibling::DATA/TABLEDATA/TR/TD[$idx]', fid=fid, idx=int(idx))
        return type_(tds[index].text)
    
def get_nrows(element):
    el = element.xpath('.//FIELDref')
    if len(el):
        fid = el[0].attrib['ref']
        nrows = element.xpath('count(//FIELD[@ID = $fid]/following-sibling::DATA/TABLEDATA/TR)', fid=fid)
        return int(nrows)

def get_param(element, utype, type_):
    el = element.xpath('.//PARAM[@utype=$utype]', utype=utype)
    if len(el):
        return el[0].attrib['value']

def find_type(element, utype):
    type_utype = "vo-dml:Instance.type"
    return element.xpath('.//GROUP[PARAM[@utype=$type_u and @value=$utype]]',
                              type_u = type_utype,
                              utype = utype)

def get_from_field_or_param(element, utype, type_, row):
    value = None
    if is_indirect(element):
        value = get_cell(element, utype, type_, row)
    if value is None:
        value = get_param(element, utype, type_)
    return value

def get_column_array_from_field(element, utype, type_):
    el = element.xpath(".//FIELD[@utype=$utype]", utype=utype)
    if len(el):
        nrows = el[0].xpath('count(DATA/TABLEDATA/TR)')
        idx = el[0].xpath("count(preceding-sibling::FIELD)")+1
        tds = el[0].xpath('following-sibling::DATA/TABLEDATA/TR/TD[$idx]', idx=int(idx))
        return [type_(td.text) for td in tds]
        

Position Class: a simple STCLib

The next cell is interesting because it defines a Position class that implements a simple structured object.

This class puts the I/O library defined above at work for deserializing instances of positions from VOTable. In order to do so, the class uses three UTYPEs that point to the STCX data model elements.

In [10]:
class Position(object):
    position_utype = "stcx:SkyCoordinate"
    longitude_utype = "stcx:SkyCoordinate.longitude"
    latitude_utype = "stcx:SkyCoordinate.latitude"
    
    def __init__(self, longitude, latitude):
        self.longitude = longitude
        self.latitude = latitude
    
    @staticmethod
    def find(element):
    
        positions = find_type(element, Position.position_utype)
        
        return_positions = []
    
        for position in positions:
            if is_indirect(position):
                nrows = get_nrows(position)
                for row in range(nrows):
                    longitude = get_from_field_or_param(position, Position.longitude_utype, float, row)
                    latitude = get_from_field_or_param(position, Position.latitude_utype, float, row)
                    return_positions.append(Position(longitude, latitude))
            else:
                longitude = get_param(position, Position.longitude_utype, float)
                latitude = get_param(position, Position.latitude_utype, float)
                return_positions.append(Position(longitude, latitude))
                
        return return_positions
    
    def __repr__(self):
        return "Position {{longitude: {}, latitude: {}}}".format(self.longitude, self.latitude)
In [11]:
positions = Position.find(pos_vot)

for position in positions:
    print position
Position {longitude: 12.0, latitude: -12.0}
Position {longitude: 1.0, latitude: 1.1}
Position {longitude: 2.0, latitude: 2.1}
Position {longitude: 3.0, latitude: 3.1}

Recap

To sum up: - We defined a number of helper functions that implement some mapping strategies from the Mapping to VOTable specification. - We defined a Position class that implements a simplistic STCX model. The implementation uses the helper functions and the vodml-ids defined by the STCX model.

In this simplistic example we can identify a generic I/O library made of the helper functions (let's call it volib), and a model specific library for STCX, that uses the helper functions (let's call it stclib).

Source Data Model example

The STCX Model can be useful if we attach coordinates to something. Let's say that this something is a Source, according to the following (still simple) Source data model.

In particular, file catalog.xml contains a Catalog with three Sources, each of which has a position, as specified in the object diagram below.

First of all, we need to load the VOTable using lxml:

In [12]:
catalog_vot = ET.parse('catalog.xml').getroot()

SRCLib reuses STCLib

The following Source and Catalog classes implement the types defined in the Source data model, just as Position above implemented a class from the STCX model.

We might think of these classes as a SRCLib library. Sources have positions that are STCX's SkyCoordinates. The Mapping document allows SRCLib to easily reuse the code in STCLib.

In [13]:
class Source(object):
    def __init__(self, name, position):
        self.name = name
        self.position = position

    @staticmethod
    def find(element):
        source_utype = "src:Source"
        name_utype = "src:Source.name"
        position_utype = "src:Source.position"
        
        sources = find_type(element, source_utype)
        
        return_sources = []
    
        for source in sources:
            if is_indirect(source):
                nrows = get_nrows(source)
                for row in range(nrows):
                    name = get_from_field_or_param(source, name_utype, str, row)
                    position = Position.find(source)[row]
                    return_sources.append(Source(name, position))
            else:
                name = get_param(source, name_utype, str)
                position = Position.find(source)[0]
                return_sources.append(Source(name, position))
                
        return return_sources
    
    def __repr__(self):
        return "Source {{name: {}, position: {}}}".format(self.name, self.position)

    
    
class Catalog(object):
    def __init__(self, name, description, sources):
        self.name = name
        self.description = description
        self.sources = sources

    @staticmethod
    def find(element):
        catalog_utype = "src:Catalog"
        name_utype = "src:Catalog.name"
        description_utype = "src:Catalog.description"
        source_utype = "src:Catalog.source"
        
        catalogs = find_type(element, catalog_utype)
        
        return_catalogs = []
    
        for catalog in catalogs:
            if is_indirect(catalog):
                nrows = get_nrows(catalog)
                for row in range(nrows):
                    name = get_from_field_or_param(catalog, name_utype, str, row)
                    description = get_from_field_or_param(catalog, description_utype, str, row)
                    sources = Source.find(catalog)
                    return_catalogs.append(Catalog(name, description, sources))
            else:
                name = get_param(catalog, name_utype, str)
                description = get_param(catalog, description_utype, str)
                sources = Source.find(catalog, source_utype)
                return_catalogs.append(Catalog(name, description, sources))
                
        return return_catalogs
    
    def __repr__(self):
        ret = "Catalog {{name: {}, description: {}, sources:\n".format(self.name, self.description)
        for source in self.sources:
            ret += '\t\t'+ repr(source)+'\n'
        ret+='}'
        return ret

One can access objects at any level of the instances hierarchy, for example one can get all Sources in the file:

In [14]:
sources = Source.find(catalog_vot)

for source in sources:
    print source.name, source.position.longitude, source.position.latitude
SOURCE_1 1.0 1.1
SOURCE_2 2.0 2.1
SOURCE_3 3.0 3.1

Or one can access the main catalog object:

In [15]:
catalog = Catalog.find(catalog_vot)[0]
print catalog
Catalog {name: My Catalog, description: My Description, sources:
		Source {name: SOURCE_1, position: Position {longitude: 1.0, latitude: 1.1}}
		Source {name: SOURCE_2, position: Position {longitude: 2.0, latitude: 2.1}}
		Source {name: SOURCE_3, position: Position {longitude: 3.0, latitude: 3.1}}
}

We can still access individual positions, by using STCLib:

In [16]:
positions = Position.find(catalog_vot)
for position in positions:
    print position
Position {longitude: 12.00000, latitude: -12.00000}
Position {longitude: 1.0, latitude: 1.1}
Position {longitude: 2.0, latitude: 2.1}
Position {longitude: 3.0, latitude: 3.1}

Old and Custom UTYPEs

The catalog.xml file also contain some old-style STC UTYPEs: - stc:AstroCoords.Position2D.Value2.C1 - stc:AstroCoords.Position2D.Value2.C1

The Mapping document allows such UTYPEs to live side-by-side with the new-style ones, and the following call shows how one can access the same data using such UTYPEs. Notice however, that the modular implementation we explored in this tutorial is not possible with the old-style UTYPEs.

In [17]:
print get_column_array_from_field(catalog_vot, "stc:AstroCoords.Position2D.Value2.C1", float)
[1.0, 2.0, 3.0]

Data Providers: How to serialize instances

Serializing instance is much easier than deserializing them, at least for data providers, who do not need to implement the specifications in a complete way. Helper tools, on the other hands, must be smarter, and they can help data providers or users even further.

To keep code simple and intuitive the example below does not produce a valid VOTable.

The following function just enables pretty-printing of the XML.

In [18]:
from xml.dom import minidom

def prettify(elem):
    rough_string = ET.tostring(elem)
    reparsed = minidom.parseString(rough_string)
    return reparsed.toprettyxml(indent="  ")

The following function serialize a catalog instance like the one we deserialized in the first part of the tutorial.

The code should be rather self-explanatory.

In [19]:
def print_catalog(catalog):
    resource = ET.Element("RESOURCE")
    table = ET.SubElement(resource, "TABLE")
    
    catalog_repr = ET.SubElement(table, "GROUP",
                                 utype = "vo-dml:Instance.root")
    
    ET.SubElement(catalog_repr, "PARAM",
                  utype = "vo-dml:Instance.type",
                  value = "src:Catalog")
    
    ET.SubElement(catalog_repr, "PARAM",
                  utype = "src:Catalog.name",
                  value = catalog.name)
    
    ET.SubElement(catalog_repr, "PARAM",
                  utype = "src:Catalog.description",
                  value = catalog.description)
    
    source_repr = ET.SubElement(catalog_repr, "GROUP",
                                  utype = "Catalog.source")
    
    ET.SubElement(source_repr, "PARAM",
                  utype = "vo-dml:Instance.type",
                  value = "src:Source")
    
    ET.SubElement(source_repr, "FIELDref",
                  utype = "src:Source.name",
                  ref = "_name")
    
    position_repr = ET.SubElement(source_repr, "GROUP",
                                  utype = "Source.position")
    
    ET.SubElement(position_repr, "PARAM",
                  utype = "vo-dml:Instance.type",
                  value = "stcx:SkyCoordinate")
    
    ET.SubElement(position_repr, "FIELDref",
                  utype = "stcx:SkyCoordinate.longitude",
                  ref = "_long")
    
    ET.SubElement(position_repr, "FIELDref",
                  utype = "stcx:SkyCoordinate.longitude",
                  ref = "_lat")
    
    ET.SubElement(table, "FIELD",
                  ID="_name")
    
    ET.SubElement(table, "FIELD",
                  ID="_long")
    
    ET.SubElement(table, "FIELD",
                  ID="_lat")
    
    data = ET.SubElement(table, "DATA")
    
    tabledata = ET.SubElement(data, "TABLEDATA")
    
    for source in catalog.sources:
        row = ET.SubElement(tabledata, "TR")
        ET.SubElement(row, "TD").text = source.name
        ET.SubElement(row, "TD").text = str(source.position.longitude)
        ET.SubElement(row, "TD").text = str(source.position.latitude)
        
        

    
    print prettify(resource)
In [20]:
print_catalog(catalog)
<?xml version="1.0" ?>
<RESOURCE>
  <TABLE>
    <GROUP utype="vo-dml:Instance.root">
      <PARAM utype="vo-dml:Instance.type" value="src:Catalog"/>
      <PARAM utype="src:Catalog.name" value="My Catalog"/>
      <PARAM utype="src:Catalog.description" value="My Description"/>
      <GROUP utype="Catalog.source">
        <PARAM utype="vo-dml:Instance.type" value="src:Source"/>
        <FIELDref ref="_name" utype="src:Source.name"/>
        <GROUP utype="Source.position">
          <PARAM utype="vo-dml:Instance.type" value="stcx:SkyCoordinate"/>
          <FIELDref ref="_long" utype="stcx:SkyCoordinate.longitude"/>
          <FIELDref ref="_lat" utype="stcx:SkyCoordinate.longitude"/>
        </GROUP>
      </GROUP>
    </GROUP>
    <FIELD ID="_name"/>
    <FIELD ID="_long"/>
    <FIELD ID="_lat"/>
    <DATA>
      <TABLEDATA>
        <TR>
          <TD>SOURCE_1</TD>
          <TD>1.0</TD>
          <TD>1.1</TD>
        </TR>
        <TR>
          <TD>SOURCE_2</TD>
          <TD>2.0</TD>
          <TD>2.1</TD>
        </TR>
        <TR>
          <TD>SOURCE_3</TD>
          <TD>3.0</TD>
          <TD>3.1</TD>
        </TR>
      </TABLEDATA>
    </DATA>
  </TABLE>
</RESOURCE>


In []: