John Philip McCrae

Lexical Linked Data Case Study: ALPINO Treebank

In this post, I will detail how to publish a linguistic resource as linked data from scratch. These instructions are based on a Linux server with apache2, but should apply to other server types as well. As a case study the ALPINO Treebank a treebank for Dutch in XML and released under the GPLv2, hence we can republish it in RDF as long as we make an attribution to the original authors.

We will start by obtaining the resource, decompressing it and removing the non-data folders

wget http://www.let.rug.nl/~vannoord/ftp/AlpinoCDROM/AlpinoCDROM.tgz
tar xzvf AlpinoCDROM.tgz
rm -fr Clig/ Papers/ stylesheets/ thistle-2-0-1/ xmlmatch/

Next, we do a simple RDF conversion, starting with this simple XSLT processor and we will use the xsltproc command to do it:


for file in *.xml
do 
  xsltproc xml2rdf3.xsl $file >$file.rdf
done
rename .xml.rdf .rdf *.xml.rdf

Now we simply create a new folder on our apache2 server and copy the result there


cd /var/www/lexinfo.net/htdocs/
mkdir -p corpora/alpino
cp -r ~/AlpinoCDROM corpora/alpino
chown -R apache:apache corpora/

And now we see that the data is available

In fact, a linked data server is just a normal server that returns RDF data, we make a quick modification to the MIME types to make sure it returns the correct type in /etc/apache2/modules.d/00_mod_mime.conf (on my server, check your Linux Distros documentation) and then restart the server.


AddType application/rdf+xml .rdf
AddType text/turtle .ttl

# For type maps (negotiated resources):
AddHandler type-map var

We can check this works very simply as follows


jmccrae@greententacle ~/AlpinoCDROM $ curl -I -H "Accept: application/rdf+xml" http://lexinfo.net/corpora/alpino/cgn_exs/1.rdf
HTTP/1.1 200 OK
Date: Thu, 09 Aug 2012 19:13:58 GMT
Server: Apache
Last-Modified: Thu, 09 Aug 2012 19:12:38 GMT
ETag: "1c96013-6e9-4c6da02bdb580"
Accept-Ranges: bytes
Content-Length: 1769
Cache-Control: max-age=1209600
Expires: Thu, 23 Aug 2012 19:13:58 GMT
Content-Type: application/rdf+xml

So far, so good... the next step is to enable content negotiation, for Alpino we have an issue that the raw XML files are renamed without extension, therefore we move all these files to the extension .txt. Then in each file we create a document call .htaccess and add the following line to it.


Options +MultiViews

Now we test it and


jmccrae@greententacle ~ $ curl -I -H "Accept: application/rdf+xml" http://lexinfo.net/corpora/alpino/cgn_exs/1
HTTP/1.1 200 OK
Date: Thu, 09 Aug 2012 19:38:08 GMT
Server: Apache
Content-Location: 1.rdf
Vary: negotiate,accept
TCN: choice
Last-Modified: Thu, 09 Aug 2012 19:12:38 GMT
ETag: "1c96013-6e9-4c6da02bdb580;4c6da48d60b80"
Accept-Ranges: bytes
Content-Length: 1769
Cache-Control: max-age=1209600
Expires: Thu, 23 Aug 2012 19:38:08 GMT
Content-Type: application/rdf+xml

It works... Now to link it to something. Inspecting the data, there are three clear groups of categories in the corpus, “cat” for categories/phrase types, “rel” for dependency relations and “pos” for part-of-speech tags. Many of these can be aligned to a data category registry or linguistic ontology. I choose to provide alignments to ISOcat and to LexInfo. This was performed by creating an OWL ontology to describe the categories used in the resource, for example the following describes “adverbs” in ALPINO


<owl:NamedIndividual rdf:about="http://lexinfo.net/corpora/alpino/categories#adv">
   <rdf:type rdf:resource="http://lexinfo.net/corpora/alpino/categories#PartOfSpeech"/>
   <rdfs:label xml:lang="en">Adverb</rdfs:label>
   <dcr:datcat rdf:resource="http://www.isocat.org/datcat/DC-1232"/>
   <owl:sameAs rdf:resource="&lexinfo;adverb"/>
</owl:NamedIndividual>

Finally we modify the XSLT to use these new categories, in particular we modify the script at line 105 (green is new code), so that it generates a triple with a URI object as follows

<xsl:choose>
  <xsl:when test="name()='rel'">
    <cat:rel>
      <xsl:attribute name="rdf:resource">
        <xsl:value-of select="concat('http://lexinfo.net/corpora/alpino/categories#',.)"/>
      </xsl:attribute>
    </cat:rel>
  </xsl:when>
  <xsl:when test="name()='cat'">
    <cat:cat>
      <xsl:attribute name="rdf:resource">
        <xsl:value-of select="concat('http://lexinfo.net/corpora/alpino/categories#',.)"/>
      </xsl:attribute>
    </cat:cat>
  </xsl:when>
  <xsl:when test="name()='pos'">
    <cat:pos>
      <xsl:attribute name="rdf:resource">
        <xsl:value-of select="concat('http://lexinfo.net/corpora/alpino/categories#',.)"/>
      </xsl:attribute>
    </cat:pos>
  </xsl:when>
  <xsl:otherwise>
     <xsl:element name="{name()}" namespace="{$ns}">
       <xsl:value-of select="."/>
     </xsl:element>
   </xsl:otherwise>
 </xsl:choose>

We apply this and publish this and our ontology and have the first version of our linked data corpora.

Finally, I make the resource browsable by creating a zipped dump of all the data and a new index page

In Part 2, we set-up a SPARQL endpoint and register the resource with CKAN

Ontology File: http://lexinfo.net/corpora/alpino/categories.rdf

XSLT File: http://lexinfo.net/corpora/alpino/alpino_xml2rdf3.xsl