John Philip McCrae

Download as: BibTeX  JSON-LD
By Type:   All   Journal Articles Books Book Chapters Proceedings Conferences Workshops Thesis Reports
By Year:   All 2020 2019 2018 2017 2016 2015 2014 2013 2012 2011 2010 2009 2008

2020


Linguistic Linked Data: Representation, Generation and Applications. Philipp Cimiano, Christian Chiarcos, John P. McCrae and Jorge Gracia, Springer, (2020).

7th Workshop on Linked Data in Linguistics (LDL-2020). Maxim Ionov, John P. McCrae, Christian Chiarcos, Thierry Declerck, Julia Bosque-Gil and Jorge Gracia (eds), European Language Resources Association (ELRA) - LREC 2020 Workshop Language Resources and Evaluation Conference, (2020). PDF

Globalex Workshop on Linked Lexicography. Ilan Kernerman, Simon Krek, John P. McCrae, Jorge Gracia, Sina Ahmadi and Besim Kabashi (eds), European Language Resources Association (ELRA) - LREC 2020 Workshop Language Resources and Evaluation Conference, (2020).

Code switching is a prevalent phenomenon in the multilingual community and social media interaction. In the past ten years, we have witnessed an explosion of code switched data in the social media that brings together languages from low resourced languages to high resourced languages in the same text, sometimes written in a non-native script. This increases the demand for processing code-switched data to assist users in various natural language processing tasks such as part-of-speech tagging, named entity recognition, sentiment analysis, conversational systems, and machine translation, etc. The available corpora for code switching research played a major role in advancing this area of research. In this paper, we propose a set of quality metrics to evaluate the dataset and categorize them accordingly.

Named Entity Recognition for Code-Mixed Indian Corpus using Meta Embedding. Ruba Priyadharshini, Bharathi Raja Chakravarthi, Mani Vegupatti and John P.Mccrae, ICACCS 2020: International Conference on Advanced Computing & Communication Systems (ICACCS), pp 68-72, (2020). PDF Abstract

In this paper, we utilize the pre-trained embedding, sub-word embedding and closely related languages of languages in the code mixed corpus to create a meta-embedding. We then use the Transformer to encode the code mixed sentence and use Conditional Random Field to predict the Named Entities in the code-mixed text. In contrast to classical Named Entity recognition where the text is monolingual, our approach can predict the Named Entities in code-mixed corpus written both in the native script as well as Roman script. Our method is a novel method to combine the embeddings of closely related languages to identify Named Entity from Code-Mixed Indian text written using native script and Roman script in social media.

On the Linguistic Linked Open Data Infrastructure. Christian Chiarcos, Bettina Klimek, Christian Fäth, Thierry Declerck and John P. McCrae, Proceedings of the 1st International Workshop on Language Technology Platforms at LREC 2020, pp 8-15, (2020). PDF Abstract

In this paper we describe the current state of development of the Linguistic Linked Open Data (LLOD) infrastructure, an LOD (sub-)cloud of linguistic resources, which covers various linguistic data bases, lexicons, corpora, terminology and metadata repositories. We give in some details an overview of the contributions made by the European H2020 projects “Prêt-à-LLOD” (‘Ready-to-use Multilingual Linked Language Data for Knowledge Services across Sectors’) and “ELEXIS” (‘European Lexicographic Infrastructure’) to the further development of the LLOD

NUIG at TIAD: Combining Unsupervised NLP and Graph Metrics for Translation Inference. John P. McCrae and Mihael Arcan, Proceedings of the Globalex Workshop on Linked Lexicography (@LREC 2020), pp 92-97, (2020). PDF Abstract

In this paper, we present the NUIG system at the TIAD shared task. This system includes graph-based metrics calculated using novel algorithms, with an unsupervised document embedding tool called ONETA and an unsupervised multi-way neural machine translation method. The results are an improvement over our previous system and produce the highest precision among all systems in the task as well as very competitive F-Measure results. Incorporating features from other systems should be easy in the framework we describe in this paper, suggesting this could very easily be extended to an even stronger result.

A Comparative Study of Different State-of-the-Art Hate Speech Detection Methods in Hindi-English Code-Mixed Data. Priya Rani, Shardul Suryawanshi, Koustava Goswami, Bharathi Raja Chakravarthi, Theodorus Fransen and John Philip McCrae, Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying at LREC 2020, pp 42-48, (2020). PDF Abstract

Hate speech detection in social media communication has become one of the primary concerns to avoid conflicts and curb undesired activities. In an environment where multilingual speakers switch among multiple languages, hate speech detection becomes a challenging task using methods that are designed for monolingual corpora. In our work, we attempt to analyze, detect and provide a comparative study of hate speech in a code-mixed social media text. We also provide a Hindi-English code-mixed data set consisting of Facebook and Twitter posts and comments. Our experiments show that deep learning models trained on this code-mixed corpus perform better.

Towards an Interoperable Ecosystem of AI and LT Platforms: A Roadmap for the Implementation of Different Levels of Interoperability. Georg Rehm, Dimitris Galanis, Penny Labropoulou, Stelios Piperidis, Martin Welß, Ricardo Usbeck, Joachim Köhler, Miltos Deligiannis, Katerina Gkirtzou, Johannes Fischer, Christian Chiarcos, Nils Feldhus, Julian Moreno-Schneider, Florian Kintzel, Elena Montiel-Ponsoda, Víctor Rodriguez-Doncel, John Philip McCrae, David Laqua, Irina Patricia Theile, Christian Dittmar, Kalina Bontcheva, Ian Roberts, Andrejs Vasiļjevs and Andis Lagzdins, Proceedings of the 1st International Workshop on Language Technology Platforms at LREC 2020, pp 96-107, (2020). PDF Abstract

With regard to the wider area of AI/LT platform interoperability, we concentrate on two core aspects: (1) cross-platform search and discovery of resources and services; (2) composition of cross-platform service workflows. We devise five different levels (of increasing complexity) of platform interoperability that we suggest to implement in a wider federation of AI/LT platforms. We illustrate the approach using the five emerging AI/LT platforms AI4EU, ELG, Lynx, QURATOR and SPEAKER.

A Dataset for Classification of Tamil Memes. Shardul Suryawanshi, Bharathi Raja Chakravarthi, Pranav Verma, Mihael Arcan, John Philip McCrae and Paul Buitelaar, Proceedings of the 5th Workshop on Indian Language Data: Resources and Evaluation (WILDRE-5) at LREC-2020, pp 7-13, (2020). PDF Abstract

Social media are interactive platforms that facilitate the creation or sharing of information, ideas or other forms of expression among people. This exchange is not free from offensive, trolling or malicious contents targeting users or communities. One way of trolling is by making memes, which in most cases combines an image with a concept or catchphrase. The challenge of dealing with memes is that they are region-specific and their meaning is often obscured in humour or sarcasm. To facilitate the computational modelling of trolling in the memes for Indian languages, we created a meme dataset for Tamil (TamilMemes). We annotated and released the dataset containing suspected trolls and not-troll memes. In this paper, we use the a image classification to address the difficulties involved in the classification of troll memes with the existing methods. We found that the identification of a troll meme with such an image classifier is not feasible which has been corroborated with precision, recall and F1-score

Modelling Frequency and Attestations for OntoLex-Lemon. Christian Chiarcos, Maxim Ionov, Jesse de Does, Katrien Depuydt, Anas Fahad Khan, Sander Stolk, Thierry Declerck and John Philip McCrae, Proceedings of the Globalex Workshop on Linked Lexicography (@LREC 2020), pp 1-9, (2020). PDF Abstract

The OntoLex vocabulary enjoys increasing popularity as a means of publishing lexical resources with RDF and as Linked Data. The recent publication of a new OntoLex module for lexicography, lexicog, reflects its increasing importance for digital lexicography. However, not all aspects of digital lexicography have been covered to the same extent. In particular, supplementary information drawn from corpora such as frequency information, links to attestations, and collocation data were considered to be beyond the scope of lexicog. Therefore, the OntoLex community has put forward the proposal for a novel module for frequency, attestation and corpus information (FrAC), that not only covers the requirements of digital lexicography, but also accommodates essential data structures for lexical information in natural language processing. This paper introduces the current state of the OntoLex-FrAC vocabulary, describes its structure, some selected use cases, elementary concepts and fundamental definitions, with a focus on frequency and attestations.

Challenges of Word Sense Alignment: Portuguese Language Resources. Ana Salgado, Sina Ahmadi, Alberto Simões, John Philip McCrae and Rute Costa, Proceedings of the 7th Workshop on Linked Data in Linguistics: Building tools and infrastructure at LREC 2020, pp 45-51, (2020). PDF Abstract

This paper reports on an ongoing task of monolingual word sense alignment in which a comparative study between the Portuguese Academy of Sciences Dictionary and the Dicionario Aberto ´ is carried out in the context of the ELEXIS (European Lexicographic Infrastructure) project. Word sense alignment involves searching for matching senses within dictionary entries of different lexical resources and linking them, which poses significant challenges. The lexicographic criteria are not always entirely consistent within individual dictionaries and even less so across different projects where different options may have been assumed in terms of structure and especially wording techniques of lexicographic glosses. This hinders the task of matching senses. We aim to present our annotation workflow in Portuguese using the Semantic Web standards. The results obtained are useful for the discussion within the community

English WordNet 2020: Improving and Extending a WordNet for English using an Open-Source Methodology. John Philip McCrae, Alexandre Rademaker, Ewa Rudnicka and Francis Bond, Proceedings of the Multimodal Wordnets Workshop at LREC 2020, pp 14-19, (2020). PDF Abstract

The Princeton WordNet, while one of the most widely used resources for NLP, has not been updated for a long time, and as such a new project English WordNet has arisen to continue the development of the model under an open-source paradigm. In this paper, we detail the second release of this resource entitled “English WordNet 2020”. The work has focused firstly, on the introduction of new synsets and senses and developing guidelines for this and secondly, on the integration of contributions from other projects. We present the changes in this edition, which total over 15,000 changes over the previous release

There is an increasing demand for sentiment analysis of text from social media which are mostly code-mixed. Systems trained on monolingual data fail for code-mixed data due to the complexity of mixing at different levels of the text. However, very few resources are available for code-mixed data to create models specific for this data. Although much research in multilingual and cross-lingual sentiment analysis has used semi-supervised or unsupervised methods, supervised methods still performs better. Only a few datasets for popular languages such as English-Spanish, English-Hindi, and English-Chinese are available. There are no resources available for Malayalam-English code-mixed data. This paper presents a new gold standard corpus for sentiment analysis of code-mixed text in Malayalam-English annotated by voluntary annotators. This gold standard corpus obtained a Krippendorff’s alpha above 0.8 for the dataset. We use this new corpus to provide the benchmark for sentiment analysis in Malayalam-English code-mixed texts.

Corpus Creation for Sentiment Analysis in Code-Mixed Tamil-English Text. Bharathi Raja Chakravarthi, Vigneshwaran Muralidaran, Ruba Priyadharshini and John Philip McCrae, Proceedings of 1st Joint SLTU (Spoken Language Technologies for Under-resourced languages) and CCURL (Collaboration and Computing for Under-Resourced Languages) Workshop at LREC 2020, pp 202-210, (2020). PDF Abstract

Understanding the sentiment of a comment from a video or an image is an essential task in many applications. Sentiment analysis of a text can be useful for various decision-making processes. One such application is to analyse the popular sentiments of videos on social media based on viewer comments. However, comments from social media do not follow strict rules of grammar, and they contain mixing of more than one language, often written in non-native scripts. Non-availability of annotated code-mixed data for a low-resourced language like Tamil also adds difficulty to this problem. To overcome this, we created a gold standard Tamil-English code-switched, sentiment-annotated corpus containing 15,744 comment posts from YouTube. In this paper, we describe the process of creating the corpus and assigning polarities. We present inter-annotator agreement and show the results of sentiment analysis trained on this corpus as a benchmark.

Towards Automatic Linking of Lexicographic Data: the case of a historical and a modern Danish dictionary. Sina Ahmadi, Sanni Nimb, Thomas Troelsgård, John P. McCrae and Nicolai H. Sørensen, Proceedings of the XIX EURALEX International Congress, (2020 (Accepted)).

Figure Me Out: A Gold Standard Dataset for Metaphor Interpretation. Omnia Zayed, John P. McCrae and Paul Buitelaar, Proceedings of the 12th Language Resource and Evaluation Conference (LREC 2020), pp 5810-5819, (2020). PDF Abstract

Metaphor comprehension and understanding is a complex cognitive task that requires interpreting metaphors by grasping the interaction between the meaning of their target and source concepts. This is very challenging for humans, let alone computers. Thus, automatic metaphor interpretation is understudied in part due to the lack of publicly available datasets. The creation and manual annotation of such datasets is a demanding task which requires huge cognitive effort and time. Moreover, there will always be a question of accuracy and consistency of the annotated data due to the subjective nature of the problem. This work addresses these issues by presenting an annotation scheme to interpret verb-noun metaphoric expressions in text. The proposed approach is designed with the goal of reducing the workload on annotators and maintain consistency. Our methodology employs an automatic retrieval approach which utilises external lexical resources, word embeddings and semantic similarity to generate possible interpretations of identified metaphors in order to enable quick and accurate annotation. We validate our proposed approach by annotating around 1,500 metaphors in tweets which were annotated by six native English speakers. As a result of this work, we publish as linked data the first gold standard dataset for metaphor interpretation which will facilitate research in this area.

Some Issues with Building a Multilingual Wordnet. Francis Bond, Luis Morgado da Costa, Michael Wayne Goodman, John P. McCrae and Ahti Lohk, Proceedings of the 12th Language Resource and Evaluation Conference (LREC 2020), pp 3189-3197, (2020). PDF Abstract

In this paper we discuss the experience of bringing together over 40 different wordnets. We introduce some extensions to the GWA wordnet LMF format proposed in Vossen et al. (2016) and look at how this new information can be displayed. Notable extensions include: confidence, corpus frequency, orthographic variants, lexicalized and non-lexicalized synsets and lemmas, new parts of speech, and more. Many of these extensions already exist in multiple wordnets – the challenge was to find a compatible representation. To this end, we introduce a new version of the Open Multilingual Wordnet (Bond and Foster, 2013), that integrates a new set of tools that tests the extensions introduced by this new format, while also ensuring the integrity of the Collaborative Interlingual Index (CILI: Bond et al., 2016), avoiding the same new concept to be introduced through multiple projects.

A Multilingual Evaluation Dataset for Monolingual Word Sense Alignment. Sina Ahmadi, John P. McCrae, Sanni Nimb, Thomas Troelsgård, Sussi Olsen, Bolette S. Pedersen, Thierry Declerck, Tanja Wissik, Monica Monachini, Andrea Bellandi, Fahad Khan, Irene Pisani, Simon Krek, Veronika Lipp, Tamás Váradi, László Simon, András Győrffy, Carole Tiberius, Tanneke Schoonheim, Yifat Ben Moshe, Maya Rudich, Raya Abu Ahmad, Dorielle Lonke, Kira Kovalenko, Margit Langemets, Jelena Kallas, Oksana Dereza, Theodorus Fransen, David Cillessen, David Lindemann, Mikel Alonso, Ana Salgado, José Luis Sancho, Rafael-J. Ureña-Ruiz, Kiril Simov, Petya Osenova, Zara Kancheva, Ivaylo Radev, Ranka Stanković, Cvetana Krstev, Biljana Lazić, Aleksandra Marković, Andrej Perdih and Dejan Gabrovšek, Proceedings of the 12th Language Resource and Evaluation Conference (LREC 2020), pp 3232-3242, (2020). PDF Abstract

Aligning senses across resources and languages is a challenging task with beneficial applications in the field of natural language processing and electronic lexicography. In this paper, we describe our efforts in manually aligning monolingual dictionaries. The alignment is carried out at sense-level for various resources in 15 languages. Moreover, senses are annotated with possible semantic relationships such as broadness, narrowness, relatedness, and equivalence. In comparison to previous datasets for this task, this dataset covers a wide range of languages and resources and focuses on the more challenging task of linking general-purpose language. We believe that our data will pave the way for further advances in alignment and evaluation of word senses by creating new solutions, particularly those notoriously requiring data such as neural networks. Our resources are publicly available at https://github.com/elexis-eu/MWSA.

Recent Developments for the Linguistic Linked Open Data Infrastructure. Thierry Declerck, John Philip McCrae, Christian Chiarcos, Philipp Cimiano, Jorge Gracia, Matthias Hartung, Deirdre Lee, Elena Montiel-Ponsoda, Artem Revenko and Roser Saurí, Proceedings of the 12th Language Resource and Evaluation Conference (LREC 2020), pp 5660-5667, (2020). PDF Abstract

In this paper we describe the contributions made by the European H2020 project “Prêt-à-LLOD” (‘Ready-to-use Multilingual Linked ` Language Data for Knowledge Services across Sectors’) to the further development of the Linguistic Linked Open Data (LLOD) infrastructure. Prêt-à-LLOD aims to develop a new methodology for building data value chains applicable to a wide range of sectors and ` applications and based around language resources and language technologies that can be integrated by means of semantic technologies. We describe the methods implemented for increasing the number of language data sets in the LLOD. We also present the approach for ensuring interoperability and for porting LLOD data sets and services to other infrastructures, as well as the contribution of the projects to existing standards.

2019


A Comparative Study of SVM and LSTM Deep Learning Algorithms for Stock Market Prediction. Sai Krishna Lakshminarayanan and John McCrae, Proceedings for the 27th AIAI Irish Conference on Artificial Intelligence and Cognitive Science, (2019). Abstract

The paper presents a comparative study of the performance of Long Short-Term Memory (LSTM) neural network models with Support Vector Machine (SVM) regression models. The framework built as a part of this study comprises of eight models. In this, 4 models are built using LSTM and 4 models using SVM respectively. Two major datasets are used for this paper. One is the base standard Dow Jones Index (DJI) stock price dataset and another is the combination of this stock price dataset along with external added input parameters of crude oil and gold prices. This comparative study shows the best model in combination with our input dataset. The performance of the models is measured in terms of their Root Mean Squared Error (RMSE), Mean Squared Error (MSE), Mean Absolute Error, Mean Absolute Percentage Error (MAPE) and R squared (R2) score values. The methodologies and the results of the models are discussed and possible enhancements to this work are also provided.

Linguistic Linked Open Data for All. John P. McCrae and Thierry Declerck, Proceedings of the Language Technology 4 All Conference, (2019). PDF Abstract

In this paper we briefly describe the European H2020 project “Prêt-à-LLOD” (‘Ready-to-use Multilingual Linked Language Data for Knowledge Services across Sectors’). This project aims to increase the uptake of language technologies by exploiting the combination of linked data and language technologies, that is Linguistic Linked Open Data (LLOD), to create ready-to-use multilingual data. Prêt-à-LLOD aims to achieve this by creating a new methodology for building data value chains applicable to a wide-range of sectors and applications and based around language resources and language technologies that can be integrated by means of semantic technologies, in particular the usage of the LLOD.

Towards a Global Lexicographic Infrastructure. Simon Krek, Thierry Declerck, John Philip McCrae and Tanja Wissik, Proceedings of the Language Technology 4 All Conference, (2019). PDF Abstract

In this paper we briefly describe the European project ELEXIS (European Lexicographic Infrastructure). ELEXIS aims to integrate, extend and harmonise national and regional efforts in the field of lexicography, both modern and historical, with the goal of creating a sustainable infrastructure which will enable efficient access to high quality lexical data in the digital age, and bridge the gap between more advanced and lesser-supported lexicographic resources. For this, ELEXIS makes use of or establish common standards and solutions for the development of lexicographic resources and develop strategies and tools for extracting, structuring and linking lexicographic resource

Cardamom: Comparative Deep Models for Minority and Historical Languages. John Philip McCrae and Theodorus Fransen, Proceedings of the Language Technology 4 All Conference, (2019). PDF Abstract

This paper gives an overview of the Cardamom project, which aims to close the resource gap for minority and under-resourced languages by means of deep-learning-based natural language processing (NLP) and exploiting similarities of closely-related languages. The project further extends this concept to historical languages, which can be considered as closely related to their modern form, and as such aims to provide NLP through both space and time for languages that have been ignored by current approaches.

Challenges for the Representations for Morphology in Ontology Lexicons. Bettina Klimek, John P. McCrae, Maxim Ionov, James K. Tauber, Christian Chiarcos, Julia Bosque-Gil and Paul Buitelaar, Proceedings of Sixth Biennial Conference on Electronic Lexicography, eLex 2019, (2019). PDF Abstract

Recent years have experienced a growing trend in the publication of language resources as Linguistic Linked Data (LLD) to enhance their discovery, reuse and the interoperability of tools that consume language data. To this aim, the OntoLex-lemon model has emerged as a de-facto standard to represent lexical data on the Web. However, traditional dictionaries contain a considerable amount of morphological information which is not straightforwardly representable as LLD within the current model. In order to fill this gap a new Morphology Module of OntoLex-lemon is currently developed. This papers presents the results of this model as on-going work as well as the underlying challenges that emerged during the module development. Based on the MMoOn Core ontology, it aims to account for a wide range of morphological information, ranging from endings to derive whole paradigms to the decomposition and generation of lexical entries which is in compliance to other OntoLex-lemon modules and facilitates the encoding of complex morphological data in ontology lexicons.

The ELEXIS Interface for Interoperable Lexical Resources. John P. McCrae, Carole Tiberius, Anas Fahad Khan, Ilan Kernerman, Thierry Declerck, Simon Krek, Monica Monachini and Sina Ahmadi, Proceedings of Sixth Biennial Conference on Electronic Lexicography, eLex 2019, (2019). PDF Abstract

ELEXIS is a project that aims to create a European network of lexical resources, and one of the key challenges for this is the development of an interoperable interface for different lexical resources so that further tools may improve the data. This paper describes this interface and in particular describes the five methods of entrance into the infrastructure, through retrodigitization, by conversion to TEI-Lex0, by the TEI-Lex0 format, by the OntoLex format or through the REST interface described in this paper.

Towards Electronic Lexicography for the Kurdish Language. Sina Ahmadi, Hossein Hassani and John P. McCrae, Proceedings of Sixth Biennial Conference on Electronic Lexicography, eLex 2019, (2019). PDF Abstract

This paper describes the development of lexicographic resources for Kurdish and provides a lexical model for this language. Kurdish is considered a less-resourced language, and currently, lacks the machine-readable lexicon resources. The unique potential which Linked Data and the Semantic Web offer to e-lexicography enables interoperability across lexical resources by elevating the traditional linguistic data to machine-processable semantic formats. Therefore, we present our lexicon in Ontolex-Lemon ontology as a standard model for sharing lexical information on the Semantic Web. The research covers Sorani, Kurmanji, and Hawrami dialects of Kurdish. This research suggests that although Kurdish is a less-resourced language, in terms of documented lexicons, it owns a wide range of resources, but because they are machine-readable, they could not contribute to the language processing. The outcome of this project, which is made publicly available, assists scholars in their efforts towards making Kurdish a resource-rich language.

Taxonomy Extraction for Customer Service Knowledge Base Construction. Bianca Pereira, Cécile Robin, Tobias Daudert, John P. McCrae, Paul Buitelaar and Pranab Mohanty, Proceedings of the SEMANTicS 2019, (2019). PDF Abstract

Customer service agents play an important role in bridging the gap between customers' vocabulary and business terms. In a scenario where organisations are moving into semi-automatic customer service, semantic technologies with capacity to bridge this gap become a necessity. In this paper we explore the use of automatic taxonomy extraction from text as a means to reconstruct a customer-agent taxonomic vocabulary. We evaluate our proposed solution in an industry use case scenario in the financial domain and show that our approaches for automated term extraction and using in-domain training for taxonomy construction can improve the quality of automatically constructed taxonomic knowledge bases.

A Character-Level LSTM Network Model for Tokenizing the Old Irish text of the Würzburg Glosses on the Pauline Epistles. Adrian Doyle, John P. McCrae and Clodagh Downey, Proceedings of the Celtic Language Technology Workshop 2019, (2019). PDF Abstract

This paper examines difficulties inherent in tokenization of Early Irish texts and demonstrates that a neural-network-based approach may provide a viable solution for historical texts which contain unconventional spacing and spelling anomalies. Guidelines for tokenizing Old Irish text are presented and the creation of a character-level LSTM network is detailed, its accuracy assessed, and efforts at optimising its performance are recorded. Based on the results of this research it is expected that a character- level LSTM model may provide a viable solution for tokenization of historical texts where the use of Scriptio Continua, or alternative spacing conventions, makes the automatic separation of tokens difficult.

Adapting Term Recognition to an Under-Resourced Language: the Case of Irish. John P. McCrae and Adrian Doyle, Proceedings of the Celtic Language Technology Workshop 2019, (2019). PDF Abstract

Automatic Term Recognition (ATR) is an important method for the summarization and analysis of large corpora, and normally requires a significant amount of linguistic input, in particular the use of part-of-speech taggers. For an under-resourced language such as Irish, the resources necessary for this may be scarce or entirely absent. We evaluate two methods for the automatic extraction of terms, based on the small part-of-speech-tagged corpora that are available for Irish and on a large terminology list, and show that both methods can produce viable term extractors. We evaluate this with a newly constructed corpus that is the first available corpus for term extraction in Irish. Our results shine some light on the challenge of adapting natural language processing systems to under-resourced scenarios.

WordNet Gloss Translation for Under-resourced Languages using Multilingual Neural Machine Translation. Bharathi Raja Chakravarthi, Mihael Arcan and John P. McCrae, Proceedings of the MomenT Workshop, (2019). PDF Abstract

In this paper, we translate the glosses in the English WordNet based on the expand approach for improving and generating wordnets with the help of multilingual neural machine translation. Neural Machine Translation (NMT) has recently been applied to many tasks in natural language processing, leading to state-of-the-art performance. However, the performance of NMT often suffers in low resource scenarios where large corpora cannot be obtained. Using training data from closely related language have proven to be invaluable for improving performance. In this paper, we describe how we trained multilingual NMT from closely related language utilizing phonetic transcription for Dravidian languages. We report the evaluation result of the generated wordnets sense in terms of precision. By comparing to the recently proposed approach, we show improvement in terms of precision.

Multilingual Multimodal Machine Translation for Dravidian Languages utilizing Phonetic Transcription. Bharathi Raja Chakravarthi, Ruba Priyadharshini, Bernardo Stearns, Arun Jayapal, S Srivedy, Mihael Arcan, Manel Zarrouk and John P. McCrae, Proceedings of the 2nd Workshop on Technologies for MT of Low Resource Languages (LoResMT 2019), (2019). PDF Abstract

Multimodal machine translation is the task of translating from source language to target language using information from other modalities. Existing multimodal datasets have been restricted to only highly resourced languages. These datasets were collected by manual translation of English descriptions from the Flickr30K dataset. In this work, we introduce MMDravi, a Multilingual Multimodal dataset for under-resourced Dravidian languages. It comprises of 30K sentences which were created utilizing several machine translation outputs. Using data from MMDravi and a phonetic transcription of the corpus, we build an MMNMT system for closely related Dravidian languages to take advantage of multilingual corpus and other modalities. We evaluate our MMNMT translations generated by the proposed approach with human annotated evaluation tests in terms of BLEU, METEOR, and TER. Relying on multilingual corpora, phonetic transcription, and image features, our approach improves the translation quality for the under-resourced languages.

English WordNet 2019 -- An Open-Source WordNet for English. John P. McCrae, Alexandre Rademaker, Francis Bond, Ewa Rudnicka and Christiane Fellbaum, Proceedings of the 10th Global WordNet Conference – GWC 2019, (2019). Abstract

We describe the release of a new wordnet for English based on the Princeton WordNet, but now developed under an open-source model. In particular, this version of WordNet, which we call English WordNet 2019, which has been developed by multiple people around the world through GitHub, fixes many errors in previous wordnets for English. We give some details of the changes that have been made in this version and give some perspectives about likely future changes that will be made as this project continues to evolve.

Identification of Adjective-Noun Neologisms using Pretrained Language Models. John P. McCrae, Joint Workshop on Multiword Expressions and WordNet (MWE-WN 2019) at ACL 2019, (2019). PDF Abstract

Neologism detection is a key task in the constructing of lexical resources and has wider implications for NLP, however the identification of multiword neologisms has received little attention. In this paper, we show that we can effectively identify the distinction between compositional and non-compositional adjective-noun pairs by using pretrained language models and comparing this with individual word embeddings. Our results show that the use of these models significantly improves over baseline linguistic features, however the combination with linguistic features still further improves the results, suggesting the strength of a hybrid approach.

Inferring translation candidates for multilingual dictionary generation. Mihael Arcan, Daniel Torregrosa, Sina Ahmadi and John P. McCrae, Proceedings of the 2nd Translation Inference Across Dictionaries (TIAD) Shared Task, (2019). PDF Abstract

In the widely-connected digital world, multilingual lexical resources are one of the most important resources, for natural language processing applications, including information retrieval, question answering or knowledge management. These applications benefit from the multilingual knowledge as well as from the semantic relation between the words documented in these resources. Since multilingual dictionary creation and curation is a time-consuming task, we explored the use of multi-way neural machine translation trained on corpora of languages from the same family and trained additionally with a relatively small human-validated dictionary to infer new translation candidates. Our results showed not only that new dictionary entries can be identified and extracted from the translation model, but also that the expected precision and recall of the resulting dictionary can be adjusted by using different thresholds.

TIAD 2019 Shared Task: Leveraging Knowledge Graphs with Neural Machine Translation for Automatic Multilingual Dictionary Generation. Daniel Torregrosa, Mihael Arcan, Sina Ahmadi and John P. McCrae, Proceedings of the 2nd Translation Inference Across Dictionaries (TIAD) Shared Task, (2019). PDF Abstract

This paper describes the different proposed approaches to the TIAD 2019 Shared Task, which consisted in the automatic discovery and generation of dictionaries leveraging multilingual knowledge bases. We present three methods based on graph analysis and neural machine translation and show that we can generate translations without parallel data.

TIAD Shared Task 2019: Orthonormal Explicit Topic Analysis for Translation Inference across Dictionaries. John P. McCrae, Proceedings of the 2nd Translation Inference Across Dictionaries (TIAD) Shared Task, (2019). PDF Abstract

The task of inferring translations can be achieved by the means of comparable corpora and in this paper we apply explicit topic modelling over comparable corpora to the task of inferring translation candidates. In particular, we use the Orthonormal Explicit Topic Analysis (ONETA) model, which has been shown to be the state-of-the-art explicit topic model through its elimination of correlations between topics. The method proves highly effective at selecting translations with high precision.

Lexical Sense Alignment using Weighted Bipartite b-Matching. Sina Ahmadi, Mihael Arcan and John McCrae, Proceedings of the Poster Track of LDK 2019, pp 12-16, (2019). PDF Abstract

Lexical resources are important components of natural language processing (NLP) applications providing linguistic information about the vocabulary of a language and the semantic relationships between the words. While there is an increasing number of lexical resources, particularly expert-made ones such as WordNet or FrameNet as well as collaboratively- curated ones such as Wikipedia1 or Wiktionary2 , manual construction and maintenance of such resources is a cumbersome task. This can be efficiently addressed by NLP techniques. Aligned resources have shown to improve word, knowledge and domain coverage and increase multilingualism by creating new lexical resources such as Yago , BabelNet and ConceptNet In addition, they can improve the performance of NLP tasks such as word sense disambiguation semantic role tagging and semantic relations extraction.

Representing Arabic Lexicons in Lemon - a Preliminary Study. Mustafa Jarrar, Hamzeh Amayreh and John McCrae, Proceedings of the Poster Track of LDK 2019, pp 29-33, (2019). PDF Abstract

We present our progress in representing 150 Arabic multilingual lexicons using Lemon, which we have been digitizing from scratch. These lexicons are available through a lexicographic search engine (https://ontology.birzeit.edu) that allows searching for translations, synonyms, and definitions. Representing these lexicons in Lemon will enable them to be used by ontologies and NLP applications, as well as to be interlinked with the Open Linguistic Data Cloud.

Crowd-sourcing A High-Quality Dataset for Metaphor Identification in Tweets. Omnia Zayed, John P. McCrae and Paul Buitelaar, 2nd Conference on Language, Data and Knowledge (LDK 2019), (2019). PDF Abstract

Metaphor is one of the most important elements of human communication, especially in informal settings such as social media. There have been a number of datasets created for metaphor identification, however, this task has proven difficult due to the nebulous nature of metaphoricity. In this paper, we present a crowd-sourcing approach for the creation of a dataset for metaphor identification, that is able to rapidly achieve large coverage over the different usages of metaphor in a given corpus while maintaining high accuracy. We validate this methodology by creating a set of 2,500 manually annotated tweets in English, for which we achieve inter-annotator agreement scores over 0.8, which is higher than other reported results that did not limit the task. This methodology is based on the use of an existing classifier for metaphor in order to assist in the identification and the selection of the examples for annotation, in a way that reduces the cognitive load for annotators and enables quick and accurate annotation. We selected a corpus of both general language tweets and political tweets relating to Brexit and we compare the resulting corpus on these two domains. As a result of this work, we have published the first dataset of tweets annotated for metaphors, which we believe will be invaluable for the development, training and evaluation of approaches for metaphor identification in tweets.

Comparison of Different Orthographies for Machine Translation of Under-Resourced Dravidian Languages. Bharathi Raja Chakravarthi, Mihael Arcan and John P. McCrae, Maria Eskevich, Gerard de Melo, Christian Fäth, John P. McCrae, Paul Buitelaar, Christian Chiarcos, Bettina Klimek and Milan Dojchinovski (eds), 2nd Conference on Language, Data and Knowledge (LDK 2019), pp 6:1--6:14, (2019). PDF Abstract

Under-resourced languages are a significant challenge for statistical approaches to machine translation, and recently it has been shown that the usage of training data from closely-related languages can improve machine translation quality of these languages. While languages within the same language family share many properties, many under-resourced languages are written in their own native script, which makes taking advantage of these language similarities difficult. In this paper, we propose to alleviate the problem of different scripts by transcribing the native script into common representation i.e. the Latin script or the International Phonetic Alphabet (IPA). In particular, we compare the difference between coarse-grained transliteration to the Latin script and fine-grained IPA transliteration. We performed experiments on the language pairs English-Tamil, English-Telugu, and English-Kannada translation task. Our results show improvements in terms of the BLEU, METEOR and chrF scores from transliteration and we find that the transliteration into the Latin script outperforms the fine-grained IPA transcription.

2nd Conference on Language, Data and Knowledge (LDK 2019). Maria Eskevich, Gerard de Melo, Christian Fäth, John P. McCrae, Paul Buitelaar, Christian Chiarcos, Bettina Klimek and Milan Dojchinovski (eds), Schloss Dagstuhl--Leibniz-Zentrum fuer Informatik - OpenAccess Series in Informatics (OASIcs), (2019).

2018


On Lexicographical Networks. Sina Ahmadi, Mihael Arcan and John McCrae, Workshop on eLexicography: Between Digital Humanities and Artificial Intelligence, (2018). PDF Abstract

Lexical resources are important components of natural language processing (NLP) applications providing machine-readable knowledge for various tasks. One of the most popular examples of lexical resources are lexicons. Lexicons provide linguistic information about the vocabulary of a language and the semantic relationships between the words in a pair of languages. In addition to the lexicons, there are various other types of lexical resources, particularly those which are made by experts such as WordNet, VerbNet and FrameNet and, those which are collaboratively curated such as Wikipedia and Wiktionary.

6th Workshop on Linked Data in Linguistics: Towards Linguistic Data Science. John P. McCrae, Christian Chiarcos, Thierry Declerck, Jorge Gracia and Bettina Klimek (eds), European Language Resources Association - LREC-2018 Workshop Proceedings, (2018). Abstract

Since its establishment in 2012, the Linked Data in Linguistics (LDL) workshop series has become the major forum for presenting, discussing and disseminating technologies, vocabularies, resources and experiences regarding the application of Semantic Web standards and the Linked Open Data paradigm to language resources in order to facilitate their visibility, accessibility, interoperability, reusability, enrichment, combined evaluation and integration. The LDL workshop series is organized by the Open Linguistics Working Group of the Open Knowledge Foundation, and has contributed greatly to the emergence and growth of the Linguistic Linked Open Data (LLOD) cloud. LDL workshops contribute to the discussion, dissemination and establishment of community standards that drive this development, most notably the Lemon/OntoLex model for lexical resources, as well as standards for other types of language resources still under development. Building on our earlier success in creating and linking language resources, LDL-2018 will focus on Linguistic Data Science, i.e., research methodologies and applications building on Linguistic Linked Open Data and the existing technology and resource stack for linguistics, natural language processing and digital humanities. LDL-2018 builds on the success of the workshop series, incl. two appearances at LREC (2014, 2016), where we attracted a large number of interested participants. As of 2016, LDL workshops alternate with our stand-alone conference on Language, Data and Knowledge (LDK). LDK-2017 was held in Galway, Ireland, as a 3-day event with 150 registrants and several satellite workshops. Continuing the LDL workshop series together with LDK is important in order to facilitate dissemination within and to receive input from the language resource community, and LREC is the obvious host conference for this purpose. LDL-2018 will be supported by the ELEXIS project on an European Lexicographic Infrastructure.

European Lexicographic Infrastructure (ELEXIS). Simon Krek, John McCrae, Iztok Kosem, Tanja Wissek, Carole Tiberius, Roberto Navigli and Bolette Sandford Pedersen, Proceedings of the XVIII EURALEX International Congress on Lexicography in Global Contexts, pp 881-892, (2018). Abstract

In the paper we describe a new EU infrastructure project dedicated to lexicography. The project is part of the Horizon 2020 program, with a duration of four years (2018-2022). The result of the project will be an infrastructure which will (1) enable efficient access to high quality lexicographic data, and (2) bridge the gap between more advanced and less-resourced scholarly communities working on lexicographic resources. One of the main issues addressed by the project is the fact that current lexicographic resources have different levels of (incompatible) structuring, and are not equally suitable for application in in Natural Language Processing and other fields. The project will therefore develop strategies, tools and standards for extracting, structuring and linking lexicographic resources to enable their inclusion in Linked Open Data and the Semantic Web, as well as their use in the context of digital humanities.

Constructing an Annotated Corpus of Verbal MWEs for English. Abigail Walsh, Claire Bonial, Kristina Geeraert, John P. McCrae, Nathan Schneider and Clarissa Somers, Proceedings of Joint Workshop on Linguistic Annotation, Multiword Expressions and Constructions (LAW-MWE-CxG-2018), (2018). Abstract

This paper describes the construction and annotation of a corpus of verbal MWEs for English as part of the PARSEME Shared Task 1.1 on automatic identification of verbal MWEs. The criteria for corpus selection, the categories of MWEs used, and the training process are discussed, along with the particular issues that led to revisions in edition 1.1 of the annotation guidelines. Finally, an overview of the characteristics of the final annotated corpus is presented, as well as some discussion on inter-annotator agreement.

Phrase-Level Metaphor Identification using Distributed Representations of Word Meaning. Omnia Zayed, John P. McCrae and Paul Buitelaar, Proceedings of the Workshop on Figurative Language Processing, (2018). Abstract

Metaphor is an essential element of human cognition which is often used to express ideas and emotions that might be difficult to express using literal language. Processing metaphoric language is a challenging task for a wide range of applications ranging from text simplification to psychotherapy. Despite the variety of approaches that are trying to process metaphor, there is still a need for better models that mimic the human cognition while exploiting fewer resources. In this paper, we present an approach based on distributional semantics to identify metaphors on the phrase-level. We investigated the use of different word embeddings models to identify verb-noun pairs where the verb is used metaphorically. Several experiments are conducted to show the performance of the proposed approach on benchmark datasets.

Linking Datasets Using Semantic Textual Similarity. John P. McCrae and Paul Buitelaar, Cybernetics and Information Technologies, 18(1), pp 109-123, (2018). Abstract

Linked data has been widely recognized as an important paradigm for representing data and one of the most important aspects of supporting its use is discovery of links between datasets. For many datasets, there is a significant amount of textual information in the form of labels, descriptions and documentation about the elements of the dataset and the fundament of a precise linking is in the application of semantic textual similarity to link these datasets. However, most linking tools so far rely on only simple string similarity metrics such as Jaccard scores. We present an evaluation of some metrics that have performed well in recent semantic textual similarity evaluations and apply these to linking existing datasets

Preservation of Original Orthography in the Construction of an Old Irish Corpus. Adrian Doyle, John P. McCrae and Clodagh Downey, Proceedings of the 3rd Workshop for Collaboration and Computing for Under-Resourced Languages, (2018). Abstract

This paper will examine the process of creating a digital corpus based on the Würzburg glosses, the earliest large collection of glosses written in the Irish language. Modern editorial standards applied in publications of these glosses can alter spelling, punctuation, and even the semantic meaning of a sentence where one word is used in place of another. Therefore, an understanding of the original orthography utilised by Old Irish scribes is important in determining the orthography which should be utilised in a modern digital corpus. This paper will outline why the text of the Würzburg glosses as it appears in Thesaurus Palaeohibernicus is the best candidate for digitisation. The automated digitisation and proofing process of the corpus will be outlined, and details will be given of a tag-set utilised within the digital corpus in order to preserve information present in Thesaurus Palaeohibernicus as metadata.

ELEXIS - European Lexicographic Infrastructure: Contributions to and from the Linguistic Linked Open Data. Thierry Declerck, John McCrae, Roberto Navigli, Ksenia Zaytseva and Tanja Wissik, Proceedings of the Globalex 2018 Workshop, (2018). Abstract

In this paper we outline the interoperability aspects of the recently started European project ELEXIS (European Lexicographic Infrastructure). ELEXIS aims to integrate, extend and harmonise national and regional efforts in the field of lexicography, both modern and historical, with the goal of creating a sustainable infrastructure which will enable efficient access to high quality lexical data in the digital age, and bridge the gap between more advanced and lesser-supported lexicographic resources. For this, ELEXIS will make use of or establish common standards and solutions for the development of lexicographic resources and develop strategies and tools for extracting, structuring and linking lexicographic resources.

A supervised approach to taxonomy extraction using word embeddings. Rajdeep Sarkar, John P. McCrae and Paul Buitelaar, Proceedings of the 11th Language Resource and Evaluation Conference (LREC), (2018). PDF Abstract

Large collections of texts are commonly generated by large organizations and making sense of these collections of texts is a significant challenge. One method for handling this is to organize the concepts into a hierarchical structure such that similar concepts can be discovered and easily browsed. This approach was the subject of a recent evaluation campaign, TExEval, however the results of this task showed that none of the systems consistently outperformed a relatively simple baseline.In order to solve this issue, we propose a new method that uses supervised learning to combine multiple features with a support vector machine classifier including the baseline features. We show that this outperforms the baseline and thus provides a stronger method for identifying taxonomic relations than previous methods

A Comparison Of Emotion Annotation Schemes And A New Annotated Data Set. Ian Wood, John P. McCrae, Vladimir Andryushechkin and Paul Buitelaar, Proceedings of the 11th Language Resource and Evaluation Conference (LREC), (2018). PDF Abstract

While the recognition of positive/negative sentiment in text is an established task with many standard data sets and well developed methodologies, the recognition of more nuanced affect has received less attention, and in particular, there are very few publicly available gold standard annotated resources. To address this lack, we present a series of emotion annotation studies on tweets culminating in a publicly available collection of 2,019 tweets with scores on four emotion dimensions: valence, arousal, dominance and surprise, following the emotion representation model identified by Fontaine et.al. (Fontaine et al., 2007). Further, we make a comparison of relative vs. absolute annotation schemes. We find improved annotator agreement with a relative annotation scheme (comparisons) on a dimensional emotion model over a categorical annotation scheme on Ekman’s six basic emotions (Ekman et al., 1987), however when we compare inter-annotator agreement for comparisons with agreement for a rating scale annotation scheme (both with the same dimensional emotion model), we find improved inter-annotator agreement with rating scales, challenging a common belief that relative judgements are more reliable.

Teanga: A Linked Data based platform for Natural Language Processing. Housam Ziad, John P. McCrae and Paul Buitelaar, Proceedings of the 11th Language Resource and Evaluation Conference (LREC), (2018). Abstract

In this paper, we describe Teanga, a linked data based platform for natural language processing (NLP). Teanga enables the use of many NLP services from a single interface, whether the need was to use a single service or multiple services in a pipeline. Teanga focuses on the problem of NLP services interoperability by using linked data to define the types of services input and output. Teanga’s strengths include being easy to install and run, easy to use, able to run multiple NLP tasks from one interface and helping users to build a pipeline of tasks through a graphical user interface.

Automatic Enrichment of Terminological Resources: the IATE RDF Example. Mihael Arcan, Elena Montiel-Ponsoda, John P. McCrae and Paul Buitelaar, Proceedings of the 11th Language Resource and Evaluation Conference (LREC), (2018). PDF Abstract

Terminological resources have proven necessary in many organizations and institutions to ensure communication between experts. However, the maintenance of these resources is a very time-consuming and expensive process. Therefore, the work described in this contribution aims to automate the maintenance process of such resources. As an example, we demonstrate enriching the RDF version of IATE with new terms in the languages for which no translation was available, as well as with domain-disambiguated sentences and information about usage frequency. This is achieved by relying on machine translation trained on parallel corpora that contains the terms in question and multilingual word sense disambiguation performed on the context provided by the sentences. Our results show that for most languages translating the terms within a disambiguated context significantly outperforms the approach with randomly selected sentences.

MixedEmotions: An Open-Source Toolbox for Multi-Modal Emotion Analysis. Paul Buitelaar, Ian D. Wood, Sapna Negi, Mihael Arcan, John P. McCrae, Andrejs Abele, Cécile Robin, Vladimir Andryushechkin, Housam Ziad, Hesam Sagha, J. Fernando Sánchez-Rada, Carlos A. Iglesias, Carlos Navarro, Andreas Giefer, Nicolaus Heise, Vincenzo Masucci, Francesco A. Danza, Ciro Caterino, Pavel Smrž, Michal Hradiš, Filip Povolný, Marek Klimeš, Pavel Matějka and Giovanni Tummarello, IEEE Transactions on Multimedia, 20(9), (2018). Abstract

Recently, there is an increasing tendency to embed functionalities for recognizing emotions from user-generated media content in automated systems such as call-centre operations, recommendations, and assistive technologies, providing richer and more informative user and content profiles. However, to date, adding these functionalities was a tedious, costly, and time-consuming effort, requiring identification and integration of diverse tools with diverse interfaces as required by the use case at hand. The MixedEmotions Toolbox leverages the need for such functionalities by providing tools for text, audio, video, and linked data processing within an easily integrable plug-and-play platform. These functionalities include: 1) for text processing: emotion and sentiment recognition; 2) for audio processing: emotion, age, and gender recognition; 3) for video processing: face detection and tracking, emotion recognition, facial landmark localization, head pose estimation, face alignment, and body pose estimation; and 4) for linked data: knowledge graph integration. Moreover, the MixedEmotions Toolbox is open-source and free. In this paper, we present this toolbox in the context of the existing landscape, and provide a range of detailed benchmarks on standard test-beds showing its state-of-the-art performance. Furthermore, three real-world use cases show its effectiveness, namely, emotion-driven smart TV, call center monitoring, and brand reputation analysis.

Mapping WordNet Instances to Wikipedia. John P. McCrae, Proceedings of the 9th Global WordNet Conference, (2018). Abstract

Lexical resource differ from encyclopaedic resources and represent two distinct types of resource covering general language and named entities respectively. However, many lexical resources, including Princeton WordNet, contain many proper nouns, referring to named entities in the world yet it is not possible or desirable for a lexical resource to cover all named entities that may reasonably occur in a text. In this paper, we propose that instead of including synsets for instance concepts PWN should instead provide links to Wikipedia articles describing the concept. In order to enable this we have created a gold-quality mapping between all of the 7,742 instances in PWN and Wikipedia (where such a mapping is possible). As such, this resource aims to provide a gold standard for link discovery, while also allowing PWN to distinguish itself from other resources such as DBpedia or BabelNet. Moreover, this linking connects PWN to the Linguistic Linked Open Data cloud, thus creating a richer, more usable resource for natural language processing.

Towards a Crowd-Sourced WordNet for Colloquial English. John P. McCrae, Ian Wood and Amanda Hicks, Proceedings of the 9th Global WordNet Conference, (2018). Abstract

Princeton WordNet is one of the most widely-used resources for natural language processing, but is updated only infrequently and cannot keep up with the fast-changing usage of the English language on social media platforms such as Twitter. The Colloquial WordNet aims to provide an open platform whereby anyone can contribute, while still following the structure of WordNet. Many crowdsourced lexical resources often have significant quality issues, and as such care must be taken in the design of the interface to ensure quality. In this paper, we present the development of a platform that can be opened on the Web to any lexicographer who wishes to contribute to this resource and the lexicographic methodology applied by this interface

Improving Wordnets for Under-Resourced Languages Using Machine Translation information. Bharathi Raja Chakravarthi, Mihael Arcan and John P. McCrae, Proceedings of the 9th Global WordNet Conference, (2018). Abstract

Wordnets are extensively used in natural language processing, but the current approaches for manually building a wordnet from scratch involves large research groups for a long period of time, which are typically not available for under-resourced languages. Even if wordnet-like resources are available for under-resourced languages, they are often not easily accessible, which can alter the results of applications using these resources. Our proposed method presents an expand approach for improving and generating wordnets with the help of machine translation. We apply our methods to improve and extend wordnets for the Dravidian languages, i.e., Tamil, Telugu, Kannada, which are severly under-resourced languages. We report evaluation results of the generated wordnet senses in term of precision for these languages. In addition to that, we carried out a manual evaluation of the translations for the Tamil language, where we demonstrate that our approach can aid in improving wordnet resources for under-resourced Dravidian languages.

ELEXIS - a European infrastructure fostering cooperation and infor-mation exchange among lexicographical research communities. Bolette Pedersen, John McCrae, Carole Tiberius and Simon Krek, Proceedings of the 9th Global WordNet Conference, (2018). Abstract

The paper describes objectives, concept and methodology for ELEXIS, a European infrastructure fostering cooperation and information exchange among lexicographical research communities. The infrastructure is a newly granted project under the Horizon 2020 INFRAIA call, with the topic Integrating Activities for Starting Communities. The project is planned to start in January 2018

Mapping WordNet Instances to Wikipedia. John P. McCrae, Proceedings of the 9th Global WordNet Conference, (2018).

Towards a Crowd-Sourced WordNet for Colloquial English. John P. McCrae, Ian Wood and Amanda Hicks, Proceedings of the 9th Global WordNet Conference, (2018).

ELEXIS - a European infrastructure fostering cooperation and infor-mation exchange among lexicographical research communities. Bolette Pedersen, John McCrae, Carole Tiberius and Simon Krek, Proceedings of the 9th Global WordNet Conference, (2018).

2017


Knowledge Graphs and Language Technology - ISWC 2016 International Workshops: KEKI and NLP&DBpedia. Marieke van Erp, Sebastian Hellmann, John P. McCrae, Christian Chiarcos, Key-Sun Choi, Jorge Gracia, Yoshihiko Hayashi, Seiji Koide, Pablo Mendes, Heiko Paulheim and Hideaki Takeda (eds), Springer - Lecture Notes in Computer Science, (2017).

Language, Data, and Knowledge. Jorge Gracia, Francis Bond, John P. McCrae, Paul Buitelaar, Christian Chiarcos and Sebastian Hellmann (eds), Springer - Lecture Notes in Artificial Intelligence, (2017).

Linking Knowledge Graphs across Languages with Semantic Similarity and Machine Translation. John P. McCrae, Mihael Arcan and Paul Buitleaar, Proceedings of the First Workshop on Multi-Language Processing in a Globalising World (MLP2017), (2017).

The OntoLex-Lemon Model: development and applications. John P. McCrae, Julia Bosque-Gil, Jorge Gracia, Paul Buitelaar and Philipp Cimiano, Proceedings of eLex 2017, pp 587-597, (2017). PDF

OnLiT: An Ontology for Linguistic Terminology. Bettina Klimek, John P. McCrae, Christian Chiarcos and Sebastian Hellmann, Proceedings of the First Conference on Language, Data and Knowledge (LDK2017), pp 42-57, (2017).

The Colloquial WordNet: Extending Princeton WordNet with Neologisms. John P. McCrae, Ian Wood and Amanda Hicks, Proceedings of the First Conference on Language, Data and Knowledge (LDK2017), pp 194-202, (2017).

An Evaluation Dataset for Linked Data Profiling. Andrejs Abele, John P. McCrae and Paul Buitelaar, Proceedings of the First Conference on Language, Data and Knowledge (LDK2017), pp 1-9, (2017).

2016


Lexicon Model for Ontologies: Community Report. Philipp Cimiano, John P. McCrae and Paul Buitelaar, Technical Report: W3C(2016).

Expanding wordnets to new languages with multilingual sense disambiguation. Mihael Arcan, John P. McCrae and Paul Buitelaar, Proceedings of The 26th International Conference on Computational Linguistics, (2016).

Identifying Poorly-Defined Concepts in WordNet with Graph Metrics. John P. McCrae and Narumol Prangnawarat, Proceedings of the First Workshop on Knowledge Extraction and Knowledge Integration (KEKI-2016), (2016).

LIXR: Quick, succinct conversion of XML to RDF. John P. McCrae and Philipp Cimiano, Proceedings of the ISWC 2016 Posters and Demo Track, (2016).

Yuzu: Publishing Any Data as Linked Data. John P. McCrae, Proceedings of the ISWC 2016 Posters and Demo Track, (2016).

NUIG-UNLP at SemEval-2016 Task 1: Soft Alignment and Deep Learning for Semantic Textual Similarity. John P. McCrae, Kartik Asooja, Nitish Aggarwal and Paul Buitelaar, SemEval-2016, (2016).

Linked Data and Text Mining as an Enabler for Reproducible Research. John P. McCrae, Georgeta Bordea and Paul Buitelaar, 1st Workshop on Cross-Platform Text Mining and Natural Language Processing Interoperability, (2016).

Domain adaptation for ontology localization. John P. McCrae, Mihael Arcan, Kartik Asooja, Jorge Gracia, Paul Buitelaar and Philipp Cimiano, Web Semantics, 36pp 23-31, (2016).

Representing Multiword Expressions on the Web with the OntoLex-Lemon model. John P. McCrae, Philipp Cimiano, Paul Buitelaar and Georgeta Bordea, PARSEME/ENeL workshop on MWE e-lexicons, (2016).

The Open Linguistics Working Group: Developing the Linguistic Linked Open Data Cloud. John P. McCrae, Christian Chiarcos, Francis Bond, Philipp Cimiano, Thierry Declerck, Gerard de Melo, Jorge Gracia, Sebastian Hellmann, Bettina Klimek, Steven Moran, Petya Osenova, Antonio Pareja-Lora and Jonathan Pool, 10th Language Resource and Evaluation Conference (LREC), pp 2435-2441, (2016).

CILI: the Collaborative Interlingual Index. Francis Bond, Piek Vossen, John P. McCrae and Christiane Fellbaum, Proceedings of the Global WordNet Conference 2016, (2016).

Toward a truly multilingual Global Wordnet Grid. Piek Vossen, Francis Bond and John P. McCrae, Proceedings of the Global WordNet Conference 2016, (2016).

2015


Multilingual Linked Data (editorial). John P. McCrae, Steven Moran, Sebastian Hellmann and Martin Brümmer, Semantic Web, 6(4), pp 315-317, (2015).

lemonUby - a large, interlinked, syntactically-rich lexical resource for ontologies. Judith Eckle-Kohler, John McCrae and Christian Chiarcos, Semantic Web, 6(4), pp 371-378, (2015).

Linghub: a Linked Data based portal supporting the discovery of language resources. John P. McCrae and Philipp Cimiano, Proceedings of the 11th International Conference on Semantic Systems, pp 88-91, (2015).

Linking Four Heterogeneous Language Resources as Linked Data. Benjamin Siemoneit, John P. McCrae and Philipp Cimiano, Proceedings of the 4th Workshop on Linked Data in Linguistics, pp 59-63, (2015).

Reconciling Heterogeneous Descriptions of Language Resources. John P. McCrae, Philipp Cimiano, Victor Rodriguez-Doncel, Daniel Vila-Suero, Jorge Gracia, Luca Matteis, Roberto Navigli, Andrejs Abele, Gabriela Vulcu and Paul Buitelaar, Proceedings of the 4th Workshop on Linked Data in Linguistics, pp 39-48, (2015).

Linked Terminology: Applying Linked Data Principles to Terminological Resources. Philipp Cimiano, John P. McCrae, Victor Rodriguez-Doncel, Tatiana Gornostaya, Asuncion Gómez-Pérez, Benjamin Siemoneit and Andis Lagzdins, Proceedings of eLex 2015, pp 504-517, (2015).

One ontology to bind them all: The META-SHARE OWL ontology for the interoperability of linguistic datasets on the Web. John P. McCrae, Penny Labropoulou, Jorge Gracia, Marta Villegas, Victor Rodriguez-Doncel and Philipp Cimiano, Proceedings of the 4th Workshop on the Multilingual Semantic Web, (2015).

LIME: the Metadata Module for OntoLex. Manuel Fiorelli, Armando Stellato, John P. McCrae, Philipp Cimiano and Maria Teresa Pazienza, Proceedings of 12th Extended Semantic Web Conference, (2015).

Language Resources and Linked Data: A Practical Perspective. Jorge Gracia, Daniel Vila-Suero, John P. McCrae, Tiziano Flati, Ciro Baron and Milan Dojchinovski, In: Knowledge Engineering and Knowledge Management, (2015).

2014


Design Patterns for Engineering the Ontology-Lexicon Interface. John P. McCrae and Christina Unger, Paul Buitelaar and Philipp Cimiano (eds), In: Towards the Multilingual Semantic Web, eds. Paul Buitelaar, Philipp Cimiano, pp 15-30, (2014).

Representing Swedish Lexical Resources in RDF with lemon. Lars Borin, Dana Dannells, Markus Forsberg and John P. McCrae, Proceedings of the ISWC 2014 Posters & Demonstrations Track - a track within the 13th International Semantic Web Conference, pp 329-332, (2014).

Towards assured data quality and validation by data certification. John P. McCrae, Cord Wiljes and Philipp Cimiano, Proceedings of the 1st Workshop on Linked Data Quality, (2014).

Bielefeld SC: Orthonormal Topic Modelling for Grammar Induction. John P. McCrae and Philipp Cimiano, Proceedings of the 8th International Workshop on Semantic Evaluation, (2014).

Default Physical Measurements in SUMO. Francesca Quattri, Adam Pease and John P. McCrae, Proceedings of 4th Workshop on Cognitive Aspects of the Lexicon, (2014).

Modelling the Semantics of Adjectives in the Ontology-Lexicon Interface. John P. McCrae, Christina Unger, Francesca Quattri and Philipp Cimiano, Proceedings of 4th Workshop on Cognitive Aspects of the Lexicon, (2014).

Publishing and Linking WordNet using lemon and RDF. John P. McCrae, Christiane Fellbaum and Philipp Cimiano, Proceedings of the 3rd Workshop on Linked Data in Linguistics, (2014).

A Multilingual Semantic Network as Linked Data: lemon-BabelNet. Maud Ehrmann, Francesco Ceconi, Daniela Vannella, John P. McCrae, Philipp Cimiano and Roberto Navigli, Proceedings of the 3rd Workshop on Linked Data in Linguistics, (2014).

Representing Multilingual Data as Linked Data: the Case of BabelNet 2.0. Maud Ehrmann, Francesco Ceconi, Daniela Vannella, John P. McCrae, Philipp Cimiano and Roberto Navigli, Proceedings of the 9th Language Resource and Evaluation Conference, pp 401-408, (2014).

3LD: Towards high quality, industry-ready Linguistic Linked Linguistic Data. Daniel Vila-Suero, Victor Rodriguez-Doncel, Asunción Gómez-Pérez, Philipp Cimiano, John P. McCrae and Guadalupe Aguado-de-Cea, European Data Forum 2014, (2014).

Ontology-based interpretation of natural language. Philipp Cimiano, Christina Unger and John McCrae, Morgan & Claypool, (2014).

2013


Orthonormal explicit topic analysis for cross-lingual document matching. John McCrae, Philipp Cimiano and Roman Klinger, Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp 1732-1742, (2013).

A lemon lexicon for DBpedia. Christina Unger, John McCrae, Sebastian Walter, Sara Winter and Philipp Cimiano, Proceedings of 1st International Workshop on NLP and DBpedia, (2013).

Multilingual variation in the context of linked data. Elena Montiel-Ponsoda, John McCrae, Guadalupe Aguado-de-Cea and Jorge Gracia, Proceedings of the 10th International Conference on Terminology and Artificial Intelligence, pp 19-26, (2013).

Mining translations from the web of open linked data. John P. McCrae and Philipp Cimiano, Proceedings of the Joint Workshop on NLP&LOD and SWAIE: Semantic Web, Linked Open Data and Infromation Extraction, pp 9-13, (2013).

Releasing multimodal data as Linguistic Linked Open Data: An experience report. Peter Menke, John P. McCrae and Philipp Cimiano, Proceedings of the 2nd Workshop on Linked Data in Linguistics, pp 44-52, (2013).

Towards open data for linguistics: Lexical Linked Data. Christian Chiarcos, John McCrae, Philipp Cimiano and Christiane Fellbaum, In: New Trends of Research in Ontologies and Lexical Resources, pp 7-25, (2013).

On the role of senses in the Ontology-Lexicon. Philipp Cimiano, John McCrae, Paul Buitelaar and Elena Montiel-Ponsoda, In: New Trends of Research in Ontologies and Lexical Resources, pp 43-62, (2013).

2012


Using SPIN to formalize accounting regulation on the Semantic Web. Dennis Spohr, Philipp Cimiano, John McCrae and Sean O'Riain, First International Workshop on Finance and Economics on the Semantic Web in conjunction with 9th Extended Semantic Web Conference, pp 1-15, (2012).

Collaborative semantic editing of linked data lexica. John McCrae, Elena Montiel-Ponsoda and Philipp Cimiano, Proc. of the 2012 International Conference on Language Resource and Evaluation, pp 2619-2625, (2012).

Three steps for creating high quality ontology-lexica. John McCrae and Philipp Cimiano, Proc. of the Workshop on Collaborative Resource Development and Delivery at the 2012 International Conference on Language Resource and Evaluation, (2012).

Integrating WordNet and Wiktionary with lemon. John McCrae, Philipp Cimiano and Elena Montiel-Ponsoda, Christian Chiarcos, Sebastian Nordhoff and Sebastian Hellmann (eds), In: Linked Data and Linguistics, eds. Christian Chiarcos, Sebastian Nordhoff, Sebastian Hellmann, pp 25-34, (2012).

Interchanging lexical resources on the Semantic Web. John McCrae, Guadalupe Aguado-de-Cea, Paul Buitelaar, Philipp Cimiano, Thierry Declerck, Asunción Gómez-Pérez, Jorge Gracia, Laura Hollink, Elena Montiel-Ponsoda, Dennis Spohr and Tobias Wunner, Language Resources and Evaluation, 46(6), pp 701-709, (2012).

2011


LexInfo: A declarative model for the lexicon-ontology interface. Philipp Cimiano, Paul Buitelaar, John McCrae and Michael Sintek, Web Semantics: Science, Services and Agents on the World Wide Web, 9(1), pp 29-51, (2011).

Challenges for the Multilingual Web of Data. Jorge Gracia, Elena Montiel-Ponsoda, Philipp Cimiano, Asunción Gómez-Pérez, Paul Buitelaar and John McCrae, Web Semantics: Science, Services and Agents on the World Wide Web, pp 63-71, (2011).

Combining statistical and semantic approaches to the translation of ontologies and taxonomies. John McCrae, Mauricio Espinoza, Elena Montiel-Ponsoda, Guadalupe Aguado-de-Cea and Philipp Cimiano, Fifth Workshop on Syntax, Structure and Semantics in Statistical Translation in conjunction with 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, (2011).

Linking Lexical Resources and Ontologies on the Semantic Web with lemon. John McCrae, Dennis Spohr and Philipp Cimiano, Proc. of the 8th Extended Semantic Web Conference, pp 245-249, (2011).

Ontology Lexicalization: The lemon perspective. Paul Buitelaar, Philipp Cimiano, John McCrae, Elena Montiel-Ponsoda and Thierry Declerck, Proc. of 9th International Conference on Terminology and Articial Intelligence, (2011).

Representing Term Variation in lemon. Elena Montiel-Ponsoda, Guadalupe Aguado-de-Cea and John McCrae, Proc. of 9th International Conference on Terminology and Articial Intelligence, (2011).

2010


CLOVA: An architecture for cross-language semantic data querying. John McCrae, Jesus R. Campaña and Philipp Cimiano, Proceedings of the 1st Workshop on the Multilingual Semantic Web, pp 5-12, (2010).

Navigating the Information Storm: Web-based global health surveillance in BioCaster. Nigel Collier, Son Doan, Reiko Matsuda Goodwin, John McCrae, Mike Conway, Mika Shigematsu and Ai Kawazoe, Taha Kass-Hout and Xiaohui Zhang (eds), In: Biosurveillance: Methods and Case Studies, eds. Taha Kass-Hout, Xiaohui Zhang, pp 291-312, (2010).

Ontology-based multilingual access to financial reports for sharing business knowledge across Europe. Thierry Declerck, Hans-Ulrich Krieger, Susan-Marie Thomas, Paul Buitelaar, Sean O'Riain, Tobias Wunner, Gilles Maguet, John McCrae, Dennis Spohr and Elena Montiel-Ponsoda, In: International Financial Control Assessment applying Multilingual Ontology Frameworks, pp 67-76, (2010).

An ontology-driven system for detecting global health events. Nigel Collier, Reiko Matsuda Goodwin, John McCrae, Son Doan, Ai Kawazoe, Mike Conway, Asanee Kawtrakul, K. Takeuchi and D. Dien, In Proc. of the 23rd International Conference on Computational Linguistics, pp 215-222, (2010).

2009


Automatic extraction of logically consistent ontologies from text corpora. John McCrae, PhD Thesis for Graduate University of Advanced Studies (SoKenDai), (2009).

SRL Editor: A rule development tool for text mining. John McCrae and Nigel Collier, Proc. of Workshop on Semantic Authoring, Annotation and Knowledge Markup in conjunction with the 5th International Conference on Knowledge Capture, (2009).

2008


Synonym set extraction from the biomedical literature by lexical pattern discovery. John McCrae and Nigel Collier, BMC Bioinformatics, 9(156), (2008).