John P. McCrae - Associate Professor at Data Science Institute, University of Galway

Download as: BibTeX

By Type: All Journal Articles Books Book Chapters Proceedings Conferences Workshops Thesis Reports Patents

By Year: All 2025 2024 2023 2022 2021 2020 2019 2018 2017 2016 2015 2014 2013 2012 2011 2010 2009 2008

Towards a Gold Standard for Adjectival Hypernymy: Enriching the Open English WordNet with a Hybrid Approach. Lorenzo Augello, John P. McCrae and Marco Passarotti Proceedings of the Fifteenth biennial Language Resources and Evaluation Conference (LREC), (2026 Accepted)

Towards a Comprehensive English Wordnet-Wikidata Mapping. John P. McCrae, Johann Bergh and Krasimir Angelov Proceedings of the Fifteenth biennial Language Resources and Evaluation Conference (LREC), (2026 Accepted)

Cross-Corpus CEFR Classification through Artificial Learners Perplexities. Bernardo Stearns, John P. McCrae and Thomas Gaillat Proceedings of the Fifteenth biennial Language Resources and Evaluation Conference (LREC), (2026 Accepted)

Investigating transformer models for textual bias detection in model, data, and dataspace cards. Andy Donald, Apostolos Galanopoulos, Atul Kumar Ojha, Edward Curry, Emir Muñoz, Ihsan Ullah, John P. McCrae, Manan Kalra, Sagar Saxena and Talha Iqbal (2026) PDF Abstract

Identifying hidden biases in AI documentation metadata (model, data, and dataspace cards) is essential for responsible AI; yet this domain remains largely unexplored. The proposed work evaluates four Transformer models (XLNet, DistilBERT, RoBERTa, and ELECTRA) for bias detection across publicly available, synthetic, and custom datasets. On the BABE news corpus, all models achieved 77–80% accuracy, with only ELECTRA exceeding 80% on every metric. To address the absence of publicly available AI-card datasets, we generated synthetic metadata for two use cases (Customer Interaction and Customer Data Uploaded by Organisations) using ChatGPT. Models trained on this synthetic corpus displayed near-perfect scores, reflecting shared stylistic cues embedded in the generated text. To test real-world robustness, we curated a Hugging Face dataset by scraping documentation comments, filtering for bias-related keywords, and obtaining annotations from four independent labellers in a single-blind setting. Partial fine-tuning (zero-shot) evaluations of models trained only on BABE or synthetic data revealed substantial performance drops on this real-world set. To mitigate this cross-domain loss, we introduce a cascaded, full fine-tuning (few-shot) pipeline in which Transformer models are sequentially fine-tuned on BABE, synthetic text, and a subset of the Hugging Face corpus. Evaluation on the remaining portion achieved over 85% across all performance metrics, enhancing precision and generalisation. This study demonstrates the challenges of bias detection beyond controlled or synthetic data and highlights cascaded fine-tuning as a practical, low-resource strategy. Future directions include leveraging evidence fusion methods, integrating cross-attention with bias taxonomies, and adopting dual-encoder architectures to advance bias detection toward more in-depth, knowledge-guided reasoning.

Lossy Text Compression Using Genetic Algorithms with LLM-Guided Operators. Rajesh Sudam and John McCrae 33rd International Conference on Artificial Intelligence and Cognitive Science (AICS 2025), (2025)

DA-ATE: Data Augmentation for Automatic Term Extraction. Shubhanker Banerjee, Bharathi Raja Chakravarthi and John P. McCrae Proceedings of the 21st International Conference on Semantic Systems, (2025) PDF

When retrieval outperforms generation: Dense evidence retrieval for scalable fake news detection. Alamgir Munir Qazi, John P. McCrae and Jamal Nasir Proceedings of the 5th Conference on Language, Data and Knowledge, (2025) PDF

Benchmarking Hindi Term Extraction in Education: A Dataset and Analysis. Shubhanker Banerjee, Bharathi Raja Chakravarthi and John P. McCrae Proceedings of the 5th Conference on Language, Data and Knowledge, (2025) PDF

Cuaċ: Fast and Small Universal Representations of Corpora. John P. McCrae, Bernardo Stearns, Alamgir Munir Qazi, Shubhanker Banerjee and Atul Kr. Ojha Proceedings of the 5th Conference on Language, Data and Knowledge, (2025) PDF

Empowering Recommender Systems using Automatically Generated Knowledge Graphs and Reinforcement Learning. Ghanshyam Verma, Simanta Sarkar, Devishree Pillai, Huan Chen, John P. McCrae, János A. Perge, Shovon Sengupta and Paul Buitelaar Proceedings of the 5th Conference on Language, Data and Knowledge, (2025) PDF

Development of Old Irish Lexical Resources, and Two Universal Dependencies Treebanks for Diplomatically Edited Old Irish Text. Adrian Doyle and John P. McCrae Proceedings of the 5th International Conference on Natural Language Processing for Digital Humanities, (2025) PDF Abstract

The quantity and variety of Old Irish text which survives in contemporary manuscripts, those dating from the Old Irish period, is quite small by comparison to what is available for Modern Irish, not to mention better-resourced modern languages. As no native speakers have existed for more than a millennium, no more text will ever be created by native speakers. For these reasons, text surviving in contemporary sources is particularly valuable. Ideally, all such text would be annotated using a single, common standard to ensure compatibility. At present, discrete Old Irish text repositories make use of incompatible annotation styles, few of which are utilised by text resources for other languages. This limits the potential for using text from more than any one resource simultaneously in NLP applications, or as a basis for creating further resources. This paper describes the production of the first Old Irish text resources to be designed specifically to ensure lexical compatibility and interoperability.

Revisiting Dalgado: Tracing the Heritage of the Portuguese Language in South Asia. Anas Fahad Khan, Ana de Castro Salgado, Isuri Anuradha, Rute Costa, Francesca Frontini, David Lindemann, Chamila Liyange, John P. McCrae, Atul Kr. Ojha and Priya Rani Alliance of Digital Humanities Organizations (ADHO) International Conference (DH2025), (2025 Accepted)

Evaluating Text Style Transfer Evaluation: Are There Any Reliable Metrics?. Sourabrata Mukherjee, Atul Kr. Ojha, John Philip McCrae and Ondrej Dusek Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4: Student Research Workshop), pp 418-434, (2025) PDF Abstract

Text Style Transfer (TST) is the task of transforming a text to reflect a particular style while preserving its original content. Evaluating TST outputs is a multidimensional challenge, requiring the assessment of style transfer accuracy, content preservation, and naturalness. Using human evaluation is ideal but costly, same as in other NLP tasks; however, automatic metrics for TST have not received as much attention as metrics for, e.g., machine translation or summarization. In this paper, we examine both set of existing and novel metrics from broader NLP tasks for TST evaluation, focusing on two popular subtasks—sentiment transfer and detoxification—in a multilingual context comprising English, Hindi, and Bengali. By conducting meta-evaluation through correlation with human judgments, we demonstrate the effectiveness of these metrics when used individually and in ensembles. Additionally, we investigate the potential of Large Language Models (LLMs) as tools for TST evaluation. Our findings highlight that certain advanced NLP metrics and experimental-hybrid-techniques provide better insights than existing TST metrics for delivering more accurate, consistent, and reproducible TST evaluations.

MOOC on Linguistic Linked Data. Jorge Gracia, Slavko Žitnik, Max Ionov, Christian Chiarcos, Dagmar Gromann, Francesco Mambrini, Marco Passarotti, Armando Stellato, John P. McCrae, Gilles Sérasset, Elena Montiel-Ponsoda, Sara Carvalho, Penny Labropoulou and Rute Costa The Semantic Web. ESWC 2025. Lecture Notes in Computer Science, vol 15719., (2025) PDF

An Assessment of Word Separation Practices in Old Irish Text Resources and a Universal Method for Tokenising Old Irish Text. Adrian Doyle and John P. McCrae Proceedings of the 5th Celtic Language Technology Workshop, (2025) PDF

Renovating the Verb Hierarchy of English Wordnet. John P. McCrae Proceedings of the 13th Global Wordnet Conference, (2025) PDF Abstract

English Wordnet's hierarchy of senses is a key feature that enables the resource to be used for a wide range of analysis, however, it is only complete for nouns and not for other parts of speech. In this work, we propose an improvement of the hierarchy of verbs, such that all verbs are connected to one of eight top synsets. We evaluate this resource in terms of improved connectivity and in comparison to SimVerb-3500, and show that this hierarchy makes the resource more useful. We extensively discuss further improvements that would make English Wordnet more practical for a wide range of applications and bring it closer in line with other lexical resources for verbs.

SHACL4GW: SHACL Shapes for the Global Wordnet Association RDF Schema. Fahad Khan and John P. McCrae Proceedings of the 13th Global Wordnet Conference, (2025) PDF Abstract

In this article, we introduce SHACL4GW, a new resource which uses the Semantic Web SHACL standard for the validation of RDF files using the Global Wordnet Association RDF format. We begin by giving a motivation for the creation of such a resource, continue by describing the resource and end with a list of things still to do.

Remedying Gender Bias in Open English Wordnet. John P. McCrae, Haotian Zhu, Fei Xia, Al Waskow and Kexin Gao Proceedings of the 13th Global Wordnet Conference, (2025) PDF Abstract

Open English Wordnet aims to improve and maintain a wordnet for English, based on the Princeton WordNet. In this context, we identify a number of gender biases in the existing wordnet and consider the challenges of remediating the biases in the resource. In particular, we look at structural, contextual and definitional biases in the resource and examine how changes to the structure of the wordnet and to the textual definitions can create a wordnet that more fairly represents reality. We propose a number of changes that introduce 317 new synsets as well as changing the definitions or relations of over 400 further synsets. We show that these changes reduce certain kinds of gender bias within the resource.

English-to-Low-Resource Translation: A Multimodal Approach for Hindi, Malayalam, Bengali, and Hausa. Ali Hatami, Shubhanker Banerjee, Mihael Arcan, Bharathi Raja Chakravarthi, Paul Buitelaar and John Philip McCrae Ninth Conference on Machine Translation (WMT24), (2024) PDF Abstract

Multimodal machine translation leverages multiple data modalities to enhance translation quality, particularly for low-resourced languages. This paper uses a multimodal model that integrates visual information with textual data to improve translation accuracy from English to Hindi, Malayalam, Bengali, and Hausa. This approach employs a gated fusion mechanism to effectively combine the outputs of textual and visual encoders, enabling more nuanced translations that consider both language and contextual visual cues. The model's performance was evaluated against the text-only machine translation model based on BLEU, ChrF2 and TER. Experimental results demonstrate that the multimodal approach consistently outperforms the text-only baseline, highlighting the potential of integrating visual information in low-resourced language translation tasks.

Co-Creational Teaching of Natural Language Processing. John P. McCrae Proceedings of the Sixth Workshop on Teaching NLP, pp 33-42, (2024) PDF Abstract

Traditional lectures have poorer outcomes compared to active learning methodologies, yet many natural language processing classes in higher education still follow this outdated methodology. In this paper, we present, co-creational teaching, a methodology that encourages partnership between staff and lecturers and show how this can be applied to teach natural language processing. As a fast-moving and dynamic area of study with high interest from students, natural language processing is an ideal subject for innovative teaching methodologies to improve student outcomes. We detail our experience with teaching natural language processing through partnership with students and provide detailed descriptions of methodologies that can be used by others in their teaching, including considerations of diverse student populations.

Multilingual Text Style Transfer: Datasets & Models for Indian Languages. Sourabrata Mukherjee, Atul Kr. Ojha, Akanksha Bansal, Deepak Alok, John P. McCrae and Ondrej Dusek Proceedings of the 17th International Natural Language Generation Conference, pp 494-522, (2024) PDF Abstract

Text style transfer (TST) involves altering the linguistic style of a text while preserving its style-independent content. This paper focuses on sentiment transfer, a popular TST subtask, across a spectrum of Indian languages: Hindi, Magahi, Malayalam, Marathi, Punjabi, Odia, Telugu, and Urdu, expanding upon previous work on English-Bangla sentiment transfer. We introduce dedicated datasets of 1,000 positive and 1,000 negative style-parallel sentences for each of these eight languages. We then evaluate the performance of various benchmark models categorized into parallel, non-parallel, cross-lingual, and shared learning approaches, including the Llama2 and GPT-3.5 large language models (LLMs). Our experiments highlight the significance of parallel data in TST and demonstrate the effectiveness of the Masked Style Filling (MSF) approach in non-parallel techniques. Moreover, cross-lingual and joint multilingual learning methods show promise, offering insights into selecting optimal models tailored to the specific language and task requirements. To the best of our knowledge, this work represents the first comprehensive exploration of the TST task as sentiment transfer across a diverse set of languages.

Cultural HeritAge and Multilingual Understanding through lexiCal Archives,- CHAMUÇA: Portuguese borrowings in contemporary Asian languages. Fahad Khan, Ana Salgado, Isuri Anuradha, Rute Costa, Chamila Liyanage, John P. McCrae, Atul K. Ojha, Priya Rani and Francesca Frontini Proceedings of EURALEX 2024, (2024) PDF

BRECS: Enhanced Binary Representation of Word Embeddings via Cosine Similarity. Rajdeep Sarkar, Sourav Dutta and John P. McCrae 27th European Conference on Artificial Intelligence, 19–24 October 2024, Santiago de Compostela, Spain – Including 13th Conference on Prestigious Applications of Intelligent Systems (PAIS 2024), pp 3995-4002, (2024) PDF Abstract

Word representations like GloVe and Word2Vec encapsulate semantic and syntactic attributes and constitute the fundamental building block in diverse Natural Language Processing (NLP) applications. Such vector embeddings are typically stored in float32 format, and for a substantial vocabulary size, they impose considerable memory and computational demands due to the resource-intensive float32 operations. Thus, representing words via binary embeddings has emerged as a promising but challenging solution. In this paper, we introduce BRECS, an autoencoder-based Siamese framework for the generation of enhanced binary word embeddings (from the original embeddings). We propose the use of the novel Binary Cosine Similarity (BCS) regularisation in BRECS, which enables it to learn the semantics and structure of the vector space spanned by the original word embeddings, leading to better binary representation generation. We further show that our framework is tailored with independent parameters within the various components, thereby providing it with better learning capability. Extensive experiments across multiple datasets and tasks demonstrate the effectiveness of BRECS, compared to existing baselines for static and contextual binary word embedding generation. The source code is available at unmapped: uri https://github.com/rajbsk/brecs.

Evaluating the Generalisation of an Artificial Learner. Bernardo Stearns, Nicolas Ballier, Thomas Gaillat, Andrew Simpkin and John P. McCrae Proceedings of the 13th Workshop on Natural Language Processing for Computer Assisted Language Learning, (2024) PDF Abstract

This paper focused on the creation of LLM-based artificial learners. Motivated by the capability of language models to encode language representation, we evaluated such models for predicting masked tokens in learner corpora. We domain-adapted the BERT model, pre-trained on native English, by further pre-training two learner models on learner corpora: a natural learner model on the EFCAMDAT dataset and a synthetic learner model on the C4200m dataset. We evaluated the two artificial learner models alongside the baseline native model using an external English-for-specific-purposes corpus from French under-graduates. We evaluated metrics related to accuracy, consistency, and divergence. While the native model performed reasonably well, the natural learner pre-trained model showed improvements in recall-at-k. We analysed error patterns, showing that the native model made “overconfident” errors by assigning high probabilities to incorrect predictions, while the artificial learners distributed probabilities more evenly when wrong. Finally, we showed that the general token choices from the native model diverged from the natural learner model and this divergence was higher at lower proficiency levels.

Large Language Models for Few-Shot Automatic Term Extraction. Shubhanker Banerjee, Bharathi Raja Chakravarthi and John P. McCrae Natural Language Processing and Information Systems: 29th International Conference on Applications of Natural Language to Information Systems, NLDB 2024, Turin, Italy, June 25–27, 2024, Proceedings, Part I, pp 137-150, (2024) PDF Abstract

Automatic term extraction is the process of identifying domain-specific terms in a text using automated algorithms and is a key first step in ontology learning and knowledge graph creation. Large language models have shown good few-shot capabilities, thus, in this paper, we present a study to evaluate the few-shot in-context learning performance of GPT-3.5-Turbo on automatic term extraction. To benchmark the performance we compare the results with fine-tuning of a BERT-sized model. We also carry out experiments with count-based term extractors to assess their applicability to few-shot scenarios. We quantify prompt sensitivity with experiments to analyze the variation in performance of large language models across different prompt templates. Our results show that in-context learning with GPT-3.5-Turbo outperforms the BERT-based model and unsupervised count-based methods in few-shot scenarios.

Findings of the WILDRE Shared Task on Code-mixed Less-resourced Sentiment Analysis for Indo-Aryan Languages. Priya Rani, Gaurav Negi, Saroj Jha, Shardul Suryawanshi, Atul Kr. Ojha, Paul Buitelaar and John P. McCrae Proceedings of the 7th Workshop on Indian Language Data: Resources and Evaluation, pp 17-23, (2024) PDF Abstract

This paper describes the structure and findings of the WILDRE 2024 shared task on Code-mixed Less-resourced Sentiment Analysis for Indo-Aryan Languages. The participants were asked to submit the test data’s final prediction on CodaLab. A total of fourteen teams registered for the shared task. Only four participants submitted the system for evaluation on CodaLab, with only two teams submitting the system description paper. While all systems show a rather promising performance, they outperform the baseline scores.

MaCmS: Magahi Code-mixed Dataset for Sentiment Analysis. Priya Rani, Theodorus Fransen, John P. McCrae and Gaurav Negi Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pp 10880-10890, (2024) PDF Abstract

The present paper introduces new sentiment data, MaCMS, for Magahi-Hindi-English (MHE) code-mixed language, where Magahi is a less-resourced minority language. This dataset is the first Magahi-Hindi-English code-mixed dataset for sentiment analysis tasks. Further, we also provide a linguistics analysis of the dataset to understand the structure of code-mixing and a statistical study to understand the language preferences of speakers with different polarities. With these analyses, we also train baseline models to evaluate the dataset’s quality.

Developing a Part-of-speech Tagger for Diplomatically Edited Old Irish Text. Adrian Doyle and John P. McCrae Proceedings of the Third Workshop on Language Technologies for Historical and Ancient Languages (LT4HALA) @ LREC-COLING-2024, pp 11-21, (2024) PDF Abstract

POS-tagging is typically considered a fundamental text preprocessing task, with a variety of downstream NLP tasks and techniques being dependent on the availability of POS-tagged corpora. As such, POS-taggers are important precursors to further NLP tasks, and their accuracy can impact the potential accuracy of these dependent tasks. While a variety of POS-tagging methods have been developed which work well with modern languages, historical languages present orthographic and editorial challenges which require special attention. The effectiveness of POS-taggers developed for modern languages is reduced when applied to Old Irish, with its comparatively complex orthography and morphology. This paper examines some of the obstacles to POS-tagging Old Irish text, and shows that inconsistencies between extant annotated corpora reduce the quantity of data available for use in training POS-taggers. The development of a multi-layer neural network model for POS-tagging Old Irish text is described, and an experiment is detailed which demonstrates that this model outperforms a variety of off-the-shelf POS-taggers. Moreover, this model sets a new benchmark for POS-tagging diplomatically edited Old Irish text.

Findings of the SIGTYP 2024 Shared Task on Word Embedding Evaluation for Ancient and Historical Languages. Oksana Dereza, Adrian Doyle, Priya Rani, Atul Kr. Ojha, Pádraic Moran and John McCrae Proceedings of the 6th Workshop on Research in Computational Linguistic Typology and Multilingual NLP, pp 160-172, (2024) PDF Abstract

This paper discusses the organisation and findings of the SIGTYP 2024 Shared Task on Word Embedding Evaluation for Ancient and Historical Languages. The shared task was split into the constrained and unconstrained tracks and involved solving either 3 or 5 problems for either 13 or 16 ancient and historical languages belonging to 4 language families, and making use of 6 different scripts. There were 14 registrations in total, of which 3 teams submitted to each track. Out of these 6 submissions, 2 systems were successful in the constrained setting and another 2 in the uncon- strained setting, and 4 system description papers were submitted by different teams. The best average result for morphological feature prediction was about 96%, while the best average results for POS-tagging and lemmatisation were 96% and 94% respectively. At the word level, the winning team could not achieve a higher average accuracy across all 16 languages than 5.95%, which demonstrates the difficulty of this problem. At the character level, the best average result over 16 languages 55.62%

Proceedings of the 9th Workshop on Linked Data in Linguistics @ LREC-COLING 2024. Christian Chiarcos, Katerina Gkirtzou, Maxim Ionov, Fahad Khan, John P. McCrae, Elena Montiel Ponsoda and Patricia Martín Chozas ELRA and ICCL, (2024) PDF

Cross-Lingual Ontology Matching using Structural and Semantic Similarity. Shubhanker Banerjee, Bharathi Raja Chakravarthi and John Philip McCrae Proceedings of the 9th Workshop on Linked Data in Linguistics @ LREC-COLING 2024, pp 11-21, (2024) PDF Abstract

The development of ontologies in various languages is attracting attention as the amount of multilingual data available on the web increases. Cross-lingual ontology matching facilitates interoperability amongst ontologies in different languages. Although supervised machine learning-based methods have shown good performance on ontology matching, their application to the cross-lingual setting is limited by the availability of training data. Current state-of-the-art unsupervised methods for cross-lingual ontology matching focus on lexical similarity between entities. These approaches follow a two-stage pipeline where the entities are translated into a common language using a translation service in the first step followed by computation of lexical similarity between the translations to match the entities in the second step. In this paper we introduce a novel ontology matching method based on the fusion of structural similarity and cross-lingual semantic similarity. We carry out experiments using 3 language pairs and report substantial improvements on the performance of the lexical methods thus showing the effectiveness of our proposed approach. To the best of our knowledge this is the first work which tackles the problem of unsupervised ontology matching in the cross-lingual setting by leveraging both structural and semantic embeddings.

CHAMUÇA: Towards a Linked Data Language Resource of Portuguese Borrowings in Asian Languages. Fahad Khan, Ana Salgado, Isuri Anuradha, Rute Costa, Chamila Liyanage, John P. McCrae, Atul Kumar Ojha, Priya Rani and Francesca Frontini Christian Chiarcos, Katerina Gkirtzou, Maxim Ionov, Fahad Khan, John P. McCrae, Elena Montiel Ponsoda and Patricia Martín Chozas (eds) Proceedings of the 9th Workshop on Linked Data in Linguistics @ LREC-COLING 2024, pp 44-48, (2024) PDF Abstract

This paper presents the development of CHAMUÇA, a novel lexical resource designed to document the influence of the Portuguese language on various Asian languages, with an initial focus on the languages of South Asia. Through the utilization of linked open data and the OntoLex vocabulary, CHAMUÇA offers structured insights into the linguistic characteristics, and cultural ramifications of Portuguese borrowings across multiple languages. The article outlines CHAMUÇA’s potential contributions to the linguistic linked data community, emphasising its role in addressing the scarcity of resources for lesser-resourced languages and serving as a test case for organising etymological data in a queryable format. CHAMUÇA emerges as an initiative towards the comprehensive catalogization and analysis of Portuguese borrowings, offering valuable insights into language contact dynamics, historical evolution, and cultural exchange in Asia, one that is based on linked data technology.

The Teanga Data Model for Linked Corpora. John P. McCrae, Priya Rani, Adrian Doyle and Bernardo Stearns Proceedings of the 9th Workshop on Linked Data in Linguistics @ LREC-COLING 2024, pp 66-74, (2024) PDF Abstract

Corpus data is the main source of data for natural language processing applications, however no standard or model for corpus data has become predominant in the field. Linguistic linked data aims to provide methods by which data can be made findable, accessible, interoperable and reusable (FAIR). However, current attempts to create a linked data format for corpora have been unsuccessful due to the verbose and specialised formats that they use. In this work, we present the Teanga data model, which uses a layered annotation model to capture all NLP-relevant annotations. We present the YAML serializations of the model, which is concise and uses a widely-deployed format, and we describe how this can be interpreted as RDF. Finally, we demonstrate three examples of the use of the Teanga data model for syntactic annotation, literary analysis and multilingual corpora.

Proceedings of the 2nd Workshop on Resources and Technologies for Indigenous, Endangered and Lesser-resourced Languages in Eurasia (EURALI) @ LREC-COLING 2024. Atul Kr. Ojha, Sina Ahmadi, Silvie Cinková, Theodorus Fransen, Chao-Hong Liu and John P. McCrae ELRA and ICCL, (2024) PDF

Text Detoxification as Style Transfer in English and Hindi. Sourabrata Mukherjee, Akanksha Bansal, Atul Kr. Ojha, John P. McCrae and Ondrej Dusek The 2024 International Conference on Natural Language Processing (ICON), (2024) PDF

Findings of the IWSLT 2023 Evaluation Campaign. Milind Agarwal, Sweta Agrawal, Antonios Anastasopoulos, Luisa Bentivogli, Ondřej Bojar, Claudia Borg, Marine Carpuat, Roldano Cattoni, Mauro Cettolo, Mingda Chen, William Chen, Khalid Choukri, Alexandra Chronopoulou, Anna Currey, Thierry Declerck, Qianqian Dong, Kevin Duh, Yannick Estève, Marcello Federico, Souhir Gahbiche, Barry Haddow, Benjamin Hsu, Phu Mon Htut, Hirofumi Inaguma, Dávid Javorský, John Judge, Yasumasa Kano, Tom Ko, Rishu Kumar, Pengwei Li, Xutai Ma, Prashant Mathur, Evgeny Matusov, Paul McNamee, John P. McCrae, Kenton Murray, Maria Nadejde, Satoshi Nakamura, Matteo Negri, Ha Nguyen, Jan Niehues, Xing Niu, Atul Kr. Ojha, John E. Ortega, Proyag Pal, Juan Pino, Lonneke van der Plas, Peter Polák, Elijah Rippeth, Elizabeth Salesky, Jiatong Shi, Matthias Sperber, Sebastian Stüker, Katsuhito Sudoh, Yun Tang, Brian Thompson, Kevin Tran, Marco Turchi, Alex Waibel, Mingxuan Wang, Shinji Watanabe and Rodolfo Zevallos Elizabeth Salesky, Marcello Federico and Marine Carpuat (eds) Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023), pp 1-61, (2023) PDF Abstract

This paper reports on the shared tasks organized by the 20th IWSLT Conference. The shared tasks address 9 scientific challenges in spoken language translation: simultaneous and offline translation, automatic subtitling and dubbing, speech-to-speech translation, multilingual, dialect and low-resource speech translation, and formality control. The shared tasks attracted a total of 38 submissions by 31 teams. The growing interest towards spoken language translation is also witnessed by the constantly increasing number of shared task organizers and contributors to the overview paper, almost evenly distributed across industry and academia.

Intent Classification by the Use of Automatically Generated Knowledge Graphs. Mihael Arcan, Sampritha Manjunath, Cécile Robin, Ghanshyam Verma, Devishree Pillai, Simon Sarkar, Sourav Dutta, Haytham Assem, John P.McCrae and Paul Buitelaar Information, 14(5) (2023) PDF

Documenting the Open Multilingual Wordnet. Francis Bond, Michael Wayne Goodman, Ewa Rudnicka, Luis Morgado da Costa, Alexandre Rademaker and John P. McCrae Proceedings of the 2023 Global WordNet Conference, (2023) PDF Abstract

In this project note we describe our work to make better documentation for the Open Multilingual Wordnet (OMW), a platform integrating many open wordnets. This includes the documentation of the OMW website itself as well as of semantic relations used by the component wordnets. Some of this documentation work was done with the support of the Google Season of Docs. The OMW project page, which links both to the actual OMW server and the documentation has been moved to a new location: https://omwn.org.

Detecting abusive comments at a fine-grained level in a low-resource language. Bharathi Raja Chakravarthi, Ruba Priyadharshini, Shubanker Banerjee, Manoj Balaji Jagadeeshan, Prasanna Kumar Kumaresan, Rahul Ponnusamy, Sean Benhur and John Philip McCrae Natural Language Processing Journal, 3 (2023) PDF Abstract

YouTube is a video-sharing and social media platform where users create profiles and share videos for their followers to view, like, and comment on. Abusive comments on videos or replies to other comments may be offensive and detrimental for the mental health of users on the platform. It is observed that often the language used in these comments is informal and does not necessarily adhere to the formal syntactic and lexical structure of the language. Therefore, creating a rule-based system for filtering out abusive comments is challenging. This article introduces four datasets of abusive comments in Tamil and code-mixed Tamil–English extracted from YouTube. Comment-level annotation has been carried out for each dataset by assigning polarities to the comments. We hope these datasets can be used to train effective machine learning-based comment filters for these languages by mitigating the challenges associated with rule-based systems. In order to establish baselines on these datasets, we have carried out experiments with various machine learning classifiers and reported the results using F1-score, precision, and recall. Furthermore, we have employed a t-test to analyze the statistical significance of the results generated by the machine learning classifiers. Furthermore, we have employed SHAP values to analyze and explain the results generated by the machine learning classifiers. The primary contribution of this paper is the construction of a publicly accessible dataset of social media messages annotated with a fine-grained abusive speech in the low-resource Tamil language. Overall, we discovered that MURIL performed well on the binary abusive comment detection task, showing the applicability of multilingual transformers for this work. Nonetheless, a fine-grained annotation for Fine-grained abusive comment detection resulted in a significantly lower number of samples per class, and classical machine learning models outperformed deep learning models, which require extensive training datasets, on this challenge. According to our knowledge, this was the first Tamil-language study on FGACD focused on diverse ethnicities. The methodology for detecting abusive messages described in this work may aid in the creation of comment filters for other under-resourced languages on social media.

Empowering recommender systems using automatically generated Knowledge Graphs and Reinforcement Learning. Ghanshyam Verma, Simon Simanta, Huan Chen, Devishree Pillai, John P. McCrae, János A. Perge, Shovon Sengupta and Paul Buitelaar OARS-KDD 2023, (2023) PDF

MG2P: An Empirical Study Of Multilingual Training for Manx G2P. Shubhanker Banerjee, Bharathi Raja Asoka Chakravarthi and John Philip McCrae LDK 2023 Conference, pp 246--255, (2023) PDF

The Cardamom Workbench for Historical and Under-Resourced Languages. Adrian Doyle, Theodorus Fransen, Bernardo Stearns, John Philip McCrae, Oksana Dereza and Priya Rani LDK 2023 Conference, pp 109-120, (2023) PDF

PICKD: In-Situ Prompt Tuning for Knowledge-Grounded Dialogue Generation. Rajdeep Sarkar, Koustava Goswami, Mihael Arcan and John McCrae PAKDD 2023: Advances in Knowledge Discovery and Data Mining, pp 124-136, (2023) PDF Abstract

Generating informative, coherent and fluent responses to user queries is challenging yet critical for a rich user experience and the eventual success of dialogue systems. Knowledge-grounded dialogue systems leverage external knowledge to induce relevant facts in a dialogue. These systems need to understand the semantic relatedness between the dialogue context and the available knowledge, thereby utilising this information for response generation. Although various innovative models have been proposed, they neither utilise the semantic entailment between the dialogue history and the knowledge nor effectively process knowledge from both structured and unstructured sources. In this work, we propose PICKD, a two-stage framework for knowledgeable dialogue. The first stage involves the Knowledge Selector choosing knowledge pertinent to the dialogue context from both structured and unstructured knowledge sources. PICKD leverages novel In-Situ prompt tuning for knowledge selection, wherein prompt tokens are injected into the dialogue-knowledge text tokens during knowledge retrieval. The second stage employs the Response Generator for generating fluent and factual responses by utilising the retrieved knowledge and the dialogue context. Extensive experiments on three domain-specific datasets exhibit the effectiveness of PICKD over other baseline methodologies for knowledge-grounded dialogue. The source is available at https://github.com/rajbsk/pickd.

Temporal Domain Adaptation for Historical Irish. Oksana Dereza, Theodorus Fransen and John P. McCrae Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023), (2023) PDF Abstract

The digitisation of historical texts has provided new horizons for NLP research, but such data also presents a set of challenges, including scarcity and inconsistency. The lack of editorial standard during digitisation exacerbates these difficulties.This study explores the potential for temporal domain adaptation in Early Modern Irish and pre-reform Modern Irish data. We describe two experiments carried out on the book subcorpus of the Historical Irish Corpus, which includes Early Modern Irish and pre-reform Modern Irish texts from 1581 to 1926. We also propose a simple orthographic normalisation method for historical Irish that reduces the type-token ratio by 21.43% on average in our data.The results demonstrate that the use of out-of-domain data significantly improves a language model’s performance. Providing a model with additional input from another historical stage of the language improves its quality by 12.49% on average on non-normalised texts and by 27.02% on average on normalised (demutated) texts. Most notably, using only out-of-domain data for both pre-training and training stages allowed for up to 86.81% of the baseline model quality on non-normalised texts and up to 95.68% on normalised texts without any target domain data. Additionally, we investigate the effect of temporal distance between the training and test data. The hypothesis that there is a positive correlation between performance and temporal proximity of training and test data has been validated, which manifests best in normalised data. Expanding this approach even further back, to Middle and Old Irish, and testing it on other languages is a further research direction.

Do not Trust the Experts: How the Lack of Standard Complicates NLP for Historical Irish. Oksana Dereza, Theodorus Fransen and John P. Mccrae The Fourth Workshop on Insights from Negative Results in NLP, (2023) PDF Abstract

In this paper, we describe how we unearthed some fundamental problems while building an analogy dataset modelled on BATS (Gladkova et al., 2016) to evaluate historical Irish embeddings on their ability to detect orthographic, morphological and semantic similarity.The performance of our models in the analogy task was extremely poor regardless of the architecture, hyperparameters and evaluation metrics, while the qualitative evaluation revealed positive tendencies. We argue that low agreement between field experts on fundamental lexical and orthographic issues, and the lack of a unified editorial standard in available resources make it impossible to build reliable evaluation datasets for computational models and obtain interpretable results. We emphasise the need for such a standard, particularly for NLP applications, and prompt Celticists and historical linguists to engage in further discussion. We would also like to draw NLP scholars’ attention to the role of data and its (extra)linguistic properties in testing new models, technologies and evaluation scenarios.

Findings of the SIGTYP 2023 Shared task on Cognate and Derivative Detection For Low-Resourced Languages. Priya Rani, Koustava Goswami, Adrian Doyle, Theodorus Fransen, Bernardo Stearns and John P. McCrae Proceedings of the 5th Workshop on Research in Computational Linguistic Typology and Multilingual NLP, (2023) PDF Abstract

This paper describes the structure and findings of the SIGTYP 2023 shared task on cognate and derivative detection for low-resourced languages, broken down into a supervised and unsupervised sub-task. The participants were asked to submit the test data’s final prediction. A total of nine teams registered for the shared task where seven teams registered for both sub-tasks. Only two participants ended up submitting system descriptions, with only one submitting systems for both sub-tasks. While all systems show a rather promising performance, all could be within the baseline score for the supervised sub-task. However, the system submitted for the unsupervised sub-task outperforms the baseline score.

Some Considerations in the Construction of a Historical Language WordNet. Fahad Khan, John P. McCrae, Francisco Javier Minaya Gómez, Rafael Cruz González and Javier E. Díaz-Vera Proceedings of the Global WordNet Conference 2023, (2023) PDF

Proceedings of the Second Workshop on Language Technology for Equality, Diversity and Inclusion. Bharathi Raja Chakravarthi, B Bharathi, John P McCrae, Manel Zarrouk, Kalika Bali and Paul Buitelaar Association for Computational Linguistics, (2022) PDF

Overview of The Shared Task on Homophobia and Transphobia Detection in Social Media Comments. Bharathi Raja Chakravarthi, Ruba Priyadharshini, Thenmozhi Durairaj, John McCrae, Paul Buitelaar, Prasanna Kumaresan and Rahul Ponnusamy Proceedings of the Second Workshop on Language Technology for Equality, Diversity and Inclusion, (2022) PDF Abstract

Homophobia and Transphobia Detection is the task of identifying homophobia, transphobia, and non-anti-LGBT+ content from the given corpus. Homophobia and transphobia are both toxic languages directed at LGBTQ+ individuals that are described as hate speech. This paper summarizes our findings on the “Homophobia and Transphobia Detection in social media comments” shared task held at LT-EDI 2022 - ACL 2022 1. This shared taskfocused on three sub-tasks for Tamil, English, and Tamil-English (code-mixed) languages. It received 10 systems for Tamil, 13 systems for English, and 11 systems for Tamil-English. The best systems for Tamil, English, and Tamil-English scored 0.570, 0.870, and 0.610, respectively, on average macro F1-score.

Overview of the Shared Task on Hope Speech Detection for Equality, Diversity, and Inclusion. Bharathi Raja Chakravarthi, Vigneshwaran Muralidaran, Ruba Priyadharshini, Subalalitha Cn, John McCrae, Miguel Ángel García, Salud María Jiménez-Zafra, Rafael Valencia-García, Prasanna Kumaresan, Rahul Ponnusamy, Daniel García-Baena and José García-Díaz Proceedings of the Second Workshop on Language Technology for Equality, Diversity and Inclusion, (2022) PDF Abstract

Hope Speech detection is the task of classifying a sentence as hope speech or non-hope speech given a corpus of sentences. Hope speech is any message or content that is positive, encouraging, reassuring, inclusive and supportive that inspires and engenders optimism in the minds of people. In contrast to identifying and censoring negative speech patterns, hope speech detection is focussed on recognising and promoting positive speech patterns online. In this paper, we report an overview of the findings and results from the shared task on hope speech detection for Tamil, Malayalam, Kannada, English and Spanish languages conducted in the second workshop on Language Technology for Equality, Diversity and Inclusion (LT-EDI-2022) organised as a part of ACL 2022. The participants were provided with annotated training & development datasets and unlabelled test datasets in all the five languages. The goal of the shared task is to classify the given sentences into one of the two hope speech classes. The performances of the systems submitted by the participants were evaluated in terms of micro-F1 score and weighted-F1 score. The datasets for this challenge are openly available

Proceedings of the 8th Workshop on Linked Data in Linguistics within the 13th Language Resources and Evaluation Conference. Thierry Declerck, John P. McCrae, Elena Montiel, Christian Chiarcos and Maxim Ionov Association for Computational Linguistics, (2022) PDF

EDIE - Elexis DIctionary Evaluation Tool. Seung-Bin Yim, Lenka Bajcetic, Thierry Declerck and John McCrae Proceedings of the Workshop on PROfiling LINGuistic KNOWledgE gRaphs (ProLingKNOWER), (2022) PDF

Proceedings of the Workshop on Resources and Technologies for Indigenous, Endangered and Lesser-resourced Languages in Eurasia within the 13th Language Resources and Evaluation Conference. Atul Kr. Ojha, Sina Ahmadi, Chao-Hong Liu and John P. McCrae European Language Resources Association, (2022) PDF Abstract

Eurasia is the largest continental area comprising all of Europe and Asia. It is also home to seven families of more than 2,500 languages spoken. Despite the rich linguistic diversity in this area, the respective language communities are under-represented while their languages are low-resource, endangered and/or systematically politically oppressed in history. Others, such as Kurdish, Gilaki, Santali, Kashmiri, Laz, and Abkhaz, are not only endangered but also understudied. One interesting characteristic of these languages is the influence of communal languages on their lexicon through borrowed words and a partially shared vocabulary of phylogenetically related words (cognates). Furthermore, contact-induced similarities can be observed to some extent even in the syntax of the languages, despite typological differences across different language families. In addition, relying on a lingua franca, many of these linguistic communities are facing standardization issues, particularly in the written form of their respective languages. This commonly results in the use of other scripts by speakers of these under-resourced languages. In line with the necessity of language technology for under-resourced and understudied languages, this workshop aims to spur the development of resources and tools for indigenous, endangered and lesser-resourced languages in Eurasia. The goal is to increase visibility and promote research for these languages in a global arena. Through collaboration between NLP researchers, language experts and linguists working for endangered languages in these communities, we aim to create language technology that will help to preserve these languages and give them a chance to receive more attention in the language processing realm. Seeing that this is the first edition of the EURALI workshop, we are very happy to have received many submissions, on various aspects regarding Eurasian languages. In the EURALI 2022 Proceedings, 18 research papers are included, dealing with no fewer than 18 Eurasian languages. We would like to thank all the colleagues who submitted their work to the workshop, the LREC 2022 organisers, as well as reviewers for making the first EURALI workshop a success.

A Dataset for Term Extraction in Hindi. Shubhanker Banerjee, Bharathi Raja Chakravarthi and John Philip McCrae Terminology in the 21st century: many faces, many places @ LREC 2022, pp 19-25, (2022) PDF Abstract

Automatic Term Extraction (ATE) is one of the core problems in natural language processing and forms a key component of text mining pipelines of domain specific corpora. Complex low-level tasks such as machine translation and summarization for domain specific texts necessitate the use of term extraction systems. However, the development of these systems requires the use of large annotated datasets and thus there has been little progress made on this front for under-resourced languages. As a part of ongoing research, we present a dataset for term extraction from Hindi texts in this paper. To the best of our knowledge, this is the first dataset that provides term annotated documents for Hindi. Furthermore, we have evaluated this dataset on statistical term extraction methods and the results obtained indicate the problems associated with development of term extractors for under-resourced languages.

Overview of the Shared Task on Machine Translation in Dravidian Languages. Anand Kumar Madasamy, Asha Hegde, Shubhanker Banerjee, Bharathi Raja Chakravarthi, Ruba Priyadharshini, Hosahalli Shashirekha and John McCrae Proceedings of the Second Workshop on Speech and Language Technologies for Dravidian Languages, pp 271-278, (2022) PDF Abstract

This paper presents an outline of the shared task on translation of under-resourced Dravidian languages at DravidianLangTech-2022 workshop to be held jointly with ACL 2022. A description of the datasets used, approach taken for analysis of submissions and the results have been illustrated in this paper. Five sub-tasks organized as a part of the shared task include the following translation pairs: Kannada to Tamil, Kannada to Telugu, Kannada to Sanskrit, Kannada to Malayalam and Kannada to Tulu. Training, development and test datasets were provided to all participants and results were evaluated on the gold standard datasets. A total of 16 research groups participated in the shared task and a total of 12 submission runs were made for evaluation. Bilingual Evaluation Understudy (BLEU) score was used for evaluation of the translations.

Building a Knowledge Base Taxonomy from Structured or Unstructured Computer Text for Use in Automated User Interaction. Pranab Mohanty, Bianca De Oliveira Pereira, Cécile Robin, Tobias Daudert, John McCrae and Paul Buitelaar Patent: US Patent 11,328,707 (2022) PDF

Bengali and Magahi PUD Treebank and Parser. Pritha Majumdar, Deepak Alok, Akanksha Bansal, Atul Kr. Ojha and John P. McCrae 6th Workshop on Indian Language Data Resource and Evaluation @ LREC 2022, pp 60-67, (2022) PDF Abstract

This paper presents the development of the Parallel Universal Dependency (PUD) Treebank for two Indo-Aryan languages: Bengali and Magahi. A treebank of 1,000 sentences has been created using a parallel corpus of English and the UD framework. A preliminary set of sentences was annotated manually - 600 for Bengali and 200 for Magahi. The rest of the sentences were built using the Bengali and Magahi parser. The sentences have been translated and annotated manually by the authors, some of whom are also native speakers of the languages. The objective behind this work is to build a syntactically-annotated linguistic repository for the aforementioned languages, that can prove to be a useful resource for building further NLP tools. Additionally, Bengali and Magahi parsers were also created which is built on machine learning approach. The accuracy of the Bengali parser is 78.13% in the case of UPOS; 76.99% in the case of XPOS, 56.12% in the case of UAS; and 47.19% in the case of LAS. The accuracy of Magahi parser is 71.53% in the case of UPOS; 66.44% in the case of XPOS, 58.05% in the case of UAS; and 33.07% in the case of LAS. This paper also includes an illustration of the annotation schema followed, the findings of the Parallel Universal Dependency (PUD) treebank, and it’s resulting linguistic analysis.

Towards Classification of Legal Pharmaceutical Text using GAN-BERT. Tapan Auti, Rajdeep Sarkar, Bernardo Stearns, Atul Kr. Ojha, Arindam Paul, Michaela Comerford, Jay Megaro, John Mariano, Vall Herard and John P. McCrae 1st Computing Social Responsibility Workshop-NLP Approaches to Corporate Social Responsibilities @ LREC 2022, pp 52-57, (2022) PDF Abstract

Pharmaceutical text classification is an important area of research for commercial and research institutions working in the pharmaceutical domain. Addressing this task is challenging due to the need of expert verified labelled data which can be expensive and time consuming to obtain. Towards this end, we leverage predictive coding methods for the task as they have been shown to generalise well for sentence classification. Specifically, we utilise GAN-BERT architecture to classify pharmaceutical texts. To capture the domain specificity, we propose to utilise the BioBERT model as our BERT model in the GAN-BERT framework. We conduct extensive evaluation to show the efficacy of our approach over baselines on multiple metrics.

Towards the Construction of a WordNet for Old English. Fahad Khan, Francisco J. Minaya Gómez, Rafael Cruz González, Harry Diakoff, Javier E. Diaz Vera, John P. McCrae, Ciara O'Loughlin, William Michael Short and Sander Stolk 13th International Conference on Language Resources and Evaluation, pp 3934–3941, (2022) PDF Abstract

In this paper we will discuss our preliminary work towards the construction of a WordNet for Old English, taking our inspiration from other similar WN construction projects for ancient languages such as Ancient Greek, Latin and Sanskrit (on this overall endeavour, see now (Biagetti et al., 2021)). The Old English WordNet (OldEWN) will build upon this innovative work in a number of different ways which we articulate in the article, most importantly by treating figurative meaning as a ’first-class citizen’ in the structuring of the semantic system. From a more practical perspective we will describe our plan to utilize a pre-existing lexicographic resource and the naisc system to automatically compile a provisional version of the WordNet which will then be checked and enriched by experts in Old English

Linghub2: Language Resource Discovery Tool for Language Technologies. Cécile Robin, Gautham Vadakkekara Suresh, Víctor Rodriguez-Doncel, John P. McCrae and Paul Buitelaar 13th International Conference on Language Resources and Evaluation, pp 6352-6360, (2022) PDF Abstract

Language resources are an essential component of natural language processing, as well as related research and applications. Users of language resources have different needs in terms of format, language, topics, etc. for the data they need to use. Linghub (McCrae and Cimiano, 2015) was first developed for this purpose, using the capabilities of linked data to represent metadata, and tackling the heterogeneous metadata issue. Linghub is aimed at helping language resources and technology users to easily find and retrieve relevant data, and identify important information on access, topics, etc. This work describes a rejuvenation and modernisation of the 2015 platform into using a popular open source data management system, DSpace, as foundation. The new platform, Linghub2, contains updated and extended resources, more languages, and continues the work towards the homogenisation of metadata through conversions, through linkage to standardisation strategies and community groups, such as the Open Digital Rights Language (ODRL) community group

MHE: Code-Mixed Corpora for Similar Language Identification. Priya Rani, John P. McCrae and Theodorus Fransen 13th International Conference on Language Resources and Evaluation, pp 3425-3433, (2022) PDF Abstract

This paper introduces a new Magahi-Hindi-English (MHE) code-mixed dataset for similar language identification, where Magahi is a less-resourced minority language. This corpus provides language identification at two levels: word and sentence. This dataset is the first Magahi-Hindi-English code-mixed dataset for similar language identification task. Furthermore, we will discuss the complexity of the dataset and provide a few baselines for the language identification task

KG-CRuSE: Recurrent Walks over Knowledge Graph for Explainable Conversation Reasoning using Semantic Embeddings. Rajdeep Sarkar, Mihael Arcan and John Philip McCrae 4th Workshop on NLP for Conversational AI, Co-located with ACL 2022, pp 98-017, (2022) PDF Abstract

Knowledge-grounded dialogue systems utilise external knowledge such as knowledge graphs to generate informative and appropriate responses. A crucial challenge of such systems is to select facts from a knowledge graph pertinent to the dialogue context for response generation. This fact selection can be formulated as path traversal over a knowledge graph conditioned on the dialogue context. Such paths can originate from facts mentioned in the dialogue history and terminate at the facts to be mentioned in the response. These walks, in turn, provide an explanation of the flow of the conversation. This work proposes KG-CRUSE, a simple, yet effective LSTM based decoder that utilises the semantic information in the dialogue history and the knowledge graph elements to generate such paths for effective conversation explanation. Extensive evaluations showed that our model outperforms the state-of-the-art models on the OpenDialKG dataset on multiple metrics

DravidianCodeMix: Sentiment Analysis and Offensive Language Identification Dataset for Dravidian Languages in Code-Mixed Text. Bharathi Raja Chakravarthi, Ruba Priyadharshini, Vigneshwaran Muralidaran, Navya Jose, Shardul Suryawanshi, Elizabeth Sherly and John Phillip McCrae Language Resource and Evaluation, pp 756-806, (2022) PDF Abstract

This paper describes the development of a multilingual, manually annotated dataset for three under-resourced Dravidian languages generated from social media comments. The dataset was annotated for sentiment analysis and offensive language identification for a total of more than 60,000 YouTube com- ments. The dataset consists of around 44,000 comments in Tamil-English, around 7000 comments in Kannada-English, and around 20,000 comments in Malayalam-English. The data was manually annotated by volunteer annotators and has a high inter-annotator agreement in Krippendorff’s alpha. The dataset contains all types of code-mixing phenomena since it comprises user-generated content from a multi-lingual country. We also present baseline experiments to establish benchmarks on the dataset using machine learning and deep learning methods. The dataset is available on Github and Zenodo.

Toward an Integrative Approach for Making Sense Distinction. John P. McCrae, Theodorus Fransen, Sina Ahmadi, Paul Buitelaar and Koustava Goswami Frontiers in Artificial Intelligence, 5 (2022) PDF Abstract

Word senses are the fundamental unit of description in lexicography, yet it is rarely the case that different dictionaries reach any agreement on the number and definition of senses in a language. With the recent rise in natural language processing and other computational approaches there is an increasing demand for quantitatively validated sense catalogues of words, yet no consensus methodology exists. In this paper, we look at four main approaches to making sense distinctions: formal, cognitive, distributional, and intercultural and examine the strengths and weaknesses of each approach. We then consider how these may be combined into a single sound methodology. We illustrate this by examining two English words, “wing” and “fish,” using existing resources for each of these four approaches and illustrate the weaknesses of each. We then look at the impact of such an integrated method and provide some future perspectives on the research that is necessary to reach a principled method for making sense distinctions.

When Linguistics Meets Web Technologies. Recent advances in Modelling Linguistic Linked Open Data. Fahad Khan, Christian Chiarcos, Thierry Declerck, Daniela Gifu, Elena González-Blanco García, Jorge Gracia, Max Ionov, Penny Labropoulou, Francesco Mambrini, John McCrae, Émilie Pagé-Perron, Marco Passarotti, Salvador Ros and Ciprian-Octavian Truica Semantic Web Journal, 13(6) (2022) PDF

Convertir le Trésor de la Langue Française en Ontolex-Lemon : un zeste de données liées. Sina Ahmadi, Mathieu Constant, Karën Fort, Bruno Guillaume and John P. McCrae LIFT 2021 : Journées scientifiques “Linguistique informatique, formelle & de terrain”, (2021) PDF Abstract

In this paper, we report our efforts to convert one of the most comprehensive lexicographic resources of French, the Trésor de la Langue Française, into the Ontolex-Lemon model. Despite the widespread usage of this resource, the original XML format seems to impede its integration in language technology tools. In order to breathe new life into this resource, we examine the usage and the conversion to more interoperable formats, primarily those based on the linguistic linked data, to provide this resource to a broader range of applications and users.

Meta-Learning for Oﬀensive Language Detection in Code-Mixed Texts. Gautham Vadakkekara Suresh, Bharathi Raja Chakravarthi and John P. McCrae 13th meeting of Forum for Information Retrieval Evaluation 2021, pp 58-66, (2021) PDF Abstract

This research investigates the application of Model-Agnostic Meta-Learning (MAML) and ProtoMAML to identify offensive code-mixed text content on social media in Tamil-English and Malayalam-English code-mixed texts. We follow a two-step strategy: The XLM-RoBERTa (XLM-R) model is trained using the meta-learning algorithms on a variety of tasks having code-mixed data, monolingual data in the same language as the target language and related tasks in other languages. The model is then fine-tuned on target tasks to identify offensive language in Malayalam-English and Tamil-English code-mixed texts. Our results show that meta-learning improves the performance of models significantly in low-resource (few-shot learning) tasks. We also introduce a weighted data sampling approach which helps the model converge better in the meta-training phase compared to traditional methods.

Findings of Shared Task on Offensive Language Identification in Tamil and Malayalam. Prasanna Kumar Kumaresan, Premjith, Ratnasingam Sakuntharaj, Sajeetha Thavareesan, Subalalitha Navaneethakrishnan, Anand Kumar Madasamy, Bharathi Raja Chakravarthi and John P. McCrae Forum for Information Retrieval Evaluation, pp 16–18, (2021) PDF Abstract

We present the results of HASOC-Dravidian-CodeMix shared task1 held at FIRE 2021, a track on offensive language identification for Dravidian languages in Code-Mixed Text in this paper. This paper will detail the task, its organisation, and the submitted systems. The identification of offensive language was viewed as a classification task. For this, 16 teams participated in identifying offensive language from Tamil-English code mixed data, 11 teams for Malayalam-English code mixed data and 14 teams for Tamil data. The teams detected offensive language using various machine learning and deep learning classification models. This paper has analysed those benchmark systems to find out how well they accommodate a code-mixed scenario in Dravidian languages, focusing on Tamil and Malayalam.

Few-shot and Zero-shot Approaches to Legal Text Classification: A Case Study in the Financial Sector. Rajdeep Sarkar, Atul Kr. Ojha, Jay Megaro, John Mariano, Vall Herard and John P. McCrae NLLP 2021 @ EMNLP 2021, (2021) PDF Abstract

The application of predictive coding techniques to legal texts has the potential to greatly reduce the cost of legal review of documents, however, there is such a wide array of legal tasks and continuously evolving legislation that it is hard to construct sufficient training data to cover all cases. In this paper, we investigate few-shot and zero-shot approaches that require substantially less training data and introduce a triplet architecture, which for promissory statements produces performance close to that of a supervised system. This method allows predictive coding methods to be rapidly developed for new regulations and markets.

How Computers Can Future-Proof Minority Languages. John P. McCrae and Theodorus Fransen Cois Coiribe, NUI Galway Views and Opinions, (2021) PDF Abstract

Dr. Theodorus Fransen & Dr. John McCrae explore how digital language tools can potentially resolve the underrepresentation of minority languages in terms of digital technology and the Web.

Cross-lingual Sentence Embedding using Multi-Task Learning. Koustava Goswami, Sourav Dutta, Haytham Assem, Theodorus Fransen and John P. McCrae EMNLP 2021, (2021) PDF Abstract

The scarcity of labeled training data across many languages is a significant roadblock for multilingual neural language processing. We approach the lack of in-language training data using sentence embeddings that map text written in different lan- guages, but with similar meanings, to nearby embedding space representations. The representations are produced using a dual-encoder based model trained to maximize the representational similarity between sentence pairs drawn from parallel data. The representations are enhanced using multitask training and unsupervised monolingual corpora. The effectiveness of our multilingual sentence embeddings are assessed on a comprehensive collection of monolingual, cross-lingual, and zero- shot/few-shot learning tasks.

NUIG at TIAD 2021: Cross-lingual Word Embeddings for Translation Inference. Sina Ahmadi, Atul Kr. Ojha, Shubhanker Banerjee and John P. McCrae Proceedings of the Workshops and Tutorials held at LDK 2021 co-located with the 3rd Language, Data and Knowledge Conference (LDK 2021), (2021) PDF Abstract

Inducing new translation pairs across dictionaries is an important task that facilitates processing and maintaining lexicographical data. This paper describes our submissions to the Translation Inference Across Dictionaries (TIAD) shared task of 2021. Our systems mainly rely on the MUSE and VecMap cross-lingual word embedding mapping to create new translation pairs between English, French and Portuguese data. We also create two regression models based on the graph analysis features. Our systems perform above the baseline systems

A Survey of Orthographic Information in Machine Translation. Bharathi Raja Chakravarthi, Priya Rani, Mihael Arcan and John P. McCrae SN Computer Science, 2(330) (2021) PDF Abstract

Machine translation is one of the applications of natural language processing which has been explored in different languages. Recently researchers started paying attention towards machine translation for resource-poor languages and closely related languages. A widespread and underlying problem for these machine translation systems is the linguistic difference and variation in orthographic conventions which causes many issues to traditional approaches. Two languages written in two different orthographies are not easily comparable but orthographic information can also be used to improve the machine translation system. This article offers a survey of research regarding orthography’s influence on machine translation of under-resourced languages. It introduces under-resourced languages in terms of machine translation and how orthographic information can be utilised to improve machine translation. We describe previous work in this area, discussing what underlying assumptions were made, and showing how orthographic knowledge improves the performance of machine translation of under-resourced languages. We discuss different types of machine translation and demonstrate a recent trend that seeks to link orthographic information with well-established machine translation methods. Considerable attention is given to current efforts using cognate information at different levels of machine translation and the lessons that can be drawn from this. Additionally, multilingual neural machine translation of closely related languages is given a particular focus in this survey. This article ends with a discussion of the way forward in machine translation with orthographic information, focusing on multilingual settings and bilingual lexicon induction.

Encoder-Attention based Automatic Term Recognition (EA-ATR). Sampritha Manjunath and John P. McCrae Dagmar Gromann, Gilles Sérasset, Thierry Declerck, John P. McCrae, Jorge Gracia, Julia Bosque-Gil, Fernando Bobillo and Barbara Heinisch (eds) 3rd Conference on Language, Data and Knowledge (LDK 2021), pp 23:1--23:13, (2021) PDF Abstract

Automated Term Recognition (ATR) is the task of finding terminology from raw text. It involves designing and developing techniques for the mining of possible terms from the text and filtering these identified terms based on their scores calculated using scoring methodologies like frequency of occurrence and then ranking the terms. Current approaches often rely on statistics and regular expressions over part-of-speech tags to identify terms, but this is error-prone. We propose a deep learning technique to improve the process of identifying a possible sequence of terms. We improve the term recognition by using Bidirectional Encoder Representations from Transformers (BERT) based embeddings to identify which sequence of words is a term. This model is trained on Wikipedia titles. We assume all Wikipedia titles to be the positive set, and random n-grams generated from the raw text as a weak negative set. The positive and negative set will be trained using the Embed, Encode, Attend and Predict (EEAP) formulation using BERT as embeddings. The model will then be evaluated against different domain-specific corpora like GENIA - annotated biological terms and Krapivin - scientific papers from the computer science domain.

Automatic construction of knowledge graphs from text and structured data: A preliminary literature review. Maraim Masoud, Bianca Pereira, John McCrae and Paul Buitelaar Dagmar Gromann, Gilles Sérasset, Thierry Declerck, John P. McCrae, Jorge Gracia, Julia Bosque-Gil, Fernando Bobillo and Barbara Heinisch (eds) 3rd Conference on Language, Data and Knowledge (LDK 2021), pp 19:1--19:9, (2021) PDF Abstract

Knowledge graphs have been shown to be an important data structure for many applications, including chatbot development, data integration, and semantic search. In the enterprise domain, such graphs need to be constructed based on both structured (e.g. databases) and unstructured (e.g. textual) internal data sources; preferentially using automatic approaches due to the costs associated with manual construction of knowledge graphs. However, despite the growing body of research that leverages both structured and textual data sources in the context of automatic knowledge graph construction, the research community has centered on either one type of source or the other. In this paper, we conduct a preliminary literature review to investigate approaches that can be used for the integration of textual and structured data sources in the process of automatic knowledge graph construction. We highlight the solutions currently available for use within enterprises and point areas that would benefit from further research.

The ELEXIS system for monolingual sense linking in dictionaries. John P. McCrae, Sina Ahmadi, Seung-bin Yim and Lenka Bajčetić Proceedings of The Seventh Biennial Conference on Electronic Lexicography, eLex 2021, pp 542-559, (2021) PDF

Enriching a terminology for under-resourced languages using knowledge graphs. John P. McCrae, Atul Kumar Ojha, Bharathi Raja Chakravarthi, Ian Kelly, Patricia Buffini, Grace Tang, Eric Paquin and Manuel Locria Proceedings of The Seventh Biennial Conference on Electronic Lexicography, eLex 2021, pp 560-571, (2021) PDF

Heteronym Sense Linking. Lenka Bajčetić, Thierry Declerck and John P. McCrae Proceedings of The Seventh Biennial Conference on Electronic Lexicography, eLex 2021, pp 505-513, (2021) PDF

Conversation Concepts: Understanding Topics and Building Taxonomies for Financial Services. John McCrae, Pranab Mohanty, Siddharth Narayanan, Bianca Pereira, Paul Buitelaar, Saurav Karmakar and Rajdeep Sarkar Information, 12(4) (2021) PDF Abstract

Knowledge graphs are proving to be an increasingly important part of modern enterprises, and new applications of such enterprise knowledge graphs are still being found. In this paper, we report on the experience with the use of an automatic knowledge graph system called Saffron in the context of a large financial enterprise and show how this has found applications within this enterprise as part of the “Conversation Concepts Artificial Intelligence” tool. In particular, we analyse the use cases for knowledge graphs within this enterprise, and this led us to a new extension to the knowledge graph system. We present the results of these adaptations, including the introduction of a semi-supervised taxonomy extraction system, which includes analysts in-the-loop. Further, we extend the kinds of relations extracted by the system and show how the use of the BERTand ELMomodels can produce high-quality results. Thus, we show how this tool can help realize a smart enterprise and how requirements in the financial industry can be realised by state-of-the-art natural language processing technologies.

ULD-NUIG at Social Media Mining for Health Applications (#SMM4H) Shared Task 2021. Atul Kr. Ojha, Priya Rani, Koustava Goswami, Bharathi Raja Chakravarthi and John P. McCrae Proceedings of the Sixth Social Media Mining for Health (#SMM4H) Workshop and Shared Task, pp 149–152, (2021) PDF

Monolingual Word Sense Alignment as a Classification Problem. Sina Ahmadi and John P. McCrae Proceedings of the 11th Global Wordnet Conference, pp 73-80, (2021) PDF Abstract

Words are defined based on their meanings in various ways in different resources. Aligning word senses across monolingual lexicographic resources increases domain coverage and enables integration and incorporation of data. In this paper, we explore the application of classification methods using manually-extracted features along with representation learning techniques in the task of word sense alignment and semantic relationship detection. We demonstrate that the performance of classification methods dramatically varies based on the type of semantic relationships due to the nature of the task but outperforms the previous experiments.

The GlobalWordNet Formats: Updates for 2020. John P. McCrae, Michael Wayne Goodman, Francis Bond, Alexandre Rademaker, Ewa Rudnicka and Luis Morgado Da Costa Proceedings of the 11th Global Wordnet Conference, pp 91-99, (2021) PDF Abstract

The Global Wordnet Formats have been introduced to enable wordnets to have a common representation that can be integrated through the Global WordNet Grid. As a result of their adoption, a number of shortcomings of the format were identified, and in this paper we describe the extensions to the formats that address these issues. These include: ordering of senses, dependencies between wordnets, pronunciation, syntactic modelling, relations, sense keys, metadata and RDF support. Furthermore, we provide some perspectives on how these changes help in the integration of wordnets.

Towards a Linking between WordNet and Wikidata. John P McCrae and David Cillessen Proceedings of the 11th Global Wordnet Conference, pp 252-257, (2021) PDF Abstract

WordNet is the most widely used lexical resource for English, while Wikidata is one of the largest knowledge graphs of entity and concepts available. While, there is a clear difference in the focus of these two resources, there is also a significant overlap and as such a complete linking of these resources would have many uses. We propose the development of such a linking, first by means of the hapax legomenon links and secondly by the use of natural language processing techniques. We show that these can be done with high accuracy but that human validation is still necessary. This has resulted in over 9,000 links being added between these two resources.

Proceedings of the First Workshop on Language Technology for Equality, Diversity and Inclusion. Bharathi Raja Chakravarthi, John P. McCrae, Manel Zarrouk, Kalika Bali and Paul Buitelaar Association for Computational Linguistics, (2021) PDF

Findings of the Shared Task on Machine Translation in Dravidian languages. Bharathi Raja Chakravarthi, Ruba Priyadharshini, Shubhanker Banerjee, Richard Saldanha, John P. McCrae, Anand Kumar M, Parameswari Krishnamurthy and Melvin Johnson Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages, pp 119-125, (2021) PDF Abstract

This paper presents an overview of the shared task on machine translation of Dravidian languages. We presented the shared task results at the EACL 2021 workshop on Speech and Language Technologies for Dravidian Languages. This paper describes the datasets used, the methodology used for the evaluation of participants, and the experiments’ overall results. As a part of this shared task, we organized four sub-tasks corresponding to machine translation of the following language pairs: English to Tamil, English to Malayalam, English to Telugu and Tamil to Telugu which are available at https://competitions.codalab.org/competitions/27650. We provided the participants with training and development datasets to perform experiments, and the results were evaluated on unseen test data. In total, 46 research groups participated in the shared task and 7 experimental runs were sub-mitted for evaluation. We used BLEU scoresfor assessment of the translations.

Findings of the Shared Task on Offensive Language Identification in Tamil, Malayalam, and Kannada. Bharathi Raja Chakravarthi, Ruba Priyadharshini, Navya Jose, Anand Kumar M, Thomas Mandl, Prasanna Kumar Kumaresan, Rahul Ponnusamy, Hariharan R L, John P. McCrae and Elizabeth Sherly Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages, pp 133-145, (2021) PDF Abstract

Detecting offensive language in social media in local languages is critical for moderating user-generated content. Thus, the field of offensive language identification for under-resourced languages like Tamil, Malayalam and Kannada is of essential importance. As user-generated content is often code-mixed and not well studied for under-resourced languages, it is imperative to create resources and conduct benchmark studies to encourage research in under-resourced Dravidian languages. We created a shared task on offensive language detection in Dravidian languages. We summarize the dataset for this challenge which are openly available at https://competitions.codalab.org/competitions/27654, and present an overview of the methods and the results of the competing systems

Proceedings of the Joint Workshop on Multiword Expressions and Electronic Lexicons. Stella Markantonatou, John McCrae, Jelena Mitrović, Carole Tiberius, Carlos Ramisch, Ashwini Vaidya, Petya Osenova and Agata Savary Proceedings of the Joint Workshop on Multiword Expressions and Electronic Lexicons, (2020) PDF

ULD@NUIG at SemEval-2020 Task 9: Generative Morphemes with an Attention Model for Sentiment Analysis in Code-Mixed Text. Koustava Goswami, Priya Rani, Bharathi Raja Chakravarthi, Theodorus Fransen and John P. McCrae Proceedings of the International Workshop on Semantic Evaluation 2020 (SemEval-2020) at COLING 2020, (2020) PDF Abstract

Code mixing is a common phenomena in multilingual societies where people switch from one language to another for various reasons. Recent advances in public communication over different social media sites have led to an increase in the frequency of code-mixed usage in written language. In this paper, we present the Generative Morphemes with Attention (GenMA) Model sentiment analysis system contributed to SemEval 2020 Task 9 SentiMix. The system aims to predict the sentiments of the given English-Hindi code-mixed tweets without using word-level language tags instead inferring this automatically using a morphological model. The system is based on a novel deep neural network (DNN) architecture, which has outperformed the baseline F1-score on the test data-set as well as the validation data-set. Our results can be found under the user name koustava on the Sentimix Hindi English https://competitions.codalab.org/competitions/20654#learn_the_details-results page.

CogALex-VI Shared Task: Bidirectional Transformer based Identification of Semantic Relations. Saurav Karmakar and John P. McCrae Proceedings of CogALex - Cognitive Aspects of the Lexicon Workshop at COLING 2020, (2020) PDF Abstract

This paper presents a bidirectional transformer based approach to recognise semantic relationships between a pair of words as proposed by CogALex VI shared task in 2020. The system presented here works through employing BERT embeddings of the words and passing the same over tuned neural network to produce a learning model for the pair of words and their relationships. Afterwards the very same model is used for the relationship between unknown words from test set. CogALex VI provided subtask 1 as the identification of relationship of three specific categories amongst English pair of words and the presented system opts to work on that. The resulted relationships of the unknown words are analysed here , which shows a balanced performance in overall characteristics with some scope of improvement as further challenges to be embarked.

Classification Benchmarks for Under-resourced Bengali Language based on Multichannel Convolutional-LSTM Network. Md. Rezaul Karim, Bharathi Raja Chakravarthi, John P. McCrae and Michael Cochez 2020 IEEE 7th International Conference on Data Science and Advanced Analytics (DSAA), (2020) PDF Abstract

Exponential growths of social media and micro-blogging sites not only provide platforms for empowering freedom of expressions and individual voices, but also enables people to express anti-social behaviour like online harassment, cyberbullying, and hate speech. Numerous works have been proposed to utilize these data for social and anti-social behaviours analysis, document characterization, and sentiment analysis by predicting the contexts mostly for highly resourced languages like English. However, some languages are under-resources, e.g., South Asian languages like Bengali, Tamil, Assamese, Malayalam that lack of computational resources for natural language processing. In this paper, we provide several classification benchmarks for Bengali, an under-resourced language. We prepared three datasets of expressing hate, commonly used topics, and opinions for hate speech detection, document classification, and sentiment analysis. We built the largest Bengali word embedding models to date based on 250 million articles, which we call BengFastText. We perform three experiments, covering document classification, sentiment analysis, and hate speech detection. We incorporate word embeddings into a Multichannel Convolutional-LSTM (MC-LSTM) network for predicting different types of hate speech, document classification, and sentiment analysis. Experiments demonstrate that BengFastText can capture the semantics of words from respective contexts correctly. Evaluations against several baseline embedding models, e.g., Word2Vec and GloVe yield up to 92.30%, 82.25%, and 90.45% F1-scores in case of document classification, sentiment analysis, and hate speech detection, respectively during 5-fold cross-validation tests

Overview of the track on Sentiment Analysis for Dravidian Languages in Code-Mixed Text. Bharathi Raja Chakravarth, Ruba Priyadharshini, Vigneshwaran Muralidaran, Shardul Suryawanshi, Navya Jose, Elizabeth Sherly and John P. McCrae Proceedings of Hate Speech and Offensive Content Identification in Indo-European Languages (HASOC 2020), (2020) Abstract

Sentiment analysis of Dravidian languages has received attention in recent years. However, most social media text is code-mixed and there is no research available on sentiment analysis of code-mixed Dravidian languages. The Dravidian-CodeMix-FIRE 2020, a track on Sentiment Analysis for Dravidian Languages in Code-Mixed Text, focused on creating a platform for researchers to come together and investigate the problem. There were two languages for this track: (i) Tamil, and (ii) Malayalam. The participants were given a dataset of YouTube comments and the goal of the shared task submissions was to recognise the sentiment of each comment by classifying them into positive, negative, neutral, mixed-feeling classes or by recognising whether the comment is not in the intended language. The performance of the systems was evaluated by weighted-F1 score.

NUIG-Panlingua-KMI Hindi↔Marathi MT Systems for Similar Language Translation Task @ WMT 2020. Atul Kr. Ojha, Priya Rani, Akanksha Bansal, Bharathi Raja Chakravarthi, Ritesh Kumar and John P. McCrae Proceedings of the Fifth Conference on Machine Translation (WMT20), (2020) PDF Abstract

NUIG-Panlingua-KMI submission to WMT 2020 seeks to push the state-of-the-art in Similar Language Translation Task for Hindi <-> Marathi language pair. As part of these efforts, we conducted a series of experiments to address the challenges for translation between similar languages. Among the 4 MT systems prepared under this task, 1 PBSMT systems were prepared for Hindi <-> Marathi each and 1 NMT systems were developed for Hindi <-> Marathi using Byte Pair En-coding (BPE) into subwords. The results show that different architectures in NMT could be an effective method for developing MT systems for closely related languages. Our Hindi-Marathi NMT system was ranked 8th among the 14 teams that participated and our Marathi-Hindi NMT system was ranked 8th among the 11 teams participated for the task.

Contextual Modulation for Relation-Level Metaphor Identification. Omnia Zayed, John P. McCrae and Paul Buitelaar Findings of the Association for Computational Linguistics: EMNLP 2020, (2020) PDF Abstract

Identifying metaphors in text is very challenging and requires comprehending the underlying comparison. The automation of this cognitive process has gained wide attention lately. However, the majority of existing approaches concentrate on word-level identification by treating the task as either single-word classification or sequential labelling without explicitly modelling the interaction between the metaphor components. On the other hand, while existing relation-level approaches implicitly model this interaction, they ignore the context where the metaphor occurs. In this work, we address these limitations by introducing a novel architecture for identifying relation-level metaphoric expressions of certain grammatical relations based on contextual modulation. In a methodology inspired by works in visual reasoning, our approach is based on conditioning the neural network computation on the deep contextualised features of the candidate expressions using feature-wise linear modulation. We demonstrate that the proposed architecture achieves state-of-the-art results on benchmark datasets. The proposed methodology is generic and could be applied to other textual classification problems that benefit from contextual interaction.

Unsupervised Deep Language and Dialect Identification for Short Texts. Koustava Goswami, Rajdeep Sarkar, Bharathi Raja Chakravarthi, Theodorus Fransen and John P. McCrae Proceedings of the 28th International Conference on Computational Linguistics (COLING'2020), (2020) PDF Abstract

Automatic Language Identification (LI) or Dialect Identification (DI) of short texts of closely related languages or dialects, is one of the primary steps in many natural language processing pipelines. Language identification is considered a solved task in many cases; however, in the case of very closely related languages, or in an unsupervised scenario (where the languages are not known in advance), performance is still poor. In this paper, we propose the Unsupervised Deep Language and Dialect Identification (UDLDI) method, which can simultaneously learn sentence embeddings and cluster assignments from short texts. The UDLDI model understands the sentence constructions of languages by applying attention to character relations which helps to optimize the clustering of languages. We have performed our experiments on three short-text datasets for different language families, each consisting of closely related languages or dialects, with very minimal training sets. Our experimental evaluations on these datasets have shown significant improvement over state-of-the-art unsupervised methods and our model has outperformed state-of-the-art LI and DI systems in supervised settings.

“Suggest me a movie for tonight”: Leveraging Knowledge Graphs for Conversational Recommendation. Rajdeep Sarkar, Koustava Goswami, Mihael Arcan and John P. McCrae Proceedings of the 28th International Conference on Computational Linguistics (COLING'2020), (2020) PDF Abstract

Conversational recommender systems focus on the task of suggesting products to users based on the conversation flow. Recently, the use of external knowledge in the form of knowledge graphs has shown to improve the performance in recommendation and dialogue systems. Information from knowledge graphs aids in enriching those systems by providing additional information such as closely related products and textual descriptions of the items. However, knowledge graphs are incomplete since they do not contain all factual information present on the web. Also, when working on a specific domain, knowledge graphs in its entirety contribute towards extraneous information and noise. In this work, we study several subgraph construction methods and compare their performance across the recommendation task. We incorporate pre-trained embeddings from the subgraphs along with positional embeddings in our models. Extensive experiments show that our method has a relative improvement of at least 5.62% compared to the state-of-the-art on multiple metrics on the recommendation task.

Bilingual Lexicon Induction across Orthographically-distinct Under-Resourced Dravidian Languages. Bharathi Raja Chakravarthi, Navaneethan Rajasekaran, Mihael Arcan, Kevin McGuinness, Noel E. O’Connor and John P. McCrae Proceedings of the Seventh Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2020), (2020) PDF Abstract

Bilingual lexicons are a vital tool for under-resourced languages and recent state-of-the-art approaches to this leverage pretrained monolingual word embeddings using supervised or semi-supervised approaches. However, these approaches require cross-lingual information such as seed dictionaries to train the model and find a linear transformation between the word embedding spaces. Especially in the case of low-resourced languages, seed dictionaries are not readily available, and as such, these methods produce extremely weak results on these languages. In this work, we focus on the Dravidian languages, namely Tamil, Telugu, Kannada, and Malayalam, which are even more challenging as they are written in unique scripts. To take advantage of orthographic information and cognates in these languages, we bring the related languages into a single script. Previous approaches have used linguistically sub-optimal measures such as the Levenshtein edit distance to detect cognates, whereby we demonstrate that the longest common sub-sequence is linguistically more sound and improves the performance of bilingual lexicon induction. We show that our approach can increase the accuracy of bilingual lexicon induction methods on these languages many times, making bilingual lexicon induction approaches feasible for such under-resourced languages.

iLOD: InterPlanetary File System based Linked Open Data Cloud. Jamal A. Nasir and John P. McCrae Proceedings of MEPDaW'20 - Managing the Evolution and Preservation of the Data Web at ISWC 2020, (2020) Abstract

The proliferation of the World Wide Web and the Semantic Web applications has led to an increase in distributed services and datasets. This increase has put the infrastructural load in terms of availability, immutability, and security, and these challenges are being failed by the Linked Open Data (LOD) cloud due to the brittleness of its decentralisation. In this work, we present iLOD: a peer-to-peer decentralized storage infrastructure using the InterPlanetary File System (IPFS). iLOD is a dataset sharing system that leverages content-based addressing to support a resilient internet, and can speed up the web by getting nearby copies. In this study, we empirically analyze and evaluate the availability limitations of LOD and propose a distributed system for storing and accessing linked datasets without requiring any centralized server.

COST Action “European network for Web-centred linguistic data science” (NexusLinguarum). Thierry Declerck, Jorge Gracia and John P. McCrae Procesamiento del Lenguaje Natural, pp 93-96, (2020) PDF Abstract

We present the current state of the large “European network for Web-centred linguistic data science”. In its first phase, the network has put in place several working groups to deal with specific topics. The network also already implemented a first round of Short Term Scientific Missions (STSM)

English WordNet: A new open-source WordNet for English. John P. McCrae, Ewa Rudnicka and Francis Bond K Lexical News, pp 37-44, (2020) PDF

Linguistic Linked Data: Representation, Generation and Applications. Philipp Cimiano, Christian Chiarcos, John P. McCrae and Jorge Gracia Springer, (2020) PDF

7th Workshop on Linked Data in Linguistics (LDL-2020). Maxim Ionov, John P. McCrae, Christian Chiarcos, Thierry Declerck, Julia Bosque-Gil and Jorge Gracia (eds) European Language Resources Association (ELRA) - LREC 2020 Workshop Language Resources and Evaluation Conference, (2020) PDF

Globalex Workshop on Linked Lexicography. Ilan Kernerman, Simon Krek, John P. McCrae, Jorge Gracia, Sina Ahmadi and Besim Kabashi (eds) European Language Resources Association (ELRA) - LREC 2020 Workshop Language Resources and Evaluation Conference, (2020) PDF

A Survey of Current Datasets for Code-Switching Research. Navya Jose, Bharathi Raja Chakravarthi, Shardul Suryawanshi, Elizabeth Sherly and John P. McCrae ICACCS 2020: International Conference on Advanced Computing & Communication Systems (ICACCS), pp 136-141, (2020) PDF Abstract

Code switching is a prevalent phenomenon in the multilingual community and social media interaction. In the past ten years, we have witnessed an explosion of code switched data in the social media that brings together languages from low resourced languages to high resourced languages in the same text, sometimes written in a non-native script. This increases the demand for processing code-switched data to assist users in various natural language processing tasks such as part-of-speech tagging, named entity recognition, sentiment analysis, conversational systems, and machine translation, etc. The available corpora for code switching research played a major role in advancing this area of research. In this paper, we propose a set of quality metrics to evaluate the dataset and categorize them accordingly.

Named Entity Recognition for Code-Mixed Indian Corpus using Meta Embedding. Ruba Priyadharshini, Bharathi Raja Chakravarthi, Mani Vegupatti and John P. McCrae ICACCS 2020: International Conference on Advanced Computing & Communication Systems (ICACCS), pp 68-72, (2020) PDF Abstract

In this paper, we utilize the pre-trained embedding, sub-word embedding and closely related languages of languages in the code mixed corpus to create a meta-embedding. We then use the Transformer to encode the code mixed sentence and use Conditional Random Field to predict the Named Entities in the code-mixed text. In contrast to classical Named Entity recognition where the text is monolingual, our approach can predict the Named Entities in code-mixed corpus written both in the native script as well as Roman script. Our method is a novel method to combine the embeddings of closely related languages to identify Named Entity from Code-Mixed Indian text written using native script and Roman script in social media.

On the Linguistic Linked Open Data Infrastructure. Christian Chiarcos, Bettina Klimek, Christian Fäth, Thierry Declerck and John P. McCrae Proceedings of the 1st International Workshop on Language Technology Platforms at LREC 2020, pp 8-15, (2020) PDF Abstract

In this paper we describe the current state of development of the Linguistic Linked Open Data (LLOD) infrastructure, an LOD (sub-)cloud of linguistic resources, which covers various linguistic data bases, lexicons, corpora, terminology and metadata repositories. We give in some details an overview of the contributions made by the European H2020 projects “Prêt-à-LLOD” (‘Ready-to-use Multilingual Linked Language Data for Knowledge Services across Sectors’) and “ELEXIS” (‘European Lexicographic Infrastructure’) to the further development of the LLOD

NUIG at TIAD: Combining Unsupervised NLP and Graph Metrics for Translation Inference. John P. McCrae and Mihael Arcan Proceedings of the Globalex Workshop on Linked Lexicography (@LREC 2020), pp 92-97, (2020) PDF Abstract

In this paper, we present the NUIG system at the TIAD shared task. This system includes graph-based metrics calculated using novel algorithms, with an unsupervised document embedding tool called ONETA and an unsupervised multi-way neural machine translation method. The results are an improvement over our previous system and produce the highest precision among all systems in the task as well as very competitive F-Measure results. Incorporating features from other systems should be easy in the framework we describe in this paper, suggesting this could very easily be extended to an even stronger result.

A Comparative Study of Different State-of-the-Art Hate Speech Detection Methods in Hindi-English Code-Mixed Data. Priya Rani, Shardul Suryawanshi, Koustava Goswami, Bharathi Raja Chakravarthi, Theodorus Fransen and John Philip McCrae Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying at LREC 2020, pp 42-48, (2020) PDF Abstract

Hate speech detection in social media communication has become one of the primary concerns to avoid conflicts and curb undesired activities. In an environment where multilingual speakers switch among multiple languages, hate speech detection becomes a challenging task using methods that are designed for monolingual corpora. In our work, we attempt to analyze, detect and provide a comparative study of hate speech in a code-mixed social media text. We also provide a Hindi-English code-mixed data set consisting of Facebook and Twitter posts and comments. Our experiments show that deep learning models trained on this code-mixed corpus perform better.

Towards an Interoperable Ecosystem of AI and LT Platforms: A Roadmap for the Implementation of Different Levels of Interoperability. Georg Rehm, Dimitris Galanis, Penny Labropoulou, Stelios Piperidis, Martin Welß, Ricardo Usbeck, Joachim Köhler, Miltos Deligiannis, Katerina Gkirtzou, Johannes Fischer, Christian Chiarcos, Nils Feldhus, Julian Moreno-Schneider, Florian Kintzel, Elena Montiel-Ponsoda, Víctor Rodriguez-Doncel, John Philip McCrae, David Laqua, Irina Patricia Theile, Christian Dittmar, Kalina Bontcheva, Ian Roberts, Andrejs Vasiļjevs and Andis Lagzdins Proceedings of the 1st International Workshop on Language Technology Platforms at LREC 2020, pp 96-107, (2020) PDF Abstract

With regard to the wider area of AI/LT platform interoperability, we concentrate on two core aspects: (1) cross-platform search and discovery of resources and services; (2) composition of cross-platform service workflows. We devise five different levels (of increasing complexity) of platform interoperability that we suggest to implement in a wider federation of AI/LT platforms. We illustrate the approach using the five emerging AI/LT platforms AI4EU, ELG, Lynx, QURATOR and SPEAKER.

A Dataset for Classification of Tamil Memes. Shardul Suryawanshi, Bharathi Raja Chakravarthi, Pranav Verma, Mihael Arcan, John Philip McCrae and Paul Buitelaar Proceedings of the 5th Workshop on Indian Language Data: Resources and Evaluation (WILDRE-5) at LREC-2020, pp 7-13, (2020) PDF Abstract

Social media are interactive platforms that facilitate the creation or sharing of information, ideas or other forms of expression among people. This exchange is not free from offensive, trolling or malicious contents targeting users or communities. One way of trolling is by making memes, which in most cases combines an image with a concept or catchphrase. The challenge of dealing with memes is that they are region-specific and their meaning is often obscured in humour or sarcasm. To facilitate the computational modelling of trolling in the memes for Indian languages, we created a meme dataset for Tamil (TamilMemes). We annotated and released the dataset containing suspected trolls and not-troll memes. In this paper, we use the a image classification to address the difficulties involved in the classification of troll memes with the existing methods. We found that the identification of a troll meme with such an image classifier is not feasible which has been corroborated with precision, recall and F1-score

Modelling Frequency and Attestations for OntoLex-Lemon. Christian Chiarcos, Maxim Ionov, Jesse de Does, Katrien Depuydt, Anas Fahad Khan, Sander Stolk, Thierry Declerck and John Philip McCrae Proceedings of the Globalex Workshop on Linked Lexicography (@LREC 2020), pp 1-9, (2020) PDF Abstract

The OntoLex vocabulary enjoys increasing popularity as a means of publishing lexical resources with RDF and as Linked Data. The recent publication of a new OntoLex module for lexicography, lexicog, reflects its increasing importance for digital lexicography. However, not all aspects of digital lexicography have been covered to the same extent. In particular, supplementary information drawn from corpora such as frequency information, links to attestations, and collocation data were considered to be beyond the scope of lexicog. Therefore, the OntoLex community has put forward the proposal for a novel module for frequency, attestation and corpus information (FrAC), that not only covers the requirements of digital lexicography, but also accommodates essential data structures for lexical information in natural language processing. This paper introduces the current state of the OntoLex-FrAC vocabulary, describes its structure, some selected use cases, elementary concepts and fundamental definitions, with a focus on frequency and attestations.

Challenges of Word Sense Alignment: Portuguese Language Resources. Ana Salgado, Sina Ahmadi, Alberto Simões, John Philip McCrae and Rute Costa Proceedings of the 7th Workshop on Linked Data in Linguistics: Building tools and infrastructure at LREC 2020, pp 45-51, (2020) PDF Abstract

This paper reports on an ongoing task of monolingual word sense alignment in which a comparative study between the Portuguese Academy of Sciences Dictionary and the Dicionario Aberto ´ is carried out in the context of the ELEXIS (European Lexicographic Infrastructure) project. Word sense alignment involves searching for matching senses within dictionary entries of different lexical resources and linking them, which poses significant challenges. The lexicographic criteria are not always entirely consistent within individual dictionaries and even less so across different projects where different options may have been assumed in terms of structure and especially wording techniques of lexicographic glosses. This hinders the task of matching senses. We aim to present our annotation workflow in Portuguese using the Semantic Web standards. The results obtained are useful for the discussion within the community

English WordNet 2020: Improving and Extending a WordNet for English using an Open-Source Methodology. John Philip McCrae, Alexandre Rademaker, Ewa Rudnicka and Francis Bond Proceedings of the Multimodal Wordnets Workshop at LREC 2020, pp 14-19, (2020) PDF Abstract

The Princeton WordNet, while one of the most widely used resources for NLP, has not been updated for a long time, and as such a new project English WordNet has arisen to continue the development of the model under an open-source paradigm. In this paper, we detail the second release of this resource entitled “English WordNet 2020”. The work has focused firstly, on the introduction of new synsets and senses and developing guidelines for this and secondly, on the integration of contributions from other projects. We present the changes in this edition, which total over 15,000 changes over the previous release

A Sentiment Analysis Dataset for Code-Mixed Malayalam-English. Bharathi Raja Chakravarthi, Navya Jose, Shardul Suryawanshi, Elizabeth Sherly and John Philip McCrae Proceedings of 1st Joint SLTU (Spoken Language Technologies for Under-resourced languages) and CCURL (Collaboration and Computing for Under-Resourced Languages) Workshop at LREC 2020, pp 177-184, (2020) PDF Abstract

There is an increasing demand for sentiment analysis of text from social media which are mostly code-mixed. Systems trained on monolingual data fail for code-mixed data due to the complexity of mixing at different levels of the text. However, very few resources are available for code-mixed data to create models specific for this data. Although much research in multilingual and cross-lingual sentiment analysis has used semi-supervised or unsupervised methods, supervised methods still performs better. Only a few datasets for popular languages such as English-Spanish, English-Hindi, and English-Chinese are available. There are no resources available for Malayalam-English code-mixed data. This paper presents a new gold standard corpus for sentiment analysis of code-mixed text in Malayalam-English annotated by voluntary annotators. This gold standard corpus obtained a Krippendorff’s alpha above 0.8 for the dataset. We use this new corpus to provide the benchmark for sentiment analysis in Malayalam-English code-mixed texts.

Corpus Creation for Sentiment Analysis in Code-Mixed Tamil-English Text. Bharathi Raja Chakravarthi, Vigneshwaran Muralidaran, Ruba Priyadharshini and John Philip McCrae Proceedings of 1st Joint SLTU (Spoken Language Technologies for Under-resourced languages) and CCURL (Collaboration and Computing for Under-Resourced Languages) Workshop at LREC 2020, pp 202-210, (2020) PDF Abstract

Understanding the sentiment of a comment from a video or an image is an essential task in many applications. Sentiment analysis of a text can be useful for various decision-making processes. One such application is to analyse the popular sentiments of videos on social media based on viewer comments. However, comments from social media do not follow strict rules of grammar, and they contain mixing of more than one language, often written in non-native scripts. Non-availability of annotated code-mixed data for a low-resourced language like Tamil also adds difficulty to this problem. To overcome this, we created a gold standard Tamil-English code-switched, sentiment-annotated corpus containing 15,744 comment posts from YouTube. In this paper, we describe the process of creating the corpus and assigning polarities. We present inter-annotator agreement and show the results of sentiment analysis trained on this corpus as a benchmark.

Towards Automatic Linking of Lexicographic Data: the case of a historical and a modern Danish dictionary. Sina Ahmadi, Sanni Nimb, Thomas Troelsgård, John P. McCrae and Nicolai H. Sørensen Proceedings of the XIX EURALEX International Congress, (2020) PDF

Figure Me Out: A Gold Standard Dataset for Metaphor Interpretation. Omnia Zayed, John P. McCrae and Paul Buitelaar Proceedings of the 12th Language Resource and Evaluation Conference (LREC 2020), pp 5810-5819, (2020) PDF Abstract

Metaphor comprehension and understanding is a complex cognitive task that requires interpreting metaphors by grasping the interaction between the meaning of their target and source concepts. This is very challenging for humans, let alone computers. Thus, automatic metaphor interpretation is understudied in part due to the lack of publicly available datasets. The creation and manual annotation of such datasets is a demanding task which requires huge cognitive effort and time. Moreover, there will always be a question of accuracy and consistency of the annotated data due to the subjective nature of the problem. This work addresses these issues by presenting an annotation scheme to interpret verb-noun metaphoric expressions in text. The proposed approach is designed with the goal of reducing the workload on annotators and maintain consistency. Our methodology employs an automatic retrieval approach which utilises external lexical resources, word embeddings and semantic similarity to generate possible interpretations of identified metaphors in order to enable quick and accurate annotation. We validate our proposed approach by annotating around 1,500 metaphors in tweets which were annotated by six native English speakers. As a result of this work, we publish as linked data the first gold standard dataset for metaphor interpretation which will facilitate research in this area.

Some Issues with Building a Multilingual Wordnet. Francis Bond, Luis Morgado da Costa, Michael Wayne Goodman, John P. McCrae and Ahti Lohk Proceedings of the 12th Language Resource and Evaluation Conference (LREC 2020), pp 3189-3197, (2020) PDF Abstract

In this paper we discuss the experience of bringing together over 40 different wordnets. We introduce some extensions to the GWA wordnet LMF format proposed in Vossen et al. (2016) and look at how this new information can be displayed. Notable extensions include: confidence, corpus frequency, orthographic variants, lexicalized and non-lexicalized synsets and lemmas, new parts of speech, and more. Many of these extensions already exist in multiple wordnets – the challenge was to find a compatible representation. To this end, we introduce a new version of the Open Multilingual Wordnet (Bond and Foster, 2013), that integrates a new set of tools that tests the extensions introduced by this new format, while also ensuring the integrity of the Collaborative Interlingual Index (CILI: Bond et al., 2016), avoiding the same new concept to be introduced through multiple projects.

A Multilingual Evaluation Dataset for Monolingual Word Sense Alignment. Sina Ahmadi, John P. McCrae, Sanni Nimb, Thomas Troelsgård, Sussi Olsen, Bolette S. Pedersen, Thierry Declerck, Tanja Wissik, Monica Monachini, Andrea Bellandi, Fahad Khan, Irene Pisani, Simon Krek, Veronika Lipp, Tamás Váradi, László Simon, András Győrffy, Carole Tiberius, Tanneke Schoonheim, Yifat Ben Moshe, Maya Rudich, Raya Abu Ahmad, Dorielle Lonke, Kira Kovalenko, Margit Langemets, Jelena Kallas, Oksana Dereza, Theodorus Fransen, David Cillessen, David Lindemann, Mikel Alonso, Ana Salgado, José Luis Sancho, Rafael-J. Ureña-Ruiz, Kiril Simov, Petya Osenova, Zara Kancheva, Ivaylo Radev, Ranka Stanković, Cvetana Krstev, Biljana Lazić, Aleksandra Marković, Andrej Perdih and Dejan Gabrovšek Proceedings of the 12th Language Resource and Evaluation Conference (LREC 2020), pp 3232-3242, (2020) PDF Abstract

Aligning senses across resources and languages is a challenging task with beneficial applications in the field of natural language processing and electronic lexicography. In this paper, we describe our efforts in manually aligning monolingual dictionaries. The alignment is carried out at sense-level for various resources in 15 languages. Moreover, senses are annotated with possible semantic relationships such as broadness, narrowness, relatedness, and equivalence. In comparison to previous datasets for this task, this dataset covers a wide range of languages and resources and focuses on the more challenging task of linking general-purpose language. We believe that our data will pave the way for further advances in alignment and evaluation of word senses by creating new solutions, particularly those notoriously requiring data such as neural networks. Our resources are publicly available at https://github.com/elexis-eu/MWSA.

Recent Developments for the Linguistic Linked Open Data Infrastructure. Thierry Declerck, John Philip McCrae, Christian Chiarcos, Philipp Cimiano, Jorge Gracia, Matthias Hartung, Deirdre Lee, Elena Montiel-Ponsoda, Artem Revenko and Roser Saurí Proceedings of the 12th Language Resource and Evaluation Conference (LREC 2020), pp 5660-5667, (2020) PDF Abstract

In this paper we describe the contributions made by the European H2020 project “Prêt-à-LLOD” (‘Ready-to-use Multilingual Linked ` Language Data for Knowledge Services across Sectors’) to the further development of the Linguistic Linked Open Data (LLOD) infrastructure. Prêt-à-LLOD aims to develop a new methodology for building data value chains applicable to a wide range of sectors and ` applications and based around language resources and language technologies that can be integrated by means of semantic technologies. We describe the methods implemented for increasing the number of language data sets in the LLOD. We also present the approach for ensuring interoperability and for porting LLOD data sets and services to other infrastructures, as well as the contribution of the projects to existing standards.

A Comparative Study of SVM and LSTM Deep Learning Algorithms for Stock Market Prediction. Sai Krishna Lakshminarayanan and John McCrae Proceedings for the 27th AIAI Irish Conference on Artificial Intelligence and Cognitive Science, (2019) PDF Abstract

The paper presents a comparative study of the performance of Long Short-Term Memory (LSTM) neural network models with Support Vector Machine (SVM) regression models. The framework built as a part of this study comprises of eight models. In this, 4 models are built using LSTM and 4 models using SVM respectively. Two major datasets are used for this paper. One is the base standard Dow Jones Index (DJI) stock price dataset and another is the combination of this stock price dataset along with external added input parameters of crude oil and gold prices. This comparative study shows the best model in combination with our input dataset. The performance of the models is measured in terms of their Root Mean Squared Error (RMSE), Mean Squared Error (MSE), Mean Absolute Error, Mean Absolute Percentage Error (MAPE) and R squared (R2) score values. The methodologies and the results of the models are discussed and possible enhancements to this work are also provided.

Linguistic Linked Open Data for All. John P. McCrae and Thierry Declerck Proceedings of the Language Technology 4 All Conference, (2019) PDF Abstract

In this paper we briefly describe the European H2020 project “Prêt-à-LLOD” (‘Ready-to-use Multilingual Linked Language Data for Knowledge Services across Sectors’). This project aims to increase the uptake of language technologies by exploiting the combination of linked data and language technologies, that is Linguistic Linked Open Data (LLOD), to create ready-to-use multilingual data. Prêt-à-LLOD aims to achieve this by creating a new methodology for building data value chains applicable to a wide-range of sectors and applications and based around language resources and language technologies that can be integrated by means of semantic technologies, in particular the usage of the LLOD.

Towards a Global Lexicographic Infrastructure. Simon Krek, Thierry Declerck, John Philip McCrae and Tanja Wissik Proceedings of the Language Technology 4 All Conference, (2019) PDF Abstract

In this paper we briefly describe the European project ELEXIS (European Lexicographic Infrastructure). ELEXIS aims to integrate, extend and harmonise national and regional efforts in the field of lexicography, both modern and historical, with the goal of creating a sustainable infrastructure which will enable efficient access to high quality lexical data in the digital age, and bridge the gap between more advanced and lesser-supported lexicographic resources. For this, ELEXIS makes use of or establish common standards and solutions for the development of lexicographic resources and develop strategies and tools for extracting, structuring and linking lexicographic resource

Cardamom: Comparative Deep Models for Minority and Historical Languages. John Philip McCrae and Theodorus Fransen Proceedings of the Language Technology 4 All Conference, (2019) PDF Abstract

This paper gives an overview of the Cardamom project, which aims to close the resource gap for minority and under-resourced languages by means of deep-learning-based natural language processing (NLP) and exploiting similarities of closely-related languages. The project further extends this concept to historical languages, which can be considered as closely related to their modern form, and as such aims to provide NLP through both space and time for languages that have been ignored by current approaches.

Challenges for the Representations for Morphology in Ontology Lexicons. Bettina Klimek, John P. McCrae, Maxim Ionov, James K. Tauber, Christian Chiarcos, Julia Bosque-Gil and Paul Buitelaar Proceedings of Sixth Biennial Conference on Electronic Lexicography, eLex 2019, (2019) PDF Abstract

Recent years have experienced a growing trend in the publication of language resources as Linguistic Linked Data (LLD) to enhance their discovery, reuse and the interoperability of tools that consume language data. To this aim, the OntoLex-lemon model has emerged as a de-facto standard to represent lexical data on the Web. However, traditional dictionaries contain a considerable amount of morphological information which is not straightforwardly representable as LLD within the current model. In order to fill this gap a new Morphology Module of OntoLex-lemon is currently developed. This papers presents the results of this model as on-going work as well as the underlying challenges that emerged during the module development. Based on the MMoOn Core ontology, it aims to account for a wide range of morphological information, ranging from endings to derive whole paradigms to the decomposition and generation of lexical entries which is in compliance to other OntoLex-lemon modules and facilitates the encoding of complex morphological data in ontology lexicons.

The ELEXIS Interface for Interoperable Lexical Resources. John P. McCrae, Carole Tiberius, Anas Fahad Khan, Ilan Kernerman, Thierry Declerck, Simon Krek, Monica Monachini and Sina Ahmadi Proceedings of Sixth Biennial Conference on Electronic Lexicography, eLex 2019, (2019) PDF Abstract

ELEXIS is a project that aims to create a European network of lexical resources, and one of the key challenges for this is the development of an interoperable interface for different lexical resources so that further tools may improve the data. This paper describes this interface and in particular describes the five methods of entrance into the infrastructure, through retrodigitization, by conversion to TEI-Lex0, by the TEI-Lex0 format, by the OntoLex format or through the REST interface described in this paper.

Towards Electronic Lexicography for the Kurdish Language. Sina Ahmadi, Hossein Hassani and John P. McCrae Proceedings of Sixth Biennial Conference on Electronic Lexicography, eLex 2019, (2019) PDF Abstract

This paper describes the development of lexicographic resources for Kurdish and provides a lexical model for this language. Kurdish is considered a less-resourced language, and currently, lacks the machine-readable lexicon resources. The unique potential which Linked Data and the Semantic Web offer to e-lexicography enables interoperability across lexical resources by elevating the traditional linguistic data to machine-processable semantic formats. Therefore, we present our lexicon in Ontolex-Lemon ontology as a standard model for sharing lexical information on the Semantic Web. The research covers Sorani, Kurmanji, and Hawrami dialects of Kurdish. This research suggests that although Kurdish is a less-resourced language, in terms of documented lexicons, it owns a wide range of resources, but because they are machine-readable, they could not contribute to the language processing. The outcome of this project, which is made publicly available, assists scholars in their efforts towards making Kurdish a resource-rich language.

Taxonomy Extraction for Customer Service Knowledge Base Construction. Bianca Pereira, Cécile Robin, Tobias Daudert, John P. McCrae, Paul Buitelaar and Pranab Mohanty Proceedings of the SEMANTicS 2019, (2019) PDF Abstract

Customer service agents play an important role in bridging the gap between customers' vocabulary and business terms. In a scenario where organisations are moving into semi-automatic customer service, semantic technologies with capacity to bridge this gap become a necessity. In this paper we explore the use of automatic taxonomy extraction from text as a means to reconstruct a customer-agent taxonomic vocabulary. We evaluate our proposed solution in an industry use case scenario in the financial domain and show that our approaches for automated term extraction and using in-domain training for taxonomy construction can improve the quality of automatically constructed taxonomic knowledge bases.

A Character-Level LSTM Network Model for Tokenizing the Old Irish text of the Würzburg Glosses on the Pauline Epistles. Adrian Doyle, John P. McCrae and Clodagh Downey Proceedings of the Celtic Language Technology Workshop 2019, (2019) PDF Abstract

This paper examines difficulties inherent in tokenization of Early Irish texts and demonstrates that a neural-network-based approach may provide a viable solution for historical texts which contain unconventional spacing and spelling anomalies. Guidelines for tokenizing Old Irish text are presented and the creation of a character-level LSTM network is detailed, its accuracy assessed, and efforts at optimising its performance are recorded. Based on the results of this research it is expected that a character- level LSTM model may provide a viable solution for tokenization of historical texts where the use of Scriptio Continua, or alternative spacing conventions, makes the automatic separation of tokens difficult.

Adapting Term Recognition to an Under-Resourced Language: the Case of Irish. John P. McCrae and Adrian Doyle Proceedings of the Celtic Language Technology Workshop 2019, (2019) PDF Abstract

Automatic Term Recognition (ATR) is an important method for the summarization and analysis of large corpora, and normally requires a significant amount of linguistic input, in particular the use of part-of-speech taggers. For an under-resourced language such as Irish, the resources necessary for this may be scarce or entirely absent. We evaluate two methods for the automatic extraction of terms, based on the small part-of-speech-tagged corpora that are available for Irish and on a large terminology list, and show that both methods can produce viable term extractors. We evaluate this with a newly constructed corpus that is the first available corpus for term extraction in Irish. Our results shine some light on the challenge of adapting natural language processing systems to under-resourced scenarios.

WordNet Gloss Translation for Under-resourced Languages using Multilingual Neural Machine Translation. Bharathi Raja Chakravarthi, Mihael Arcan and John P. McCrae Proceedings of the MomenT Workshop, (2019) PDF Abstract

In this paper, we translate the glosses in the English WordNet based on the expand approach for improving and generating wordnets with the help of multilingual neural machine translation. Neural Machine Translation (NMT) has recently been applied to many tasks in natural language processing, leading to state-of-the-art performance. However, the performance of NMT often suffers in low resource scenarios where large corpora cannot be obtained. Using training data from closely related language have proven to be invaluable for improving performance. In this paper, we describe how we trained multilingual NMT from closely related language utilizing phonetic transcription for Dravidian languages. We report the evaluation result of the generated wordnets sense in terms of precision. By comparing to the recently proposed approach, we show improvement in terms of precision.

Multilingual Multimodal Machine Translation for Dravidian Languages utilizing Phonetic Transcription. Bharathi Raja Chakravarthi, Ruba Priyadharshini, Bernardo Stearns, Arun Jayapal, S Srivedy, Mihael Arcan, Manel Zarrouk and John P. McCrae Proceedings of the 2nd Workshop on Technologies for MT of Low Resource Languages (LoResMT 2019), (2019) PDF Abstract

Multimodal machine translation is the task of translating from source language to target language using information from other modalities. Existing multimodal datasets have been restricted to only highly resourced languages. These datasets were collected by manual translation of English descriptions from the Flickr30K dataset. In this work, we introduce MMDravi, a Multilingual Multimodal dataset for under-resourced Dravidian languages. It comprises of 30K sentences which were created utilizing several machine translation outputs. Using data from MMDravi and a phonetic transcription of the corpus, we build an MMNMT system for closely related Dravidian languages to take advantage of multilingual corpus and other modalities. We evaluate our MMNMT translations generated by the proposed approach with human annotated evaluation tests in terms of BLEU, METEOR, and TER. Relying on multilingual corpora, phonetic transcription, and image features, our approach improves the translation quality for the under-resourced languages.

English WordNet 2019 -- An Open-Source WordNet for English. John P. McCrae, Alexandre Rademaker, Francis Bond, Ewa Rudnicka and Christiane Fellbaum Proceedings of the 10th Global WordNet Conference – GWC 2019, (2019) Abstract

We describe the release of a new wordnet for English based on the Princeton WordNet, but now developed under an open-source model. In particular, this version of WordNet, which we call English WordNet 2019, which has been developed by multiple people around the world through GitHub, fixes many errors in previous wordnets for English. We give some details of the changes that have been made in this version and give some perspectives about likely future changes that will be made as this project continues to evolve.

Identification of Adjective-Noun Neologisms using Pretrained Language Models. John P. McCrae Joint Workshop on Multiword Expressions and WordNet (MWE-WN 2019) at ACL 2019, (2019) PDF Abstract

Neologism detection is a key task in the constructing of lexical resources and has wider implications for NLP, however the identification of multiword neologisms has received little attention. In this paper, we show that we can effectively identify the distinction between compositional and non-compositional adjective-noun pairs by using pretrained language models and comparing this with individual word embeddings. Our results show that the use of these models significantly improves over baseline linguistic features, however the combination with linguistic features still further improves the results, suggesting the strength of a hybrid approach.

Inferring translation candidates for multilingual dictionary generation. Mihael Arcan, Daniel Torregrosa, Sina Ahmadi and John P. McCrae Proceedings of the 2nd Translation Inference Across Dictionaries (TIAD) Shared Task, (2019) PDF Abstract

In the widely-connected digital world, multilingual lexical resources are one of the most important resources, for natural language processing applications, including information retrieval, question answering or knowledge management. These applications benefit from the multilingual knowledge as well as from the semantic relation between the words documented in these resources. Since multilingual dictionary creation and curation is a time-consuming task, we explored the use of multi-way neural machine translation trained on corpora of languages from the same family and trained additionally with a relatively small human-validated dictionary to infer new translation candidates. Our results showed not only that new dictionary entries can be identified and extracted from the translation model, but also that the expected precision and recall of the resulting dictionary can be adjusted by using different thresholds.

TIAD 2019 Shared Task: Leveraging Knowledge Graphs with Neural Machine Translation for Automatic Multilingual Dictionary Generation. Daniel Torregrosa, Mihael Arcan, Sina Ahmadi and John P. McCrae Proceedings of the 2nd Translation Inference Across Dictionaries (TIAD) Shared Task, (2019) PDF Abstract

This paper describes the different proposed approaches to the TIAD 2019 Shared Task, which consisted in the automatic discovery and generation of dictionaries leveraging multilingual knowledge bases. We present three methods based on graph analysis and neural machine translation and show that we can generate translations without parallel data.

TIAD Shared Task 2019: Orthonormal Explicit Topic Analysis for Translation Inference across Dictionaries. John P. McCrae Proceedings of the 2nd Translation Inference Across Dictionaries (TIAD) Shared Task, (2019) PDF Abstract

The task of inferring translations can be achieved by the means of comparable corpora and in this paper we apply explicit topic modelling over comparable corpora to the task of inferring translation candidates. In particular, we use the Orthonormal Explicit Topic Analysis (ONETA) model, which has been shown to be the state-of-the-art explicit topic model through its elimination of correlations between topics. The method proves highly effective at selecting translations with high precision.

Lexical Sense Alignment using Weighted Bipartite b-Matching. Sina Ahmadi, Mihael Arcan and John McCrae Proceedings of the Poster Track of LDK 2019, pp 12-16, (2019) PDF Abstract

Lexical resources are important components of natural language processing (NLP) applications providing linguistic information about the vocabulary of a language and the semantic relationships between the words. While there is an increasing number of lexical resources, particularly expert-made ones such as WordNet or FrameNet as well as collaboratively- curated ones such as Wikipedia1 or Wiktionary2 , manual construction and maintenance of such resources is a cumbersome task. This can be efficiently addressed by NLP techniques. Aligned resources have shown to improve word, knowledge and domain coverage and increase multilingualism by creating new lexical resources such as Yago , BabelNet and ConceptNet In addition, they can improve the performance of NLP tasks such as word sense disambiguation semantic role tagging and semantic relations extraction.

Representing Arabic Lexicons in Lemon - a Preliminary Study. Mustafa Jarrar, Hamzeh Amayreh and John McCrae Proceedings of the Poster Track of LDK 2019, pp 29-33, (2019) PDF Abstract

We present our progress in representing 150 Arabic multilingual lexicons using Lemon, which we have been digitizing from scratch. These lexicons are available through a lexicographic search engine (https://ontology.birzeit.edu) that allows searching for translations, synonyms, and definitions. Representing these lexicons in Lemon will enable them to be used by ontologies and NLP applications, as well as to be interlinked with the Open Linguistic Data Cloud.

Crowd-sourcing A High-Quality Dataset for Metaphor Identification in Tweets. Omnia Zayed, John P. McCrae and Paul Buitelaar 2nd Conference on Language, Data and Knowledge (LDK 2019), (2019) PDF Abstract

Metaphor is one of the most important elements of human communication, especially in informal settings such as social media. There have been a number of datasets created for metaphor identification, however, this task has proven difficult due to the nebulous nature of metaphoricity. In this paper, we present a crowd-sourcing approach for the creation of a dataset for metaphor identification, that is able to rapidly achieve large coverage over the different usages of metaphor in a given corpus while maintaining high accuracy. We validate this methodology by creating a set of 2,500 manually annotated tweets in English, for which we achieve inter-annotator agreement scores over 0.8, which is higher than other reported results that did not limit the task. This methodology is based on the use of an existing classifier for metaphor in order to assist in the identification and the selection of the examples for annotation, in a way that reduces the cognitive load for annotators and enables quick and accurate annotation. We selected a corpus of both general language tweets and political tweets relating to Brexit and we compare the resulting corpus on these two domains. As a result of this work, we have published the first dataset of tweets annotated for metaphors, which we believe will be invaluable for the development, training and evaluation of approaches for metaphor identification in tweets.

Comparison of Different Orthographies for Machine Translation of Under-Resourced Dravidian Languages. Bharathi Raja Chakravarthi, Mihael Arcan and John P. McCrae Maria Eskevich, Gerard de Melo, Christian Fäth, John P. McCrae, Paul Buitelaar, Christian Chiarcos, Bettina Klimek and Milan Dojchinovski (eds) 2nd Conference on Language, Data and Knowledge (LDK 2019), pp 6:1--6:14, (2019) PDF Abstract

Under-resourced languages are a significant challenge for statistical approaches to machine translation, and recently it has been shown that the usage of training data from closely-related languages can improve machine translation quality of these languages. While languages within the same language family share many properties, many under-resourced languages are written in their own native script, which makes taking advantage of these language similarities difficult. In this paper, we propose to alleviate the problem of different scripts by transcribing the native script into common representation i.e. the Latin script or the International Phonetic Alphabet (IPA). In particular, we compare the difference between coarse-grained transliteration to the Latin script and fine-grained IPA transliteration. We performed experiments on the language pairs English-Tamil, English-Telugu, and English-Kannada translation task. Our results show improvements in terms of the BLEU, METEOR and chrF scores from transliteration and we find that the transliteration into the Latin script outperforms the fine-grained IPA transcription.

2nd Conference on Language, Data and Knowledge (LDK 2019). Maria Eskevich, Gerard de Melo, Christian Fäth, John P. McCrae, Paul Buitelaar, Christian Chiarcos, Bettina Klimek and Milan Dojchinovski (eds) Schloss Dagstuhl--Leibniz-Zentrum fuer Informatik - OpenAccess Series in Informatics (OASIcs), (2019) PDF

On Lexicographical Networks. Sina Ahmadi, Mihael Arcan and John McCrae Workshop on eLexicography: Between Digital Humanities and Artificial Intelligence, (2018) PDF Abstract

Lexical resources are important components of natural language processing (NLP) applications providing machine-readable knowledge for various tasks. One of the most popular examples of lexical resources are lexicons. Lexicons provide linguistic information about the vocabulary of a language and the semantic relationships between the words in a pair of languages. In addition to the lexicons, there are various other types of lexical resources, particularly those which are made by experts such as WordNet, VerbNet and FrameNet and, those which are collaboratively curated such as Wikipedia and Wiktionary.

6th Workshop on Linked Data in Linguistics: Towards Linguistic Data Science. John P. McCrae, Christian Chiarcos, Thierry Declerck, Jorge Gracia and Bettina Klimek (eds) European Language Resources Association - LREC-2018 Workshop Proceedings, (2018) PDF Abstract

Since its establishment in 2012, the Linked Data in Linguistics (LDL) workshop series has become the major forum for presenting, discussing and disseminating technologies, vocabularies, resources and experiences regarding the application of Semantic Web standards and the Linked Open Data paradigm to language resources in order to facilitate their visibility, accessibility, interoperability, reusability, enrichment, combined evaluation and integration. The LDL workshop series is organized by the Open Linguistics Working Group of the Open Knowledge Foundation, and has contributed greatly to the emergence and growth of the Linguistic Linked Open Data (LLOD) cloud. LDL workshops contribute to the discussion, dissemination and establishment of community standards that drive this development, most notably the Lemon/OntoLex model for lexical resources, as well as standards for other types of language resources still under development. Building on our earlier success in creating and linking language resources, LDL-2018 will focus on Linguistic Data Science, i.e., research methodologies and applications building on Linguistic Linked Open Data and the existing technology and resource stack for linguistics, natural language processing and digital humanities. LDL-2018 builds on the success of the workshop series, incl. two appearances at LREC (2014, 2016), where we attracted a large number of interested participants. As of 2016, LDL workshops alternate with our stand-alone conference on Language, Data and Knowledge (LDK). LDK-2017 was held in Galway, Ireland, as a 3-day event with 150 registrants and several satellite workshops. Continuing the LDL workshop series together with LDK is important in order to facilitate dissemination within and to receive input from the language resource community, and LREC is the obvious host conference for this purpose. LDL-2018 will be supported by the ELEXIS project on an European Lexicographic Infrastructure.

European Lexicographic Infrastructure (ELEXIS). Simon Krek, John McCrae, Iztok Kosem, Tanja Wissek, Carole Tiberius, Roberto Navigli and Bolette Sandford Pedersen Proceedings of the XVIII EURALEX International Congress on Lexicography in Global Contexts, pp 881-892, (2018) PDF Abstract

In the paper we describe a new EU infrastructure project dedicated to lexicography. The project is part of the Horizon 2020 program, with a duration of four years (2018-2022). The result of the project will be an infrastructure which will (1) enable efficient access to high quality lexicographic data, and (2) bridge the gap between more advanced and less-resourced scholarly communities working on lexicographic resources. One of the main issues addressed by the project is the fact that current lexicographic resources have different levels of (incompatible) structuring, and are not equally suitable for application in in Natural Language Processing and other fields. The project will therefore develop strategies, tools and standards for extracting, structuring and linking lexicographic resources to enable their inclusion in Linked Open Data and the Semantic Web, as well as their use in the context of digital humanities.

Constructing an Annotated Corpus of Verbal MWEs for English. Abigail Walsh, Claire Bonial, Kristina Geeraert, John P. McCrae, Nathan Schneider and Clarissa Somers Proceedings of Joint Workshop on Linguistic Annotation, Multiword Expressions and Constructions (LAW-MWE-CxG-2018), (2018) Abstract

This paper describes the construction and annotation of a corpus of verbal MWEs for English as part of the PARSEME Shared Task 1.1 on automatic identification of verbal MWEs. The criteria for corpus selection, the categories of MWEs used, and the training process are discussed, along with the particular issues that led to revisions in edition 1.1 of the annotation guidelines. Finally, an overview of the characteristics of the final annotated corpus is presented, as well as some discussion on inter-annotator agreement.

Phrase-Level Metaphor Identification using Distributed Representations of Word Meaning. Omnia Zayed, John P. McCrae and Paul Buitelaar Proceedings of the Workshop on Figurative Language Processing, (2018) Abstract

Metaphor is an essential element of human cognition which is often used to express ideas and emotions that might be difficult to express using literal language. Processing metaphoric language is a challenging task for a wide range of applications ranging from text simplification to psychotherapy. Despite the variety of approaches that are trying to process metaphor, there is still a need for better models that mimic the human cognition while exploiting fewer resources. In this paper, we present an approach based on distributional semantics to identify metaphors on the phrase-level. We investigated the use of different word embeddings models to identify verb-noun pairs where the verb is used metaphorically. Several experiments are conducted to show the performance of the proposed approach on benchmark datasets.

Linking Datasets Using Semantic Textual Similarity. John P. McCrae and Paul Buitelaar Cybernetics and Information Technologies, 18(1)pp 109-123, (2018) PDF Abstract

Linked data has been widely recognized as an important paradigm for representing data and one of the most important aspects of supporting its use is discovery of links between datasets. For many datasets, there is a significant amount of textual information in the form of labels, descriptions and documentation about the elements of the dataset and the fundament of a precise linking is in the application of semantic textual similarity to link these datasets. However, most linking tools so far rely on only simple string similarity metrics such as Jaccard scores. We present an evaluation of some metrics that have performed well in recent semantic textual similarity evaluations and apply these to linking existing datasets

Preservation of Original Orthography in the Construction of an Old Irish Corpus. Adrian Doyle, John P. McCrae and Clodagh Downey Proceedings of the 3rd Workshop for Collaboration and Computing for Under-Resourced Languages, (2018) PDF Abstract

This paper will examine the process of creating a digital corpus based on the Würzburg glosses, the earliest large collection of glosses written in the Irish language. Modern editorial standards applied in publications of these glosses can alter spelling, punctuation, and even the semantic meaning of a sentence where one word is used in place of another. Therefore, an understanding of the original orthography utilised by Old Irish scribes is important in determining the orthography which should be utilised in a modern digital corpus. This paper will outline why the text of the Würzburg glosses as it appears in Thesaurus Palaeohibernicus is the best candidate for digitisation. The automated digitisation and proofing process of the corpus will be outlined, and details will be given of a tag-set utilised within the digital corpus in order to preserve information present in Thesaurus Palaeohibernicus as metadata.

ELEXIS - European Lexicographic Infrastructure: Contributions to and from the Linguistic Linked Open Data. Thierry Declerck, John McCrae, Roberto Navigli, Ksenia Zaytseva and Tanja Wissik Proceedings of the Globalex 2018 Workshop, (2018) Abstract

In this paper we outline the interoperability aspects of the recently started European project ELEXIS (European Lexicographic Infrastructure). ELEXIS aims to integrate, extend and harmonise national and regional efforts in the field of lexicography, both modern and historical, with the goal of creating a sustainable infrastructure which will enable efficient access to high quality lexical data in the digital age, and bridge the gap between more advanced and lesser-supported lexicographic resources. For this, ELEXIS will make use of or establish common standards and solutions for the development of lexicographic resources and develop strategies and tools for extracting, structuring and linking lexicographic resources.

A supervised approach to taxonomy extraction using word embeddings. Rajdeep Sarkar, John P. McCrae and Paul Buitelaar Proceedings of the 11th Language Resource and Evaluation Conference (LREC), (2018) PDF Abstract

Large collections of texts are commonly generated by large organizations and making sense of these collections of texts is a significant challenge. One method for handling this is to organize the concepts into a hierarchical structure such that similar concepts can be discovered and easily browsed. This approach was the subject of a recent evaluation campaign, TExEval, however the results of this task showed that none of the systems consistently outperformed a relatively simple baseline.In order to solve this issue, we propose a new method that uses supervised learning to combine multiple features with a support vector machine classifier including the baseline features. We show that this outperforms the baseline and thus provides a stronger method for identifying taxonomic relations than previous methods

A Comparison Of Emotion Annotation Schemes And A New Annotated Data Set. Ian Wood, John P. McCrae, Vladimir Andryushechkin and Paul Buitelaar Proceedings of the 11th Language Resource and Evaluation Conference (LREC), (2018) PDF Abstract

While the recognition of positive/negative sentiment in text is an established task with many standard data sets and well developed methodologies, the recognition of more nuanced affect has received less attention, and in particular, there are very few publicly available gold standard annotated resources. To address this lack, we present a series of emotion annotation studies on tweets culminating in a publicly available collection of 2,019 tweets with scores on four emotion dimensions: valence, arousal, dominance and surprise, following the emotion representation model identified by Fontaine et.al. (Fontaine et al., 2007). Further, we make a comparison of relative vs. absolute annotation schemes. We find improved annotator agreement with a relative annotation scheme (comparisons) on a dimensional emotion model over a categorical annotation scheme on Ekman’s six basic emotions (Ekman et al., 1987), however when we compare inter-annotator agreement for comparisons with agreement for a rating scale annotation scheme (both with the same dimensional emotion model), we find improved inter-annotator agreement with rating scales, challenging a common belief that relative judgements are more reliable.

Teanga: A Linked Data based platform for Natural Language Processing. Housam Ziad, John P. McCrae and Paul Buitelaar Proceedings of the 11th Language Resource and Evaluation Conference (LREC), (2018) PDF Abstract

In this paper, we describe Teanga, a linked data based platform for natural language processing (NLP). Teanga enables the use of many NLP services from a single interface, whether the need was to use a single service or multiple services in a pipeline. Teanga focuses on the problem of NLP services interoperability by using linked data to define the types of services input and output. Teanga’s strengths include being easy to install and run, easy to use, able to run multiple NLP tasks from one interface and helping users to build a pipeline of tasks through a graphical user interface.

Automatic Enrichment of Terminological Resources: the IATE RDF Example. Mihael Arcan, Elena Montiel-Ponsoda, John P. McCrae and Paul Buitelaar Proceedings of the 11th Language Resource and Evaluation Conference (LREC), (2018) PDF Abstract

Terminological resources have proven necessary in many organizations and institutions to ensure communication between experts. However, the maintenance of these resources is a very time-consuming and expensive process. Therefore, the work described in this contribution aims to automate the maintenance process of such resources. As an example, we demonstrate enriching the RDF version of IATE with new terms in the languages for which no translation was available, as well as with domain-disambiguated sentences and information about usage frequency. This is achieved by relying on machine translation trained on parallel corpora that contains the terms in question and multilingual word sense disambiguation performed on the context provided by the sentences. Our results show that for most languages translating the terms within a disambiguated context significantly outperforms the approach with randomly selected sentences.

MixedEmotions: An Open-Source Toolbox for Multi-Modal Emotion Analysis. Paul Buitelaar, Ian D. Wood, Sapna Negi, Mihael Arcan, John P. McCrae, Andrejs Abele, Cécile Robin, Vladimir Andryushechkin, Housam Ziad, Hesam Sagha, J. Fernando Sánchez-Rada, Carlos A. Iglesias, Carlos Navarro, Andreas Giefer, Nicolaus Heise, Vincenzo Masucci, Francesco A. Danza, Ciro Caterino, Pavel Smrž, Michal Hradiš, Filip Povolný, Marek Klimeš, Pavel Matějka and Giovanni Tummarello IEEE Transactions on Multimedia, 20(9) (2018) PDF Abstract

Recently, there is an increasing tendency to embed functionalities for recognizing emotions from user-generated media content in automated systems such as call-centre operations, recommendations, and assistive technologies, providing richer and more informative user and content profiles. However, to date, adding these functionalities was a tedious, costly, and time-consuming effort, requiring identification and integration of diverse tools with diverse interfaces as required by the use case at hand. The MixedEmotions Toolbox leverages the need for such functionalities by providing tools for text, audio, video, and linked data processing within an easily integrable plug-and-play platform. These functionalities include: 1) for text processing: emotion and sentiment recognition; 2) for audio processing: emotion, age, and gender recognition; 3) for video processing: face detection and tracking, emotion recognition, facial landmark localization, head pose estimation, face alignment, and body pose estimation; and 4) for linked data: knowledge graph integration. Moreover, the MixedEmotions Toolbox is open-source and free. In this paper, we present this toolbox in the context of the existing landscape, and provide a range of detailed benchmarks on standard test-beds showing its state-of-the-art performance. Furthermore, three real-world use cases show its effectiveness, namely, emotion-driven smart TV, call center monitoring, and brand reputation analysis.

Mapping WordNet Instances to Wikipedia. John P. McCrae Proceedings of the 9th Global WordNet Conference, (2018) Abstract

Lexical resource differ from encyclopaedic resources and represent two distinct types of resource covering general language and named entities respectively. However, many lexical resources, including Princeton WordNet, contain many proper nouns, referring to named entities in the world yet it is not possible or desirable for a lexical resource to cover all named entities that may reasonably occur in a text. In this paper, we propose that instead of including synsets for instance concepts PWN should instead provide links to Wikipedia articles describing the concept. In order to enable this we have created a gold-quality mapping between all of the 7,742 instances in PWN and Wikipedia (where such a mapping is possible). As such, this resource aims to provide a gold standard for link discovery, while also allowing PWN to distinguish itself from other resources such as DBpedia or BabelNet. Moreover, this linking connects PWN to the Linguistic Linked Open Data cloud, thus creating a richer, more usable resource for natural language processing.

Towards a Crowd-Sourced WordNet for Colloquial English. John P. McCrae, Ian Wood and Amanda Hicks Proceedings of the 9th Global WordNet Conference, (2018) Abstract

Princeton WordNet is one of the most widely-used resources for natural language processing, but is updated only infrequently and cannot keep up with the fast-changing usage of the English language on social media platforms such as Twitter. The Colloquial WordNet aims to provide an open platform whereby anyone can contribute, while still following the structure of WordNet. Many crowdsourced lexical resources often have significant quality issues, and as such care must be taken in the design of the interface to ensure quality. In this paper, we present the development of a platform that can be opened on the Web to any lexicographer who wishes to contribute to this resource and the lexicographic methodology applied by this interface

Improving Wordnets for Under-Resourced Languages Using Machine Translation information. Bharathi Raja Chakravarthi, Mihael Arcan and John P. McCrae Proceedings of the 9th Global WordNet Conference, (2018) Abstract

Wordnets are extensively used in natural language processing, but the current approaches for manually building a wordnet from scratch involves large research groups for a long period of time, which are typically not available for under-resourced languages. Even if wordnet-like resources are available for under-resourced languages, they are often not easily accessible, which can alter the results of applications using these resources. Our proposed method presents an expand approach for improving and generating wordnets with the help of machine translation. We apply our methods to improve and extend wordnets for the Dravidian languages, i.e., Tamil, Telugu, Kannada, which are severly under-resourced languages. We report evaluation results of the generated wordnet senses in term of precision for these languages. In addition to that, we carried out a manual evaluation of the translations for the Tamil language, where we demonstrate that our approach can aid in improving wordnet resources for under-resourced Dravidian languages.

ELEXIS - a European infrastructure fostering cooperation and infor-mation exchange among lexicographical research communities. Bolette Pedersen, John McCrae, Carole Tiberius and Simon Krek Proceedings of the 9th Global WordNet Conference, (2018) Abstract

The paper describes objectives, concept and methodology for ELEXIS, a European infrastructure fostering cooperation and information exchange among lexicographical research communities. The infrastructure is a newly granted project under the Horizon 2020 INFRAIA call, with the topic Integrating Activities for Starting Communities. The project is planned to start in January 2018

Mapping WordNet Instances to Wikipedia. John P. McCrae Proceedings of the 9th Global WordNet Conference, (2018)

Towards a Crowd-Sourced WordNet for Colloquial English. John P. McCrae, Ian Wood and Amanda Hicks Proceedings of the 9th Global WordNet Conference, (2018)

Knowledge Graphs and Language Technology - ISWC 2016 International Workshops: KEKI and NLP&DBpedia. Marieke van Erp, Sebastian Hellmann, John P. McCrae, Christian Chiarcos, Key-Sun Choi, Jorge Gracia, Yoshihiko Hayashi, Seiji Koide, Pablo Mendes, Heiko Paulheim and Hideaki Takeda (eds) Springer - Lecture Notes in Computer Science, (2017) PDF

Language, Data, and Knowledge. Jorge Gracia, Francis Bond, John P. McCrae, Paul Buitelaar, Christian Chiarcos and Sebastian Hellmann (eds) Springer - Lecture Notes in Artificial Intelligence, (2017) PDF

Linking Knowledge Graphs across Languages with Semantic Similarity and Machine Translation. John P. McCrae, Mihael Arcan and Paul Buitleaar Proceedings of the First Workshop on Multi-Language Processing in a Globalising World (MLP2017), (2017)

The OntoLex-Lemon Model: development and applications. John P. McCrae, Julia Bosque-Gil, Jorge Gracia, Paul Buitelaar and Philipp Cimiano Proceedings of eLex 2017, pp 587-597, (2017) PDF

OnLiT: An Ontology for Linguistic Terminology. Bettina Klimek, John P. McCrae, Christian Chiarcos and Sebastian Hellmann Proceedings of the First Conference on Language, Data and Knowledge (LDK2017), pp 42-57, (2017)

The Colloquial WordNet: Extending Princeton WordNet with Neologisms. John P. McCrae, Ian Wood and Amanda Hicks Proceedings of the First Conference on Language, Data and Knowledge (LDK2017), pp 194-202, (2017)

An Evaluation Dataset for Linked Data Profiling. Andrejs Abele, John P. McCrae and Paul Buitelaar Proceedings of the First Conference on Language, Data and Knowledge (LDK2017), pp 1-9, (2017)

Lexicon Model for Ontologies: Community Report. Philipp Cimiano, John P. McCrae and Paul Buitelaar Technical Report: W3C (2016) PDF

Expanding wordnets to new languages with multilingual sense disambiguation. Mihael Arcan, John P. McCrae and Paul Buitelaar Proceedings of The 26th International Conference on Computational Linguistics, (2016) PDF

Identifying Poorly-Defined Concepts in WordNet with Graph Metrics. John P. McCrae and Narumol Prangnawarat Proceedings of the First Workshop on Knowledge Extraction and Knowledge Integration (KEKI-2016), (2016)

LIXR: Quick, succinct conversion of XML to RDF. John P. McCrae and Philipp Cimiano Proceedings of the ISWC 2016 Posters and Demo Track, (2016)

Yuzu: Publishing Any Data as Linked Data. John P. McCrae Proceedings of the ISWC 2016 Posters and Demo Track, (2016)

NUIG-UNLP at SemEval-2016 Task 1: Soft Alignment and Deep Learning for Semantic Textual Similarity. John P. McCrae, Kartik Asooja, Nitish Aggarwal and Paul Buitelaar SemEval-2016, (2016) PDF

Linked Data and Text Mining as an Enabler for Reproducible Research. John P. McCrae, Georgeta Bordea and Paul Buitelaar 1st Workshop on Cross-Platform Text Mining and Natural Language Processing Interoperability, (2016) PDF

Domain adaptation for ontology localization. John P. McCrae, Mihael Arcan, Kartik Asooja, Jorge Gracia, Paul Buitelaar and Philipp Cimiano Web Semantics, 36pp 23-31, (2016) PDF

Representing Multiword Expressions on the Web with the OntoLex-Lemon model. John P. McCrae, Philipp Cimiano, Paul Buitelaar and Georgeta Bordea PARSEME/ENeL workshop on MWE e-lexicons, (2016) PDF

The Open Linguistics Working Group: Developing the Linguistic Linked Open Data Cloud. John P. McCrae, Christian Chiarcos, Francis Bond, Philipp Cimiano, Thierry Declerck, Gerard de Melo, Jorge Gracia, Sebastian Hellmann, Bettina Klimek, Steven Moran, Petya Osenova, Antonio Pareja-Lora and Jonathan Pool 10th Language Resource and Evaluation Conference (LREC), pp 2435-2441, (2016) PDF

CILI: the Collaborative Interlingual Index. Francis Bond, Piek Vossen, John P. McCrae and Christiane Fellbaum Proceedings of the Global WordNet Conference 2016, (2016) PDF

Toward a truly multilingual Global Wordnet Grid. Piek Vossen, Francis Bond and John P. McCrae Proceedings of the Global WordNet Conference 2016, (2016) PDF

Multilingual Linked Data (editorial). John P. McCrae, Steven Moran, Sebastian Hellmann and Martin Brümmer Semantic Web, 6(4)pp 315-317, (2015) PDF

lemonUby - a large, interlinked, syntactically-rich lexical resource for ontologies. Judith Eckle-Kohler, John McCrae and Christian Chiarcos Semantic Web, 6(4)pp 371-378, (2015) PDF

Linghub: a Linked Data based portal supporting the discovery of language resources. John P. McCrae and Philipp Cimiano Proceedings of the 11th International Conference on Semantic Systems, pp 88-91, (2015)

Linking Four Heterogeneous Language Resources as Linked Data. Benjamin Siemoneit, John P. McCrae and Philipp Cimiano Proceedings of the 4th Workshop on Linked Data in Linguistics, pp 59-63, (2015) PDF

Reconciling Heterogeneous Descriptions of Language Resources. John P. McCrae, Philipp Cimiano, Victor Rodriguez-Doncel, Daniel Vila-Suero, Jorge Gracia, Luca Matteis, Roberto Navigli, Andrejs Abele, Gabriela Vulcu and Paul Buitelaar Proceedings of the 4th Workshop on Linked Data in Linguistics, pp 39-48, (2015) PDF

Linked Terminology: Applying Linked Data Principles to Terminological Resources. Philipp Cimiano, John P. McCrae, Victor Rodriguez-Doncel, Tatiana Gornostaya, Asuncion Gómez-Pérez, Benjamin Siemoneit and Andis Lagzdins Proceedings of eLex 2015, pp 504-517, (2015) PDF

One ontology to bind them all: The META-SHARE OWL ontology for the interoperability of linguistic datasets on the Web. John P. McCrae, Penny Labropoulou, Jorge Gracia, Marta Villegas, Victor Rodriguez-Doncel and Philipp Cimiano Proceedings of the 4th Workshop on the Multilingual Semantic Web, (2015) PDF

LIME: the Metadata Module for OntoLex. Manuel Fiorelli, Armando Stellato, John P. McCrae, Philipp Cimiano and Maria Teresa Pazienza Proceedings of 12th Extended Semantic Web Conference, (2015) PDF

Language Resources and Linked Data: A Practical Perspective. Jorge Gracia, Daniel Vila-Suero, John P. McCrae, Tiziano Flati, Ciro Baron and Milan Dojchinovski In: Knowledge Engineering and Knowledge Management (2015) PDF

Design Patterns for Engineering the Ontology-Lexicon Interface. John P. McCrae and Christina Unger Paul Buitelaar and Philipp Cimiano (eds) In: Towards the Multilingual Semantic WebPaul Buitelaar and Philipp Cimiano (eds)pp 15-30, (2014) PDF

Representing Swedish Lexical Resources in RDF with lemon. Lars Borin, Dana Dannells, Markus Forsberg and John P. McCrae Proceedings of the ISWC 2014 Posters & Demonstrations Track - a track within the 13th International Semantic Web Conference, pp 329-332, (2014) PDF

Towards assured data quality and validation by data certification. John P. McCrae, Cord Wiljes and Philipp Cimiano Proceedings of the 1st Workshop on Linked Data Quality, (2014) PDF

Bielefeld SC: Orthonormal Topic Modelling for Grammar Induction. John P. McCrae and Philipp Cimiano Proceedings of the 8th International Workshop on Semantic Evaluation, (2014) PDF

Default Physical Measurements in SUMO. Francesca Quattri, Adam Pease and John P. McCrae Proceedings of 4th Workshop on Cognitive Aspects of the Lexicon, (2014) PDF

Modelling the Semantics of Adjectives in the Ontology-Lexicon Interface. John P. McCrae, Christina Unger, Francesca Quattri and Philipp Cimiano Proceedings of 4th Workshop on Cognitive Aspects of the Lexicon, (2014) PDF

Publishing and Linking WordNet using lemon and RDF. John P. McCrae, Christiane Fellbaum and Philipp Cimiano Proceedings of the 3rd Workshop on Linked Data in Linguistics, (2014) PDF

A Multilingual Semantic Network as Linked Data: lemon-BabelNet. Maud Ehrmann, Francesco Ceconi, Daniela Vannella, John P. McCrae, Philipp Cimiano and Roberto Navigli Proceedings of the 3rd Workshop on Linked Data in Linguistics, (2014)

Representing Multilingual Data as Linked Data: the Case of BabelNet 2.0. Maud Ehrmann, Francesco Ceconi, Daniela Vannella, John P. McCrae, Philipp Cimiano and Roberto Navigli Proceedings of the 9th Language Resource and Evaluation Conference, pp 401-408, (2014)

3LD: Towards high quality, industry-ready Linguistic Linked Linguistic Data. Daniel Vila-Suero, Victor Rodriguez-Doncel, Asunción Gómez-Pérez, Philipp Cimiano, John P. McCrae and Guadalupe Aguado-de-Cea European Data Forum 2014, (2014)

Ontology-based interpretation of natural language. Philipp Cimiano, Christina Unger and John McCrae Morgan & Claypool, (2014) PDF

Orthonormal explicit topic analysis for cross-lingual document matching. John McCrae, Philipp Cimiano and Roman Klinger Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp 1732-1742, (2013) PDF

A lemon lexicon for DBpedia. Christina Unger, John McCrae, Sebastian Walter, Sara Winter and Philipp Cimiano Proceedings of 1st International Workshop on NLP and DBpedia, (2013) PDF

Multilingual variation in the context of linked data. Elena Montiel-Ponsoda, John McCrae, Guadalupe Aguado-de-Cea and Jorge Gracia Proceedings of the 10th International Conference on Terminology and Artificial Intelligence, pp 19-26, (2013) PDF

Mining translations from the web of open linked data. John P. McCrae and Philipp Cimiano Proceedings of the Joint Workshop on NLP&LOD and SWAIE: Semantic Web, Linked Open Data and Infromation Extraction, pp 9-13, (2013) PDF

Releasing multimodal data as Linguistic Linked Open Data: An experience report. Peter Menke, John P. McCrae and Philipp Cimiano Proceedings of the 2nd Workshop on Linked Data in Linguistics, pp 44-52, (2013) PDF

Towards open data for linguistics: Lexical Linked Data. Christian Chiarcos, John McCrae, Philipp Cimiano and Christiane Fellbaum In: New Trends of Research in Ontologies and Lexical Resourcespp 7-25, (2013)

On the role of senses in the Ontology-Lexicon. Philipp Cimiano, John McCrae, Paul Buitelaar and Elena Montiel-Ponsoda In: New Trends of Research in Ontologies and Lexical Resourcespp 43-62, (2013)

Using SPIN to formalize accounting regulation on the Semantic Web. Dennis Spohr, Philipp Cimiano, John McCrae and Sean O'Riain First International Workshop on Finance and Economics on the Semantic Web in conjunction with 9th Extended Semantic Web Conference, pp 1-15, (2012) PDF

Collaborative semantic editing of linked data lexica. John McCrae, Elena Montiel-Ponsoda and Philipp Cimiano Proc. of the 2012 International Conference on Language Resource and Evaluation, pp 2619-2625, (2012) PDF

Three steps for creating high quality ontology-lexica. John McCrae and Philipp Cimiano Proc. of the Workshop on Collaborative Resource Development and Delivery at the 2012 International Conference on Language Resource and Evaluation, (2012)

Integrating WordNet and Wiktionary with lemon. John McCrae, Philipp Cimiano and Elena Montiel-Ponsoda Christian Chiarcos, Sebastian Nordhoff and Sebastian Hellmann (eds) In: Linked Data and LinguisticsChristian Chiarcos, Sebastian Nordhoff and Sebastian Hellmann (eds)pp 25-34, (2012)

Interchanging lexical resources on the Semantic Web. John McCrae, Guadalupe Aguado-de-Cea, Paul Buitelaar, Philipp Cimiano, Thierry Declerck, Asunción Gómez-Pérez, Jorge Gracia, Laura Hollink, Elena Montiel-Ponsoda, Dennis Spohr and Tobias Wunner Language Resources and Evaluation, 46(6)pp 701-709, (2012)

LexInfo: A declarative model for the lexicon-ontology interface. Philipp Cimiano, Paul Buitelaar, John McCrae and Michael Sintek Web Semantics: Science, Services and Agents on the World Wide Web, 9(1)pp 29-51, (2011)

Challenges for the Multilingual Web of Data. Jorge Gracia, Elena Montiel-Ponsoda, Philipp Cimiano, Asunción Gómez-Pérez, Paul Buitelaar and John McCrae Web Semantics: Science, Services and Agents on the World Wide Web, pp 63-71, (2011) PDF

Combining statistical and semantic approaches to the translation of ontologies and taxonomies. John McCrae, Mauricio Espinoza, Elena Montiel-Ponsoda, Guadalupe Aguado-de-Cea and Philipp Cimiano Fifth Workshop on Syntax, Structure and Semantics in Statistical Translation in conjunction with 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, (2011)

Linking Lexical Resources and Ontologies on the Semantic Web with lemon. John McCrae, Dennis Spohr and Philipp Cimiano Proc. of the 8th Extended Semantic Web Conference, pp 245-249, (2011)

Ontology Lexicalization: The lemon perspective. Paul Buitelaar, Philipp Cimiano, John McCrae, Elena Montiel-Ponsoda and Thierry Declerck Proc. of 9th International Conference on Terminology and Articial Intelligence, (2011)

Representing Term Variation in lemon. Elena Montiel-Ponsoda, Guadalupe Aguado-de-Cea and John McCrae Proc. of 9th International Conference on Terminology and Articial Intelligence, (2011)

CLOVA: An architecture for cross-language semantic data querying. John McCrae, Jesus R. Campaña and Philipp Cimiano Proceedings of the 1st Workshop on the Multilingual Semantic Web, pp 5-12, (2010) PDF

Navigating the Information Storm: Web-based global health surveillance in BioCaster. Nigel Collier, Son Doan, Reiko Matsuda Goodwin, John McCrae, Mike Conway, Mika Shigematsu and Ai Kawazoe Taha Kass-Hout and Xiaohui Zhang (eds) In: Biosurveillance: Methods and Case StudiesTaha Kass-Hout and Xiaohui Zhang (eds)pp 291-312, (2010) PDF

Ontology-based multilingual access to financial reports for sharing business knowledge across Europe. Thierry Declerck, Hans-Ulrich Krieger, Susan-Marie Thomas, Paul Buitelaar, Sean O'Riain, Tobias Wunner, Gilles Maguet, John McCrae, Dennis Spohr and Elena Montiel-Ponsoda In: International Financial Control Assessment applying Multilingual Ontology Frameworkspp 67-76, (2010)

An ontology-driven system for detecting global health events. Nigel Collier, Reiko Matsuda Goodwin, John McCrae, Son Doan, Ai Kawazoe, Mike Conway, Asanee Kawtrakul, K. Takeuchi and D. Dien In Proc. of the 23rd International Conference on Computational Linguistics, pp 215-222, (2010)

Automatic extraction of logically consistent ontologies from text corpora. John McCrae PhD Thesis for Graduate University of Advanced Studies (SoKenDai), (2009)

SRL Editor: A rule development tool for text mining. John McCrae and Nigel Collier Proc. of Workshop on Semantic Authoring, Annotation and Knowledge Markup in conjunction with the 5th International Conference on Knowledge Capture, (2009) PDF

Synonym set extraction from the biomedical literature by lexical pattern discovery. John McCrae and Nigel Collier BMC Bioinformatics, 9(156) (2008) PDF