Introduction to Linguistic Data Science

A course at ESSLLI 2025

Big data is fundamentally changing the way that linguists can investigate linguistic facts leading to a new research area which combines data science with linguistics. This course provides an introduction to the new area of linguistic data science by means of an introductory course with hands-on data analysis that is focused on key questions in linguistics. This course will first provide a basic introduction to data science and in particular how this can be applied to large corpora using natural language processing techniques. We will then show how this can be used to find answers to problems in syntax, semantics, multilinguality and other areas of linguistics, along with a summary giving perspectives on how these methods can be applied to students' own research.

This course provides a broad overview of how data science techniques, including machine learning, natural language processing and data visualisation can be applied to linguistics and will equip students with powerful tools to analyse their own challenges in a quantitative manner.

Course outline

  • Monday: Fundamentals of Data Science
  • Slides   Worksheet
    • What is data science?
    • Methods for data science
    • Natural language processing
    • Machine learning
    • Text preprocessing
    • Hands-on: How can we infer authors of texts using stylometrics?
  • Tuesday: Linguistic Data Science
  • Slides   Worksheet
    • Corpora and data
    • Corpus linguistics
    • Types, tokens and Morphology
    • Hands-on: English Clitics
    • Social Media Analytics
    • Hands-on: Sarcasm detection
  • Wednesday: Linguistic Data Science for Syntax
  • Slides   Worksheet
    • Part-of-speech Analysis
    • Parsing
    • Hands-on: Adverbs in English
    • Language Usage
    • Hands-on: Diachronic Analysis
  • Thursday: Linguistic Data Science for Semantics
  • Slides  Worksheet
    • Vector space models
    • Word embeddings
    • Understanding Semantic Spaces
    • Large Language Models
    • Prompt Engineering
    • Hands-on question: How can we infer word senses?
  • Friday: Multilingual Linguistic Data Science
  • Slides  Worksheet
    • Machine Translation
    • Under-resourced languages
    • Hands-on: Code-switching
    • Distant Reading
  • Bonus Content: Perspectives
  • Slides
    • Evaluation of Machine Translation
    • Linked Data for Linguistics

Expected level and prerequisites

This course is aimed at PhD students engaged in linguistics, computer science or a related field. The course will illustrate key concepts using Python, however we do not expect students to have any prior experience with programming, as students will be provided with iPython notebooks and will only be expected to make minor changes in order to complete their investigations. Although we will cover some technical concepts, we do not expect any prerequisites in terms of mathematics, computer science or linguistics. As such, we expect that this course will be accessible to all students at ESSLLI. I note that this course will be partly based on a similar course, “Machine Learning and Natural Language Processing for Managers”, offered as part of a postgraduate diploma at the University of Galway.

References

  • Manning, Christopher, and Hinrich Schutze. Foundations of statistical natural language processing. MIT press, 1999.
  • Koehn, Philipp. Statistical machine translation. Cambridge University Press, 2009.
  • Goldberg, Yoav. "Neural network methods for natural language processing." Synthesis lectures on human language technologies 10, no. 1 (2017): 1-309.
  • McKinney, Wes. Python for data analysis: Data wrangling with Pandas, NumPy, and IPython. " O'Reilly Media, Inc.", 2012.

About the lecturer

This course is lectured by John P. McCrae, a lecturer at the University of Galway and a senior researcher at the Insight Centre for Data Analytics and ADAPT centre. He has been working on linguistic data science for over 10 years and has taught similar courses at previous ESSLLI schools and other summer schools and has published over 100 papers in the area of linguistic data science.