Introduction to Linguistic Data Science

Big data is fundamentally changing the way that linguists can investigate linguistic facts leading to a new research area which combines data science with linguistics. This course provides an introduction to the new area of linguistic data science by means of an introductory course with hands-on data analysis that is focused on key questions in linguistics. This course will first provide a basic introduction to data science and in particular how this can be applied to large corpora using natural language processing techniques. We will then show how this can be used to find answers to problems in syntax, semantics, multilinguality and other areas of linguistics, along with a summary giving perspectives on how these methods can be applied to students' own research.

This course provides a broad overview of how data science techniques, including machine learning, natural language processing and data visualisation can be applied to linguistics and will equip students with powerful tools to analyse their own challenges in a quantitative manner.

Course outline

Monday: Fundamentals of Data Science

Slides

Worksheet

What is data science?
Methods for data science
Natural language processing
Machine learning
Text preprocessing
Hands-on: How can we infer authors of texts using stylometrics?

Tuesday: Linguistic Data Science

Slides

Worksheet

Corpora and data
Corpus linguistics
Types, tokens and Morphology
Hands-on: English Clitics
Social Media Analytics
Hands-on: Sarcasm detection

Wednesday: Linguistic Data Science for Syntax

Slides

Worksheet

Part-of-speech Analysis
Parsing
Hands-on: Adverbs in English
Language Usage
Hands-on: Diachronic Analysis

Thursday: Linguistic Data Science for Semantics

Slides

Worksheet

Vector space models
Word embeddings
Understanding Semantic Spaces
Large Language Models
Prompt Engineering
Hands-on question: How can we infer word senses?

Friday: Multilingual Linguistic Data Science

Slides

Worksheet

Machine Translation
Under-resourced languages
Hands-on: Code-switching
Distant Reading

Bonus Content: Perspectives

Slides

Evaluation of Machine Translation
Linked Data for Linguistics

Expected level and prerequisites

This course is aimed at PhD students engaged in linguistics, computer science or a related field. The course will illustrate key concepts using Python, however we do not expect students to have any prior experience with programming, as students will be provided with iPython notebooks and will only be expected to make minor changes in order to complete their investigations. Although we will cover some technical concepts, we do not expect any prerequisites in terms of mathematics, computer science or linguistics. As such, we expect that this course will be accessible to all students at ESSLLI. I note that this course will be partly based on a similar course, “Machine Learning and Natural Language Processing for Managers”, offered as part of a postgraduate diploma at the University of Galway.

References

Manning, Christopher, and Hinrich Schutze. Foundations of statistical natural language processing. MIT press, 1999.
Koehn, Philipp. Statistical machine translation. Cambridge University Press, 2009.
Goldberg, Yoav. "Neural network methods for natural language processing." Synthesis lectures on human language technologies 10, no. 1 (2017): 1-309.
McKinney, Wes. Python for data analysis: Data wrangling with Pandas, NumPy, and IPython. " O'Reilly Media, Inc.", 2012.

About the lecturer

This course is lectured by John P. McCrae, a lecturer at the University of Galway and a senior researcher at the Insight Centre for Data Analytics and ADAPT centre. He has been working on linguistic data science for over 10 years and has taught similar courses at previous ESSLLI schools and other summer schools and has published over 100 papers in the area of linguistic data science.