Big data is fundamentally changing the way that linguists can investigate linguistic facts leading to a new research area which combines data science with linguistics. This course provides an introduction to the new area of linguistic data science by means of an introductory course with hands-on data analysis that is focused on key questions in linguistics. This course will first provide a basic introduction to data science and in particular how this can be applied to large corpora using natural language processing techniques. We will then show how this can be used to find answers to problems in syntax, semantics, multilinguality and other areas of linguistics, along with a summary giving perspectives on how these methods can be applied to students' own research.
This course provides a broad overview of how data science techniques, including machine learning, natural language processing and data visualisation can be applied to linguistics and will equip students with powerful tools to analyse their own challenges in a quantitative manner.
This course is aimed at PhD students engaged in linguistics, computer science or a related field. The course will illustrate key concepts using Python, however we do not expect students to have any prior experience with programming, as students will be provided with iPython notebooks and will only be expected to make minor changes in order to complete their investigations. Although we will cover some technical concepts, we do not expect any prerequisites in terms of mathematics, computer science or linguistics. As such, we expect that this course will be accessible to all students at ESSLLI. I note that this course will be partly based on a similar course, “Machine Learning and Natural Language Processing for Managers”, offered as part of a postgraduate diploma at the University of Galway.
This course is lectured by John P. McCrae, a lecturer at the University of Galway and a senior researcher at the Insight Centre for Data Analytics and ADAPT centre. He has been working on linguistic data science for over 10 years and has taught similar courses at previous ESSLLI schools and other summer schools and has published over 100 papers in the area of linguistic data science.