Semantics

John P. McCrae - University of Galway

Course at ESSLLI 2025

Vector space models

Mathematical model for representing text as a vector of numbers.

Enables linear algebra to analyse text.

Term-document matrix

A term-document matrix is a matrix where each row represents a term and each column represents a document.

	Document 1	Document 2	Document 3
Term 1	1	0	1
Term 2	0	1	1
Term 3	1	1	0

Term-document matrix

Python example

term_document_matrix = np.zeros((len(vocabulary), len(documents)))
for doc in documents:
    for term in doc:
      term_document_matrix[term][doc] += 1

Similarity measures

Cosine similarity (angle between vectors)

Euclidean distance (distance between vectors)

from sklearn.metrics.pairwise import (cosine_similarity, 
  euclidean_distance)

cosine_similarity(term_document_matrix[0], 
                  term_document_matrix[1])
euclidean_distance(term_document_matrix[0], 
                   term_document_matrix[1])

Term-document matrix

Very large

Very sparse (many zeros)

Not very informative

Hands-on: Vector space models

Word embeddings

Instead of a large vector compress all this information into a small vector.

Word embeddings - Autoencoders

Word embeddings - Word2Vec

Analogy

Word vectors capture linguistic regularities

vec(Berlin) ≃ vec(Germany) + vec(Paris) - vec(France)

Understanding semantic spaces

Loading word embeddings

We will use the GloVe vectors

from gensim.models import KeyedVectors
glove = KeyedVectors.load_word2vec_format('glove.6B.50d.txt', 
                            binary=False, no_header=True)

Most similar words

glove.most_similar("belgium")
[('netherlands', 0.8926310539245605),
('france', 0.8631513118743896),
('switzerland', 0.8280506134033203),
('austria', 0.8187914490699768),
('luxembourg', 0.8173472285270691)]

Analogy

def analogy(x1, x2, y1):
  result = glove.most_similar(positive=[y1, x2], 
                              negative=[x1])
  return result[0][0]

PCA and t-SNE

Word embeddings have many dimensions
Reduce to 2 dimensions to visualise

Principal Component Analysis (PCA) - linear
t-distributed Stochastic Neighbour Embedding (t-SNE) - non-linear

Visualisation with PCA

Hands-on: Word embeddings

Pretrained language models

Language Models

A language model is a function that calculates the likelihood of a string of words.

P("this string") = 0.0001

What's the big deal???

The probability of text is higher if the text is:

In a language
Grammatically order
Coherent
Plausible

Generative Language Models

Languages models can generate text

\[ \max_{w \in \mathrm{vocabulary}} p(\mathrm{what~is~the~next~} w)\]

Repeating this allows us to generate text

Pretraining

Most models are trained autoregressively

Can you guess the word?

for the humanities, literature and culture in the ????
the theme of the album is the life of the ???
shipping in the caribbean and off the ???
she was the daughter of an african ???

Transformers

Most popular architecture at the moment
Why transformers?
Natural language is hard to do math with:

Words not numbers
Sentences of different lengths

Transformers

ChatGPT

Generative Pre-Trained Transfomer

Prompt Engineering

Large (>10B parameter) models demonstrate emergent properties
Using the correct initial text (prompt) we can extra information from the language model
Huggingface Hub

Zero-shot prompting

Input:

Classify the text into neutral, negative or positive. 
Text: I think the vacation is okay.
Sentiment:

Output:

positive

Few-shot prompting

Input:

Classify the text into neutral, negative or positive. 
Text: I think the vacation is okay.
Sentiment: neutral
Text: I think the vacation is great.
Sentiment: positive
Text: I think the vacation is terrible.
Sentiment: negative
Text: I think the vacation is okay.
Sentiment:

Output:

neutral

Chain-of-thought prompting

Input:

Sarah has 3 packs of pencils. Each pack contains 5 pencils. 
How many pencils does she have in total?

Output:

<think>Each pack has 5 pencils, and there are 3 packs.
Multiplying 3 × 5 gives the total number of pencils.</think>
15 pencils

Self-evaluation

Input:

is 9 + 10?

Output:

Input:

What is 9 + 10?
21
Do you think 21 is the correct answer?

Output:

No

Hands-on: Word Sense Induction

Summary

Vectors allow us to do maths with language
Word embeddings reveal hidden semantic relations
Large language models show (astonishing) ability to generate plausible dialogue

Back