Semantics

John P. McCrae - University of Galway

Course at ESSLLI 2025

Vector space models

Vector space models

Mathematical model for representing text as a vector of numbers.

Enables linear algebra to analyse text.

Term-document matrix

A term-document matrix is a matrix where each row represents a term and each column represents a document.

Document 1Document 2Document 3
Term 1101
Term 2011
Term 3110

Term-document matrix

Python example

term_document_matrix = np.zeros((len(vocabulary), len(documents)))
for doc in documents:
    for term in doc:
      term_document_matrix[term][doc] += 1

Similarity measures

Cosine similarity (angle between vectors)

Euclidean distance (distance between vectors)

from sklearn.metrics.pairwise import (cosine_similarity, 
  euclidean_distance)

cosine_similarity(term_document_matrix[0], 
                  term_document_matrix[1])
euclidean_distance(term_document_matrix[0], 
                   term_document_matrix[1])

Term-document matrix

Very large

Very sparse (many zeros)

Not very informative

Hands-on: Vector space models

Word embeddings

Word embeddings

Instead of a large vector compress all this information into a small vector.

Word embeddings - Autoencoders

Word embeddings - Word2Vec

Analogy

Word vectors capture linguistic regularities

vec(Berlin) ≃ vec(Germany) + vec(Paris) - vec(France)

Understanding semantic spaces

Loading word embeddings

We will use the GloVe vectors

from gensim.models import KeyedVectors
glove = KeyedVectors.load_word2vec_format('glove.6B.50d.txt', 
                            binary=False, no_header=True)

Most similar words

glove.most_similar("belgium")
[('netherlands', 0.8926310539245605),
('france', 0.8631513118743896),
('switzerland', 0.8280506134033203),
('austria', 0.8187914490699768),
('luxembourg', 0.8173472285270691)]

Analogy

def analogy(x1, x2, y1):
  result = glove.most_similar(positive=[y1, x2], 
                              negative=[x1])
  return result[0][0]

PCA and t-SNE

  • Word embeddings have many dimensions
  • Reduce to 2 dimensions to visualise
    • Principal Component Analysis (PCA) - linear
    • t-distributed Stochastic Neighbour Embedding (t-SNE) - non-linear

Visualisation with PCA

Hands-on: Word embeddings

Pretrained language models

Language Models

A language model is a function that calculates the likelihood of a string of words.

P("this string") = 0.0001

What's the big deal???

The probability of text is higher if the text is:

  • In a language
  • Grammatically order
  • Coherent
  • Plausible

Generative Language Models

Languages models can generate text

\[ \max_{w \in \mathrm{vocabulary}} p(\mathrm{what~is~the~next~} w)\]

Repeating this allows us to generate text

Pretraining

Most models are trained autoregressively

Can you guess the word?

  • for the humanities, literature and culture in the ????
  • the theme of the album is the life of the ???
  • shipping in the caribbean and off the ???
  • she was the daughter of an african ???

Transformers

  • Most popular architecture at the moment
  • Why transformers?
  • Natural language is hard to do math with:
    • Words not numbers
    • Sentences of different lengths

Transformers

ChatGPT

Generative Pre-Trained Transfomer

Prompt Engineering

Prompt Engineering

  • Large (>10B parameter) models demonstrate emergent properties
  • Using the correct initial text (prompt) we can extra information from the language model
  • Huggingface Hub

Zero-shot prompting

Input:

Classify the text into neutral, negative or positive. 
Text: I think the vacation is okay.
Sentiment: 

Output:

positive

Few-shot prompting

Input:

Classify the text into neutral, negative or positive. 
Text: I think the vacation is okay.
Sentiment: neutral
Text: I think the vacation is great.
Sentiment: positive
Text: I think the vacation is terrible.
Sentiment: negative
Text: I think the vacation is okay.
Sentiment:

Output:

neutral

Chain-of-thought prompting

Input:

Sarah has 3 packs of pencils. Each pack contains 5 pencils. 
How many pencils does she have in total?

Output:

<think>Each pack has 5 pencils, and there are 3 packs.
Multiplying 3 × 5 gives the total number of pencils.</think>
15 pencils

Self-evaluation

Input:

is 9 + 10?

Output:

21

Input:

What is 9 + 10?
21
Do you think 21 is the correct answer?

Output:

No

Hands-on: Word Sense Induction

Summary

Summary

  • Vectors allow us to do maths with language
  • Word embeddings reveal hidden semantic relations
  • Large language models show (astonishing) ability to generate plausible dialogue
Back