Mathematical model for representing text as a vector of numbers.
Enables linear algebra to analyse text.
A term-document matrix is a matrix where each row represents a term and each column represents a document.
| Document 1 | Document 2 | Document 3 | |
| Term 1 | 1 | 0 | 1 |
| Term 2 | 0 | 1 | 1 |
| Term 3 | 1 | 1 | 0 |
Python example
term_document_matrix = np.zeros((len(vocabulary), len(documents)))
for doc in documents:
for term in doc:
term_document_matrix[term][doc] += 1Cosine similarity (angle between vectors)
Euclidean distance (distance between vectors)
from sklearn.metrics.pairwise import (cosine_similarity,
euclidean_distance)
cosine_similarity(term_document_matrix[0],
term_document_matrix[1])
euclidean_distance(term_document_matrix[0],
term_document_matrix[1])Very large
Very sparse (many zeros)
Not very informative
Instead of a large vector compress all this information into a small vector.
Word vectors capture linguistic regularities
vec(Berlin) ≃ vec(Germany) + vec(Paris) - vec(France)
We will use the GloVe vectors
from gensim.models import KeyedVectors
glove = KeyedVectors.load_word2vec_format('glove.6B.50d.txt',
binary=False, no_header=True)glove.most_similar("belgium")
[('netherlands', 0.8926310539245605),
('france', 0.8631513118743896),
('switzerland', 0.8280506134033203),
('austria', 0.8187914490699768),
('luxembourg', 0.8173472285270691)]def analogy(x1, x2, y1):
result = glove.most_similar(positive=[y1, x2],
negative=[x1])
return result[0][0]
A language model is a function that calculates the likelihood of a string of words.
P("this string") = 0.0001
The probability of text is higher if the text is:
Languages models can generate text
\[ \max_{w \in \mathrm{vocabulary}} p(\mathrm{what~is~the~next~} w)\]
Repeating this allows us to generate text
Most models are trained autoregressively
Can you guess the word?
Generative Pre-Trained Transfomer
Input:
Classify the text into neutral, negative or positive. Text: I think the vacation is okay. Sentiment:
Output:
positive
Input:
Classify the text into neutral, negative or positive. Text: I think the vacation is okay. Sentiment: neutral Text: I think the vacation is great. Sentiment: positive Text: I think the vacation is terrible. Sentiment: negative Text: I think the vacation is okay. Sentiment:
Output:
neutral
Input:
Sarah has 3 packs of pencils. Each pack contains 5 pencils. How many pencils does she have in total?
Output:
<think>Each pack has 5 pencils, and there are 3 packs. Multiplying 3 × 5 gives the total number of pencils.</think> 15 pencils
Input:
is 9 + 10?
Output:
21
Input:
What is 9 + 10? 21 Do you think 21 is the correct answer?
Output:
No