COMP 3225 Stuart E. Middleton sem03@soton.ac.uk University of Southampton Natural Language Processing Revision Copyright University of Southampton 2022. Content for internal use at University of Southampton only. Slides may include content publicly shared for education purposes via https://web.stanford.edu/~jurafsky/slp3/ Sections • Past Exam Questions • Landscape of Content • Non - Neural NLP Revision • Neural NLP Revision This lecture is supporting material to help you focus your revision. All core examinable content has already been covered in previous lectures. Re - watch the videos to recap the content. 2 Past Exam Questions • Exam guidance on module wiki • https://secure.ecs.soton.ac.uk/notes/comp3225/exam/COMP3225 - 6253 - exam - paper - guidance.pdf • Part A (answer 2 of 2) - core, one classic NLP topic and one neural NLP topic • Part B (answer 1 of 3) - any topic • <look at guidance> • Typical exam question structure • 5 marks assessing recall / application of lectures • 10 marks assessing recall / application of lectures & book • 15 marks assessing understanding of lectures & book & wider reading • Past exam paper on module wiki • https://secure.ecs.soton.ac.uk/notes/comp3225/exam/COMP3225 - past - exam - paper - 2022.pdf • Exam will be a 2 hour in - person examination • Older papers (e.g. 2021) were online 24 hour papers • < look at past exam paper> 3 Landscape of Non - Neural Content 4 Text Processing Tokenization, Stemming, BPE Levenshtein Distance, Minimum Edit Distance Basic regex, Greedy, Groups, Lookahead Evaluation ROGUE, BLEU, PERPLEXITY, F1 Syntax and Grammar CFG, Head Finding, Syntactic Parsing, CKY Parsing Dependency Grammar, Projectivity Transition - based Parser, Graph - based Dependency Parsing Coursework Core Core Core Language Models n - Grams, Maximum Likelihood Estimation, Smoothing, Backoff , Interpolation Sequence Labelling POS, Tagsets , POS tagging, HMM POS tagger NER, BIO tagging, CRF (train, inference), Feature sets Landscape of Neural Content 5 Embeddings Sparse Embeddings, Term - Doc matrix, Term - Term matrix, Cosine Distance, TF - IDF, PMI Dense Embeddings, Skip - gram Embeddings, Loss Function, Word Embedding, Semantic Properties of Embeddings, Bias and Embeddings Deep NLP Methods Matrix/Vector Shapes ( inc tips/tricks), Activation Functions, One - hot Vectors, Cross - entropy Loss Neural NLP Patterns, Deep Learning Stacks MLP/RNN/LSTM/GRU, Transformer, Multi - head Self - attention Sequence Labelling WordNet, WSD using BERT Semantic Role, PropBank , FrameNet , Neural SRL Sequence to Sequence MT, Word Order Typology Encoder - Decoder, Attention, Beam Search Core Core Sequence Classification IE, Supervised RE, Semi - supervised RE, Unsupervised RE, KBP Span Labelling IR - based QA, MRC Graph - based QA, Neural Entity Linking, Neural Relation Detection Overview - Non - Neural NLP Revision • Landscape of Non - Neural Content • Words (2) • Regular Expressions (3) • Training, Evaluation & Linguistic Resources (4) • N - Grams (5) • Parts of Speech Tagging (6) • Named Entity Recognition (7) • Constituency Grammars (14) • Syntactic Parsing (15) • Dependency Parsing (16) 6 Landscape of Non - Neural Content 7 Text Processing Tokenization, Stemming, BPE Levenshtein Distance, Minimum Edit Distance Basic regex, Greedy, Groups, Lookahead Evaluation ROGUE, BLEU, PERPLEXITY, F1 Syntax and Grammar CFG, Head Finding, Syntactic Parsing, CKY Parsing Dependency Grammar, Projectivity Transition - based Parser, Graph - based Dependency Parsing Coursework Core Core Core Language Models n - Grams, Maximum Likelihood Estimation, Smoothing, Backoff , Interpolation Sequence Labelling POS, Tagsets , POS tagging, HMM POS tagger NER, BIO tagging, CRF (train, inference), Feature sets Metrics - ROUGE • used to evaluate text summarization – a good machine summary is one that includes many of the same sequences of words as a human - generated (reference) summary – ROUGE - n is the proportion of the significant word sequences (n - grams, see lecture 5) in the machine summaries matching those in the reference summaries • Variations for N=1, 2, longest common subsequence, etc – recall - oriented • depends on quantity of material that matches • ROUGE: A Package for Automatic Evaluation of Summaries, by Chin - Yew Lin (ISI) – https:// www.microsoft.com / en - us/research/wp - content/uploads/2016/07/was2004.pdf 8 Metrics - BLEU • used to evaluate machine translations • a good machine translation is one that includes many of the same sequences of words as a human - generated (reference) translation 9 • Precision - based • how many n - word sequences from the machine - generated set also appear in the reference set for n=1,2,3,4 • Proportion of all common sequences matching those in the reference set, compared to the total number of sequences • BLEU ignores recall - based factors (how much of the material did it translate) and focuses only on precision (how much of the material that it translated did it translate well). • One solution is to combine both kinds of metric (see F1 score) • Instead, BLEU penalizes translations that are shorter than the reference translations Equation 11.23 see chapter 11.8.2 Metrics - PERPLEXITY • used to evaluate language models – how good a vocabulary, or a list of word sequences (n - grams), is at “predicting” a target text – based on the probability of all the words in the text appearing in that order • inverted and normalized by the number of words • minimizing perplexity is equivalent to maximizing the test set probability according to the language model. 10 Equation 3.14 We trained unigram, bigram, and trigram grammars on 38 million words (including start - of - sentence tokens) from the Wall Street Journal, using a 19,979 word vocabulary. We then computed the perplexity of each of these models on a test set of 1.5 million words. The blue table above shows the perplexity of a 1.5 million word WSJ test set according to each of these grammars. (see page 37) Metrics – Precision, Recall & F1 • used in text classification, searching etc • Recall = proportion of relevant items that were actually selected from the set of all relevant items – Items that were classified compared to all the items that should have been classified • Precision = proportion of genuinely relevant items that were selected compared to all the items that were selected – Items that were correctly classified compared to all the items that were classified • Trade - off: easy to have 100% precision but 1% recall or 100% recall and 1% precision 11 ir Diagram sourced from Wikipedia N - grams • An n - gram model informs us of the probability of the next word in the text, given the previous n - 1 words in the text. – P(w | h) the probability of word w given the previous history h , where h is is sequence of words – Out of the times that h occurred, how may times was it followed by w 12 𝑃 ( the | bus was so crowded that ) = 𝐶 𝑜𝑢𝑛𝑡 ( bus was so crowded that the ) 𝐶 𝑜𝑢𝑛𝑡 ( bus was so crowded that ) = 35 28700 = 0 12% (Google) Maximum Likelihood Estimation (MLE) • The process of choosing the right set of bigram parameters to make our model correctly predict ( maximise the likelihood of) the nth word in the text is called maximum likelihood estimation • the MLE estimate for the parameters of an n - gram model are obtained by – observing the n - gram counts from a representative corpus – normalizing them (dividing by a total count ) to lie between 0 and 1 13 Equation 3.11 Equation 3.12 Bigram Parameter Estimation N - gram Parameter Estimation This is the process we saw on slide 7 “the bus was so crowded that” Laplace smoothing • Apply Laplace smoothing to unigrams. – the unsmoothed MLE of word w i is its count c i normalized by N, the total number of tokens • P( w i ) = c i / N – Since there are V words in the vocabulary and each one was incremented, the denominator is also adjusted to take into account the extra V observations • LP( w i ) = (c i + 1) / (N + V) – Instead of changing both the numerator and denominator, it is convenient to describe how a smoothing algorithm affects the numerator, by defining an adjusted count c ∗ which is easier to compare directly with the MLE counts • c ∗ i = (c i + 1) N / (N + V) – Normalising by N yields the same expression as LP( w i ) above – Also consider the discount rate d c = c* / c 14 Backoff & Interpolation • If data is sparse about the appearance of higher - order n - grams, we can fall back to information about lower order n - grams • to compute P( w n | w n−2 w n−1 ) without counts of the trigram w n−2 w n−1 w n – estimate its probability by using the bigram probability P( w n | w n−1 ) – if there are no counts of the bigram, estimate using the unigram P( w n ). • In the absence of detailed information, using less context can be a good strategy. Allows generalization to more for contexts that the model hasn’t been trained on about. • There are two ways to use this n - gram “hierarchy”. – Backoff : only use lower - order n - gram if we have zero evidence for a higher - order n - gram – Interpolation : always mix the probability estimates from all the n - gram estimators, weighing and combining the trigram, bigram, and unigram counts. 15 Introduction to Parts of Speech (POS) • Closed Classes - fixed membership • Typically function words used for structuring grammar (of, it, and, you) • Open Classes - open membership • Noun (including proper noun ), verb , adjective , adverb , interjection 16 Tagsets • A list of POS labels is called a Tagset • Tagsets come in different shapes and sizes • Penn Treebank (45 labels) • Brown Corpus (87 labels) • C7 Tagset (146 labels) • Penn Treebank Cite: Marcus, M. P., Santorini, B., and Marcinkiewicz , M. A. (1993). Building a large annotated corpus of English: The Penn treebank. Computational Linguistics 19(2), 313 – 330 17 POS Tagging • POS tagging is the process of assigning a POS tag to each word in a text • Input sequence >> X >> x 1 ; x 2 ; ::: ; x n of (tokenized) words • Output sequence >> Y >> y 1 ; y 2 ; ::: ; y n of POS tags • Each output y i corresponding exactly to one input x i 18 Hidden Markov Model (HMM) POS tagger • Hidden Markov Model >> POS tags are hidden states, which we must infer from observed words 19