Intro: Concept of attention Selective attention What do you actually pay attention to? Stuff I notice about people: Stuff they're carrying Cars they drive Glasses or not Piercings and tattoos Stuff I deliberately notice: Hair color, length Approximate height Stuff my daughter notices: Whether outfits match Hairstyle Fashion choices Other stuff like that Same thing applies to reading Speedreading and skimming: sparse attention It can work In a sense, we don't even need most of the words The BFG Order of language matters: Recurrent neural network does it sequentially Resistant to GPU Transformer "transforms" input, can be parallelized This enabled big language analysis Attention "Attention is all you need" Order varies from one language to the next English vs. Spanish Self-Attention This pieces was a new thing in 2017 Stuff like equivalency of words can be learned Attention for a word is focused on other words in the sentence Used to disambiguate meaning Think about using this for translation If you know what a word means, you can generate synonyms Can learn what words are used together Positional Encoding Each word has a position in the input sentence So meaning can vary based on position Just number them? Not that easy Overlapping frequencies of sin and cosine Can determine both exact position and "close" And, a series of spacings can be represented As in, it tends to be in odd-numbered positions Or, it's either early or late in a sentence This was also new to transformers Allows parallelism in for training Done by GPU in the original paper (8 P100's) Multi-headed attention: Calculate how each word relates to each other query, key, value dot(query, key) determines extent of relationship Pre-trained values here softmax normalizes + pushes to extremes Can be activation threshold 6 heads in original paper Each could learn something different Multiple by very large weights matrix Fed through feed-forward network after that Input layer can be stacked Decoding process: Central question: What word goes next? "masked" multi-head attention process 0 attention score for future words in output We haven't generated them yet, shouldn't pay attention to them Kinda inhuman... Another way: For each word, what is the probability it's the next word end token counts as an entry in the vector of all words Notes: dimensionality of the embedding space "2048 or higher" for GPT That's about it for our technical overview of AI Questions? What gray areas did I leave?