Intro:  Concept of attention
	Selective attention
		What do you actually pay attention to?
		Stuff I notice about people:
			Stuff they're carrying
			Cars they drive
			Glasses or not
			Piercings and tattoos
		Stuff I deliberately notice:
			Hair color, length
			Approximate height
		Stuff my daughter notices:
			Whether outfits match
			Hairstyle
			Fashion choices
			Other stuff like that
	Same thing applies to reading
		Speedreading and skimming:  sparse attention
			It can work
	In a sense, we don't even need most of the words
		The BFG

Order of language matters:
	Recurrent neural network does it sequentially
		Resistant to GPU
	Transformer "transforms" input, can be parallelized
	This enabled big language analysis

Attention
	"Attention is all you need"
	Order varies from one language to the next
		English vs. Spanish
	
Self-Attention
	This pieces was a new thing in 2017
	Stuff like equivalency of words can be learned
	Attention for a word is focused on other words in the sentence
	Used to disambiguate meaning

Think about using this for translation
	If you know what a word means, you can generate synonyms
	Can learn what words are used together
	
Positional Encoding
	Each word has a position in the input sentence
		So meaning can vary based on position
	Just number them?
		Not that easy
	Overlapping frequencies of sin and cosine
		Can determine both exact position and "close"
		And, a series of spacings can be represented 
			As in, it tends to be in odd-numbered positions
			Or, it's either early or late in a sentence
	This was also new to transformers
	Allows parallelism in for training
	Done by GPU in the original paper (8 P100's)

Multi-headed attention:
	Calculate how each word relates to each other
	query, key, value
	dot(query, key) determines extent of relationship
	Pre-trained values here
	softmax normalizes + pushes to extremes
		Can be activation threshold
	6 heads in original paper
	Each could learn something different
		Multiple by very large weights matrix
	Fed through feed-forward network after that
	Input layer can be stacked

Decoding process:
	Central question:  What word goes next?
	"masked" multi-head attention process
		0 attention score for future words in output
		We haven't generated them yet, shouldn't pay attention to them
		Kinda inhuman...
	Another way: For each word, what is the probability it's the next word
	end token counts as an entry in the vector of all words
	

Notes:
	dimensionality of the embedding space "2048 or higher" for GPT
	

That's about it for our technical overview of AI
	Questions?  What gray areas did I leave?