Grading note: I'm caught up, in general Homework 5 was the funnest assignment I've ever graded Text processing, parsing, and linguistics: I took a class on that once... Tokenization! Consider a pretty cut-and-dried problem: Parsing a C-family programming language 1. Tokenization of input 2. Forming a parse tree 3. If we were actually compiling, optimization 4. If we were actually compiling, generating machine code Parsing English is a lot harder The C-family language was designed for easy parsing! English is not: The same set of letters can have more than one meaning The same meaning can be expressed by multiple words Speakers may invent new meanings for words or use local idioms Sentiment of the speaker, such as sarcasm, may change meaning If you really want to learn to do this: https://explosion.ai/blog/parsing-english-in-python When we read, we don't sound out words We don't even read the words All we do is pick from the sensible choices Same thing with listening There's a lot of knowledge that goes into understanding English That's part of what makes models like ChatGPT better Imagine one could make a "sensibility" score Choose the most sensible interpretation Even if it's not the most technically correct Prolog and first-order logic: Prolog demo So, if we could extract facts, we could use first-order logic! An early idea Not really flexible enough We need to know how people really speak