Artificial Intelligence Introduction:
	Artificial Intelligence includes Machine Learning
		But machine learning isn't all there is to AI

"solved" games:
	Think about tic-tac-toe
	9 possible moves, but only 3 unique
	Each of these leads to 8 responses, but not all are unique
		Center:  2 (side or corner)
		Corner:  5 
		Side:  5
		So now there are 12 possible games, each of which continue
	No more than 9 moves, but the last only has one option
		So we're 2 moves in, 1 doesn't count, leaving 6
		If the average number of options is 4:  4^6 = 4096
		But a lot of those lead to an immediate loss!
	So how many possible positions are there?
		No more than 12 * 4096, which is 49,152
	From any point, a computer could calculate all possible outcomes
		mimimax tree	

Chess:
	I can't find the quote I remember about this
	At any rate, chess computers don't have to be intelligent
	minmax tree search
		No need for elaborate pruning with tic-tac-toe
		pruning with chess!
			Very important!
			More pruning = longer lookahead
			Too much pruning = miss important moves			
		How many moves can you make?
			checkers vs. chess
			go
			mancala
	Main point here:  AI is a bag of tricks
		Sometimes they work really well!

Intro to machine learning
	Machine learning is increasingly applied to chess

An important principle in AI:
	Simpler is better if they perform equally well
	Complicated is worse unless it performs better
	Perform?
		This really depends on the task
		Chess or tic-tac-toe:  Defeats other players
			There are computer chess tournaments
		Classification:  Accuracy?
			Maybe!  Detecting rare conditions?
		"How does this perform" can be a pretty complicated question
	Machine learning is usually on the "more complicated" end

An example task that's really common:  Supervised learning for Classification
	Problem setup:  We have a bunch of examples of X and Y
		We know a bunch that are X for sure and Y for sure
		We also have, or will have, a bunch where we're not sure which they are
		The future or unknown X and Y examples are similar to the old ones
	We can create a training set using the data we already have classified
		A set where we know the answer already
	We can create a test set the same way
		But we won't use this to train the model!
		Must also be comprised of examples where we know the right classification
	We expect that test set performance will be maintained
	Some examples of what this could be
		I had brain scans with neural conditions
	Balance and training:
		Do we have more of one type of example than the other?
		Is one type just plain more likely?
		Accuracy as % correct is only valid for balanced datasets!

Representation of Examples:
	Usually a feature vector
	It's a series of features, consistent from one example to the next
	A lot of choices go into how to make these!
	Will we have weka available?
	
First algorithm:  C4.5 (J48) Decision Trees
	J48 is a widely-used Java implementation of C4.5
	Tree nodes split on a feature
	Try to purify results
	Entropy:  Discorder at any point
	Entropy definition (wikipedia)
		https://en.wikipedia.org/wiki/Entropy_(information_theory)
	Let's make a quick Python function to calculate entropy
		Easy enough!  (I hope...)
		math.log2
	Information gain
		https://en.wikipedia.org/wiki/Information_gain_(decision_tree)
		Decrease in entropy, averaged across child nodes
	Ok, now we can calculate these
	Let's put together a tree for a classic weka example
	Note:  The classifications don't have to be binary
		Can, for example, try ranges for numerical values

What can a decision tree repesent?
	And:  If feature1 is A and feature2 is B
	Or:  If feature1 is A or feature2 is B
	Xor:  If feature1 is A xor feature2 is B
	Not:  If not feature1
	You can reduce a decision tree to a logical conjunction
		Could possibly be simplified after that
		Could also make a semiconductor to implement it
	So a C4.5 hypothesis can be applied very quickly and efficiently!

Another decision tree pro:  Comprehensibility
	It's possible to see why a decision was made
	And figure out what features were important

Yet another pro for this:  Large numbers of useless features don't hurt
	If they're useless and don't have much information gain, they'll be ignored
	Good if you don't even know what features are important!
	Can use this for feature selection, for other algorithms

A disadvantage:  Lots of significant features are a problem
	Imaging a feature vector from a 4K image
	Relations of neighboring pixels would be very important!
	Some other algorithms do a lot better here

Another disadvantage:  Doesn't really do degree
	Specific outputs, rather than floating-point