Heap project, again: In-place sorting One more major data structures idea: Hash tables! Consider the log(n) tree retrieval: It scales well, but O(1) would be better yet Lots of pointers are used per data item, which can reduce efficiency Indexing in an array is faster! But, the "key" has to be the array index key = the value we search for What if our keys are distributed over a wide range? What if they're a different data type, like strings? Example with Python dictionary How to convert keys to an array index: Simple idea for strings: Add up the letter values Then modulus with the length of the array They have to land in the table Collision management: correct key vs. correct hash function value Hash functions will collide Key values are unique So if the value is in a different place, we can find it Linear chaining: We do need a "null" value of some sort Easy if we're storing memory references, could be complicated otherwise Just keep checking until the right value or a null is found Clustering Nearly-full tables Other chaining methods: We can try quadratic chaining, etc Just jump in some other pattern Double-hashing The real problem is that the table is getting full Re-hashing: Allocate a new table and re-insert all items O(n) and kinda slow, but sometimes you have to How full? Open question, but somewhere around 50% Separate chaining: Each table entry is the head of a linked list If all items collide, operations take as long as a linked list But hopefully they won't! Re-hash if the lists are getting too long How about each entry is a link to a tree? Or another hash table? Then we've got a hash table tree! Sounds very expandable The STL historically lacked one Except for the Silicon Graphics version hash_map is now available Also called unordered_map hash_map was the name when it was an extension "regular" map is usually a tree of some sort The STL doesn't actually require this Hash table vs. Balanced Binary Search Tree: Both waste some space For the BST, this depends on the size of items stored For the hash table, it depends on how full the table is Both have fast retrieval: O(1) on the hash table, O(log(n)) for the BST BST depends mainly on memory accesses Hash table depends mainly on offset computation via the hash function BST has a sorted traversal available in O(n) time Hash function isn't sorted, and you can't sort it in place Hash table has unsorted O(n) traversal BST has consistent O(log(n)) insert and delete Hash table is O(1) or O(n) depending on if a rehash is needed Chaining can avoid that, but worst case for hash table is usually O(n) Hash table is still usually O(1) Both can be used to solve the same problems most of the time Difference from a cryptographic hash: For a cryptographic hash, the hash value doesn't reveal anything about the original Ideally, anyway Collisions are a security problem! Cryptographic hash function usually takes much longer to run SHA256 example Using a "regular" hash for cryptographic purposes: Probably a bad idea, won't have proper collision/preimage/etc resistance Using a cryptographic hash for a hash table: It'll distribute things in a very even manner Will have to use modulus to put it into an acceptable range Will be very slow! Bitcoin miners are doing SHA256 hashes