CS211 Project 4

Due Wednesday, December 13

It is common practice in many systems using password authentication not to store actual passwords. Despite this, users can log in using a password. The usual way to accomplish this is to store not the password itself, but a hash value derived from the password. Suppose a user has the password "hamster". Instead of storing "hamster", the system will store a sha256 hash of "hamster". You can calculate these on unix-like systems with the tool sha256sum, like this:

seth@nimrod:~/cs211/sha-2 $ sha256sum
hamster
12e1d34b00a928b8cebb6a5f739f656b339c0e9d2f4507148a5b1afdadd634d0  -

You do have to be careful using this tool, because usage like this can lead to inaccurate results:

seth@nimrod:~/cs211/sha-2 $ echo hamster | sha256sum
78b696a042c8c52aec9c867262e30aaa0782f817c48c13d62202a9fbef32960a  -

It's different because echo put something extra on the end, probably a newline. Press ctrl+d twice after you've typed the word you want a hash value for, and sha256sum will terminate and give you the right value. hamster\n yields a completely different hash value than hamster. So would hamster with a space after it, or hamster with a carriage return, or any other permutation. The hash values will also look very different with only minor changes to the original word. Consider these hash values:

gamster: 56e3574e281f6fd3f10aae80198cf75a395a13a85300d7b1713e56d6a2675a88
hamster: 12e1d34b00a928b8cebb6a5f739f656b339c0e9d2f4507148a5b1afdadd634d0
iamster: e8da8ebd2f3628f80175bcf51b6c9f83bcb5fb988c17f0d3d394ac3679740cf9

One implication of this is that if a different word hashed to 12e1d34b00a928b8cebb6a5f739f656b339c0e9d2f4507148a5b1afdadd634d0, then it is a correct password just like hamster! Thus it is possible that more than one password will work. However, this is unlikely due to the large size of the hash value. Since the number of possible hash values is much larger than the number of passwords, it is likely that most hash values are not generated by any password. Finding two passwords that hash to the same value would be an example of a hash collision. Some historical hash functions such as MD5 has suffered from predictable collisions, but sha256 is still highly regarded in this respect, and underlies many important security systems including bitcoin.
Suppose you have a hash value, and wish to find the associated password. sha256 is intended be impossible to reverse. Take a few minutes to read the wikipedia entry on SHA-2 if you have time (sha256 is part of the sha-2 family). However, since many passwords are dictionary words, we could perform a dictionary attack. For project 4, create a program that will perform a multithreaded dictionary attack on a sha256 hash.

Calculating a sha256 hash value

Sha256 is well-defined, and you could certainly start from scratch and write a program for it. But, there are good open implementations around. For this project, I recommend starting with Alain Mosnier's code, available on github here: https://github.com/amosnier/sha-2. On a unix-type system, you can check it out like this:

seth@nimrod:/tmp $ git clone https://github.com/amosnier/sha-2
Cloning into 'sha-2'...
remote: Enumerating objects: 206, done.
remote: Counting objects: 100% (89/89), done.
remote: Compressing objects: 100% (15/15), done.
remote: Total 206 (delta 82), reused 74 (delta 74), pack-reused 117
Receiving objects: 100% (206/206), 57.54 KiB | 577.00 KiB/s, done.
Resolving deltas: 100% (120/120), done.

For anyone working on isoptera, that's an easy way to go. Other operating systems might vary a little. If working with online GDB, I'd recommend downloading sha-256.c and sha-256.h and uploading them both as part of a new project. The repository is written in C, and will work fine in C++, with one exception. C++ requires an explicit cast when copying a pointer of type void to a different type, and C does not. So mixing with C++ and building with g++ will lead to an error:

seth@nimrod:/tmp/sha-2 $ g++ 211test.cpp  sha-256.c
sha-256.c: In function 'void sha_256_write(Sha_256*, const void*, size_t)':
sha-256.c:139:28: error: invalid conversion from 'const void*' to 'const uint8_t*' {aka 'const unsigned char*'} [-fpermissive]
  139 |         const uint8_t *p = data;
      |                            ^~~~
      |                            |
      |                            const void*

To fix this, open up sha-256.c to line 139, and make this revision:

const uint8_t *p = (const uint8_t*)data;

After that, it should build. For a C++ test program, download login_system.cpp from the class examples area, and build it together with sha-256.c. Here's an example:

seth@nimrod:/tmp/sha-2 $ g++ login_system.cpp sha-256.c
seth@nimrod:/tmp/sha-2 $ ./a.out
Enter the password:  hamster
Sorry, you don't seem to know the password!

If you get the password right, it'll let you in.

Creating the Multithreaded Dictionary Attack

Since you have the hash value already in login_system.cpp, you can perform a dictionary attack on it. Dictionary attacks fall into the category of programming problems known as "embarrassingly parallel": https://en.wikipedia.org/wiki/Embarrassingly_parallel To load the dictionary, look at the example word_list.cpp from October 12. Alternatively, word_list_vector.cpp, from the same day, if you'd prefer an STL vector to an unmanaged heap array. This should allow you to form a list of dictionary words. Then put together an algorithm like this:

for each word in the dictionary:
	Check to see if that word hashes to the hash value in login_system.cpp
	if it does, print out the word

Adapt the timing code from project 1 to see how long your attack takes. It probably won't be all that long, but remember, we're only trying the spell check dictionary here. It would be better to try different variations of capitalization, etc. The more we add, the longer the attack will take. Also consider what would happen if we had a file with 1,000,000 hashes, and wanted to see if any of them hash to a dictionary word. This could be useful, for example, to make sure none of the users on the system had an easily-guessed password.
Set up a function to test a range of values. It should take a starting index, ending index, hash value to look for, and possibly the word list (unless yours is stored globally, in which case it doesn't need to take it as a parameter). Then calculate ranges for each thread. For example, if you have 8 CPU cores, you should use 8 threads, each of which will try an eighth of the words in the dictionary. So the first would check 0 through wordcount/8, the second wordcount/8 to (2 * wordcount/8), etc. Set up a loop to start these threads. I'd recommend pushing them onto an STL vector by reference:

vector threads;
for(int i = 0; i < threadcount; i++){
	threads.push_back( ... );
}
// Afterward, to wait for them to finish
for(auto t : threads)
	t->join();

Using a reference will avoid making copies of the threads, which is probably a bad idea. I'd also recommend some kind of exit mechanism for if a word is found:

// At global scope
bool done = false;

// Loop in each thread:
for(size_t i = startpoint; i < endpoint && !done; i++)

// When you find a word that matches the hash:
done = true

Time how long it takes with one thread per core. It should show an improvement, although there is some overhead for starting extra threads.

A Note on Bitcoin Miners

Bitcoin mining is based on extensive sha256 calculation. For a valid block to be added to the blockchain, it must hash to a value within a narrow range. So bitcoin miners adjust a value inside the potential next block until the hash value is correct. They simply change the value, calculate a hash, change the value, calculate a hash, etc. Project 4 solutions are, broadly speaking, not that different from bitcoin miners. In the early days of mining, a multithreaded program like this could make some money. Of course, some electricity is used as well. This electricity cost is more than the profit from bitcoin mining on a CPU these days, and has been for quite a while. After CPU mining, graphics processors were used, but these eventually were forced to yield to application-specific integrated circuit mining hardware (ASIC). Even older generation ASIC miners aren't profitable to run anymore either, unless your electricity is very cheap. Newer mining hardware can still pull in a profit. When the gas was out in Lewiston recently, I did fire up my old ASIC miner. The electricity would have been used for the space heater otherwise!