def hash(num: int) -> int:
return num % 1000
Cryptographic Hash
What is a hash function?
A hash function is a function that takes an input of any length and returns a fixed size output.
h: \{0,1\}^* \rightarrow \{0,1\}^n
e.g.
Example of a hash parameter and output, with n=8.
Input | Output |
---|---|
01010101110110101 | 10101010 |
01 | 00101010 |
1011101101010101010101010101010101010101010101 | 10101010 |
So, the input can be of any length, but the output is always the same length.
Non cryptographic hash
Let’s create a simple hash function
print(hash(123983))
print(hash(123))
print(hash(293939393939393122))
983
123
122
Regardless how long the input is, the output is always a number [0, 1000), which fulfills the definition of a hash function
h: \{0,1\}^* \rightarrow \{0,1\}^n
Collision
A collision happen when two different inputs produce the same output.
print(hash(123983))
print(hash(983))
983
983
Good hash function should have a low probability of collision. But - by definition - it is impossible to have a hash function that never collide (unless the output is the same size as the input)
One Way
A hash function is one way if it is easy to compute the output from the input, but it is hard to compute the input from the output.
The following is not a one way hash function:
def hash_function(input):
return upper(input)
Well, it’s pretty obvious that the input is lower(output)
, so it’s not one way.
Another non one way hash function is: encryption algorithm. If you know the key, you can easily decrypt the output.
Hash Applications
Password Storage
username | password |
---|---|
alice | 123456 |
bob | rahasia |
charlie | Password! |
Oh, no! Don’t store password in plain text, do you? Why not?
- The database can be stolen, the whole world will know the passwords
- We don’t want the system administrator to know the password, do we?
So, how to store it securely while still being able to authenticate the user?
Use hash!
username | hash(password) |
---|---|
alice | 8237492392332892302183243 |
bob | 2837498234792109348923582 |
charlie | 490447284045892492384232 |
Upon login, the system will hash the password and compare it with the stored hash.
def check_password(username, password) -> bool:
= get_stored_hash_from_db(username)
stored_hash return stored_hash == hash(password)
It’s not secure yet, we will discuss later why.
Q: Why not just encrypt the password?
A: Well, we can, but then we need to decrypt it to compare it with the input. So, the system administrator can still know the password.
Checksum
We can also use hash to check the integrity of a file, i.e. to check whether the file is corrupted/altered or not.
Let’s say the file contains the following text:
Hello! My name is Alice. I am a student studying computer science. I love cryptography!
We can calculate the hash of the file and store it in a separate file, e.g. checksum.txt
23885728283
If someone alter the file, e.g. by adding a period at the end of the file, the hash will be different.
When we want to check the integrity of the file, we can calculate the hash of the file and compare it with the stored hash.
def check_integrity(filename) -> bool:
= get_stored_hash_from_file(filename)
stored_hash return stored_hash == hash(filename)
Many websites provide the hash of the file so that you can check the integrity of the file you download, e.g. OpenOffice
But is it secure? What if someone alter the file and also alter the stored hash? Hash is not secure if the attacker can alter both the file and the stored hash.
We could use a digital signature to tackle this problem, but it’s a topic for another day (or another course).
Attacks on Hash Functions
Preimage Attack
Given a hash h and a hash function H, it should be difficult to find a message m such that H(m) = h.
Let’s say we have the following simple hash function
def hash(s: str) -> int:
return sum([ord(c) for c in s]) % 1000
= "Password"
secret = hash(secret) hashed
print(hashed)
851
Given that we know that 851, can we find m?
Yes*, we can try all possible values of m until we find the one that produces 983.
= ["secret", "1234", "password", "Password"]
common_passwords
for guess in common_passwords:
if hash(guess) == hashed:
print("Password cracked! It was:", guess)
break
Password cracked! It was: Password
But, what if we don’t the password is not in the common password list?
We can find another message m' that produces the same hash h (i.e. find collision).
Look at the previous hash function
def hash(s: str) -> int:
return sum([ord(c) for c in s]) % 1000
Can we find a string s
that produces the same hash as 851
?
# draw ascii table of 'a' to 'z'
for i in range(26):
print(chr(ord('a') + i) + " -> " + str(ord('a') + i), end=" | ")
a -> 97 | b -> 98 | c -> 99 | d -> 100 | e -> 101 | f -> 102 | g -> 103 | h -> 104 | i -> 105 | j -> 106 | k -> 107 | l -> 108 | m -> 109 | n -> 110 | o -> 111 | p -> 112 | q -> 113 | r -> 114 | s -> 115 | t -> 116 | u -> 117 | v -> 118 | w -> 119 | x -> 120 | y -> 121 | z -> 122 |
Let’s analyze, what characters should be chosen so that the sum of their ASCII values is equal to 851
?
hash("zzzzzzw")
851
Bingo! We found a collision!
Since collision will happens, to prevent pre-image attack, we need to make sure that it’s computationally infeasible to mount a pre-image attack.
For a n
bit hash, the complexity of pre-image attack is O(2^n) (brute force). Modern hash function has a hash size of 256 bits, which means that the complexity of pre-image attack is O(2^{256}).
Rainbow Table
O(2^n) seems to be a big number, but it’s not impossible to mount a pre-image attack.
Let’s use the following hash function
import hashlib
def h(s: str) -> int:
# Encode the string into bytes
= s.encode('utf-8')
text_bytes = hashlib.md5(text_bytes)
hash_object
# Get the hexadecimal representation of the MD5 hash
= hash_object.hexdigest()
md5_hash
return md5_hash
And we have the following users table:
= []
users
"Alice", h("password123")))
users.append(("Bob", h("123456")))
users.append(("Charlie", h("qwerty")))
users.append(("Diana", h("secret")))
users.append(("Eve", h("aman")))
users.append(("Frank", h("iloveyou")))
users.append((
users
[('Alice', '482c811da5d5b4bc6d497ffa98491e38'),
('Bob', 'e10adc3949ba59abbe56e057f20f883e'),
('Charlie', 'd8578edf8458ce06fbc5bb76a58c5ca4'),
('Diana', '5ebe2294ecd0e0f08eab7690d2a6ee69'),
('Eve', 'ccda1683d8c97f8f2dff2ea7d649b42c'),
('Frank', 'f25a2fc72690b780b2a14e140ef6a9e0')]
Can we crack the password of Eve
?
Using mere brute force, we need to try all possible passwords. Let’s try
def crack(hashed_password) -> str:
# it surely can be made recursive, but for simplicity we will use 4 nested loops
for c1 in range(26):
for c2 in range(26):
for c3 in range(26):
for c4 in range(26):
= chr(ord('a') + c1) + chr(ord('a') + c2) + chr(ord('a') + c3) + chr(ord('a') + c4)
guess = h(guess)
hashed_guess # print(guess, hashed_guess)
if hashed_password == hashed_guess:
return guess
return None
= users[4][1]
hashed_password crack(hashed_password)
'aman'
Wow! We found the password of Eve
!
The complexity is $O(26^4)
, which is not that big. That’s why we are encouraged to use a long password and use alphanumeric characters.
= [4, 8, 10]
number_of_characters = [26, 52, 62]
character_sets_size
for i in range(len(number_of_characters)):
print("Number of characters:", number_of_characters[i])
print("Character set size:", character_sets_size[i])
print("Number of possible passwords:", character_sets_size[i] ** number_of_characters[i])
print("")
Number of characters: 4
Character set size: 26
Number of possible passwords: 456976
Number of characters: 8
Character set size: 52
Number of possible passwords: 53459728531456
Number of characters: 10
Character set size: 62
Number of possible passwords: 839299365868340224
Ok, bruteforce attack seems infeasible, but what if we just try common/leaked passwords? People reuse password afterall.
= [
known_password "password", "secure", "secret", "qwerty", "123456", "12345678", "12345", "iloveyou", "princess", "1234567",
]
We can pre-calculate the hash of common passwords and store it in a table, this is called rainbow table.
for password in known_password:
print(password, h(password))
password 5f4dcc3b5aa765d61d8327deb882cf99
secure 1c0b76fce779f78f51be339c49445c49
secret 5ebe2294ecd0e0f08eab7690d2a6ee69
qwerty d8578edf8458ce06fbc5bb76a58c5ca4
123456 e10adc3949ba59abbe56e057f20f883e
12345678 25d55ad283aa400af464c76d713c07ad
12345 827ccb0eea8a706c4c34a16891f84e7b
iloveyou f25a2fc72690b780b2a14e140ef6a9e0
princess 8afa847f50a716e64932d995c8e7435a
1234567 fcea920f7412b5da7be0cf42b8c93759
Given the following users table, can you crack the password of Diana
?
users
[('Alice', '482c811da5d5b4bc6d497ffa98491e38'),
('Bob', 'e10adc3949ba59abbe56e057f20f883e'),
('Charlie', 'd8578edf8458ce06fbc5bb76a58c5ca4'),
('Diana', '5ebe2294ecd0e0f08eab7690d2a6ee69'),
('Eve', 'ccda1683d8c97f8f2dff2ea7d649b42c'),
('Frank', 'f25a2fc72690b780b2a14e140ef6a9e0')]
Yes, just look up the rainbow table!
Another problem with simply hashing password is that, if two users have the same password, they will have the same hash
"Grace", h("qwerty")))
users.append(( users
[('Alice', '482c811da5d5b4bc6d497ffa98491e38'),
('Bob', 'e10adc3949ba59abbe56e057f20f883e'),
('Charlie', 'd8578edf8458ce06fbc5bb76a58c5ca4'),
('Diana', '5ebe2294ecd0e0f08eab7690d2a6ee69'),
('Eve', 'ccda1683d8c97f8f2dff2ea7d649b42c'),
('Frank', 'f25a2fc72690b780b2a14e140ef6a9e0'),
('Grace', 'd8578edf8458ce06fbc5bb76a58c5ca4')]
Ooho, “Charlie” and “Grace” have the same password!
Salt
The solution is to add a random string to the password before hashing it. This random string is called salt.
So, rather than just hash(password)
, we do hash(salt + password)
= "supersecretsalt"
salt = []
users_with_salt
"Alice", h("password123" + salt)))
users_with_salt.append(("Bob", h("123456" + salt)))
users_with_salt.append(("Charlie", h("qwerty" + salt)))
users_with_salt.append(("Diana", h("secret" + salt)))
users_with_salt.append(("Eve", h("aman" + salt)))
users_with_salt.append(("Frank", h("iloveyou" + salt)))
users_with_salt.append(("Grace", h("qwerty" + salt)))
users_with_salt.append(( users_with_salt
[('Alice', 'bef36a6b8abe65a1ca0017fb29ce389a'),
('Bob', '792a39b2d91e301c841b0b4e6dc11d06'),
('Charlie', 'db899b76a6609a8aac3dbd60a1962236'),
('Diana', '41ade3cb6140859235528f6b1c507850'),
('Eve', 'b42d96585e5c0a13eac7438a813d32bf'),
('Frank', '760f87a08cbe8f2316d283b41f9caf99'),
('Grace', 'db899b76a6609a8aac3dbd60a1962236')]
Now, we can’t mount a rainbow table attack anymore, since we don’t know the salt.
But what if we know the salt? Can we still mount a rainbow table attack?
Yes!
Just pre-calculate the hash of common passwords with the salt.
Different salt for each user
We missed one thing, the salt is the same for all users. So, if two users have the same password, they will have the same hash. And it is also susceptible to rainbow table attack.
The solution is to use different salt for each user.
= []
users_with_salt
= "supersecretsalt"
salt "Alice", h("password123" + salt), salt))
users_with_salt.append((
= "2839sdfsjkdfkl"
salt "Bob", h("123456" + salt), salt))
users_with_salt.append((
= "sdfjksdfjksdf"
salt "Charlie", h("qwerty" + salt), salt))
users_with_salt.append((
= "iou23u4iousod"
salt "Diana", h("secret" + salt), salt))
users_with_salt.append((
= "23894798sdjkfskl"
salt "Eve", h("aman" + salt), salt))
users_with_salt.append((
= "89273klscdnkljkls"
salt "Frank", h("iloveyou" + salt), salt))
users_with_salt.append((
= "i2349iusdjfksf"
salt "Grace", h("qwerty" + salt), salt))
users_with_salt.append((
users_with_salt
[('Alice', 'bef36a6b8abe65a1ca0017fb29ce389a', 'supersecretsalt'),
('Bob', '155727303933b14ead7a48751f073166', '2839sdfsjkdfkl'),
('Charlie', '753d4d9815f30a4ed36b89ed1876565c', 'sdfjksdfjksdf'),
('Diana', '1dff7cfa2c56261f3e52e1120b2b564b', 'iou23u4iousod'),
('Eve', '6107eef0d46f7b3db6ae22e3ca45f4db', '23894798sdjkfskl'),
('Frank', '38a2d0871c9e925e7f6f03408fe665df', '89273klscdnkljkls'),
('Grace', 'f751c0d1057e2bbb30174091d69cd9f7', 'i2349iusdjfksf')]
Rainbow table becomes not effective anymore, since we need to pre-calculate the hash of common passwords with each salt.
Bonus: Grace and Bob now have different hash!
BCrypt
BCrypt is a password hashing function that is designed to be slow and expensive to compute. It is designed to be slow to prevent brute force attack.
It’s used by Devise, a popular authentication library for Ruby on Rails.
!pip install bcrypt
Use Cases
Hashing Password
import bcrypt
# Generate a salt
= bcrypt.gensalt()
salt
# Hash a password for the first time
= b"super secret password"
password = bcrypt.hashpw(password, salt)
hashed
# Print the hashed password
print(hashed)
b'$2b$12$9wUZrRjuimKbqdz.uAR0UuDl5pCDLvbvqyZNXxEPoeqnGiM837HNG'
The hashed
contains the salt and the hash of the password. This hashed
value is stored in the database.
username | hashed |
---|---|
alice | $2b$12$9wUZrRjuimKbqdz.uAR0UuDl5pCDLvbvqyZNXxEPoeqnGiM837HNG |
Verifying Password
# retrieved from user input (in login form)
= b"super secret password"
password
# retrieved from database
= b'$2b$12$9wUZrRjuimKbqdz.uAR0UuDl5pCDLvbvqyZNXxEPoeqnGiM837HNG'
hashed
if bcrypt.checkpw(password, hashed):
print("Login successful.")
else:
print("Incorrect password.")
Login successful.