ProgrammingBeginners Guide to Hash Tables and Complexity
Posted:

ProgrammingBeginners Guide to Hash Tables and ComplexityPosted:

CriticaI
  • 2 Million
Status: Offline
Joined: Nov 05, 201310Year Member
Posts: 2,737
Reputation Power: 448
Status: Offline
Joined: Nov 05, 201310Year Member
Posts: 2,737
Reputation Power: 448
Keep reading if you want to know how to create a hash table and why they are efficient data types.

First things first. Why are hash tables important?
Hash tables are essentially the same thing as declaring another variable. Sure variables are the best solution but what if you have an unknown amount of variables? Maybe your first thought is to store the values in an array, but that is not the best choice.


Why are arrays bad for multiple values?
Arrays are fine for small chunks of data. However, if you need to constantly filter/search through the array, then your program will be slow. Searching through an array is essentially trial and error until the correct item is found. Very slow. In big O notation, this is O(n)


To understand the performance difference of hash tables, you need to know Big O notation.
Loops, nested loops, and recursion all increase complexity. The goal is to be as close to O(1) as possible.
https://nedbatchelder.com/text/bigo_pix/011.png


How does a hash table work?
A hash table works by storing values in an array with a fixed size. Instead of appending the array, the index is calculated using a hash function. Essentially, this means the value will be stored randomly. When it's time the get the value, the index is recomputed and retrieved from the array.


What is the complexity of a hash table?
Hash tables have a complexity of O(1) except when there is a collision. Then the complexity is O(n). This is why collisions should be avoided or reduced.






Just show me the code:

The first step is to create a hashing algorithm.
Make sure your hashing algorithm does not add complexity. (I'm increasing complexity for the sake of making the code easier to understand). A salt is used to increase entropy. It is best to pick prime numbers as a salt because 0 and multiples of two are more likely to cause collisions. The hash function can be anything that produces a pseudorandom number. I chose to multiply the decimal values of each character in the string. For more efficient algorithms refer to this StackOverflow post.

def create_hash(plaintext, salt=43):
    total = salt
    for char in str(plaintext):
        total = total * ord(char)
    return total


Next, we need a compression algorithm.
Since the hash has a fixed size, we need a function that takes the big integer from the hash and produces another pseudorandom number that is smaller than the length of the array. Otherwise, we'll cause an overflow error. We'll use a common compression algorithm found in other languages. This is just math for now. Remember do not to add complexity!

a = random int
b = random int
h = hash from key
m = size of hash table array

c = (a * h + b) % m


Creating the hash table class
Since the compression requires some random values, its best to calculate those values as little as possible. So, we'll calculate it when a hash is made. Depending on the scenario it may be better to define these values as constants.

class hash_table():

    def __init__(self, size, salt=43):
        self.arr = [None] * math.ceil(size * 1.3)
        self.prime = 2147483647
        self.salt = salt
        self.a = random.randint(self.salt, self.prime)
        self.b = random.randint(self.salt, self.prime)

    def compress(self, key):
        return (self.a * create_hash(key, self.salt) + self.b) % len(self.arr)


Storing values and handling collisions
Collisions will happen. To handle them we will store all duplicates and we store the key and value in an array. Remember, using these arrays barely affects the complexity because we are reducing collisions. Most likely there will only be a few items in the array in the worst-case scenario O(n). If there is only one item in the array then the complexity is still O(1).

def set_item(self, key, value):
        lookup = self.compress(key)
        if self.arr[lookup]:
            self.arr[lookup].append([key, value])
        else:
            self.arr[lookup] = [[key, value]]


Getting a value:
Now use the lookup to get all the collisions from a hash. When we loop over the collisions we return the value if the key matches. Break to reduce complexity.

def get_item(self, key):
        lookup = self.compress(key)
        if self.arr[lookup]:
            for collision in self.arr[lookup]:
                if collision[0] == key:
                    return collision[1]
        return None


Removing a key:
Removing a key is like getting a key. We search through the possible collisions and delete the key/value pair from the collision array if the key matches.

def remove_item(self, key):
        lookup = self.compress(key)
        if self.arr[lookup]:
            for index in range(len(self.arr[lookup])):
                if self.arr[lookup][index][0] == key:
                    del self.arr[lookup][index]
                    break


All together now.
Here is all the code and how to use it.

import math
import random

def create_hash(plaintext, salt=43):
    total = salt
    for char in str(plaintext):
        total = total * ord(char)
    return total

class hash_table():

    def __init__(self, size, salt=43):
        self.arr = [None] * math.ceil(size * 1.3)
        self.prime = 2147483647
        self.salt = salt
        self.a = random.randint(self.salt, self.prime)
        self.b = random.randint(self.salt, self.prime)

    def compress(self, key):
        return (self.a * create_hash(key, self.salt) + self.b) % len(self.arr)

    def set_item(self, key, value):
        lookup = self.compress(key)
        if self.arr[lookup]:
            self.arr[lookup].append([key, value])
        else:
            self.arr[lookup] = [[key, value]]
       
    def get_item(self, key):
        lookup = self.compress(key)
        if self.arr[lookup]:
            for collision in self.arr[lookup]:
                if collision[0] == key:
                    return collision[1]
        return None

    def remove_item(self, key):
        lookup = self.compress(key)
        if self.arr[lookup]:
            for index in range(len(self.arr[lookup])):
                if self.arr[lookup][index][0] == key:
                    del self.arr[lookup][index]
                    break


my_hash = hash_table(200) # -> arr[200]
my_hash.set_item('name', 'my name')
my_hash.get_item('name')  # -> "criticai"
my_hash.remove_item('name')
my_hash.set_item('site', 'TTG')
my_hash.set_item('name', 'critical')
my_hash.get_item('name') # -> critical
my_hash.get_item('site') # -> TTG


Additional notes
Remember hash tables can have a complexity of O(1) but after more data is added it has a higher chance of colliding which creates a complexity of O(n).

Since the hash is compressed based on the size of the internal array, then to reduce collisions you should increase the internal array size.

Ideally, you should use a size that is 1.3 times the number of keys you plan to have.

The following 1 user thanked CriticaI for this useful post:

FOV (12-19-2020)
#2. Posted:
FOV
  • Mind Charity
Status: Offline
Joined: Aug 31, 201112Year Member
Posts: 1,883
Reputation Power: 8098
Motto: The extent of the observable world that is seen at any given moment.
Motto: The extent of the observable world that is seen at any given moment.
Status: Offline
Joined: Aug 31, 201112Year Member
Posts: 1,883
Reputation Power: 8098
Motto: The extent of the observable world that is seen at any given moment.
This is very informative. Thanks for the guide!
Users browsing this topic: None
Jump to:


RECENT POSTS

HOT TOPICS