Data Structure Study Notes <9> Hash Search

发表于 2020-08-29 16:05 2233 字 12 min read

cos avatar

cos

FE / ACG / 手工 / 深色模式强迫症 / INFP / 兴趣广泛养两只猫的老宅女 / remote

文章介绍了散列表(哈希表)的基本概念、工作原理及应用。通过散列函数将关键词映射到存储地址,实现平均查找时间接近常量O(1)的高效查找,但需处理冲突并控制装填因子在0.5到0.85之间。文中详细讲解了散列函数的构造方法(如除留余数法、移位法)、冲突解决策略(开放定址法中的线性、平方、双散列探测和链地址法),以及不同方法在性能上的表现和适用场景。

This article has been machine-translated from Chinese. The translation may contain inaccuracies or awkward phrasing. If in doubt, please refer to the original Chinese version.

I. Hash Table

1. Concept Introduction

Variable management in compilation: Dynamic search problem

  • Use a search tree? -- Comparing two variable names (strings) is inefficient
  • Convert strings to numbers, then process? -- This is the idea behind hash search Known search methods:
  • Sequential search O(N)
  • Binary search (static search) O(log2N)
  • Binary search tree O(h) where h is the height of the binary search tree
  • AVL tree O(log2N)

Insert image description here

Problem: How to quickly search for the desired keyword? What if the keywords are inconvenient to compare?

The essence of searching: Finding the position of a known object

  • Orderly arrangement of objects: Total order (completely ordered, e.g., binary search), partial order (some keywords have an ordering relationship, e.g., search trees)
  • Directly "compute" the object's position: Hashing

Basic Operations of Hashing

  • Computing position: Construct a hash function to determine the storage location of keywords
  • Resolving collisions: Apply a strategy to handle cases where multiple keywords map to the same position

Time Complexity

The time complexity is nearly constant: O(1), meaning the search time is independent of the problem size!

(1) Basic Idea

  • Using the keyword key as the independent variable, compute the corresponding function value h(key) through a deterministic function h (hash function), which serves as the storage address of the data object.
  • Different keywords may map to the same hash address, i.e., h(keyi) = h(keyj) (when keyi != keyj), which is called a "Collision". -- A collision resolution strategy is needed.

In the absence of collisions, searching only requires computing the address using the hash function and looking it up.

Insert image description here

Load Factor

Load Factor: If the hash table has a space of size m and n elements are stored in it, then alpha = n / m is called the load factor of the hash table.

  • As in the above figure, alpha = 11 / 17 approximately equals 0.65 If there is no overflow, then Tsearch = Tinsert = Tdelete = O(1)

Rehashing

  • When the hash table has too many elements (i.e., the load factor alpha is too large), search efficiency decreases
    • The practical maximum load factor is generally 0.5 <= alpha <= 0.85
  • The solution is to double the hash table size. This process is called "Rehashing"

(2) Constructing Hash Functions

Two Factors

A "good" hash function should generally consider the following two factors:

  • Simple computation, to improve conversion speed
  • Uniform distribution of keyword addresses, to minimize collisions as much as possible.

Hash Function Construction for Numeric Keywords

1. Direct Addressing Method

Take a linear function of the keyword as the hash address: h(key) = a * key + b (a, b are constants)

Insert image description here

2. Division Method (Commonly Used)

Hash function: h(key) = key mod p e.g., h(key) = key % 17

Insert image description here
Generally, to distribute keyword addresses as uniformly as possible, p should be a prime number.

3. Digit Analysis Method

Analyze the variation of digits at each position of numeric keywords and select the more random digits as the hash address.

Insert image description here

4. Folding Method

Split the keyword into several parts of equal digit length, then sum them up.

Insert image description here

5. Mid-Square Method

Split the keyword into several parts of equal digit length, then sum them up

Hash Function Construction for String Keywords

Simple Hash Function -- ASCII Code Sum Method

For a character keyword key, define the hash function as: h(key) = (sum of key[i]) mod TableSize Severe collisions!!

Simple Improvement -- First 3 Characters Shift Method

h(key) = (key[0] * 272 + key[1] * 27 + key[2]) mod TableSize Still has collisions and space waste

Good Hash Function -- Shift Method

Considers all n characters of the keyword and distributes well. h(key) = (i=0n1key[ni1]×32i\begin{matrix} \sum_{i=0}^{n-1}key[n-i-1] \times32^i \end{matrix}) mod TableSize Code:

typedef string DataType;//Data type
typedef int Index;//Index after hashing
//Return the index computed by the hash function
Index Hash(string Key, int TableSize) {
    unsigned int h = 0; //Hash function value, initialized to 0
    int len = Key.length();
    for(int i = 0; i < len; ++i)
        h = (h << 5) + Key[i];

    return h % TableSize;
}

Common approaches for handling collisions:

  • Try another position -- Open Addressing
  • Organize all conflicting objects at the same position together -- Separate Chaining

Open Addressing

Open Addressing: Once a collision occurs (another element is already at that position), find another empty address according to some rule. Its advantage is that the hash table is an array with high storage efficiency and random access; its disadvantage is the "clustering" phenomenon. In open addressing, deletion must be done carefully -- only "lazy deletion" is allowed, meaning a deletion marker must be added rather than actually deleting, so that searches won't "break the chain". The space can be reused on the next insertion.

  • If the i-th collision occurs, the next probed address is incremented by di. The basic formula is: hi(key) = (h(key) + di) mod TableSize (1 <= i < TableSize)
  • The value of di determines different collision resolution strategies: Linear probing (di = i), Quadratic probing (di = +/- i2), Double hashing (di = i * h2(key)).

1. Linear Probing

Try the next storage addresses in the sequence 1, 2, ..., (TableSize - 1) cyclically.

Insert image description here
Insert image description here

2. Quadratic Probing (Secondary Probing)

Try the next storage addresses with the increment sequence 12, -12, 22, -22, ..., q2, -q2 where q <= |TableSize/2|, cyclically.

Insert image description here
Insert image description here
Can quadratic probing find all available space? The following theorem holds:

  • If the hash table length TableSize is a prime number of the form 4k+3 (where k is a positive integer), then quadratic probing can probe the entire hash table space.
//Search for Key element, using quadratic probing for collision resolution
Index Find(ptrHash H, DataType Key) {
    Index nowp, newp;
    int Cnum = 0;//Record number of collisions
    newp = nowp = Hash(Key, H->TableSize);
    //Collision occurs when the position is non-empty and not the target element
    while(H->Units[newp].flag != Empty && H->Units[newp].data != Key) {
        ++Cnum;//One more collision
        if(++Cnum % 2) {
            newp = nowp + (Cnum+1)*(Cnum+1)/4;//Increment is +i^2, where i is (Cnum+1)/2
            if(newp >= H->TableSize)
                newp = newp % H->TableSize;
        } else {
            newp = nowp - Cnum*Cnum/4;//Increment is -i^2, where i is Cnum/2
            while(newp < 0)
                newp += H->TableSize;
        }
    }
    return newp;//Return position; if it's an empty cell, the key was not found
}

3. Double Hashing

di is i * h2(key), where h2(key) is another hash function. The probe sequence becomes h2(key), 2h2(key), 3h2(key)...

  • For any key, h2(key) != 0!
  • The probe sequence should also ensure all hash table cells can be probed. The following form has good results: h2(key) = p - (key mod p) where p < TableSize, and both p and TableSize are prime numbers.

Separate Chaining

Separate Chaining: Store all keywords that collide at the same position in a singly linked list. Its advantage is that deletion does not require "lazy deletion," so there is no storage garbage. Its disadvantage is that the storage efficiency and search efficiency of the linked list portions are relatively low. Uneven linked list lengths can cause severe degradation in time efficiency.

Insert image description here

(4) Operation Set for Hash Search (Code)

Here we provide hash search for strings, using the shift method for the hash function and quadratic probing from open addressing for collision resolution. Other methods can be modified from this basis.

Definition

const int MaxSize = 100000;
typedef int Index;//Index after hashing
typedef string DataType;//Element type stored in hash cells
//Hash cell status type: has a legitimate element, empty cell, has a deleted element
typedef enum {Legitimate, Empty, Deleted} EntryType;
struct HashNode {   //Hash table cell type
    DataType data;      //Store element
    EntryType flag;     //Cell status
};
struct HashTable {  //Hash table type
    int TableSize;      //Table size
    HashNode *Units; //Array storing hash cells
};
typedef HashTable *ptrHash;

Getting a Prime Number

Return the smallest prime number greater than N and not exceeding MaxSize, to ensure the hash table's maximum length is prime and keyword addresses are distributed as uniformly as possible.

//Return the smallest prime number greater than N and not exceeding MaxSize
int NextPrime(int N) {
    int i, p = (N%2) ? N+2 : N+1;//Start from the next odd number greater than N
    while(p <= MaxSize) {
        for(i = (int)sqrt(p); i >= 2; i--)
            if(!(p % i)) break;//Not a prime
        if(i == 2) break;//for loop ended normally, it's a prime
        else p += 2;//Try the next odd number
    }
    return p;
}

Creating an Empty Table

Create an empty table with length greater than TableSize (ensuring the length is prime).

ptrHash CreateTable(int TableSize) {
    ptrHash H;
    int i;
    H = new HashTable;
    H->TableSize = NextPrime(TableSize);
    H->Units = new HashNode[H->TableSize];
    for(int i = 0; i < H->TableSize; ++i)
        H->Units[i].flag = Empty;
    return H;
}

Hash Computation

Return the index computed by the hash function. Here the keyword data type is string, using the shift method for hashing.

Index Hash(DataType Key, int TableSize) {
    unsigned int h = 0; //Hash function value, initialized to 0
    string str = Key.str;
    int len = str.length();
    for(int i = 0; i < len; ++i)
        h = (h << 5) + str[i];
    return h % TableSize;
}

Search Operation

Search for Key element, using quadratic probing for collision resolution. Returns the position index; if the position is an empty cell, the key was not found.

Index Find(ptrHash H, DataType Key) {
    Index nowp, newp;
    int Cnum = 0;//Record number of collisions
    newp = nowp = Hash(Key, H->TableSize);
    //Collision occurs when the position is non-empty and not the target element
    while(H->Units[newp].flag != Empty && H->Units[newp].data != Key) {
        ++Cnum;//One more collision
        if(++Cnum % 2) {
            newp = nowp + (Cnum+1)*(Cnum+1)/4;//Increment is +i^2, where i is (Cnum+1)/2
            if(newp >= H->TableSize)
                newp = newp % H->TableSize;
        } else {
            newp = nowp - Cnum*Cnum/4;//Increment is -i^2, where i is Cnum/2
            while(newp < 0)
                newp += H->TableSize;
        }
    }
    return newp;//Return position; if it's an empty cell, the key was not found
}

Insert Operation

Insert Key into the table. Returns whether insertion was successful or not; failure means the key already exists.

bool Insert(ptrHash H, DataType Key) {
    Index p = Find(H, Key);
    if(H->Units[p].flag != Legitimate) {//This position can accept an element
        H->Units[p].flag = Legitimate;
        H->Units[p].data = Key;
        //Other operations
        return true;
    } else {//The key already exists
        //Other operations
        return false;
    }
}

(5) Performance Analysis

The search efficiency of a hash table is analyzed through the following factors:

  • Average Successful Search Length (ASLs)
  • Average Unsuccessful Search Length (ASLu)

Taking the example from the linear probing section, let's analyze the performance of this hash table:

Insert image description here

  • ASLs is the average number of search comparisons for keywords in the table (i.e., the number of collisions plus 1) ASLs = (1+7+1+1+2+1+4+2+4) / 9 = 23 / 9 approximately equals 2.56
  • ASLu is the average number of search comparisons for keywords not in the hash table (unsuccessful) General method: Classify keywords not in the hash table into several categories. If classified by H(key) values, here H(key) = key mod 11, we have 11 categories. ASLu is as follows: ASLu = (3+2+1+2+1+1+1+9+8+7+6) / 11 = 41 / 11 approximately equals 3.73

The number of keyword comparisons depends on how many collisions occur, which is influenced by three factors:

  1. Whether the hash function distributes uniformly
  2. The collision resolution method
  3. The load factor alpha of the hash table The impact of different collision resolution methods and load factors on efficiency is as follows:

1. Search Performance of Linear Probing

Insert image description here

2. Search Performance of Quadratic Probing and Double Hashing

Insert image description here
When the load factor alpha < 0.5, the expected number of probes for all probing methods is small and relatively close. As alpha increases, the expected number of probes for linear probing increases more rapidly. The expected number of probes for unsuccessful searches and insertions is larger than that for successful searches. Therefore, a reasonable maximum load factor should not exceed 0.85.

3. Search Performance of Separate Chaining

Insert image description here

Summary

  • With an appropriate h(key), the expected search efficiency of hashing is constant O(1), which is almost independent of the size n of the keyword space! It is also suitable for problems where direct keyword comparison is computationally expensive.
  • This is predicated on a small alpha, so hashing is a method that trades space for time.
  • The storage of hash methods is random with respect to keywords, making it inconvenient for sequential search of keywords, and also not suitable for range searches or finding maximum/minimum values.

喜欢的话,留下你的评论吧~

© 2020 - 2026 cos @cosine
Powered by theme astro-koharu · Inspired by Shoka