C Algorithm 04: Searching Massive Data

【Problem】Given 4 billion unique unsigned int integers that are not sorted. Given a number, how can we quickly determine if this number exists among those 4 billion numbers?

1 billion is 9 zeros, so 4 billion is 4000000000. Our instinctive thought for a solution is to use a for loop to find the equal number, but this method is definitely not feasible. On a 32-bit machine, an unsigned int occupies 4 bytes, so 4 billion integers would require 4000000000 / (1024 * 1024 * 1024) = 14.90116119384765625, which is about 15 GB. Therefore, a conventional linear search cannot meet performance requirements.

Let’s review the basic knowledge of bit operations, bits, and bytes, as these are precious resources due to limited memory.

1 byte = 8 bits, 1 KB = 1024 bytes, 1 MB = 1024 KB

Using the above byte units for conversion, 4000000000 / (8 * 1024 * 1024) = 476.837158203125, which means we only need 476.84 MB of memory. For a 32-bit machine, an int occupies 4 bytes, and for the range of 0 to 31 bits, if it is an int array, as shown in the figure below:

C Algorithm 04: Searching Massive DataThe bits of A[0] are 0 to 31

The bits of A[1] are 32 to 63

The bits of A[2] are 64 to 95

The bits of A[3] are 96 to 127

And so on for A[n]……

One bit can represent one number. We assume the initial value is 0; if a certain bit is set to 1, it indicates that it represents a certain value. Now let’s simulate this: if there are 100 numbers, we need to search among these 100 numbers. 100 bits only require 12.5 bytes (100/8=12.5), which means we only need to define int[4] (since one int has 32 bits, 4 ints have 128 bits, which is just over 100).

The initial value of int[4] is 0. Let’s look at the program:

1. First, calculate which byte position the value is in, and which bit in that byte position corresponds to the array shown above, which bit in A[n].

int arr[4] = {0, 0, 0, 0};

void SetBit(int n)
{
    int bytePos = (n >> 5); // Calculate byte position
    int bitPos = (n & 0x1f); // Calculate which bit in the byte
    arr[bytePos] |= (1 << (bitPos - 1)); // Set the logical bit to 1
}

2. Search if the corresponding bit value is 1; if it is, it indicates that the number is found, and return its value.

int SearchBit(int n)
{
    int bytePos = (n >> 5); // Calculate byte position
    int bitPos = (n & 0x1f); // Calculate which bit in the byte

    return arr[bytePos] & (1 << (bitPos - 1)); // Check if this bit is 1 and return its value
}

3. Clear the bit to 0.

int ClearBit(int n)
{
    int bytePos = (n >> 5); // Calculate byte position
    int bitPos = (n & 0x1f); // Calculate which bit in the byte

    arr[bytePos] &= ~(1 << (bitPos - 1)); // Clear the bit to 0
}

Main function, please refer to:

int main()
{
    int a[100] = {1,10,20,30,40,50,60,70,80,90,
            2,11,21,31,41,51,61,71,81,91,
            3,12,22,32,42,52,62,72,82,92,
            4,13,23,33,43,53,63,73,83,93,
            5,14,24,34,44,54,64,74,84,94,
            6,15,25,35,45,55,65,75,85,95,
            7,16,26,36,46,56,66,76,86,96,
            8,17,27,37,47,57,67,77,87,97,
            9,18,28,38,48,58,68,78,88,98,
            19,29,39,49,59,69,79,89,99,103};

    for (int i = 0; i < 100; i++) SetBit(a[i]); // Set bit for each value

    for (int i = 1; i <= 100; i++)
    {
        if (SearchBit(i)) printf("%d is exist\n", i);
        else printf("%d is not exist\n", i);
    }

    for (int i = 0; i < 100; i++) ClearBit(a[i]);

    return 0;
}

【Running Result】C Algorithm 04: Searching Massive DataThe running result shows that the number 100 does not exist; it is not in the array. This is a typical bitmap problem, allowing for fast data searching while saving memory.

Leave a Comment