Developing a Data Compression Tool in C Language

Developing a Data Compression Tool in C Language

Introduction

In today’s article, we will introduce how to write a simple data compression tool using the C language. Data compression is a technique that reduces the storage size of information by encoding it. In this article, we will implement a basic character frequency count and Huffman coding to achieve data compression.

Introduction to Huffman Coding

Huffman coding is a lossless data compression algorithm that uses variable-length binary codes to represent characters. The most frequently occurring characters are assigned shorter codes, while less common characters are assigned longer codes, thus achieving compression. This tutorial will cover the construction of the Huffman tree, pre-order traversal, and how to generate the final output file.

Tool Preparation

To proceed with development, you will need the following tools:

  • A C language compiler, such as GCC.
  • A text editor to write our code.

Project Structure

First, we create a new file named <span>huffman.c</span> and organize the following main modules within it:

  1. Data structure definitions
  2. Huffman tree nodes
  3. Character frequency counting
  4. Huffman tree construction
  5. Code generation
  6. Main function logic

Next, we will implement these parts step by step.

1. Data Structure Definitions

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define MAX_TREE_NODES 256 // Maximum number of nodes (256 ASCII characters)
typedef struct Node {
    char character;
    int frequency;
    struct Node *left, *right;
} Node;
// Node comparison function for sorting
int compare(const void *a, const void *b) {
    return ((Node *)a)->frequency - ((Node *)b)->frequency;
}

2. Character Frequency Counting

void calculate_frequency(const char *filename, int freq[]) {
    FILE *file = fopen(filename, "r");
    if (!file) {
        perror("Failed to open file");
        exit(EXIT_FAILURE);
    }
    char ch;
    while ((ch = fgetc(file)) != EOF) {
        freq[(unsigned char)ch]++;
    }
    fclose(file);
}

3. Creating the Huffman Tree

Node* create_node(char character, int frequency) {
    Node* new_node = (Node*)malloc(sizeof(Node));
    new_node->character = character;
    new_node->frequency = frequency;
    new_node->left = NULL;
    new_node->right = NULL;
    return new_node;
}
Node* build_huffman_tree(int freq[]) {
    Node* nodes[MAX_TREE_NODES];
    int node_count = 0;
    for (int i = 0; i < MAX_TREE_NODES; i++) {
        if (freq[i] > 0) { // Create non-zero frequency nodes
            nodes[node_count++] = create_node(i, freq[i]);
        }
    }
    while (node_count > 1) { // Build the Huffman tree
        qsort(nodes, node_count, sizeof(Node*), compare);
        Node* left_child = nodes[0]; // Pop the two smallest nodes
        Node* right_child = nodes[1];
        Node* parent_node = create_node('\0', left_child->frequency + right_child->frequency);
        parent_node->left = left_child;
        parent_node->right = right_child;
        nodes[1] = parent_node; // Insert the new parent node into the array, replacing the smallest element, maintaining order
        node_count--;
    }
    return nodes[0]; // Return the root node
}

4. Code Generation and Output Saving

void generate_codes(Node *root, char **codes, char *code_str, int top) {
    if (root->left == NULL && root->right == NULL) {
        code_str[top] = '\0';
        codes[(unsigned char)(root->character)] = strdup(code_str);
        return;
    }
    code_str[top] = '0';
    generate_codes(root->left, codes, code_str, top + 1);
    code_str[top] = '1';
    generate_codes(root->right, codes, code_str, top + 1);
}
void compress_file(const char *input_filename, const char *output_filename) {
    int frequencies[MAX_TREE_NODES] = {0};
    calculate_frequency(input_filename, frequencies);
    Node *root = build_huffman_tree(frequencies);
    char **codes = (char **)malloc(MAX_TREE_NODES * sizeof(char *));
    char code_string[MAX_TREE_NODES];
    generate_codes(root, codes, code_string, 0);
    FILE *output_file = fopen(output_filename, "wb");
    fprintf(output_file, "%d\n", MAX_TREE_NODES);
    for (int i = 0; i < MAX_TREE_NODES; i++) {
        if (codes[i] != NULL) {
            fprintf(output_file, "%d:%s\n", i, codes[i]);
        }
    }
    fclose(output_file);
    free(codes);
}

5. Main Function Logic

Finally, we call the above functions in the main function to execute the entire program:

int main(int argc, char **argv) {
    if (argc != 3) {
        printf("Usage: %s input.txt output.txt\n", argv[0]);
        return EXIT_FAILURE;
    }
    compress_file(argv[1], argv[2]);
    return EXIT_SUCCESS;
}

Conclusion and Test Run

In the terminal or command line window, run the program with the following command, providing the input file and the desired output file name:

gcc huffman.c -o huffcompress && ./huffcompress input.txt output.txt

Please ensure you have a data text file to compress: input.txt, and the above steps will generate output.txt, which contains the corresponding Huffman coding information for that text.

I hope this article helps everyone understand how to build a simple data compression tool using the C language. If you have any questions about the content, please feel free to leave a comment for discussion.

Leave a Comment