Developing a Data Compression Tool in C Language
Introduction
In today’s article, we will introduce how to write a simple data compression tool using the C language. Data compression is a technique that reduces the storage size of information by encoding it. In this article, we will implement a basic character frequency count and Huffman coding to achieve data compression.
Introduction to Huffman Coding
Huffman coding is a lossless data compression algorithm that uses variable-length binary codes to represent characters. The most frequently occurring characters are assigned shorter codes, while less common characters are assigned longer codes, thus achieving compression. This tutorial will cover the construction of the Huffman tree, pre-order traversal, and how to generate the final output file.
Tool Preparation
To proceed with development, you will need the following tools:
- A C language compiler, such as GCC.
- A text editor to write our code.
Project Structure
First, we create a new file named <span>huffman.c</span>
and organize the following main modules within it:
- Data structure definitions
- Huffman tree nodes
- Character frequency counting
- Huffman tree construction
- Code generation
- Main function logic
Next, we will implement these parts step by step.
1. Data Structure Definitions
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define MAX_TREE_NODES 256 // Maximum number of nodes (256 ASCII characters)
typedef struct Node {
char character;
int frequency;
struct Node *left, *right;
} Node;
// Node comparison function for sorting
int compare(const void *a, const void *b) {
return ((Node *)a)->frequency - ((Node *)b)->frequency;
}
2. Character Frequency Counting
void calculate_frequency(const char *filename, int freq[]) {
FILE *file = fopen(filename, "r");
if (!file) {
perror("Failed to open file");
exit(EXIT_FAILURE);
}
char ch;
while ((ch = fgetc(file)) != EOF) {
freq[(unsigned char)ch]++;
}
fclose(file);
}
3. Creating the Huffman Tree
Node* create_node(char character, int frequency) {
Node* new_node = (Node*)malloc(sizeof(Node));
new_node->character = character;
new_node->frequency = frequency;
new_node->left = NULL;
new_node->right = NULL;
return new_node;
}
Node* build_huffman_tree(int freq[]) {
Node* nodes[MAX_TREE_NODES];
int node_count = 0;
for (int i = 0; i < MAX_TREE_NODES; i++) {
if (freq[i] > 0) { // Create non-zero frequency nodes
nodes[node_count++] = create_node(i, freq[i]);
}
}
while (node_count > 1) { // Build the Huffman tree
qsort(nodes, node_count, sizeof(Node*), compare);
Node* left_child = nodes[0]; // Pop the two smallest nodes
Node* right_child = nodes[1];
Node* parent_node = create_node('\0', left_child->frequency + right_child->frequency);
parent_node->left = left_child;
parent_node->right = right_child;
nodes[1] = parent_node; // Insert the new parent node into the array, replacing the smallest element, maintaining order
node_count--;
}
return nodes[0]; // Return the root node
}
4. Code Generation and Output Saving
void generate_codes(Node *root, char **codes, char *code_str, int top) {
if (root->left == NULL && root->right == NULL) {
code_str[top] = '\0';
codes[(unsigned char)(root->character)] = strdup(code_str);
return;
}
code_str[top] = '0';
generate_codes(root->left, codes, code_str, top + 1);
code_str[top] = '1';
generate_codes(root->right, codes, code_str, top + 1);
}
void compress_file(const char *input_filename, const char *output_filename) {
int frequencies[MAX_TREE_NODES] = {0};
calculate_frequency(input_filename, frequencies);
Node *root = build_huffman_tree(frequencies);
char **codes = (char **)malloc(MAX_TREE_NODES * sizeof(char *));
char code_string[MAX_TREE_NODES];
generate_codes(root, codes, code_string, 0);
FILE *output_file = fopen(output_filename, "wb");
fprintf(output_file, "%d\n", MAX_TREE_NODES);
for (int i = 0; i < MAX_TREE_NODES; i++) {
if (codes[i] != NULL) {
fprintf(output_file, "%d:%s\n", i, codes[i]);
}
}
fclose(output_file);
free(codes);
}
5. Main Function Logic
Finally, we call the above functions in the main function to execute the entire program:
int main(int argc, char **argv) {
if (argc != 3) {
printf("Usage: %s input.txt output.txt\n", argv[0]);
return EXIT_FAILURE;
}
compress_file(argv[1], argv[2]);
return EXIT_SUCCESS;
}
Conclusion and Test Run
In the terminal or command line window, run the program with the following command, providing the input file and the desired output file name:
gcc huffman.c -o huffcompress && ./huffcompress input.txt output.txt
Please ensure you have a data text file to compress: input.txt, and the above steps will generate output.txt, which contains the corresponding Huffman coding information for that text.
I hope this article helps everyone understand how to build a simple data compression tool using the C language. If you have any questions about the content, please feel free to leave a comment for discussion.