Writing this document because of a comment.
I have written an article about recognizing calculator screens, explaining that it can be independently deployed on apps, mini-programs, Raspberry Pi, and other embedded devices. An old brother said that this is all old technology that has been applied for many years.
He is not wrong; it may have existed many years ago. However, in reality, regardless of how long a technology has been in use, there are still people spending money to develop new applications every year.
I wonder if those born in the 80s and 90s still remember that non-smartphones in the early 2000s already had handwriting recognition features. They even came with a stylus.
Even though this technology has been around for 30 years. If you ask software companies now whether they still need to pay for third-party services for handwriting recognition features, they will likely say yes.
Why?
This is a very good question and worth pondering.
First, the technology back then is different from now. In the pre-AI era, handwriting recognition used template matching technology based on rules. The characters you wrote had to be compared against templates in the library. Just like a robot customer service that replies based on keywords: if you ask about a “house,” it understands, but if you say “room,” it claims not to understand. Nowadays, handwriting recognition is based on deep learning; if you ask it about a “house,” it knows it’s a place to live.
Second, even if the technology hasn’t changed, its practical implementation has barriers. This manifests in two aspects. First, just because you have an open-source project doesn’t mean you can use it effectively. Second, each person’s specific needs are different; even if you can run it, it’s difficult to modify for personalized customization.
Today, I will explain an open-source OCR project to illustrate my point.
I have seen this project announced for several months now, with various news articles coming in waves: “Better than xx using OCR,” “Table Recognition Miracle,” “Top Ten Open Source Projects of the Year”…
Everyone is spreading and sharing, saying how good and useful this thing is. Even my friend who works with rolling shutters shared it with me, saying it’s great. However, no one has written an article explaining how they actually used it, what the results were, its principles, how it was trained, what its advantages are, what its disadvantages are, whether the disadvantages can be optimized, and how to optimize them. Today, I’ll fill that gap; otherwise, someone will say, “Oh, this has long been solved, at zero cost, and has matured many years ago.”
The project is called Surya, an OCR recognition project.
The open-source address is github.com/VikParuchuri/surya, which currently has 14K stars on GitHub. It supports localized deployment and is free for commercial use for companies with annual revenues below $5 million.
I set it up on my computer; it runs on CPU, and GPU efficiency is even higher. I did a little experiment and will show you the features.
1. Feature Demonstration
I used this image for testing, which is a news article from a certain newspaper.
It can detect what types of structures are in the image, such as paragraphs, images, titles, etc. The image below shows the detected areas marked.
Additionally, area detection also comes with a reading order feature. The reading order refers to the sequence in which you read the document, such as from left to right and from top to bottom. Sometimes, the reading order is very important; otherwise, it can easily spoil the content.
Since it is OCR, it must convert images to text. To convert text, it first needs to know which areas contain text. It also has text line detection.
After detecting the position of the text, the next step is to recognize the text. Below is the recognition result.
Finally, let’s show its table recognition. Here’s the test image.
Doing a table detection gives the following effect.
From the recognized data, it shows 4 rows, 3 columns, and 12 cells.
Next, we will perform OCR content recognition.
2. Algorithm Integration
The above shows its features.Let’s not discuss the effectiveness for now.Next, I want to ask, how is it able to achieve all this? Answering this question will help better understand its capabilities.
The author lists many thanks at the end, stating that without the assistance of many excellent open-source models, he could not have completed this project. For example, he thanks the CRAFT project, which is an open-source text detection model with over 3k stars.
It also uses Donut, which is a new method for understanding documents without OCR. We know that to understand a document, we generally first need to know what it says, and then analyze the document to make judgments. Donut combines multimodal approaches to directly parse images, requiring minimal text processing, thus skipping the full-text analysis step.
Look at the image above. If you ask Donut what the title of this image is, it can answer correctly. This is understanding the document.
Therefore, from an algorithmic perspective, Surya employs many top-tier open-source models. Those models are also built upon the shoulders of giants. It can be said that the algorithms it integrates are currently at the forefront of public knowledge.
Now let’s talk about its training data.
Its training data can be found in the vikp section of huggingface.co.
3. Training Data
For example, for text area type detection, its training data looks like this:
Let’s look at one of its data sets. The image is a picture, the bboxes are area boxes, and the labels are area types, which include text types and table types. This data needs to be annotated, meaning drawing boxes on the image to indicate area types. The total training amount is 1910 images. Not many.
For example, for table analysis detection, its training data looks like this:
The image is a table image, the bboxes are cells, rows are regions of each row, and cols are regions of each column. Providing these labeled data to the algorithm allows it to learn what features define rows and columns. The data here is relatively abundant, with 9680 images. Hence, it is said that its table recognition is very strong.
For text line detection, its training data looks like this:
The structure of the training data consists of: image, a specific area in the image, the type of text in that area, and additionally, a text content. For example, in the selected data above, it indicates that there is a text line of type 7 within this image, and its area is defined by the rectangle [88, 96, 865, 134] (left, top, right, bottom); please learn carefully.
Finally, we have OCR recognition.
The composition of the training data is still the same: image, area box, text content. It mainly informs the model how many areas are in this image, what text content those areas contain, and please study carefully. Additionally, there is a language field indicating the type of text language.
Surya claims to support recognition of over 90 languages. This is not an exaggeration because its training data indeed has annotations for over 90 languages. However, the total amount is too small. There are only 4635 images in total, meaning on average, there are only about 50 training images per language.
Therefore, the OCR recognition effect of Surya for Chinese is not particularly good (even though it claims to be on par with Tesseract). The main reason is not that the algorithm is poor, but rather that the training data for Chinese is too limited. The English alphabet has 26 letters, and 50 images can cover them. However, for Chinese, with tens of thousands of characters, it is difficult to cover them all. And for handwriting recognition, Surya can only say it’s a matter of luck since there is essentially no training data, and the recognition results will vary widely.
Among the training data, the most is for table recognition, with a total of 9700 samples. The least is for reading order detection, with only 126 images. Thus, the volume of data determines its recognition effect. For massive training data, even for powerful commercial companies, it is still a challenge. The author managing to gather these thousands of data points is already quite an achievement.
In conclusion, Surya is a top-tier algorithm open-source project that is free. Its capabilities in table analysis are indeed powerful. However, its training data is still too limited and mainly suited for recognizing electronic documents (clear, undistorted), while it hardly supports handwriting recognition. If you want to replace a paid OCR solution without modifications, the feasibility is low. Even just using table recognition, you would need somewhat specialized personnel to handle the transition from photos to electronic documents. If the performance of paid solutions is already poor, you should dismiss the idea of replacing it with a free one. The algorithm is open-source, but the investment in training data and training equipment must come from somewhere.
I won’t repeat what is clearly written in the official ReadMe.md document.For instance, you need to run pip install streamlit.Or it has several parameters; the first –langs is used to specify the OCR language.
Otherwise, I would sound like a parrot.
Additionally, since you want to study it, you should be able to run it without problems. Look at its source code; I will only mention the key points.
First, download the source code. In the source code, you can see two files: pyproject.toml and poetry.lock. This indicates that Surya uses Poetry as its project management tool. Poetry can manage both dependencies and virtual environments.
It’s best to find a Linux environment and install Poetry. Even if you are on Windows, you can easily install an Ubuntu virtual machine now. Linux can avoid many issues.
Open the Linux command line, navigate to the root directory of the source code. First, run pip install poetry to install Poetry. Then run poetry install to install the dependent environment. Finally, run poetry shell to enter the environment, and you will see:
(surya-ocr-py3.12) root@tf:/mnt/d/surya#
At this point, running surya_gui will launch its web page. Normally, you should see the following output:
https://huggingface.co/vikp(surya-ocr-py3.12) root@tf:/mnt/d/surya# surya_gui You can now view your Streamlit app in your browser. Local URL: http://localhost:8501 Network URL: http://192.168.1.109:8501gio: http://localhost:8501: Operation not supportedLoaded detection model /mnt/d/surya/vikp/surya_det3 on device cpu with dtype torch.float32Loaded recognition model /mnt/d/surya/vikp/surya_rec2 on device cpu with dtype torch.float32Loaded detection model /mnt/d/surya/vikp/surya_layout3 on device cpu with dtype torch.float32Loaded reading order model /mnt/d/surya/vikp/surya_order on device cpu with dtype torch.float32Loaded recognition model /mnt/d/surya/vikp/surya_tablerec on device cpu with dtype torch.float32
Visiting localhost:8501 should display a page like this:
However, it may not work properly. This is often due to it failing to automatically download the weight model from huggingface.co. In this case, you will need to manually download the model and place it in a fixed location.
From the error message, it indicates that it cannot load the model. Following the code leads to surya/settings.py.
# Text detectionDETECTOR_MODEL_CHECKPOINT: str = "vikp/surya_det3"DETECTOR_BENCH_DATASET_NAME: str = "vikp/doclaynet_bench"# Text recognitionRECOGNITION_MODEL_CHECKPOINT: str = "vikp/surya_rec2"RECOGNITION_BENCH_DATASET_NAME: str = "vikp/rec_bench"# LayoutLAYOUT_MODEL_CHECKPOINT: str = "vikp/surya_layout3"LAYOUT_BENCH_DATASET_NAME: str = "vikp/publaynet_bench"# OrderingORDER_MODEL_CHECKPOINT: str = "vikp/surya_order"ORDER_BENCH_DATASET_NAME: str = "vikp/order_bench"# Table RecTABLE_REC_MODEL_CHECKPOINT: str = "vikp/surya_tablerec"TABLE_REC_BENCH_DATASET_NAME: str = "vikp/fintabnet_bench"……
This includes the five main functions (detection, recognition, type, ordering, table) and their weight model and training dataset path configurations. Normally, they should automatically download and cache. But now we need to download and manually configure them. The download method is to go to huggingface.co/vikp to find the corresponding model files.
Download the model file you need, meaning if you want to use a specific function, download that model. This can be tricky for newcomers since some functions are interdependent. For instance, table recognition often requires first detecting the table area before recognizing row and column areas. In practice, it will go through several models. Therefore, if you are unfamiliar, it’s best to download all the MODEL_CHECKPOINTs.
DATASET_NAME is the dataset; if you want to retrain, download it. If you do not call the training code, not downloading it will not cause an error.
You can download the weight files to the root directory of the project. Then make the following configurations:
Change vikp/surya_det3 to
os.path.join(BASE_DIR, "vikp/surya_det3")
Since BASE_DIR is defined as the project root directory, this path is correct.
After that, running surya_gui should work normally.
Visit localhost:8501 to upload files for testing the five major functions.
It will display the corresponding results.
In the console, it will also output the operation types and time consumed:
Detecting bboxes: 100%|███████| 1/1 [00:02<00:00, 2.61s/it]Detecting bboxes: 100%|███████| 1/1 [00:02<00:00, 2.06s/it]Detecting bboxes: 100%|███████| 1/1 [00:02<00:00, 2.44s/it]Recognizing tables: 100%|███████| 1/1 [00:01<00:00, 1.19s/it]
Thus, you can study its source code.You can modify a bit of code, run it, and see the changes.The specific functional modules and code correspondences are explained in the official readMe.md.Whether exposing interface capabilities, modifying internal functions, or retraining your own data, you now have a starting point.
5. Conclusion
Excellent open-source projects are like a well-built shell of a house; compared to commercial software, they often lack comfortable living conditions.However, they have a solid foundation, reasonable structure, and excellent quality.To thrive, they need someone to do the interior decoration.Conversely, some commercial software, despite their fancy decoration, may just have a flimsy framework.
Why do we say that we are now in an era where data is king?From the above discussions, it can be seen that, within a certain time and space,the algorithms are public, and computational power can be bought, but data is the hard part.Only with good and abundant data can good AI models be produced.
I am an IT guy, someone who enjoys practical applications.
Leave a Comment
Your email address will not be published. Required fields are marked *