Click the above “Beginner Learning Vision” to select “Star” or “Top”
Important content delivered at the first time
This article is reproduced from | Machine Heart
How to build an intelligent car system without changing the car?For some time, the author Robert Lucian Chiriac has been thinking about how to give cars the ability to detect and recognize objects.This idea is very interesting because we have already witnessed the capabilities of Tesla, although we cannot immediately buy a Tesla (it must be mentioned that the Model 3 now looks more and more attractive), but he has an idea to strive to realize this dream.So, the author achieved this with a Raspberry Pi, which can detect license plates in real-time when placed on the car.In the following content, we will introduce each step in the project and provide the GitHub project address, where the project address is just a client tool, and other datasets and pre-trained models can be found at the end of the original blog.
Now, let’s see how the author Robert Lucian Chiriac builds a useful onboard detection and recognition system step by step.Here’s a finished product picture.Step 1:Define the project scopeBefore starting, the first question that popped into my mind was what this system should be able to do. If I have learned anything in my life, it is that taking things step by step is always the best strategy. So, besides basic visual tasks, what I need is just to clearly recognize license plates while driving. This recognition process involves two steps:
Detect the license plate.
Recognize the text within each license plate bounding box.
I think if I can accomplish these tasks, it will be much easier to perform other similar tasks (such as determining collision risks, distances, etc.). I might even be able to create a vector space to represent the surrounding environment—sounds cool to think about.Before determining these details, I know I have to achieve:
A machine learning model that takes unlabelled images as input to detect license plates;
Some hardware. In simple terms, I need a computer system connected to one or more cameras to call my model.
Let’s start with the first thing—building an object detection model.Step 2:Select the right modelAfter careful research, I decided to use these machine learning models:
YOLOv3 – This is one of the fastest models available today, and it has a comparable mAP with other SOTA models. We use this model to detect objects;
CRAFT text detector – We use it to detect text in images;
CRNN – Simply put, it is a recurrent convolutional neural network model. To arrange the detected characters into words in the correct order, it must be sequential data;
How do these three models work together? The following describes the operational flow:
First, the YOLOv3 model receives frames of images from the camera and finds the bounding boxes of the license plates in each frame. It is not recommended to use very precise predicted bounding boxes—the bounding box should be a bit larger than the actual detected object. If it is too tight, it may affect the performance of subsequent processes;
The text detector receives the cropped license plates from YOLOv3. At this point, if the bounding box is too small, it is likely that part of the license plate text has also been cropped, resulting in a poor prediction. However, when the bounding box is enlarged, we can allow the CRAFT model to detect the positions of the letters, so that the position of each letter can be very accurate;
Finally, we can pass the bounding boxes of each word from CRAFT to the CRNN model to predict the actual words.
With my basic model architecture sketch, I can start working on the hardware.Step 3:Select hardwareWhen I realized that I needed low-power hardware, I thought of my old love: the Raspberry Pi. It has a dedicated camera Pi Camera and enough computing power to preprocess each frame at a decent frame rate. The Pi Camera is the physical camera of the Raspberry Pi and has its mature and complete library.To connect to the internet, I can use 4G access through EC25-E. I have also used one of its GPS modules in a previous project, details can be found here:Blog address: https://www.robertlucian.com/2018/08/29/mobile-network-access-rpi/Then I need to start designing the shell – it should be fine to hang it on the car’s rearview mirror, so I ultimately designed a support structure divided into two parts:
On the side of the rearview mirror, the Raspberry Pi + GPS module + 4G module will be retained. You can check my article about the EC25-E module for the GPS and 4G antennas I used;
On the other side, I used an arm utilizing a ball joint to support the Pi Camera.
I will use my reliable Prusa i3 MK3S 3D printer to print these parts, and the 3D printing parameters are also provided at the end of the original text.Figure 1: Shape of the Raspberry Pi + 4G/GPS shellFigure 2:Using a ball joint arm to support the Pi CameraFigures 1 and 2 show how they look when rendered. Note that the C-shaped bracket is pluggable, so the accessories of the Raspberry Pi and the support for the Pi Camera are not printed together with the bracket. They share a socket, and the bracket is plugged into it. If any readers want to replicate this project, this is very useful. They just need to adjust the bracket on the rearview mirror. Currently, this base works well on my car (Land Rover Freelander).Figure 3:Side view of the Pi Camera support structureFigure 4:Front view of the Pi Camera support structure and RPi baseFigure 5:Expected camera field of viewFigure 6:Close-up of the embedded system of the Raspberry Pi with built-in 4G/GPS module and Pi CameraClearly, these things take some time to model, and I need several attempts to get a sturdy structure. The PETG material I used has a layer height of 200 microns. PETG works well at 80-90 degrees Celsius and has strong resistance to UV radiation—though not as good as ASA, it is still strong.This was designed in SolidWorks, so all my SLDPRT/SLDASM files and all STLs and gcode can be found at the end of the original text. You can also use these to print your own version.Step 4:Train the modelNow that the hardware is resolved, it’s time to start training the model. Everyone should know that it’s best to work on the shoulders of giants. This is the core of transfer learning—first learn with a very large dataset, and then utilize the knowledge learned from it.YOLOv3I found many pre-trained license plate models online, but not as many as I initially expected, but I found one trained on 3600 license plate images. This training set is not large, but it is better than nothing. Moreover, it is also trained based on the pre-trained model of Darknet, so I can use it directly.Model address: https://github.com/ThorPham/License-plate-detectionSince I already had a recordable hardware system, I decided to drive around town for a few hours to collect new video frame data to fine-tune the previous model.I used VOTT to label those frames containing license plates, ultimately creating a small dataset with 534 images, all of which had labeled bounding boxes for the license plates.Dataset address: https://github.com/RobertLucian/license-plate-datasetThen I found code to implement YOLOv3 using Keras and used it to train my dataset, then submitted my model to this repo so others can use it. I ultimately achieved an mAP of 90% on the test set, considering my dataset is very small, this result is already good.
Keras implementation: https://github.com/experiencor/keras-yolo3
CRAFT & CRNNTo find a suitable network for text recognition, I went through countless trials. Finally, I stumbled upon keras-ocr, which packages CRAFT and CRNN, is very flexible, and has pre-trained models, which is fantastic. I decided not to fine-tune the models and keep them as they are.keras-ocr address: https://github.com/faustomorales/keras-ocrMost importantly, predicting text with keras-ocr is very simple. Basically, it takes just a few lines of code. You can check the project homepage to see how it’s done.Step 5:Deploy my license plate detection modelModel deployment mainly has two methods:
Perform all inference locally;
Perform inference in the cloud.
Both methods have their challenges. The first means having a central “brain” computer system, which is complex and expensive. The second faces challenges in latency and infrastructure, especially when using GPUs for inference.During my research, I stumbled upon an open-source project called cortex. It is a newcomer in the AI field, but as the next evolutionary direction of AI development tools, it undoubtedly makes sense.Cortex project address: https://github.com/cortexlabs/cortexBasically, cortex is a platform that deploys machine learning models as production web services. This means I can focus on my application while leaving the rest for cortex to handle. It does all the preparation work on AWS, and all I need to do is write predictors using template models. Even better, I only need to write dozens of lines of code for each model.Below is a terminal of the cortex runtime obtained from the GitHub repo. If this isn’t called elegant and concise, I don’t know what term to use:Since this computer vision system is not designed for autonomous driving, latency is not as important to me, and I can use cortex to solve this problem. If it were part of an autonomous driving system, using services provided by cloud providers wouldn’t be a good idea, at least not now.Deploying ML models with cortex requires:
Define a cortex.yaml file, which is the configuration file for our API. Each API will handle one type of task. I assigned the task of detecting license plate bounding boxes on the given frame to the yolov3 API, while the crnn API predicts the license plate number with the help of the CRAFT text detector and crnn;
Define the predictor for each API. Basically, all you need to do is define a predict method for a specific class in cortex to receive a payload (all the servy part has been handled by the platform), this payload can predict results and return the predictions. It’s that simple!
Here’s an example of a predictor for the classic iris dataset, but due to space constraints, I won’t elaborate on the specifics here. You can find how to use these two APIs in the project link—the remaining resources for this project are at the end of this article.# predictor.pyimport boto3import picklelabels = [“setosa”, “versicolor”, “virginica”]classPythonPredictor:def__init__(self, config): s3 = boto3.client(“s3”) s3.download_file(config[“bucket”], config[“key”], “model.pkl”) self.model = pickle.load(open(“model.pkl”, “rb”)) defpredict(self, payload): measurements = [ payload[“sepal_length”], payload[“sepal_width”], payload[“petal_length”], payload[“petal_width”], ] label_id = self.model.predict([measurements])[0] return labels[label_id]To make predictions, you just need to use curl like this:curl http://***.amazonaws.com/iris-classifier \
-X POST -H “Content-Type: application/json” \
-d ‘{“sepal_length”: 5.2, “sepal_width”: 3.6, “petal_length”: 1.4, “petal_width”: 0.3}’Then you will receive feedback like setosa, very simple!Step 6:Develop the clientWith cortex helping me deploy, I can start designing the client—this is the more challenging part.I thought of the following architecture:
Collect frames from the Pi Camera at an acceptable resolution (800×450 or 480×270) with a frame rate of 30 FPS and push each frame into a public queue;
In a separate process, I will take frames from the queue and distribute them to multiple workstations on different threads;
Each worker thread (or inference thread as I call it) will make API requests to my cortex API. First, a request to my yolov3 API, then, if any license plates are detected, another request will be sent with a batch of cropped license plates to my crnn API. The predicted license plate numbers will be returned in text format;
Push each detected license plate (with or without the recognized text) to another queue, which will ultimately broadcast it to the browser page. Meanwhile, the predicted license plate numbers will also be pushed to another queue to be saved to disk in CSV format later;
The broadcast queue will receive a set of unordered frames. The consumer’s task is to place them in a very small buffer (the size of a few frames) and broadcast a new frame to the client to reorder. This consumer runs separately in another process and must also try to maintain the queue size fixed to a specified value to display frames at a consistent frame rate. Clearly, if the queue size decreases, the drop in frame rate is proportional, and vice versa;
Meanwhile, another thread will run in the main process to fetch predictions and GPS data from another queue. When the client receives a termination signal, predictions, GPS data, and time will also be saved to a CSV file.
The following diagram illustrates the relationship flow between the client and the cloud API provided by cortex.Figure 7:Flowchart between the cloud API provided by cortex and the clientIn our case, the client is the Raspberry Pi, and the cloud API to which inference requests are sent is provided by cortex on AWS.The source code of the client can also be found in its GitHub: https://github.com/robertlucian/cortex-licens-plate-reader-clientOne challenge I had to overcome was the bandwidth of 4G. It’s best to reduce the bandwidth required by this application to minimize possible hang-ups or overuse of available data. I decided to let the Pi Camera use a very low resolution: 480×270 (we can use a small resolution here because the Pi Camera has a very narrow field of view, so we can still easily recognize license plates).However, even at this resolution, the JPEG size of each frame is about 100KB (0.8MBits). Multiply this by 30 frames per second, and you get 3000KB, which is 24mb/s, and that’s without HTTP overhead, which is a lot.Therefore, I used some tricks:
Reduce the width to 416 pixels, which is the size required by the YOLOv3 model, and the scale is clearly intact;
Convert the image to grayscale;
Remove the top 45% of the image. The idea here is that license plates do not appear at the top of the frame since cars don’t fly, right? As far as I know, removing 45% of the image does not affect the predictor’s performance;
Convert the image back to JPEG, but this time the quality is much lower.
The final frame size is about 7-10KB, which is excellent. That corresponds to 2.8Mb/s. But considering all the overhead, it’s about 3.5Mb/s. For the crnn API, cropped license plates don’t need much space; even without compression, their size is only about 2-3KB each.In summary, to run at 30FPS, the bandwidth required for the inference API is around 6Mb/s, a number I can accept.ResultsSuccess!The above is an example of real-time inference through cortex. I need about 20 GPU-equipped instances to run it smoothly. Depending on the latency of this set of GPUs, you may need more GPUs or fewer instances. The average latency from capturing frames to broadcasting them to the browser window is about 0.9 seconds, which is amazing considering the inference takes place far away—I’m still surprised.The text recognition part may not be the best, but it at least proves one point—it can be made more precise by increasing the video resolution or reducing the camera’s field of view or fine-tuning.As for the issue of too high GPU demand, this can be solved through optimization. For example, using mixed precision/full half precision (FP16/BFP16) in the model. Generally, using mixed precision has a minimal impact on accuracy, so we haven’t made too many trade-offs.In summary, if all optimizations are in place, reducing the number of GPUs from 20 to one is actually feasible. If properly optimized, even a single GPU resource may not be fully utilized.Download 1: OpenCV-Contrib Extension Module Chinese Version TutorialReply in the backend of “Beginner Learning Vision” public account:Chinese Tutorial on Extension Modules to download the first Chinese version of the OpenCV extension module tutorial on the internet, covering installation of extension modules, SFM algorithms, stereo vision, object tracking, biological vision, super-resolution processing, and more than twenty chapters.Download 2: Python Vision Practical Project 52 LecturesReply in the backend of “Beginner Learning Vision” public account: Python Vision Practical Project to download 31 practical vision projects including image segmentation, mask detection, lane line detection, vehicle counting, eyeliner addition, license plate recognition, character recognition, emotion detection, text content extraction, facial recognition, etc., to help quickly learn computer vision.Download 3: OpenCV Practical Project 20 LecturesReply in the backend of “Beginner Learning Vision” public account: OpenCV Practical Project 20 Lectures to download 20 practical projects based on OpenCV to achieve advanced learning of OpenCV.
Discussion Group
Welcome to join the public account reader group to communicate with peers. Currently, there are WeChat groups for SLAM,3D vision,sensors,autonomous driving,computational photography,detection,segmentation,recognition,medical imaging,GAN,algorithm competitions, etc. (will gradually be subdivided in the future), please scan the WeChat number below to join the group, with the note: “nickname + school/company + research direction”, for example: “Zhang San + Shanghai Jiaotong University + Visual SLAM”. Please follow the format for notes, otherwise, it will not be approved. After successful addition, you will be invited into the relevant WeChat group according to your research direction. Please do not send advertisements in the group, otherwise, you will be removed from the group, thank you for understanding~