Efficient Pose Estimation Inference with LitePose

Click the card below to follow the “LiteAI” official account

Hi, everyone, I am Lite. Recently, I shared the Efficient Large Model Full-Stack Technology from Part 1 to Part 19, which includes content on large model quantization and fine-tuning, efficient inference for LLMs, quantum computing, generative AI acceleration, and more. The content links are as follows:

Efficient Large Model Full-Stack Technology (Part 19): Efficient Training and Inference Framework | TorchSparse++: Efficient Training and Inference Framework for Sparse Convolutions on GPU

Recently, I also shared the TinyML Project from Part 1 to Part 10, which includes efficient ML systems, efficient CNN algorithm & system co-design, efficient Transformers, efficient CV task inference, and more. The content links are as follows:

TinyML Project (Part 1): Efficient ML System | TinyChat: Visual Language Model and Edge AI 2.0

TinyML Project (Part 2): Efficient ML System | TinyChat Engine: On-device LLM Inference Library

TinyML Project (Part 3): Efficient CNN Algorithm & System Co-Design | MCUNetV3: Device-side Training with 256KB Memory

TinyML Project (Part 4): Efficient CNN Algorithm & System Co-Design | TinyTL: Reducing Activation Counts and Non-trainable Parameters for Efficient Learning on Devices

TinyML Project (Part 5): Efficient LLM Inference | Block Sparse Attention

TinyML Project (Part 6): Efficient Intelligent Driving Applications | BEVFusion: Multi-task Multi-sensor Fusion with Unified Bird’s Eye View Representation

TinyML Project (Part 7): Efficient Transformer | Flattened Window Attention for Efficient Point Cloud Transformer

TinyML Project (Part 8): Efficient Transformer | Lightweight Transformer with Long and Short Distance Attention

TinyML Project (Part 9): Efficient Transformer | SparseViT: Reevaluating the Sparsity of Efficient High-resolution ViT Activations

TinyML Project (Part 10): Efficient Point Cloud Inference | TorchSparse: Efficient Point Cloud Inference Engine

Article Summary

The core content of this article introduces an efficient architecture design named LitePose for real-time multi-person 2D human pose estimation on resource-constrained edge devices.

1. Background and Challenges:

– Human pose estimation plays a crucial role in many visual applications that require understanding human behavior.

– Existing HRNet-based pose estimation models are computationally expensive and difficult to deploy on resource-constrained edge devices.

2. Research Objectives:

– Design an efficient architecture for real-time multi-person pose estimation on edge devices while maintaining good performance.

3. Method Introduction:

– Gradual Shrinking Experiment: By gradually reducing the depth of the high-resolution branch, it was found that the model performance improves in low-computation areas.

– LitePose Architecture: A single-branch architecture named LitePose was designed, enhancing model capability through the fusion of deconvolution heads and large kernel convolutions.

– Fusion Deconv Head: Eliminates redundancy in the high-resolution branch, allowing for low-overhead scale-aware feature fusion.

– Large Kernel Convs: Significantly improves model capability and receptive field while maintaining low computation costs.

4. Experimental Results:

– On the CrowdPose dataset, LitePose achieved a 2.8x reduction in MACs and up to a 5.0x reduction in latency while performing better.

– On the COCO dataset, compared to EfficientHRNet, LitePose achieved a 2.9x reduction in latency while providing better performance.

5. Contribution Summary:

– Revealed the redundancy of high-resolution branches in low-computation areas.

– Proposed the LitePose architecture and two techniques to enhance its capability: fusion deconvolution heads and large kernel convolutions.

– Demonstrated the effectiveness of LitePose through extensive experiments on two benchmark datasets.

Efficient Pose Estimation Inference with LitePose

Article link: https://arxiv.org/pdf/2205.01271

Project link: https://github.com/mit-han-lab/litepose

TL;DR

Article Methods

1. Method Overview:

– LitePose designed an efficient single-branch architecture by eliminating redundancy in the high-resolution branch.

– Introduced two techniques, fusion deconvolution heads and large kernel convolutions, to enhance model capability.

2. Key Technologies:

– Gradual Shrinking Experiment:

– By gradually reducing the depth of the high-resolution branch, it was found that model performance improves in low computation areas.

– This indicates that single-branch architectures are more efficient in low computation areas.

– Fusion Deconv Head:

– Removes redundancy in the high-resolution branch, allowing for low-overhead scale-aware feature fusion.

– Directly utilizes low-level high-resolution features for deconvolution and final prediction, avoiding redundancy in multi-branch HR fusion modules.

– Large Kernel Convs:

– Large kernel convolutions provide a more significant performance boost in pose estimation tasks compared to small kernel convolutions.

– Using a 7×7 large kernel convolution, the mAP on the CrowdPose dataset improved by 14% with a 25% increase in computation cost.

Experimental Results

1. CrowdPose Dataset:

– Model Performance: LitePose achieved better performance than existing methods on the CrowdPose dataset.

– Computational Efficiency: Compared to HRNet-based methods, LitePose achieved a 2.8x reduction in MACs and up to a 5.0x reduction in latency.

– Specific Data:

– HigherHRNet-W48: 63.8M parameters, 154.6GMACs, AP of 65.9.

– LitePose-S: 2.7M parameters, 5.0GMACs, AP of 58.3.

– Latency: On Qualcomm Snapdragon 855, Raspberry Pi4B+, and NVIDIA Jetson Nano GPU, LitePose achieved latency reductions of 5.0×, 4.9×, and 5.0× respectively.

2. Microsoft COCO Dataset:

– Model Performance: LitePose also performed excellently on the COCO dataset, outperforming HRNet-based methods.

– Computational Efficiency: Compared to EfficientHRNet, LitePose achieved a 1.8x reduction in MACs and up to a 2.9x reduction in latency.

– Specific Data:

– HigherHRNet-W48: 63.8M parameters, 155.1GMACs, AP of 69.9.

– LitePose-S: 2.7M parameters, 5.0GMACs, AP of 56.8.

– Latency: On Qualcomm Snapdragon 855, Raspberry Pi4B+, and NVIDIA Jetson Nano GPU, LitePose achieved latency reductions of 2.9×, 2.5×, and 2.3× respectively.

3. Comparison with Lightweight OpenPose:

– Performance: LitePose outperformed Lightweight OpenPose by 14.0 AP on the COCO dataset.

– Latency: LitePose has lower latency on mobile platforms.

4. Ablation Study:

– Large Kernel Convolutions: The 7×7 large kernel convolution improved mAP on the CrowdPose dataset by 14% with a 25% increase in computation cost.

– Fusion Deconv Head: On the CrowdPose dataset, the fusion deconv head provided a +7.6 AP performance boost with minimal increase in computation cost.

– Neural Architecture Search: Through neural architecture search, LitePose achieved a +2.2 AP performance improvement on the CrowdPose dataset.

Final Thoughts

Here, I recommend the latest course 6.5940 from hanlab for the fall of 2024 (the course is ongoing).

Course link: https://efficientml.ai

Courseware: https://pan.quark.cn/s/1324c20d7efd

Your likes, views, and follows are my greatest motivation to continue!

Scan to add me, or add WeChat (ID: LiteAI01) for technical, career, and professional planning exchanges, note “Research direction + School/Region + Name”

Article Summary

Article Methods

Experimental Results

Final Thoughts

Scan to add me, or add WeChat (ID: LiteAI01) for technical, career, and professional planning exchanges, note “Research direction + School/Region + Name”

Related posts

Leave a Comment Cancel reply