Exploring the New ChatGLM Language Model

I have grown tired of the stable diffusion I was using before, and I am now a well-known artist with thousands of followers on Pixiv. Today, I picked up a new toy, ChatGLM!

This is a large-scale bilingual language model with question-and-answer and dialogue capabilities, optimized for Chinese, and is currently in an invitation-only beta test, which will gradually expand. Meanwhile, following the open-source release of the GLM-130B base model, we have officially open-sourced the latest bilingual dialogue GLM model: ChatGLM-6B. With model quantization technology, users can deploy it locally on consumer-grade graphics cards (requiring only 6GB of VRAM at INT4 quantization level). After approximately 1 trillion identifiers of bilingual training, supplemented by supervised fine-tuning, self-feedback, and human feedback reinforcement learning techniques, the 6.2 billion parameter ChatGLM-6B, while smaller than the hundred billion model, significantly lowers the deployment barrier for users and can already generate responses that align well with human preferences.

On my laptop with a single RTX 3070 (8GB), I can load the model, but it throws an out-of-memory error during inference, exceeding the VRAM limit.

Graphics cards with 3060 12GB or more can run ChatGLM-6B. Depending on the graphics card performance, the model can be loaded with different quantization precisions. The minimum VRAM required for different quantization precisions is as follows:

Quantization Level	Minimum GPU VRAM	Quantization Code
FP16 (No Quantization)	13 GB	model.half().cuda()
INT8	10 GB	model.half().quantize(8).cuda()
INT4	6 GB	model.half().quantize(4).cuda()
—	CPU	model.float()

ChatGLM has a 130 billion parameter bilingual dense model GLM-130B¹, which has some unique advantages:

Bilingual: Supports both Chinese and English.
High Accuracy (English): Outperforms GPT-3 175B (API: davinci, base model), OPT-175B, and BLOOM-176B on public English natural language benchmarks LAMBADA, MMLU, and Big-bench-lite.
High Accuracy (Chinese): Significantly outperforms ERNIE TITAN 3.0 260B and YUAN 1.0-245B on 7 zero-shot CLUE datasets and 5 zero-shot FewCLUE datasets.
Fast Inference: The first hundred billion model to achieve INT4 quantization, supporting fast and nearly lossless inference on a server with 4 RTX 3090 or 8 RTX 2080Ti cards.
Reproducibility: All results (over 30 tasks) can be reproduced using our open-source code and model parameters.
Cross-Platform: Supports training and inference on domestic Haiguang DCU, Huawei Ascend 910, and Shenwei processors, as well as NVIDIA chips in the US.

Currently, I am trying to feed it data from the field I want to specialize in, aiming to become a small assistant in that area, like (a writer of romantic stories?) Haha, let me go back to Pixiv and find a new title.

Finally, regarding investments, I have only seen this news.

Lingyun Guang (688400.SH) stated on March 28 on the investor interaction platform that the company has applied ChatGLM in the intelligent content creation of digital humans, improving the efficiency and quality of intelligent voice generation, enabling pure AI-driven voice generation for digital humans.

Related posts

Leave a Comment Cancel reply