Is C Language Being Phased Out?

Source丨Reprinted with permission from OSC Open Source Community (ID: oschina2013)

Author丨 Bai Kaishui

Researchers from Carnegie Mellon University have launched an open-source automatic code generation model called PolyCoder, which has 27 billion parameters and is based on the GPT-2 architecture. It was trained on 249GB of code spanning 12 programming languages on a single machine. The training results indicate that PolyCoder outperforms all models, including Codex, in the C programming language.

“Recently, large language models (LMs) for code have shown tremendous promise in code completion and synthesizing code from natural language descriptions. However, the current state-of-the-art code LMs (such as Codex) are not publicly available, leaving many questions about their model and data design decisions. Our goal is to fill some of these gaps through a systematic evaluation of the largest existing models across various programming languages: Codex, GPT-J, GPT-Neo, GPT-NeoX20B, and CodeParrot. Although Codex itself is not open-source, we found that existing open-source models achieve comparable results in some programming languages, although primarily aimed at natural language modeling. We further identified a significant missing piece: large open-source models specifically trained on multilingual code corpora.”

The researchers pointed out that OpenAI’s Codex was released last August and is accessible through Microsoft’s GitHub Copilot tool; however, it provides “non-free access” to model outputs via a black-box API call, and the model weights and training data are not available. DeepMind claims that its recently released AlphaCode code generator ranked in the top 54.3% in human-involved programming competitions, but training this model requires “hundreds of petaFLOPS days” of training in Google’s data centers.“Despite the tremendous success of large language code models, the strongest models have yet to be made public. This hinders the application of these models outside resource-rich companies and limits research in this field for resource-poor organizations.”To address this issue, they introduced PolyCoder. This model was trained using data from multiple repositories on GitHub, covering 12 popular programming languages: C, C#, C++, Go, Java, JavaScript, PHP, Python, Ruby, Rust, Scala, and TypeScript. The unfiltered dataset contains a total of 631GB of data and 38.9 million files. Additionally, to train PolyCoder, the researchers chose GPT-2 (due to budget constraints).However, while PolyCoder outperforms all models in C language, Codex still surpasses PolyCoder in other languages.Is C Language Being Phased Out?Is C Language Being Phased Out?“It is worth noting that PolyCoder outperforms Codex and all other models in the C language. When comparing individual open-source models, PolyCoder performs better than similarly sized GPT-Neo 2.7B in C, JavaScript, Rust, Scala, and TypeScript. In the other 11 languages besides C, all other open-source models, including ours, are significantly worse than Codex (higher perplexity).”For more details, see: https://arxiv.org/pdf/2202.13169.pdf

Therefore, the model excels in C language, leading some netizens to joke: the first to be phased out is C.Is C Language Being Phased Out?

Is C Language Being Phased Out?

Is C Language Being Phased Out?

1. Experts duel without a mouse, an extremely useful cross-platform command-line interface library

2. Want to learn programming from scratch? Start with this highly-rated beginner’s masterpiece (book giveaway at the end)

3. Directly praising these websites!

4. Starting programming in elementary school, how terrifying!

5. It looks confusing, but it’s actually a learning website?

Is C Language Being Phased Out?

Is C Language Being Phased Out?

Click to share

Is C Language Being Phased Out?

Click to like

Is C Language Being Phased Out?

Click to view

Leave a Comment