StarCoder 2: The New AI-Powered Code Generator for GPUs
In the realm of Artificial Intelligence, there is a growing demand for tools that facilitate coding. Capitalizing on this need, AI firm Hugging Face collaborated with the workflow company ServiceNow a few years back to create StarCoder, an open-source code generator that offers a more liberating license compared to its competitors. Following its inception early last year, the teams have been developing an improved version known as StarCoder 2.
Introducing StarCoder 2
This new iteration is not just a single model but a 'family' of code generators. StarCoder 2 is available in three variants, with the first two designed to function smoothly on most modern GPUs owned by consumers:
- A 3-billion-parameter model created by ServiceNow.
- A 7-billion-parameter model forged by Hugging Face.
- And lastly, a robust 15-billion-parameter model developed in conjunction with Nvidia.
Parameters are vital components of an AI model, as they're gleaned from training data and are essential in the model's problem-solving capabilities, in this case, generating code.
Capabilities of StarCoder 2
StarCoder 2 is adept at proposing completion for partially written code, summarizing code snippets, and understanding natural language prompts for code retrieval. Training on a data set that's quadruple the size of the original StarCoder's (67.5 TB as opposed to 16.4 TB), it offers substantial improvements in performance while reducing operational costs. Tuning the system takes just a few hours on GPUs like Nvidia's A100, allowing for faster development of applications such as chatbots and coding assistants. The increased data diversity, including around 619 programming languages, means predictions should be more accurate and contextually relevant.
Yet, the question remains: does StarCoder 2 outshine other code generators? Preliminary benchmarks suggest it might, showing efficiency over the 33-billion-parameter Code Llama on certain tasks and performing them twice as fast. Hugging Face also underscores the benefits of an open-source model, which enables local deployment and learning of a developer's codebase, avoiding the potential risk of exposing the code to a cloud-hosted AI. Additionally, the training data's availability allows for transparency, forking, reproduction, or audit by developers.
Ethical and Legal Considerations
The ethics of AI is a cornerstone of StarCoder 2's development. Instead of using copyrighted code like some rivals, its training material is sourced exclusively from the Software Heritage archive. Moreover, ahead of its training phase, BigCode provided an opportunity for code owners to exclude their data, if they desired. However, licensing could pose challenges for some. The BigCode Open RAIL-M 1.0 license, while less restrictive, does not qualify as 'truly open' as it lim its certain applications like medical advice platforms. RAIL-M's guidelines might be too ambiguous, potentially clashing with budding AI regulations like the EU AI Act.
AI, coding, GPUs