ByteDance’s New Large Model, Doubao, Delivers Stellar Performance

The Doubao large model was recently unveiled by ByteDance. Besides sparking a trend of price reductions for large models with its impressively low cost, Doubao’s modeling capabilities have garnered significant industry attention.

The Doubao Model team disclosed the results of a phase of internal testing. The Doubao-pro-4k model scored a total of 76.8 across 11 mainstream public evaluation sets, including MMLU, BBH, GSM8K, and HumanEval. This score represents a 19% improvement over the 64.5 score achieved by the previous generation model, Skylark2, and surpasses the scores of other domestic models tested during the same period.

The evaluation results reveal that Doubao made substantial strides in code capabilities, improving by around 50% compared to the previous generation model on the HumanEval and MBPP evaluation sets. Furthermore, Doubao demonstrated significant performance enhancements on evaluation sets for professional knowledge and instruction compliance, achieving improvements of 33% and 24% respectively. This positioned Doubao as the highest scoring domestic model in these areas.

Doubao’s model also showed commendable performance in evaluations of mathematical capabilities and language comprehension abilities, as well as on the comprehensive evaluation sets CMMLU and CEval, securing a place within the top three. When considering the test results from all 11 public evaluation sets, the total score for Doubao’s general model-pro was 76.8. According to OpenAI’s published test scores, GPT-4 maintained a slight edge with a total score of 80.1 across these evaluation sets.

It’s worth noting that the Doubao model was launched only recently, on May 15, and has not yet been incorporated into third-party institution testing. However, it’s anticipated that many third-party evaluation institutions will release the model’s evaluation results over the next one to two months. The AI dialogue assistant, “Doubao,” which shares its name with the model, has already reached a monthly active user count of 26 million, offering users a free testing experience.

In a previous evaluation report released by the Beijing Academy of Artificial Intelligence, which covered 91 language models worldwide, Skylark2 topped the list in the subjective evaluation that focused on Chinese language capabilities, outperforming GPT-4.