news

zhiyuan research institute releases chinese internet corpus 3.0, containing 1000gb of high-quality data

2024-09-20

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

on september 20, at the parallel forum "cultural trends: integration of emerging business models and technologies" of the 2024 beijing cultural forum, liu guang, head of the tianying language model of the zhiyuan research institute, released the chinese internet corpus 3.0.
the chinese internet corpus 3.0 has the characteristics of unprecedented scale and wide sources; fine annotation, empowering application; breakthrough effect, and better understanding of chinese. at present, the data volume of the chinese internet corpus 3.0 (cci3. 0) is as high as 1000gb, including 268 million web pages; the data volume of the chinese internet corpus 3.0 high quality subset (cci3. 0 hq) is 498gb. each corpus is analyzed and marked from more than 10 dimensions, with parameters such as security score, quality score, and information density, which makes it convenient for users to select high-value data, meet the feasibility needs of enterprises, and better play the role of data.
according to liu guang, data is the cornerstone and bottleneck of the development of large models. at present, the demand for data scale in model training has increased significantly, and the proportion of internet sources has led to a shortage of chinese data. only labeled high-quality data can unleash the value of artificial intelligence. if the industry focuses more on data quality, the development of artificial intelligence will be faster. this is the background for the launch of the chinese internet corpus 3.0.
report/feedback