[ English | 中文 ]
Welcome to the COIG-CQIA project page. COIG-CQIA stands for Chinese Open Instruction Generalist - Quality is All You Need, a high-quality Chinese instruction fine-tuning dataset. This dataset is designed to provide the Chinese NLP community with high-quality and human interaction-aligned instruction fine-tuning data.
Inspired by studies like LIMA: Less Is More for Alignment, COIG-CQIA focuses on creating a dataset from Chinese internet sources including Q&A and articles. These are deeply cleansed, restructured, and manually reviewed to ensure quality, diversity, and relevance.
- [2023.12.04] 🎉 Released version 0.1 of the dataset. SFT models fully fine-tuned using v0.1 of the dataset are based on Yi-6B-base and Yi-34B-base.
Leveraging the COIG-CQIA data, we have developed a series of SFT models based on the Yi series.
Model Name | Base Model | Download Link |
---|---|---|
CQIA-Yi-6B-v0.1 | Yi-6B-base | Download |
CQIA-Yi-34B-v0.1 | Yi-34B-base | Download |
from transformers import AutoModel
Logical Reasoning
Input:
Response:
{
"instruction": "Example question or instruction",
"input": "Supplementary content for the question or instruction",
"output": "Response to the input",
"task_type": {
"major": ["Q&A"],
"minor": ["Encyclopedic Q&A"]
},
"domain": ["Encyclopedia", "Maternal and Child Health"],
"answer_from": "human",
"human_verified": true,
"copyright": "Copyright information including author details...",
}
instruction
: The command or question for input.input
: Supplementary content for the instruction or question.output
: The corresponding response.task_type
: The main and sub-task types the data belongs to.domain
: The field to which the data belongs.answer_from
: Whether the response is written by humans or generated by large models (with human verification).human_verified
: Indicates if the data has been verified by humans.copyright
: Information about the data's copyright, including the author.
Social Media&Forum
Category | Quantity | Source | Construction Method |
---|---|---|---|
Zhihu | 8837 | [Website] | Multi-stage filtering and human verification. |
Douban | 3132 | [Website] | Manually-written prompt templates. |
Xiaohongshu | 1508 | [Website] | Manually-written prompt templates. |
Segmentfault | 458 | [Website] | Rule-based method for cleaning and filtering, followed by manual review. |
Total | 13935 | - | - |
Encyclopedia
Category | Quantity | Source | Construction Method |
---|---|---|---|
Encyclopedic Article | 980 | Collected from the internet[Website] [Website] [Website] [Website] | Rule-based method for cleaning and filtering, followed by manual review. |
Encyclopedia of China | 1706 | [Website] | Manually-written prompt templates. |
wikiHow-zh | 1876 | [Website] & [Open Dataset] | Rule-based method for cleaning and filtering. |
Total | 4571 | - | - |
General NLP tasks
Category | Quantity | Source | Construction Method |
---|---|---|---|
COIG-PC-Core | 3000 | [Open Dataset] | Manual review of question quality. |
总量 | 3000 | - | - |
Examinations&Quiz
Category | Quantity | Source | Construction Method |
---|---|---|---|
The Chinese National College Entrance Examination&Middle School Entrance Examinations | 2000 | [Open Dataset] | - |
Nationwide Master's Program Unified Admissions Examination | 475 | Collected from the internet | Rule-based method for cleaning and filtering. |
Logical Reasoning | 422 | Collected from the internet | Rule-based method for cleaning and filtering. |
Total | 2897 | - | - |
Human value
Category | Quantity | Source | Construction Method |
---|---|---|---|
100poison | 906 | [Open Dataset] | - |
COIG-human-value | 101 | [Open Dataset] | Manual review of question quality |
Total | 1007 | - | - |
Traditional Chinese Culture
Category | Quantity | Source | Construction Method |
---|---|---|---|
Traditional Knowledge Quiz | 232 | Collected from the internet | Rule-based method for cleaning and filtering, followed by manual review. |
Chinese Idiom | 112 | [Open Dataset] | Rule-based method for cleaning and filtering, followed by manual review. |
Classical Chinese Poetry Writing | 47 | [Open Dataset] | Rule-based method for cleaning and filtering, followed by manual review. |
Classical Chinese Translation | 112 | [Open Dataset] | Rule-based method for cleaning and filtering, followed by manual review. |
Total | 1112 | - | - |
Finance&Economy Management
Category | Quantity | Source | Construction Method |
---|---|---|---|
MBA Encyclopedia | 10689 | [Website] | Manually-written prompt templates. |
Finance NLP tasks | 600 | [Open Dataset] | Manual review of question quality. |
Total | 12689 | - | - |
Medical
Category | Quantity | Source | Construction Method |
---|---|---|---|
Medical Encyclopedia | 8351 | [Website] | Manually-written prompt templates. |
Medical Articles | 186 | [Website][Website] | Rule-based method for cleaning and filtering. |
Total | 8537 | - | - |
Law
Category | Quantity | Source | Construction Method |
---|---|---|---|
Nationwide Master's Program Unified Admissions Examination | 2645 | Collected from the internet | Rule-based method for cleaning and filtering. |
Total | 2645 | - | - |
To cite COIG-CQIA in your work, please use the following format:
@misc{COIG-CQIA,
author = {},
title = {COIG-CQIA: Quality is All you need for Chinese Instruction Fine-tuning},
year = {2023},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/paralym/COIG-CQIA}},
}
Additional relevant citations:
@article{zhang2023chinese,
title={Chinese open instruction generalist: A preliminary release},
author={Zhang, Ge and Shi, Yemin and Liu, Ruibo and Yuan, Ruibin and Li, Yizhi and Dong, Siwei and Shu, Yu and Li, Zhaoqun and Wang, Zekun and Lin, Chenghua and others},
journal={arXiv preprint arXiv:2304.07987},
year={2023}
}
@misc{Firefly,
author = {Jianxin Yang},
title = {Firefly(流萤): 中文对话式大语言模型},
year = {2023},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/yangjianxin1/Firefly}},
}
@misc{xu2023cvalues,
title={CValues: Measuring the Values of Chinese Large Language Models from Safety to Responsibility},
author={Guohai Xu and Jiayi Liu and Ming Yan and Haotian Xu and Jinghui Si and Zhuoran Zhou and Peng Yi and Xing Gao and Jitao Sang and Rong Zhang and Ji Zhang and Chao Peng and Fei Huang and Jingren Zhou},
year={2023},
eprint={2307.09705},
archivePrefix={arXiv},
primaryClass={cs.CL}
}