You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Welcome to the COIG-CQIA project page. COIG-CQIA stands for Chinese Open Instruction Generalist - Quality is All You Need, a high-quality Chinese instruction fine-tuning dataset. This dataset is designed to provide the Chinese NLP community with high-quality and human interaction-aligned instruction fine-tuning data.
Project Overview
Inspired by studies like LIMA: Less Is More for Alignment, COIG-CQIA focuses on creating a dataset from Chinese internet sources including Q&A and articles. These are deeply cleansed, restructured, and manually reviewed to ensure quality, diversity, and relevance.
Updates
[2023.12.04] 🎉 Released version 0.1 of the dataset. SFT models fully fine-tuned using v0.1 of the dataset are based on Yi-6B-base and Yi-34B-base.
Models
Leveraging the COIG-CQIA data, we have developed a series of SFT models based on the Yi series.
{
"instruction": "Example question or instruction",
"input": "Supplementary content for the question or instruction",
"output": "Response to the input",
"task_type": {
"major": ["Q&A"],
"minor": ["Encyclopedic Q&A"]
},
"domain": ["Encyclopedia", "Maternal and Child Health"],
"answer_from": "human",
"human_verified": true,
"copyright": "Copyright information including author details...",
}
Data Fields
instruction: The command or question for input.
input: Supplementary content for the instruction or question.
output: The corresponding response.
task_type: The main and sub-task types the data belongs to.
domain: The field to which the data belongs.
answer_from: Whether the response is written by humans or generated by large models (with human verification).
human_verified: Indicates if the data has been verified by humans.
copyright: Information about the data's copyright, including the author.
Nationwide Master's Program Unified Admissions Examination
2645
Collected from the internet
Rule-based method for cleaning and filtering.
Total
2645
-
-
Citation
To cite COIG-CQIA in your work, please use the following format:
@misc{COIG-CQIA,
author = {},
title = {COIG-CQIA: Quality is All you need for Chinese Instruction Fine-tuning},
year = {2023},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/paralym/COIG-CQIA}},
}
Additional relevant citations:
@article{zhang2023chinese,
title={Chinese open instruction generalist: A preliminary release},
author={Zhang, Ge and Shi, Yemin and Liu, Ruibo and Yuan, Ruibin and Li, Yizhi and Dong, Siwei and Shu, Yu and Li, Zhaoqun and Wang, Zekun and Lin, Chenghua and others},
journal={arXiv preprint arXiv:2304.07987},
year={2023}
}
@misc{Firefly,
author = {Jianxin Yang},
title = {Firefly(流萤): 中文对话式大语言模型},
year = {2023},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/yangjianxin1/Firefly}},
}
@misc{xu2023cvalues,
title={CValues: Measuring the Values of Chinese Large Language Models from Safety to Responsibility},
author={Guohai Xu and Jiayi Liu and Ming Yan and Haotian Xu and Jinghui Si and Zhuoran Zhou and Peng Yi and Xing Gao and Jitao Sang and Rong Zhang and Ji Zhang and Chao Peng and Fei Huang and Jingren Zhou},
year={2023},
eprint={2307.09705},
archivePrefix={arXiv},
primaryClass={cs.CL}
}