Stars
[ICML 2024] Selecting High-Quality Data for Training Language Models
Acceptance rates for the major AI conferences
MNBVC(Massive Never-ending BT Vast Chinese corpus)超大规模中文语料集。对标chatGPT训练的40T数据。MNBVC数据集不但包括主流文化,也包括各个小众文化甚至火星文的数据。MNBVC数据集包括新闻、作文、小说、书籍、杂志、论文、台词、帖子、wiki、古诗、歌词、商品介绍、笑话、糗事、聊天记录等一切形式的纯文本中文数据。
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
DomainWordsDict, Chinese words dict that contains more than 68 domains, which can be used as text classification、knowledge enhance task。涵盖68个领域、共计916万词的专业词典知识库,可用于文本分类、知识增强、领域词汇库扩充等自然语言处理应用。
SuperCLUE: 中文通用大模型综合性基准 | A Benchmark for Foundation Models in Chinese
整理开源的中文大语言模型,以规模较小、可私有化部署、训练成本较低的模型为主,包括底座模型,垂直领域微调及应用,数据集与教程等。
Awesome-LLM: a curated list of Large Language Model
[EMNLP 2022] An Open Toolkit for Knowledge Graph Extraction and Construction
Repo for the paper "Large Language Models Struggle to Learn Long-Tail Knowledge"
Reading list of hallucination in LLMs. Check out our new survey paper: "Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models"
The official GitHub page for the survey paper "A Survey on Evaluation of Large Language Models".
pyspark🍒🥭 is delicious,just eat it!😋😋
TruthfulQA: Measuring How Models Imitate Human Falsehoods
程序员延寿指南 | A programmer's guide to live longer
Stanford NLP Python library for tokenization, sentence segmentation, NER, and parsing of many human languages
🤗 PEFT: State-of-the-art Parameter-Efficient Fine-Tuning.
AutoGPT is the vision of accessible AI for everyone, to use and to build on. Our mission is to provide the tools, so that you can focus on what matters.
The repository provides code for running inference with the SegmentAnything Model (SAM), links for downloading the trained model checkpoints, and example notebooks that show how to use the model.
Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.
A collection of resources on applications of Transformers in Medical Imaging.
医学影像数据集列表 『An Index for Medical Imaging Datasets』
Transfer learning / domain adaptation / domain generalization / multi-task learning etc. Papers, codes, datasets, applications, tutorials.-迁移学习
A PyTorch-based library for semi-supervised learning (NeurIPS'21)
TextAttack 🐙 is a Python framework for adversarial attacks, data augmentation, and model training in NLP https://textattack.readthedocs.io/en/master/