A curated list for Efficient Pre-trained Language Models:
- Knowledge Distillation
- Network Pruning
- Quantization
- Inference Acceleration
- Structure Design
- Hardware
- Evaluation
- Others
This part is still under construction to include papers published from 2018-2023.
- A Systematic Study of Knowledge Distillation for Natural Language Generation with Pseudo-Target Training. Nitay Calderon, Subhabrata Mukherjee, Roi Reichart, Amir Kantor. [Paper][Github]
- Tailoring Instructions to Student's Learning Levels Boosts Knowledge Distillation. Yuxin Ren, Zihan Zhong, Xingjian Shi, Yi Zhu, Chun Yuan, Mu Li. [Paper][Github]
- f-Divergence Minimization for Sequence-Level Knowledge Distillation. Yuqiao Wen, Zichao Li, Wenyu Du, Lili Mou. [Paper][Github]
- AD-KD: Attribution-Driven Knowledge Distillation for Language Model Compression. Siyue Wu, Hongzhan Chen, Xiaojun Quan, Qifan Wang, Rui Wang. [Paper][Github]
- Lifting the Curse of Capacity Gap in Distilling Language Models. Chen Zhang, Yang Yang, Jiahao Liu, Jingang Wang, Yunsen Xian, Benyou Wang, Dawei Song. [Paper][Github]
- Bridging the Gap between Decision and Logits in Decision-based Knowledge Distillation for Pre-trained Language Models. Qinhong Zhou, Zonghan Yang, Peng Li, Yang Liu. [Paper][Github]
- How to Distill your BERT: An Empirical Study on the Impact of Weight Initialisation and Distillation Objectives. Xinpeng Wang, Leonie Weissweiler, Hinrich Schütze, Barbara Plank. [Paper][Github]
- ReAugKD: Retrieval-augmented knowledge distillation for pre-trained language models. Jianyi Zhang, Aashiq Muhamed, Aditya Anantharaman, Guoyin Wang, Changyou Chen, Kai Zhong, Qingjun Cui, Yi Xu, Belinda Zeng, Trishul Chilimbi, Yiran Chen. [Paper]
- Less is More: Task-aware Layer-wise Distillation for Language Model Compression. Chen Liang, Simiao Zuo, Qingru Zhang, Pengcheng He, Weizhu Chen, Tuo Zhao. [Paper][Github]
- Adaptive Contrastive Knowledge Distillation for BERT Compression. Jinyang Guo, Jiaheng Liu, Zining Wang, Yuqing Ma, Ruihao Gong, Ke Xu, Xianglong Liu. [Paper]
- Distilling Reasoning Capabilities into Smaller Language Models. Kumar Shridhar, Alessandro Stolfo, Mrinmaya Sachan. [Paper]
- Are Intermediate Layers and Labels Really Necessary? A General Language Model Distillation Method. Shicheng Tan, Weng Lam Tam, Yuanchun Wang, Wenwen Gong, Shu Zhao, Peng Zhang, Jie Tang. [Paper][Github]
- TempLM: Distilling Language Models into Template-Based Generators. Tianyi Zhang, Mina Lee, Lisa Li, Ende Shen, Tatsunori B. Hashimoto. [Paper][Github]
- Cost-effective Distillation of Large Language Models. Sayantan Dasgupta, Trevor Cohn, Timothy Baldwin. [Paper]
- A Study on Knowledge Distillation from Weak Teacher for Scaling Up Pre-trained Language Models. Hayeon Lee, Rui Hou, Jongpil Kim, Davis Liang, Sung Ju Hwang, Alexander Min. [Paper]
- A Comparative Analysis of Task-Agnostic Distillation Methods for Compressing Transformer Language Models. Takuma Udagawa, Aashka Trivedi, Michele Merler, Bishwaranjan Bhattacharjee. [Paper]
- What is Lost in Knowledge Distillation?. Manas Mohanty, Tanya Roosta, Peyman Passban. [Paper]
- Co-training and Co-distillation for Quality Improvement and Compression of Language Models. Hayeon Lee, Rui Hou, Jongpil Kim, Davis Liang, Hongbo Zhang, Sung Ju Hwang, Alexander Min. [Paper]
- Pruning Pre-trained Language Models Without Fine-Tuning. Ting Jiang, Deqing Wang, Fuzhen Zhuang, Ruobing Xie, Feng Xia. [Paper] [Github]
- Gradient-based Intra-attention Pruning on Pre-trained Language Models. Ziqing Yang, Yiming Cui, Xin Yao, Shijin Wang. [Paper][Github]
- Memory-efficient NLLB-200: Language-specific Expert Pruning of a Massively Multilingual Machine Translation Model. Yeskendir Koishekenov, Alexandre Berard, Vassilina Nikoulina. [Paper][Github]
- Structured Pruning for Efficient Generative Pre-trained Language Models. Chaofan Tao, Lu Hou, Haoli Bai, Jiansheng Wei, Xin Jiang, Qun Liu, Ping Luo, Ngai Wong. [Paper]
- Pruning Pre-trained Language Models with Principled Importance and Self-regularization. Siyu Ren, Kenny Zhu. [Paper][Github]
- Knowledge-preserving Pruning for Pre-trained Language Models without Retraining. Seungcheol Park, Hojun Choi, U Kang. [Paper]
- Towards Robust Pruning: An Adaptive Knowledge-Retention Pruning Strategy for Language Models. Jianwei Li, Qi Lei, Wei Cheng, Dongkuan Xu. [Paper]
- Transfer Learning for Structured Pruning under Limited Task Data. Lucio Dery, David Grangier, Awni Hannun. [Paper]
- Activity Sparsity Complements Weight Sparsity for Efficient RNN Inference. Rishav Mukherji, Mark Schöne, Khaleelulla Khan Nazeer, Christian Mayr, Anand Subramoney. [Paper]
- DSFormer: Effective Compression of Text-Transformers by Dense-Sparse Weight Factorization. Rahul Chand, Yashoteja Prabhu, Pratyush Kumar. [Paper]
- PRILoRA: Pruned and Rank-Increasing Low-Rank Adaptation. Nadav Benedek, Lior Wolf. [Paper]
- The Need for Speed: Pruning Transformers with One Recipe. Samir Khaki, Konstantinos N. Plataniotis. [Paper]
- Outlier Suppression: Pushing the Limit of Low-bit Transformer. Xiuying Wei, Yunchen Zhang, Xiangguo Zhang, Ruihao Gong, Shanghang Zhang, Qi Zhang, Fengwei Yu, Xianglong Liu. [Paper][Github]
- Self-Distilled Quantization: Achieving High Compression Rates in Transformer-Based Language Models. James O’Neill, Sourav Dutta. [Paper]
- Understanding Int4 Quantization for Language Models: Latency Speedup, Composability, and Failure Cases. Xiaoxia Wu, Cheng Li, Reza Yazdani Aminabadi, Zhewei Yao, Yuxiong He. [Paper]
- PreQuant: A Task-agnostic Quantization Approach for Pre-trained Language Models. Zhuocheng Gong, Jiahao Liu, Qifan Wang, Yang Yang, Jingang Wang, Wei Wu, Yunsen Xian, Dongyan Zhao, Rui Yan. [Paper]
- Boost Transformer-based Language Models with GPU-Friendly Sparsity and Quantization. Chong Yu, Tao Chen, Zhongxue Gan. [Paper]
- Zero-Shot Sharpness-Aware Quantization for Pre-trained Language Models. Miaoxi Zhu, Qihuang Zhong, Li Shen, Liang Ding, Juhua Liu, Bo Du, Dacheng Tao. [Paper]
- PoWER-BERT: Accelerating BERT Inference via Progressive Word-vector Elimination. Saurabh Goyal, Anamitra R. Choudhury, Saurabh M. Raje, Venkatesan T. Chakaravarthy, Yogish Sabharwal, Ashish Verma. [Paper][Github]
- A Simple Hash-Based Early Exiting Approach For Language Understanding and Generation. Tianxiang Sun, Xiangyang Liu, Wei Zhu, Zhichao Geng, Lingling Wu, Yilong He, Yuan Ni, Guotong Xie, Xuanjing Huang, Xipeng Qiu. [Paper][Github]
- Learned Token Pruning for Transformers. Sehoon Kim, Sheng Shen, David Thorsley, Amir Gholami, Woosuk Kwon, Joseph Hassoun, Kurt Keutzer. [Paper][Github]
- Confident Adaptive Language Modeling. Tal Schuster, Adam Fisch, Jai Gupta, Mostafa Dehghani, Dara Bahri, Vinh Q. Tran, Yi Tay, Donald Metzler. [Paper][Github]
- Sparse Token Transformers with Attention Back Tracking. Heejun Lee, Minki Kang, Youngwan Lee, Sung Ju Hwang. [Paper]
- Dynamic and Efficient Inference for Text Generation via BERT Family. Xiaobo Liang, Juntao Li, Lijun Wu, Ziqiang Cao, Min Zhang. [Paper][Github]
- Constraint-aware and Ranking-distilled Token Pruning for Efficient Transformer Inference. Junyan Li, Li Lyna Zhang, Jiahang Xu, Yujing Wang, Shaoguang Yan, Yunqing Xia, Yuqing Yang, Ting Cao, Hao Sun, Weiwei Deng, Qi Zhang, Mao Yang. [Paper][Github]
- Dynamic Sparse Attention for Scalable Transformer Acceleration. Liu Liu*, Zheng Qu*, Zhaodong Chen, Fengbin Tu, Yufei Ding, Yuan Xie.. [Paper]
- Exponentially Faster Language Modelling. Peter Belcak, Roger Wattenhofer. [Paper][Github]
- Quantized Transformer Language Model Implementations on Edge Devices. Mohammad Wali Ur Rahman, Murad Mehrab Abrar, Hunter Gibbons Copening, Salim Hariri, Sicong Shao, Pratik Satam, Soheil Salehi. [Paper]
- Are Compressed Language Models Less Subgroup Robust?. Leonidas Gee, Andrea Zugarini, Novi Quadrianto. [Paper][Github]
- NTK-approximating MLP Fusion for Efficient Language Model Fine-tuning. Tianxin Wei, Zeming Guo, Yifan Chen, Jingrui He. [Paper][Github]
- MosaicBERT: A Bidirectional Encoder Optimized for Fast Pretraining. Jacob Portes, Alex Trott, Sam Havens, Daniel King, Abhinav Venigalla, Moin Nadeem, Nikhil Sardana, Daya Khudia, Jonathan Frankle. [Paper][Github][Project]
- PuMer: Pruning and Merging Tokens for Efficient Vision Language Models. Qingqing Cao, Bhargavi Paranjape, Hannaneh Hajishirzi. [Paper]
- Lightweight Adaptation of Neural Language Models via Subspace Embedding. Amit Kumar Jaiswal, Haiming Liu. [Paper]
- Frustratingly Simple Memory Efficiency for Pre-trained Language Models via Dynamic Embedding Pruning. Miles Williams, Nikolaos Aletras. [Paper]
- Approximating Two-Layer Feedforward Networks for Efficient Transformers. Róbert Csordás, Kazuki Irie, Jürgen Schmidhuber. [Paper][Github]
- Not all layers are equally as important: Every Layer Counts BERT. Lucas Georges Gabriel Charpentier, David Samuel. [Paper][Github]