MEAP
MEAP
Abstract 1. Introduction
Large Language Models (LLMs) are discovered Next-token prediction (NTP) (Radford, 2018) is the foun-
arXiv:2502.07490v1 [[Link]] 11 Feb 2025
to suffer from accurately retrieving key informa- dational training objective for many large language models
tion. To address this, we propose Mask-Enhanced (LLMs), including OpenAI’s GPT series (Brown, 2020).
Autoregressive Prediction (MEAP), a simple yet NTP trains models to predict the next word (or token) in a
effective training paradigm that seamlessly inte- sequence, given all preceding tokens. Its scaling efficiency
grates Masked Language Modeling (MLM) into and exceptional performance in text generation have estab-
Next-Token Prediction (NTP) to enhance the lat- lished it as the dominant paradigm for state-of-the-art LLMs
ter’s in-context retrieval capabilities. Specifically, such as GPT-4 (Achiam et al., 2023), LLaMa3 (Dubey et al.,
MEAP first randomly masks a small fraction of 2024), Gemini 1.5 Pro (Team et al., 2024), and DeepSeek-
input tokens and then directly performs the stan- V3 (Liu et al., 2024a). However, recent studies highlight the
dard next-token prediction autoregressive using limitations of NTP-based LLMs in accurately retrieving key
a decoder-only Transformer. MEAP eliminates information from context (Liu et al., 2024b; Kamradt, 2023).
the need for bidirectional attention or encoder-
In contrast, masked language modeling (MLM), used in
decoder architectures for MLM, incurring no addi-
BERT (Devlin, 2018), adopts a denoising objective that re-
tional computational overhead during pre-training
constructs masked inputs using bidirectional attention. This
or inference. Intensive experiments demonstrate
cloze-type nature makes MLM particularly effective for
that MEAP substantially outperforms NTP on
tasks requiring precise information retrieval and sentence-
key information retrieval and long-context rea-
level understanding. However, MLM’s inherent focus on re-
soning tasks, while performing on par or better
constructing masked tokens reduces its effectiveness in tasks
on commonsense reasoning tasks. The benefits
requiring coherent and long-form text generation (Wang &
of MEAP also extend to supervised fine-tuning,
Cho, 2019; Dong et al., 2019).
where it shows remarkable advantages in lost-
in-the-middle scenarios, outperforming NTP by While intuitively appealing, combining NTP and MLM
11.77% percentage points. Our analysis indicates to leverage their respective strengths remains a non-trivial
that MEAP’s effectiveness arises from its ability challenge. MLM typically operates best within two-stack
to promote more distinguishable attention scores encoder-decoder architectures, and performance degrades
by concentrating on a reduced set of non-masked significantly when applied to decoder-only Transformers
tokens. This mechanism improves the model’s fo- (Tay et al., 2022). Efforts to integrate the two often rely on
cus on task-relevant signals while mitigating the unified pre-training pipelines where multiple objectives are
influence of peripheral context. These findings alternated during the pretraining process (Dong et al., 2019;
position MEAP as a promising training paradigm Tay et al., 2022). However, this multi-objective approach
for large language [Link] is provided in the introduces substantial complexity to the training pipeline,
link. making it cumbersome to scale, especially for models with
billions or trillions of parameters.
*
Equal contribution 1 School of Artificial Intelligence, Uni- To this end, we propose Mask-Enhanced Autoregressive
versity of Chinese Academy of Sciences, China 2 SCITIX (SGP)
TECH PTE. LTD., Singapore 3 South China Normal Univer- Prediction (MEAP), a simple yet effective LLM training
sity, China 4 University of Texas at Austin, USA 5 Sun Yat- paradigm that seamlessly integrates masked tokens into
Sen University, China 6 University of Oxford, UK. Correspon- next-token prediction. Specifically, we first randomly mask
dence to: Zheng Cao <zcao@[Link]>, Shiwei Liu <shi- a small fraction of the input tokens and then directly perform
[Link]@[Link]>. standard next-token prediction using a decoder-only Trans-
The entire development process relies on the Siflow platform former in an autoregressive manner. This straightforward
([Link] provided by SCITIX (SGP) TECH PTE. LTD. modification eliminates the need for bidirectional attention
1
Mask-Enhanced Autoregressive Prediction: Pay Less Attention to Learn More
2. Related Work
[Mask] [Mask]
Masked Language Modeling. Pre-training is one of the
most important pillars of LLMs. BERT first trained a
Figure 1. Overview of next token prediction, masked language bidirectional, encoder-only Transformer with masked lan-
modeling, and our MEAP. guage modeling (MLM), where the model is trained to
or an expensive encoder-decoder architecture, thereby incur- predict masked input tokens. XLNet (Yang, 2019) intro-
ring no additional computational overhead during training. duced the Permutation-based Language Modeling to ac-
During inference, our resulting LLMs can work as simply count for dependencies between masked tokens during train-
as LLMs that are trained with NTP with no extra engineer- ing. RoBERTa(Liu, 2019) further improves the pre-training
ing effort. The simplicity of MEAP enables us to enhance of BERT by training the model longer, over more data, with
LLMs’ performance of key information retrieval and long- longer sequences, etc. MLM was further advanced by T5
context reasoning, while retaining the impressive scaling (Roberts et al., 2019). Specifically, T5 frames every text
efficiency of decoder-only LLMs. Figure 1 shows the illus- processing task as a ’text-to-text’ problem, leveraging in-
trations of different training paradigms. creased lengths of corrupted tokens to achieve improved per-
formance on classification tasks, which has contributed to
As a general pre-training paradigm, MEAP works effec- its growing popularity. However, these models have shown
tively for scenarios of pre-training and fine-tuning. For limited performance in open-text generation and in-context
the pre-training setting, we conduct control experiments by learning, limiting their usage in modern LLMs.
pre-training 1.1B LLaMa-style LLMs (Zhang et al., 2024)
with NTP and MEAP, where the training tokens scale from Next Token Prediction. In a parallel vein, Radford et al.
40B to 200B. Our results demonstrate that MEAP substan- (2019) proposed next-token prediction (NTP) where a
tially improves the performance in key information retrieval decoder-only Transformer is trained to predict the next to-
tasks such as Needle in a Haystack (Kamradt, 2023) by ken from left to right using unidirectional attention ensured
up to 33% average score and Multi-Document Question by casual mask. By predicting the next token based on pre-
Answering (MDQA) (Liu et al., 2024b) by up to 27.2 per- viously generated tokens and the given input context, NTP
centage points, while preserving general knowledge learned maintains coherence and logical flow in the generated text,
during pre-training. It is noteworthy that MEAP achieves well-suited for text generation. Moreover, NTP eliminates
85.8% accuracy with 60B training tokens on the Needle in the need for an encoder, significantly improving the scal-
a Haystack, while NTP requires 200B for similar perfor- ability of language models. Due to the above advantages,
mance, highlighting MEAP’s superior data efficiency in key NTP serves as the most popular pre-training objective of
information retrieval. In addition, compared to the original modern LLMs (Brown, 2020; Achiam et al., 2023; Touvron
NTP, MEAP also suffers less from hallucination. et al., 2023; Jiang et al., 2023; Yang et al., 2024; Liu et al.,
2024a).
In addition, the promise of MEAP also holds for LLM
fine-tuning. Our MEAP framework demonstrates consis- Unified Training Paradigms. There are works that propose
tent improvements across multiple commonsense reasoning unified training paradigms aiming to train one Transformer
tasks, achieving an average gain of 1.12 scores over the with multiple objective functions. For instance, UniLM
NTP baseline. On Multi-Document Question Answering, (Dong et al., 2019) trains a bidirectional encoder on unidi-
MEAP achieves an average improvement of 11.77% across rectional language modeling (LM), bidirectional LM, and
all positions. Sequence-to-Sequence LM. UL2 (Tay et al., 2022) proposes
a unified pre-training paradigm with Mixture-of-Denoisers
Our analysis suggests that MEAP’s effectiveness stems from (MoD) to combine diverse pre-training paradigms together,
its ability to enhance attention distinguishability by focus- improving the performance over T5 and GPT. While effec-
ing on a reduced set of non-masked tokens. This mecha- tive, the preference for encoder-decoder architectures and
nism sharpens the model’s attention to task-relevant signals the complicated switches among different training objec-
while reducing the impact of peripheral context. In essence, tives hinder their applications in practice.
MEAP learns more by attending to fewer tokens.
In contrast, our approach seamlessly integrates masked to-
The structure of this paper is as follows. Section 3 details kens into NTP without incurring any additional pre-training
the MEAP algorithm. The evaluation of MEAP on LLM pre- or inference costs, while preserving the ultra-efficiency of
training and fine-tuning is presented in Sections 4.1 and 4.2,
2
Mask-Enhanced Autoregressive Prediction: Pay Less Attention to Learn More
3
Mask-Enhanced Autoregressive Prediction: Pay Less Attention to Learn More
Table 1. Pre-training Evaluation. Zero-shot performance of MEAP and NTP on various commonsense reasoning tasks. Results are
measured directly after pre-training on 200B tokens with no fine-tuning.
decoder-only Transformers (Vaswani, 2017) following the For key information retrieval, we choose the well-
setting of Zhang et al. (2024). Our model has 24 layers with established Needle-in-a-Haystack evaluation (Liu et al.,
32 attention heads, a hidden size of 2,048, an intermediate 2024b), where the model is asked to retrieve the random fact
hidden size of 5,632, 32 attention heads, and a context length or statement (the ‘needle’) in the middle of a long context
of 4096. We follow the common configurations of LLM window (the ‘haystack’). This approach provides quanti-
components, e.g., Rotary Positional Embedding (RoPE) tative metrics for assessing precise information extraction
(Su et al., 2024), Pre-Norm (Ba, 2016) with RMSNorm from extended contexts, particularly relevant for document
(Zhang & Sennrich, 2019), SwiGLU (Shazeer, 2020), and analysis applications.
Grouped-query Attention (Ainslie et al., 2023). To assess
As this evaluation involves long-context modeling capacity,
the scalability of MEAP, we increase the training token size
we follow the setting of Ye et al. (2024) and conduct a length
from 40B to 60B, and further to 200B.
extension to 64K. In particular, we continue training our
For all experiments, we implement a learning rate warm-up model for additional 4B tokens from SlimPajama (Soboleva
during the first 10% of the training steps, followed by a et al.) using the approach proposed in (Fu et al., 2024). The
cosine annealing schedule, which decays the learning rate to implementation utilizes modified Rotary Position Embed-
10% of its initial value. We use the AdamW optimizer with dings with θbase = 640, 000.
the following settings: β1 = 0.9, β2 = 0.95. The maximum
To demonstrate MEAP’s scalability, we increase the training
learning rate is set to 4 × 10−4 , the minimum learning rate
token size to 40B, 80B, and 200B, reporting the results of
is 4 × 10−5 , and the weight decay is 5 × 10−2 .
needle retrieval in Table 2. The results show that MEAP con-
sistently outperforms NTP across different training scales.
4.1.1. L ANGUAGE M ODELING E VALUATION
At 40B tokens, MEAP achieves 80.2% accuracy, signifi-
While the primary goal of MEAP is to enhance LLM perfor- cantly surpassing the baseline’s 65.9%. The performance
mance in key information retrieval, it is essential to ensure gap peaks at 60B tokens, with MEAP maintaining steady
that integrating MLM into NTP does not compromise the improvement and reaching 85.8% accuracy. At 200B tokens,
model’s fundamental language modeling capability. To eval- MEAP approaches optimal performance, attaining 98.2%
uate this, we employ the LM Eval Harness benchmark (Gao accuracy, while the NTP baseline still falls short of 90%
et al., 2024), assessing models in a zero-shot setting. The accuracy. It is noteworthy that MEAP achieves 85.8% accu-
results, presented in Table 1, show that MEAP performs racy using just 60B training tokens, whereas NTP requires
comparably to, or even outperforms, NTP, achieving a 1.6% approximately three times as many (200B tokens) to reach
improvement in the overall average score. This finding pro- a similar level. This demonstrates MEAP’s superior data
vides strong evidence that incorporating random masking efficiency over NTP in key information retrieval.
into NTP does not degrade the model’s language modeling
We further illustrate the retrieval performance of our 200B-
capacity. In the following evaluations, we will examine
token model with a 32K context length in Figure 3. The
whether MEAP further improves performance in key infor-
accuracy is reported across varying answer needle depths
mation retrieval and long-context modeling.
(y-axis) and context lengths (x-axis). The results show that
MEAP generally maintains perfect accuracy across different
4.1.2. N EEDLE - IN - A -H AYSTACK R ETRIEVAL
context lengths and depths, with errors limited to only two
grid cells. In contrast, NTP begins to exhibit accuracy degra-
Table 2. Single needle accuracy (%) with 32K context. dation at a context length of 24K, affecting a wide range of
depths from 50% to 100%.
Token 40B 60B 200B
4.1.3. M ULTI -D OCUMENT Q UESTION A NSWERING
NTP 65.9 52.8 87.1
MEAP 80.2 85.8 98.2 We report the accuracy improvement of MEAP over NTP
in Table 3. MEAP again consistently outperforms NTP
4
Mask-Enhanced Autoregressive Prediction: Pay Less Attention to Learn More
10.0 1.00 1.00 1.00 1.00 1.00 0.09 0.09 0.09 10.0 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
30.0 1.00 1.00 1.00 1.00 1.00 0.09 0.09 0.09 30.0 1.00 1.00 1.00 1.00 1.00 1.00 0.18 0.27
Depth Percent
Depth Percent
40.0 1.00 1.00 1.00 1.00 1.00 0.09 0.27 0.27 0.6 40.0 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.6
Score
Score
50.0 1.00 1.00 1.00 1.00 1.00 1.00 0.64 1.00 50.0 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
60.0 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.4 60.0 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.4
70.0 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 70.0 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
80.0 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 80.0 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
0.2 0.2
90.0 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 90.0 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
100.0 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 100.0 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
0.0 0.0
00
00
00
00
0
00
00
00
00
00
00
00
00
00
00
00
00
40
80
40
80
12
16
20
24
28
32
12
16
20
24
28
32
Context Length (tokens) Context Length (tokens)
(a) NTP Pre-training (b) MEAP Pre-training
Figure 3. Training dynamics comparison between standard pretraining and MEAP framework. Scores are computed using ROUGE-1,
measuring unigram overlap between model responses and expected answers.
70
MEAP
Table 3. Pre-training Evaluation. Relative accuracy (%) improve- 65 NTP
ment of MEAP over NTP on multi-document QA.
60
Performance Score (%)
55
Answer Position 1 5 10 15 20
50
10 documents +7.6 +7.0 +30.6 – –
45
20 documents +12.4 +4.0 +5.1 +3.7 +27.2
40
35
by good margins across all configurations, with significant
30
gains at later positions (+30.6% at position 3 in 10-doc,
25
+27.2% at position 5 in 20-doc). These results indicate that 8K 12K 16K 20K 24K 28K 32K
Context Length
MEAP enhances the model’s ability to retrieve relevant in-
formation from long contexts, maintain performance across Figure 4. Long-context reasoning performance comparison be-
tween MEAP and NTP on the Multi-Needle Reasoning Task (M-
different context lengths and positions, and handle complex
RS) across different context lengths.
scenarios with multiple distractors. The improvements high-
light the effectiveness of the masking strategy in enhancing
over extended sequences.
the model’s overall capability for long-context information
retrieval tasks.
4.1.5. C ONTEXTUAL H ALLUCINATION E VALUATION
4.1.4. L ONG -C ONTEXT R EASONING E VALUATION
Table 4. Accuracy (i.e., free of hallucinations) on text summariza-
We evaluate long-context reasoning capabilities using the tion datasets.
Multi-Needle Reasoning Task (M-RS) (Li et al., 2024a),
which requires models to retrieve and extract multiple pieces Task XSum MultiNews WikiSum
of information from long texts and using them to logically NTP 0.09 0.17 0.24
answer questions that demand an integrated understanding MEAP 0.13 0.19 0.33
and reasoning of various text segments. This forces the
model to distribute attention across contextually relevant
Since MEAP improves more accurate key information re-
tokens rather than focusing solely on local patterns.
trieval, we expect it to suffer less from contextual halluci-
We leverage the OpenCompass evaluation framework (Con- nation. To verify, we evaluate MEAP in reducing contex-
tributors, 2023) and report the results in Figure 4. MEAP tual hallucinations on three summarization datasets: XSum
consistently outperforms NTP across context lengths with (Narayan et al., 2018), WikiSum (Cohen et al., 2021), and
6.6 percentage point average improvement. demonstrates MultiNews (Fabbri et al., 2019), following Ye et al. (2024).
MEAP’s enhanced capacity to maintain attention coherence For this setting, we fine-tune the pre-trained models with
5
Mask-Enhanced Autoregressive Prediction: Pay Less Attention to Learn More
60
Alpaca and evaluate them. We compare model-generated
55
MEAP-3
and reference summaries using Deepseek-V3 (Liu et al.,
50
2024a) as the hallucination detector across 100 random sam- NTP-6
45
Accuracy (%)
ples per dataset. As shown in Table 4, our masking strategy
40 MEAP-2
achieves a consistent reduction in hallucination rates across MEAP-1
35
all datasets.
30 NTP-2
NTP-4
25
4.2. Fine-tuning Evaluation NTP-1
20
training configuration uses a global batch size of 512. The NTP MEAP
6
Mask-Enhanced Autoregressive Prediction: Pay Less Attention to Learn More
Table 5. Fine-tuning Evaluation. Performance of MEAP and NTP on various commonsense reasoning tasks. Results are measured by
fine-tuning with Llama-3-8B.
7
Mask-Enhanced Autoregressive Prediction: Pay Less Attention to Learn More
Table 7. Performance comparison of different mask ratios in pre-training and fine-tuning. The best results are highlighted in bold.
Pre-training Fine-tuning
Mask Ratio NTP 0.05 0.10 0.15 0.20 NTP 0.05 0.10 0.15
Accuracy 0.52 0.54 0.56 0.58 0.56 0.72 0.77 0.81 0.71
Results. The attention distributions during inference for standard NTP baselines with 6 epochs for a fair comparison.
both models are visualized in Figure 6. Notably, MEAP The results show that a mask ratio of 0.15 achieves the
exhibits a substantial improvement in answer-relevant at- best performance in pre-training, while a mask ratio of
tention (34.5% vs. 9.4%) while reducing the dominance of 0.10 yields the highest accuracy in fine-tuning. MEAP
context-before attention from 73.1% to 49.1%. Both models consistently outperforms standard NTP in pre-training and
maintain similar attention levels for peripheral components, fine-tuning, demonstrating its effectiveness in leveraging
including context-after, query sections, and EOS tokens (all masked tokens for improved performance.
approximately 5%–6%). These results demonstrate that
the MEAP framework enhances attention allocation during 7. Conclusion
inference, prioritizing key information more effectively.
This work addresses challenges in information processing
Table 8. Attention score comparison between NTP and MEAP.
through a straightforward approach that masks 10%–15% of
input while maintaining traditional prediction methods. Our
Input Length Mask Ratio Score Decay Var. Increase results show significant improvements in comprehension
across longer contexts, achieved without additional compu-
1,024 0.15 34.08% 12.66%
4,096 0.15 53.34% 7.80% tational costs. This approach demonstrates remarkable effi-
ciency, matching performance metrics with just 60B training
examples that typically require 200B examples with conven-
tional methods. The results indicate that this strategy leads
6. Ablation Study to more effective processing of key information through
We conduct ablation studies on the mask ratio for both improved focus on relevant content. Since it requires no
pre-training and fine-tuning settings. Table 7 summarizes structural changes, this method can be readily integrated
the results. For pre-training, we evaluate our pre-trained into existing systems without disrupting workflows.
model in Section 4.1 on the Multi-Document QA task using
the nq-open-oracle dataset (Liu et al., 2024b). For
fine-tuning, we train MEAP on the Alpaca dataset (Taori
et al., 2023b) for 3 epochs with different mask ratios against
8
Mask-Enhanced Autoregressive Prediction: Pay Less Attention to Learn More
Impact Statement A., et al. The llama 3 herd of models. arXiv preprint
arXiv:2407.21783, 2024.
This work proposes a modified pre-training paradigm that
may influence how both industry and academia approach Fabbri, A. R., Li, I., She, T., Li, S., and Radev, D. R.
language model training. MEAP integrates seamlessly with Multi-news: A large-scale multi-document summariza-
existing LLM frameworks without requiring additional en- tion dataset and abstractive hierarchical model. arXiv
gineering effort or computational resources. While the im- preprint arXiv:1906.01749, 2019.
provement in information retrieval and reasoning capabili-
ties could have broad implications for downstream applica- Fu, Y., Panda, R., Niu, X., Yue, X., Hajishirzi, H., Kim,
tions, the method’s computational efficiency and architec- Y., and Peng, H. Data engineering for scaling language
tural compatibility mean it can be readily adopted within models to 128k context. arXiv preprint arXiv:2402.10171,
current training infrastructures. We anticipate this work will 2024.
contribute to more efficient model development while main-
taining established training pipelines and computational Gao, L., Tow, J., Abbasi, B., Biderman, S., Black, S., DiPofi,
requirements. A., Foster, C., Golding, L., Hsu, J., Le Noac’h, A., Li,
H., McDonell, K., Muennighoff, N., Ociepa, C., Phang,
J., Reynolds, L., Schoelkopf, H., Skowron, A., Sutawika,
References L., Tang, E., Thite, A., Wang, B., Wang, K., and Zou,
Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., A. A framework for few-shot language model evaluation,
Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., 07 2024. URL [Link]
Anadkat, S., et al. Gpt-4 technical report. arXiv preprint 12608602.
arXiv:2303.08774, 2023.
Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C.,
Ainslie, J., Lee-Thorp, J., de Jong, M., Zemlyanskiy, Y., Chaplot, D. S., Casas, D. d. l., Bressand, F., Lengyel, G.,
Lebrón, F., and Sanghai, S. Gqa: Training generalized Lample, G., Saulnier, L., et al. Mistral 7b. arXiv preprint
multi-query transformer models from multi-head check- arXiv:2310.06825, 2023.
points. arXiv preprint arXiv:2305.13245, 2023.
Kamradt, G. Needle in a haystack-pressure testing llms.
Ba, J. L. Layer normalization. arXiv preprint Github Repository, pp. 28, 2023.
arXiv:1607.06450, 2016.
Li, M., Zhang, S., Liu, Y., and Chen, K. Needlebench:
Brown, T. B. Language models are few-shot learners. arXiv Can llms do retrieval and reasoning in 1 million context
preprint arXiv:2005.14165, 2020. window?, 2024a. URL [Link]
Cohen, N., Kalinsky, O., Ziser, Y., and Moschitti, A. 2407.11963.
Wikisum: Coherent summarization dataset for efficient Li, T., Zhang, G., Do, Q. D., Yue, X., and Chen, W. Long-
human-evaluation. In Proceedings of the 59th Annual context llms struggle with long in-context learning. arXiv
Meeting of the Association for Computational Linguistics preprint arXiv:2404.02060, 2024b.
and the 11th International Joint Conference on Natural
Language Processing (Volume 2: Short Papers), pp. 212– Liu, A., Feng, B., Xue, B., Wang, B., Wu, B., Lu, C., Zhao,
219, 2021. C., Deng, C., Zhang, C., Ruan, C., et al. Deepseek-
v3 technical report. arXiv preprint arXiv:2412.19437,
Contributors, O. Opencompass: A universal evaluation
2024a.
platform for foundation models. [Link]
com/open-compass/opencompass, 2023. Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua,
Devlin, J. Bert: Pre-training of deep bidirectional trans- M., Petroni, F., and Liang, P. Lost in the middle: How
formers for language understanding. arXiv preprint language models use long contexts. Transactions of the
arXiv:1810.04805, 2018. Association for Computational Linguistics, 12:157–173,
2024b.
Dong, L., Yang, N., Wang, W., Wei, F., Liu, X., Wang, Y.,
Gao, J., Zhou, M., and Hon, H.-W. Unified language Liu, Y. Roberta: A robustly optimized bert pretraining
model pre-training for natural language understanding approach. arXiv preprint arXiv:1907.11692, 364, 2019.
and generation. Advances in neural information process-
ing systems, 32, 2019. Narayan, S., Cohen, S. B., and Lapata, M. Don’t give me
the details, just the summary. Topic-Aware Convolutional
Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, Neural Networks for Extreme Summarization ArXiv, abs,
A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, 2018.
9
Mask-Enhanced Autoregressive Prediction: Pay Less Attention to Learn More
Radford, A. Improving language understanding by genera- Vaswani, A. Attention is all you need. Advances in Neural
tive pre-training. 2018. Information Processing Systems, 2017.
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Wang, A. and Cho, K. Bert has a mouth, and it must speak:
Sutskever, I., et al. Language models are unsupervised Bert as a markov random field language model. arXiv
multitask learners. OpenAI blog, 1(8):9, 2019. preprint arXiv:1902.04094, 2019.
Ren, J., Rajbhandari, S., Aminabadi, R. Y., Ruwase, O., Xiao, G., Tian, Y., Chen, B., Han, S., and Lewis, M. Ef-
Yang, S., Zhang, M., Li, D., and He, Y. {Zero-offload}: ficient streaming language models with attention sinks.
Democratizing {billion-scale} model training. In 2021 arXiv preprint arXiv:2309.17453, 2023.
USENIX Annual Technical Conference (USENIX ATC
Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu,
21), pp. 551–564, 2021.
B., Li, C., Liu, D., Huang, F., Wei, H., et al. Qwen2. 5
Roberts, A., Raffel, C., Lee, K., Matena, M., Shazeer, N., technical report. arXiv preprint arXiv:2412.15115, 2024.
Liu, P. J., Narang, S., Li, W., and Zhou, Y. Exploring Yang, Z. Xlnet: Generalized autoregressive pretrain-
the limits of transfer learning with a unified text-to-text ing for language understanding. arXiv preprint
transformer. Google, Tech. Rep., 2019. arXiv:1906.08237, 2019.
Shazeer, N. Glu variants improve transformer. arXiv Ye, T., Dong, L., Xia, Y., Sun, Y., Zhu, Y., Huang, G.,
preprint arXiv:2002.05202, 2020. and Wei, F. Differential transformer. arXiv preprint
arXiv:2410.05258, 2024.
Soboleva, D., Al-Khateeb, F., Myers, R., Steeves, J. R., Hes-
tness, J., and Dey, N. Slimpajama: A 627b token cleaned Zhang, B. and Sennrich, R. Root mean square layer nor-
and deduplicated version of redpajama, 2023. URL malization. Advances in Neural Information Processing
[Link] co/datasets/cerebras/SlimPajama- Systems, 32, 2019.
627B.
Zhang, P., Zeng, G., Wang, T., and Lu, W. Tinyllama:
Su, J., Lu, Y., Pan, S., Wen, B., and Liu, Y. Roformer: An open-source small language model. arXiv preprint
Enhanced transformer with rotary position embedding. arXiv:2401.02385, 2024.
Neurocomputing, 2024.
Tay, Y., Dehghani, M., Tran, V. Q., Garcia, X., Wei, J.,
Wang, X., Chung, H. W., Shakeri, S., Bahri, D., Schuster,
T., et al. Ul2: Unifying language learning paradigms.
arXiv preprint arXiv:2205.05131, 2022.
Team, G., Georgiev, P., Lei, V. I., Burnell, R., Bai, L.,
Gulati, A., Tanzer, G., Vincent, D., Pan, Z., Wang, S.,
et al. Gemini 1.5: Unlocking multimodal understand-
ing across millions of tokens of context. arXiv preprint
arXiv:2403.05530, 2024.
10
Mask-Enhanced Autoregressive Prediction: Pay Less Attention to Learn More
Name Hyperparameter
optimizer AdamW
lr schedule cosine
clip 1.0
max learning rate 4 × 10−4
min learning rate 4 × 10−5
weight decay 5 × 10−2
sequence length 4096
batch size 256
epoch 1
Table 10. Hyperparameters of pretrained MEAP models. Data amount are specified in tokens.
11
Mask-Enhanced Autoregressive Prediction: Pay Less Attention to Learn More
12
Mask-Enhanced Autoregressive Prediction: Pay Less Attention to Learn More
Table 12. Results of the 1.1b MEAP model under different masking rates
13
Mask-Enhanced Autoregressive Prediction: Pay Less Attention to Learn More
14
Mask-Enhanced Autoregressive Prediction: Pay Less Attention to Learn More
Name Hyperparameter
optimizer AdamW
lr schedule cosine
clip 1.0
learning rate 2 × 10−5
weight decay 5 × 10−2
maximum sequence length 4096
batch size 512
15