0% found this document useful (0 votes)
59 views15 pages

MEAP

The document presents Mask-Enhanced Autoregressive Prediction (MEAP), a novel training paradigm for Large Language Models (LLMs) that integrates Masked Language Modeling (MLM) into Next-Token Prediction (NTP) to improve information retrieval and long-context reasoning. MEAP enhances model performance by masking a fraction of input tokens, allowing the model to focus on task-relevant signals without incurring additional computational costs. Experimental results demonstrate that MEAP significantly outperforms NTP in key information retrieval tasks while maintaining efficiency in training and inference.

Uploaded by

xodelay157
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views15 pages

MEAP

The document presents Mask-Enhanced Autoregressive Prediction (MEAP), a novel training paradigm for Large Language Models (LLMs) that integrates Masked Language Modeling (MLM) into Next-Token Prediction (NTP) to improve information retrieval and long-context reasoning. MEAP enhances model performance by masking a fraction of input tokens, allowing the model to focus on task-relevant signals without incurring additional computational costs. Experimental results demonstrate that MEAP significantly outperforms NTP in key information retrieval tasks while maintaining efficiency in training and inference.

Uploaded by

xodelay157
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Mask-Enhanced Autoregressive Prediction: Pay Less Attention to Learn More

Xialie Zhuang * 1 2 Zhikai Jia * 2 Jianjin Li * 3 Zhenyu Zhang 4 Li Shen 5


Zheng Cao 2 Shiwei Liu 6

Abstract 1. Introduction
Large Language Models (LLMs) are discovered Next-token prediction (NTP) (Radford, 2018) is the foun-
arXiv:2502.07490v1 [[Link]] 11 Feb 2025

to suffer from accurately retrieving key informa- dational training objective for many large language models
tion. To address this, we propose Mask-Enhanced (LLMs), including OpenAI’s GPT series (Brown, 2020).
Autoregressive Prediction (MEAP), a simple yet NTP trains models to predict the next word (or token) in a
effective training paradigm that seamlessly inte- sequence, given all preceding tokens. Its scaling efficiency
grates Masked Language Modeling (MLM) into and exceptional performance in text generation have estab-
Next-Token Prediction (NTP) to enhance the lat- lished it as the dominant paradigm for state-of-the-art LLMs
ter’s in-context retrieval capabilities. Specifically, such as GPT-4 (Achiam et al., 2023), LLaMa3 (Dubey et al.,
MEAP first randomly masks a small fraction of 2024), Gemini 1.5 Pro (Team et al., 2024), and DeepSeek-
input tokens and then directly performs the stan- V3 (Liu et al., 2024a). However, recent studies highlight the
dard next-token prediction autoregressive using limitations of NTP-based LLMs in accurately retrieving key
a decoder-only Transformer. MEAP eliminates information from context (Liu et al., 2024b; Kamradt, 2023).
the need for bidirectional attention or encoder-
In contrast, masked language modeling (MLM), used in
decoder architectures for MLM, incurring no addi-
BERT (Devlin, 2018), adopts a denoising objective that re-
tional computational overhead during pre-training
constructs masked inputs using bidirectional attention. This
or inference. Intensive experiments demonstrate
cloze-type nature makes MLM particularly effective for
that MEAP substantially outperforms NTP on
tasks requiring precise information retrieval and sentence-
key information retrieval and long-context rea-
level understanding. However, MLM’s inherent focus on re-
soning tasks, while performing on par or better
constructing masked tokens reduces its effectiveness in tasks
on commonsense reasoning tasks. The benefits
requiring coherent and long-form text generation (Wang &
of MEAP also extend to supervised fine-tuning,
Cho, 2019; Dong et al., 2019).
where it shows remarkable advantages in lost-
in-the-middle scenarios, outperforming NTP by While intuitively appealing, combining NTP and MLM
11.77% percentage points. Our analysis indicates to leverage their respective strengths remains a non-trivial
that MEAP’s effectiveness arises from its ability challenge. MLM typically operates best within two-stack
to promote more distinguishable attention scores encoder-decoder architectures, and performance degrades
by concentrating on a reduced set of non-masked significantly when applied to decoder-only Transformers
tokens. This mechanism improves the model’s fo- (Tay et al., 2022). Efforts to integrate the two often rely on
cus on task-relevant signals while mitigating the unified pre-training pipelines where multiple objectives are
influence of peripheral context. These findings alternated during the pretraining process (Dong et al., 2019;
position MEAP as a promising training paradigm Tay et al., 2022). However, this multi-objective approach
for large language [Link] is provided in the introduces substantial complexity to the training pipeline,
link. making it cumbersome to scale, especially for models with
billions or trillions of parameters.
*
Equal contribution 1 School of Artificial Intelligence, Uni- To this end, we propose Mask-Enhanced Autoregressive
versity of Chinese Academy of Sciences, China 2 SCITIX (SGP)
TECH PTE. LTD., Singapore 3 South China Normal Univer- Prediction (MEAP), a simple yet effective LLM training
sity, China 4 University of Texas at Austin, USA 5 Sun Yat- paradigm that seamlessly integrates masked tokens into
Sen University, China 6 University of Oxford, UK. Correspon- next-token prediction. Specifically, we first randomly mask
dence to: Zheng Cao <zcao@[Link]>, Shiwei Liu <shi- a small fraction of the input tokens and then directly perform
[Link]@[Link]>. standard next-token prediction using a decoder-only Trans-
The entire development process relies on the Siflow platform former in an autoregressive manner. This straightforward
([Link] provided by SCITIX (SGP) TECH PTE. LTD. modification eliminates the need for bidirectional attention

1
Mask-Enhanced Autoregressive Prediction: Pay Less Attention to Learn More

respectively. In Section 5, we further analyze the underlying


Next Token Prediction Masked Language Modeling MEAP reasons for MEAP’s effectiveness. Section 6 provides an
ablation study, and we conclude the paper in Section 7.

2. Related Work
[Mask] [Mask]
Masked Language Modeling. Pre-training is one of the
most important pillars of LLMs. BERT first trained a
Figure 1. Overview of next token prediction, masked language bidirectional, encoder-only Transformer with masked lan-
modeling, and our MEAP. guage modeling (MLM), where the model is trained to
or an expensive encoder-decoder architecture, thereby incur- predict masked input tokens. XLNet (Yang, 2019) intro-
ring no additional computational overhead during training. duced the Permutation-based Language Modeling to ac-
During inference, our resulting LLMs can work as simply count for dependencies between masked tokens during train-
as LLMs that are trained with NTP with no extra engineer- ing. RoBERTa(Liu, 2019) further improves the pre-training
ing effort. The simplicity of MEAP enables us to enhance of BERT by training the model longer, over more data, with
LLMs’ performance of key information retrieval and long- longer sequences, etc. MLM was further advanced by T5
context reasoning, while retaining the impressive scaling (Roberts et al., 2019). Specifically, T5 frames every text
efficiency of decoder-only LLMs. Figure 1 shows the illus- processing task as a ’text-to-text’ problem, leveraging in-
trations of different training paradigms. creased lengths of corrupted tokens to achieve improved per-
formance on classification tasks, which has contributed to
As a general pre-training paradigm, MEAP works effec- its growing popularity. However, these models have shown
tively for scenarios of pre-training and fine-tuning. For limited performance in open-text generation and in-context
the pre-training setting, we conduct control experiments by learning, limiting their usage in modern LLMs.
pre-training 1.1B LLaMa-style LLMs (Zhang et al., 2024)
with NTP and MEAP, where the training tokens scale from Next Token Prediction. In a parallel vein, Radford et al.
40B to 200B. Our results demonstrate that MEAP substan- (2019) proposed next-token prediction (NTP) where a
tially improves the performance in key information retrieval decoder-only Transformer is trained to predict the next to-
tasks such as Needle in a Haystack (Kamradt, 2023) by ken from left to right using unidirectional attention ensured
up to 33% average score and Multi-Document Question by casual mask. By predicting the next token based on pre-
Answering (MDQA) (Liu et al., 2024b) by up to 27.2 per- viously generated tokens and the given input context, NTP
centage points, while preserving general knowledge learned maintains coherence and logical flow in the generated text,
during pre-training. It is noteworthy that MEAP achieves well-suited for text generation. Moreover, NTP eliminates
85.8% accuracy with 60B training tokens on the Needle in the need for an encoder, significantly improving the scal-
a Haystack, while NTP requires 200B for similar perfor- ability of language models. Due to the above advantages,
mance, highlighting MEAP’s superior data efficiency in key NTP serves as the most popular pre-training objective of
information retrieval. In addition, compared to the original modern LLMs (Brown, 2020; Achiam et al., 2023; Touvron
NTP, MEAP also suffers less from hallucination. et al., 2023; Jiang et al., 2023; Yang et al., 2024; Liu et al.,
2024a).
In addition, the promise of MEAP also holds for LLM
fine-tuning. Our MEAP framework demonstrates consis- Unified Training Paradigms. There are works that propose
tent improvements across multiple commonsense reasoning unified training paradigms aiming to train one Transformer
tasks, achieving an average gain of 1.12 scores over the with multiple objective functions. For instance, UniLM
NTP baseline. On Multi-Document Question Answering, (Dong et al., 2019) trains a bidirectional encoder on unidi-
MEAP achieves an average improvement of 11.77% across rectional language modeling (LM), bidirectional LM, and
all positions. Sequence-to-Sequence LM. UL2 (Tay et al., 2022) proposes
a unified pre-training paradigm with Mixture-of-Denoisers
Our analysis suggests that MEAP’s effectiveness stems from (MoD) to combine diverse pre-training paradigms together,
its ability to enhance attention distinguishability by focus- improving the performance over T5 and GPT. While effec-
ing on a reduced set of non-masked tokens. This mecha- tive, the preference for encoder-decoder architectures and
nism sharpens the model’s attention to task-relevant signals the complicated switches among different training objec-
while reducing the impact of peripheral context. In essence, tives hinder their applications in practice.
MEAP learns more by attending to fewer tokens.
In contrast, our approach seamlessly integrates masked to-
The structure of this paper is as follows. Section 3 details kens into NTP without incurring any additional pre-training
the MEAP algorithm. The evaluation of MEAP on LLM pre- or inference costs, while preserving the ultra-efficiency of
training and fine-tuning is presented in Sections 4.1 and 4.2,

2
Mask-Enhanced Autoregressive Prediction: Pay Less Attention to Learn More

We carefully selected the masking ratio, P = 15% for pre-


training, to ensure that the model receives an adequate level
... ... ... of training difficulty and learning signals, without exces-
sively disrupting the pre-training process. The relatively
... ... ... moderate number of masked tokens allows this approach
to be seamlessly integrated into existing NTP frameworks,
... ... ... without significantly increasing pre-training overhead or
[Mask] [Mask]

Pre-training Fine-tuning altering the original training procedure.


LLM fine-tuning. MEAP can also be extended to fine-
Figure 2. Training frameworks of MEAP: Left (Pre-training): A tuning scenarios. In this approach, we duplicate the training
certain portion of input tokens is randomly masked, followed by samples and apply the same random masking strategy to the
standard next-token prediction (NTP). Right (Fine-tuning): Train- copied sequences during fine-tuning. The original sequences
ing samples are duplicated, and the random masking strategy is and their masked counterparts are then combined into a
applied to the copied sequences. Standard NTP is then performed single input sequence to feed into the model. Formally,
on the modified input for fine-tuning. the next-token prediction with causal attention mechanism
models the following objective:
NTP. More importantly, MEAP is more suitable for modern
LLMs, as our method does not alter the core mechanism T
of NTP, the resulting models remain fully compatible with Y
pθ (X ′ ) = pθ (xt | x1 , . . . , xt−1 ; x1 , [mask], . . . , xt−1 )
existing pipelines, platforms, and hardware optimized for
t=1
modern LLMs.
This design addresses a critical concern: input sequences
3. Mask-Enhanced Autoregressive Prediction in supervised fine-tuning often contain key information es-
In this section, we introduce Mask-Enhanced Autoregres- sential for downstream tasks. Directly masking the original
sive Prediction (MEAP). sequence risks removing crucial information, potentially
compromising the model’s performance on the target tasks.
LLM pre-training. To enhance the performance of LLM Masking the duplicated sequence incorporates MLM to NTP
in handling and understanding long texts, particularly in while avoiding this concern. We choose P = 10% for fine-
key tasks such as key information retrieval and long con- tuning in our experiment. We only perform MEAP for the
text, we designed and implemented a simple yet efficient QA pair whose answer length exceeds 50, otherwise, we
random masking strategy. The core idea of this method perform the standard NTP for the pair.
is to selectively mask portions of the input during the pre-
training phase. Specifically, we employed a fixed proportion Notably, while MEAP doubles the sequence length during
masking mechanism, where tokens in the input sequence fine-tuning, Figure 5 shows that it achieves superior perfor-
are randomly masked according to a predefined percentage mance to NTP with only half the training time, essentially
P . In this way, the model is forced to learn in the absence of attaining stronger results with even fewer training tokens.
some contextual information, which helps improve its deep We believe that the effectiveness of MEAP stems from its
understanding and reasoning capabilities. ability to promote more distinguishable attention by focus-
Formally, given a decoder-only Transformer θ and a se- ing on fewer tokens during LLM training, as masked tokens
quence of input X = (x1 , x2 , ...xn−1 , xn ), we first typically receive negligible attention. This modification
randomly mask a fraction of P tokens having X ′ = helps the model focus on task-relevant signals while reduc-
(x1 , [mask], ..., xt−1 , xt ). We then perform the standard ing the impact of peripheral context, as verified in Section 5.
next-token prediction using the masked input in a left-to-
right manner: 4. Experimental Results
To evaluate the effectiveness of MEAP in training LLMs,
T
Y we conduct controlled experiments comparing LLMs pre-
pθ (X ′ ) = pθ (xt | x1 , [mask], . . . , xt−1 ) trained/fine-tuned by MEAP with those trained by NTP.
t=1

4.1. Pre-training Evaluation


Same as NTP, when the model is tasked with predicting
a masked token, it employs causal masked attention, us- Setup. We follow the Llama architecture (Touvron et al.,
ing only the preceding tokens to predict the masked token. 2023) as our base model. Specifically, we train 1.1B

3
Mask-Enhanced Autoregressive Prediction: Pay Less Attention to Learn More

Table 1. Pre-training Evaluation. Zero-shot performance of MEAP and NTP on various commonsense reasoning tasks. Results are
measured directly after pre-training on 200B tokens with no fine-tuning.

ARC-c ARC-e BoolQ PIQA HellaSwag WinoGrande OBQA Average


NTP 22.9 55.7 53.3 73.6 44.1 55.0 23.2 46.2
MEAP 25.4 56.4 59.5 72.3 43.4 55.3 22.6 47.8

decoder-only Transformers (Vaswani, 2017) following the For key information retrieval, we choose the well-
setting of Zhang et al. (2024). Our model has 24 layers with established Needle-in-a-Haystack evaluation (Liu et al.,
32 attention heads, a hidden size of 2,048, an intermediate 2024b), where the model is asked to retrieve the random fact
hidden size of 5,632, 32 attention heads, and a context length or statement (the ‘needle’) in the middle of a long context
of 4096. We follow the common configurations of LLM window (the ‘haystack’). This approach provides quanti-
components, e.g., Rotary Positional Embedding (RoPE) tative metrics for assessing precise information extraction
(Su et al., 2024), Pre-Norm (Ba, 2016) with RMSNorm from extended contexts, particularly relevant for document
(Zhang & Sennrich, 2019), SwiGLU (Shazeer, 2020), and analysis applications.
Grouped-query Attention (Ainslie et al., 2023). To assess
As this evaluation involves long-context modeling capacity,
the scalability of MEAP, we increase the training token size
we follow the setting of Ye et al. (2024) and conduct a length
from 40B to 60B, and further to 200B.
extension to 64K. In particular, we continue training our
For all experiments, we implement a learning rate warm-up model for additional 4B tokens from SlimPajama (Soboleva
during the first 10% of the training steps, followed by a et al.) using the approach proposed in (Fu et al., 2024). The
cosine annealing schedule, which decays the learning rate to implementation utilizes modified Rotary Position Embed-
10% of its initial value. We use the AdamW optimizer with dings with θbase = 640, 000.
the following settings: β1 = 0.9, β2 = 0.95. The maximum
To demonstrate MEAP’s scalability, we increase the training
learning rate is set to 4 × 10−4 , the minimum learning rate
token size to 40B, 80B, and 200B, reporting the results of
is 4 × 10−5 , and the weight decay is 5 × 10−2 .
needle retrieval in Table 2. The results show that MEAP con-
sistently outperforms NTP across different training scales.
4.1.1. L ANGUAGE M ODELING E VALUATION
At 40B tokens, MEAP achieves 80.2% accuracy, signifi-
While the primary goal of MEAP is to enhance LLM perfor- cantly surpassing the baseline’s 65.9%. The performance
mance in key information retrieval, it is essential to ensure gap peaks at 60B tokens, with MEAP maintaining steady
that integrating MLM into NTP does not compromise the improvement and reaching 85.8% accuracy. At 200B tokens,
model’s fundamental language modeling capability. To eval- MEAP approaches optimal performance, attaining 98.2%
uate this, we employ the LM Eval Harness benchmark (Gao accuracy, while the NTP baseline still falls short of 90%
et al., 2024), assessing models in a zero-shot setting. The accuracy. It is noteworthy that MEAP achieves 85.8% accu-
results, presented in Table 1, show that MEAP performs racy using just 60B training tokens, whereas NTP requires
comparably to, or even outperforms, NTP, achieving a 1.6% approximately three times as many (200B tokens) to reach
improvement in the overall average score. This finding pro- a similar level. This demonstrates MEAP’s superior data
vides strong evidence that incorporating random masking efficiency over NTP in key information retrieval.
into NTP does not degrade the model’s language modeling
We further illustrate the retrieval performance of our 200B-
capacity. In the following evaluations, we will examine
token model with a 32K context length in Figure 3. The
whether MEAP further improves performance in key infor-
accuracy is reported across varying answer needle depths
mation retrieval and long-context modeling.
(y-axis) and context lengths (x-axis). The results show that
MEAP generally maintains perfect accuracy across different
4.1.2. N EEDLE - IN - A -H AYSTACK R ETRIEVAL
context lengths and depths, with errors limited to only two
grid cells. In contrast, NTP begins to exhibit accuracy degra-
Table 2. Single needle accuracy (%) with 32K context. dation at a context length of 24K, affecting a wide range of
depths from 50% to 100%.
Token 40B 60B 200B
4.1.3. M ULTI -D OCUMENT Q UESTION A NSWERING
NTP 65.9 52.8 87.1
MEAP 80.2 85.8 98.2 We report the accuracy improvement of MEAP over NTP
in Table 3. MEAP again consistently outperforms NTP

4
Mask-Enhanced Autoregressive Prediction: Pay Less Attention to Learn More

NTP Pretrain Needle In A Haystack MEAP Pretrain Needle In A Haystack


1.0 1.0
0.0 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.55 0.0 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

10.0 1.00 1.00 1.00 1.00 1.00 0.09 0.09 0.09 10.0 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

20.0 1.00 1.00 1.00 1.00 1.00 0.09 0.09 0.09


0.8 20.0 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
0.8

30.0 1.00 1.00 1.00 1.00 1.00 0.09 0.09 0.09 30.0 1.00 1.00 1.00 1.00 1.00 1.00 0.18 0.27
Depth Percent

Depth Percent
40.0 1.00 1.00 1.00 1.00 1.00 0.09 0.27 0.27 0.6 40.0 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.6

Score

Score
50.0 1.00 1.00 1.00 1.00 1.00 1.00 0.64 1.00 50.0 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

60.0 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.4 60.0 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.4
70.0 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 70.0 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

80.0 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 80.0 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
0.2 0.2
90.0 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 90.0 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

100.0 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 100.0 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
0.0 0.0
00

00

00

00

0
00

00

00

00

00

00

00

00

00

00

00

00
40

80

40

80
12

16

20

24

28

32

12

16

20

24

28

32
Context Length (tokens) Context Length (tokens)
(a) NTP Pre-training (b) MEAP Pre-training
Figure 3. Training dynamics comparison between standard pretraining and MEAP framework. Scores are computed using ROUGE-1,
measuring unigram overlap between model responses and expected answers.

70
MEAP
Table 3. Pre-training Evaluation. Relative accuracy (%) improve- 65 NTP
ment of MEAP over NTP on multi-document QA.
60
Performance Score (%)

55
Answer Position 1 5 10 15 20
50
10 documents +7.6 +7.0 +30.6 – –
45
20 documents +12.4 +4.0 +5.1 +3.7 +27.2
40

35
by good margins across all configurations, with significant
30
gains at later positions (+30.6% at position 3 in 10-doc,
25
+27.2% at position 5 in 20-doc). These results indicate that 8K 12K 16K 20K 24K 28K 32K
Context Length
MEAP enhances the model’s ability to retrieve relevant in-
formation from long contexts, maintain performance across Figure 4. Long-context reasoning performance comparison be-
tween MEAP and NTP on the Multi-Needle Reasoning Task (M-
different context lengths and positions, and handle complex
RS) across different context lengths.
scenarios with multiple distractors. The improvements high-
light the effectiveness of the masking strategy in enhancing
over extended sequences.
the model’s overall capability for long-context information
retrieval tasks.
4.1.5. C ONTEXTUAL H ALLUCINATION E VALUATION
4.1.4. L ONG -C ONTEXT R EASONING E VALUATION
Table 4. Accuracy (i.e., free of hallucinations) on text summariza-
We evaluate long-context reasoning capabilities using the tion datasets.
Multi-Needle Reasoning Task (M-RS) (Li et al., 2024a),
which requires models to retrieve and extract multiple pieces Task XSum MultiNews WikiSum
of information from long texts and using them to logically NTP 0.09 0.17 0.24
answer questions that demand an integrated understanding MEAP 0.13 0.19 0.33
and reasoning of various text segments. This forces the
model to distribute attention across contextually relevant
Since MEAP improves more accurate key information re-
tokens rather than focusing solely on local patterns.
trieval, we expect it to suffer less from contextual halluci-
We leverage the OpenCompass evaluation framework (Con- nation. To verify, we evaluate MEAP in reducing contex-
tributors, 2023) and report the results in Figure 4. MEAP tual hallucinations on three summarization datasets: XSum
consistently outperforms NTP across context lengths with (Narayan et al., 2018), WikiSum (Cohen et al., 2021), and
6.6 percentage point average improvement. demonstrates MultiNews (Fabbri et al., 2019), following Ye et al. (2024).
MEAP’s enhanced capacity to maintain attention coherence For this setting, we fine-tune the pre-trained models with

5
Mask-Enhanced Autoregressive Prediction: Pay Less Attention to Learn More

60
Alpaca and evaluate them. We compare model-generated
55
MEAP-3
and reference summaries using Deepseek-V3 (Liu et al.,
50
2024a) as the hallucination detector across 100 random sam- NTP-6
45

Accuracy (%)
ples per dataset. As shown in Table 4, our masking strategy
40 MEAP-2
achieves a consistent reduction in hallucination rates across MEAP-1
35
all datasets.
30 NTP-2
NTP-4
25
4.2. Fine-tuning Evaluation NTP-1
20

Setup. We fine-tune the Llama-3-8B (Dubey et al., 2024) 15


0 500 1000 1500 2000
on the Alpaca instruction dataset (Taori et al., 2023a). The Training Time (s)

training configuration uses a global batch size of 512. The NTP MEAP

model is optimized with AdamW (β1 = 0.9, β2 = 0.95),


a learning rate of 2 × 10−5 (with 10% warmup and cosine Figure 5. Comparison of fine-tuning efficiency between MEAP
decay), and weight decay set to 0.01. We retain key architec- and NTP. ‘MEAP-n’ refers to MEAP training for n epoch.
tural components from Llama-3, such as RoPE embeddings
(Su et al., 2024), RMSNorm (Zhang & Sennrich, 2019), and
grouped-query attention (Ainslie et al., 2023).
• Mid-Context Advantage: The maximum improvement
During fine-tuning, we randomly mask a portion of tokens at position 20 (+15.22%) demonstrates enhanced long-
in the assistant’s response, while keeping the source context range dependency modeling, crucial for connecting
intact. Only the masked tokens are predicted during fine- concepts across scientific documents.
tuning. The training process uses bfloat16 precision with
DeepSpeed Zero Stage 2 (Ren et al., 2021), and the Llama-3
tokenizer (Dubey et al., 2024) with a maximum sequence
These findings validate MEAP’s effectiveness in preserv-
length of 4096 tokens.
ing signal integrity across long contexts while highlight-
ing opportunities for temporal reasoning enhancement and
4.2.1. L ANGUAGE M ODELING E VALUATION
cross-document entity disambiguation.
Similar to the pre-training evaluation, we first assess
MEAP’s effectiveness in language modeling. Table 5 4.3. Training Efficiency Analysis
presents the evaluation results. Our MEAP framework
demonstrates consistent improvements across multiple tasks, MEAP introduces no additional overhead for pre-training or
achieving an average gain of 1.12 scores over the NTP inference compared to standard NTP, as the only difference
baseline. The performance improvements are particularly lies in the masking operation. During fine-tuning, MEAP
notable on ARC-c and WinoGrande, indicating enhanced requires duplicating the input sequence and training with
reasoning capabilities. The results highlight MEAP’s effec- a doubled sequence length, resulting in increased training
tiveness in fine-tuning complex reasoning tasks. overhead. This overhead, however, is effectively amortized
by MEAP’s higher data utilization efficiency. Specifically,
4.2.2. M ULTI -D OCUMENT Q UESTION A NSWERING compared to NTP, MEAP requires only 50% of the epochs
with a similar number of tokens being processed, while
We evaluate MEAP’s context-aware reasoning using the outperforming the latter significantly.
multi-document QA task with distractor suppression (Liu
et al., 2024b). To ensure a fair comparison, we train MEAP To verify, we report the results on the multi-document QA
for 2 epochs and NTP for 4 epochs, such that both ap- retrieval from 20 documents (Liu et al., 2024b), where re-
proaches process a similar number of tokens. Table 6 quan- trieval performance is assessed by computing the average
tifies the exact match (EM) improvements across critical retrieval values across 5 positions. As shown in Figure 5,
document positions in the 20-document setting. MEAP con- a single epoch of MEAP training significantly outperforms
sistently achieves notable gains across all positions, further two epochs of NTP training by a large margin while also
demonstrating its superiority over NTP. Two key patterns reducing total training time. This highlights MEAP’s data
emerge from the experimental results: efficiency, achieving similar or better results while reducing
computational resources.
• Consistent Improvement: MEAP achieves substantial In summary, MEAP delivers significant training time re-
gains across all positions with an average improvement ductions with improved or comparable performance on the
of 11.77%, showing robust performance throughout retrieval task, highlighting its efficiency and effectiveness
the document range. in large-scale training scenarios.

6
Mask-Enhanced Autoregressive Prediction: Pay Less Attention to Learn More

Table 5. Fine-tuning Evaluation. Performance of MEAP and NTP on various commonsense reasoning tasks. Results are measured by
fine-tuning with Llama-3-8B.

ARC-c ARC-e BoolQ PIQA HellaSwag WinoGrande OBQA Average


NTP 53.58 81.10 83.98 79.27 62.74 72.06 39.40 67.30
MEAP 55.12 83.21 83.82 81.01 63.31 74.27 38.20 68.42

attention distribution patterns. Our experimental protocol


Table 6. Fine-tuning Evaluation. Accuracy (%) of MEAP and
involved sampling 500 sequences. The original unmodified
NTP on multi-document QA with 20 documents.
samples refer to the input sequence of NTP XN , and their
Position 1 5 10 15 20 masked counterparts (same samples with 15% masks), XM ,
designated as the input for MEAP. These sequence pairs
NTP 24.29 22.82 24.11 25.46 31.11 were then processed through our 1.1B models pre-trained
MEAP 33.22 34.16 36.01 36.91 46.33 with NTP and MEAP, respectively. We compare two values:
∆ +8.93 +11.34 +11.90 +11.45 +15.22 (1) Attention Score Decay: the percentage decrease in the
averaged attention score at masked positions, computed as:
5. Why Does MEAP Work?
Att(XN [mask = 1]) − Att(XM [mask = 1])
In this section, we attempt to interpret the underlying rea- Att(XN [mask = 1])
sons for the effectiveness of MEAP. We conjecture that
MEAP’s effectiveness stems from its ability to promote (2) Attention Variance Increase: the attention variance in-
more distinguishable attention by focusing on fewer tokens crease at non-mask positions, computed as:
during LLM training, as masked tokens [MASK] are ex-
pected to receive marginal attention scores. V ar(Att(XM [mask = 0]))−V ar(Att(XN [mask = 0]))
While effective, attention mechanisms in LLMs often strug-
Expectations. We anticipate that the average attention score
gle with long-context understanding, where redundant and
at masked positions will undergo a significant decline in the
non-informative attention is assigned to tokens (Liu et al.,
MEAP-trained model, indicating that masked tokens receive
2024b; Li et al., 2024b). A plausible explanation is that the
minimal attention in MEAP. Consequently, this reduction is
attention module relies on the Softmax function to normal-
expected to increase the attention variance at non-masked
ize attention scores within (0, 1), which tends to minimize
positions, making the attention distribution in MEAP more
differences among tokens, especially when training on se-
distinguishable compared to NTP.
quences of thousands of tokens. Furthermore, LLMs exhibit
a phenomenon known as attention sinks, where the initial Results. Table 8 confirms our expectations. MEAP assigns
few tokens receive disproportionately high attention scores 53.34% less attention to masked tokens, resulting in a 7.80%
compared to the rest (Xiao et al., 2023). Collectively, these increase in attention variance. This finding validates that
factors lead to small and nearly indistinguishable attention MEAP learns more distinguishable attention scores com-
scores across tokens, which is generally undesirable. When pared to NTP.
LLMs fail to properly differentiate between tokens, they are
more likely to generate incorrect outputs. 5.2. MEAP Focus More on Task-Relevant Tokens
By randomly replacing tokens with masks, MEAP implicitly To verify if MEAP learns more effective attention, we mea-
penalizes the attention scores at masked positions, thereby sure the average attention scores that the model assigns to
amplifying the attention differences among non-masked different input segments. We structured our input sequences
tokens. This masking mechanism encourages the model to into distinct segments: context-before, answer, context-after,
generate more distinguishable attention scores, allowing it query, and EOS token. The complete input sequence was
to focus on task-relevant texts while mitigating the influence formed by concatenating these segments sequentially, fol-
of peripheral context. We validate this hypothesis through lowed by an EOS token. This structured format enabled
the following experiments. precise tracking of attention allocation across different func-
tional components.
5.1. Masking Leads to More distinguishable Attention
Expectation. Our expectation is that MEAP tends to am-
To elucidate the mechanistic impact of our masking strategy plify attention to answer spans and meanwhile reduce the
on model behavior, we conducted a detailed analysis of attention to less relevant tokens.

7
Mask-Enhanced Autoregressive Prediction: Pay Less Attention to Learn More

(a) NTP (b) MEAP


Figure 6. Attention distribution comparison between NTP and MEAP during inference. The input sequence consists of: context-before
(“In the heart of Paris, the Eiffel Tower stands tall, symbolizing both the city and the entire country.”), answer (“Designed by Gustave
Eiffel”), context-after (“, it was completed in 1889 for the World’s Fair. Originally criticized for its unusual design, it has since become
one of the most recognizable landmarks in the world. Tourists from all over the globe visit it every year, making it one of the most
photographed monuments.”), and query (“question: Who designed the Eiffel Tower?”). MEAP allocates a much higher attention score to
answer-relevant tokens (0.345) compared to NTP (0.094).

Table 7. Performance comparison of different mask ratios in pre-training and fine-tuning. The best results are highlighted in bold.

Pre-training Fine-tuning
Mask Ratio NTP 0.05 0.10 0.15 0.20 NTP 0.05 0.10 0.15
Accuracy 0.52 0.54 0.56 0.58 0.56 0.72 0.77 0.81 0.71

Results. The attention distributions during inference for standard NTP baselines with 6 epochs for a fair comparison.
both models are visualized in Figure 6. Notably, MEAP The results show that a mask ratio of 0.15 achieves the
exhibits a substantial improvement in answer-relevant at- best performance in pre-training, while a mask ratio of
tention (34.5% vs. 9.4%) while reducing the dominance of 0.10 yields the highest accuracy in fine-tuning. MEAP
context-before attention from 73.1% to 49.1%. Both models consistently outperforms standard NTP in pre-training and
maintain similar attention levels for peripheral components, fine-tuning, demonstrating its effectiveness in leveraging
including context-after, query sections, and EOS tokens (all masked tokens for improved performance.
approximately 5%–6%). These results demonstrate that
the MEAP framework enhances attention allocation during 7. Conclusion
inference, prioritizing key information more effectively.
This work addresses challenges in information processing
Table 8. Attention score comparison between NTP and MEAP.
through a straightforward approach that masks 10%–15% of
input while maintaining traditional prediction methods. Our
Input Length Mask Ratio Score Decay Var. Increase results show significant improvements in comprehension
across longer contexts, achieved without additional compu-
1,024 0.15 34.08% 12.66%
4,096 0.15 53.34% 7.80% tational costs. This approach demonstrates remarkable effi-
ciency, matching performance metrics with just 60B training
examples that typically require 200B examples with conven-
tional methods. The results indicate that this strategy leads
6. Ablation Study to more effective processing of key information through
We conduct ablation studies on the mask ratio for both improved focus on relevant content. Since it requires no
pre-training and fine-tuning settings. Table 7 summarizes structural changes, this method can be readily integrated
the results. For pre-training, we evaluate our pre-trained into existing systems without disrupting workflows.
model in Section 4.1 on the Multi-Document QA task using
the nq-open-oracle dataset (Liu et al., 2024b). For
fine-tuning, we train MEAP on the Alpaca dataset (Taori
et al., 2023b) for 3 epochs with different mask ratios against

8
Mask-Enhanced Autoregressive Prediction: Pay Less Attention to Learn More

Impact Statement A., et al. The llama 3 herd of models. arXiv preprint
arXiv:2407.21783, 2024.
This work proposes a modified pre-training paradigm that
may influence how both industry and academia approach Fabbri, A. R., Li, I., She, T., Li, S., and Radev, D. R.
language model training. MEAP integrates seamlessly with Multi-news: A large-scale multi-document summariza-
existing LLM frameworks without requiring additional en- tion dataset and abstractive hierarchical model. arXiv
gineering effort or computational resources. While the im- preprint arXiv:1906.01749, 2019.
provement in information retrieval and reasoning capabili-
ties could have broad implications for downstream applica- Fu, Y., Panda, R., Niu, X., Yue, X., Hajishirzi, H., Kim,
tions, the method’s computational efficiency and architec- Y., and Peng, H. Data engineering for scaling language
tural compatibility mean it can be readily adopted within models to 128k context. arXiv preprint arXiv:2402.10171,
current training infrastructures. We anticipate this work will 2024.
contribute to more efficient model development while main-
taining established training pipelines and computational Gao, L., Tow, J., Abbasi, B., Biderman, S., Black, S., DiPofi,
requirements. A., Foster, C., Golding, L., Hsu, J., Le Noac’h, A., Li,
H., McDonell, K., Muennighoff, N., Ociepa, C., Phang,
J., Reynolds, L., Schoelkopf, H., Skowron, A., Sutawika,
References L., Tang, E., Thite, A., Wang, B., Wang, K., and Zou,
Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., A. A framework for few-shot language model evaluation,
Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., 07 2024. URL [Link]
Anadkat, S., et al. Gpt-4 technical report. arXiv preprint 12608602.
arXiv:2303.08774, 2023.
Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C.,
Ainslie, J., Lee-Thorp, J., de Jong, M., Zemlyanskiy, Y., Chaplot, D. S., Casas, D. d. l., Bressand, F., Lengyel, G.,
Lebrón, F., and Sanghai, S. Gqa: Training generalized Lample, G., Saulnier, L., et al. Mistral 7b. arXiv preprint
multi-query transformer models from multi-head check- arXiv:2310.06825, 2023.
points. arXiv preprint arXiv:2305.13245, 2023.
Kamradt, G. Needle in a haystack-pressure testing llms.
Ba, J. L. Layer normalization. arXiv preprint Github Repository, pp. 28, 2023.
arXiv:1607.06450, 2016.
Li, M., Zhang, S., Liu, Y., and Chen, K. Needlebench:
Brown, T. B. Language models are few-shot learners. arXiv Can llms do retrieval and reasoning in 1 million context
preprint arXiv:2005.14165, 2020. window?, 2024a. URL [Link]
Cohen, N., Kalinsky, O., Ziser, Y., and Moschitti, A. 2407.11963.
Wikisum: Coherent summarization dataset for efficient Li, T., Zhang, G., Do, Q. D., Yue, X., and Chen, W. Long-
human-evaluation. In Proceedings of the 59th Annual context llms struggle with long in-context learning. arXiv
Meeting of the Association for Computational Linguistics preprint arXiv:2404.02060, 2024b.
and the 11th International Joint Conference on Natural
Language Processing (Volume 2: Short Papers), pp. 212– Liu, A., Feng, B., Xue, B., Wang, B., Wu, B., Lu, C., Zhao,
219, 2021. C., Deng, C., Zhang, C., Ruan, C., et al. Deepseek-
v3 technical report. arXiv preprint arXiv:2412.19437,
Contributors, O. Opencompass: A universal evaluation
2024a.
platform for foundation models. [Link]
com/open-compass/opencompass, 2023. Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua,
Devlin, J. Bert: Pre-training of deep bidirectional trans- M., Petroni, F., and Liang, P. Lost in the middle: How
formers for language understanding. arXiv preprint language models use long contexts. Transactions of the
arXiv:1810.04805, 2018. Association for Computational Linguistics, 12:157–173,
2024b.
Dong, L., Yang, N., Wang, W., Wei, F., Liu, X., Wang, Y.,
Gao, J., Zhou, M., and Hon, H.-W. Unified language Liu, Y. Roberta: A robustly optimized bert pretraining
model pre-training for natural language understanding approach. arXiv preprint arXiv:1907.11692, 364, 2019.
and generation. Advances in neural information process-
ing systems, 32, 2019. Narayan, S., Cohen, S. B., and Lapata, M. Don’t give me
the details, just the summary. Topic-Aware Convolutional
Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, Neural Networks for Extreme Summarization ArXiv, abs,
A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, 2018.

9
Mask-Enhanced Autoregressive Prediction: Pay Less Attention to Learn More

Radford, A. Improving language understanding by genera- Vaswani, A. Attention is all you need. Advances in Neural
tive pre-training. 2018. Information Processing Systems, 2017.

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Wang, A. and Cho, K. Bert has a mouth, and it must speak:
Sutskever, I., et al. Language models are unsupervised Bert as a markov random field language model. arXiv
multitask learners. OpenAI blog, 1(8):9, 2019. preprint arXiv:1902.04094, 2019.

Ren, J., Rajbhandari, S., Aminabadi, R. Y., Ruwase, O., Xiao, G., Tian, Y., Chen, B., Han, S., and Lewis, M. Ef-
Yang, S., Zhang, M., Li, D., and He, Y. {Zero-offload}: ficient streaming language models with attention sinks.
Democratizing {billion-scale} model training. In 2021 arXiv preprint arXiv:2309.17453, 2023.
USENIX Annual Technical Conference (USENIX ATC
Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu,
21), pp. 551–564, 2021.
B., Li, C., Liu, D., Huang, F., Wei, H., et al. Qwen2. 5
Roberts, A., Raffel, C., Lee, K., Matena, M., Shazeer, N., technical report. arXiv preprint arXiv:2412.15115, 2024.
Liu, P. J., Narang, S., Li, W., and Zhou, Y. Exploring Yang, Z. Xlnet: Generalized autoregressive pretrain-
the limits of transfer learning with a unified text-to-text ing for language understanding. arXiv preprint
transformer. Google, Tech. Rep., 2019. arXiv:1906.08237, 2019.
Shazeer, N. Glu variants improve transformer. arXiv Ye, T., Dong, L., Xia, Y., Sun, Y., Zhu, Y., Huang, G.,
preprint arXiv:2002.05202, 2020. and Wei, F. Differential transformer. arXiv preprint
arXiv:2410.05258, 2024.
Soboleva, D., Al-Khateeb, F., Myers, R., Steeves, J. R., Hes-
tness, J., and Dey, N. Slimpajama: A 627b token cleaned Zhang, B. and Sennrich, R. Root mean square layer nor-
and deduplicated version of redpajama, 2023. URL malization. Advances in Neural Information Processing
[Link] co/datasets/cerebras/SlimPajama- Systems, 32, 2019.
627B.
Zhang, P., Zeng, G., Wang, T., and Lu, W. Tinyllama:
Su, J., Lu, Y., Pan, S., Wen, B., and Liu, Y. Roformer: An open-source small language model. arXiv preprint
Enhanced transformer with rotary position embedding. arXiv:2401.02385, 2024.
Neurocomputing, 2024.

Taori, R., Gulrajani, I., Zhang, T., Dubois, Y., Li,


X., Guestrin, C., Liang, P., and Hashimoto, T. B.
Stanford alpaca: An instruction-following llama
model. [Link]
stanford_alpaca, 2023a.

Taori, R., Gulrajani, I., Zhang, T., et al. Al-


paca: A strong, replicable instruction-following
model. [Link]
stanford_alpaca, 2023b.

Tay, Y., Dehghani, M., Tran, V. Q., Garcia, X., Wei, J.,
Wang, X., Chung, H. W., Shakeri, S., Bahri, D., Schuster,
T., et al. Ul2: Unifying language learning paradigms.
arXiv preprint arXiv:2205.05131, 2022.

Team, G., Georgiev, P., Lei, V. I., Burnell, R., Bai, L.,
Gulati, A., Tanzer, G., Vincent, D., Pan, Z., Wang, S.,
et al. Gemini 1.5: Unlocking multimodal understand-
ing across millions of tokens of context. arXiv preprint
arXiv:2403.05530, 2024.

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux,


M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E.,
Azhar, F., et al. Llama: Open and efficient foundation lan-
guage models. arXiv preprint arXiv:2302.13971, 2023.

10
Mask-Enhanced Autoregressive Prediction: Pay Less Attention to Learn More

A. Experimental Details of Pre-training


A.1. Architecture and Hyperparameters
This section outlines the pre-training hyperparameters of the MEAP model, designed to ensure efficient training and optimal
performance. The sequence length is fixed at 4096 tokens, enabling the model to handle long-range dependencies while
maintaining computational efficiency. The learning rate schedule includes an initial warm-up phase for the first 10% of
training steps, followed by cosine decay to 10% of the initial value, allowing gradual and precise parameter adjustments.
The AdamW optimizer is used with standard hyperparameters β1 = 0.9 and β2 = 0.95 to stabilize the optimization process.
Learning rate bounds are set between 4 × 10−4 and 4 × 10−5 to ensure effective learning throughout training, while a
weight decay of 5 × 10−2 helps prevent overfitting and promote generalization by penalizing excessively large weights.
Complete training hyperparameters are documented in Table 9.
The model sizes and corresponding hyperparameters are shown in Table 10.

Table 9. Hyperparameters of training

Name Hyperparameter
optimizer AdamW
lr schedule cosine
clip 1.0
max learning rate 4 × 10−4
min learning rate 4 × 10−5
weight decay 5 × 10−2
sequence length 4096
batch size 256
epoch 1

Table 10. Hyperparameters of pretrained MEAP models. Data amount are specified in tokens.

Params Hidden Intermediate Heads Layers Steps Data amount


100M 768 2048 12 12 2K 2B
500M 1024 4864 16 24 10K 10 B
1.1 B 2048 5632 24 32 190K 200 B

A.2. Pre-training Loss of Difference Model Sizes


The loss curves of the MEAP model at various sizes, as shown in Figure 7, provide a detailed visualization of the model’s
performance across different scales.

A.3. Language Modeling Evaluation Of All Size Models for Pre-training


As shown in Table 11, we present the evaluation results of models of different scales implemented using our method. To
comprehensively assess the language modeling performance of these models, we conducted a detailed analysis for each
model, with particular focus on their performance at varying scales.

A.4. Pretrained Model Evaluation Under Different Masking Rates


As shown in Table 12, we present the evaluation results of models implemented with our approach, where different mask
rates are applied during training. A comprehensive and detailed analysis of the language modeling performance is conducted
for each mask rate, with a focus on how varying levels of masking influence key performance metrics. This analysis
elucidates the effects of mask rates on the model’s ability to handle diverse linguistic tasks, highlighting any changes in
accuracy as the mask rate is adjusted.

11
Mask-Enhanced Autoregressive Prediction: Pay Less Attention to Learn More

Figure 7. Overview of loss for all size pretrained-models

Table 11. Results of all size MEAP pretrained models

Benchmark 100M 500M 1.1B


ARC-Challenge 17.32 18.4 25.4
ARC-Easy 31.99 42.0 56.4
BoolQ 45.14 55.63 59.5
HellaSwag 26.82 30.77 43.4
OpenBookQA 11.41 16.40 22.6
PIQA 58.49 66.81 72.3
WinoGrande 52.09 49.57 55.3
Avg 34.75 39.94 47.85

A.5. Details Of Contextual Hallucination Evaluation


Here are the prompt for summarization generation, where ”doc” is the original text to be summarized.
Summarize the following article: doc
We use the following prompts to let the Deepseek v3 model perform binary classification to determine whether there is
hallucination in the model output compared to the human summary.
The ”model output” is the output of the model, and the ”predicted label” is the manually annotated
label. Please compare the ”model output” with the ”predicted label”. By comparing the two, check
if the ”model output” is similar. If it is similar, return 1; otherwise, return 0. An explanation of the
output is required. Here is the output format I provide. Please follow it strictly!! Score: xx

12
Mask-Enhanced Autoregressive Prediction: Pay Less Attention to Learn More

Table 12. Results of the 1.1b MEAP model under different masking rates

Mask Ratio Mask Ratio Mask Ratio


Benchmark 0.15 0.05 0.1
ARC Challenge 25.4 26.11 24.3
ARC Easy 56.4 56.1 54.3
BoolQ 59.5 56.5 53.4
HellaSwag 43.4 43.69 43.85
OpenBookQA 22.6 22.0 21.8
PIQA 72.3 72.63 72.91
Winogrande 55.3 56.4 56.91
Avg 47.84 47.63 46.78

Table 13. Attention change of example 1

Area Content MEAP attention NTP attention


context before The Great Wall of China,stretching 0.491 0.731
over 13,000 miles, is one of the most
impressive feats of ancient engineer-
ing.
answer Built to protect Chinese states from 0.329 0.108
invasions
context after the wall took several dynasties over 0.078 0.80
2,000 years to complete. Its im-
mense length and historical signif-
icance make it a popular tourist
attraction today. The wall’s con-
struction involved countless work-
ers, many of whom faced difficult
conditions.
query question:What was the purpose of 0.067 0.070
the Great Wall of China?
eos <s> 0.069 0.071

A.6. Details of Attention Distribution of MEAP and NTP


To validate the generality of attention changes, we conducted corresponding tests on additional examples and observed that
the attention changes in these examples are consistent with the results presented in the main text. The specific changes are
detailed in Table 13, Table 14, and Table 15.

13
Mask-Enhanced Autoregressive Prediction: Pay Less Attention to Learn More

Table 14. Attention change of example 2

Area Content MEAP attention NTP attention


context before In the early 20th century, Albert Ein- 0.435 0.694
stein introduced his theory of rela-
tivity, which changed the way we
understand space, time, and gravity.
answer His famous equation, E=mc² 0.386 0.115
context after shows the relationship between en- 0.066 0.074
ergy and mass. Einstein’s ideas rev-
olutionized physics, and his work
led to the development of technolo-
gies like GPS and nuclear energy.
Despite facing initial skepticism,
his theories were eventually proven
through experiments and observa-
tions, earning him a Nobel Prize in
Physics in 1921.
query question:What famous equation did 0.057 0.060
Albert Einstein create?
eos <s> 0.055 0.057

Table 15. Attention change of example 3

Area Content MEAP attention NTP attention


context before At the center of Rome, the Colos- 0.579 0.748
seum rises as a magnificent testa-
ment to ancient Roman architecture,
symbolizing the grandeur of the Ro-
man Empire.
answer Constructed between 70 and 80 AD 0.219 0.065
under the emperors Vespasian and
Titus,
context after it was used for gladiatorial contests 0.071 0.067
and public spectacles. Once a sym-
bol of Roman power, the Colosseum
has weathered centuries of change
but remains one of the most iconic
structures in the world. Tourists
flock to see it every year, making it
one of the most photographed mon-
uments in history.
query question:Who built the Colosseum? 0.063 0.050
eos <s> 0.068 0.069

14
Mask-Enhanced Autoregressive Prediction: Pay Less Attention to Learn More

B. Details of Fine-tuning Experiments


B.1. Architecture and Hyperparameters
This section details the MEAP fine-tuning hyperparameters for the Llama3 model. The maximum sequence length is 4096
tokens, optimizing long-range dependencies and efficiency. The batch size is 512, and the learning rate schedule includes a
warm-up for the first 10% of training steps. The AdamW optimizer is used with β1 = 0.9 and β2 = 0.95, and the learning
rate is set to 2 × 10−5 .

Table 16. MEAP fine-tuning hyperparameters of Llama3 model

Name Hyperparameter
optimizer AdamW
lr schedule cosine
clip 1.0
learning rate 2 × 10−5
weight decay 5 × 10−2
maximum sequence length 4096
batch size 512

15

You might also like