0% found this document useful (0 votes)
86 views13 pages

Chen Et Al. - 2025 - Multimodal Air-Quality Prediction A Multimodal Feature Fusion Network Based On Shared-Specific Moda

This study presents a Shared-Specific Modality Decoupling-based Spatiotemporal Multimodal Fusion Network for air-quality prediction, integrating remote sensing data with ground monitoring observations to enhance feature representation. The proposed framework effectively disentangles shared and modality-specific features, improving the accuracy of air-quality forecasting through a hierarchical attention-graph convolution fusion module. Evaluations on real-world datasets demonstrate the model's superior performance compared to baseline methods, highlighting the benefits of multimodal data integration in environmental monitoring.

Uploaded by

tgk67196
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
86 views13 pages

Chen Et Al. - 2025 - Multimodal Air-Quality Prediction A Multimodal Feature Fusion Network Based On Shared-Specific Moda

This study presents a Shared-Specific Modality Decoupling-based Spatiotemporal Multimodal Fusion Network for air-quality prediction, integrating remote sensing data with ground monitoring observations to enhance feature representation. The proposed framework effectively disentangles shared and modality-specific features, improving the accuracy of air-quality forecasting through a hierarchical attention-graph convolution fusion module. Evaluations on real-world datasets demonstrate the model's superior performance compared to baseline methods, highlighting the benefits of multimodal data integration in environmental monitoring.

Uploaded by

tgk67196
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Environmental Modelling and Software 192 (2025) 106553

Contents lists available at ScienceDirect

Environmental Modelling and Software


journal homepage: [Link]/locate/envsoft

Multimodal air-quality prediction: A multimodal feature fusion network


based on shared-specific modal feature decoupling
Xiaoxia Chen a ,∗, Zhen Wang a , Fangyan Dong b , Kaoru Hirota c
a
The Faculty of Electrical Engineering and Computer Science, Ningbo University, No. 818 Fenghua Road, Ningbo, 315211, Zhejiang Province, China
b
The Faculty of Mechanical Engineering and Mechanics, Ningbo University, No. 818 Fenghua Road, Ningbo, 315211, Zhejiang Province, China
c
The School of Automation, Beijing Institute of Technology, Beijing, 100081, China

ARTICLE INFO ABSTRACT

Keywords: Severe air pollution degrades air quality and threatens human health, necessitating accurate prediction for
Air-quality prediction pollution control. While spatiotemporal networks integrating sequence models and graph structures dominate
Multimodal fusion current methods, prior work neglects multimodal data fusion to enhance feature representation. This study
Spatial–temporal network
addresses the spatial limitations of single-perspective ground monitoring by synergizing remote sensing data,
Time series forecasting
which provides global air quality distribution, with ground observations. We propose a Shared-Specific
Modality Decoupling-based Spatiotemporal Multimodal Fusion Network for air-quality prediction, comprising:
(1) feature extractors for remote sensing images and ground monitoring data, (2) a decoupling module
separating shared and modality-specific features, and (3) a hierarchical attention-graph convolution fusion
module. This framework achieves effective multimodal fusion by disentangling cross-modal dependencies while
preserving unique characteristics. Evaluations on two real-world datasets demonstrate superior performance
over baseline models, validating the efficacy of multimodal integration for spatial–temporal air quality
forecasting.

1. Introduction Air-quality prediction has garnered significant attention in the envi-


ronmental field, with research methods continually evolving alongside
Since the widespread use of fossil fuels began, air-quality has technological advancements. These methods can be categorized into
emerged as a significant environmental issue garnering widespread three main types: traditional statistical methods, numerical model-
concern. Air-quality is closely linked to human health, with poor air- based approaches, and artificial intelligence-based methods. Statistical
quality being a common cause of respiratory diseases. Among the methods, such as autoregressive integrated moving average (ARIMA)
many air pollutants, Particulate Matter 2.5 (PM2.5 ) is one of the and multiple linear regression (Kumar and Jain, 2010; Gu et al.,
primary contributors to the deterioration of air-quality. The increase 2021), struggle to capture complex temporal features and interac-
in PM2.5 concentrations significantly raises the likelihood of smog tions between pollutants, making precise prediction challenging under
formation, thereby degrading air-quality (Lelieveld et al., 2023; Li complicated meteorological conditions. Numerical model-based ap-
et al., 2024). Moreover, because PM2.5 often carries harmful substances proaches (He et al., 2013), through multiple iterations, have advanced
and pathogens, prolonged exposure to high concentrations of PM2.5 to third-generation models capable of simulating intricate atmospheric
can lead to cardiovascular and respiratory diseases, posing severe risks chemical and physical processes using three-dimensional atmospheric
to human health. Timely monitoring and accurate forecasting of air- grids. While numerical models can accurately predict various pollutants
quality are crucial for protecting human health (Yu et al., 2024; Hu and extreme weather, they suffer from high complexity and lack the
et al., 2023; Sun et al., 2023). With the widespread establishment of air- ability to be optimized in a data-driven manner. In recent years,
quality monitoring stations across the country, the availability of vast artificial intelligence techniques have come to dominate air-quality
amounts of real-time and accurate air-quality data has made precise air- prediction. Machine learning models like support vector regression and
quality forecasting possible. Accurate air-quality predictions can guide random forests offer good interpretability but still fall short in capturing
people’s travel and social activity planning, provide decision-making complex temporal features (Liu et al., 2017; Hasnain et al., 2023; Ma
support for governments, and underpin further atmospheric research. et al., 2020). Sequence modeling methods in deep learning, such as

∗ Corresponding author.
E-mail address: chenxiaoxia@[Link] (X. Chen).

[Link]
Received 23 March 2025; Received in revised form 23 May 2025; Accepted 2 June 2025
Available online 17 June 2025
1364-8152/© 2025 Published by Elsevier Ltd.
X. Chen et al. Environmental Modelling and Software 192 (2025) 106553

long short-term memory (LSTM) (Chen et al., 2021), gate recurrent


unit (GRU) (Wang and Shao, 2022), and transformer (Cao et al., 2024;
Zhang et al., 2023), excel at extracting sequential features, making
them highly effective in air-quality prediction. Moreover, spatiotempo-
ral network models that integrate graph neural networks have emerged,
combining temporal sequence modeling with spatial diffusion modeling
to achieve highly accurate air-quality prediction (Li et al., 2017; Yu
et al., 2018; Bai et al., 2020; Guo et al., 2019; Song et al., 2020; Chen
et al., 2024b,a, 2025b; Chi et al., 2024; Chen et al., 2025a).
Previous studies on air-quality prediction have primarily used air-
quality time series data. To enhance prediction performance,
researchers have also introduced several auxiliary data sources, in-
cluding meteorological information, points of interest (POI) data, and
geographical location information. Such additional data provide valu-
Fig. 1. Schematic diagram of the multimodal feature decoupling framework.
able supplementary insights beyond the core time series data (Wang
et al., 2021a; Gu et al., 2018; Liu et al., 2022; Chadalavada et al.,
2024). However, these data sources have typically been used only to
offer additional context, aiding in the deeper extraction of features the challenges of extracting and fusing effective features from multi-
from air-quality data. We have observed that prior studies have often modal sources to enhance the performance of multimodal air-quality
prediction. To address these challenges, we aim to effectively integrate
limited their focus to deeply mining the feature information inherent
spatiotemporal multimodal data. This data comes from both ground-
within the air-quality time series data itself, with little attention given
based air-quality monitoring and satellite-derived PM2.5 remote sensing
to integrating other relevant data sources to further enrich the feature
imagery. As a first step, we have designed a feature extraction mod-
set and thereby enhance the air-quality prediction task. In recent
ule, which captures both temporal and spatial features from ground
years, multimodality has garnered widespread attention. Multimodal
monitoring data and remote sensing images. Additionally, we proposed
data refers to data acquired through multiple sensory channels or
a Multimodal Fusion Framework (MMFF) based on shared-specific
information sources, which can complement each other, providing a
modality decoupling, as shown in Fig. 1. This framework employs
more comprehensive and in-depth analytical perspective. a data-driven approach to disentangle shared and specific modality
This provides inspiration for our research, where we aim to en- features from multiple modes and enhances the complementarity of
rich feature information by integrating air-quality data from different multimodal features through a hierarchical graph convolutional fusion
modalities, achieving complementarity and refinement to enhance the module. Based on this framework, we further developed a Spatial–
accuracy of air-quality prediction. Data from ground monitoring sta- Temporal Multimodal Fusion Network (STMFNet) for air-quality pre-
tions offer high precision but only reflect local air-quality around the diction, grounded on the shared-specific modality decoupling fusion
station, whereas satellite remote sensing images provide broad cover- framework. Specifically, we incorporated a time series feature extrac-
age, capturing air-quality distribution across cities, regions, and even tion (TSFE) module and a multi-scale remote sensing images feature
entire nations, though with less precision in localized areas. Therefore, extraction (RSIFE) module, enabling the comprehensive extraction of
ground monitoring data and satellite imagery can complement each both time-series and remote sensing image features. For TSFE mod-
other: remote sensing images offer spatial coverage and global air- ule, we adopted a decomposition method that separates air-quality
quality distribution, while ground monitoring data provide detailed data into trend components reflecting long-term changes and seasonal
information on localized and dynamic changes. Although multimodal components capturing short-term fluctuations, and designed specialized
data can achieve mutual supplementation and enhancement of feature long-term and short-term feature extraction modules based on their
information, it also brings forth new challenges. The foremost challenge respective data distribution characteristics. For RSIFE module, we de-
lies in the fusion of multimodal features—how to effectively integrate signed a network capable of extracting both local and global features
data from different modalities to maximize their complementary advan- from remote sensing images, thus capturing their spatial variability on
tages remains a significant issue. Additionally, there is the challenge of multiple scales. In summary, the main contributions of this paper are
feature selection and extraction within multimodal features. In multi- as follows:
modal data, information from different modalities may possess varying
1. We propose a spatiotemporal multimodal fusion network based
degrees of importance and redundancy. Extracting and selecting the
on shared-specific modality decoupling for air-quality predic-
most representative features from multimodal data, and effectively
tion. This network extends traditional time series forecasting
utilizing these features during the fusion process, is a considerable
by incorporating multimodal data to enrich feature information,
challenge. significantly enhancing the accuracy of air-quality prediction.
The ground-based air quality monitoring data and remote sens-
2. For ground-level air-quality monitoring data and high-altitude
ing image data with different modalities both fundamentally reflect remote sensing image data, we have designed separate tempo-
the spatiotemporal distribution characteristics of pollutant concentra- ral and spatial feature extraction modules. The temporal fea-
tions in air quality. The distinction lies in their differing observation ture extraction module employs a seasonal-trend decomposition
methods and spatiotemporal sampling scales. If we can effectively method to capture both long-term and cyclical features, while
separate the high-dimensional shared features between them, it will the spatial feature extraction module utilizes a multi-scale ap-
help fundamentally capture the spatiotemporal characteristics of air proach to capture local and global variation features from the
quality variations. Therefore, we consider the feasibility of introducing remote sensing images.
decoupling methods to isolate modality-specific features unique to each 3. A multimodal feature fusion framework based on shared-specific
data type, thereby extracting common spatiotemporal features that modality decoupling was proposed. This framework utilizes a
reflect air quality changes across different modal characteristics. This decoupling approach to separate shared and specific modality
approach aims to mitigate conflicts and biases caused by heterogeneous features from spatiotemporal multimodal data and constructs
modal data while further enhancing the effectiveness of air quality a hierarchical relational graph to facilitate the complementary
predictions. exchange and deep fusion of feature information between the
Therefore, in order to further integrate multimodal air-quality infor- specific and shared modalities. This ensures the thorough inte-
mation and enrich the associated feature data, it is crucial to address gration of multimodal data.

2
X. Chen et al. Environmental Modelling and Software 192 (2025) 106553

4. Extensive experiments were conducted on two real-world Some studies have employed matrix theory for multimodal feature
datasets, and the results demonstrate that our model consistently fusion. Zhang et al. achieved feature fusion by performing tensor outer
outperforms all baseline models. products on different modal feature data. Zhao et al. (2024) and
Liu et al. (2021), recognizing the complexity of matrix operations,
The remainder of this paper is organized as follows: Section 2 re- further combined matrix decomposition methods, transforming the
views prior research related to this work. Section 3 provides a detailed original feature matrix into low-dimensional vectors, thereby eliminat-
description of the proposed model. Section 4 presents comprehensive ing redundant features and reducing computational complexity to some
performance experiments and visualizations of prediction outcomes, extent. Additionally, methods based on Transformer networks or incor-
along with extensive ablation studies on different architectures and key porating attention mechanisms have become common approaches for
components. Section 5 concludes the paper. multimodal feature fusion. Ma et al. (2024) employed a Transformer-
based architecture to effectively model long-range dependencies and
2. Related work achieve precise remote sensing semantic segmentation by utilizing a
cross-attention mechanism for multi-scale fusion of multimodal infor-
2.1. Air-quality prediction mation. Zadeh et al. (2018) proposed the MFN network, introducing
memory attention mechanisms and gating mechanisms to simulta-
Air-quality prediction is a quintessential time series prediction task.
neously capture temporal and cross-modal interactions. Fang et al.
In data-driven approaches, several classical statistical methods such
(2023) used multi-level attention mechanisms to capture significant
as ARIMA (Kumar and Jain, 2010), multiple linear regression (Gu
intra-modal features and employed attention mechanisms to learn cor-
et al., 2021), and classification and regression trees (Gass et al., 2014)
relations between features across various modalities. Zou et al. (2022)
have been employed for air-quality prediction tasks. However, these
combined Transformer networks for multimodal feature fusion, intro-
methods are typically limited to single time series and often fall short
ducing cross-modal interactions in the multi-head attention mechanism
in terms of prediction accuracy. On the other hand, machine learning-
and enriching the feature information obtained from other modalities
based methods, such as support vector regression (Liu et al., 2017),
while preserving the integrity of the primary modal features. Although
gradient boosting trees (Ma et al., 2020), random forests (Hasnain
attention mechanisms effectively capture global correlations in feature
et al., 2023), and decision trees (Naveen et al., 2023), are also widely
information, their high complexity and computational overhead are
applied in the field of air-quality prediction. While these approaches
significant concerns. Ma et al. (2022) further integrated CNN and
offer commendable interpretability, they struggle to achieve precise
ViT within a unified fusion framework, combining shallow and deep
air-quality prediction.
features in a multi-level manner to accurately characterize both lo-
Within deep learning-based approaches, some studies have lever-
cal details and global semantics, thereby accomplishing multimodal
aged sequence models such as LSTM (Chen et al., 2021), GRU (Wang
fusion-based semantic segmentation of remote sensing imagery. Some
and Shao, 2022), temporal convolutional networks (TCN) (Ren et al.,
studies (Li et al., 2020) have integrated generative adversarial networks
2023), Transformer (Cao et al., 2024; Xia et al., 2025), as well as MLP-
to generate fused modal features, but the reliability and stability of
based models to model air-quality data (Maciąg et al., 2023), achieving
generative adversarial networks require further consideration.
promising results. However, these sequence models typically treat the
data from each station as independent entities, overlooking the inter- In the environmental and remote sensing fields, multimodal fusion
relationships between stations. Consequently, some research (Li et al., methods are commonly used to integrate multisource remote sensing
2017; Yu et al., 2018) has highlighted that further extraction of spatial images with environmental data from other modalities. These methods
correlations between stations can aid in uncovering hidden features are widely applied in tasks such as remote sensing image classification,
in air-quality data. By incorporating graph neural networks, spatial object detection, semantic segmentation, regional meteorological im-
correlations between stations can be modeled in the form of graph age simulation and reconstruction, as well as pollutant concentration
structures to extract spatial features. Common methods for constructing prediction. Roy et al. (2023) proposed a multimodal feature fusion
graphs include measuring geographic distance, regional functional sim- Transformer, achieving image classification of multisource remote sens-
ilarity, and dynamic time warping to create adjacency matrices (Wang ing images. Qingyun and Zhaokui (2022) introduced a cross-modal
et al., 2021b). Additionally, some studies have combined multiple attention fusion network for object detection in multispectral remote
metrics to construct more comprehensive graph structures and adaptive sensing images. In the domain of air-quality prediction, methods com-
graphs (Wu et al., 2019; Zhang et al., 2025, 2024), further exploring bining multimodal features are relatively scarce. Among them, Rowley
spatial features. and Karakuş (2023) combined satellite remote sensing images and
The aforementioned methods primarily focus on extracting deep pollutant data to predict pollutants such as nitrogen dioxide and ozone.
data features from air-quality data from both temporal and spatial per- Xia et al. (2024) proposed a multimodal prediction method that in-
spectives, achieving in-depth spatiotemporal feature extraction. How- tegrates remote sensing image data with pollutant time series data
ever, these methods are limited to the characteristics of the air-quality from multiple stations. Zhang et al. (2021) combined NO2 concentra-
data itself, neglecting other types of environmental data that also tion data at multiple vertical levels with ground-level concentration
contain valuable feature information. When deep features have already observations and meteorological data to achieve NO2 concentration
been thoroughly explored, integrating additional environmental data prediction from a three-dimensional perspective. Lei et al. (2022) in-
to enrich the feature set and provide more effective information for tegrated reconstructed data, satellite images, and ground observation
prediction tasks presents a viable approach. data, employing CNN-LSTM and random forest methods to simulate
PM2.5 concentrations across China.
2.2. Multi-modal fusion By summarizing and analyzing the content of the aforementioned
related work, we can identify several issues within the fusion methods
Multimodal data typically refers to data obtained through various for multimodal environmental data. First, when dealing with multi-
sensory channels or data sources, often representing different manifes- modal environmental data, it is crucial to account for the distinct char-
tations of the same object in distinct forms. This data usually includes acteristics of different modalities, as significant modality differences
various types of information such as images, text, audio, video, and often exist, making feature representation and fusion more challenging.
sequence data. Multimodal feature fusion techniques are widely applied Second, environmental data often contain substantial noise and redun-
in fields such as computer vision, emotion recognition, remote sensing dant information, which can negatively impact the effectiveness of fea-
semantic segmentation, and object detection. ture fusion. Moreover, multimodal environmental data typically exhibit

3
X. Chen et al. Environmental Modelling and Software 192 (2025) 106553

Fig. 2. Overall framework diagram of proposed model.

spatiotemporal characteristics, requiring further refinement in the fea- 𝑆


ture extraction or fusion stages to better accommodate the unique data 𝑋𝑠𝑒𝑎 = 𝑋 𝑆 − 𝑋𝑡𝑟𝑒𝑛𝑑
𝑆
(2)
properties. To address these issues, our model takes into account the 𝐴𝑣𝑔𝑝𝑜𝑜𝑙() denotes average pooling, which smooths the input se-
feature differences across modalities which separates modality-specific quence by applying average pooling with a specified kernel size.
characteristics from shared ones through a decoupling mechanism. In 𝑃 𝑎𝑑𝑑𝑖𝑛𝑔() fills the input data with zeros to ensure that its dimensions
addition, a multi-scale feature extraction strategy was applied, which are a multiple of the kernel size.
allows the model to effectively capture useful information at vari- The trend component and the seasonal component exhibit distinct
ous spatial and temporal scales. Additionally, our model incorporates data distribution characteristics. The trend component captures long-
distinct feature extraction modules tailored to spatiotemporal data, term changes and typically has a relatively stable distribution. In
ensuring that it adapts to the spatiotemporal nature of the multimodal contrast, the seasonal component reflects fixed periodic fluctuations
data. and shows more frequent data variations. To model the more stable
long-term trend in the trend component, we employ a simple linear
3. Methodology layer.
𝑆 𝑆
3.1. Overview 𝐹𝑡𝑟𝑒𝑛𝑑 = 𝐿𝑖𝑛𝑒𝑎𝑟(𝑋𝑡𝑟𝑒𝑛𝑑 ) (3)

For the seasonal component, we first extract its primary periodic


The overall structure of the proposed model is illustrated in Fig. 2. In and frequency information based on the Fast Fourier Transform. Uti-
essence, STMFNet employs a dual-branch
[ ] architecture. For the input air lizing the periodicity, we transform the one-dimensional time series
𝑆
quality time-series
[ ] data 𝑋 𝑆 ∈ R 𝐵,𝐿 ,𝑁 and remote sensing image data data into two-dimensional data that encapsulates periodic information,
𝑅
𝑋 𝑅 ∈ R 𝐵,𝐿 ,𝐻,𝑊 , the time-series feature extraction module and the thereby fully capturing the seasonal component’s cyclical time series
remote sensing image feature extraction module separately extract the features across different temporal scales.
corresponding features 𝐹 𝑆 ∈ R[𝐵,𝑁,𝐷] and 𝐹 𝑅 ∈ R[𝐵,𝐶,𝐷] . Respectively,
𝑆
𝐿𝑆 and 𝐿𝑅 correspond to the data quantities of time-series data and 𝐴 = 𝐴𝑣𝑔(𝐴𝑚𝑝(𝐹 𝐹 𝑇 (𝑋𝑠𝑒𝑎 ))) (4)
remote sensing data under different sampling frequencies. 𝑁 represents
{ } 𝑇
the number of stations, 𝐻 and 𝑊 denote the height and width of the 𝑓1 , 𝑓2 , … , 𝑓𝑘 = 𝑎𝑟𝑔𝑇 𝑜𝑝𝑘(𝐴), 𝑝𝑖 = , 𝑖 ∈ {1, … , 𝑘} (5)
remote sensing images, 𝐶 signifies the number of image channels, and 𝑓𝑖
𝐷 represents the dimension of the universal features. Subsequently, 𝑖
( ( 𝑆 ))
𝑋𝑠𝑒𝑎 = 𝑅𝑒𝑠ℎ𝑎𝑝𝑒 𝑃 𝑎𝑑𝑑𝑖𝑛𝑔 𝑋𝑠𝑒𝑎 , 𝑓𝑖 , 𝑝𝑖 , 𝑖 ∈ {1, … , 𝑘} (6)
the shared-specific feature decoupling module is utilized to isolate the
modality-specific features 𝑀𝑠𝑝𝑒𝑐𝑆 ∈ R[𝐵,𝑁,𝐷] , 𝑀𝑠𝑝𝑒𝑐
𝑅 ∈ R[𝐵,𝐶,𝐷] , and the Here, 𝐴𝑚𝑝() is used to calculate the amplitude values 𝐴 of the
𝑆𝑅
modality-shared features 𝑀𝑠ℎ𝑎𝑟𝑒𝑑 ∈ R[𝐵,𝑆,𝐷] . 𝑆 represents the dimension various frequency domain components after the Fourier Transform. The
of the fused modality features. Finally, the hierarchical attention graph 𝐾 frequency domain components with the highest amplitude values
convolution module constructs cross-modal attention matrices from are then selected, and their corresponding frequencies 𝑓1 , 𝑓2 , … , 𝑓𝑘
specific modalities to shared modalities, facilitating the process of are determined. These frequencies are used to derive the respective
information supplementation and fusion. The shared modality features periods 𝑝1 , 𝑝2 , … , 𝑝𝑘 of the different frequency components, which are
𝑆𝑅
𝐻𝑆ℎ𝑎𝑟𝑒𝑑 , together with the cross-modal transmission features 𝐻𝑇𝑆𝑆 and then utilized to convert the original one-dimensional time series data
𝑟𝑎𝑛𝑠
𝐻𝑇𝑅𝑆 are concatenated and then passed to the regression layer to into two-dimensional data. Subsequently, we apply an adaptive graph
𝑟𝑎𝑛𝑠 [ ]
obtain the final predicted output 𝑌 𝑆 ∈ R 𝐵,𝐿 ,𝑁 .
𝑃 convolution method to model the spatial correlations between multiple
air-quality monitoring stations, with the formula described as follows:
( ( ( )𝑇 ))
3.2. Time series feature extraction module 𝐴𝑑𝑝𝑖 = 𝑆𝑜𝑓 𝑡𝑚𝑎𝑥 𝑅𝑒𝐿𝑢 𝐸1𝑖 ⋅ 𝐸2𝑖 (7)

To fully capture the temporal characteristics of air-quality data, this ( )


𝐹 𝐺𝑖𝑠𝑒𝑎 = 𝜎 𝐴𝑑𝑝𝑖 ⋅ 𝑋𝑠𝑒𝑎
𝑖
⋅𝑊𝑖 (8)
paper, with careful consideration of its data distribution characteristics,
designs feature extraction components tailored to its data distribution. 𝑖
Here, 𝐴𝑑𝑝 represents the adaptive graph corresponding to the 𝑖th
Specifically, we first employ a sequence decomposition strategy, break- frequency domain component, 𝑋𝑠𝑒𝑎 𝑖 denotes the transformed frequency

ing down the original input time series data into a trend component 𝑖
domain input, and 𝑊 refers to the corresponding trainable parameters.
representing long-term changes and a seasonal component representing Subsequently, a multi-head attention (MHA) mechanism is em-
short-term fluctuations. The decomposition operation is described by ployed to model the global correlations within the time series data of
the following formula: a single period, as described by the following formula:
𝑆
( )
𝑋𝑡𝑟𝑒𝑛𝑑 = 𝐴𝑣𝑔𝑝𝑜𝑜𝑙(𝑃 𝑎𝑑𝑑𝑖𝑛𝑔(𝑋 𝑆 )) (1) 𝐹 𝐴𝑖𝑠𝑒𝑎 = 𝑀𝐻𝐴 𝐹 𝐺𝑖𝑠𝑒𝑎

4
X. Chen et al. Environmental Modelling and Software 192 (2025) 106553
(𝐻 ( ( ) ))
⋃ 𝑄ℎ𝑠𝑒𝑎 ⋅ 𝐾𝑠𝑒𝑎


= 𝐿𝑖𝑛𝑒𝑎𝑟 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥 √ 𝑉𝑠𝑒𝑎 (9)
ℎ 𝑑𝑖𝑚
where 𝐻 represents the number of attention heads, ℎ represents the h-

th attention head, 𝐻 ℎ represents merging the features of 𝐻 attention
heads, dim denotes the general feature dimension, and 𝑄ℎ𝑠𝑒𝑎 , 𝐾𝑠𝑒𝑎
ℎ and
ℎ denote the mappings of 𝐹 𝐺𝑖
𝑉𝑠𝑒𝑎 to the Query, Key, and Value of
𝑠𝑒𝑎
attention mechanism respectively.
We then convert the time series data, which has been transformed
into a two-dimensional shape, back to its original one-dimensional
form. This is followed by reaggregating the data into the original sea-
sonal component sequence, using the amplitude values as weightings.

( )
𝑤̂ 1 , 𝑤̂ 2 , … , 𝑤̂ 𝑘 = 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥 𝐴𝑓1 , 𝐴𝑓2 , … , 𝐴𝑓𝑘 (10)


𝑘
𝑆
𝐹𝑠𝑒𝑎 = 𝑤̂ 𝑖 ⋅ 𝐹 𝐴𝑖𝑠𝑒𝑎 (11) Fig. 3. Structural diagram of the shared-specific modality decoupling module.
𝑖=1
𝑆
Finally, we obtain the trend component feature 𝐹𝑡𝑟𝑒𝑛𝑑 and the sea-
sonal component feature 𝐹𝑠𝑒𝑎𝑆 from the two separate time series feature
images capture the spatial variations in pollutant concentrations across
extraction branches for the trend and seasonal components, respec-
different regions at the same time and their temporal changes, while
tively. These are then summed to derive the final time series feature
ground observation data also reflect these variations. The data from dif-
𝐹 𝑆 , as follows:
ferent modalities reside in distinct feature spaces, each with its unique
𝐹 𝑆 = 𝐹𝑡𝑟𝑒𝑛𝑑
𝑆 𝑆
+ 𝐹𝑠𝑒𝑎 (12) representation: remote sensing images tend to emphasize long-term
global trends, whereas ground observation data focus on short-term
local changes. Accurately identifying and integrating the shared and
3.3. Remote-sensing image feature extraction module
specific features of different modalities is crucial for the joint analysis
and prediction of multimodal data, aiding in the enhancement of
Remote sensing images contain rich variation features on both
understanding and prediction accuracy of spatiotemporal changes in
global and local scales. Additionally, air-quality changes may exhibit
air-quality.
different patterns across various spatial scales; for instance, while pol-
To further clarify the definitions of multimodal shared features
lutant concentrations might show a decreasing trend globally, a specific
local area could display an increase. Therefore, extracting features and modality-specific features in air quality analysis, based on the
from remote sensing images at different scales to accommodate varying differences in how ground monitoring data and remote sensing images
trends across spatial scales is essential. This approach aids in accu- reflect air quality characteristics, we define these features as follows:
rately identifying changes in pollutant concentrations at different spa- Shared features refer to high-level abstract representations com-
tial scales, thereby enhancing the precision of pollutant concentration monly existing across multimodal data (e.g., remote sensing images
predictions. and ground monitoring data) that capture the essential patterns of
The multi-scale feature extraction module for remote sensing images air quality. These features originate from the physical consistency of
primarily consists of three components: the multi-scale convolution spatiotemporal variations in air quality, independent of data acquisition
component, the feature pyramid component, and the multi-scale fusion methods or representational forms. They embody dynamic correlations
component. The multi-scale convolution component, which comprises of pollutant concentrations over time and spatial distribution patterns.
convolutional blocks with different convolutional kernels, is responsi- Modality-specific features are unique representations within a single
ble for extracting image features at various scales from the original modality, arising from data acquisition principles or inherent resolution
remote sensing images. The formula is described as follows: constraints. These features reflect the observation perspectives and
( ) information granularity specific to a particular modality, requiring
{𝑐2, 𝑐3, 𝑐4, 𝑐5} = 𝑀𝑢𝑙𝑡𝑖𝑆𝑐𝑎𝑙𝑒𝐶𝑜𝑛𝑣 𝑋 𝑅 (13) extraction through modality-adapted modeling methods to supplement
The feature pyramid contains two feature pathways: a top-down path- refined information not covered by shared features.
way, which utilizes local features at smaller scales to complement We presents a Shared-Specific Modality Decoupling Module, which
global features at larger scales, and a same-layer feature transmission extracts shared and specific modality features from different modal
pathway used to calculate image features across different scales. The data in a similarity-constrained and data-driven manner. The structural
formula is described as follows: diagram of this module is shown in Fig. 3. Specifically, the module
( ) ( ) primarily separates shared and specific features based on a shared gat-
𝑝𝑖 = 𝑆𝑐𝑎𝑙𝑒𝐶𝑜𝑛𝑣5 𝑐𝑖 + 𝑈 𝑝𝑆𝑎𝑚𝑝𝑙𝑖𝑛𝑔 𝑝𝑖+1 , 𝑖 ∈ {2, 3, 4, 5} (14) ing mechanism and residual mechanism. The shared gating mechanism
The multi-scale fusion component is responsible for integrating the receives the corresponding modality features as input and adaptively
image features from different scales to obtain the final multi-scale computes the weight of the shared features within the input features
remote sensing image features. The formula is described as follows: in a data-driven manner. Furthermore, to guide the shared gating
mechanism in learning the shared features between two modalities,
𝐹 𝑅 = 𝑆𝑐𝑎𝑙𝑒𝐹 𝑢𝑠𝑖𝑜𝑛(𝑝2, 𝑝3, 𝑝4, 𝑝5) (15) we introduce cosine similarity as the loss function for this module,
constraining the shared gating mechanism to learn the shared features
of the two modal features. Finally, the extracted shared features are
3.4. Shared-specific modality decouple module
averaged and output as the shared modality features. The process is
Remote sensing images of air-quality and time series data belong mathematically described as follows:
to different modalities, with significant differences in their forms of 𝑆
( )
𝑀𝑆ℎ𝑎𝑟𝑒𝑑 = 𝑆ℎ𝑎𝑟𝑒𝑑𝐺𝑎𝑡𝑒𝑆 𝐹 𝑆
expression and feature scales, yet they share certain representational ( 𝑆 )
characteristics in terms of object features. For instance, remote sensing = 𝜎 𝑊𝐺𝑎𝑡𝑒 ⋅ 𝐹 𝑆 + 𝑏𝑆 ⋅ 𝐹 𝑆 (16)

5
X. Chen et al. Environmental Modelling and Software 192 (2025) 106553

𝑅
( ) ⎛ 𝑀 𝑆 𝑊𝑠𝑝𝑒𝑐
𝑄 𝑆 𝑊𝐾 ⎞
𝑀𝑆ℎ𝑎𝑟𝑒𝑑 = 𝑆ℎ𝑎𝑟𝑒𝑑𝐺𝑎𝑡𝑒𝑅 𝐹 𝑅 ⋅ 𝑀𝑆𝑝𝑒𝑐 𝑠𝑝𝑒𝑐
= 𝑆𝑜𝑓 𝑡𝑚𝑎𝑥 ⎜ ⎟ ∈ R[𝑁,
𝑆𝑝𝑒𝑐 𝑁]
( 𝑅 ) √
= 𝜎 𝑊𝐺𝑎𝑡𝑒 ⋅ 𝐹 𝑅 + 𝑏𝑅 ⋅ 𝐹 𝑅 (17) ⎜ 𝑑𝑖𝑚 ⎟
⎝ ⎠
( 𝑆 𝑅
) Here, 𝑆𝑒𝑙𝑓 𝑀𝑜𝑑𝑎𝑙𝐴𝑡𝑡() represents the calculation function of the self-
𝐿𝑜𝑠𝑠𝑆ℎ𝑎𝑟𝑒𝑑 = 1 − 𝐶𝑜𝑠𝑆𝑖𝑚 𝑀𝑆ℎ𝑎𝑟𝑒𝑑 , 𝑀𝑆ℎ𝑎𝑟𝑒𝑑 𝑄 𝐾 represent the weight matrices
modal attention matrix, 𝑊𝑠𝑝𝑒𝑐 and 𝑊𝑠𝑝𝑒𝑐
𝑆
𝑀𝑆ℎ𝑎𝑟𝑒𝑑 𝑅
⋅ 𝑀𝑆ℎ𝑎𝑟𝑒𝑑 𝑆
that map 𝑀𝑆𝑝𝑒𝑐 to Query and Key, respectively. 𝑄𝑢𝑒𝑟𝑦𝑆 and 𝐾𝑒𝑦𝑆
=1− (18)
‖ 𝑆 ‖ ‖ 𝑅 ‖ represent the Query and Key for computing attention weights in the
‖𝑀𝑆ℎ𝑎𝑟𝑒𝑑 ‖ ⋅ ‖𝑀𝑆ℎ𝑎𝑟𝑒𝑑 ‖
‖ ‖ ‖ ‖ time-series modality, respectively. Dim denotes the general feature.
( ( 𝑆 )) The calculation methods for 𝐴𝑅 and 𝐴𝑆ℎ are the same as described
𝑆𝑅 𝑅
𝑀𝑆ℎ𝑎𝑟𝑒𝑑 = 𝐿𝑖𝑛𝑒𝑎𝑟 𝐴𝑣𝑔 𝑀𝑆ℎ𝑎𝑟𝑒𝑑 , 𝑀𝑆ℎ𝑎𝑟𝑒𝑑 (19) above. Here, 𝑁 represents the number of feature nodes in the time
series features, 𝐶 represents the number of feature channels in the
Here, 𝜎 represents the 𝑆𝑖𝑔𝑚𝑜𝑖𝑑 activation function, 𝐶𝑜𝑠𝑆𝑖𝑚() denotes
𝑅 𝑆 remote sensing image features, and 𝑆 represents the number of feature
cosine similarity, 𝑊𝐺𝑎𝑡𝑒 and 𝑊𝐺𝑎𝑡𝑒 represent the trainable parameters
nodes in the shared features.
of the shared gating mechanism, and 𝑏𝑆 and 𝑏𝑅 correspond to the reg-
For features between different levels, namely modality-shared fea-
ularization terms. Subsequently, the residual mechanism is employed
tures and modality-specific features, we utilize cross-modality attention
to obtain the corresponding specific modality features based on the
to calculate their feature correlations. This results in the correla-
extracted shared features, with the following calculation formula:
tion weight matrices between the modality-specific features and the
( ) ( 𝑆 )
𝑆
𝑀𝑆𝑝𝑒𝑐 = 𝑅𝑒𝑠 𝐹 𝑆 , 𝑀𝑆ℎ𝑎𝑟𝑒𝑑
𝑆
= 𝐹 𝑆 − 𝐿𝑖𝑛𝑒𝑎𝑟 𝑀𝑆ℎ𝑎𝑟𝑒𝑑 (20) modality-shared features. Specifically, 𝐴𝑆𝑆 and 𝐴𝑅𝑆 represent the
correlation weights between the time series-specific features and the
𝑅
( ) ( 𝑅 ) modality-shared features, and between the remote sensing image-
𝑀𝑆𝑝𝑒𝑐 = 𝑅𝑒𝑠 𝐹 𝑅 , 𝑀𝑆ℎ𝑎𝑟𝑒𝑑
𝑅
= 𝐹 𝑅 − 𝐿𝑖𝑛𝑒𝑎𝑟 𝑀𝑆ℎ𝑎𝑟𝑒𝑑 (21)
specific features and the modality-shared features, respectively.
Ultimately, through the decoupling module, we obtain the modality ( )
𝑆𝑅
shared feature 𝑀𝑆ℎ𝑎𝑟𝑒𝑑 𝑆
as well as the modality-specific features 𝑀𝑆𝑝𝑒𝑐 𝐴𝑆𝑆 = 𝐶𝑟𝑜𝑠𝑠𝑀𝑜𝑑𝑎𝑙𝐴𝑡𝑡 𝑀𝑆𝑝𝑒𝑐 𝑆 𝑆𝑅
, 𝑀𝑆ℎ𝑎𝑟𝑒𝑑
𝑅 ( )
and 𝑀𝑆𝑝𝑒𝑐 from the time-series feature data and remote sensing image 𝑄𝑢𝑒𝑟𝑦𝑆 ⋅ 𝐾𝑒𝑦𝑆ℎ
data, which are then used as inputs for the hierarchical attention graph = 𝑆𝑜𝑓 𝑡𝑚𝑎𝑥 √ (23)
𝑑𝑖𝑚
convolution module.
⎛ 𝑀 𝑆 𝑊𝑠𝑝𝑒𝑐
𝑄 𝑆𝑅
⋅ 𝑀𝑆ℎ𝑎𝑟𝑒𝑑 𝐾
𝑊𝑠ℎ𝑎𝑟𝑒𝑑 ⎞
= 𝑆𝑜𝑓 𝑡𝑚𝑎𝑥 ⎜ ⎟ ∈ R[𝑁,𝑆]
𝑆𝑝𝑒𝑐

3.5. Hierarchical attention graph convolution module ⎜ 𝑑𝑖𝑚 ⎟
⎝ ⎠
Traditional multimodal feature fusion methods often achieve fused ( )
modality features by simply concatenating different modality features 𝐴𝑅𝑆 = 𝐶𝑟𝑜𝑠𝑠𝑀𝑜𝑑𝑎𝑙𝐴𝑡𝑡 𝑀𝑆𝑝𝑒𝑐𝑅 𝑆𝑅
, 𝑀𝑆ℎ𝑎𝑟𝑒𝑑
and assigning weights. However, this approach overlooks the fact that ( )
𝑄𝑢𝑒𝑟𝑦𝑅 ⋅ 𝐾𝑒𝑦𝑆ℎ
multimodal data have different forms of representation, and their = 𝑆𝑜𝑓 𝑡𝑚𝑎𝑥 √ (24)
modality features also exhibit specific feature distributions. Directly 𝑑𝑖𝑚
concatenating features may lead to insufficient feature fusion and even ⎛ 𝑀 𝑅 𝑊𝑠𝑝𝑒𝑐
𝑄𝑢𝑒𝑟𝑦 𝑆𝑅
⋅ 𝑀𝑆ℎ𝑎𝑟𝑒𝑑 𝐾𝑒𝑦 ⎞
𝑊𝑠ℎ𝑎𝑟𝑒𝑑
= 𝑆𝑜𝑓 𝑡𝑚𝑎𝑥 ⎜ ⎟ ∈ R[𝐶,𝑆]
𝑆𝑝𝑒𝑐
introduce additional bias into subsequent tasks. Therefore, designing a √
⎜ 𝑑𝑖𝑚 ⎟
fusion strategy that takes into account the distinct feature distributions ⎝ ⎠
of multimodal data not only allows for the full utilization of the unique Here, 𝐶𝑟𝑜𝑠𝑠𝑀𝑜𝑑𝑎𝑙𝐴𝑡𝑡() represents the calculation function of the cross-
information inherent in each modality but also reduces noise and bias 𝑄 𝐾
modal attention matrix. 𝑊𝑠𝑝𝑒𝑐 and 𝑊𝑠ℎ𝑎𝑟𝑒𝑑 represent the weight matri-
caused by inconsistencies, thereby enhancing the overall performance 𝑅
ces that map 𝑀𝑆𝑝𝑒𝑐 to Query and map 𝑀𝑆ℎ𝑎𝑟𝑒𝑑𝑆𝑅 to Key respectively,
of multimodal tasks. and dim denotes the general feature. Similarly, 𝑊𝑠𝑝𝑒𝑐 𝑄𝑢𝑒𝑟𝑦 𝐾𝑒𝑦
and 𝑊𝑠ℎ𝑎𝑟𝑒𝑑
Considering the different feature distributions of multimodal fea- 𝑅 𝑅
represent the weight matrices that map 𝑀𝑆𝑝𝑒𝑐 to 𝑄𝑢𝑒𝑟𝑦 and 𝑀𝑆ℎ𝑎𝑟𝑒𝑑 𝑆𝑅
tures, this paper divides modality-shared features and modality-specific
to 𝐾𝑒𝑦𝑆ℎ respectively.
features into two distinct levels. Within each level, self-modality atten-
Subsequently, we employ a hierarchical graph convolutional
tion is employed to compute the relevance weights between different
method to establish the process of feature propagation within intra-
feature nodes, capturing the correlations within the same level. Be-
layer features as well as between features of different levels. This
tween different levels, cross-modality attention is used to calculate
approach facilitates the refinement and supplementation of modality-
the relevance weights between the feature nodes of modality-shared
specific features into modality-shared features. The process primarily
features and modality-specific features. Subsequently, a graph convo-
consists of three parts: the intra-layer feature propagation of modality-
lutional network is utilized to propagate features and transmit infor- specific features based on self-modality correlations, the intra-layer
mation between nodes of different modalities, thereby supplementing feature propagation of modality-shared features also based on self-
and refining modality-specific features with information from modality- modality correlations, and the inter-layer feature propagation based on
shared features. This process further achieves deep interaction and cross-modality correlations. The intra-layer propagation process can be
fusion of multimodal features. Specifically, for the features within described by the following equations:
the same level, including the time series-specific feature representa- ( )
𝑆 𝑆
𝑆 , the remote sensing image-specific feature representation
tion 𝑀Spec 𝐻𝑠𝑝𝑒𝑐 = 𝐼𝑛𝑡𝑟𝑎𝐻𝑖𝑒𝑟𝐺𝐶𝑁 𝑀𝑆𝑝𝑒𝑐 , 𝐴𝑆
𝑅 𝑆𝑅 , the self-
𝑀Spec , and the shared modality feature representation 𝑀Shared

𝐾
modality attention mechanism is applied to calculate the relevance = 𝐴𝑆 ⋅ 𝑀𝑆𝑝𝑒𝑐
𝑆 𝑘
⋅ 𝑊𝑠𝑝𝑒𝑐 (25)
weight matrices between different feature nodes within each level. 𝑘=0
These relevance weight matrices are denoted as 𝐴𝑆 , 𝐴𝑅 , and 𝐴𝑆ℎ , 𝑆
Here, 𝐻𝑠𝑝𝑒𝑐 𝑅
and 𝐻𝑠𝑝𝑒𝑐 represent the hidden features of the time se-
respectively. ries and remote sensing image-specific modality features, respectively,
( ) 𝑆
𝐴𝑆 = 𝑆𝑒𝑙𝑓 𝑀𝑜𝑑𝑎𝑙𝐴𝑡𝑡 𝑀𝑆𝑝𝑒𝑐 𝑆 after intra-layer feature propagation. 𝐻𝑠ℎ𝑎𝑟𝑒𝑑 𝑅 represents the hidden
( ) features of the modality-shared features after intra-layer feature prop-
𝑄𝑢𝑒𝑟𝑦𝑆 ⋅ 𝐾𝑒𝑦𝑆 agation. The intra-layer feature propagation primarily assigns dynamic
= 𝑆𝑜𝑓 𝑡𝑚𝑎𝑥 √ (22)
𝑑𝑖𝑚 weights to intra-layer modal features by constructing a self-correlation

6
X. Chen et al. Environmental Modelling and Software 192 (2025) 106553

Table 1
Detailed information of the two datasets.
Datasets #Time span #Time interval #Time stamps #Stations
Beijing-2018 1/1/2018–1/1/2021 1 h 26 280 34
Tianjin-2014 5/1/2014–5/1/2015 1 h 8760 27

attention matrix of intra-modal features, thereby achieving feature


𝑆 ,
screening and filtering. For intra-layer propagation features of 𝐻𝑠𝑝𝑒𝑐
the original time-series-specific modality features are multiplied with
the self-correlation attention matrix, establishing a point-to-point fea-
ture transmission process within the layer through a graph convolution
approach. Additionally, the inter-layer feature propagation process can
be described by the following equations:
( ) Fig. 4. Geographical distribution map of PM2.5 monitoring stations.
𝐻𝑇𝑆𝑆 𝑆
𝑟𝑎𝑛𝑠 = 𝐶𝑟𝑜𝑠𝑠𝐻𝑖𝑒𝑟𝐺𝐶𝑁 𝐻𝑠𝑝𝑒𝑐 , 𝐴
𝑆𝑆


𝐾
= 𝐴𝑆𝑆 ⋅ 𝐻𝑠𝑝𝑒𝑐
𝑆 𝑘
⋅ 𝑊𝑡𝑟𝑎𝑛𝑠 (26)
𝑘=0

( )
𝐻𝑇𝑅𝑆 𝑅
𝑟𝑎𝑛𝑠 = 𝐶𝑟𝑜𝑠𝑠𝐻𝑖𝑒𝑟𝐺𝐶𝑁 𝐻𝑠𝑝𝑒𝑐 , 𝐴
𝑅𝑆


𝐾
= 𝐴𝑅𝑆 ⋅ 𝐻𝑠𝑝𝑒𝑐
𝑅 𝑘
⋅ 𝑊𝑡𝑟𝑎𝑛𝑠 (27)
𝑘=0

Here, 𝐻𝑇𝑆𝑆𝑟𝑎𝑛𝑠
and 𝐻𝑇𝑅𝑆
𝑟𝑎𝑛𝑠
represent the propagated features between the
time series-specific modality features and the modality-shared features,
and between the remote sensing image-specific modality features and
the modality-shared features, respectively. In contrast to the intra-
layer feature transmission process, the inter-layer feature transmission
weight matrix represents mapping weights between features at different
Fig. 5. Visualization of PM2.5 remote-sensing images.
hierarchical levels, while following the same computational principles
as graph convolution operations.
The correlation weight matrices between modality features of dif-
ferent levels represent the feature weights during the mapping of (TEOM) or the β-attenuation method, with uncertainties of ±1.5%
feature information from different feature spaces. By incorporating or 0.1 μg∕m3 , respectively [30]. The raw data were then preprocessed
cross-modality attention, the hierarchical graph convolution can map through steps including data imputation, outlier handling, standardiza-
features from different feature spaces into a unified feature space tion, and data splitting, ultimately resulting in the ground monitoring
based on these cross-modality weights. Consequently, the resulting station air-quality observation data.
propagated features and modality-shared features reside in a unified
feature space. After concatenating them and passing through the output 4.1.2. Remote sensing images
mapping layer, the final prediction output 𝑌 𝑆 can be obtained. This The remote sensing image data used in this study were sourced from
process can be mathematically described as follows: the CHAP dataset (Wei et al., 2021), a comprehensive, long-term, na-
( ( )) tionwide, high-resolution, and high-quality ground air pollutant dataset
𝑌 𝑆 = 𝑂𝑢𝑡𝑝𝑢𝑡 𝑐𝑜𝑛𝑐𝑎𝑡 𝐻𝑇𝑆𝑆 𝑅𝑆 𝑆𝑅
𝑟𝑎𝑛𝑠 , 𝐻𝑇 𝑟𝑎𝑛𝑠 , 𝐻𝑆ℎ𝑎𝑟𝑒𝑑
( 𝑆𝑆 ) for China. This dataset primarily includes seven major air pollutants,
= 𝐿𝑖𝑛𝑒𝑎𝑟 𝐻𝑇 𝑟𝑎𝑛𝑠 ∥ 𝐻𝑇𝑅𝑆 𝑆𝑅
𝑟𝑎𝑛𝑠 ∥ 𝐻𝑆ℎ𝑎𝑟𝑒𝑑 (28) and for this study, PM2.5 remote sensing images were selected as the
data source. The temporal resolution is daily, and the spatial resolution
is 1 km. The original remote sensing data were cropped based on
4. Experiments results and discussion
geographic boundary and latitude-longitude information to obtain the
corresponding PM2.5 remote sensing retrieval images for the study area.
4.1. Datasets
Remote sensing image samples are as shown in Fig. 5.
The research area of this study encompasses Beijing and Tianjin,
with the experimental datasets comprising observed PM2.5 concen- 4.2. Baselines
trations from ground pollutant monitoring stations, along with high-
resolution remotely sensed PM2.5 concentration images and correspond- 1. LSTM (Rao et al., 2019): Incorporates memory gating to address
ing geographic auxiliary information for the study area. the vanishing gradient problem in traditional recurrent neural
networks, and is widely applied in time series prediction tasks.
4.1.1. Ground observation data 2. STGCN (Yu et al., 2017): Proposes a spatiotemporal convolu-
The hourly PM2.5 concentration data were sourced from the China tional network that combines graph neural networks to model
National Environmental Monitoring Center. Currently, nearly 1800 spatial correlations.
monitoring stations have been established across mainland China. For 3. Transformer (Vaswani, 2017): Introduces a multi-head atten-
our study, we selected observational data from 34 and 27 monitoring tion mechanism to model global correlations, widely used in
stations within the research area for training and validation purposes. sequence modeling tasks.
The geographical distribution of the monitoring stations is shown in 4. Graph Wavenet (Wu et al., 2019): Integrates adaptive graph
Fig. 4, and detailed information can be found in Table 1. The PM2.5 convolution with temporal convolutional networks to adaptively
data were measured using the tapered element oscillating microbalance infer spatial correlations.

7
X. Chen et al. Environmental Modelling and Software 192 (2025) 106553

Table 2
Detailed hyperparameter settings of baseline model.
Model Hyperparameter
LearningRate WindowsSize Dims GraphOrder AttentionHeads DropOutRate
LSTM 1E−04 96 64 \ \ 0.2
STGCN 1E−04 96 96 2 \ 0.2
Transformer 1E−04 96 128 \ 8 0.2
GraphWavenet 1E−04 96 96 2 \ 0.2
Informer 1E−04 96 256 \ 8 0.2
DLinear 1E−04 96 256 \ \ \
Timesnet 1E−04 96 128 2 8 0.3
Res-GCN 1E−04 96 128 2 8 0.3
STMFNet 1E−04 96 96 2 8 0.3

5. Informer (Zhou et al., 2021): Enhances the original Transformer Table 3


The single-step prediction performance of STMFNet and other baselines on two datasets,
by incorporating Fourier transforms to compute frequency do-
the best performance is highlighted in bold, while the second-best performance is
main attention weights, thereby reducing computational com- underlined.
plexity. Model Datasets
6. DLinear (Zeng et al., 2023): A simple yet effective linear model
Beijing Tianjin
that incorporates decomposition mechanisms.
MAE RMSE MAPE MAE RMSE MAPE
7. TimesNet (Wu et al., 2022): Transforms raw data into periodic
LSTM 7.09 10.12 0.31 14.65 21.84 0.34
data using Fourier transforms and extracts periodic temporal
STGCN 6.18 9.05 0.27 12.98 19.84 0.31
features based on convolutional networks. Transformer 6.05 9.12 0.29 13.15 20.14 0.32
8. Res-GCN (Xia et al., 2024): A multimodal air quality prediction GraphWavenet 5.74 8.64 0.26 12.51 18.96 0.26
model combining residual networks and GCN. Informer 5.32 8.23 0.29 11.92 19.26 0.31
DLinear 4.84 8.18 0.3 10.43 17.62 0.24
All experiments were implemented using the Pytorch (Paszke et al., Timesnet 5.48 8.61 0.31 11.32 18.42 0.24
Res-GCN 5.3 7.8 0.24 11.44 19.02 0.24
2019) 1.12.1 framework in a Python 3.9 environment and were con- STMFNet 4.27 6.93 0.24 10.12 16.66 0.23
ducted on a server running the Ubuntu 22.04 operating system
equipped with an Nvidia RTX 4080 GPU. To ensure a fair comparison
between models, we standardized the input and output data sizes
across all models. Each model was provided with data of the same Models based on GNNs (STGCN, Graph Wavenet), which further
dimensions during evaluation. Additionally, all other hyperparameters capture spatial dependencies, outperform traditional sequence models
were appropriately tuned based on the original settings provided in the (LSTM, Transformer, etc.) in predictive performance. This confirms the
papers. The proposed model in this study was trained using the Adam effectiveness of capturing spatial dependencies in deeply extracting
optimizer (Kingma, 2014) with a learning rate of 1e−4. Additional time series features and improving prediction accuracy. Additionally,
detailed parameters of the proposed model are presented in Table 2. DLinear, which utilizes a seasonal-trend decomposition mechanism,
achieves SOTA performance in most prediction tasks using only lin-
ear layers, demonstrating that decomposition methods contribute to a
deeper understanding of the composition of time series. Furthermore,
4.3. Evaluation metrics
methods incorporating Fourier transforms (Informer, TimesNet, etc.)
achieve strong predictive performance by leveraging frequency domain
In this paper, we use MAE, RMSE, and MAPE as evaluation metrics
weight information or periodic information.
for model comparison. The formulas for each evaluation metric are as
The proposed method achieved the best predictive performance
follows:
across various prediction tasks on two datasets. STMFNet effectively
1 ∑| 𝑆 |
𝑀𝐴𝐸 = |𝑦 − 𝑦𝑖 | (29) extracts both global and local spatial feature information by incorpo-
N 𝑖∈N | 𝑖 |
rating remote sensing image data. For time series data, it integrates a

1 ∑( 𝑆 )2 seasonal-trend decomposition method to gain a deeper understanding
𝑅𝑀𝑆𝐸 = 𝑦 − 𝑦𝑖 (30) of the composition mechanisms of air-quality time series. Based on the
N 𝑖∈N 𝑖
data distribution characteristics of different components, specialized
1 ∑ || 𝑦𝑖 − 𝑦𝑖 ||
𝑆
feature extraction modules are designed. For the highly fluctuating,
𝑀𝐴𝑃 𝐸 = | | (31)
N 𝑖∈N || 𝑦𝑖 || strongly periodic seasonal components, the method leverages Fourier
transforms to fully extract time series features by incorporating periodic
Here, 𝑦𝑆𝑖 and 𝑦𝑖 represent the predicted values and the true values, features from the frequency domain. Additionally, considering the chal-
respectively, for a total of 𝑁 data points. lenges posed by integrating multimodal data, this study introduces a
multimodal feature fusion method based on Shared-Specific modal fea-
4.4. Main results ture decoupling. This approach effectively addresses the feature space
mismatch problem between different modalities, enabling effective fu-
We compared the predictive performance of STMFNet with several sion of multimodal information, thereby achieving complementary and
baseline models on two real-world datasets. Tables 3 and 4 provide comprehensive feature information and further enhancing predictive
the MAE, RMSE, and MAPE metrics for single-step and multi-step performance.
prediction tasks, with the best results highlighted in bold. The experi-
mental results demonstrate that STMFNet consistently outperforms all 4.5. Ablation analysis
baseline models in both single-step and multi-step prediction tasks.
Specifically, on the two datasets, STMFNet achieved an average im- 4.5.1. Ablation analysis on STMFNet
provement in MAE performance of 5.74% and 5.49% compared to the In this section, we compare STMFNet with its variants across two
best-performing state-of-the-art models. datasets and conduct ablation experiments to understand how different

8
X. Chen et al. Environmental Modelling and Software 192 (2025) 106553

Table 4
The multistep prediction performance of STIAGNN and other baselines on two datasets, the best performance is highlighted in bold, while the
second-best performance is underlined.
Datasets Models 12 h 24 h 36 h 48 h
MAE RMSE MAPE MAE RMSE MAPE MAE RMSE MAPE MAE RMSE MAPE
LSTM 14.51 20.98 0.76 17.65 25.41 1.19 19.87 27.54 1.49 22.65 29.42 1.63
STGCN 13.06 17.84 0.74 15.94 22.79 1.14 17.64 25.69 1.37 20.48 26.98 1.46
Transformer 12.74 19.81 0.79 16.32 24.96 1.14 18.03 26.24 1.35 19.29 26.87 1.49
GraphWavenet 12.57 18.55 0.76 15.69 23.41 1.16 17.81 25.96 1.3 20.16 27.84 1.39
Beijing Informer 12.65 20.59 0.73 16.26 25.01 1.18 18.15 26.69 1.38 19.24 29.67 1.31
DLinear 12.64 20.39 0.92 15.81 24.19 1.26 17.61 25.85 1.48 18.6 26.9 1.55
Timesnet 12.4 19.95 0.82 16.07 24.18 1.21 18.2 27.12 1.33 19.19 28.36 1.41
Res-GCN 11.89 18.97 0.79 15.3 23.26 1.08 17.73 26.59 1.29 19.27 28.11 1.43
STMFNet 11.62 18.22 0.68 15.32 23.22 1.12 17.11 25.22 1.39 18.11 26.43 1.41
LSTM 24.91 34.73 0.67 29.68 38.53 0.86 34.28 41.98 1.19 38.36 47.69 1.35
STGCN 24.16 31.89 0.6 27.64 37.16 0.83 31.03 40.2 1.15 36.74 46.98 1.28
Transformer 24.09 32.76 0.58 29.68 38.69 0.86 32.59 41.56 0.91 33.56 44.16 0.94
GraphWavenet 23.15 35.36 0.56 27.49 37.65 0.76 30.65 40.89 0.95 34.52 45.26 1.09
Tianjin Informer 23.99 33.42 0.56 27.99 38.02 0.78 31.22 41.54 0.88 33.37 44.03 0.93
DLinear 21.66 30.97 0.51 30.08 40.44 0.91 32.06 41.94 1.01 33.63 43.17 1.07
Timesnet 22.02 32.12 0.48 27.54 37.46 0.76 30.19 40.25 0.86 32.56 42.17 0.92
Res-GCN 21.78 30.49 0.52 27.08 37.71 0.72 29.68 38.43 0.76 30.15 41.69 0.87
STMFNet 21.24 29.81 0.51 26.52 36.44 0.64 27.85 37.95 0.69 28.94 39.14 0.72

Fig. 6. Results of ablation analysis on STMFNet (a) Beijing. (b) Tianjin.

components affect the model’s performance and to evaluate the impor- 4.5.2. Ablation analysis on time feature extraction module
tance of each component. The detailed descriptions of the variants are To investigate the effectiveness of each component and mechanism
as follows: within the time series feature extraction module, we performed abla-
tion studies on the various components of the time feature extraction
1. w/o MMF: Multimodal Feature Ablation: This variant retains module to create different variants. The descriptions of each variant are
only the time series features, inputting them into the output as follows:
mapping module to obtain the prediction output.
2. w/o FFM: Feature Fusion Module Ablation: This variant removes 1. w/o STD: Seasonal-Trend Decomposition Ablation: This variant
the feature fusion module, and the time series features are removes the seasonal-trend decomposition operation, directly
concatenated with the remote sensing image features to generate inputting the raw time series data into the seasonal feature
the prediction output. extraction module to obtain the final time series features.
2. w/o MSR: Multi-Scale Reshape Ablation: This variant elimi-
3. w/o DM: Decoupling Module Ablation: This variant eliminates
nates the multi-scale reshape operation combined with Fourier
the decoupling module, preserving the original multimodal fea-
transforms, instead applying adaptive graph convolution and
tures.
multi-head attention mechanisms on the raw data.
4. w/o HAGCN: Hierarchical Attention Graph Convolutional Net-
3. w/o AGCN: Adaptive Graph Convolution Ablation: This variant
work Module Ablation: This variant removes the hierarchical
removes the adaptive graph convolution operation.
attention graph convolutional network module, concatenating
4. w/o MHA: Multi-Head Attention Ablation: This variant removes
the decoupled features before inputting them into the output
the multi-head attention operation.
mapping layer to obtain the final prediction output.
As shown in Fig. 7, removing any component from the time series
As shown in Fig. 6, removing any component from the model feature extraction module results in a performance decline. The w/o
results in a performance decline. Notably, the variant w/o MMF ex- STD variant confirms that the seasonal-trend decomposition method
hibits a significant performance drop on both datasets, confirming helps explain the composition of sequence data, thereby improving the
that the inclusion of multimodal information greatly enhances model accuracy of time series prediction. The w/o MSR variant demonstrates
performance. Additionally, the w/o FFM variant demonstrates that that the reshape operation based on periodic data is effective in ex-
the addition of the multimodal feature fusion module enables effec- tracting periodic features. Additionally, the w/o AGCN variant confirms
tive integration of multimodal features, further improving predictive that adaptive graph convolution effectively captures spatial correlations
performance. The results of the w/o DM and w/o HAGCN variants between stations, while the w/o MHA variant verifies the effectiveness
also validate the effectiveness of the decoupling mechanism and the of the multi-head attention mechanism in capturing global correlations
hierarchical attention graph convolutional network, respectively. within the periodic data.

9
X. Chen et al. Environmental Modelling and Software 192 (2025) 106553

Fig. 7. Results of ablation analysis on time feature extraction module (a) Beijing. (b) Tianjin.

4.6. Parameter sensitivity analysis 4.7.2. Multi-station prediction visualization


In Fig. 10, we visualized the prediction data for all stations from
This section further analyzes the parameter sensitivity of STMFNet both datasets and conducted a comparative analysis. By observing the
by selecting three hyperparameters: the general feature dimension similarity between the visualizations of the predicted and actual values
(#𝐷), the order of the graph convolution (#𝐾), and the size of the in the left and right images, it is evident that our model achieves strong
input time window (#𝑊 ). The MAE and RMSE values for single-step predictive performance across all stations.
prediction under various settings of these hyperparameters on two
datasets are visualized, providing an intuitive understanding of how 4.7.3. Regional prediction visualization
different hyperparameters affect model performance. Specifically, the To evaluate the regional prediction performance of the model, we
general feature dimension was set to [32, 64, 96, 128], the graph aggregated the data from each monitoring station by administrative
convolution order was set to [1, 2, 3, 4], and the input time window region to obtain regional average predictions. These were then com-
size was set to [24, 48, 96, 120]. The prediction error comparison under pared with actual observations and remote sensing images, as shown in
different parameter settings for the single-step prediction task is shown Figs. 11 and 12. This comparison clearly demonstrates that the model is
in the figure. capable of accurately predicting high-pollution areas and performs well
From Fig. 8, it can be observed that on the Beijing dataset, for the across regions with varying pollution levels. Additionally, we further
general feature dimension (#𝐷), as the feature dimension gradually discuss the impact of monitoring station distribution and scale on air
increases, the prediction error decreases and reaches optimal perfor- quality prediction, based on the quantity and spatial arrangement of
mance at 𝐷=96. However, when 𝐷 continues to increase, the model stations within each administrative region.
complexity increases, making it difficult to train, which leads to an In the case of Beijing, as shown in Fig. 11, the model accurately
increase in prediction error. For the graph convolution order (#𝐾), the captures high pollution concentrations in the central and south-central
best performance is achieved at 𝐾=2. When 𝐾=1, the model cannot areas, aligning well with both ground observations and remote sens-
acquire sufficient effective information from neighboring feature nodes, ing data. This suggests the model effectively learns spatial pollution
resulting in poor performance. When 𝐾=3 or 4, the graph convolution patterns driven by urban density and industrial activity. Minor spatial
order becomes too large, causing node features to become overly deviations are likely due to uneven station distribution, yet the overall
smooth, which leads to the inability to extract important feature infor- prediction remains robust, demonstrating the model’s ability to gen-
mation from key nodes. As the input time window size (#𝑊 ) gradually eralize pollutant distribution using limited direct observations. Under
increases, the prediction error also gradually decreases, achieving the low-pollution conditions (Fig. 11b), the model also performs reliably,
best performance at 𝑊 =96. This can be understood as the increasing particularly along regional edges, further confirming its stability across
time window size providing more time feature information in the input. varying pollution levels.
However, it also introduces more noise, which leads to an increase in In the case of Tianjin, as illustrated in Fig. 12, the model identi-
prediction error at 𝑊 =120. fies pollution hotspots in both the northern urban core and southern
The changes in prediction error under different settings of the three coastal industrial zones, which is consistent with known sources such
hyperparameters on the Tianjin dataset are broadly similar to those as traffic congestion and coastal manufacturing. These results indicate
on the Beijing dataset. On the Tianjin dataset, the best performance the model’s capacity to reflect complex pollution distributions shaped
is achieved at 𝐷=64 and 𝑊 =48, which the study attributes to the by urbanization and industrial structure. Its ability to align with both
influence of the dataset size and data distribution. observations and remote sensing data underlines its potential for re-
gional air quality assessment and targeted pollution control, especially
in spatially heterogeneous urban-industrial environments.
4.7. Visualization
In addition, we analyzed the influence of monitoring station distri-
bution and density within each administrative region on the model’s
To more intuitively demonstrate the performance of our model in prediction performance. Regions with a higher density of evenly dis-
actual predictions, we visualized the prediction results at three different tributed stations, such as central Beijing, generally yielded more ac-
levels: single-station prediction visualization, multi-station prediction curate predictions due to the availability of sufficient observational
visualization, and regional prediction visualization. data for model calibration. Conversely, in areas with sparse or uneven
station distribution, slight spatial deviations were observed between
4.7.1. Single-station prediction visualization predictions and actual pollutant concentrations. This indicates that
In Fig. 9, we randomly selected several stations from two datasets while the model is capable of generalizing patterns using learned spatial
and visualized the predicted and actual values. From the images, it can features, its performance can be affected by data sparsity, especially in
be observed that our model accurately captures the changing trends of capturing fine-grained local variations. Therefore, both the number and
air-quality time series data and provides precise predictions. Moreover, spatial coverage of monitoring stations play a crucial role in enhancing
it demonstrates strong predictive performance even in the face of model reliability, highlighting the importance of optimizing station
frequent short-term fluctuations and sudden changes. layout for improving regional air quality forecasts.

10
X. Chen et al. Environmental Modelling and Software 192 (2025) 106553

Fig. 8. Results of parameter analysis on STMFNet.

Fig. 9. Visualization of prediction results on single station.

5. Conclusion
Fig. 10. Visualization of prediction results on multi-stations.
This paper proposes a novel deep learning network, STMFNet, for
multimodal air-quality prediction. The network primarily consists of
three modules: the time series feature extraction module, the remote
sensing image feature extraction module, and the multimodal feature with specific modal feature information, enabling deep fusion and
fusion module. Specifically, the time series feature extraction module complementarity of multimodal information, which ultimately yields
utilizes a decomposition strategy to transform the raw sequence into the final prediction results. Extensive experiments on two real-world
seasonal and trend components, extracting long-term trend and peri- datasets have confirmed the effectiveness of STMFNet, demonstrating
odic fluctuation features. The remote sensing image feature extraction its superiority over other baseline models and proving the effectiveness
module captures air-quality variation features at different scales and of incorporating multimodal data for air-quality prediction. However,
integrates these features to achieve both global and local scale vari- the current research is limited to station-level air-quality prediction
ation feature extraction. The multimodal feature fusion module first within urban areas. In the future, we plan to expand the study area
employs a decoupling strategy to decompose the original multimodal to encompass provincial and even national regions, extending station-
features into shared and specific features. It then utilizes a hierarchical level air-quality prediction to pixel-level precision air quality prediction
attention graph convolutional network to supplement the shared modal across the broader study area.

11
X. Chen et al. Environmental Modelling and Software 192 (2025) 106553

Declaration of competing interest

The authors declare that they have no known competing finan-


cial interests or personal relationships that could have appeared to
influence the work reported in this paper.

Acknowledgments

This work was supported by the National Natural Science Founda-


tion of China under Grant 61803214; the Zhejiang Provincial Natural
Science Foundation of China under Grant LY21F030004; the Natural
Science Foundation of Ningbo, China under Grant 2019A610451; the
K.C. Wong Magna Fund in Ningbo University, China.

Data availability

I have shared the link to my code and data in my article.

Fig. 11. Visualization of prediction results in the region of Beijing.


References

Bai, L., Yao, L., Li, C., Wang, X., Wang, C., 2020. Adaptive graph convolutional
recurrent network for traffic forecasting. Adv. Neural Inf. Process. Syst. 33,
17804–17815.
Cao, W., Qi, W., Lu, P., 2024. Air quality prediction based on time series decomposition
and convolutional sparse self-attention mechanism transformer model. IEEE Access.
Chadalavada, S., Faust, O., Salvi, M., Seoni, S., Raj, N., Raghavendra, U., Gudigar, A.,
Barua, P.D., Molinari, F., Acharya, R., 2024. Application of artificial intelligence
in air pollution monitoring and forecasting: A systematic review. Environ. Model.
Softw. 106312.
Chen, H., Guan, M., Li, H., 2021. Air quality prediction based on integrated dual LSTM
model. IEEE Access 9, 93285–93297.
Chen, X., Hu, Y., Dong, F., Chen, K., Xia, H., 2024a. A multi-graph spatial-temporal
attention network for air-quality prediction. Process. Saf. Environ. Prot. 181,
442–451.
Chen, X., Hu, Y., Liu, C., Chen, A., Chi, Z., 2025a. Dynamic spatio-temporal graph
network based on multi-level feature interaction for sinter TFe prediction. J.
Process Control 148, 103401. [Link]
URL [Link]
Chen, X., Liu, C., Xia, H., Chi, Z., 2025b. Burn-through point prediction and control
based on multi-cycle dynamic spatio-temporal feature extraction. Control Eng.
Pract. 154, 106165. [Link] URL
[Link]
Chen, X., Xia, H., Wu, M., Hu, Y., Wang, Z., 2024b. Spatiotemporal hierarchical transmit
neural network for regional-level air-quality prediction. Knowl.-Based Syst. 289,
111555.
Chi, Z., Chen, X., Xia, H., Liu, C., Wang, Z., 2024. An adaptive control system
based on spatial–temporal graph convolutional and disentangled baseline-volatility
prediction of bellows temperature for iron ore sintering process. J. Process Con-
Fig. 12. Visualization of prediction results in the region of Tianjin. trol 140, 103254. [Link] URL https:
//[Link]/science/article/pii/S0959152424000945.
Fang, M., Peng, S., Liang, Y., Hung, C., Liu, S., 2023. A multimodal fusion model with
multi-level attention mechanism for depression detection. Biomed. Signal Process.
CRediT authorship contribution statement Control. 82, 104561.
Gass, K., Klein, M., Chang, H.H., Flanders, W.D., Strickland, M.J., 2014. Classification
Xiaoxia Chen: Writing – review & editing, Validation, Supervision, and regression trees for epidemiologic research: An air pollution example. Environ.
Resources, Investigation, Funding acquisition, Formal analysis, Concep- Heal. 13, 1–10.
Gu, K., Qiao, J., Lin, W., 2018. Recurrent air quality predictor based on
tualization. Zhen Wang: Writing – review & editing, Writing – original
meteorology-and pollution-related factors. IEEE Trans. Ind. Informatics 14 (9),
draft, Validation, Software, Project administration, Methodology, Inves- 3946–3955.
tigation, Data curation. Fangyan Dong: Writing – review & editing, Gu, Y., Zhao, Y., Zhou, J., Li, H., Wang, Y., 2021. A fuzzy multiple linear regression
Validation, Supervision. Kaoru Hirota: Writing – review & editing, model based on meteorological factors for air quality index forecast. J. Intell. Fuzzy
Validation, Supervision. Systems 40 (6), 10523–10547.
Guo, S., Lin, Y., Feng, N., Song, C., Wan, H., 2019. Attention based spatial-temporal
Software and data availability graph convolutional networks for traffic flow forecasting. In: Proceedings of the
AAAI Conference on Artificial Intelligence, vol. 33, pp. 922–929.
Hasnain, A., Sheng, Y., Hashmi, M.Z., Bhatti, U.A., Ahmed, Z., Zha, Y., 2023. Assessing
• Name of software: STMFNet for multimodal air-quality prediction the ambient air quality patterns associated to the COVID-19 outbreak in the yangtze
• Developer: Zhen Wang River Delta: A random forest approach. Chemosphere 314, 137638.
• Contact: chenxiaoxia@[Link] He, J., Yu, Y., Liu, N., Zhao, S., 2013. Numerical model-based relationship between
• Data first available: March 24, 2025 meteorological conditions and air quality and its implication for urban air quality
• Program language: Python management. Int. J. Environ. Pollut. 53 (3–4), 265–286.
Hu, J., Zhou, R., Ding, R., Ye, D., Su, Y., 2023. Effect of PM2.5 air pollution on the
• Source code at: [Link]
global burden of lower respiratory infections, 1990–2019: A systematic analysis
• Original Ground Monitoring Data: [Link] from the global burden of disease study 2019. J. Hazard. Mater. 459, 132215.
• Original Remote Sensing Data: [Link] [Link] URL [Link]
[Link] com/science/article/pii/S030438942301498X.

12
X. Chen et al. Environmental Modelling and Software 192 (2025) 106553

Kingma, D.P., 2014. Adam: A method for stochastic optimization. arXiv preprint Wang, H., Shao, S., 2022. Prediction of PM2.5 in Hefei based on a hybrid CNN-GRU
arXiv:1412.6980. model. In: 2022 5th International Conference on Data Science and Information
Kumar, U., Jain, V., 2010. ARIMA forecasting of ambient air pollutants (O3, NO, NO2 Technology. DSIT, IEEE, pp. 1–6.
and CO). Stoch. Environ. Res. Risk Assess. 24, 751–760. Wang, C., Zhu, Y., Zang, T., Liu, H., Yu, J., 2021a. Modeling inter-station relationships
Lei, C., Xu, X., Ma, Y., Jin, S., Liu, B., Gong, W., 2022. Full coverage estimation of with attentive temporal graph convolutional network for air quality prediction. In:
the PM concentration across china based on an adaptive spatiotemporal approach. Proceedings of the 14th ACM International Conference on Web Search and Data
IEEE Trans. Geosci. Remote Sens. 60, 1–14. Mining. pp. 616–634.
Lelieveld, J., Haines, A., Burnett, R., Tonne, C., Klingmüller, K., Münzel, T., Pozzer, A., Wang, C., Zhu, Y., Zang, T., Liu, H., Yu, J., 2021b. Modeling inter-station relationships
2023. Air pollution deaths attributable to fossil fuels: Observational and modelling with attentive temporal graph convolutional network for air quality prediction. In:
study. BMJ 383, [Link] arXiv:[Link] Proceedings of the 14th ACM International Conference on Web Search and Data
[Link]/content/383/[Link]. Mining. pp. 616–634.
Li, J., Hua, C., Ma, L., Chen, K., Zheng, F., Chen, Q., Bao, X., Sun, J., Xie, R., Bianchi, F., Wei, J., Li, Z., Lyapustin, A., Sun, L., Peng, Y., Xue, W., Su, T., Cribb, M., 2021.
Kerminen, V., Petäjä, T., Kulmala, M., Liu, Y., 2024. Key drivers of the oxidative po- Reconstructing 1-km-resolution high-quality PM2.5 data records from 2000 to 2018
tential of PM2.5 in Beijing in the context of air quality improvement from 2018 to in China: Spatiotemporal variations and policy implications. Remote Sens. Environ.
2022. Environ. Int. 187, 108724. [Link] 252, 112136.
URL [Link] Wu, H., Hu, T., Liu, Y., Zhou, H., Wang, J., Long, M., 2022. Timesnet: Temporal
Li, X., Wang, C., Tan, J., Zeng, X., Ou, D., Ou, D., Zheng, B., 2020. Adversarial multi- 2D-variation modeling for general time series analysis. arXiv preprint arXiv:2210.
modal representation learning for click-through rate prediction. In: Proceedings of 02186, arXivpreprint.
the Web Conference 2020. pp. 827–836. Wu, Z., Pan, S., Long, G., Jiang, J., Zhang, C., 2019. Graph wavenet for deep
Li, Y., Yu, R., Shahabi, C., Liu, Y., 2017. Diffusion convolutional recurrent neural spatial-temporal graph modeling. arXiv preprint arXiv:1906.00121.
network: Data-driven traffic forecasting. arXiv preprint arXiv:1707.01926. Xia, H., Chen, X., Chen, B., Hu, Y., 2025. Dynamic synchronous graph transformer
Liu, B., Binaykia, A., Chang, P., Tiwari, M.K., Tsao, C., 2017. Urban air quality fore- network for region-level air-quality forecasting. Neurocomputing 616, 128924.
casting based on multi-dimensional collaborative support vector regression(SVR): A [Link] URL [Link]
case study of Beijing-tianjin-shijiazhuang. PLOS ONE 12 (7), 1–17. [Link] com/science/article/pii/S0925231224016953.
org/10.1371/[Link].0179763. Xia, H., Chen, X., Wang, Z., Chen, X., Dong, F., 2024. A multi-modal deep-
Liu, F., Chen, J., Tan, W., Cai, C., 2021. A multi-modal fusion method based on learning air quality prediction method based on multi-station time-series data and
higher-order orthogonal iteration decomposition. Entropy 23 (10), 1349. remote-sensing images: Case study of Beijing and Tianjin. Entropy 26 (1).
Liu, Y., Wang, P., Li, Y., Wen, L., Deng, X., 2022. Air quality prediction models based Yu, Z., Sun, Z., Liu, L., Li, C., Zhang, X., Amat, G., Ran, M., Hu, X., Xu, Y.,
on meteorological factors and real-time data of industrial waste gas. Sci. Rep. 12 Zhao, X., Zhou, J., 2024. Environmental surveillance in jinan city of East China
(1), 9253. (2014–2022) reveals improved air quality but remained health risks attributable
Ma, J., Yu, Z., Qu, Y., Xu, J., Cao, Y., 2020. Application of the xgboost machine to PM2.5-bound metal contaminants. Environ. Pollut. 343, 123275. [Link]
learning method in PM2.5 prediction: A case study of Shanghai. Aerosol Air org/10.1016/[Link].2023.123275, URL [Link]
Qual. Res. 20 (1), 128–138. [Link] URL article/pii/S0269749123022777.
[Link] Yu, B., Yin, H., Zhu, Z., 2017. Spatio-temporal graph convolutional networks: A deep
Ma, X., Zhang, X., Pun, M.-O., 2022. A crossmodal multiscale fusion network for learning framework for traffic forecasting. arXiv preprint arXiv:1709.04875.
semantic segmentation of remote sensing data. IEEE J. Sel. Top. Appl. Earth Obs. Yu, B., Yin, H., Zhu, Z., 2018. Spatio-temporal graph convolutional networks: A deep
Remote. Sens. 15, 3463–3474. [Link] learning framework for traffic forecasting. In: Proceedings of the Twenty-Seventh
Ma, X., Zhang, X., Pun, M.-O., Liu, M., 2024. A multilevel multimodal fusion International Joint Conference on Artificial Intelligence. IJCAI-18, International
transformer for remote sensing semantic segmentation. IEEE Trans. Geosci. Remote Joint Conferences on Artificial Intelligence Organization, pp. 3634–3640. http:
Sens. 62, 1–15. [Link] //[Link]/10.24963/ijcai.2018/505.
Maciąg, P.S., Bembenik, R., Piekarzewicz, A., Del Ser, J., Lobo, J.L., Kasabov, N.K., Zadeh, A., Liang, P.P., Mazumder, N., Poria, S., Cambria, E., Morency, L.-P., 2018.
2023. Effective air pollution prediction by combining time series decomposition Memory fusion network for multi-view sequential learning. In: Proceedings of the
with stacking and bagging ensembles of evolving spiking neural networks. Environ. AAAI Conference on Artificial Intelligence, vol. 32.
Model. Softw. 170, 105851. Zeng, A., Chen, M., Zhang, L., Xu, Q., 2023. Are transformers effective for time series
Naveen, S., Upamanyu, M., Chakki, K., Chandan, M., Hariprasad, P., 2023. Air quality forecasting? In: Proceedings of the AAAI Conference on Artificial Intelligence, vol.
prediction based on decision tree using machine learning. In: 2023 International 37, pp. 11121–11128.
Conference on Smart Systems for Applications in Electrical Sciences. ICSSES, IEEE, Zhang, B., Chen, W., Li, M.-Z., Guo, X., Zheng, Z., Yang, R., 2024. Mgatt-LSTM: A
pp. 1–6. multi-scale spatial correlation prediction model of PM2.5 concentration based on
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., multi-graph attention. Environ. Model. Softw. 179, 106095.
Lin, Z., Gimelshein, N., Antiga, L., et al., 2019. Pytorch: An imperative style, Zhang, S., Li, B., Liu, L., Hu, Q., Liu, H., Zheng, R., Zhu, Y., Liu, T., Sun, M., Liu, C.,
high-performance deep learning library. Adv. Neural Inf. Process. Syst. 32. 2021. Prediction of vertical profile of NO2 using deep multimodal fusion network
Qingyun, F., Zhaokui, W., 2022. Cross-modality attentive feature fusion for object based on the ground-based 3-D remote sensing. IEEE Trans. Geosci. Remote Sens.
detection in multispectral remote sensing imagery. Pattern Recognit. 130, 108786. 60, 1–13.
Rao, K.S., Devi, G.L., Ramesh, N., et al., 2019. Air quality prediction in visakhapatnam Zhang, J., Luo, Z., Yang, Z., 2023. Research on air quality prediction based on
with LSTM based recurrent neural networks. Int. J. Intell. Syst. Appl. 11 (2), 18–24. LSTM-transformer with adaptive temporal attention mechanism. In: 2023 2nd
Ren, Y., Wang, S., Xia, B., 2023. Deep learning coupled model based on TCN-LSTM for International Conference on Artificial Intelligence and Intelligent Information
particulate matter concentration prediction. Atmos. Pollut. Res. 14 (4), 101703. Processing. AIIIP, IEEE, pp. 320–323.
Rowley, A., Karakuş, O., 2023. Predicting air quality via multimodal AI and satellite Zhang, B., Qin, H., Zhang, Y., Li, M., Qin, D., Guo, X., Li, M., Guo, C., 2025. Multi-
imagery. Remote Sens. Environ. 293, 113609. granularity PM2.5 concentration long sequence prediction model combined with
Roy, S.K., Deria, A., Hong, D., Rasti, B., Plaza, A., Chanussot, J., 2023. Multimodal spatial–temporal graph. Environ. Model. Softw. 106400.
fusion transformer for remote sensing image classification. IEEE Trans. Geosci. Zhao, L., Yang, Y., Ning, T., 2024. A three-stage multimodal emotion recognition
Remote Sens. 61, 1–20. network based on text low-rank fusion. Multimedia Syst. 30 (3), 142.
Song, C., Lin, Y., Guo, S., Wan, H., 2020. Spatial-temporal synchronous graph convolu- Zhou, H., Zhang, S., Peng, J., Zhang, S., Li, J., Xiong, H., Zhang, W., 2021. Informer: Be-
tional networks: A new framework for spatial-temporal network data forecasting. In: yond efficient transformer for long sequence time-series forecasting. In: Proceedings
Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 914–921. of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 11106–11115.
Sun, D., Liu, C., Ding, Y., Yu, C., Guo, Y., Sun, D., Pang, Y., Pei, P., Du, H., Yang, L., et Zou, S., Huang, X., Shen, X., Liu, H., 2022. Improving multimodal fusion with main
al., 2023. Long-term exposure to ambient PM2· 5, active commuting, and farming modal transformer for emotion recognition in conversation. Knowl.-Based Syst. 258,
activity and cardiovascular disease risk in adults in China: A prospective cohort 109978.
study. Lancet Planet. Heal. 7 (4), e304–e312.
Vaswani, A., 2017. Attention is all you need. Adv. Neural Inf. Process. Syst..

13

You might also like