Chen Et Al. - 2025 - Multimodal Air-Quality Prediction A Multimodal Feature Fusion Network Based On Shared-Specific Moda
Chen Et Al. - 2025 - Multimodal Air-Quality Prediction A Multimodal Feature Fusion Network Based On Shared-Specific Moda
Keywords: Severe air pollution degrades air quality and threatens human health, necessitating accurate prediction for
Air-quality prediction pollution control. While spatiotemporal networks integrating sequence models and graph structures dominate
Multimodal fusion current methods, prior work neglects multimodal data fusion to enhance feature representation. This study
Spatial–temporal network
addresses the spatial limitations of single-perspective ground monitoring by synergizing remote sensing data,
Time series forecasting
which provides global air quality distribution, with ground observations. We propose a Shared-Specific
Modality Decoupling-based Spatiotemporal Multimodal Fusion Network for air-quality prediction, comprising:
(1) feature extractors for remote sensing images and ground monitoring data, (2) a decoupling module
separating shared and modality-specific features, and (3) a hierarchical attention-graph convolution fusion
module. This framework achieves effective multimodal fusion by disentangling cross-modal dependencies while
preserving unique characteristics. Evaluations on two real-world datasets demonstrate superior performance
over baseline models, validating the efficacy of multimodal integration for spatial–temporal air quality
forecasting.
∗ Corresponding author.
E-mail address: chenxiaoxia@[Link] (X. Chen).
[Link]
Received 23 March 2025; Received in revised form 23 May 2025; Accepted 2 June 2025
Available online 17 June 2025
1364-8152/© 2025 Published by Elsevier Ltd.
X. Chen et al. Environmental Modelling and Software 192 (2025) 106553
2
X. Chen et al. Environmental Modelling and Software 192 (2025) 106553
4. Extensive experiments were conducted on two real-world Some studies have employed matrix theory for multimodal feature
datasets, and the results demonstrate that our model consistently fusion. Zhang et al. achieved feature fusion by performing tensor outer
outperforms all baseline models. products on different modal feature data. Zhao et al. (2024) and
Liu et al. (2021), recognizing the complexity of matrix operations,
The remainder of this paper is organized as follows: Section 2 re- further combined matrix decomposition methods, transforming the
views prior research related to this work. Section 3 provides a detailed original feature matrix into low-dimensional vectors, thereby eliminat-
description of the proposed model. Section 4 presents comprehensive ing redundant features and reducing computational complexity to some
performance experiments and visualizations of prediction outcomes, extent. Additionally, methods based on Transformer networks or incor-
along with extensive ablation studies on different architectures and key porating attention mechanisms have become common approaches for
components. Section 5 concludes the paper. multimodal feature fusion. Ma et al. (2024) employed a Transformer-
based architecture to effectively model long-range dependencies and
2. Related work achieve precise remote sensing semantic segmentation by utilizing a
cross-attention mechanism for multi-scale fusion of multimodal infor-
2.1. Air-quality prediction mation. Zadeh et al. (2018) proposed the MFN network, introducing
memory attention mechanisms and gating mechanisms to simulta-
Air-quality prediction is a quintessential time series prediction task.
neously capture temporal and cross-modal interactions. Fang et al.
In data-driven approaches, several classical statistical methods such
(2023) used multi-level attention mechanisms to capture significant
as ARIMA (Kumar and Jain, 2010), multiple linear regression (Gu
intra-modal features and employed attention mechanisms to learn cor-
et al., 2021), and classification and regression trees (Gass et al., 2014)
relations between features across various modalities. Zou et al. (2022)
have been employed for air-quality prediction tasks. However, these
combined Transformer networks for multimodal feature fusion, intro-
methods are typically limited to single time series and often fall short
ducing cross-modal interactions in the multi-head attention mechanism
in terms of prediction accuracy. On the other hand, machine learning-
and enriching the feature information obtained from other modalities
based methods, such as support vector regression (Liu et al., 2017),
while preserving the integrity of the primary modal features. Although
gradient boosting trees (Ma et al., 2020), random forests (Hasnain
attention mechanisms effectively capture global correlations in feature
et al., 2023), and decision trees (Naveen et al., 2023), are also widely
information, their high complexity and computational overhead are
applied in the field of air-quality prediction. While these approaches
significant concerns. Ma et al. (2022) further integrated CNN and
offer commendable interpretability, they struggle to achieve precise
ViT within a unified fusion framework, combining shallow and deep
air-quality prediction.
features in a multi-level manner to accurately characterize both lo-
Within deep learning-based approaches, some studies have lever-
cal details and global semantics, thereby accomplishing multimodal
aged sequence models such as LSTM (Chen et al., 2021), GRU (Wang
fusion-based semantic segmentation of remote sensing imagery. Some
and Shao, 2022), temporal convolutional networks (TCN) (Ren et al.,
studies (Li et al., 2020) have integrated generative adversarial networks
2023), Transformer (Cao et al., 2024; Xia et al., 2025), as well as MLP-
to generate fused modal features, but the reliability and stability of
based models to model air-quality data (Maciąg et al., 2023), achieving
generative adversarial networks require further consideration.
promising results. However, these sequence models typically treat the
data from each station as independent entities, overlooking the inter- In the environmental and remote sensing fields, multimodal fusion
relationships between stations. Consequently, some research (Li et al., methods are commonly used to integrate multisource remote sensing
2017; Yu et al., 2018) has highlighted that further extraction of spatial images with environmental data from other modalities. These methods
correlations between stations can aid in uncovering hidden features are widely applied in tasks such as remote sensing image classification,
in air-quality data. By incorporating graph neural networks, spatial object detection, semantic segmentation, regional meteorological im-
correlations between stations can be modeled in the form of graph age simulation and reconstruction, as well as pollutant concentration
structures to extract spatial features. Common methods for constructing prediction. Roy et al. (2023) proposed a multimodal feature fusion
graphs include measuring geographic distance, regional functional sim- Transformer, achieving image classification of multisource remote sens-
ilarity, and dynamic time warping to create adjacency matrices (Wang ing images. Qingyun and Zhaokui (2022) introduced a cross-modal
et al., 2021b). Additionally, some studies have combined multiple attention fusion network for object detection in multispectral remote
metrics to construct more comprehensive graph structures and adaptive sensing images. In the domain of air-quality prediction, methods com-
graphs (Wu et al., 2019; Zhang et al., 2025, 2024), further exploring bining multimodal features are relatively scarce. Among them, Rowley
spatial features. and Karakuş (2023) combined satellite remote sensing images and
The aforementioned methods primarily focus on extracting deep pollutant data to predict pollutants such as nitrogen dioxide and ozone.
data features from air-quality data from both temporal and spatial per- Xia et al. (2024) proposed a multimodal prediction method that in-
spectives, achieving in-depth spatiotemporal feature extraction. How- tegrates remote sensing image data with pollutant time series data
ever, these methods are limited to the characteristics of the air-quality from multiple stations. Zhang et al. (2021) combined NO2 concentra-
data itself, neglecting other types of environmental data that also tion data at multiple vertical levels with ground-level concentration
contain valuable feature information. When deep features have already observations and meteorological data to achieve NO2 concentration
been thoroughly explored, integrating additional environmental data prediction from a three-dimensional perspective. Lei et al. (2022) in-
to enrich the feature set and provide more effective information for tegrated reconstructed data, satellite images, and ground observation
prediction tasks presents a viable approach. data, employing CNN-LSTM and random forest methods to simulate
PM2.5 concentrations across China.
2.2. Multi-modal fusion By summarizing and analyzing the content of the aforementioned
related work, we can identify several issues within the fusion methods
Multimodal data typically refers to data obtained through various for multimodal environmental data. First, when dealing with multi-
sensory channels or data sources, often representing different manifes- modal environmental data, it is crucial to account for the distinct char-
tations of the same object in distinct forms. This data usually includes acteristics of different modalities, as significant modality differences
various types of information such as images, text, audio, video, and often exist, making feature representation and fusion more challenging.
sequence data. Multimodal feature fusion techniques are widely applied Second, environmental data often contain substantial noise and redun-
in fields such as computer vision, emotion recognition, remote sensing dant information, which can negatively impact the effectiveness of fea-
semantic segmentation, and object detection. ture fusion. Moreover, multimodal environmental data typically exhibit
3
X. Chen et al. Environmental Modelling and Software 192 (2025) 106553
ing down the original input time series data into a trend component 𝑖
domain input, and 𝑊 refers to the corresponding trainable parameters.
representing long-term changes and a seasonal component representing Subsequently, a multi-head attention (MHA) mechanism is em-
short-term fluctuations. The decomposition operation is described by ployed to model the global correlations within the time series data of
the following formula: a single period, as described by the following formula:
𝑆
( )
𝑋𝑡𝑟𝑒𝑛𝑑 = 𝐴𝑣𝑔𝑝𝑜𝑜𝑙(𝑃 𝑎𝑑𝑑𝑖𝑛𝑔(𝑋 𝑆 )) (1) 𝐹 𝐴𝑖𝑠𝑒𝑎 = 𝑀𝐻𝐴 𝐹 𝐺𝑖𝑠𝑒𝑎
4
X. Chen et al. Environmental Modelling and Software 192 (2025) 106553
(𝐻 ( ( ) ))
⋃ 𝑄ℎ𝑠𝑒𝑎 ⋅ 𝐾𝑠𝑒𝑎
ℎ
ℎ
= 𝐿𝑖𝑛𝑒𝑎𝑟 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥 √ 𝑉𝑠𝑒𝑎 (9)
ℎ 𝑑𝑖𝑚
where 𝐻 represents the number of attention heads, ℎ represents the h-
⋃
th attention head, 𝐻 ℎ represents merging the features of 𝐻 attention
heads, dim denotes the general feature dimension, and 𝑄ℎ𝑠𝑒𝑎 , 𝐾𝑠𝑒𝑎
ℎ and
ℎ denote the mappings of 𝐹 𝐺𝑖
𝑉𝑠𝑒𝑎 to the Query, Key, and Value of
𝑠𝑒𝑎
attention mechanism respectively.
We then convert the time series data, which has been transformed
into a two-dimensional shape, back to its original one-dimensional
form. This is followed by reaggregating the data into the original sea-
sonal component sequence, using the amplitude values as weightings.
( )
𝑤̂ 1 , 𝑤̂ 2 , … , 𝑤̂ 𝑘 = 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥 𝐴𝑓1 , 𝐴𝑓2 , … , 𝐴𝑓𝑘 (10)
∑
𝑘
𝑆
𝐹𝑠𝑒𝑎 = 𝑤̂ 𝑖 ⋅ 𝐹 𝐴𝑖𝑠𝑒𝑎 (11) Fig. 3. Structural diagram of the shared-specific modality decoupling module.
𝑖=1
𝑆
Finally, we obtain the trend component feature 𝐹𝑡𝑟𝑒𝑛𝑑 and the sea-
sonal component feature 𝐹𝑠𝑒𝑎𝑆 from the two separate time series feature
images capture the spatial variations in pollutant concentrations across
extraction branches for the trend and seasonal components, respec-
different regions at the same time and their temporal changes, while
tively. These are then summed to derive the final time series feature
ground observation data also reflect these variations. The data from dif-
𝐹 𝑆 , as follows:
ferent modalities reside in distinct feature spaces, each with its unique
𝐹 𝑆 = 𝐹𝑡𝑟𝑒𝑛𝑑
𝑆 𝑆
+ 𝐹𝑠𝑒𝑎 (12) representation: remote sensing images tend to emphasize long-term
global trends, whereas ground observation data focus on short-term
local changes. Accurately identifying and integrating the shared and
3.3. Remote-sensing image feature extraction module
specific features of different modalities is crucial for the joint analysis
and prediction of multimodal data, aiding in the enhancement of
Remote sensing images contain rich variation features on both
understanding and prediction accuracy of spatiotemporal changes in
global and local scales. Additionally, air-quality changes may exhibit
air-quality.
different patterns across various spatial scales; for instance, while pol-
To further clarify the definitions of multimodal shared features
lutant concentrations might show a decreasing trend globally, a specific
local area could display an increase. Therefore, extracting features and modality-specific features in air quality analysis, based on the
from remote sensing images at different scales to accommodate varying differences in how ground monitoring data and remote sensing images
trends across spatial scales is essential. This approach aids in accu- reflect air quality characteristics, we define these features as follows:
rately identifying changes in pollutant concentrations at different spa- Shared features refer to high-level abstract representations com-
tial scales, thereby enhancing the precision of pollutant concentration monly existing across multimodal data (e.g., remote sensing images
predictions. and ground monitoring data) that capture the essential patterns of
The multi-scale feature extraction module for remote sensing images air quality. These features originate from the physical consistency of
primarily consists of three components: the multi-scale convolution spatiotemporal variations in air quality, independent of data acquisition
component, the feature pyramid component, and the multi-scale fusion methods or representational forms. They embody dynamic correlations
component. The multi-scale convolution component, which comprises of pollutant concentrations over time and spatial distribution patterns.
convolutional blocks with different convolutional kernels, is responsi- Modality-specific features are unique representations within a single
ble for extracting image features at various scales from the original modality, arising from data acquisition principles or inherent resolution
remote sensing images. The formula is described as follows: constraints. These features reflect the observation perspectives and
( ) information granularity specific to a particular modality, requiring
{𝑐2, 𝑐3, 𝑐4, 𝑐5} = 𝑀𝑢𝑙𝑡𝑖𝑆𝑐𝑎𝑙𝑒𝐶𝑜𝑛𝑣 𝑋 𝑅 (13) extraction through modality-adapted modeling methods to supplement
The feature pyramid contains two feature pathways: a top-down path- refined information not covered by shared features.
way, which utilizes local features at smaller scales to complement We presents a Shared-Specific Modality Decoupling Module, which
global features at larger scales, and a same-layer feature transmission extracts shared and specific modality features from different modal
pathway used to calculate image features across different scales. The data in a similarity-constrained and data-driven manner. The structural
formula is described as follows: diagram of this module is shown in Fig. 3. Specifically, the module
( ) ( ) primarily separates shared and specific features based on a shared gat-
𝑝𝑖 = 𝑆𝑐𝑎𝑙𝑒𝐶𝑜𝑛𝑣5 𝑐𝑖 + 𝑈 𝑝𝑆𝑎𝑚𝑝𝑙𝑖𝑛𝑔 𝑝𝑖+1 , 𝑖 ∈ {2, 3, 4, 5} (14) ing mechanism and residual mechanism. The shared gating mechanism
The multi-scale fusion component is responsible for integrating the receives the corresponding modality features as input and adaptively
image features from different scales to obtain the final multi-scale computes the weight of the shared features within the input features
remote sensing image features. The formula is described as follows: in a data-driven manner. Furthermore, to guide the shared gating
mechanism in learning the shared features between two modalities,
𝐹 𝑅 = 𝑆𝑐𝑎𝑙𝑒𝐹 𝑢𝑠𝑖𝑜𝑛(𝑝2, 𝑝3, 𝑝4, 𝑝5) (15) we introduce cosine similarity as the loss function for this module,
constraining the shared gating mechanism to learn the shared features
of the two modal features. Finally, the extracted shared features are
3.4. Shared-specific modality decouple module
averaged and output as the shared modality features. The process is
Remote sensing images of air-quality and time series data belong mathematically described as follows:
to different modalities, with significant differences in their forms of 𝑆
( )
𝑀𝑆ℎ𝑎𝑟𝑒𝑑 = 𝑆ℎ𝑎𝑟𝑒𝑑𝐺𝑎𝑡𝑒𝑆 𝐹 𝑆
expression and feature scales, yet they share certain representational ( 𝑆 )
characteristics in terms of object features. For instance, remote sensing = 𝜎 𝑊𝐺𝑎𝑡𝑒 ⋅ 𝐹 𝑆 + 𝑏𝑆 ⋅ 𝐹 𝑆 (16)
5
X. Chen et al. Environmental Modelling and Software 192 (2025) 106553
𝑅
( ) ⎛ 𝑀 𝑆 𝑊𝑠𝑝𝑒𝑐
𝑄 𝑆 𝑊𝐾 ⎞
𝑀𝑆ℎ𝑎𝑟𝑒𝑑 = 𝑆ℎ𝑎𝑟𝑒𝑑𝐺𝑎𝑡𝑒𝑅 𝐹 𝑅 ⋅ 𝑀𝑆𝑝𝑒𝑐 𝑠𝑝𝑒𝑐
= 𝑆𝑜𝑓 𝑡𝑚𝑎𝑥 ⎜ ⎟ ∈ R[𝑁,
𝑆𝑝𝑒𝑐 𝑁]
( 𝑅 ) √
= 𝜎 𝑊𝐺𝑎𝑡𝑒 ⋅ 𝐹 𝑅 + 𝑏𝑅 ⋅ 𝐹 𝑅 (17) ⎜ 𝑑𝑖𝑚 ⎟
⎝ ⎠
( 𝑆 𝑅
) Here, 𝑆𝑒𝑙𝑓 𝑀𝑜𝑑𝑎𝑙𝐴𝑡𝑡() represents the calculation function of the self-
𝐿𝑜𝑠𝑠𝑆ℎ𝑎𝑟𝑒𝑑 = 1 − 𝐶𝑜𝑠𝑆𝑖𝑚 𝑀𝑆ℎ𝑎𝑟𝑒𝑑 , 𝑀𝑆ℎ𝑎𝑟𝑒𝑑 𝑄 𝐾 represent the weight matrices
modal attention matrix, 𝑊𝑠𝑝𝑒𝑐 and 𝑊𝑠𝑝𝑒𝑐
𝑆
𝑀𝑆ℎ𝑎𝑟𝑒𝑑 𝑅
⋅ 𝑀𝑆ℎ𝑎𝑟𝑒𝑑 𝑆
that map 𝑀𝑆𝑝𝑒𝑐 to Query and Key, respectively. 𝑄𝑢𝑒𝑟𝑦𝑆 and 𝐾𝑒𝑦𝑆
=1− (18)
‖ 𝑆 ‖ ‖ 𝑅 ‖ represent the Query and Key for computing attention weights in the
‖𝑀𝑆ℎ𝑎𝑟𝑒𝑑 ‖ ⋅ ‖𝑀𝑆ℎ𝑎𝑟𝑒𝑑 ‖
‖ ‖ ‖ ‖ time-series modality, respectively. Dim denotes the general feature.
( ( 𝑆 )) The calculation methods for 𝐴𝑅 and 𝐴𝑆ℎ are the same as described
𝑆𝑅 𝑅
𝑀𝑆ℎ𝑎𝑟𝑒𝑑 = 𝐿𝑖𝑛𝑒𝑎𝑟 𝐴𝑣𝑔 𝑀𝑆ℎ𝑎𝑟𝑒𝑑 , 𝑀𝑆ℎ𝑎𝑟𝑒𝑑 (19) above. Here, 𝑁 represents the number of feature nodes in the time
series features, 𝐶 represents the number of feature channels in the
Here, 𝜎 represents the 𝑆𝑖𝑔𝑚𝑜𝑖𝑑 activation function, 𝐶𝑜𝑠𝑆𝑖𝑚() denotes
𝑅 𝑆 remote sensing image features, and 𝑆 represents the number of feature
cosine similarity, 𝑊𝐺𝑎𝑡𝑒 and 𝑊𝐺𝑎𝑡𝑒 represent the trainable parameters
nodes in the shared features.
of the shared gating mechanism, and 𝑏𝑆 and 𝑏𝑅 correspond to the reg-
For features between different levels, namely modality-shared fea-
ularization terms. Subsequently, the residual mechanism is employed
tures and modality-specific features, we utilize cross-modality attention
to obtain the corresponding specific modality features based on the
to calculate their feature correlations. This results in the correla-
extracted shared features, with the following calculation formula:
tion weight matrices between the modality-specific features and the
( ) ( 𝑆 )
𝑆
𝑀𝑆𝑝𝑒𝑐 = 𝑅𝑒𝑠 𝐹 𝑆 , 𝑀𝑆ℎ𝑎𝑟𝑒𝑑
𝑆
= 𝐹 𝑆 − 𝐿𝑖𝑛𝑒𝑎𝑟 𝑀𝑆ℎ𝑎𝑟𝑒𝑑 (20) modality-shared features. Specifically, 𝐴𝑆𝑆 and 𝐴𝑅𝑆 represent the
correlation weights between the time series-specific features and the
𝑅
( ) ( 𝑅 ) modality-shared features, and between the remote sensing image-
𝑀𝑆𝑝𝑒𝑐 = 𝑅𝑒𝑠 𝐹 𝑅 , 𝑀𝑆ℎ𝑎𝑟𝑒𝑑
𝑅
= 𝐹 𝑅 − 𝐿𝑖𝑛𝑒𝑎𝑟 𝑀𝑆ℎ𝑎𝑟𝑒𝑑 (21)
specific features and the modality-shared features, respectively.
Ultimately, through the decoupling module, we obtain the modality ( )
𝑆𝑅
shared feature 𝑀𝑆ℎ𝑎𝑟𝑒𝑑 𝑆
as well as the modality-specific features 𝑀𝑆𝑝𝑒𝑐 𝐴𝑆𝑆 = 𝐶𝑟𝑜𝑠𝑠𝑀𝑜𝑑𝑎𝑙𝐴𝑡𝑡 𝑀𝑆𝑝𝑒𝑐 𝑆 𝑆𝑅
, 𝑀𝑆ℎ𝑎𝑟𝑒𝑑
𝑅 ( )
and 𝑀𝑆𝑝𝑒𝑐 from the time-series feature data and remote sensing image 𝑄𝑢𝑒𝑟𝑦𝑆 ⋅ 𝐾𝑒𝑦𝑆ℎ
data, which are then used as inputs for the hierarchical attention graph = 𝑆𝑜𝑓 𝑡𝑚𝑎𝑥 √ (23)
𝑑𝑖𝑚
convolution module.
⎛ 𝑀 𝑆 𝑊𝑠𝑝𝑒𝑐
𝑄 𝑆𝑅
⋅ 𝑀𝑆ℎ𝑎𝑟𝑒𝑑 𝐾
𝑊𝑠ℎ𝑎𝑟𝑒𝑑 ⎞
= 𝑆𝑜𝑓 𝑡𝑚𝑎𝑥 ⎜ ⎟ ∈ R[𝑁,𝑆]
𝑆𝑝𝑒𝑐
√
3.5. Hierarchical attention graph convolution module ⎜ 𝑑𝑖𝑚 ⎟
⎝ ⎠
Traditional multimodal feature fusion methods often achieve fused ( )
modality features by simply concatenating different modality features 𝐴𝑅𝑆 = 𝐶𝑟𝑜𝑠𝑠𝑀𝑜𝑑𝑎𝑙𝐴𝑡𝑡 𝑀𝑆𝑝𝑒𝑐𝑅 𝑆𝑅
, 𝑀𝑆ℎ𝑎𝑟𝑒𝑑
and assigning weights. However, this approach overlooks the fact that ( )
𝑄𝑢𝑒𝑟𝑦𝑅 ⋅ 𝐾𝑒𝑦𝑆ℎ
multimodal data have different forms of representation, and their = 𝑆𝑜𝑓 𝑡𝑚𝑎𝑥 √ (24)
modality features also exhibit specific feature distributions. Directly 𝑑𝑖𝑚
concatenating features may lead to insufficient feature fusion and even ⎛ 𝑀 𝑅 𝑊𝑠𝑝𝑒𝑐
𝑄𝑢𝑒𝑟𝑦 𝑆𝑅
⋅ 𝑀𝑆ℎ𝑎𝑟𝑒𝑑 𝐾𝑒𝑦 ⎞
𝑊𝑠ℎ𝑎𝑟𝑒𝑑
= 𝑆𝑜𝑓 𝑡𝑚𝑎𝑥 ⎜ ⎟ ∈ R[𝐶,𝑆]
𝑆𝑝𝑒𝑐
introduce additional bias into subsequent tasks. Therefore, designing a √
⎜ 𝑑𝑖𝑚 ⎟
fusion strategy that takes into account the distinct feature distributions ⎝ ⎠
of multimodal data not only allows for the full utilization of the unique Here, 𝐶𝑟𝑜𝑠𝑠𝑀𝑜𝑑𝑎𝑙𝐴𝑡𝑡() represents the calculation function of the cross-
information inherent in each modality but also reduces noise and bias 𝑄 𝐾
modal attention matrix. 𝑊𝑠𝑝𝑒𝑐 and 𝑊𝑠ℎ𝑎𝑟𝑒𝑑 represent the weight matri-
caused by inconsistencies, thereby enhancing the overall performance 𝑅
ces that map 𝑀𝑆𝑝𝑒𝑐 to Query and map 𝑀𝑆ℎ𝑎𝑟𝑒𝑑𝑆𝑅 to Key respectively,
of multimodal tasks. and dim denotes the general feature. Similarly, 𝑊𝑠𝑝𝑒𝑐 𝑄𝑢𝑒𝑟𝑦 𝐾𝑒𝑦
and 𝑊𝑠ℎ𝑎𝑟𝑒𝑑
Considering the different feature distributions of multimodal fea- 𝑅 𝑅
represent the weight matrices that map 𝑀𝑆𝑝𝑒𝑐 to 𝑄𝑢𝑒𝑟𝑦 and 𝑀𝑆ℎ𝑎𝑟𝑒𝑑 𝑆𝑅
tures, this paper divides modality-shared features and modality-specific
to 𝐾𝑒𝑦𝑆ℎ respectively.
features into two distinct levels. Within each level, self-modality atten-
Subsequently, we employ a hierarchical graph convolutional
tion is employed to compute the relevance weights between different
method to establish the process of feature propagation within intra-
feature nodes, capturing the correlations within the same level. Be-
layer features as well as between features of different levels. This
tween different levels, cross-modality attention is used to calculate
approach facilitates the refinement and supplementation of modality-
the relevance weights between the feature nodes of modality-shared
specific features into modality-shared features. The process primarily
features and modality-specific features. Subsequently, a graph convo-
consists of three parts: the intra-layer feature propagation of modality-
lutional network is utilized to propagate features and transmit infor- specific features based on self-modality correlations, the intra-layer
mation between nodes of different modalities, thereby supplementing feature propagation of modality-shared features also based on self-
and refining modality-specific features with information from modality- modality correlations, and the inter-layer feature propagation based on
shared features. This process further achieves deep interaction and cross-modality correlations. The intra-layer propagation process can be
fusion of multimodal features. Specifically, for the features within described by the following equations:
the same level, including the time series-specific feature representa- ( )
𝑆 𝑆
𝑆 , the remote sensing image-specific feature representation
tion 𝑀Spec 𝐻𝑠𝑝𝑒𝑐 = 𝐼𝑛𝑡𝑟𝑎𝐻𝑖𝑒𝑟𝐺𝐶𝑁 𝑀𝑆𝑝𝑒𝑐 , 𝐴𝑆
𝑅 𝑆𝑅 , the self-
𝑀Spec , and the shared modality feature representation 𝑀Shared
∑
𝐾
modality attention mechanism is applied to calculate the relevance = 𝐴𝑆 ⋅ 𝑀𝑆𝑝𝑒𝑐
𝑆 𝑘
⋅ 𝑊𝑠𝑝𝑒𝑐 (25)
weight matrices between different feature nodes within each level. 𝑘=0
These relevance weight matrices are denoted as 𝐴𝑆 , 𝐴𝑅 , and 𝐴𝑆ℎ , 𝑆
Here, 𝐻𝑠𝑝𝑒𝑐 𝑅
and 𝐻𝑠𝑝𝑒𝑐 represent the hidden features of the time se-
respectively. ries and remote sensing image-specific modality features, respectively,
( ) 𝑆
𝐴𝑆 = 𝑆𝑒𝑙𝑓 𝑀𝑜𝑑𝑎𝑙𝐴𝑡𝑡 𝑀𝑆𝑝𝑒𝑐 𝑆 after intra-layer feature propagation. 𝐻𝑠ℎ𝑎𝑟𝑒𝑑 𝑅 represents the hidden
( ) features of the modality-shared features after intra-layer feature prop-
𝑄𝑢𝑒𝑟𝑦𝑆 ⋅ 𝐾𝑒𝑦𝑆 agation. The intra-layer feature propagation primarily assigns dynamic
= 𝑆𝑜𝑓 𝑡𝑚𝑎𝑥 √ (22)
𝑑𝑖𝑚 weights to intra-layer modal features by constructing a self-correlation
6
X. Chen et al. Environmental Modelling and Software 192 (2025) 106553
Table 1
Detailed information of the two datasets.
Datasets #Time span #Time interval #Time stamps #Stations
Beijing-2018 1/1/2018–1/1/2021 1 h 26 280 34
Tianjin-2014 5/1/2014–5/1/2015 1 h 8760 27
∑
𝐾
= 𝐴𝑆𝑆 ⋅ 𝐻𝑠𝑝𝑒𝑐
𝑆 𝑘
⋅ 𝑊𝑡𝑟𝑎𝑛𝑠 (26)
𝑘=0
( )
𝐻𝑇𝑅𝑆 𝑅
𝑟𝑎𝑛𝑠 = 𝐶𝑟𝑜𝑠𝑠𝐻𝑖𝑒𝑟𝐺𝐶𝑁 𝐻𝑠𝑝𝑒𝑐 , 𝐴
𝑅𝑆
∑
𝐾
= 𝐴𝑅𝑆 ⋅ 𝐻𝑠𝑝𝑒𝑐
𝑅 𝑘
⋅ 𝑊𝑡𝑟𝑎𝑛𝑠 (27)
𝑘=0
Here, 𝐻𝑇𝑆𝑆𝑟𝑎𝑛𝑠
and 𝐻𝑇𝑅𝑆
𝑟𝑎𝑛𝑠
represent the propagated features between the
time series-specific modality features and the modality-shared features,
and between the remote sensing image-specific modality features and
the modality-shared features, respectively. In contrast to the intra-
layer feature transmission process, the inter-layer feature transmission
weight matrix represents mapping weights between features at different
Fig. 5. Visualization of PM2.5 remote-sensing images.
hierarchical levels, while following the same computational principles
as graph convolution operations.
The correlation weight matrices between modality features of dif-
ferent levels represent the feature weights during the mapping of (TEOM) or the β-attenuation method, with uncertainties of ±1.5%
feature information from different feature spaces. By incorporating or 0.1 μg∕m3 , respectively [30]. The raw data were then preprocessed
cross-modality attention, the hierarchical graph convolution can map through steps including data imputation, outlier handling, standardiza-
features from different feature spaces into a unified feature space tion, and data splitting, ultimately resulting in the ground monitoring
based on these cross-modality weights. Consequently, the resulting station air-quality observation data.
propagated features and modality-shared features reside in a unified
feature space. After concatenating them and passing through the output 4.1.2. Remote sensing images
mapping layer, the final prediction output 𝑌 𝑆 can be obtained. This The remote sensing image data used in this study were sourced from
process can be mathematically described as follows: the CHAP dataset (Wei et al., 2021), a comprehensive, long-term, na-
( ( )) tionwide, high-resolution, and high-quality ground air pollutant dataset
𝑌 𝑆 = 𝑂𝑢𝑡𝑝𝑢𝑡 𝑐𝑜𝑛𝑐𝑎𝑡 𝐻𝑇𝑆𝑆 𝑅𝑆 𝑆𝑅
𝑟𝑎𝑛𝑠 , 𝐻𝑇 𝑟𝑎𝑛𝑠 , 𝐻𝑆ℎ𝑎𝑟𝑒𝑑
( 𝑆𝑆 ) for China. This dataset primarily includes seven major air pollutants,
= 𝐿𝑖𝑛𝑒𝑎𝑟 𝐻𝑇 𝑟𝑎𝑛𝑠 ∥ 𝐻𝑇𝑅𝑆 𝑆𝑅
𝑟𝑎𝑛𝑠 ∥ 𝐻𝑆ℎ𝑎𝑟𝑒𝑑 (28) and for this study, PM2.5 remote sensing images were selected as the
data source. The temporal resolution is daily, and the spatial resolution
is 1 km. The original remote sensing data were cropped based on
4. Experiments results and discussion
geographic boundary and latitude-longitude information to obtain the
corresponding PM2.5 remote sensing retrieval images for the study area.
4.1. Datasets
Remote sensing image samples are as shown in Fig. 5.
The research area of this study encompasses Beijing and Tianjin,
with the experimental datasets comprising observed PM2.5 concen- 4.2. Baselines
trations from ground pollutant monitoring stations, along with high-
resolution remotely sensed PM2.5 concentration images and correspond- 1. LSTM (Rao et al., 2019): Incorporates memory gating to address
ing geographic auxiliary information for the study area. the vanishing gradient problem in traditional recurrent neural
networks, and is widely applied in time series prediction tasks.
4.1.1. Ground observation data 2. STGCN (Yu et al., 2017): Proposes a spatiotemporal convolu-
The hourly PM2.5 concentration data were sourced from the China tional network that combines graph neural networks to model
National Environmental Monitoring Center. Currently, nearly 1800 spatial correlations.
monitoring stations have been established across mainland China. For 3. Transformer (Vaswani, 2017): Introduces a multi-head atten-
our study, we selected observational data from 34 and 27 monitoring tion mechanism to model global correlations, widely used in
stations within the research area for training and validation purposes. sequence modeling tasks.
The geographical distribution of the monitoring stations is shown in 4. Graph Wavenet (Wu et al., 2019): Integrates adaptive graph
Fig. 4, and detailed information can be found in Table 1. The PM2.5 convolution with temporal convolutional networks to adaptively
data were measured using the tapered element oscillating microbalance infer spatial correlations.
7
X. Chen et al. Environmental Modelling and Software 192 (2025) 106553
Table 2
Detailed hyperparameter settings of baseline model.
Model Hyperparameter
LearningRate WindowsSize Dims GraphOrder AttentionHeads DropOutRate
LSTM 1E−04 96 64 \ \ 0.2
STGCN 1E−04 96 96 2 \ 0.2
Transformer 1E−04 96 128 \ 8 0.2
GraphWavenet 1E−04 96 96 2 \ 0.2
Informer 1E−04 96 256 \ 8 0.2
DLinear 1E−04 96 256 \ \ \
Timesnet 1E−04 96 128 2 8 0.3
Res-GCN 1E−04 96 128 2 8 0.3
STMFNet 1E−04 96 96 2 8 0.3
8
X. Chen et al. Environmental Modelling and Software 192 (2025) 106553
Table 4
The multistep prediction performance of STIAGNN and other baselines on two datasets, the best performance is highlighted in bold, while the
second-best performance is underlined.
Datasets Models 12 h 24 h 36 h 48 h
MAE RMSE MAPE MAE RMSE MAPE MAE RMSE MAPE MAE RMSE MAPE
LSTM 14.51 20.98 0.76 17.65 25.41 1.19 19.87 27.54 1.49 22.65 29.42 1.63
STGCN 13.06 17.84 0.74 15.94 22.79 1.14 17.64 25.69 1.37 20.48 26.98 1.46
Transformer 12.74 19.81 0.79 16.32 24.96 1.14 18.03 26.24 1.35 19.29 26.87 1.49
GraphWavenet 12.57 18.55 0.76 15.69 23.41 1.16 17.81 25.96 1.3 20.16 27.84 1.39
Beijing Informer 12.65 20.59 0.73 16.26 25.01 1.18 18.15 26.69 1.38 19.24 29.67 1.31
DLinear 12.64 20.39 0.92 15.81 24.19 1.26 17.61 25.85 1.48 18.6 26.9 1.55
Timesnet 12.4 19.95 0.82 16.07 24.18 1.21 18.2 27.12 1.33 19.19 28.36 1.41
Res-GCN 11.89 18.97 0.79 15.3 23.26 1.08 17.73 26.59 1.29 19.27 28.11 1.43
STMFNet 11.62 18.22 0.68 15.32 23.22 1.12 17.11 25.22 1.39 18.11 26.43 1.41
LSTM 24.91 34.73 0.67 29.68 38.53 0.86 34.28 41.98 1.19 38.36 47.69 1.35
STGCN 24.16 31.89 0.6 27.64 37.16 0.83 31.03 40.2 1.15 36.74 46.98 1.28
Transformer 24.09 32.76 0.58 29.68 38.69 0.86 32.59 41.56 0.91 33.56 44.16 0.94
GraphWavenet 23.15 35.36 0.56 27.49 37.65 0.76 30.65 40.89 0.95 34.52 45.26 1.09
Tianjin Informer 23.99 33.42 0.56 27.99 38.02 0.78 31.22 41.54 0.88 33.37 44.03 0.93
DLinear 21.66 30.97 0.51 30.08 40.44 0.91 32.06 41.94 1.01 33.63 43.17 1.07
Timesnet 22.02 32.12 0.48 27.54 37.46 0.76 30.19 40.25 0.86 32.56 42.17 0.92
Res-GCN 21.78 30.49 0.52 27.08 37.71 0.72 29.68 38.43 0.76 30.15 41.69 0.87
STMFNet 21.24 29.81 0.51 26.52 36.44 0.64 27.85 37.95 0.69 28.94 39.14 0.72
components affect the model’s performance and to evaluate the impor- 4.5.2. Ablation analysis on time feature extraction module
tance of each component. The detailed descriptions of the variants are To investigate the effectiveness of each component and mechanism
as follows: within the time series feature extraction module, we performed abla-
tion studies on the various components of the time feature extraction
1. w/o MMF: Multimodal Feature Ablation: This variant retains module to create different variants. The descriptions of each variant are
only the time series features, inputting them into the output as follows:
mapping module to obtain the prediction output.
2. w/o FFM: Feature Fusion Module Ablation: This variant removes 1. w/o STD: Seasonal-Trend Decomposition Ablation: This variant
the feature fusion module, and the time series features are removes the seasonal-trend decomposition operation, directly
concatenated with the remote sensing image features to generate inputting the raw time series data into the seasonal feature
the prediction output. extraction module to obtain the final time series features.
2. w/o MSR: Multi-Scale Reshape Ablation: This variant elimi-
3. w/o DM: Decoupling Module Ablation: This variant eliminates
nates the multi-scale reshape operation combined with Fourier
the decoupling module, preserving the original multimodal fea-
transforms, instead applying adaptive graph convolution and
tures.
multi-head attention mechanisms on the raw data.
4. w/o HAGCN: Hierarchical Attention Graph Convolutional Net-
3. w/o AGCN: Adaptive Graph Convolution Ablation: This variant
work Module Ablation: This variant removes the hierarchical
removes the adaptive graph convolution operation.
attention graph convolutional network module, concatenating
4. w/o MHA: Multi-Head Attention Ablation: This variant removes
the decoupled features before inputting them into the output
the multi-head attention operation.
mapping layer to obtain the final prediction output.
As shown in Fig. 7, removing any component from the time series
As shown in Fig. 6, removing any component from the model feature extraction module results in a performance decline. The w/o
results in a performance decline. Notably, the variant w/o MMF ex- STD variant confirms that the seasonal-trend decomposition method
hibits a significant performance drop on both datasets, confirming helps explain the composition of sequence data, thereby improving the
that the inclusion of multimodal information greatly enhances model accuracy of time series prediction. The w/o MSR variant demonstrates
performance. Additionally, the w/o FFM variant demonstrates that that the reshape operation based on periodic data is effective in ex-
the addition of the multimodal feature fusion module enables effec- tracting periodic features. Additionally, the w/o AGCN variant confirms
tive integration of multimodal features, further improving predictive that adaptive graph convolution effectively captures spatial correlations
performance. The results of the w/o DM and w/o HAGCN variants between stations, while the w/o MHA variant verifies the effectiveness
also validate the effectiveness of the decoupling mechanism and the of the multi-head attention mechanism in capturing global correlations
hierarchical attention graph convolutional network, respectively. within the periodic data.
9
X. Chen et al. Environmental Modelling and Software 192 (2025) 106553
Fig. 7. Results of ablation analysis on time feature extraction module (a) Beijing. (b) Tianjin.
10
X. Chen et al. Environmental Modelling and Software 192 (2025) 106553
5. Conclusion
Fig. 10. Visualization of prediction results on multi-stations.
This paper proposes a novel deep learning network, STMFNet, for
multimodal air-quality prediction. The network primarily consists of
three modules: the time series feature extraction module, the remote
sensing image feature extraction module, and the multimodal feature with specific modal feature information, enabling deep fusion and
fusion module. Specifically, the time series feature extraction module complementarity of multimodal information, which ultimately yields
utilizes a decomposition strategy to transform the raw sequence into the final prediction results. Extensive experiments on two real-world
seasonal and trend components, extracting long-term trend and peri- datasets have confirmed the effectiveness of STMFNet, demonstrating
odic fluctuation features. The remote sensing image feature extraction its superiority over other baseline models and proving the effectiveness
module captures air-quality variation features at different scales and of incorporating multimodal data for air-quality prediction. However,
integrates these features to achieve both global and local scale vari- the current research is limited to station-level air-quality prediction
ation feature extraction. The multimodal feature fusion module first within urban areas. In the future, we plan to expand the study area
employs a decoupling strategy to decompose the original multimodal to encompass provincial and even national regions, extending station-
features into shared and specific features. It then utilizes a hierarchical level air-quality prediction to pixel-level precision air quality prediction
attention graph convolutional network to supplement the shared modal across the broader study area.
11
X. Chen et al. Environmental Modelling and Software 192 (2025) 106553
Acknowledgments
Data availability
Bai, L., Yao, L., Li, C., Wang, X., Wang, C., 2020. Adaptive graph convolutional
recurrent network for traffic forecasting. Adv. Neural Inf. Process. Syst. 33,
17804–17815.
Cao, W., Qi, W., Lu, P., 2024. Air quality prediction based on time series decomposition
and convolutional sparse self-attention mechanism transformer model. IEEE Access.
Chadalavada, S., Faust, O., Salvi, M., Seoni, S., Raj, N., Raghavendra, U., Gudigar, A.,
Barua, P.D., Molinari, F., Acharya, R., 2024. Application of artificial intelligence
in air pollution monitoring and forecasting: A systematic review. Environ. Model.
Softw. 106312.
Chen, H., Guan, M., Li, H., 2021. Air quality prediction based on integrated dual LSTM
model. IEEE Access 9, 93285–93297.
Chen, X., Hu, Y., Dong, F., Chen, K., Xia, H., 2024a. A multi-graph spatial-temporal
attention network for air-quality prediction. Process. Saf. Environ. Prot. 181,
442–451.
Chen, X., Hu, Y., Liu, C., Chen, A., Chi, Z., 2025a. Dynamic spatio-temporal graph
network based on multi-level feature interaction for sinter TFe prediction. J.
Process Control 148, 103401. [Link]
URL [Link]
Chen, X., Liu, C., Xia, H., Chi, Z., 2025b. Burn-through point prediction and control
based on multi-cycle dynamic spatio-temporal feature extraction. Control Eng.
Pract. 154, 106165. [Link] URL
[Link]
Chen, X., Xia, H., Wu, M., Hu, Y., Wang, Z., 2024b. Spatiotemporal hierarchical transmit
neural network for regional-level air-quality prediction. Knowl.-Based Syst. 289,
111555.
Chi, Z., Chen, X., Xia, H., Liu, C., Wang, Z., 2024. An adaptive control system
based on spatial–temporal graph convolutional and disentangled baseline-volatility
prediction of bellows temperature for iron ore sintering process. J. Process Con-
Fig. 12. Visualization of prediction results in the region of Tianjin. trol 140, 103254. [Link] URL https:
//[Link]/science/article/pii/S0959152424000945.
Fang, M., Peng, S., Liang, Y., Hung, C., Liu, S., 2023. A multimodal fusion model with
multi-level attention mechanism for depression detection. Biomed. Signal Process.
CRediT authorship contribution statement Control. 82, 104561.
Gass, K., Klein, M., Chang, H.H., Flanders, W.D., Strickland, M.J., 2014. Classification
Xiaoxia Chen: Writing – review & editing, Validation, Supervision, and regression trees for epidemiologic research: An air pollution example. Environ.
Resources, Investigation, Funding acquisition, Formal analysis, Concep- Heal. 13, 1–10.
Gu, K., Qiao, J., Lin, W., 2018. Recurrent air quality predictor based on
tualization. Zhen Wang: Writing – review & editing, Writing – original
meteorology-and pollution-related factors. IEEE Trans. Ind. Informatics 14 (9),
draft, Validation, Software, Project administration, Methodology, Inves- 3946–3955.
tigation, Data curation. Fangyan Dong: Writing – review & editing, Gu, Y., Zhao, Y., Zhou, J., Li, H., Wang, Y., 2021. A fuzzy multiple linear regression
Validation, Supervision. Kaoru Hirota: Writing – review & editing, model based on meteorological factors for air quality index forecast. J. Intell. Fuzzy
Validation, Supervision. Systems 40 (6), 10523–10547.
Guo, S., Lin, Y., Feng, N., Song, C., Wan, H., 2019. Attention based spatial-temporal
Software and data availability graph convolutional networks for traffic flow forecasting. In: Proceedings of the
AAAI Conference on Artificial Intelligence, vol. 33, pp. 922–929.
Hasnain, A., Sheng, Y., Hashmi, M.Z., Bhatti, U.A., Ahmed, Z., Zha, Y., 2023. Assessing
• Name of software: STMFNet for multimodal air-quality prediction the ambient air quality patterns associated to the COVID-19 outbreak in the yangtze
• Developer: Zhen Wang River Delta: A random forest approach. Chemosphere 314, 137638.
• Contact: chenxiaoxia@[Link] He, J., Yu, Y., Liu, N., Zhao, S., 2013. Numerical model-based relationship between
• Data first available: March 24, 2025 meteorological conditions and air quality and its implication for urban air quality
• Program language: Python management. Int. J. Environ. Pollut. 53 (3–4), 265–286.
Hu, J., Zhou, R., Ding, R., Ye, D., Su, Y., 2023. Effect of PM2.5 air pollution on the
• Source code at: [Link]
global burden of lower respiratory infections, 1990–2019: A systematic analysis
• Original Ground Monitoring Data: [Link] from the global burden of disease study 2019. J. Hazard. Mater. 459, 132215.
• Original Remote Sensing Data: [Link] [Link] URL [Link]
[Link] com/science/article/pii/S030438942301498X.
12
X. Chen et al. Environmental Modelling and Software 192 (2025) 106553
Kingma, D.P., 2014. Adam: A method for stochastic optimization. arXiv preprint Wang, H., Shao, S., 2022. Prediction of PM2.5 in Hefei based on a hybrid CNN-GRU
arXiv:1412.6980. model. In: 2022 5th International Conference on Data Science and Information
Kumar, U., Jain, V., 2010. ARIMA forecasting of ambient air pollutants (O3, NO, NO2 Technology. DSIT, IEEE, pp. 1–6.
and CO). Stoch. Environ. Res. Risk Assess. 24, 751–760. Wang, C., Zhu, Y., Zang, T., Liu, H., Yu, J., 2021a. Modeling inter-station relationships
Lei, C., Xu, X., Ma, Y., Jin, S., Liu, B., Gong, W., 2022. Full coverage estimation of with attentive temporal graph convolutional network for air quality prediction. In:
the PM concentration across china based on an adaptive spatiotemporal approach. Proceedings of the 14th ACM International Conference on Web Search and Data
IEEE Trans. Geosci. Remote Sens. 60, 1–14. Mining. pp. 616–634.
Lelieveld, J., Haines, A., Burnett, R., Tonne, C., Klingmüller, K., Münzel, T., Pozzer, A., Wang, C., Zhu, Y., Zang, T., Liu, H., Yu, J., 2021b. Modeling inter-station relationships
2023. Air pollution deaths attributable to fossil fuels: Observational and modelling with attentive temporal graph convolutional network for air quality prediction. In:
study. BMJ 383, [Link] arXiv:[Link] Proceedings of the 14th ACM International Conference on Web Search and Data
[Link]/content/383/[Link]. Mining. pp. 616–634.
Li, J., Hua, C., Ma, L., Chen, K., Zheng, F., Chen, Q., Bao, X., Sun, J., Xie, R., Bianchi, F., Wei, J., Li, Z., Lyapustin, A., Sun, L., Peng, Y., Xue, W., Su, T., Cribb, M., 2021.
Kerminen, V., Petäjä, T., Kulmala, M., Liu, Y., 2024. Key drivers of the oxidative po- Reconstructing 1-km-resolution high-quality PM2.5 data records from 2000 to 2018
tential of PM2.5 in Beijing in the context of air quality improvement from 2018 to in China: Spatiotemporal variations and policy implications. Remote Sens. Environ.
2022. Environ. Int. 187, 108724. [Link] 252, 112136.
URL [Link] Wu, H., Hu, T., Liu, Y., Zhou, H., Wang, J., Long, M., 2022. Timesnet: Temporal
Li, X., Wang, C., Tan, J., Zeng, X., Ou, D., Ou, D., Zheng, B., 2020. Adversarial multi- 2D-variation modeling for general time series analysis. arXiv preprint arXiv:2210.
modal representation learning for click-through rate prediction. In: Proceedings of 02186, arXivpreprint.
the Web Conference 2020. pp. 827–836. Wu, Z., Pan, S., Long, G., Jiang, J., Zhang, C., 2019. Graph wavenet for deep
Li, Y., Yu, R., Shahabi, C., Liu, Y., 2017. Diffusion convolutional recurrent neural spatial-temporal graph modeling. arXiv preprint arXiv:1906.00121.
network: Data-driven traffic forecasting. arXiv preprint arXiv:1707.01926. Xia, H., Chen, X., Chen, B., Hu, Y., 2025. Dynamic synchronous graph transformer
Liu, B., Binaykia, A., Chang, P., Tiwari, M.K., Tsao, C., 2017. Urban air quality fore- network for region-level air-quality forecasting. Neurocomputing 616, 128924.
casting based on multi-dimensional collaborative support vector regression(SVR): A [Link] URL [Link]
case study of Beijing-tianjin-shijiazhuang. PLOS ONE 12 (7), 1–17. [Link] com/science/article/pii/S0925231224016953.
org/10.1371/[Link].0179763. Xia, H., Chen, X., Wang, Z., Chen, X., Dong, F., 2024. A multi-modal deep-
Liu, F., Chen, J., Tan, W., Cai, C., 2021. A multi-modal fusion method based on learning air quality prediction method based on multi-station time-series data and
higher-order orthogonal iteration decomposition. Entropy 23 (10), 1349. remote-sensing images: Case study of Beijing and Tianjin. Entropy 26 (1).
Liu, Y., Wang, P., Li, Y., Wen, L., Deng, X., 2022. Air quality prediction models based Yu, Z., Sun, Z., Liu, L., Li, C., Zhang, X., Amat, G., Ran, M., Hu, X., Xu, Y.,
on meteorological factors and real-time data of industrial waste gas. Sci. Rep. 12 Zhao, X., Zhou, J., 2024. Environmental surveillance in jinan city of East China
(1), 9253. (2014–2022) reveals improved air quality but remained health risks attributable
Ma, J., Yu, Z., Qu, Y., Xu, J., Cao, Y., 2020. Application of the xgboost machine to PM2.5-bound metal contaminants. Environ. Pollut. 343, 123275. [Link]
learning method in PM2.5 prediction: A case study of Shanghai. Aerosol Air org/10.1016/[Link].2023.123275, URL [Link]
Qual. Res. 20 (1), 128–138. [Link] URL article/pii/S0269749123022777.
[Link] Yu, B., Yin, H., Zhu, Z., 2017. Spatio-temporal graph convolutional networks: A deep
Ma, X., Zhang, X., Pun, M.-O., 2022. A crossmodal multiscale fusion network for learning framework for traffic forecasting. arXiv preprint arXiv:1709.04875.
semantic segmentation of remote sensing data. IEEE J. Sel. Top. Appl. Earth Obs. Yu, B., Yin, H., Zhu, Z., 2018. Spatio-temporal graph convolutional networks: A deep
Remote. Sens. 15, 3463–3474. [Link] learning framework for traffic forecasting. In: Proceedings of the Twenty-Seventh
Ma, X., Zhang, X., Pun, M.-O., Liu, M., 2024. A multilevel multimodal fusion International Joint Conference on Artificial Intelligence. IJCAI-18, International
transformer for remote sensing semantic segmentation. IEEE Trans. Geosci. Remote Joint Conferences on Artificial Intelligence Organization, pp. 3634–3640. http:
Sens. 62, 1–15. [Link] //[Link]/10.24963/ijcai.2018/505.
Maciąg, P.S., Bembenik, R., Piekarzewicz, A., Del Ser, J., Lobo, J.L., Kasabov, N.K., Zadeh, A., Liang, P.P., Mazumder, N., Poria, S., Cambria, E., Morency, L.-P., 2018.
2023. Effective air pollution prediction by combining time series decomposition Memory fusion network for multi-view sequential learning. In: Proceedings of the
with stacking and bagging ensembles of evolving spiking neural networks. Environ. AAAI Conference on Artificial Intelligence, vol. 32.
Model. Softw. 170, 105851. Zeng, A., Chen, M., Zhang, L., Xu, Q., 2023. Are transformers effective for time series
Naveen, S., Upamanyu, M., Chakki, K., Chandan, M., Hariprasad, P., 2023. Air quality forecasting? In: Proceedings of the AAAI Conference on Artificial Intelligence, vol.
prediction based on decision tree using machine learning. In: 2023 International 37, pp. 11121–11128.
Conference on Smart Systems for Applications in Electrical Sciences. ICSSES, IEEE, Zhang, B., Chen, W., Li, M.-Z., Guo, X., Zheng, Z., Yang, R., 2024. Mgatt-LSTM: A
pp. 1–6. multi-scale spatial correlation prediction model of PM2.5 concentration based on
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., multi-graph attention. Environ. Model. Softw. 179, 106095.
Lin, Z., Gimelshein, N., Antiga, L., et al., 2019. Pytorch: An imperative style, Zhang, S., Li, B., Liu, L., Hu, Q., Liu, H., Zheng, R., Zhu, Y., Liu, T., Sun, M., Liu, C.,
high-performance deep learning library. Adv. Neural Inf. Process. Syst. 32. 2021. Prediction of vertical profile of NO2 using deep multimodal fusion network
Qingyun, F., Zhaokui, W., 2022. Cross-modality attentive feature fusion for object based on the ground-based 3-D remote sensing. IEEE Trans. Geosci. Remote Sens.
detection in multispectral remote sensing imagery. Pattern Recognit. 130, 108786. 60, 1–13.
Rao, K.S., Devi, G.L., Ramesh, N., et al., 2019. Air quality prediction in visakhapatnam Zhang, J., Luo, Z., Yang, Z., 2023. Research on air quality prediction based on
with LSTM based recurrent neural networks. Int. J. Intell. Syst. Appl. 11 (2), 18–24. LSTM-transformer with adaptive temporal attention mechanism. In: 2023 2nd
Ren, Y., Wang, S., Xia, B., 2023. Deep learning coupled model based on TCN-LSTM for International Conference on Artificial Intelligence and Intelligent Information
particulate matter concentration prediction. Atmos. Pollut. Res. 14 (4), 101703. Processing. AIIIP, IEEE, pp. 320–323.
Rowley, A., Karakuş, O., 2023. Predicting air quality via multimodal AI and satellite Zhang, B., Qin, H., Zhang, Y., Li, M., Qin, D., Guo, X., Li, M., Guo, C., 2025. Multi-
imagery. Remote Sens. Environ. 293, 113609. granularity PM2.5 concentration long sequence prediction model combined with
Roy, S.K., Deria, A., Hong, D., Rasti, B., Plaza, A., Chanussot, J., 2023. Multimodal spatial–temporal graph. Environ. Model. Softw. 106400.
fusion transformer for remote sensing image classification. IEEE Trans. Geosci. Zhao, L., Yang, Y., Ning, T., 2024. A three-stage multimodal emotion recognition
Remote Sens. 61, 1–20. network based on text low-rank fusion. Multimedia Syst. 30 (3), 142.
Song, C., Lin, Y., Guo, S., Wan, H., 2020. Spatial-temporal synchronous graph convolu- Zhou, H., Zhang, S., Peng, J., Zhang, S., Li, J., Xiong, H., Zhang, W., 2021. Informer: Be-
tional networks: A new framework for spatial-temporal network data forecasting. In: yond efficient transformer for long sequence time-series forecasting. In: Proceedings
Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 914–921. of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 11106–11115.
Sun, D., Liu, C., Ding, Y., Yu, C., Guo, Y., Sun, D., Pang, Y., Pei, P., Du, H., Yang, L., et Zou, S., Huang, X., Shen, X., Liu, H., 2022. Improving multimodal fusion with main
al., 2023. Long-term exposure to ambient PM2· 5, active commuting, and farming modal transformer for emotion recognition in conversation. Knowl.-Based Syst. 258,
activity and cardiovascular disease risk in adults in China: A prospective cohort 109978.
study. Lancet Planet. Heal. 7 (4), e304–e312.
Vaswani, A., 2017. Attention is all you need. Adv. Neural Inf. Process. Syst..
13