A Multi-View Multi-Task Learning Framework For Multi-Variate Time Series Forecasting
A Multi-View Multi-Task Learning Framework For Multi-Variate Time Series Forecasting
Abstract—Multi-variate time series (MTS) data is a ubiquitous class of data abstraction in the real world. Any instance of MTS is
generated from a hybrid dynamical system and their specific dynamics are usually unknown. The hybrid nature of such a dynamical
system is a result of complex external attributes, such as geographic location and time of day, each of which can be categorized into
either spatial attributes or temporal attributes. Therefore, there are two fundamental views which can be used to analyze MTS data,
namely the spatial view and the temporal view. Moreover, from each of these two views, we can partition the set of data samples of MTS
into disjoint forecasting tasks in accordance with their associated attribute values. Then, samples of the same task will manifest similar
forthcoming pattern, which is less sophisticated to be predicted in comparison with the original single-view setting. Considering this
insight, we propose a novel multi-view multi-task (MVMT) learning framework for MTS forecasting. Instead of being explicitly presented
in most scenarios, MVMT information is deeply concealed in the MTS data, which severely hinders the model from capturing it naturally.
To this end, we develop two kinds of basic operations, namely task-wise affine transformation and task-wise normalization, respectively.
Applying these two operations with prior knowledge on the spatial and temporal view allows the model to adaptively extract MVMT
information while predicting. Extensive experiments on three datasets are conducted to illustrate that canonical architectures can be
greatly enhanced by the MVMT learning framework in terms of both effectiveness and efficiency. In addition, we design rich case
studies to reveal the properties of representations produced at different phases in the entire prediction procedure.
Index Terms—Time series forecasting, deep learning, normalization, multi-view multi-task learning
1 INTRODUCTION predict a potential market crash [8]. Due to the complex and
continuous fluctuation of impacting factors, real-world time
IME series forecasting is a significant problem in many
T industrial and business applications [1], [2], [3]. For
instance, a public transport operator can allocate sufficient
series tends to be extraordinarily non-stationary, that is, exhib-
iting diverse dynamics. For instance, traffic volume is largely
affected by the road’s condition, location, and the current time
capacity to mitigate the queuing time in a region in advance, if
and weather condition. In the retail sector, the current season,
they have the means to foresee that a particular region will suf-
price and brand are determinants for the sales of merchandise.
fer from a supply shortage in the next couple of hours [4], [5],
The diverse dynamics impose an enormous challenge on time
[6], [7]. As another example, an investor can avoid economic
series forecasting. In this work, we study multi-variate time
loss with the assistance of a robo-advisor which is able to
series forecasting, where multiple variables evolve with time.
Traditional time series forecasting algorithms, such as
Jinliang Deng is with the Department of Computer Science and Engineering, ARIMA and state space models (SSMs), provide a principled
Southern University of Science and Technology, Shenzhen 518055, China, and framework for modeling and learning time series patterns.
also with the Australian Artificial Intelligence Institute, University of Technol- However, these algorithms have a rigorous requirement for
ogy Sydney, Sydney 2007, Australia. E-mail: [email protected]. the stationarity of a time series, which suffer from severe lim-
Xiusi Chen is with the University of California, Los Angeles, CA 90095
USA. E-mail: [email protected]. itations in practical use if most of the impacting factors are
Renhe Jiang is with the Center for Spatial Information Science, University unavailable. Recent studies show that thanks to the nonline-
of Tokyo, Tokyo 113-8654, Japan. E-mail: [email protected]. arity of activation functions, deep learning models possess
Xuan Song is with the SUSTech-UTokyo Joint Research Center on Super
Smart City, Department of Computer Science and Engineering, and also
the capacity to handle complex dynamics, theoretically in
with the Research Institute of Trustworthy Autonomous Systems, South- any form, even in the absence of additional impacting factors
ern University of Science and Technology (SUSTech), Shenzhen 518055, [9]. Therefore, the nonstationarity issue can be addressed to
China. E-mail: [email protected]. some degree. Common neural architectures applied on time
Ivor W. Tsang is with the Australian Artificial Intelligence Institute, Uni-
versity of Technology Sydney, Ultimo, NSW 2007, Australia. series data include recurrent neural networks (RNNs), long-
E-mail: [email protected]. short term memory (LSTM) [10], Transformer [11], Wavenet
Manuscript received 1 September 2021; revised 11 September 2022; accepted [12] and temporal convolution networks (TCNs) [13].
30 October 2022. Date of publication 2 November 2022; date of current version In our article, we conjecture that MTS forecasting can
21 June 2023. essentially be treated as a multi-view multi-task learning
This work was supported in part by ARC under Grants DP180100106 and
DP200101328, in part by the National Key Research and Development Project of problem. To the best of our knowledge, we are the first to for-
China under Grant 2021YFB1714400, and in part by Guangdong Provincial mulate MTS in this way. There are typically two additional
Key Laboratory under Grant 2020B121201001. views in the MTS problem apart from the original spatial-tem-
(Corresponding authors: Xuan Song and Ivor W. Tsang.) poral view, namely the temporal view and the spatial view.
Recommended for acceptance by E. Chen.
Digital Object Identifier no. 10.1109/TKDE.2022.3218803 From each of these two views, we can divide the forecasting
1041-4347 © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See ht_tps://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on January 24,2025 at 08:44:25 UTC from IEEE Xplore. Restrictions apply.
7666 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 35, NO. 8, AUGUST 2023
2 RELATED WORK
2.1 Time Series Forecasting
Time series forecasting has been studied for decades. Tradi-
tional methods, such as ARIMA, can only learn the linear
relationship among different timesteps, which has an inher-
ent deficiency in fitting many real-world time series data
that are highly nonlinear. With the power of deep learning
models, a large volume of work in this area that has recently
achieved impressive performance. For instance, [17] adopt
LSTM to capture the nonlinear dynamics and long-term
Fig. 4. (a) Spatial indistinguishability; (b) Temporal indistinguishability. dependencies in time series data. However, the memorizing
capacity of LSTM is still restricted, as pointed out by [18].
To resolve this issue, [19], [20] create an external memory to
measurement pairs of region A, and separate them based on
weekday or weekend. It is obvious that these two clusters explicitly store some representative patterns that can be fre-
also have a strong correlation. quently observed in the history, which is able to effectively
To address the issues, we propose two fundamental guide the forecasting when similar patterns occur. [21]
operations, task-wise affine transformation and task-wise nor- makes use of a skip connection to enable the information to
malization, each of which can weaken the inter-task correla- be transmitted from distant history. The attention mecha-
tion while maintaining the intra-task correlation. Task-wise nism is another option to deal with the vanishing memory
affine transformation transforms the representations of each problem [22], [23]. Of these methods, Transformer is a rep-
sample with task-specific affine parameters, hence task-spe- resentative architecture which consists of only attention
cific characteristics can be encoded into the feature space. operations [24]. To overcome the computation bottleneck of
The limitation of this operation is that it can only be applied canonical Transformer, [11] proposes a novel mechanism
on the spatial view whose task partition is static over time, that periodically skips some timesteps when performing
or in other words the set of tasks does not change with time. attention. As far as we know, Wavenet [12], TCN [13] and
When it comes to the temporal view with dynamic task par- Transformer [24] are currently the superior choices for
tition, the model cannot pre-learn the affine parameters for modeling long-term time series data [25], [26], [27], [28].
tasks appearing at a future time. To complement task-wise Contrasting to the above approaches focusing on enriching
affine transformation, task-wise normalization is proposed, the semantics of the current state with the information
which can be applied not only on the spatial view but also extracted from the history, [29] attempted to improve the
on the temporal view. Basically, it performs normalization emissions from the current state into the future. They for-
over the entire group of samples divided into the same task, mulated the forecasts at different forthcoming time steps as
which can also result in representations with task-specific separate tasks and modeled the interplay between them.
characteristics. In our study, we realize task-wise affine To tackle MTS, several studies [27], [30], [31], [32] assume
transformation and normalization from the spatial view that multi-variate time series data has a low-rank structure.
and the temporal view respectively, giving rise to a com- Another thread of works [17], [33], [34] leverages the atten-
pound operation known as ST-Norm, abbreviated as STN. tion mechanism to learn the correlations among individual
We summarize our contributions as follows: time series, where [34] opted to obtain the attentive scores
with cosine similarity other than [17], [33] using dot product
We propose a novel MVMT learning framework for similarity. Recently, [26] inferred the inherent structure over
time series forecasting. We account for three views the variables derived from self-learned encodings associ-
in this framework, namely the original view, the spa- ated with each variable. [35] used Fourier transform to
tial view and the temporal view. From each of the decompose original MTS data into a group of orthogonal
spatial view and the temporal view, learning is per- signals. [36] proposed a dual self-attention network (DSA-
formed in a multi-task manner where each task is Net) to dynamically capture both local and global patterns
associated with a variable or a timestamp. with convolution and attention operators, effectively han-
We develop task-wise affine transformation and nor- dling MTS data with non-periodic dynamics. The above
malization to enable the feature space to be encoded methods make point estimation. [30], [37], [38] propose a
with MVMT information. Either of these two opera- confidence interval that is likely to contain the forthcoming
tions can weaken the inter-task correlations while keep- observation. With prior knowledge on the application sce-
ing the intra-task correlations in the representation nario, rich kinds of inter-variate relationship can be utilized.
space, which emulates the explicit partitioning of data For instance, in real-world transportation system, there are
samples as in the normal setting of multi-task learning. three typical relationships, namely spatial closeness [4], [5],
We propose a compound operation ST-Norm, con- [6], [7], functional similarity [4] and origin-destination con-
sisting of different realizations of task-wise affine nectiveness [39]. In particular, spatial closeness follows the
transformation and normalization respectively from first law of geography - ”near things are more related than
the spatial view and the temporal view. distant things”; functional similarity explains a part of phe-
We conduct extensive experiments to quantitatively nomenons that although two locations are separate far
and qualitatively validate the effectiveness of the away from each other, they exhibit similar movement pat-
MVMT learning framework. tern of traffic recordings over a day; origin-destination (OD)
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on January 24,2025 at 08:44:25 UTC from IEEE Xplore. Restrictions apply.
7668 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 35, NO. 8, AUGUST 2023
TABLE 1
Notations
Notation Description
N; M Number of variables/ tasks.
T; Tin ; Tout Length of history, number of input
steps/ output steps.
X 2 RNT Historical observations for forecasting.
^ Z
Z; Z; 2 RNT d Historical latent representations.
I ½½1; N ½½1; T A set of sample indexes. Fig. 5. Illustration of task partition schemes from: (a) the temporal view;
P ¼ fI1 ; ; IM g A task partition which is a partition of (b) the spatial view.
the entire set of sample indexes.
GP 2 RMd Global components conditioned on One of their major discoveries is that normalization can
partition P.
P NT d increase the rankness of the feature space, or in other words, it
L 2R Local components conditioned on
partition P. enables the model to extract more diverse features.
T; S Task partition from the temporal/
spatial view. 3 PRELIMINARIES
x; y; z Vector or matrix that represents certain
variable. In this section, we introduce the definitions and the assump-
þ; ; = Element-wise addition/multiplication/ tions. All frequently used notations are reported in Table 1.
division.
3.1 Task Partition Schemes
connectiveness is characterized as the quantity of the traffic Given the complete set of sample indexes I ¼ ½½1; N
flow between the OD pair [39]. [40] proposed a new objec- ½½1; T , a task partition is defined as P ¼ fI1 ; ; IM g, where
tive function for training deep neural networks, aiming at I1 ; ; IM are disjoint subsets (also known as tasks in our
accurately predicting sudden changes. [41] explored the context) of I, and M is the number of tasks been partitioned.
structure of LSTM to learn variable-wise hidden states, with Besides, we introduce a function P which maps any sample
the aim to distinguish the contribution of variables to the index to the index of its belonging task Pðn; tÞ ¼ m under
prediction. In some scenarios, MTS may evolve asynchro- the task partition scheme.
nously and are spaced unevenly. To deal with this more We employ two schemes to partition tasks respectively
general case, [42] organized the asynchronous MTS data as from the spatial view and the temporal view, which are
a single series of observations, alongside a series of external introduced as follows:
features expressing the spatial-temporal relationships Temporal view: samples collected at the same time
between the observations. Afterward, they married convo- are put in the same task as shown in Fig. 5a, ignoring
lutional neural networks and auto-regressive models to per- the spatial difference. In this case, M ¼ T and
form forecasting on top of this new data representation. Im ¼ fðn; mÞgN n¼1 .
Distinguishing from the existing works, we are the pio- Spatial view: samples of the same variable are
neer in developing spatial and temporal normalization in the put in the same task as shown in Fig. 5b, ignoring
context of MTS forecasting, and demonstrate their rationality the temporal difference. In this case, M ¼ N and
from the perspective of multi-view multi-task learning. Im ¼ fðm; tÞgTt¼1 .
In terms of any of the aforementioned partition schemes,
samples put in the same task are impacted by the same fac-
2.2 Normalization tor, thus they should manifest a strong correlation. Mean-
Normalization was first adopted in deep image processing, while, samples across tasks are impacted by different
and has significantly enhanced the performance of deep learn- factors, which result in weakly correlated patterns.
ing models for nearly all tasks. There are multiple normaliza- There are many partition schemes with different granu-
tion methods, such as batch normalization [43], instance larities. For example, we can also put the samples collected
normalization [44], group normalization [45], layer normali- during a specified period in a task. In our work, we adopt
zation [46] and positional normalization [47], each of which is the finest granularity, and the question of what the optimal
proposed to address a particular group of computer vision granularity is is left to explore in future works.
tasks. Of these, instance normalization has the greatest poten-
tial for our study, which was originally designed for image 3.2 Other Preliminaries
synthesis owing to its power to remove style information
from the images. Researchers have found that feature statistics
Definition 1 (Time series forecasting). Time series forecast-
ing is formulated as the following conditional distribution:
can capture the style of an image, and the remaining features
upon normalizing the statistics are responsible for the content.
Y
Tout
Such a separable property enables the content of an image to P ðX:;tþ1:tþTout jX:;tþ1Tin :t Þ ¼ P ðX:;tþi jX:;tþ1Tin :t Þ;
be rendered in the style of another image, which is also known i¼1
as style transfer. The style information in the image is like the
scale information in time series. There is another line of work Definition 2 (Time series factorization). Specifying a task
which explores the reason why the normalization trick facili- partition scheme P, any sample of the time series can be factor-
tates the learning of deep neural networks [48], [49], [50], [51]. ized in the following way:
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on January 24,2025 at 08:44:25 UTC from IEEE Xplore. Restrictions apply.
DENG ET AL.: MULTI-VIEW MULTI-TASK LEARNING FRAMEWORK FOR MULTI-VARIATE TIME SERIES FORECASTING 7669
showing regular patterns, such identification can be accom- For conciseness, we do not include the implementation from
plished given prior knowledge on regularity. Nonetheless, the spatial view.
for the ones with irregular patterns, it would be cumber- Next, we explain why task-wise normalization works. As
some to identify eligible tasks. we indicate in Remark 3, certain cases of indistinguishability
Due to the limitation, we apply task-wise affine transfor- are caused by task-wise transformations that differ only in
mation from the spatial view. The implementation takes the scaling factors. Task-wise normalization resolves this issue
following form: by converting scaling transformation to rotating transforma-
S ¼ Zn;t wS þ bS ; tion. Basically, we construct rotation with a rotating angle
Z n;t n n (3)
which relies on the scaling factor. First, we rewrite the nor-
where wSn and bSn are learnable affine parameters. malized representation from Eq. (4) in the following way:
Next, we introduce another thread of approaches which P
can address the cold-start problem. ^ P ¼ Zn;t mm
Z (7)
n;t
sT
m
4.2 Task-Wise Normalization LPn;t bPm
¼ ; (8)
Task-wise normalization explicitly encodes global compo- g Pm
nents into the representation space. The global component
varies from task to task, and thus can be used to separate where Eq. (8) is deduced by substituting Eq. (1), Eq. (2), Eq. (5)
tasks. However, the observation of any sample is a mixture and Eq. (6) into Eq. (7). Then combining the original view, the
of the global component and the local component, which hin- spatial view and the temporal view, we map the original
ders the capture of the global component. To extract the representation to an augmented representation as follows:
global component, we start by applying task-wise normali- 2 3
zation to eliminate the global component from the represen- 2 3 Zn;t
Zn;t 6 LS bS 7
tation, then we combine the normalized representation with 6 ^ S 7 6 n;tg S m1 7
the original representation to obtain an augmented represen- Zn;t ! 4 Z n;t 5 ¼ 6 m1 7 ;
^T 4 LT bT 5
tation for each sample. The augmented representation space Z n;t n;t m2
gT
m2
can manifest the difference on global components, and hence
the current inter-task correlation is weaker than the original. where m1 ¼ Sðn; tÞ and m2 ¼ T ðn; tÞ. At this step, from the
Likewise, for a task partition scheme P and its associated view of geometric transformation, each sample point is
mapping function P, task-wise normalization is performed rotated by an angle depending on GPm g Pm from its original
as follows: P position and is then translated by bPm =g Pm . Thus, the differ-
^ P ¼ Zn;t mm
Z (4) ence on GPm , on bPm and on g Pm can be manifested in the new
n;t
s Pm
space. It is noteworthy that the new space also maintains
where we let m denote Pðn; tÞ which is the task belonging of the correlation among GPm belonging to different tasks,
the sample, which can facilitate the learning process.
mPm ¼ E Zi;j jPði; jÞ ¼ m (5) 4.3 Wavenet
1 X Im
We illustrate the architecture of our work in Fig. 7. Some
Zi;j
jIm j i;j key variables with their shapes are labeled at their corre-
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
h iffi sponding positions along the computation path. Generally,
P
2 our framework is instantiated as a structure like Wavenet
s m ¼ E Zi;j mm jPði; jÞ ¼ mP
T
^ T ¼ Zn;t mt ;
Z
This formula can be easily generalized for a multi-dimen-
n;t
sT
t
sion signal but we omit its general form here for brevity.
where Moreover, padding (zero or replicate) with size of k 1 is
appended to the left tail of the signal to ensure length con-
1X N
sistency. We can stack multiple causal convolution layers to
mT
t ¼ Zi;t ;
N i¼1 obtain a larger receptive field for each element.
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi One shortcoming of using causal convolution is that either
u N
u1 X the kernel size or the number of layers increases in a linear
st ¼ t
T
ðZi;t mT 2
t Þ þ : manner with the range of the receptive field, and the linear rela-
N i¼1
tionship causes an explosion of parameters when modeling
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on January 24,2025 at 08:44:25 UTC from IEEE Xplore. Restrictions apply.
DENG ET AL.: MULTI-VIEW MULTI-TASK LEARNING FRAMEWORK FOR MULTI-VARIATE TIME SERIES FORECASTING 7671
5 EVALUATION
In this section, we describe the extensive experiments on
three common datasets to validate the effectiveness of
MVMT framework from different aspects.
TABLE 2
Dataset Statistics
The time period spans from March 1st, 2013 to Feb- AGCRN [55]. AGCRN is also equipped with a
ruary 28th, 2017. graph-learning module to establish inter-variate rela-
tionship. Furthermore, it uses a personalized RNN to
model the temporal relationship for each time series.
5.1.2 Network Setting Transformer [11]. This model captures the long-term
The batch size is 8, and the input length of the batch sample dependencies in time series data by using an atten-
is 16. For the Wavenet backbone, the layer number is set to tion mechanism, where the keys and queries are
4, the kernel size of each DCC component is 2, and the asso- yielded by causal convolution over local context to
ciated dilation rate is 2i , where i is the index of the layer model segment-level correlation.
(counting from 0). Such settings collectively enable the out- LSTNet [21]. There are two components in LSTNet:
put from Wavenet to perceive 16 input steps. The number one is a conventional autoregressive model, and the
of hidden channels dz in each DCC is 16. We apply zero- other is an LSTM with an additional skip connection
padding on the left tail of the input to enable the length of over the temporal dimension.
the output from DCC to be 16 as well. The learning rate of TCN. [13] The architecture of TCN is similar to
the Adam optimizer is 0.00016. Wavenet, except that the nonlinear transformation in
each residual block is made up of two rectified linear
units (ReLU).
5.1.3 Evaluation Metrics
We also test the performance of TCN and Transformer
We validate our model using root mean squared error incorporating STN, where STN is similarly applied before
(RMSE), mean absolute error (MAE) and mean absolute per- the causal convolution operation in each layer. We do not
centage error (MAPE). compare our method to linear models such as ARIMA,
We conduct cross validation on each dataset to get a because the involved baseline models show superiority
comprehensive evaluation of our model against competing over the linear models as illustrated in the original work in
ones. In particular, we split time series data into 10 chunks [11], [21], [25], [26]. For all baselines, we conduct grid search
along the temporal axis with approximately equal length, on the number of hidden units over f4; 8; 16; 32; 64g, and
then we create 5 groups of training/validation/test sets select the best models according to the validation results.
constructed from these 10 data chunks. For the first group,
the first 4 chunks of data construct the training set, while
the 5th and the 6th chunk respectively construct the valida- 5.3 Experimental Results
tion and testing set; For the second group, the first 5 The experimental results on the five datasets are reported in
chunks construct the training set, while the 6th and the 7th Table 3. The improvements achieved by Wavenet + STN
chunk respectively construct the validation and testing set, over the best benchmarks are recorded in the last row of
etc. We repeat the experiment 10 times for each model on each sub-table.
each group of training/validation/test set and report the It is obvious that Wavenet + STN achieves SOTA results
average performance. over almost all horizons on the BikeNYC, PeMSD7 and elec-
tricity data. The reason for this is that we refine the high-fre-
5.2 Baseline Models quency components from both the temporal view and the
spatial view, which are generally overlooked by baseline
MTGNN [26]. MTGNN constructs inter-variate rela- models. Next, we reveal the cause of Wavenet + STN’s
tionships by introducing a graph-learning module. under-performance on the electricity dataset over the first
Specifically, the graph learning module connects horizon with respect to MAPE. Electricity data follows a
each hub node with its top k nearest neighbors in a long-tailed distribution – a certain proportion exceeds a rel-
defined metric space. MTGNN’s backbone architec- atively high level. Recall that the optimization involves min-
ture for temporal modeling is Wavenet. imizing mean squared error, which means that more
Graph Wavenet [25]. The architecture of Graph Wave- weights are placed on large errors. Moreover, every sample
net is similar to MTGNN. The major difference is that is treated equivalently in the estimation of global statistics.
the former derives a soft graph where each pair of Therefore, the model can fit long-tailed samples better, but
nodes has a continuous probability of being connected. at the cost of degrading the fitness on normal samples.
We further investigate the improvement in terms of an
6. Code available at: https://github.com/JLDeng/ST-Norm.git individual variable in MTS. To prove that STN captures the
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on January 24,2025 at 08:44:25 UTC from IEEE Xplore. Restrictions apply.
DENG ET AL.: MULTI-VIEW MULTI-TASK LEARNING FRAMEWORK FOR MULTI-VARIATE TIME SERIES FORECASTING 7673
TABLE 3
Performance for Multi-Step Prediction on 5 Datasets
difference in scaling factors among various variables, we variables in this dataset vary in the same range, which signi-
characterize a variable with the mean of its historical obser- fies that they have approximately the same scaling factor.
vations. For succinctness, we calculate the variable-wise Figs. 10 and 11 illustrate the efficiency of our model.
reduction on RMSE obtained by Wavenet + STN compared Fig. 11 shows that with the additional STN module, the
to AGCRN, and then plot the reduction against the scale of converging speeds of the models are accelerated by a
each variable in Fig. 9. We can see that for BikeNYC and large margin, faster than nearly all the baseline models.
electricity, the improvement becomes more prominent as Fig. 10 indicates that the running time of training our
the scale grows, which meets our previous expectation. For model on the same volume of data is also competitive
PeMSD7, the improvement is less significant, as all the with baselines.
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on January 24,2025 at 08:44:25 UTC from IEEE Xplore. Restrictions apply.
7674 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 35, NO. 8, AUGUST 2023
representation is deemed to show the region-wise pattern, also inspect their associated input representations, each of
which plays the role of a local component from this view. which is a concatenation of raw measurements. Next, we
We select three representative regions at specified times discuss separately the outcomes from these two views in
to reflect what the operations extract from the data. We details.
examine the intermediate representation output from the Temporal View. In Fig. 16, we display the demand evo-
two views in the top residual block. As a comparison, we lution during a given period over the three investigated
regions. We can observe that the three regions have similar
evolution patterns, especially regions B and C. The repre-
sentations concatenated by the original measurements are
plotted in Fig. 17a, and the output of the intermediate rep-
resentations from the temporal view are plotted in
Fig. 17b. For the sake of visualization, we obtain the two-
dimensional embeddings of these representations via t-
Distributed Stochastic Neighbor Embedding (t-SNE). We
can observe that the representations are completely rear-
ranged in accordance with the regional identity. This
observation demonstrates that the local components are
roughly invariant within the group belonging to the same
region. This coincides with our understanding that some
regional attributes, such as population and functionality,
are stable over time.
Spatial View. To reflect the characteristics of the output
of representations from the spatial view, we take another
region D into consideration, as shown in Fig. 18. Noticeably,
the magnitude of the demand over region D is substantially
TABLE 4
Ablation Study
Fig. 16. Example data selected to study the output of intermediate repre-
sentations from the spatial view.
Fig. 20. Four groups of samples selected to inspect temporal
distinguishability.
Fig. 21. The percentages of variance explained by the first seven Fig. 23. Two groups of samples selected to inspect spatial distinguishability.
components.
over the others is extremely small. According to this finding, When dealing with variable-length sequences, the preva-
we visualize in Fig. 22 the projections on the first four com- lent forecasting methods pad or truncate them to the same
ponents which explain almost all the variance over the rep- length at the pre-processing step, which unavoidably intro-
resentations. It is obvious that the enhanced model duces noise or abandons valuable samples. In contrast, the
produces representations showing the strongest intra-task MVMT framework can directly process variable-length
correlation and the weakest inter-task correlation of the sequences, as the mean and the standard deviation compu-
three models. Hence, the distinguishability from temporal tations make no difference for different sample sizes.
view is improved by applying the proposed operations. There are two limitations of this framework. First,
Spatial View. In the second case, we examine the distin- whether trialing different schemes can be automated to
guishability from the spatial view, concentrating on a partic- save efforts in model architecture design is an open prob-
ular time. We roughly divide the regions into two groups lem. Second, it is worth exploring how to apply this frame-
based on the evolving patterns as shown in Fig. 23. Basi- work on streaming data, where data distribution is shifting
cally, bike demand over the first group of regions experien- over time. In this new scenario, the current implementation
ces a v-shape variation during the period being visualized of the MVMT framework will not be suitable or at least
and will continue rising in the forthcoming time steps; on have considerable room to be improved, since the parame-
the contrary, the second group of regions approximately ters in spatial-view affine transformation cannot adapt to
remains constant, e.g., 0. We clarify that the assignment is the unknown data distribution.
not rigorous and unique, but such fuzziness will not affect
the conclusion we can draw. REFERENCES
Similar to the first case, we perform PCA, plot the per- [1] R. Jiang et al., “DeepCrowd: A deep model for large-scale city-
centage of variance explained by each of the selected com- wide crowd density and flow prediction,” IEEE Trans. Knowl.
ponents in Fig. 24 and visualize in Fig. 25 the projections on Data Eng., early access, May 03, 2021, doi: 10.1109/
TKDE.2021.3077056.
the first two components which explain most of the vari- [2] J. Deng, X. Chen, Z. Fan, R. Jiang, X. Song, and I. W. Tsang, “The
ance. At this step, it is apparent that the enhanced model pulse of urban transport: Exploring the co-evolving pattern for
yields the most distinguishable representation space from spatio-temporal forecasting,” ACM Trans. Knowl. Discov. Data,
the spatial view. Moreover, we recolor the sample points vol. 15, no. 6, pp. 1–25, 2021.
[3] R. Jiang et al., “DeepUrbanEvent: A system for predicting city-
based on the scale of the observations, and the visualization wide crowd dynamics at big events,” in Proc. 25th ACM SIGKDD
results show that the representations also encode the infor- Int. Conf. Knowl. Discov. Data Mining, 2019, pp. 2114–2122.
mation regarding scale. [4] X. Geng et al., “Spatiotemporal multi-graph convolution network
for ride-hailing demand forecasting,” in Proc. AAAI Conf. Artif.
Intell., 2019, pp. 3656–3663.
[5] Z. Pan et al., “Spatio-temporal meta learning for urban traffic pre-
6 CONCLUSION AND FUTURE WORK diction,” IEEE Trans. Knowl. Data Eng., vol. 34, no. 3, pp. 1462–1476,
In this work, we develop a novel multi-view multi-task Mar. 2022.
[6] J. Sun, J. Zhang, Q. Li, X. Yi, Y. Liang, and Y. Zheng, “Predicting
learning framework for multi-variate time series forecast- citywide crowd flows in irregular regions using multi-view graph
ing. In our design, this framework consists of the original convolutional networks,” IEEE Trans. Knowl. Data Eng., vol. 34,
view, the spatial view and the temporal view, but it is flexi- no. 5, pp. 2348–2359, May 2022.
[7] Y. Gong, Z. Li, J. Zhang, W. Liu, and Y. Zheng, “Online spatio-
ble to account for more views depending on the specific temporal crowd flow distribution prediction for complex metro
application. Forecasting from each view is accomplished in system,” IEEE Trans. Knowl. Data Eng., vol. 34, no. 2, pp. 865–880,
a multi-task manner via two task-dependent operations, Feb. 2022.
namely task-wise normalization and task-wise affine trans- [8] D. Ding, M. Zhang, X. Pan, M. Yang, and X. He, “Modeling extreme
events in time series prediction,” in Proc. 25th ACM SIGKDD Int.
formation. Diverse experiments were performed to quanti- Conf. Knowl. Discov. Data Mining, 2019, pp. 1114–1122.
tatively show the effectiveness, the efficiency and the [9] I. Goodfellow, Y. Bengio, and A. Courville, Deep learning. Cam-
robustness of the framework. Furthermore, we conducted bridge, MA, USA: MIT Press, 2016.
multiple case studies on representations produced at differ- [10] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”
Neural Comput., vol. 9, no. 8, pp. 1735–1780, 1997.
ent stages in the forecasting procedure. The outcome quali- [11] S. Li et al., “Enhancing the locality and breaking the memory bot-
tatively demonstrates that this framework strengthens the tleneck of transformer on time series forecasting,” in Proc. Adv.
intra-task correlation, while weakening the inter-task Neural Inf. Process. Syst., 2019, pp. 5243–5253.
[12] A. V. D. Oord et al., “WaveNet: A generative model for raw
correlation. audio,” 2016, arXiv:1609.03499.
It is noticeable that the framework is modular, flexible [13] S. Bai, J. Z. Kolter, and V. Koltun, “An empirical evaluation of
and extensible. In practice, we start with defining a variety generic convolutional and recurrent networks for sequence mod-
of task partition schemes from different views, beyond the eling,” 2018, arXiv:1803.01271.
[14] C. Christoudias, R. Urtasun, and T. Darrell, “Multi-view learning
spatial and the temporal view, based on the structure in the presence of view disagreement,” 2012, arXiv:1206.3242.
underlying the specified data. Then, we instantiate a col- [15] C. Xu, D. Tao, and C. Xu, “A survey on multi-view learning,”
lection of task-wise affine transformation and normaliza- 2013, arXiv:1304.5634.
[16] J. Deng, X. Chen, R. Jiang, X. Song, and I. W. Tsang, “ST-Norm:
tion modules to accommodate these schemes. For Spatial and temporal normalization for multi-variate time series
example, if the set of variables is only partially correlated, forecasting,” in Proc. 27th ACM SIGKDD Conf. Knowl. Discov. Data
we can identify the correlated subset and define a scheme Mining, 2021, pp. 269–278.
to distinguish this subset. Like crafting advantageous fea- [17] Y. Qin, D. Song, H. Chen, W. Cheng, G. Jiang, and G. Cottrell, “A
dual-stage attention-based recurrent neural network for time
tures, formulating views and tasks beneficial for forecast series prediction,” 2017, arXiv:1704.02971.
relies on prior knowledge, i.e., the practitioners’ insight [18] J. Zhao et al., “Do RNN and LSTM have long memory?,”
into the data. 2020, arXiv: 2006.03860.
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on January 24,2025 at 08:44:25 UTC from IEEE Xplore. Restrictions apply.
DENG ET AL.: MULTI-VIEW MULTI-TASK LEARNING FRAMEWORK FOR MULTI-VARIATE TIME SERIES FORECASTING 7679
[19] Y.-Y. Chang, F.-Y. Sun, Y.-H. Wu, and S.-D. Lin, “A memory-net- [43] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep
work based solution for multivariate time-series forecasting,” network training by reducing internal covariate shift,” 2015,
2018, arXiv:1809.02105. arXiv:1502.03167.
[20] X. Tang, H. Yao, Y. Sun, C. C. Aggarwal, P. Mitra, and S. Wang, [44] D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Instance normali-
“Joint modeling of local and global temporal dynamics for multi- zation: The missing ingredient for fast stylization,” 2016,
variate time series forecasting with missing values,” in Proc. Conf. arXiv:1607.08022.
Assoc. Advance. Artif. Intell., 2020, pp. 5956–5963. [45] Y. Wu and K. He, “Group normalization,” in Proc. Eur. Conf. Com-
[21] G. Lai, W.-C. Chang, Y. Yang, and H. Liu, “Modeling long-and short- put. Vis., 2018, pp. 3–19.
term temporal patterns with deep neural networks,” in Proc. 41st Int. [46] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,”
ACM SIGIR Conf. Res. Develop. Inf. Retrieval, 2018, pp. 95–104. 2016, arXiv:1607.06450.
[22] Q. Tan et al., “DATA-GRU: Dual-attention time-aware gated [47] B. Li, F. Wu, K. Q. Weinberger, and S. Belongie, “Positional
recurrent unit for irregular multivariate time series,” in Proc. normalization,” in Proc. Adv. Neural Inf. Process. Syst., 2019,
AAAI Conf. Artif. Intell., 2020, pp. 930–937. pp. 1622–1634.
[23] C. Fan et al., “Multi-horizon time series forecasting with temporal [48] X. Lian and J. Liu, “Revisit batch normalization: New understand-
attention learning,” in Proc. 25th ACM SIGKDD Int. Conf. Knowl. ing and refinement via composition optimization,” in Proc. 22nd
Discov. Data Mining, 2019, pp. 2527–2535. Int. Conf. Artif. Intell. Statist., 2019, pp. 3254–3263.
[24] A. Vaswani et al., “Attention is all you need,” in Proc. Adv. Neural [49] N. Bjorck, C. P. Gomes, B. Selman, and K. Q. Weinberger,
In. Process. Syst., 2017, pp. 5998–6008. “Understanding batch normalization,” in Proc. Adv. Neural Inf.
[25] Z. Wu, S. Pan, G. Long, J. Jiang, and C. Zhang, “Graph WaveNet Process. Syst., 2018, pp. 7694–7705.
for deep spatial-temporal graph modeling,” in Proc. 28th Int. Joint [50] S. Santurkar, D. Tsipras, A. Ilyas, and A. Madry, “How does
Conf. Artif. Intell., 2019, pp. 1907–1913. batch normalization help optimization?,” in Proc. Adv. Neural Inf.
[26] Z. Wu, S. Pan, G. Long, J. Jiang, X. Chang, and C. Zhang, Process. Syst., 2018, pp. 2483–2493.
“Connecting the dots: Multivariate time series forecasting with [51] H. Daneshmand, J. Kohler, F. Bach, T. Hofmann, and A. Lucchi,
graph neural networks,” 2020, arXiv:2005.11650. “Batch normalization provably avoids rank collapse for randomly
[27] R. Sen, H.-F. Yu, and I. S. Dhillon, “Think globally, act locally: initialised deep networks,” 2020, arXiv:2003.01652.
A deep neural network approach to high-dimensional time [52] L. Huang, L. Zhao, Y. Zhou, F. Zhu, L. Liu, and L. Shao, “An
series forecasting,” in Proc. Adv. Neural Inf. Process. Syst., 2019, investigation into the stochasticity of batch whitening,” in Proc.
pp. 4837–4846. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, pp. 6439–6448.
[28] H. Zhou et al., “Informer: Beyond efficient transformer for long [53] A. Kessy, A. Lewin, and K. Strimmer, “Optimal whitening and
sequence time-series forecasting,” in Proc. AAAI Conf. Artif. Intell., decorrelation,” Amer. Statistician, vol. 72, no. 4, pp. 309–314, 2018.
2021, pp. 11 106–11 115. [54] D. P. Kingma and J. Ba, “Adam: A method for stochastic opti-
[29] J. Cheng, K. Huang, and Z. Zheng, “Towards better forecasting by mization,” 2014, arXiv:1412.6980.
fusing near and distant future visions,” in Proc. AAAI Conf. Artif. [55] L. Bai, L. Yao, C. Li, X. Wang, and C. Wang, “Adaptive graph con-
Intell., 2020, pp. 3593–3600. volutional recurrent network for traffic forecasting,” Proc. Adv.
[30] D. Salinas, V. Flunkert, J. Gasthaus, and T. Januschowski, Neural Inf. Process. Syst., 2020, pp. 17 804–17 815.
“DeepAR: Probabilistic forecasting with autoregressive recurrent
networks,” Int. J. Forecasting, vol. 36, pp. 1181–1191, 2019.
[31] H.-F. Yu, N. Rao, and I. S. Dhillon, “Temporal regularized matrix Jinliang Deng received the BS degree in com-
factorization for high-dimensional time series prediction,” in Proc. puter science from Peking University in 2017, and
Adv. Neural Inf. Process. Syst., 2016, pp. 847–855. the MS degree in computer science from The
[32] R. Yu, S. Zheng, A. Anandkumar, and Y. Yue, “Long-term fore- Hong Kong University of Science and Technology
casting using higher order tensor RNNs,” 2017, arXiv:1711.00073. in 2019. He is currently working toward the PhD
[33] Y. Liang, S. Ke, J. Zhang, X. Yi, and Y. Zheng, “GeoMAN: Multi-
degree with the Australian Artificial Intelligence
level attention networks for geo-sensory time series prediction,”
Institute, University of Technology Sydney and
in Proc. Int. Joint Conf. Artif. Intell., 2018, pp. 3428–3434. the Department of Computer Science and Engi-
[34] G. Spadon, S. Hong, B. Brandoli, S. Matwin, J. F. Rodrigues-Jr, and neering, Southern University of Science and
J. Sun, “Pay attention to evolution: Time series forecasting with Technology. His research interests include time
deep graph-evolution learning,” IEEE Trans. Pattern Anal. Mach. series forecasting, urban computing and deep
Intell., vol. 44, no. 9, pp. 5368–5384, Sep. 2022.
learning.
[35] D. Cao et al., “Spectral temporal graph neural network for multi-
variate time-series forecasting,” in Proc. Adv. Neural Inf. Process.
Syst., 2020, pp. 17 766–17 778.
[36] S. Huang, D. Wang, X. Wu, and A. Tang, “DSANet: Dual Xiusi Chen received the BS and MS degrees in
self-attention network for multivariate time series forecasting,” computer science from Peking University, in 2015
in Proc. 28th ACM Int. Conf. Inf. Knowl. Manage., 2019, and 2018, respectively. He is currently working
pp. 2129–2132. toward the PhD degree with the Department of
Computer Science, University of California, Los
[37] S. S. Rangapuram, M. W. Seeger, J. Gasthaus, L. Stella, Y. Wang,
Angeles. His research interests include natural
and T. Januschowski, “Deep state space models for time
series forecasting,” in Proc. Adv. Neural Inf. Process. Syst., 2018, language processing, knowledge graph, neural
pp. 7785–7794. maching reasoning and reinforcement learning.
[38] Y. Wang, A. Smola, D. C. Maddix, J. Gasthaus, D. Foster, and T.
Januschowski, “Deep factors for forecasting,” 2019, arXiv:
1905.12417.
[39] Y. Wang, H. Yin, H. Chen, T. Wo, J. Xu, and K. Zheng, “Origin-
destination matrix prediction via graph convolution: A new Renhe Jiang received the BS degree in software
perspective of passenger demand modeling,” in Proc. 25th engineering from the Dalian University of Tech-
ACM SIGKDD Int. Conf. Knowl. Discov. Data Mining, 2019, nology, China, in 2012, the MS degree in informa-
pp. 1227–1235. tion science from Nagoya University, Japan, in
[40] V. Le Guen and N. Thome, “Shape and time distortion loss for 2015, and the PhD degree in civil engineering
training deep time series forecasting models,” in Proc. Adv. Neural from The University of Tokyo, Japan, in 2019.
Inf. Process. Syst., 2019, pp. 4189–4201. From 2019, he has been an assistant professor
[41] T. Guo, T. Lin, and N. Antulov-Fantulin, “Exploring interpretable with the Information Technology Center, The Uni-
LSTM neural networks over multi-variable data,” in Proc. Int. versity of Tokyo. His research interests include
Conf. Mach. Learn., 2019, pp. 2494–2504. ubiquitous computing, deep learning, and spatio-
[42] M. Binkowski, G. Marti, and P. Donnat, “Autoregressive convolu- temporal data analysis.
tional neural networks for asynchronous time series,” in Proc. Int.
Conf. Mach. Learn., 2018, pp. 580–589.
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on January 24,2025 at 08:44:25 UTC from IEEE Xplore. Restrictions apply.
7680 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 35, NO. 8, AUGUST 2023
Xuan Song received the PhD degree in signal Ivor W. Tsang (Fellow, IEEE) is an ARC future fel-
and information processing from Peking Univer- low and professor of Artificial Intelligence with the
sity in 2010. In 2017, he was selected as Excel- University of Technology Sydney (UTS), Australia.
lent Young Researcher of Japan MEXT. He has He is also the research director of the Australian
served as an associate editor, guest editor, area Artificial Intelligence Institute. His research inter-
Cchair, senior program committee member for ests include transfer learning, generative models,
many prestigious journals and top-tier conferen- and Big Data analytics for data with extremely
ces, such as IMWUT, IEEE Transactions on Multi- high dimensions. In 2013, he received his presti-
media, WWW Journal, ACM Transactions on gious ARC Future Fellowship for his research
Intelligent Systems and Technology, IEEE Trans- regarding machine learning on Big Data. In 2019,
actions on Knowledge and Data Engineering, Big his JMLR paper titled “Towards ultrahigh dimen-
Data Journal, UbiComp, IJCAI, AAAI, ICCV, CVPR etc. His main sional feature selection for Big Data” received the International Consor-
research interests are AI and its related research areas, such as data tium of Chinese Mathematicians Best Paper Award. In 2020, he was
mining and urban computing. To date, he has published more than 100 recognized as the AI 2000 AAAI/IJCAI Most Influential Scholar in Aus-
technical publications in journals, book chapters, and international con- tralia for his outstanding contributions to the field of AAAI/IJCAI between
ference proceedings, including more than 60 high-impact papers in top- 2009 and 2019. His research on transfer learning granted him the Best
tier publications for computer science. His research has been featured in Student Paper Award for CVPR 2010 and 2014 IEEE Transactions on
many Chinese, Japanese and international venues, including the United Multimedia Prize Paper Award. In addition, he received the IEEE TNN
Nations, the Discovery Channel, and Fast Company Magazine. He Outstanding 2004 Paper Award in 2007. He serves as a senior area chair/
received the Honorable Mention Award at UbiComp 2015. area chair for NeurIPS, ICML, AISTATS, AAAI and IJCAI, and the Editorial
Board for Journal of Machine Learning Research, MLJ, and IEEE Trans-
actions on Pattern Analysis and Machine Intelligence.
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on January 24,2025 at 08:44:25 UTC from IEEE Xplore. Restrictions apply.