0% found this document useful (0 votes)
32 views16 pages

A Multi-View Multi-Task Learning Framework For Multi-Variate Time Series Forecasting

This document presents a novel multi-view multi-task (MVMT) learning framework for multi-variate time series forecasting, addressing the complexities of MTS data generated from hybrid dynamical systems. The framework utilizes spatial and temporal views to partition data into disjoint forecasting tasks, enhancing prediction accuracy by reducing complexity through task-specific operations. Extensive experiments demonstrate that the MVMT framework significantly improves the effectiveness and efficiency of traditional forecasting models.

Uploaded by

susitvl2001
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views16 pages

A Multi-View Multi-Task Learning Framework For Multi-Variate Time Series Forecasting

This document presents a novel multi-view multi-task (MVMT) learning framework for multi-variate time series forecasting, addressing the complexities of MTS data generated from hybrid dynamical systems. The framework utilizes spatial and temporal views to partition data into disjoint forecasting tasks, enhancing prediction accuracy by reducing complexity through task-specific operations. Extensive experiments demonstrate that the MVMT framework significantly improves the effectiveness and efficiency of traditional forecasting models.

Uploaded by

susitvl2001
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 35, NO.

8, AUGUST 2023 7665

A Multi-View Multi-Task Learning Framework


for Multi-Variate Time Series Forecasting
Jinliang Deng , Xiusi Chen , Renhe Jiang , Xuan Song , and Ivor W. Tsang , Fellow, IEEE

Abstract—Multi-variate time series (MTS) data is a ubiquitous class of data abstraction in the real world. Any instance of MTS is
generated from a hybrid dynamical system and their specific dynamics are usually unknown. The hybrid nature of such a dynamical
system is a result of complex external attributes, such as geographic location and time of day, each of which can be categorized into
either spatial attributes or temporal attributes. Therefore, there are two fundamental views which can be used to analyze MTS data,
namely the spatial view and the temporal view. Moreover, from each of these two views, we can partition the set of data samples of MTS
into disjoint forecasting tasks in accordance with their associated attribute values. Then, samples of the same task will manifest similar
forthcoming pattern, which is less sophisticated to be predicted in comparison with the original single-view setting. Considering this
insight, we propose a novel multi-view multi-task (MVMT) learning framework for MTS forecasting. Instead of being explicitly presented
in most scenarios, MVMT information is deeply concealed in the MTS data, which severely hinders the model from capturing it naturally.
To this end, we develop two kinds of basic operations, namely task-wise affine transformation and task-wise normalization, respectively.
Applying these two operations with prior knowledge on the spatial and temporal view allows the model to adaptively extract MVMT
information while predicting. Extensive experiments on three datasets are conducted to illustrate that canonical architectures can be
greatly enhanced by the MVMT learning framework in terms of both effectiveness and efficiency. In addition, we design rich case
studies to reveal the properties of representations produced at different phases in the entire prediction procedure.

Index Terms—Time series forecasting, deep learning, normalization, multi-view multi-task learning

1 INTRODUCTION predict a potential market crash [8]. Due to the complex and
continuous fluctuation of impacting factors, real-world time
IME series forecasting is a significant problem in many
T industrial and business applications [1], [2], [3]. For
instance, a public transport operator can allocate sufficient
series tends to be extraordinarily non-stationary, that is, exhib-
iting diverse dynamics. For instance, traffic volume is largely
affected by the road’s condition, location, and the current time
capacity to mitigate the queuing time in a region in advance, if
and weather condition. In the retail sector, the current season,
they have the means to foresee that a particular region will suf-
price and brand are determinants for the sales of merchandise.
fer from a supply shortage in the next couple of hours [4], [5],
The diverse dynamics impose an enormous challenge on time
[6], [7]. As another example, an investor can avoid economic
series forecasting. In this work, we study multi-variate time
loss with the assistance of a robo-advisor which is able to
series forecasting, where multiple variables evolve with time.
Traditional time series forecasting algorithms, such as
 Jinliang Deng is with the Department of Computer Science and Engineering, ARIMA and state space models (SSMs), provide a principled
Southern University of Science and Technology, Shenzhen 518055, China, and framework for modeling and learning time series patterns.
also with the Australian Artificial Intelligence Institute, University of Technol- However, these algorithms have a rigorous requirement for
ogy Sydney, Sydney 2007, Australia. E-mail: [email protected]. the stationarity of a time series, which suffer from severe lim-
 Xiusi Chen is with the University of California, Los Angeles, CA 90095
USA. E-mail: [email protected]. itations in practical use if most of the impacting factors are
 Renhe Jiang is with the Center for Spatial Information Science, University unavailable. Recent studies show that thanks to the nonline-
of Tokyo, Tokyo 113-8654, Japan. E-mail: [email protected]. arity of activation functions, deep learning models possess
 Xuan Song is with the SUSTech-UTokyo Joint Research Center on Super
Smart City, Department of Computer Science and Engineering, and also
the capacity to handle complex dynamics, theoretically in
with the Research Institute of Trustworthy Autonomous Systems, South- any form, even in the absence of additional impacting factors
ern University of Science and Technology (SUSTech), Shenzhen 518055, [9]. Therefore, the nonstationarity issue can be addressed to
China. E-mail: [email protected]. some degree. Common neural architectures applied on time
 Ivor W. Tsang is with the Australian Artificial Intelligence Institute, Uni-
versity of Technology Sydney, Ultimo, NSW 2007, Australia. series data include recurrent neural networks (RNNs), long-
E-mail: [email protected]. short term memory (LSTM) [10], Transformer [11], Wavenet
Manuscript received 1 September 2021; revised 11 September 2022; accepted [12] and temporal convolution networks (TCNs) [13].
30 October 2022. Date of publication 2 November 2022; date of current version In our article, we conjecture that MTS forecasting can
21 June 2023. essentially be treated as a multi-view multi-task learning
This work was supported in part by ARC under Grants DP180100106 and
DP200101328, in part by the National Key Research and Development Project of problem. To the best of our knowledge, we are the first to for-
China under Grant 2021YFB1714400, and in part by Guangdong Provincial mulate MTS in this way. There are typically two additional
Key Laboratory under Grant 2020B121201001. views in the MTS problem apart from the original spatial-tem-
(Corresponding authors: Xuan Song and Ivor W. Tsang.) poral view, namely the temporal view and the spatial view.
Recommended for acceptance by E. Chen.
Digital Object Identifier no. 10.1109/TKDE.2022.3218803 From each of these two views, we can divide the forecasting

1041-4347 © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See ht_tps://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on January 24,2025 at 08:44:25 UTC from IEEE Xplore. Restrictions apply.
7666 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 35, NO. 8, AUGUST 2023

Fig. 1. NYC shared bike demand.


Fig. 3. (a) Desirable feature space. (b), (c) Undesirable feature spaces,
for samples into different tasks based on certain criteria. We where (b) has a strong inter-task correlation, and (c) has a weak intra-
task correlation.
take shared bike demand as a concrete example, and display
the demand data collected from three regions over a five-day
period in Fig. 1. In this example, from the temporal view, fore- tasks are orthogonal to each other in the augmented feature
casting at a particular time over all the regions can be grouped space, and samples from the same task maintain their rela-
into a task; from the spatial view, forecasting at all times over tive positions as in the original feature space. More gener-
a particular region can be grouped into a task. Herein, the ally, even if orthogonality is not rigorously satisfied, each
task partition scheme follows the principle where data points task can be captured in a more individual way provided
sampled from the same task are impacted by a common exter- that the correlations between different tasks diminish. In a
nal factor, which makes them display common patterns. By further step, we can deduce that given the condition that
handling each task with an exclusive predictor, the prediction the inter-task weak correlation and the intra-task strong cor-
complexity can be largely reduced in contrast with using a relation are manifested in the feature space, the predictor
single predictor for all tasks. Sometimes, if the diversity of will automatically differentiate the task identity of every
tasks suitably coincides with the diversity of dynamics, a lin- given sample. Therefore, all we need is an augmented fea-
ear regression model has the sufficient ability to undertake ture space encoding the two types of relationships as shown
individual tasks. in Fig. 2c. In addition, we display two kinds of undesirable
Distinct from a majority of previous multi-view learning geometries in Fig. 3 to highlight the key properties of the
problems whose objectives are to identify the shared infor- geometry which fits the multi-task learning paradigm.
mation among multiple views [14], [15], our formulation However, only using raw time series data as input fea-
resorts to the supplementary information presented by the tures does not obtain the inter-task weak-correlation and
additional views. In particular, the temporal view is advan- the intra-task strong-correlation from either of the spatial
tageous at capturing abrupt changes, e.g. due to weather view or the temporal view. Although some tasks are inher-
conditions, while the spatial view is beneficial to capturing ently separated (i.e., 9am versus 6pm) based on the sequen-
local patterns which are stable over time. Therefore, these tial pattern, most tasks are indistinguishable in the feature
two views can reinforce forecasting from different aspects. space. Following our previous work [16], there are two
To achieve multi-task learning given any view, the key is types of indistinguishability due to the strong inter-task cor-
to produce a feature space manifesting inter-task weak-corre- relation from the spatial view and the temporal view: (1)
lation and intra-task strong-correlation. To make this idea Spatial indistinguishability means that the dynamics yielded
more comprehensible, we start by displaying the canonical by different variables are not adequately discernible. For
paradigm of multi-task learning in Fig. 2a with its two instance, looking at the three regions in Fig. 1, we consider
equivalent derivatives in Figs. 2b and 2c. In Fig. 2b, we cre- their dynamics measured between 8pm and 9pm on differ-
ate an augmented feature space with the number of dimen- ent days. In Fig. 4a, we plot the measurement at 8pm versus
sions being as many as three times that of the original ones. the measurement at 9pm over the three regions, where the
The augmented feature space is equally partitioned into data points are colored in accordance with their regional
three subspace where each subspace is associated with a identities. Different clusters of dynamics are supposed to be
task. Given a sample from any task, we let the correspond- distinguishable. However, the cluster-wise relationships
ing subspace of its belonging task accommodate its features (indicated by the direction of a straight line fitting the intra-
and the other two subspace padded with 0. It is easy to tes- cluster data points) are highly correlated, which signifies
tify the equivalence between Figs. 2a and 2b. Next, let we the inter-task strong correlation; (2) Temporal indistinguish-
have a closer look at Fig. 2b. An immediate judgement can ability means that dynamics measured at specific times are
be made from this formulation that samples from different not substantially discrete. In Fig. 2b, we only plot the

Fig. 2. Three equivalent paradigms for for multi-task learning.


Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on January 24,2025 at 08:44:25 UTC from IEEE Xplore. Restrictions apply.
DENG ET AL.: MULTI-VIEW MULTI-TASK LEARNING FRAMEWORK FOR MULTI-VARIATE TIME SERIES FORECASTING 7667

2 RELATED WORK
2.1 Time Series Forecasting
Time series forecasting has been studied for decades. Tradi-
tional methods, such as ARIMA, can only learn the linear
relationship among different timesteps, which has an inher-
ent deficiency in fitting many real-world time series data
that are highly nonlinear. With the power of deep learning
models, a large volume of work in this area that has recently
achieved impressive performance. For instance, [17] adopt
LSTM to capture the nonlinear dynamics and long-term
Fig. 4. (a) Spatial indistinguishability; (b) Temporal indistinguishability. dependencies in time series data. However, the memorizing
capacity of LSTM is still restricted, as pointed out by [18].
To resolve this issue, [19], [20] create an external memory to
measurement pairs of region A, and separate them based on
weekday or weekend. It is obvious that these two clusters explicitly store some representative patterns that can be fre-
also have a strong correlation. quently observed in the history, which is able to effectively
To address the issues, we propose two fundamental guide the forecasting when similar patterns occur. [21]
operations, task-wise affine transformation and task-wise nor- makes use of a skip connection to enable the information to
malization, each of which can weaken the inter-task correla- be transmitted from distant history. The attention mecha-
tion while maintaining the intra-task correlation. Task-wise nism is another option to deal with the vanishing memory
affine transformation transforms the representations of each problem [22], [23]. Of these methods, Transformer is a rep-
sample with task-specific affine parameters, hence task-spe- resentative architecture which consists of only attention
cific characteristics can be encoded into the feature space. operations [24]. To overcome the computation bottleneck of
The limitation of this operation is that it can only be applied canonical Transformer, [11] proposes a novel mechanism
on the spatial view whose task partition is static over time, that periodically skips some timesteps when performing
or in other words the set of tasks does not change with time. attention. As far as we know, Wavenet [12], TCN [13] and
When it comes to the temporal view with dynamic task par- Transformer [24] are currently the superior choices for
tition, the model cannot pre-learn the affine parameters for modeling long-term time series data [25], [26], [27], [28].
tasks appearing at a future time. To complement task-wise Contrasting to the above approaches focusing on enriching
affine transformation, task-wise normalization is proposed, the semantics of the current state with the information
which can be applied not only on the spatial view but also extracted from the history, [29] attempted to improve the
on the temporal view. Basically, it performs normalization emissions from the current state into the future. They for-
over the entire group of samples divided into the same task, mulated the forecasts at different forthcoming time steps as
which can also result in representations with task-specific separate tasks and modeled the interplay between them.
characteristics. In our study, we realize task-wise affine To tackle MTS, several studies [27], [30], [31], [32] assume
transformation and normalization from the spatial view that multi-variate time series data has a low-rank structure.
and the temporal view respectively, giving rise to a com- Another thread of works [17], [33], [34] leverages the atten-
pound operation known as ST-Norm, abbreviated as STN. tion mechanism to learn the correlations among individual
We summarize our contributions as follows: time series, where [34] opted to obtain the attentive scores
with cosine similarity other than [17], [33] using dot product
 We propose a novel MVMT learning framework for similarity. Recently, [26] inferred the inherent structure over
time series forecasting. We account for three views the variables derived from self-learned encodings associ-
in this framework, namely the original view, the spa- ated with each variable. [35] used Fourier transform to
tial view and the temporal view. From each of the decompose original MTS data into a group of orthogonal
spatial view and the temporal view, learning is per- signals. [36] proposed a dual self-attention network (DSA-
formed in a multi-task manner where each task is Net) to dynamically capture both local and global patterns
associated with a variable or a timestamp. with convolution and attention operators, effectively han-
 We develop task-wise affine transformation and nor- dling MTS data with non-periodic dynamics. The above
malization to enable the feature space to be encoded methods make point estimation. [30], [37], [38] propose a
with MVMT information. Either of these two opera- confidence interval that is likely to contain the forthcoming
tions can weaken the inter-task correlations while keep- observation. With prior knowledge on the application sce-
ing the intra-task correlations in the representation nario, rich kinds of inter-variate relationship can be utilized.
space, which emulates the explicit partitioning of data For instance, in real-world transportation system, there are
samples as in the normal setting of multi-task learning. three typical relationships, namely spatial closeness [4], [5],
 We propose a compound operation ST-Norm, con- [6], [7], functional similarity [4] and origin-destination con-
sisting of different realizations of task-wise affine nectiveness [39]. In particular, spatial closeness follows the
transformation and normalization respectively from first law of geography - ”near things are more related than
the spatial view and the temporal view. distant things”; functional similarity explains a part of phe-
 We conduct extensive experiments to quantitatively nomenons that although two locations are separate far
and qualitatively validate the effectiveness of the away from each other, they exhibit similar movement pat-
MVMT learning framework. tern of traffic recordings over a day; origin-destination (OD)
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on January 24,2025 at 08:44:25 UTC from IEEE Xplore. Restrictions apply.
7668 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 35, NO. 8, AUGUST 2023

TABLE 1
Notations

Notation Description
N; M Number of variables/ tasks.
T; Tin ; Tout Length of history, number of input
steps/ output steps.
X 2 RNT Historical observations for forecasting.
^ Z
Z; Z;  2 RNT d Historical latent representations.
I  ½½1; N  ½½1; T  A set of sample indexes. Fig. 5. Illustration of task partition schemes from: (a) the temporal view;
P ¼ fI1 ;    ; IM g A task partition which is a partition of (b) the spatial view.
the entire set of sample indexes.
GP 2 RMd Global components conditioned on One of their major discoveries is that normalization can
partition P.
P NT d increase the rankness of the feature space, or in other words, it
L 2R Local components conditioned on
partition P. enables the model to extract more diverse features.
T; S Task partition from the temporal/
spatial view. 3 PRELIMINARIES
x; y; z Vector or matrix that represents certain
variable. In this section, we introduce the definitions and the assump-
þ; ; = Element-wise addition/multiplication/ tions. All frequently used notations are reported in Table 1.
division.
3.1 Task Partition Schemes
connectiveness is characterized as the quantity of the traffic Given the complete set of sample indexes I ¼ ½½1; N 
flow between the OD pair [39]. [40] proposed a new objec- ½½1; T , a task partition is defined as P ¼ fI1 ;    ; IM g, where
tive function for training deep neural networks, aiming at I1 ;    ; IM are disjoint subsets (also known as tasks in our
accurately predicting sudden changes. [41] explored the context) of I, and M is the number of tasks been partitioned.
structure of LSTM to learn variable-wise hidden states, with Besides, we introduce a function P which maps any sample
the aim to distinguish the contribution of variables to the index to the index of its belonging task Pðn; tÞ ¼ m under
prediction. In some scenarios, MTS may evolve asynchro- the task partition scheme.
nously and are spaced unevenly. To deal with this more We employ two schemes to partition tasks respectively
general case, [42] organized the asynchronous MTS data as from the spatial view and the temporal view, which are
a single series of observations, alongside a series of external introduced as follows:
features expressing the spatial-temporal relationships  Temporal view: samples collected at the same time
between the observations. Afterward, they married convo- are put in the same task as shown in Fig. 5a, ignoring
lutional neural networks and auto-regressive models to per- the spatial difference. In this case, M ¼ T and
form forecasting on top of this new data representation. Im ¼ fðn; mÞgN n¼1 .
Distinguishing from the existing works, we are the pio-  Spatial view: samples of the same variable are
neer in developing spatial and temporal normalization in the put in the same task as shown in Fig. 5b, ignoring
context of MTS forecasting, and demonstrate their rationality the temporal difference. In this case, M ¼ N and
from the perspective of multi-view multi-task learning. Im ¼ fðm; tÞgTt¼1 .
In terms of any of the aforementioned partition schemes,
samples put in the same task are impacted by the same fac-
2.2 Normalization tor, thus they should manifest a strong correlation. Mean-
Normalization was first adopted in deep image processing, while, samples across tasks are impacted by different
and has significantly enhanced the performance of deep learn- factors, which result in weakly correlated patterns.
ing models for nearly all tasks. There are multiple normaliza- There are many partition schemes with different granu-
tion methods, such as batch normalization [43], instance larities. For example, we can also put the samples collected
normalization [44], group normalization [45], layer normali- during a specified period in a task. In our work, we adopt
zation [46] and positional normalization [47], each of which is the finest granularity, and the question of what the optimal
proposed to address a particular group of computer vision granularity is is left to explore in future works.
tasks. Of these, instance normalization has the greatest poten-
tial for our study, which was originally designed for image 3.2 Other Preliminaries
synthesis owing to its power to remove style information
from the images. Researchers have found that feature statistics
Definition 1 (Time series forecasting). Time series forecast-
ing is formulated as the following conditional distribution:
can capture the style of an image, and the remaining features
upon normalizing the statistics are responsible for the content.
Y
Tout
Such a separable property enables the content of an image to P ðX:;tþ1:tþTout jX:;tþ1Tin :t Þ ¼ P ðX:;tþi jX:;tþ1Tin :t Þ;
be rendered in the style of another image, which is also known i¼1
as style transfer. The style information in the image is like the
scale information in time series. There is another line of work Definition 2 (Time series factorization). Specifying a task
which explores the reason why the normalization trick facili- partition scheme P, any sample of the time series can be factor-
tates the learning of deep neural networks [48], [49], [50], [51]. ized in the following way:
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on January 24,2025 at 08:44:25 UTC from IEEE Xplore. Restrictions apply.
DENG ET AL.: MULTI-VIEW MULTI-TASK LEARNING FRAMEWORK FOR MULTI-VARIATE TIME SERIES FORECASTING 7669

Zn;t ¼ GPm LPn;t ; (1)

where m ¼ Pðn; tÞ denotes the task belonging to the sample


under P, GPm is a global component shared by all the samples
from task m under P, and LPn;t is a local component only pos-
sessed by sample ðn; tÞ. Be aware that different task partition
schemes will result in different forms of factorization.
Assumption 1. We postulate that different dimensions of the
Fig. 6. An overview of the MVMT framework, where the spatial-view and
local component are independent, and they follow a multi-vari- the temporal-view manipulation compose ST-Norm.
ate normal distribution as follows:
by population: taxi demand over a region is normally
02 3 2 31
bPm ½0 ðg Pm ½0Þ2 proportional to the population residing in this region.
B6 .. 7 6 7C
LPn;t  N @4 . 5; 4 ..
. 5A;
4 METHODOLOGY
bPm ½d  1 ðg Pm ½d  1Þ2
(2) An overview of the multi-view multi-task learning block is
displayed in Fig. 6. Any representation input to this block is
where bPm denotes the mean vector and bPm ½i denotes its ith firstly replicated to three copies. Then, the three copies are
entry; the co-variance matrix is a diagonal matrix, where the separately transformed by specific operations respectively
vector of entries on the main diagonal are denoted as ðg Pm Þ2 from the spatial view, the temporal view and the original
and all the off-diagonal entries are 0. view. Finally, the resulting three copies are concatenated
together to obtain an augmented representation, which is
Remark 1. We assume the off-diagonal entries of the covari-
taken to be the output of this block.
ance matrix to be 0, in the sense that this setup simplifies
In this section, we start by introducing the proposed task-
the form of time series factorization and the following
wise affine transformation and task-wise normalization
analysis. Such kind of simplification facilitates revealing
respectively in Sections 4.1 and 4.2; then, in Section 4.3, we
the rationale of our method. For a more complicated setup
demonstrate how to alter Wavenet under the guidance of
where different dimensions are correlated with each other,
MVMT framework; we conclude this section by introducing
some sophisticated whitening operations can be applied
the process of forecasting and learning in Section 4.4.
into use [52], [53], which is left to be investigated in future
work.
4.1 Task-Wise Affine Transformation
Remark 2. Given a collection of samples from any task m, Task-wise affine transformation differentiates tasks by
their local components are denoted as fLPn;t jPðn; tÞ ¼ mg. assigning each task with an exclusive group of affine
Intuitively, fLPn;t jPðn; tÞ ¼ mg is expected to span a sub- parameters. As each group of parameters is only responsi-
space whose dimension is lower than the effective dimen- ble for capturing the dynamics of a single task, the indistin-
sion of the latent representation space d, as LPn;t results guishability issue can be mitigated.
from the lower number of environmental factors com- Formally, with respect to a task partition scheme P and
pared to Zn;t . Hence, only a part of the entries in g Pm have its associated mapping function P, task-wise affine transfor-
non-zero values. In addition, if fLPn;t jPðn; tÞ ¼ m1 g and mation takes the following operations:
fLPn;t jPðn; tÞ ¼ m2 g with different task identities are
impacted by the same group of environmental factors,  P ¼ Zn;t wP þ bP ;
Z n;t m m
they are supposed to span the same subspace.
Remark 3. Indistinguishability is attributed to the phenom- where m denotes Pðn; tÞ; wPm and bPm are two affine parame-
enon where latent representations from different tasks ters targeting at task m under the partition P. To be noted
span the same subspace. In the rest of this paragraph, we that affine transformation is a special case of a fully con-
discuss the root cause of this phenomenon. To begin nected layer, where interactions between the feature chan-
with, we can treat Eq. (1) from the view of geometric nels are taken into consideration. We will empirically show
transformation. In this way, local components from the that task-wise affine transformation in conjunction with the
same task will experience the same transformation rely- following task-wise normalization achieves competitive
ing on the corresponding global component, and the ones performance, and introduces far fewer parameters in con-
from the different tasks will experience different transfor- trast with the task-wise fully connected layer.
mations. In practice, there are multiple types of geometric Although task-wise affine transformation allows for
transformations, which are made up of three basic trans- more freedom to learn task-wise dynamics, it has severe
formations, namely scaling, translating and rotating. Scal- limitation in handling cold-start tasks. In the practice of
ing is the one that will not change the space spanned by time series forecasting, tasks partitioned from the temporal
the local components, so any group of transformations view accumulate with time. For each new task encountered
that only differ in scaling factors will cause the produced in the testing phase, we must identify a task that not only
latent representations to be indistinguishable. A concrete presents similar dynamics to the one being tested, but it has
instance of scaling in the real-world is the effect imposed also been observed in the training data. For a time series
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on January 24,2025 at 08:44:25 UTC from IEEE Xplore. Restrictions apply.
7670 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 35, NO. 8, AUGUST 2023

showing regular patterns, such identification can be accom- For conciseness, we do not include the implementation from
plished given prior knowledge on regularity. Nonetheless, the spatial view.
for the ones with irregular patterns, it would be cumber- Next, we explain why task-wise normalization works. As
some to identify eligible tasks. we indicate in Remark 3, certain cases of indistinguishability
Due to the limitation, we apply task-wise affine transfor- are caused by task-wise transformations that differ only in
mation from the spatial view. The implementation takes the scaling factors. Task-wise normalization resolves this issue
following form: by converting scaling transformation to rotating transforma-
 S ¼ Zn;t wS þ bS ; tion. Basically, we construct rotation with a rotating angle
Z n;t n n (3)
which relies on the scaling factor. First, we rewrite the nor-
where wSn and bSn are learnable affine parameters. malized representation from Eq. (4) in the following way:
Next, we introduce another thread of approaches which P
can address the cold-start problem. ^ P ¼ Zn;t  mm
Z (7)
n;t
sT
m
4.2 Task-Wise Normalization LPn;t  bPm
¼ ; (8)
Task-wise normalization explicitly encodes global compo- g Pm
nents into the representation space. The global component
varies from task to task, and thus can be used to separate where Eq. (8) is deduced by substituting Eq. (1), Eq. (2), Eq. (5)
tasks. However, the observation of any sample is a mixture and Eq. (6) into Eq. (7). Then combining the original view, the
of the global component and the local component, which hin- spatial view and the temporal view, we map the original
ders the capture of the global component. To extract the representation to an augmented representation as follows:
global component, we start by applying task-wise normali- 2 3
zation to eliminate the global component from the represen- 2 3 Zn;t
Zn;t 6 LS bS 7
tation, then we combine the normalized representation with 6 ^ S 7 6 n;tg S m1 7
the original representation to obtain an augmented represen- Zn;t ! 4 Z n;t 5 ¼ 6 m1 7 ;
^T 4 LT bT 5
tation for each sample. The augmented representation space Z n;t n;t m2
gT
m2
can manifest the difference on global components, and hence
the current inter-task correlation is weaker than the original. where m1 ¼ Sðn; tÞ and m2 ¼ T ðn; tÞ. At this step, from the
Likewise, for a task partition scheme P and its associated view of geometric transformation, each sample point is
mapping function P, task-wise normalization is performed rotated by an angle depending on GPm g Pm from its original
as follows: P position and is then translated by bPm =g Pm . Thus, the differ-
^ P ¼ Zn;t  mm
Z (4) ence on GPm , on bPm and on g Pm can be manifested in the new
n;t
s Pm
space. It is noteworthy that the new space also maintains
where we let m denote Pðn; tÞ which is the task belonging of the correlation among GPm belonging to different tasks,
the sample, which can facilitate the learning process.
 
mPm ¼ E Zi;j jPði; jÞ ¼ m (5) 4.3 Wavenet
1 X Im
We illustrate the architecture of our work in Fig. 7. Some
Zi;j
jIm j i;j key variables with their shapes are labeled at their corre-
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
h iffi sponding positions along the computation path. Generally,
P
2 our framework is instantiated as a structure like Wavenet
s m ¼ E Zi;j  mm jPði; jÞ ¼ mP

vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi [12], except that we incorporate a ST-Norm block into the


u residual block.
u 1 X Im
t ðZi;j  mPm Þ2 þ ; (6) We briefly introduce a dilated causal convolution where
jIm j i;j the filter is applied with skipping values. For a 1-D signal
z 2 RT and a filter f : f0; . . .; k  1g ! R, the causal convo-
where  is a small constant to preserve numerical stability. By lution on element t is defined as follows:
implementing P with different task partition schemes, we
gain representations normalized in different ways. For X
k1
instance, the task partition from the temporal view produces F ðtÞ ¼ ðz fÞðtÞ ¼ fðiÞ  zti : (9)
the following implementation of task-wise normalization: i¼0

T
^ T ¼ Zn;t  mt ;
Z
This formula can be easily generalized for a multi-dimen-
n;t
sT
t
sion signal but we omit its general form here for brevity.
where Moreover, padding (zero or replicate) with size of k  1 is
appended to the left tail of the signal to ensure length con-
1X N
sistency. We can stack multiple causal convolution layers to
mT
t ¼ Zi;t ;
N i¼1 obtain a larger receptive field for each element.
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi One shortcoming of using causal convolution is that either
u N
u1 X the kernel size or the number of layers increases in a linear
st ¼ t
T
ðZi;t  mT 2
t Þ þ : manner with the range of the receptive field, and the linear rela-
N i¼1
tionship causes an explosion of parameters when modeling
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on January 24,2025 at 08:44:25 UTC from IEEE Xplore. Restrictions apply.
DENG ET AL.: MULTI-VIEW MULTI-TASK LEARNING FRAMEWORK FOR MULTI-VARIATE TIME SERIES FORECASTING 7671

ground truth values. We use the Adam optimizer [54] to


optimize this target.
The present SOTA is monopolized by models equipped
with a graph learning module, which establishes mutual
relationships between different time series [25], [26], [55].
As a result, the computational complexity of this module is
OðTN 2 Þ. In contrast, the task-wise affine transformation and
the task-wise normalization modules only involve OðTNÞ
operations, presenting better scalability than graph-based
models when the number of nodes is massive.

5 EVALUATION
In this section, we describe the extensive experiments on
three common datasets to validate the effectiveness of
MVMT framework from different aspects.

5.1 Experimental Setting


Fig. 7. Overall architecture, showing two residual blocks for illustration, 5.1.1 Datasets
but multiple blocks can be stacked layer by layer. þ and  respectively We validate our model on five real-world datasets, namely
denote element-wise addition and element-wise multiplication.
BikeNYC, PeMSD7, Electricity, PM2.5 and Solar energy.
The statistics regarding each dataset as well as the corre-
sponding settings of the designed task are reported in
Table 2. We standardize the values in each dataset to facili-
tate training and transform them back to the original scale
in the testing phase.
Table 2 reports the statistics of the datasets. More details
regarding the datasets are as follows.
Fig. 8. Dilated causal convolution.
 PeMSD71. The data is collected from the Caltrans Per-
formance Measurement System (PeMS) using sensor
long history. Pooling is a natural choice to address this issue, stations, which are deployed to monitor traffic speed
but it sacrifices the order information presented in the signal. across the major metropolitan areas of the California
To this end, dilated causal convolution, as shown in Fig. 8, is state highway system. We further aggregate the data
used, a form which supports the exponential expansion of the to 30-minute intervals by average pooling.
receptive field. The formal computing process is written as:  Electricity2. The original dataset contains the electric-
ity consumption of 370 points/clients, from which
X
k1 34 outlier points that contain extreme values are
F ðtÞ ¼ ðz d fÞðtÞ ¼ fðiÞ  ztdi ; (10) removed. Moreover, we calculate the hourly average
i¼0 consumption for each point, and take it as the time
series being modeled.
where d is the dilation factor. Normally, d increases expo-
 BikeNYC3. Each time series in this dataset denotes
nentially w.r.t. the depth of the network (i.e., 2l at level l of
the aggregate demand for shared bikes over a region
the network). If d is 1 (20 ), then the dilated convolution oper-
in New York City. We do not consider the spatial
ator d reduces to a regular convolution operator .
relationship presented in the PeMSD7 and BikeNYC
data, since our objective is to study the temporal
4.4 Forecasting and Learning
patterns.
We let ZðLÞ 2 RNl Tin dz denote the output from the last  Solar energy4. It contains the solar power production
residual block, where each row zðLÞ 2 RTin dz represents a records in the year of 2006, which is sampled every
variable. Then, we employ a temporal pooling block to per- 10 minutes from 137 PV plants in Alabama State.
form temporal aggregation for each variable. Several types  PM2.55. It contains hourly PM2.5 data from multiple
of pooling operations can be applied, such as max pooling air-quality monitoring sites, acquired from the Bei-
and mean pooling, depending on the problem being stud- jing Municipal Environmental Monitoring Center.
ied. In our case, we select the vector in the most recent time
slot as the pooling result, which is treated as the representa-
1. https://pems.dot.ca.gov/
tion of the entire signal. Finally, we make a separate predic- 2. https://archive.ics.uci.edu/ml/datasets/
tion for each variable, based on the obtained representation ElectricityLoadDiagrams20112014
using a shared fully connected layer. 3. https://ride.citibikenyc.com/system-data
4. http://www.nrel.gov/grid/solar-power-data.html
In the learning phase, our objective is to minimize the 5. https://archive.ics.uci.edu/ml/datasets/Beijing+Multi-Site+Air-
mean squared error between the predicted values and Quality+Data
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on January 24,2025 at 08:44:25 UTC from IEEE Xplore. Restrictions apply.
7672 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 35, NO. 8, AUGUST 2023

TABLE 2
Dataset Statistics

Tasks Electricity PeMSD7 BikeNYC Solar PM2.5


Start time 10/1/2014 5/1/2012 4/1/2014 1/1/2006 3/1/2013
End time 12/31/2014 6/30/2012 9/30/2014 12/31/2006 2/28/2017
Sample rate 1 hour 30 minutes 1 hour 10 minutes 1 hour
# Variate 336 228 128 137 10
Input length 16 16 16 16 16
Output length 3 3 3 3 3

The time period spans from March 1st, 2013 to Feb-  AGCRN [55]. AGCRN is also equipped with a
ruary 28th, 2017. graph-learning module to establish inter-variate rela-
tionship. Furthermore, it uses a personalized RNN to
model the temporal relationship for each time series.
5.1.2 Network Setting  Transformer [11]. This model captures the long-term
The batch size is 8, and the input length of the batch sample dependencies in time series data by using an atten-
is 16. For the Wavenet backbone, the layer number is set to tion mechanism, where the keys and queries are
4, the kernel size of each DCC component is 2, and the asso- yielded by causal convolution over local context to
ciated dilation rate is 2i , where i is the index of the layer model segment-level correlation.
(counting from 0). Such settings collectively enable the out-  LSTNet [21]. There are two components in LSTNet:
put from Wavenet to perceive 16 input steps. The number one is a conventional autoregressive model, and the
of hidden channels dz in each DCC is 16. We apply zero- other is an LSTM with an additional skip connection
padding on the left tail of the input to enable the length of over the temporal dimension.
the output from DCC to be 16 as well. The learning rate of  TCN. [13] The architecture of TCN is similar to
the Adam optimizer is 0.00016. Wavenet, except that the nonlinear transformation in
each residual block is made up of two rectified linear
units (ReLU).
5.1.3 Evaluation Metrics
We also test the performance of TCN and Transformer
We validate our model using root mean squared error incorporating STN, where STN is similarly applied before
(RMSE), mean absolute error (MAE) and mean absolute per- the causal convolution operation in each layer. We do not
centage error (MAPE). compare our method to linear models such as ARIMA,
We conduct cross validation on each dataset to get a because the involved baseline models show superiority
comprehensive evaluation of our model against competing over the linear models as illustrated in the original work in
ones. In particular, we split time series data into 10 chunks [11], [21], [25], [26]. For all baselines, we conduct grid search
along the temporal axis with approximately equal length, on the number of hidden units over f4; 8; 16; 32; 64g, and
then we create 5 groups of training/validation/test sets select the best models according to the validation results.
constructed from these 10 data chunks. For the first group,
the first 4 chunks of data construct the training set, while
the 5th and the 6th chunk respectively construct the valida- 5.3 Experimental Results
tion and testing set; For the second group, the first 5 The experimental results on the five datasets are reported in
chunks construct the training set, while the 6th and the 7th Table 3. The improvements achieved by Wavenet + STN
chunk respectively construct the validation and testing set, over the best benchmarks are recorded in the last row of
etc. We repeat the experiment 10 times for each model on each sub-table.
each group of training/validation/test set and report the It is obvious that Wavenet + STN achieves SOTA results
average performance. over almost all horizons on the BikeNYC, PeMSD7 and elec-
tricity data. The reason for this is that we refine the high-fre-
5.2 Baseline Models quency components from both the temporal view and the
spatial view, which are generally overlooked by baseline
 MTGNN [26]. MTGNN constructs inter-variate rela- models. Next, we reveal the cause of Wavenet + STN’s
tionships by introducing a graph-learning module. under-performance on the electricity dataset over the first
Specifically, the graph learning module connects horizon with respect to MAPE. Electricity data follows a
each hub node with its top k nearest neighbors in a long-tailed distribution – a certain proportion exceeds a rel-
defined metric space. MTGNN’s backbone architec- atively high level. Recall that the optimization involves min-
ture for temporal modeling is Wavenet. imizing mean squared error, which means that more
 Graph Wavenet [25]. The architecture of Graph Wave- weights are placed on large errors. Moreover, every sample
net is similar to MTGNN. The major difference is that is treated equivalently in the estimation of global statistics.
the former derives a soft graph where each pair of Therefore, the model can fit long-tailed samples better, but
nodes has a continuous probability of being connected. at the cost of degrading the fitness on normal samples.
We further investigate the improvement in terms of an
6. Code available at: https://github.com/JLDeng/ST-Norm.git individual variable in MTS. To prove that STN captures the
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on January 24,2025 at 08:44:25 UTC from IEEE Xplore. Restrictions apply.
DENG ET AL.: MULTI-VIEW MULTI-TASK LEARNING FRAMEWORK FOR MULTI-VARIATE TIME SERIES FORECASTING 7673

TABLE 3
Performance for Multi-Step Prediction on 5 Datasets

1st horizon 2nd horizon 3rd horizon


Dataset Model MAPE MAE RMSE MAPE MAE RMSE MAPE MAE RMSE
LSTNet 18.2% 2.37 4.97 19.5% 2.58 5.56 21.0% 2.78 6.06
AGCRN 17.8% 2.34 5.01 18.9% 2.48 5.40 20.1% 2.62 5.74
Graph Wavenet 18.0% 2.39 4.96 19.6% 2.66 5.65 21.1% 2.87 6.14
MTGNN 18.8% 2.47 5.15 20.6% 2.74 5.85 22.1% 2.95 6.33
Transformer 22.9% 2.99 6.32 26.6% 3.59 7.85 28.8% 3.99 8.79
Transformer + STN 17.9% 2.35 4.85 19.3% 2.56 5.51 20.7% 2.76 5.96
BikeNYC
TCN 22.7% 3.04 6.43 26.8% 3.72 8.07 28.9% 4.15 9.04
TCN + STN 17.2% 2.26 4.66 18.9% 2.47 5.31 20.4% 2.67 5.76
Wavenet 21.7% 2.88 6.08 25.3% 3.48 7.57 27.5% 3.90 8.53
Wavenet + STN 17.0% 2.22 4.60 18.4% 2.43 5.21 19.9% 2.62 5.66
Improvements +3.4% +5.1% +7.2% +2.6% +2.1% +3.5% +1.0% +0.0% +1.3%
LSTNet 8.29% 4.05 6.79 8.57% 4.19 6.96 9.13% 4.44 7.29
AGCRN 5.38% 2.58 4.59 7.07% 3.38 6.00 8.06% 3.85 6.70
Graph Wavenet 5.28% 2.53 4.51 7.34% 3.51 6.12 8.41% 4.02 6.86
MTGNN 5.46% 2.62 4.61 7.64% 3.65 6.28 8.73% 4.17 7.03
Transformer 5.87% 2.81 5.06 9.06% 4.32 7.46 11.2% 5.32 9.81
Transformer + STN 5.51% 2.63 4.67 7.54% 3.61 6.28 8.64% 4.15 7.04
PeMSD7
TCN 5.74% 2.75 4.97 9.07% 4.32 7.45 11.4% 5.43 8.85
TCN + STN 5.24% 2.51 4.46 7.00% 3.35 5.94 7.88% 3.78 6.60
Wavenet 5.50% 2.61 4.80 8.75% 4.10 7.20 11.0% 5.16 8.61
Wavenet + STN 5.14% 2.46 4.44 6.85% 3.27 5.91 7.75% 3.71 6.58
Improvements +2.6% +2.7% +1.5% +3.1% +2.7% +1.5% +3.8% +3.6% +1.7%
LSTNet 14.2% 21.9 46.0 14.6% 22.4 46.8 15.3% 23.3 48.4
AGCRN 9.58% 14.6 33.8 11.0% 16.8 38.4 12.5% 19.1 42.3
Graph Wavenet 9.17% 14.3 32.1 11.9% 18.3 40.9 13.7% 20.7 46.2
MTGNN 9.31% 14.5 31.8 12.3% 19.0 40.8 14.1% 21.4 45.5
Transformer 10.8% 16.5 37.4 16.0% 24.1 53.3 19.4% 28.9 62.0
Transformer + STN 8.9% 14.0 31.0 11.2% 17.4 38.5 12.7% 19.4 42.2
Electricity
TCN 10.5% 16.2 36.6 15.7% 23.9 52.7 19.2% 28.6 61.2
TCN + STN 8.50% 13.2 28.2 10.7% 16.4 35.4 12.0% 18.4 39.6
Wavenet 10.2% 15.8 36.2 15.0% 22.9 51.2 18.2% 27.3 59.3
Wavenet + STN 8.19% 12.7 27.7 10.3% 16.0 35.5 11.8% 18.1 40.3
Improvements +10.0% +11.1% +12.8% +6.3% +4.7% +7.8% +5.7% +5.4% +4.8%
LSTNet 13.5% 1.50 2.31 16.6% 1.89 2.89 18.7% 2.19 3.28
Graph Wavenet 8.0% 1.12 1.87 10.7% 1.48 2.51 12.8% 1.80 2.95
AGCRN 7.9% 1.07 1.83 10.6% 1.46 2.49 12.6% 1.75 2.90
Solar MTGNN 7.9% 1.08 1.85 10.7% 1.46 2.49 12.8% 1.79 2.94
Wavenet + STN 6.9% 0.80 1.64 10.4% 1.27 2.32 12.6% 1.59 2.74
Improvements +12.7% +25.2% +10.4% +1.9% +13.0% +6.8% 0% +9.1% +5.5%
LSTNet 18.0% 11.68 19.82 26.3% 17.42 29.64 32.6% 21.91 36.85
Graph Wavenet 17.4% 10.24 19.23 25.2% 15.48 28.67 31.4% 19.90 35.85
AGCRN 16.9% 10.07 19.30 24.6% 15.23 28.08 30.9% 19.47 34.86
PM2.5 MTGNN 18.5% 10.41 19.93 27.0% 16.35 30.36 33.5% 21.07 37.99
Wavenet + STN 17.1% 10.00 18.75 24.6% 15.08 27.88 30.7% 19.38 34.88
Improvements -1.2% +0.7% +2.5% +2.4% +1.0% +0.7% 0.3% +0.5% -0.6%

difference in scaling factors among various variables, we variables in this dataset vary in the same range, which signi-
characterize a variable with the mean of its historical obser- fies that they have approximately the same scaling factor.
vations. For succinctness, we calculate the variable-wise Figs. 10 and 11 illustrate the efficiency of our model.
reduction on RMSE obtained by Wavenet + STN compared Fig. 11 shows that with the additional STN module, the
to AGCRN, and then plot the reduction against the scale of converging speeds of the models are accelerated by a
each variable in Fig. 9. We can see that for BikeNYC and large margin, faster than nearly all the baseline models.
electricity, the improvement becomes more prominent as Fig. 10 indicates that the running time of training our
the scale grows, which meets our previous expectation. For model on the same volume of data is also competitive
PeMSD7, the improvement is less significant, as all the with baselines.
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on January 24,2025 at 08:44:25 UTC from IEEE Xplore. Restrictions apply.
7674 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 35, NO. 8, AUGUST 2023

Fig. 10. Running time of an epoch in the training process on BikeNYC


data.

5.5 Hyper-Parameter Analysis


We further study the effect of different settings of the hyper-
parameters in the proposed modules. There are four hyper-
parameters need to be manually set by practitioners, namely
the dimension of hidden channels dz , the number of historical
steps input to the model, the kernel size of DCC and the batch
size. When evaluating each hyper-parameter above, the
Fig. 9. Variable-wise improvements over AGCRN. remaining three are fixed to their default setting as intro-
duced in Section 5.1.2. The study results are reported in
Fig. 13, from which we can draw a major conclusion: STN not
5.4 Ablation Study only boosts the performance, but also increases the stability
We design several variants within the MVMT framework as of the performance under different hyper-parameter settings.
shown in Fig. 12 to validate the effectiveness of different
operations. 5.6 Case Study
We evaluate these variants on the first three datasets and To obtain more insights on the algorithm, we conduct multi-
report the results in Table 4. From this table, we can draw ple studies to qualitatively analyze the representations gen-
the following major conclusions: erated while forecasting, including the initial representation,
the intermediate representation and the final representation.
 Taking either the spatial view or the temporal The dataset we select for this investigation is BikeNYC.
view can greatly enhance the capability of vanilla
Wavenet, and taking both achieves the best 5.6.1 Initial Representation
performance.
 Contrasting (a) with (b), we conclude that the origi- We apply task-wise normalization over the raw input data
nal view is indispensable in the MVMT framework, respectively from the spatial view and the temporal view,
especially when applied on the electricity data. An and examine whether the issues we raise in Fig. 2 are miti-
intuitive explanation for this indispensability is that gated. We plot the original quantity versus the temporally
task-wise normalization more or less loses informa- normalized quantity in Fig. 14, and the original quantity
tion encoded in the original representation, espe- versus the spatially normalized quantity in Fig. 15. It is
cially for data which presents complex dynamics apparent that the pairwise relationship between the original
like the electricity dataset. quantity and the temporally normalized quantity separates
 Based on (a), (c) and (e), we find that task-wise different regions, and the pairwise relationship between the
affine transformation significantly increases the original quantity and the spatially normalized quantity sep-
improvement contributed by the spatial branch. arates different days.
The reason for this is that this operation links
the tasks associated with the same region over 5.6.2 Intermediate Representation
time, which is paramount for online forecasting In this part, we investigate the output from the spatial
where only recent observations are input to the branch and the temporal branch in STN. We start by discus-
model. sing the property that these intermediate representations
 Based on (a), (d), (e) and (f), we find that the per- are supposed to express. The output from the spatial branch
formance gain entailed by merely taking the tem- is expected to encode the local component which reflects
poral view is limited on the electricity data. We the temporary pattern at a single timestamp. The reason for
conjecture that this is also due to the complex pat- this is that task-wise normalization eliminates the global
terns of the data: without encoding the spatial component from the input representation, where the global
attribute into the representations, it is difficult for component encodes the long-term pattern regarding a
the model to capture typical temporal patterns region, which has nothing to do with a specific timestamp.
from data. With respect to the temporal branch, the resulting
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on January 24,2025 at 08:44:25 UTC from IEEE Xplore. Restrictions apply.
DENG ET AL.: MULTI-VIEW MULTI-TASK LEARNING FRAMEWORK FOR MULTI-VARIATE TIME SERIES FORECASTING 7675

Fig. 11. Loss convergence.

representation is deemed to show the region-wise pattern, also inspect their associated input representations, each of
which plays the role of a local component from this view. which is a concatenation of raw measurements. Next, we
We select three representative regions at specified times discuss separately the outcomes from these two views in
to reflect what the operations extract from the data. We details.
examine the intermediate representation output from the Temporal View. In Fig. 16, we display the demand evo-
two views in the top residual block. As a comparison, we lution during a given period over the three investigated
regions. We can observe that the three regions have similar
evolution patterns, especially regions B and C. The repre-
sentations concatenated by the original measurements are
plotted in Fig. 17a, and the output of the intermediate rep-
resentations from the temporal view are plotted in
Fig. 17b. For the sake of visualization, we obtain the two-
dimensional embeddings of these representations via t-
Distributed Stochastic Neighbor Embedding (t-SNE). We
can observe that the representations are completely rear-
ranged in accordance with the regional identity. This
observation demonstrates that the local components are
roughly invariant within the group belonging to the same
region. This coincides with our understanding that some
regional attributes, such as population and functionality,
are stable over time.
Spatial View. To reflect the characteristics of the output
of representations from the spatial view, we take another
region D into consideration, as shown in Fig. 18. Noticeably,
the magnitude of the demand over region D is substantially

Fig. 12. Variants for ablation study.

TABLE 4
Ablation Study

(a) (b) (c) (d) (e) (f)


RMSE 5.32 5.40 6.34 6.15 6.48 7.55
B MAE 2.60 2.65 3.04 2.93 3.07 3.53
MAPE (%) 19.1 19.5 22.9 21.7 23.4 26.3
RMSE 5.25 5.38 5.99 6.12 6.00 6.78
P MAE 2.88 3.01 3.39 3.44 3.42 3.89
MAPE (%) 6.02 6.28 7.15 7.20 7.20 8.23
RMSE 38.9 41.0 43.1 42.9 46.4 47.2
E MAE 18.8 21.1 21.4 20.2 23.9 22.8
MAPE (%) 14.2 17.7 16.2 14.9 18.9 16.4
Fig. 13. Hyper-parameter analysis.
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on January 24,2025 at 08:44:25 UTC from IEEE Xplore. Restrictions apply.
7676 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 35, NO. 8, AUGUST 2023

Fig. 18. Example data picked out to study intermediate representations


output from the temporal view.

Fig. 14. Initial representations produced: (a) without task-wise normali-


zation; (b) with task-wise normalization from the temporal view.

Fig. 19. Two-dimensional embeddings of the intermediate representa-


tions produced by: (a) the vanilla Wavenet; (b) the temporal view in
MVMT. The sampling time is distinguished by the color of the marker,
while the sampling location is distinguished by the shape of the marker.
Fig. 15. Initial representations produced: (a) without task-wise normali-
zation; (b) with task-wise normalization from the spatial view.

Fig. 16. Example data selected to study the output of intermediate repre-
sentations from the spatial view.
Fig. 20. Four groups of samples selected to inspect temporal
distinguishability.

spatial view, as it forms clusters of instances with the same


occurrence time.

5.6.3 Final Representation


In this part, we investigate the representations directly
used for prediction. The major purpose is to examine the
spatial distinguishability and temporal distinguishability
Fig. 17. Two-dimensional embeddings of the intermediate representa- of the representations after processing by STN. These two
tions produced by: (a) the vanilla Wavenet; (b) the spatial view in MVMT.
types of distinguishability are separately illustrated with
two cases.
smaller than those over regions A or B. Here, we account for Temporal View. In the first case, we examine the dis-
three different times in a day, consisting of 1am, 8am and tinguishability from the temporal view, concentrating on
12pm. Likewise, the input representations are plotted in a specific region. With prior knowledge that bike demand
Fig. 19a, and the intermediate representations are plotted in shows daily periodicity where the same evolving pattern
Fig. 19b. As shown in Fig. 19a, instances belonging to region appears iteratively at the same time every day, we select
D are mixed up without separation between different times, four representative times of a day which exhibit entirely
which signifies that the model will struggle to differentiate different patterns. Then, we extract the input sequences
the times at which those instances occurred. In contrast, this at these times, and visualize them in Fig. 20, grouped by
issue is mitigated in the representation space from the time.
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on January 24,2025 at 08:44:25 UTC from IEEE Xplore. Restrictions apply.
DENG ET AL.: MULTI-VIEW MULTI-TASK LEARNING FRAMEWORK FOR MULTI-VARIATE TIME SERIES FORECASTING 7677

Fig. 21. The percentages of variance explained by the first seven Fig. 23. Two groups of samples selected to inspect spatial distinguishability.
components.

Fig. 24. The percentage of variance explained by the first seven


components.

Fig. 22. Projections of representations on the subspace spanned by (a),


(c), (e): the first two components; (b), (d), (f): the third component and
fourth component.

To demonstrate that the distinguishability of representa-


tions can be greatly enhanced by the additional operations,
we employ raw sequence and representation produced by
vanilla Wavenet as two benchmark representations. We
firstly perform principal component analysis (PCA) aiming
at reducing the dimensionality of the representations. The
reason for adopting PCA rather than t-SNE is that PCA can
preserve the linearity on the representation space. The per-
centage of variance explained by each of the selected com-
Fig. 25. Projections of representations on the subspace spanned by the
ponents is plotted in Fig. 21, from which we can observe first two components, colored in accordance with (a), (c), (e): group iden-
that only a few components are effective as the variance tity and (b), (d), (f): scale.
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on January 24,2025 at 08:44:25 UTC from IEEE Xplore. Restrictions apply.
7678 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 35, NO. 8, AUGUST 2023

over the others is extremely small. According to this finding, When dealing with variable-length sequences, the preva-
we visualize in Fig. 22 the projections on the first four com- lent forecasting methods pad or truncate them to the same
ponents which explain almost all the variance over the rep- length at the pre-processing step, which unavoidably intro-
resentations. It is obvious that the enhanced model duces noise or abandons valuable samples. In contrast, the
produces representations showing the strongest intra-task MVMT framework can directly process variable-length
correlation and the weakest inter-task correlation of the sequences, as the mean and the standard deviation compu-
three models. Hence, the distinguishability from temporal tations make no difference for different sample sizes.
view is improved by applying the proposed operations. There are two limitations of this framework. First,
Spatial View. In the second case, we examine the distin- whether trialing different schemes can be automated to
guishability from the spatial view, concentrating on a partic- save efforts in model architecture design is an open prob-
ular time. We roughly divide the regions into two groups lem. Second, it is worth exploring how to apply this frame-
based on the evolving patterns as shown in Fig. 23. Basi- work on streaming data, where data distribution is shifting
cally, bike demand over the first group of regions experien- over time. In this new scenario, the current implementation
ces a v-shape variation during the period being visualized of the MVMT framework will not be suitable or at least
and will continue rising in the forthcoming time steps; on have considerable room to be improved, since the parame-
the contrary, the second group of regions approximately ters in spatial-view affine transformation cannot adapt to
remains constant, e.g., 0. We clarify that the assignment is the unknown data distribution.
not rigorous and unique, but such fuzziness will not affect
the conclusion we can draw. REFERENCES
Similar to the first case, we perform PCA, plot the per- [1] R. Jiang et al., “DeepCrowd: A deep model for large-scale city-
centage of variance explained by each of the selected com- wide crowd density and flow prediction,” IEEE Trans. Knowl.
ponents in Fig. 24 and visualize in Fig. 25 the projections on Data Eng., early access, May 03, 2021, doi: 10.1109/
TKDE.2021.3077056.
the first two components which explain most of the vari- [2] J. Deng, X. Chen, Z. Fan, R. Jiang, X. Song, and I. W. Tsang, “The
ance. At this step, it is apparent that the enhanced model pulse of urban transport: Exploring the co-evolving pattern for
yields the most distinguishable representation space from spatio-temporal forecasting,” ACM Trans. Knowl. Discov. Data,
the spatial view. Moreover, we recolor the sample points vol. 15, no. 6, pp. 1–25, 2021.
[3] R. Jiang et al., “DeepUrbanEvent: A system for predicting city-
based on the scale of the observations, and the visualization wide crowd dynamics at big events,” in Proc. 25th ACM SIGKDD
results show that the representations also encode the infor- Int. Conf. Knowl. Discov. Data Mining, 2019, pp. 2114–2122.
mation regarding scale. [4] X. Geng et al., “Spatiotemporal multi-graph convolution network
for ride-hailing demand forecasting,” in Proc. AAAI Conf. Artif.
Intell., 2019, pp. 3656–3663.
[5] Z. Pan et al., “Spatio-temporal meta learning for urban traffic pre-
6 CONCLUSION AND FUTURE WORK diction,” IEEE Trans. Knowl. Data Eng., vol. 34, no. 3, pp. 1462–1476,
In this work, we develop a novel multi-view multi-task Mar. 2022.
[6] J. Sun, J. Zhang, Q. Li, X. Yi, Y. Liang, and Y. Zheng, “Predicting
learning framework for multi-variate time series forecast- citywide crowd flows in irregular regions using multi-view graph
ing. In our design, this framework consists of the original convolutional networks,” IEEE Trans. Knowl. Data Eng., vol. 34,
view, the spatial view and the temporal view, but it is flexi- no. 5, pp. 2348–2359, May 2022.
[7] Y. Gong, Z. Li, J. Zhang, W. Liu, and Y. Zheng, “Online spatio-
ble to account for more views depending on the specific temporal crowd flow distribution prediction for complex metro
application. Forecasting from each view is accomplished in system,” IEEE Trans. Knowl. Data Eng., vol. 34, no. 2, pp. 865–880,
a multi-task manner via two task-dependent operations, Feb. 2022.
namely task-wise normalization and task-wise affine trans- [8] D. Ding, M. Zhang, X. Pan, M. Yang, and X. He, “Modeling extreme
events in time series prediction,” in Proc. 25th ACM SIGKDD Int.
formation. Diverse experiments were performed to quanti- Conf. Knowl. Discov. Data Mining, 2019, pp. 1114–1122.
tatively show the effectiveness, the efficiency and the [9] I. Goodfellow, Y. Bengio, and A. Courville, Deep learning. Cam-
robustness of the framework. Furthermore, we conducted bridge, MA, USA: MIT Press, 2016.
multiple case studies on representations produced at differ- [10] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”
Neural Comput., vol. 9, no. 8, pp. 1735–1780, 1997.
ent stages in the forecasting procedure. The outcome quali- [11] S. Li et al., “Enhancing the locality and breaking the memory bot-
tatively demonstrates that this framework strengthens the tleneck of transformer on time series forecasting,” in Proc. Adv.
intra-task correlation, while weakening the inter-task Neural Inf. Process. Syst., 2019, pp. 5243–5253.
[12] A. V. D. Oord et al., “WaveNet: A generative model for raw
correlation. audio,” 2016, arXiv:1609.03499.
It is noticeable that the framework is modular, flexible [13] S. Bai, J. Z. Kolter, and V. Koltun, “An empirical evaluation of
and extensible. In practice, we start with defining a variety generic convolutional and recurrent networks for sequence mod-
of task partition schemes from different views, beyond the eling,” 2018, arXiv:1803.01271.
[14] C. Christoudias, R. Urtasun, and T. Darrell, “Multi-view learning
spatial and the temporal view, based on the structure in the presence of view disagreement,” 2012, arXiv:1206.3242.
underlying the specified data. Then, we instantiate a col- [15] C. Xu, D. Tao, and C. Xu, “A survey on multi-view learning,”
lection of task-wise affine transformation and normaliza- 2013, arXiv:1304.5634.
[16] J. Deng, X. Chen, R. Jiang, X. Song, and I. W. Tsang, “ST-Norm:
tion modules to accommodate these schemes. For Spatial and temporal normalization for multi-variate time series
example, if the set of variables is only partially correlated, forecasting,” in Proc. 27th ACM SIGKDD Conf. Knowl. Discov. Data
we can identify the correlated subset and define a scheme Mining, 2021, pp. 269–278.
to distinguish this subset. Like crafting advantageous fea- [17] Y. Qin, D. Song, H. Chen, W. Cheng, G. Jiang, and G. Cottrell, “A
dual-stage attention-based recurrent neural network for time
tures, formulating views and tasks beneficial for forecast series prediction,” 2017, arXiv:1704.02971.
relies on prior knowledge, i.e., the practitioners’ insight [18] J. Zhao et al., “Do RNN and LSTM have long memory?,”
into the data. 2020, arXiv: 2006.03860.
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on January 24,2025 at 08:44:25 UTC from IEEE Xplore. Restrictions apply.
DENG ET AL.: MULTI-VIEW MULTI-TASK LEARNING FRAMEWORK FOR MULTI-VARIATE TIME SERIES FORECASTING 7679

[19] Y.-Y. Chang, F.-Y. Sun, Y.-H. Wu, and S.-D. Lin, “A memory-net- [43] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep
work based solution for multivariate time-series forecasting,” network training by reducing internal covariate shift,” 2015,
2018, arXiv:1809.02105. arXiv:1502.03167.
[20] X. Tang, H. Yao, Y. Sun, C. C. Aggarwal, P. Mitra, and S. Wang, [44] D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Instance normali-
“Joint modeling of local and global temporal dynamics for multi- zation: The missing ingredient for fast stylization,” 2016,
variate time series forecasting with missing values,” in Proc. Conf. arXiv:1607.08022.
Assoc. Advance. Artif. Intell., 2020, pp. 5956–5963. [45] Y. Wu and K. He, “Group normalization,” in Proc. Eur. Conf. Com-
[21] G. Lai, W.-C. Chang, Y. Yang, and H. Liu, “Modeling long-and short- put. Vis., 2018, pp. 3–19.
term temporal patterns with deep neural networks,” in Proc. 41st Int. [46] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,”
ACM SIGIR Conf. Res. Develop. Inf. Retrieval, 2018, pp. 95–104. 2016, arXiv:1607.06450.
[22] Q. Tan et al., “DATA-GRU: Dual-attention time-aware gated [47] B. Li, F. Wu, K. Q. Weinberger, and S. Belongie, “Positional
recurrent unit for irregular multivariate time series,” in Proc. normalization,” in Proc. Adv. Neural Inf. Process. Syst., 2019,
AAAI Conf. Artif. Intell., 2020, pp. 930–937. pp. 1622–1634.
[23] C. Fan et al., “Multi-horizon time series forecasting with temporal [48] X. Lian and J. Liu, “Revisit batch normalization: New understand-
attention learning,” in Proc. 25th ACM SIGKDD Int. Conf. Knowl. ing and refinement via composition optimization,” in Proc. 22nd
Discov. Data Mining, 2019, pp. 2527–2535. Int. Conf. Artif. Intell. Statist., 2019, pp. 3254–3263.
[24] A. Vaswani et al., “Attention is all you need,” in Proc. Adv. Neural [49] N. Bjorck, C. P. Gomes, B. Selman, and K. Q. Weinberger,
In. Process. Syst., 2017, pp. 5998–6008. “Understanding batch normalization,” in Proc. Adv. Neural Inf.
[25] Z. Wu, S. Pan, G. Long, J. Jiang, and C. Zhang, “Graph WaveNet Process. Syst., 2018, pp. 7694–7705.
for deep spatial-temporal graph modeling,” in Proc. 28th Int. Joint [50] S. Santurkar, D. Tsipras, A. Ilyas, and A. Madry, “How does
Conf. Artif. Intell., 2019, pp. 1907–1913. batch normalization help optimization?,” in Proc. Adv. Neural Inf.
[26] Z. Wu, S. Pan, G. Long, J. Jiang, X. Chang, and C. Zhang, Process. Syst., 2018, pp. 2483–2493.
“Connecting the dots: Multivariate time series forecasting with [51] H. Daneshmand, J. Kohler, F. Bach, T. Hofmann, and A. Lucchi,
graph neural networks,” 2020, arXiv:2005.11650. “Batch normalization provably avoids rank collapse for randomly
[27] R. Sen, H.-F. Yu, and I. S. Dhillon, “Think globally, act locally: initialised deep networks,” 2020, arXiv:2003.01652.
A deep neural network approach to high-dimensional time [52] L. Huang, L. Zhao, Y. Zhou, F. Zhu, L. Liu, and L. Shao, “An
series forecasting,” in Proc. Adv. Neural Inf. Process. Syst., 2019, investigation into the stochasticity of batch whitening,” in Proc.
pp. 4837–4846. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, pp. 6439–6448.
[28] H. Zhou et al., “Informer: Beyond efficient transformer for long [53] A. Kessy, A. Lewin, and K. Strimmer, “Optimal whitening and
sequence time-series forecasting,” in Proc. AAAI Conf. Artif. Intell., decorrelation,” Amer. Statistician, vol. 72, no. 4, pp. 309–314, 2018.
2021, pp. 11 106–11 115. [54] D. P. Kingma and J. Ba, “Adam: A method for stochastic opti-
[29] J. Cheng, K. Huang, and Z. Zheng, “Towards better forecasting by mization,” 2014, arXiv:1412.6980.
fusing near and distant future visions,” in Proc. AAAI Conf. Artif. [55] L. Bai, L. Yao, C. Li, X. Wang, and C. Wang, “Adaptive graph con-
Intell., 2020, pp. 3593–3600. volutional recurrent network for traffic forecasting,” Proc. Adv.
[30] D. Salinas, V. Flunkert, J. Gasthaus, and T. Januschowski, Neural Inf. Process. Syst., 2020, pp. 17 804–17 815.
“DeepAR: Probabilistic forecasting with autoregressive recurrent
networks,” Int. J. Forecasting, vol. 36, pp. 1181–1191, 2019.
[31] H.-F. Yu, N. Rao, and I. S. Dhillon, “Temporal regularized matrix Jinliang Deng received the BS degree in com-
factorization for high-dimensional time series prediction,” in Proc. puter science from Peking University in 2017, and
Adv. Neural Inf. Process. Syst., 2016, pp. 847–855. the MS degree in computer science from The
[32] R. Yu, S. Zheng, A. Anandkumar, and Y. Yue, “Long-term fore- Hong Kong University of Science and Technology
casting using higher order tensor RNNs,” 2017, arXiv:1711.00073. in 2019. He is currently working toward the PhD
[33] Y. Liang, S. Ke, J. Zhang, X. Yi, and Y. Zheng, “GeoMAN: Multi-
degree with the Australian Artificial Intelligence
level attention networks for geo-sensory time series prediction,”
Institute, University of Technology Sydney and
in Proc. Int. Joint Conf. Artif. Intell., 2018, pp. 3428–3434. the Department of Computer Science and Engi-
[34] G. Spadon, S. Hong, B. Brandoli, S. Matwin, J. F. Rodrigues-Jr, and neering, Southern University of Science and
J. Sun, “Pay attention to evolution: Time series forecasting with Technology. His research interests include time
deep graph-evolution learning,” IEEE Trans. Pattern Anal. Mach. series forecasting, urban computing and deep
Intell., vol. 44, no. 9, pp. 5368–5384, Sep. 2022.
learning.
[35] D. Cao et al., “Spectral temporal graph neural network for multi-
variate time-series forecasting,” in Proc. Adv. Neural Inf. Process.
Syst., 2020, pp. 17 766–17 778.
[36] S. Huang, D. Wang, X. Wu, and A. Tang, “DSANet: Dual Xiusi Chen received the BS and MS degrees in
self-attention network for multivariate time series forecasting,” computer science from Peking University, in 2015
in Proc. 28th ACM Int. Conf. Inf. Knowl. Manage., 2019, and 2018, respectively. He is currently working
pp. 2129–2132. toward the PhD degree with the Department of
Computer Science, University of California, Los
[37] S. S. Rangapuram, M. W. Seeger, J. Gasthaus, L. Stella, Y. Wang,
Angeles. His research interests include natural
and T. Januschowski, “Deep state space models for time
series forecasting,” in Proc. Adv. Neural Inf. Process. Syst., 2018, language processing, knowledge graph, neural
pp. 7785–7794. maching reasoning and reinforcement learning.
[38] Y. Wang, A. Smola, D. C. Maddix, J. Gasthaus, D. Foster, and T.
Januschowski, “Deep factors for forecasting,” 2019, arXiv:
1905.12417.
[39] Y. Wang, H. Yin, H. Chen, T. Wo, J. Xu, and K. Zheng, “Origin-
destination matrix prediction via graph convolution: A new Renhe Jiang received the BS degree in software
perspective of passenger demand modeling,” in Proc. 25th engineering from the Dalian University of Tech-
ACM SIGKDD Int. Conf. Knowl. Discov. Data Mining, 2019, nology, China, in 2012, the MS degree in informa-
pp. 1227–1235. tion science from Nagoya University, Japan, in
[40] V. Le Guen and N. Thome, “Shape and time distortion loss for 2015, and the PhD degree in civil engineering
training deep time series forecasting models,” in Proc. Adv. Neural from The University of Tokyo, Japan, in 2019.
Inf. Process. Syst., 2019, pp. 4189–4201. From 2019, he has been an assistant professor
[41] T. Guo, T. Lin, and N. Antulov-Fantulin, “Exploring interpretable with the Information Technology Center, The Uni-
LSTM neural networks over multi-variable data,” in Proc. Int. versity of Tokyo. His research interests include
Conf. Mach. Learn., 2019, pp. 2494–2504. ubiquitous computing, deep learning, and spatio-
[42] M. Binkowski, G. Marti, and P. Donnat, “Autoregressive convolu- temporal data analysis.
tional neural networks for asynchronous time series,” in Proc. Int.
Conf. Mach. Learn., 2018, pp. 580–589.

Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on January 24,2025 at 08:44:25 UTC from IEEE Xplore. Restrictions apply.
7680 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 35, NO. 8, AUGUST 2023

Xuan Song received the PhD degree in signal Ivor W. Tsang (Fellow, IEEE) is an ARC future fel-
and information processing from Peking Univer- low and professor of Artificial Intelligence with the
sity in 2010. In 2017, he was selected as Excel- University of Technology Sydney (UTS), Australia.
lent Young Researcher of Japan MEXT. He has He is also the research director of the Australian
served as an associate editor, guest editor, area Artificial Intelligence Institute. His research inter-
Cchair, senior program committee member for ests include transfer learning, generative models,
many prestigious journals and top-tier conferen- and Big Data analytics for data with extremely
ces, such as IMWUT, IEEE Transactions on Multi- high dimensions. In 2013, he received his presti-
media, WWW Journal, ACM Transactions on gious ARC Future Fellowship for his research
Intelligent Systems and Technology, IEEE Trans- regarding machine learning on Big Data. In 2019,
actions on Knowledge and Data Engineering, Big his JMLR paper titled “Towards ultrahigh dimen-
Data Journal, UbiComp, IJCAI, AAAI, ICCV, CVPR etc. His main sional feature selection for Big Data” received the International Consor-
research interests are AI and its related research areas, such as data tium of Chinese Mathematicians Best Paper Award. In 2020, he was
mining and urban computing. To date, he has published more than 100 recognized as the AI 2000 AAAI/IJCAI Most Influential Scholar in Aus-
technical publications in journals, book chapters, and international con- tralia for his outstanding contributions to the field of AAAI/IJCAI between
ference proceedings, including more than 60 high-impact papers in top- 2009 and 2019. His research on transfer learning granted him the Best
tier publications for computer science. His research has been featured in Student Paper Award for CVPR 2010 and 2014 IEEE Transactions on
many Chinese, Japanese and international venues, including the United Multimedia Prize Paper Award. In addition, he received the IEEE TNN
Nations, the Discovery Channel, and Fast Company Magazine. He Outstanding 2004 Paper Award in 2007. He serves as a senior area chair/
received the Honorable Mention Award at UbiComp 2015. area chair for NeurIPS, ICML, AISTATS, AAAI and IJCAI, and the Editorial
Board for Journal of Machine Learning Research, MLJ, and IEEE Trans-
actions on Pattern Analysis and Machine Intelligence.

" For more information on this or any other computing topic,


please visit our Digital Library at www.computer.org/csdl.

Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on January 24,2025 at 08:44:25 UTC from IEEE Xplore. Restrictions apply.

You might also like