0% found this document useful (0 votes)
34 views26 pages

Probabilistic Forecasting For Dynamical Systems With Missing or Imperfect Data

This study presents a novel approach to probabilistic forecasting for dynamical systems using a variant of stochastic interpolation (SI) to handle missing or imperfect data. By leveraging machine learning techniques, the authors demonstrate that forecasting future states as distributions rather than single-point predictions can effectively account for uncertainties inherent in complex systems. The proposed method is applicable across various fields, including weather prediction, and emphasizes the importance of quantifying uncertainty in forecasting outcomes.

Uploaded by

Harman Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views26 pages

Probabilistic Forecasting For Dynamical Systems With Missing or Imperfect Data

This study presents a novel approach to probabilistic forecasting for dynamical systems using a variant of stochastic interpolation (SI) to handle missing or imperfect data. By leveraging machine learning techniques, the authors demonstrate that forecasting future states as distributions rather than single-point predictions can effectively account for uncertainties inherent in complex systems. The proposed method is applicable across various fields, including weather prediction, and emphasizes the importance of quantifying uncertainty in forecasting outcomes.

Uploaded by

Harman Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Probabilistic Forecasting for Dynamical Systems with Missing or Imperfect Data

Siddharth Rout1,2 Eldad Haber1,2 Stéphane Gaudreault3


arXiv:2503.12273v1 [[Link]-ph] 15 Mar 2025

1
Institute of Applied Mathematics, University of British Columbia, Vancouver, BC, Canada
2
Department of Earth, Ocean and Atmospheric Sciences, University of British Columbia, Vancouver, BC, Canada
3
Recherche en prévision numérique atmosphérique, Environnement et Changement climatique Canada , Dorval, QC, Canada

Abstract Lam et al., 2022, Bodnar et al., 2024]. ML approaches are


well suited for this task, as they can automatically identify
patterns in high-dimensional data without the need for a
The modeling of dynamical systems is essential precise mathematical formulation of all physical processes.
in many fields, but applying machine learning This is particularly advantageous when complete physical
techniques is often challenging due to incomplete models are either unknown or computationally intractable.
or noisy data. This study introduces a variant of Furthermore, ML can efficiently capture multi-scale phe-
stochastic interpolation (SI) for probabilistic fore- nomena while maintaining computational efficiency during
casting, estimating future states as distributions inference.
rather than single-point predictions. We explore
its mathematical foundations and demonstrate its The application of machine learning models to dynamical
effectiveness on various dynamical systems, in- systems spans a wide range of fields, including finance, epi-
cluding the challenging WeatherBench dataset. demiology, rumor propagation, and predictive maintenance
[Cheng et al., 2023a]. Our primary motivation is the recent
success of machine learning techniques in weather and cli-
1 INTRODUCTION mate prediction, where vast amounts of meteorological data
have been analyzed to forecast weather patterns and climate
The modeling of dynamical systems is of paramount impor- changes [Pathak et al., 2022b, Lam et al., 2022, Bodnar
tance across numerous scientific and engineering disciplines, et al., 2024].
including weather prediction, climate science, finance, and Machine learning models, particularly those based on deep
biology [Brin and Stuck, 2002, Tu, 2012]. Traditional meth- learning, excel at processing large and complex datasets.
ods for predicting the behavior of these systems often rely In the context of dynamical systems, these models can be
on explicit mathematical models derived from known physi- trained on historical data to learn system behavior and pre-
cal laws. Such approaches are well summarized in Reich and dict future states—essentially integrating the system over
Cotter [2015] and typically employ a combination of numer- a very large time step. Among the most effective models
ical methods, Bayesian inference, and sensitivity analysis. are Recurrent Neural Networks (RNNs), Long Short-Term
However, these methods can be limited by their underly- Memory (LSTM) networks, Gated Recurrent Units (GRUs),
ing assumptions and the complexity of the systems they and Transformers [Chattopadhyay et al., 2020, Vaswani
aim to model [Guckenheimer and Worfolk, 1993, Guck- et al., 2017]. These models are specifically designed for se-
enheimer and Holmes, 2013]. In particular, they require a quential data and retain memory of previous inputs, making
high-fidelity model with sufficiently high resolution and them well-suited for time series prediction in dynamical
accurate input data. Furthermore, such approaches often systems. Convolutional Neural Networks (CNNs) have also
demand significant computational resources, making them been adapted for time series prediction, particularly in spa-
difficult to compute within a reasonable time frame [Moin tiotemporal problems [Shi et al., 2015, Li et al., 2020, Liu
and Mahesh, 1998, Lynch, 2008, Wedi et al., 2015b]. et al., 2021, Zakariaei et al., 2024].
In recent years, machine learning (ML) has emerged as a Most traditional models for dynamical systems aim for de-
powerful tool for predicting dynamical systems, leveraging terministic predictions, providing a single forecast of the
its ability to learn patterns and relationships from data with- system’s future state. However, with incomplete or noisy
out requiring explicit knowledge of the underlying physical data, such predictions can be misleading or unrealistic due
laws [Ghadami and Epureanu, 2022, Pathak et al., 2022b,
to large errors. In contrast, probabilistic approaches account tributions [Albergo et al., 2023, Lipman et al., 2022]. Unlike
for this uncertainty by forecasting a distribution of possible deterministic methods, which produce a single interpolated
future states rather than a single outcome. This shift from value, flow matching introduces randomness into the predic-
point predictions to probability distributions is crucial for tion, enabling the generation of a diverse range of possible
many complex systems, where small uncertainties in initial outputs.
conditions or missing data can lead to vastly different out-
By leveraging probability distributions and stochastic pro-
comes, and where quantifying uncertainty is essential for
cesses, flow matching based generative models can capture
decision-making.
the inherent variability and complexity of real-world data,
Motivation: Machine learning techniques have made sig-
resulting in more robust and versatile models.
nificant strides in predicting dynamical systems. However,
In this work, we apply Stochastic Interpolation (SI) to prob-
despite their success, these models face several challenges.
abilistic forecasting in dynamical systems and demonstrate
One of the primary difficulties lies in the quality, quantity,
that it is a natural choice, particularly when the dynamics are
and completeness of the data [Ganesan et al., 2004, Geer and
unresolved due to noise or when some states are unavailable.
Bauer, 2011, Cheng et al., 2023b]. High-quality, large-scale
Specifically, we build on previous approaches that transition
datasets are essential for training accurate machine learning
from Stochastic Differential Equations (SDEs) to Ordinary
models, yet such data are often low resolution, incomplete,
Differential Equations (ODEs), facilitating easier and more
or noisy. As we show here, this violates standard machine
accurate integration [Yang et al., 2024].
learning assumptions. A model may receive incorrect or
insufficient input (e.g. low-resolution data), preventing it As recently shown, selecting an appropriate loss function
from making accurate predictions. In such cases, determin- during training significantly improves inference efficiency
istic predictions fail, necessitating the use of probabilistic [Lipman et al., 2024]. To achieve this, we further employ SI
forecasting, an approach that is intuitive for practitioners in to encode and decode transport maps, introducing physical
the field. Instead of predicting a single outcome, one also perturbations to non-Gaussian data samples. This enables
quantifies uncertainty by leveraging historical data [Cheng sampling from the base state and propagating these samples
et al., 2023b, Smith, 2024]. to future states.
One way to address this issue is through ensemble fore- Finally, we demonstrate that our approach is applicable to
casting [Kalnay, 2003]. This technique is widely used in various problems, with a key application in weather predic-
numerical weather prediction (NWP) as it accounts for un- tion, where it enhances forecasts by quantifying uncertainty.
certainties in initial conditions and model physics, produc-
ing a range of possible outcomes to enhance forecast relia-
bility and accuracy. However, ensemble forecasting relies 2 MATHEMATICAL FOUNDATION
on extensive physical simulations and requires significant
computational resources. In this section, we lay the mathematical foundation of the
ideas behind the methods proposed in this work. Assume
This study explores a framework for addressing this chal- that there is a dynamical system of the form
lenge using recent advances in machine learning. Specifi-
cally, we employ a variant of flow matching to map the cur- ẏ = f (y, t, p). (1)
rent state distribution to the future state distribution. Flow
matching is a family of techniques, including stochastic in- Here, y ∈ Rn is the state vector and f is a function that
terpolation (SI) [Albergo et al., 2023, Lipman et al., 2022] depends on y, the time t and some parameters p ∈ Rk . We
and score matching [Song et al., 2020], that enables learn- assume that f is smooth and differentiable so that given
ing a transformation between two distributions. While these an initial condition y(0) one could compute y(T ) using
techniques are commonly used for image generation by map- some numerical integration method [Stuart and Humphries,
ping Gaussian distributions to images, they can be readily 1998]. The learning problem arises in the case when the
adapted to match the distribution of the current state to the function f or the initial conditions are unknown and we
distribution of future states. have observations about y(t). In this case, our goal is to
learn the function f given the observations y(t). There are
Recent work by Chen et al. [2024] introduced the use of a number of approaches by which this can be done [Reich
SI and the Föllmer process for probabilistic forecasting and Cotter, 2015, Chattopadhyay et al., 2020, Ghadami and
via stochastic differential equations. Here, we propose a Epureanu, 2022]. Assume first that the data on y is measured
similar approach to handle missing and noisy data but adopt in small intervals of time (T = h) that for simplicity we
a deterministic framework instead. assume to be constant. Then, we can solve the optimization
Stochastic interpolation (SI) is a homotopy-based technique problem
used in machine learning and statistics to generate new data  
1X 1 yj+1 − yj
points by probabilistically interpolating between known dis- min g (yj + yj+1 ), tj+ 21 , θ − , (2)
θ 2 j 2 h

2
where g is some functional architecture that is sufficiently
expressive to approximate f and θ are parameters in g. If
the original function f is known, then one can choose f = g
and p = θ. However, in many cases g is only a surrogate of
f and the parameters θ are very different than p.
The second, less trivial case is when the time interval T
is very large (relative to the smoothness of the problem).
In such cases, one cannot use simple finite difference to
approximate the derivative. Instead, we note that since the
differential equation has a unique solution and its trajectories
do not intersect, we have that
Figure 1: Trajectories for the Predator Pray Model. Note
Z t+T
that the trajectories get very close but do not intersect.
y(t + T ) = F (y(t), t, p) = f (y(τ ), τ, p)dτ. (3)
t

Here F is the function that integrates the ODE from time t equation 3. Furthermore, since the system is periodic, it is
to t + T . We can then approximate F using an appropriate easy to set a learning problem that can integrate the system
functional architecture thus predicting y at later time given for a long time.
observation on it in earlier times.
Note however, that the trajectories cluster at the lower left
The two scenarios above fall under the case where we can corner. This implies that even though the trajectories never
predict the future based on the current values of y. Formally, meet in theory, they may be very close, so numerically, if the
we define the following: data is noisy, the corner can be thought of as a bifurcation
point.
Definition 1 Closed System. Let Y be the space of some The system is therefore closed if the data y is accurate,
observed data y from a dynamical system D. We say the however, the system is open if there is noise on the data
system is closed if given y(t) we can uniquely recover such that late times can be significantly influenced by earlier
y(t+T ) for all t and finite bounded T ≤ τ , where τ is some times.
constant. Practically, given the data, y(t) and a constant ϵ
we can estimate a function F such that
The predator prey model draws our attention to a different
∥y(t + T ) − F (y(t), t, p)∥2 ≤ ϵ (4) scenario that is much more common in practice. Assume that
some data is missing or, that the data is polluted by noise.
For example, in weather prediction, where the primitive
Definition 1 implies that we can learn the function F in
equations are integrated to approximate global atmospheric
equation 3 assuming that we have sufficient amount of data
flow, it is common that we do not observe the pressure, tem-
in sufficient accuracy and an expressive enough architecture
perature or wind speed in sufficient resolution. In this case,
to approximates F (·, ·, ·). In this case, the focus of any ML
having information about the past is, in general, insufficient
based method should be given to the appropriate architec-
to predict the future. In this case we define the system as
ture that can express F perhaps using some of the known
open.
structure of the problem. This concept and its limitations
are illustrated using the predator prey model. This example While we could use classical machine learning techniques
demonstrates how a seemingly simple dynamical system can to solve closed systems it is unreasonable to attempt and
exhibit complex behavior through the interaction of just two solve open systems with similar techniques. This is because
variables. While theoretically deterministic, the system can the system does not have a unique solution, given partial or
becomes practically unpredictable if measured in regions noisy data. To demonstrate that we return to the predator
where the trajectories are very close. prey model, but this time with partial or noisy data.

Example 1 The Predator Prey Model [Murray, 2007, Example 2 The predator Pray Model with Partial or
Chapter 3]: The predator prey model is Noisy Data: Consider again the predator prey model, but
this time assume that we record y1 only. Now assume that
dy1 dy2
= p1 y1 − p2 y1 y2 = p3 y1 y2 − p4 y2 (5) we known that at t = 0, y1 = 1 and our goal is to predict
dt dt the solution at time t = 200, which is very far into the future.
The trajectories for different starting points are shown in Clearly, this is impossible to do. Nonetheless, we can run
fig. 1. Assuming that we record data in sufficient accuracy, many simulations where y1 = 1 and y2 = π(y2 ) where π
the trajectory at any time can be determined by the mea- is some density. For example, we choose π(y2 ) = U (0, 1).
surement point at earlier times, thus, justifying the model in In this case we ran simulations to obtain the results pre-

3
This approach acknowledges that a unique prediction may
not be attainable but allows for generating samples of future
outcomes. In contrast, deterministic prediction assumes a
closed system and highly accurate data, making it often
inapplicable.

3 PROBABILISTIC FORECASTING AND


Figure 2: Left: The solution for y1 (0) = 1 and y2 (0) ∼ STOCHASTIC INTERPOLATION
U (0, 1) at t = 200. Right:The solution for y(0) =
[0.1, 0.3]⊤ + ϵ where ϵ ∼ N (0, 0.05I) at t = 200 In this section, we discuss a version of Stochastic Interpo-
lation (SI). SI is the technique that we use to propagate the
uncertainty and also sample from the initial distribution.
sented in fig. 2 (left). Clearly, the results are far from unique.
Nonetheless, the points at t = 200 can be thought of as sam- To this end, let us consider the case that we observe a noisy
ples from a probability density function. This probability or a part of the system. In particular, let y(t) be a vector that
density function resides on a curve. Our goal is now trans- is obtained from an unknown dynamical system D and let
formed from learning the function F (y, t) in equation 3 to
q0 (t) = Sy(t) + ϵ (6)
learn a probability density function π(y(T )).
A second case that is of interest is when the data is noisy. Here S is some sampling operator that typically reduces
Consider the case that y0obs = y(0) + ϵ. In this case we con- the dimension of y and ϵ ∼ N (0, σ 2 I). The distinction
sider y0 = [0.1, 0.3]⊤ and we contaminate it with Gaussian between y and q0 is important. The vector y represents
noise with 0 mean and 0.05 standard deviation. Again, we the full state of the system, while q0 represents only partial
attempt to predict the data at T = 200. We plot the results knowledge of the system. For many systems, it is impossible
from 1000 simulations in fig. 2 (right). Again, we see that the to observe the complete state space. For example, for local
incomplete information transforms into a density estimation weather prediction we may have temperature stations but
problem. not wind stations, where wind speed is essential to model
the advection of heat. The inability to obtain information
Example 2 represents a much more realistic scenario, since about the full state implies that most likely it is impossible
in reality data is almost always sampled only partially or on to deliver a deterministic prediction, and thus stochastic
low resolution and includes noise. prediction is the natural choice.
Given q0 (t), our goal is to sample from the distribution
Comment 1 It is important to note that in many cases one of observed partial state qT = q(t + T ) ∼ πT (q), given
can transform an open system into a closed one by consid- samples from the distribution on the data at q0 (t) ∼ π0 (q).
ering higher order dynamics. For example, in the preditor Clearly, we do not have a functional expression for neither
prey model, one could (formaly) eliminate y2 by setting π0 nor πT , however, if we are able to obtain many samples
from both distributions, we can use them to build such a
distribution. This assumption is unrealistic without further
y2 = (p2 y1 )−1 (ẏ1 − p1 y1 )
assumptions in the general case since we typically have only
and then subsituting y2 into the second system, obtaining a single time series. A few assumptions are common and we
a second order system. This however, requires data that is use them in our experiments. The most common assump-
sufficiently accurate to be differentiated more that once. tions are, i) The time dependent process is autoregressive.
This implies that we can treat every instance (or instances)
of the time series as q0 and use the future time series as
Motivated by the two examples above, we now form the
qT . Such an assumption is very common although it may
learning problem of probabilistic forecasting (see, e.g. Reich
be unrealistic, and ii) Periodicity and clustering. For many
and Cotter [2015]).
problems, especially problems that relate to earth science,
there is a natural periodicity and data can be clustered using
Definition 2 Probabilistic forecasting. Probabilistic fore-
the natural cycle of the year.
casting predicts future outcomes by providing a range of
possible scenarios along with their associated probabilities, Given our ability to obtain data that represents samples from
rather than a single point estimate. the probability at time 0 and at time T , we are able to use
To be specific, let the initial vector y(0) ∼ π0 (y) and as- the data and provide a stochastic prediction. To this end,
sume that y(T ) is obtained from integrating the dynamical we note that stochastic prediction is nothing but a problem
system D. Probabilistic forecasting refers to estimating and of probability transformation which is in the base of mass
sampling from the distribution πT (y). transport [Benamou and Brenier, 2003]. While efficient

4
solutions for low dimensions problems have been addressed 4 SAMPLING THE PERTURBED STATES
[Fisher, 1970, Villani et al., 2009], solutions for problems
in high dimensions are only recently being developed. Sampling from the distribution q0 in a meaningful way is
not trivial. There are a number of approaches to achieve this
One recent and highly effective technique to solve such
goal. A brute force approach can search through the data for
a problem is stochastic interpolation (SI) [Albergo and
so-called similar states, for example, given a particular state
Vanden-Eijnden, 2022], which is a technique that belongs
q0 we can search other states in the data set such that ∥q0 −
to a family of flow matching methods [Lipman et al., 2022,
qi ∥2 ≤ ϵ. While this approach is possible in low dimension
Albergo et al., 2023, Song et al., 2020]. The basic idea of
it is difficult if not impossible in very high dimensions. For
stochastic interpolation is to generate a simple interpolant
example, for global predictions, it is difficult to find two
for all possible points in q0 and qT . A simple interpolant of
states that are very close everywhere. To this end, we turn
this kind is linear and reads
to a machine learning approach that is designed to generate
realistic perturbations.
qt = tqT + (1 − t)q0 (7)
We turn our attention to Variational Auto Encoders (VAE’s).
where qT ∼ πT , q0 ∼ π0 , and t ∈ [0, 1] is a parameter. In particular, we use a flow based VAE [Dai and Wipf,
The points qt are associated from a distribution πt (q) that 2019, Vahdat et al., 2021]. Variational Autoencoders are
converges to πT (q) at t = 1 and to π0 (q) at t = 0. In SI particularly useful. In the encoding stage, they push the data
one learns (estimates) the velocity q0 from the original distribution π0 (q) to a vector z that is
sampled from a Gaussian, N (0, I). Next, in the decoding
vt (qt ) = q̇t = qT − q0 (8) stage, the model pushes the vector z back to q0 . Since the
Gaussian space is convex, small perturbation can be applied
by solving the stochastic optimization problem to the latent vector z which lead to samples from π0 (q) that
are centered around q0 .
Z 1
1 The difference between standard VAEs and flow based VAEs
min Eq0 ,q1 ∥vθ (qt , t) − (qT − q0 )∥2 dt (9)
θ 2 0 is that give q0 ∼ π0 , VAEs attempt to learn a transformation
to a Gaussian space N (0, I) directly. However, as have been
Here vθ (qt , t) is an interpolant for the velocity v that demonstrated in Ruthotto and Haber [2021] VAEs are only
is given at points qt . A common model that is used for able to create maps to a latent state that is similar to but not
vθ (qt , t) is a deep neural network. While for simple prob- Gaussian, making it difficult to perturb and sample from
lems such models can be composed of a simple network them. VAEs that are flow based can generate a Gaussian
with a small number of hidden layers, for complex problems, map to a much higher accuracy. Furthermore, such flows
especially problems that relate to space-time predictions, can learn the decoding from a Gaussian back to the original
complex models such as U-Nets are commonly used [Ron- distribution. Flows that can transform the points between
neberger et al., 2015]. the two distributions are sometimes referred to as symmetric
Assume that the a velocity model vθ (qt , t) is trained and flows [Ho et al., 2020, Albergo et al., 2023].
uses it to integrate q from time 0 to T , that is Such flows can be considered as a special case of SI, an
encoder-decoder scheme for a physical state q0 being en-
dq coded to a Gaussian state z. To this end, let us define the
= vθ (q, t) q(0) = q0 (10)
dt linear interpolant as
Note that we use a deterministic framework rather than (
(1 − 2t)q0 + 2tz, if t ∈ [0, 12 )
a stochastic framework. This allows to incorporate high qt = (11)
accuracy integrators that can take larger step sizes. However, (2t − 1)q0 + 2(1 − t)z, if t ∈ [ 12 , 1]
this implies that the solution of the ODE given a single
initial condition q0 gives a single prediction. In order to The velocities associated with the interpolant as simply
sample from the target distribution, πT we sample from q0 (
and then use the ODE (equation 10) to push many samples dqt 2(−q0 + z), if t ∈ [0, 21 )
ut = = (12)
forward. We thus obtain M different samples for q0 (see dt 2(q0 − z), if t ∈ [ 12 , 1]
next section) and use them in order to sample from πT .
Note that the flow starts at q0 ∼ π0 towards z ∼ N (0, I)
at t = 0 and arrive to z at t = 12 . This is the encoding
Comment 2 Note that the ODE obtained for q is not phys-
state. In the second part of the flow we learn a decoding
ical. It merely used to interpolate the density from time 0
map that pushes the points z back to q0 . Note also that u is
to T . To demonstrate we continue with the predator prey
symmetric about t = 1/2 which is used in training. Training
model.
these models is straight forward and is done in a similar way

5
to the training of our stochastic interpolation model, namely the setup described in Srivastava et al. [2015], we trained
we solve a stochastic optimization problem of the form our model on 10,000 randomly generated trajectories and
Z 1
then used the standard publicly available dataset of 10,000
1 2 trajectories for testing.
min Eq0 ,z ∥uθ (qt , t) + 2(q0 − z)∥2 dt (13)
θ 2 0 WeatherBench: The dataset used for global weather pre-
Note that we can use the same velocity for the reverse pro- diction in this work is derived from WeatherBench [Rasp
cess. Given the velocity uθ we now generate perturb sam- et al., 2020], which provides three scaled-down subsets of
ples in the following way. First, we integrate the ODE to the original ERA5 reanalysis data [Hersbach et al., 2020].
t = 12 , that is we solve the ODE The specific variables and model configurations are detailed
in appendix B.1. Our forecasting models use the 6-hour
dq 1 reanalysis window as the model time step.
= uθ (q, t) q(0) = q0 t ∈ [0, ] (14)
dt 2 The statistics and configurations for each of the datasets dur-
This yield the state q( 12 ) ∼ N (0, I). We then perturb the ing experiments are mentioned in the appendix B. We have
state two additional real datasets, Vancouver93 and CloudCast, in
the appendix for additional analysis. In the next section we
q
b(1/2) = q(1/2) + σω (15) give details for experiments and results on the predator-prey
model, the MovingMNIST and the WeatherBench datasets.
where ω ∼ N (0, I) and σ is a hyper-parameter. We then
integrate the ODE equation 14 from 12 to 1 starting from
b( 12 ) obtaining a perturb state q
q b0 . The integration is done
5.2 RESULTS
in batch mode, that is, we integrate M vectors simultane-
ously to obtain M samples from the initial state around q0 .
These states are then used to obtain M samples from qT as Our goal is to learn a homotopy based deterministic func-
explained in section 3. tional map for complex timeseries to predict the future state
from the current state but with uncertainty. To do so, we
evaluate an ensemble of predictions from an ensemble of
5 EXPERIMENTS initial states. The initial states are obtained by using the
auto-encoder described in Section 4. We then push those
Our goal in this section is to apply the proposed framework states forward obtaining an ensemble of final states. We then
to a set of various temporal datasets and show how it can be report on the statistics of the final states.
used to estimate the uncertainty in a forecast. We experiment
The Predator Pray Model: We apply SI to learn a deter-
with two synthetic datasets that can be fully analyzed and a
ministic functional map on predator-prey model 5.1 for a
realistic dataset for weather prediction. Further experiments
long time integration. The initial state of the data is sam-
on other data sets can be found in the appendix.
pled as a group of noisy initial states, that is, we consider
the case that q = y(0) + ϵ. In this case we consider
5.1 DATASETS y(0) = [0.1, 0.3]⊤ and we contaminate it with the noise
vector, ϵ ∼ N (0, 0.05I). Our goal is to predict the state,
We now describe the datasets considered in our experiments. q(T ) at time T = 200. The distribution of 250 noisy points
Lotka–Volterra predator–prey model: This is a nonlinear for q(0) and the final distributions for q(200) using numer-
dynamical system. It follows the equation 5, where p0 = ical integration and SI can be seen in fig. 3. For this simple
2/3, p1 = 4/3, p3 = 1 and p4 = 1. The initial distribution of case, a multilayer perceptron (MLP) is used to learn the map-
states that is Gaussian with a mean of y(0) = [0.1, 0.3]T ping. It can be noticed that SI is able to match with a sample
and standard deviation of 0.05 the final distribution of states from the final complex distribution, specifically the areas
is a distribution obtained after fine numerical integration with high probabilities are captured very well. Some more
over t ∈ [0, 200]. Note that the length of the integration statistical comparisons of the variables’ state are reported
time is very long which implies that the output probability in appendix E.1, to support our results. Appendix D shows
space is widely spread which makes this simple problem of some results on Vancouver93 where the learned final dis-
predicting the final distribution from an initial distribution tribution from simple distribution to a complex conditional
very difficult. distribution is clearly evident.
MovingMNIST: Moving MNIST is a variation of the well The same concept can be extended to high-dimensional spa-
known MNIST dataset [Srivastava et al., 2015]. The dataset tiotemporal image space instead of vector space without any
is a synthetic video dataset designed to test sequence pre- loss of generality. Like in Lipman et al. [2022], Karras et al.
diction models. It features 20-frame sequences where two [2024], Bieder et al. [2024], we use a U-Net architecture
MNIST digits move with random trajectories. Following from Dhariwal and Nichol [2021] to learn the velocity.

6
True Our True Our
Mean Mean Std Dev Std Dev
Data Score Score Score Score
Predator–Prey 7.55e-1 7.43e-1 1.04 1.07
MovingMNIST 6.01e-2 5.58e-2 2.21e-1 1.87e-1
WeatherBench 1.36e4 1.36e4 9.15e1 9.66e1

Table 1: Comparison of mean score and standard deviation


score of ensemble predictions.

Data MSE(↓) MAE(↓) SSIM(↑)


Predator–Prey 8.0e-3 8.8e-2 NA
MovingMNIST 2.5e-3 1.8e-2 0.843
WeatherBench 1.9e4 5.2e1 0.868
Figure 3: Comparison of actual final distribution and that
obtained using SI on the predator-prey Model. Trajectories Table 2: Accuarcy of our ensemble mean.
for transport learned by SI are in blue. Note that the trajec-
tories are not physical.
is to ensure that the statistical characteristics of the pre-
dicted ensemble closely match those of the target ensemble.
Moving MNIST: We use the Moving MNIST to train a net- The simplest metric for comparing these high-dimensional
work (U-Net) to predict 10 time frames into the future, given distributions is the mean score, which represents the av-
10 past frames. The hyperparameter settings of the U-Net erage of the pixel values in a state. Similarly, we compute
are in appendix G.1. A set of initial states is obtained using the standard deviation score. For two distributions to be
the auto-encoder creating 50 initial and final states. Using considered similar, these scores should ideally align. More
SI and the algorithm proposed in 3 we sample predictions similar the scores are two distributions, more similar are
of final states. The predictions can be observed in fig. 4. The the distributions. Table 1 presents the mean and standard
predictions are similar yet different from the true final state. deviation scores for both the target ensemble and predicted
Statistical comparisons of the ensemble predictions using SI ensemble, demonstrating a strong alignment between the
with the ensemble of real targets can be seen in section 5.2 two. Inspired by methods used in turbulence flow field anal-
and in appendix E.3. ysis, we further compare the predicted and target ensembles
by computing their mean state and standard deviation state.
To quantify the similarity between these states, we employ
WeatherBench: We now use the WeatherBench dataset
standard image comparison metrics: Mean Squared Error
to build a forecasting model using SI. We use a similar
(MSE), Mean Absolute Error (MAE), and Structural Simi-
U-Net as in Moving MNIST for learning the functional
larity Index Measure (SSIM). These metrics are applied to
mapping. The hyperparameter settings are in appendix G.1.
both the ensemble mean states and the ensemble standard
A particular time step is selected from the testset and 77
deviation states, as shown in Tables 2 and 3. Lower the MSE
most similar samples are searched from the testset to collect
and MAE are, better the uncertainty is captured. Higher the
the initial states, where similarity in measured by the L2
SSIM, the uncertainty is captured.
norm. We hence have the ensemble of actual initial and final
states of the true data. This ensample is used as ground truth As can be readily observed in the tables, our approach pro-
and allows an unbiased testing of our method. Using our vides an accurate representation of the mean and standard
technique, we generate an ensemble of initial and final states. deviation, which are the first and second order statistics of
A few samples of generated states along with the mean and each data set. The definition of the metrics are in appendix C.
standard deviation of the ensemble of 78 forecasts can be
seen in fig. 5. We use the standard variables U10 and T850 Perturbation of Physical States using SI: Perturbation
for visual observation here. Some statistical comparisons of a physical state in a dynamical system refers to a de-
of our ensemble prediction, variables Z500, T2m, U10 and liberate natural deviation in a system’s parameters or state
T850, can be seen in section 5.2 and in appendix E.5. variables. Using the algorithm proposed in 4, we generate
perturbed states for MovingMNIST, and WeatherBench. All
Metrics: Below, we describe the metrics used to evalu- the generated perturbed states are very much realistic. Some
ate the performance of our model across different datasets. sample perturbed states from MovingMNIST can be seen in
For each ensemble of predictions generated by our model, fig. 6. The physical disturbances like bending or deformation
there is a corresponding ensemble of target states. The goal of digits, change in position and shape is clearly captured.

7
Figure 4: Six of 50 Moving MNIST trajectory predictions obtained using SI and their ensemble mean and standard deviation.

Figure 5: Two sample stochastic forecasts of U10 and T850 after 2 days obtained using SI and the ensemble mean and
standard deviation for 78 forecasts.

Data MSE(↓) MAE(↓) SSIM(↑)


Predator–Prey 2.7e-2 1.6e-1 NA
MovingMNIST 3.6e-3 2.5e-2 0.744
WeatherBench 1.1e4 3.6e1 0.608

Table 3: Accuracy of our ensemble standard deviation.

Similarly, we generated perturbed states for WeatherBench


as well. Some samples of perturbed states from Weather-
Bench’s U10 and T850 variables (as these are easier to
visualise for humans) can be seen in fig. 7 and fig. 8 respec-
tively. As it can be seen the perturbations are very much Figure 7: Three random perturbed states of a WeatherBench
physical. Additional studies can be seen in appendix F. U10 state.

Figure 6: Three random perturbed states of a Moving


MNIST sample state.

6 CONCLUSION
Probabilist forecasting is an important topic for many sci- Figure 8: Perturbed states of a WeatherBench T850 state.
entific problems. It expands traditional machine learning
techniques beyond paired input and outputs, to mapping be-
tween distributions. While generative AI has been focused that allows us to sample from the current distribution. We
on mapping Gaussians to data, similar methodologies can be have shown that this approach can lead to a computationally
applied for the generation of predictions given current states. efficient way to sample future states even for long integra-
In this work we have coupled a stochastic interpolation to tion times and for highly non-Gaussian distributions in high
propagate the current state distribution with an auto-encoder dimensions.

8
References Lucor, Bertrand Iooss, Julien Brajard, Dunhui Xiao, Ti-
jana Janjic, Weiping Ding, Yike Guo, Alberto Carrassi,
Michael S Albergo and Eric Vanden-Eijnden. Building Marc Bocquet, and Rossella Arcucci. Machine learn-
normalizing flows with stochastic interpolants. arXiv ing with data assimilation and uncertainty quantification
preprint arXiv:2209.15571, 2022. for dynamical systems: A review. IEEE/CAA Journal of
Automatica Sinica, 10(6):1361–1387, 2023b.
Michael S Albergo, Nicholas M Boffi, and Eric Vanden-
Eijnden. Stochastic interpolants: A unifying frame- Bin Dai and David Wipf. Diagnosing and enhancing vae
work for flows and diffusions. arXiv preprint models. arXiv preprint arXiv:1903.05789, 2019.
arXiv:2303.08797, 2023.
Prafulla Dhariwal and Alexander Nichol. Diffusion mod-
Shaojie Bai, J Zico Kolter, and Vladlen Koltun. An em- els beat gans on image synthesis. Advances in neural
pirical evaluation of generic convolutional and recur- information processing systems, 34:8780–8794, 2021.
rent networks for sequence modeling. arXiv preprint
arXiv:1803.01271, 2018. Ronald Aylmer Fisher. Statistical methods for research
workers. In Breakthroughs in statistics: Methodology and
J. D. Benamou and Y. Brenier. A computational fluid me- distribution, pages 66–70. Springer, 1970.
chanics solution to the monge kantorovich mass transfer
problem. SIAM J. Math. Analysis, 35:61–97, 2003. Deepak Ganesan, Sylvia Ratnasamy, Hanbiao Wang, and
Deborah Estrin. Coping with irregular spatio-temporal
Florentin Bieder, Julia Wolleb, Alicia Durrer, Robin Sand- sampling in sensor networks. ACM SIGCOMM Computer
kuehler, and Philippe C. Cattin. Memory-efficient 3d Communication Review, 34(1):125–130, 2004.
denoising diffusion models for medical image process-
ing. In Medical Imaging with Deep Learning, volume Alan J. Geer and Peter Bauer. Observation errors in all-
227 of Proceedings of Machine Learning Research, pages sky data assimilation. Quarterly Journal of the Royal
552–567. PMLR, 10–12 Jul 2024. Meteorological Society, 137(661):2024–2037, 2011.

Cristian Bodnar, Wessel P Bruinsma, Ana Lucic, Megan Amin Ghadami and Bogdan I. Epureanu. Data-driven predic-
Stanley, Johannes Brandstetter, Patrick Garvan, Maik tion in dynamical systems: recent developments. Philo-
Riechert, Jonathan Weyn, Haiyu Dong, Anna Vaughan, sophical Transactions of the Royal Society A: Mathe-
et al. Aurora: A foundation model of the atmosphere. matical, Physical and Engineering Sciences, 380(2229):
arXiv preprint arXiv:2405.13063, 2024. 20210213, 2022.

Michael Brin and Garrett Stuck. Introduction to dynamical John Guckenheimer and Philip Holmes. Nonlinear oscil-
systems. Cambridge university press, 2002. lations, dynamical systems, and bifurcations of vector
fields, volume 42. Springer Science & Business Media,
Ashesh Chattopadhyay, Pedram Hassanzadeh, and Devika 2013.
Subramanian. Data-driven predictions of a multiscale
lorenz 96 chaotic system using machine-learning meth- John Guckenheimer and Patrick Worfolk. Dynamical sys-
ods: reservoir computing, artificial neural network, and tems: Some computational problems. In Bifurcations and
long short-term memory network. Nonlinear Processes Periodic Orbits of Vector Fields, pages 241–277. Springer,
in Geophysics, 27(3):373–389, 2020. 1993.

Hans Hersbach, Bill Bell, Paul Berrisford, Shoji Hira-


Yifan Chen, Mark Goldstein, Mengjian Hua, Michael S
hara, András Horányi, Joaquín Muñoz-Sabater, Julien
Albergo, Nicholas M Boffi, and Eric Vanden-Eijnden.
Nicolas, Carole Peubey, Raluca Radu, Dinand Schepers,
Probabilistic forecasting with stochastic interpolants and
Adrian Simmons, Cornel Soci, Saleh Abdalla, Xavier
föllmer processes. arXiv preprint arXiv:2403.13724,
Abellan, Gianpaolo Balsamo, Peter Bechtold, Gion-
2024.
ata Biavati, Jean Bidlot, Massimo Bonavita, Giovanna
Sibo Cheng, César Quilodrán-Casas, Said Ouala, Alban De Chiara, Per Dahlgren, Dick Dee, Michail Diamantakis,
Farchi, Che Liu, Pierre Tandeo, Ronan Fablet, Didier Lu- Rossana Dragani, Johannes Flemming, Richard Forbes,
cor, Bertrand Iooss, Julien Brajard, et al. Machine learn- Manuel Fuentes, Alan Geer, Leo Haimberger, Sean Healy,
ing with data assimilation and uncertainty quantification Robin J. Hogan, Elías Hólm, Marta Janisková, Sarah Kee-
for dynamical systems: a review. IEEE/CAA Journal of ley, Patrick Laloyaux, Philippe Lopez, Cristina Lupu, Ga-
Automatica Sinica, 10(6):1361–1387, 2023a. bor Radnoti, Patricia de Rosnay, Iryna Rozum, Freja Vam-
borg, Sebastien Villaume, and Jean-Noël Thépaut. The
Sibo Cheng, César Quilodrán-Casas, Said Ouala, Alban era5 global reanalysis. Quarterly Journal of the Royal
Farchi, Che Liu, Pierre Tandeo, Ronan Fablet, Didier Meteorological Society, 146(730):1999–2049, 2020.

9
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- Peter Lynch. The origins of computer weather prediction
sion probabilistic models. Advances in neural information and climate modeling. Journal of computational physics,
processing systems, 33:6840–6851, 2020. 227(7):3431–3444, 2008.
E Kalnay. Atmospheric Modeling, Data Assimilation and Parviz Moin and Krishnan Mahesh. Direct numerical sim-
Predictability, volume 341. Cambridge University Press, ulation: a tool in turbulence research. Annual review of
2003. fluid mechanics, 30(1):539–578, 1998.
Tero Karras, Miika Aittala, Jaakko Lehtinen, Janne Hellsten, James D Murray. Mathematical biology: I. An introduction,
Timo Aila, and Samuli Laine. Analyzing and improving volume 17. Springer Science & Business Media, 2007.
the training dynamics of diffusion models. In Proceedings
of the IEEE/CVF Conference on Computer Vision and Tung Nguyen, Johannes Brandstetter, Ashish Kapoor,
Pattern Recognition (CVPR), pages 24174–24184, June Jayesh K Gupta, and Aditya Grover. Climax: A foun-
2024. dation model for weather and climate. arXiv preprint
arXiv:2301.10343, 2023.
Remi Lam, Alvaro Sanchez-Gonzalez, Matthew Willson,
Peter Wirnsberger, Meire Fortunato, Ferran Alet, Suman A. H. Nielsen, A. Iosifidis, and H. Karstoft. Cloudcast: A
Ravuri, Timo Ewalds, Zach Eaton-Rosen, Weihua Hu, satellite-based dataset and baseline for forecasting clouds.
et al. Graphcast: Learning skillful medium-range global IEEE Journal of Selected Topics in Applied Earth Obser-
weather forecasting. arXiv preprint arXiv:2212.12794, vations and Remote Sensing, 14:3485–3494, 2021. doi:
2022. 10.1109/JSTARS.2021.3062936.
Remi Lam, Alvaro Sanchez-Gonzalez, Matthew Willson, Jaideep Pathak, Shashank Subramanian, Peter Harrington,
Peter Wirnsberger, Meire Fortunato, Ferran Alet, Suman Sanjeev Raja, Ashesh Chattopadhyay, Morteza Mardani,
Ravuri, Timo Ewalds, Zach Eaton-Rosen, Weihua Hu, Thorsten Kurth, David Hall, Zongyi Li, Kamyar Aziz-
Alexander Merose, Stephan Hoyer, George Holland, zadenesheli, et al. Fourcastnet: A global data-driven high-
Oriol Vinyals, Jacklynn Stott, Alexander Pritzel, Shakir resolution weather model using adaptive fourier neural
Mohamed, and Peter Battaglia. Learning skillful medium- operators. arXiv preprint arXiv:2202.11214, 2022a.
range global weather forecasting. Science, 382(6677):
1416–1421, 2023. Jaideep Pathak, Shashank Subramanian, Peter Harrington,
Sanjeev Raja, Ashesh Chattopadhyay, Morteza Mardani,
Christian Sebastian Lamprecht. Meteostat api. https: Thorsten Kurth, David Hall, Zongyi Li, Kamyar Aziz-
//[Link]/en/. zadenesheli, et al. Fourcastnet: A global data-driven high-
Sylvie Leroyer, Stéphane Bélair, Syed Z Husain, and Joce- resolution weather model using adaptive fourier neural
lyn Mailhot. Subkilometer numerical weather prediction operators. arXiv preprint arXiv:2202.11214, 2022b.
in an urban coastal area: A case study over the vancouver
Stephan Rasp, Peter D. Dueben, Sebastian Scher,
metropolitan area. Journal of Applied Meteorology and
Jonathan A. Weyn, Soukayna Mouatadid, and Nils
Climatology, 53(6):1433–1453, 2014.
Thuerey. Weatherbench: A benchmark data set for data-
Zongyi Li, Nikola Kovachki, Kamyar Azizzadenesheli, driven weather forecasting. Journal of Advances in Mod-
Burigede Liu, Kaushik Bhattacharya, Andrew Stuart, eling Earth Systems, 12(11), 2020.
and Anima Anandkumar. Fourier neural operator for
parametric partial differential equations. arXiv preprint Sebastian Reich and Colin Cotter. Probabilistic forecasting
arXiv:2010.08895, 2020. and Bayesian data assimilation. Cambridge University
Press, 2015.
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil-
ian Nickel, and Matt Le. Flow matching for generative Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net:
modeling. arXiv preprint arXiv:2210.02747, 2022. Convolutional networks for biomedical image segmenta-
tion. In Medical image computing and computer-assisted
Yaron Lipman, Marton Havasi, Peter Holderrieth, Neta intervention–MICCAI 2015: 18th international confer-
Shaul, Matt Le, Brian Karrer, Ricky TQ Chen, David ence, Munich, Germany, October 5-9, 2015, proceedings,
Lopez-Paz, Heli Ben-Hamu, and Itai Gat. Flow matching part III 18, pages 234–241. Springer, 2015.
guide and code. arXiv preprint arXiv:2412.06264, 2024.
Lars Ruthotto and Eldad Haber. An introduction to
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng deep generative modeling. GAMM-Mitteilungen, 44(2):
Zhang, Stephen Lin, and Baining Guo. Swin transformer: e202100008, 2021.
Hierarchical vision transformer using shifted windows. In
Proceedings of the IEEE/CVF international conference Xingjian Shi, Zhourong Chen, Hao Wang, Dit-Yan Yeung,
on computer vision, pages 10012–10022, 2021. Wai-Kin Wong, and Wang-chun Woo. Convolutional lstm

10
network: A machine learning approach for precipitation Niloufar Zakariaei, Siddharth Rout, Eldad Haber, and
nowcasting. Advances in neural information processing Moshe Eliasof. Advection augmented convolutional neu-
systems, 28, 2015. ral networks. arXiv preprint arXiv:2406.19253, 2024.
Ralph C Smith. Uncertainty quantification: theory, imple-
mentation, and applications. SIAM, 2024.
Yang Song, Sahaj Garg, Jiaxin Shi, and Stefano Ermon.
Sliced score matching: A scalable approach to density and
score estimation. In Uncertainty in Artificial Intelligence,
pages 574–584. PMLR, 2020.
Nitish Srivastava, Elman Mansimov, and Ruslan Salakhutdi-
nov. Unsupervised learning of video representations using
lstms. In Proceedings of the 32nd International Confer-
ence on International Conference on Machine Learning -
Volume 37, ICML’15, page 843–852. [Link], 2015.
Andrew Stuart and Anthony R Humphries. Dynamical
systems and numerical analysis, volume 2. Cambridge
University Press, 1998.
Pierre NV Tu. Dynamical systems: an introduction with
applications in economics and biology. Springer Science
& Business Media, 2012.
Arash Vahdat, Karsten Kreis, and Jan Kautz. Score-based
generative modeling in latent space. Advances in neural
information processing systems, 34:11287–11302, 2021.
Phillip Vannini, Dennis Waskul, Simon Gottschalk, and
Toby Ellis-Newstead. Making sense of the weather:
Dwelling and weathering on canada’s rain coast. Space
and Culture, 15(4):361–380, 2012.
Ashish Vaswani, Noam M. Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser,
and Illia Polosukhin. Attention is all you need. In Neural
Information Processing Systems, 2017.
Cédric Villani et al. Optimal transport: old and new, volume
338. Springer, 2009.
NP Wedi, P Bauer, W Denoninck, M Diamantakis, M Ham-
rud, C Kuhnlein, S Malardel, K Mogensen, G Mozdzyn-
ski, and PK Smolarkiewicz. The modelling infrastructure
of the Integrated Forecasting System: Recent advances
and future challenges. European Centre for Medium-
Range Weather Forecasts, 2015a.
NP Wedi, P Bauer, W Denoninck, M Diamantakis, M Ham-
rud, C Kuhnlein, S Malardel, K Mogensen, G Mozdzyn-
ski, and PK Smolarkiewicz. The modelling infrastructure
of the integrated forecasting system: Recent advances and
future challenges. 2015b.
Ling Yang, Zixiang Zhang, Zhilong Zhang, Xingchao Liu,
Minkai Xu, Wentao Zhang, Chenlin Meng, Stefano Er-
mon, and Bin Cui. Consistency flow matching: Defining
straight flows with velocity consistency. arXiv preprint
arXiv:2407.02398, 2024.

11
Appendix

Siddharth Rout1,2 Eldad Haber1,2 Stéphane Gaudreault3

1
Institute of Applied Mathematics, University of British Columbia, Vancouver, BC, Canada
2
Department of Earth, Ocean and Atmospheric Sciences, University of British Columbia, Vancouver, BC, Canada
3
Recherche en prévision numérique atmosphérique, Environnement et Changement climatique Canada , Dorval, QC, Canada

A ADDITIONAL DATASETS

A.1 DATASETS

• Vancouver 93 Temperature Trend (Vancouver93): This is a real nonlinear chaotic high dimensional dynamical
system, in which only a single state (temperature) is recorded. The daily average temperatures at 93 different weather
stations in and around Vancouver, BC, Canada are captured for the past 34 years from sources such as National
Oceanic and Atmospheric Administration (NOAA), the Government of Canada, and Germany’s national meteorological
service (DWD) through Meteostat’s python library Lamprecht. Essentially, it is a time series of 12,419 records at
93 stations. The complex urban coastal area of Vancouver with coasts, mountains, valleys, and islands makes it an
interesting location where forecasting is very difficult than in general Vannini et al. [2012], Leroyer et al. [2014]. The
historical temperature alone insufficient to predict the temperature in the future, as it requires additional variables
like precipitation, pressure, and many more variables at higher resolution. This makes the dataset fit the proposed
framework.

• CloudCast Nielsen et al. [2021]: a real physical nonlinear chaotic spatiotemporal dynamical system. The dataset
comprises 70,080 satellite images capturing 11 different cloud types for multiple layers of the atmosphere annotated on
a pixel level every 15 minutes from January 2017 to December 2018 and has a resolution of 128×128 pixels (15×15
km).

B DATASETS

Table 4 describes the statistics, train-test splits, image sequences used for modeling, and frame resolutions.

Dataset Ntrain Ntest (C, H, W ) History Prediction


Predator-Prey 10,000 500 (1, 1, 2) 1 1
Vancouver93 9,935 2,484 (93, 1, 1) 5 (5 days) 5 (5 days)
Moving MNIST 10,000 10,000 (1, 64, 64) 10 10
CloudCast 52,560 17,520 (1, 128, 128) 4 (1 Hr) 4 (1 Hr)
WeatherBench (5.625◦ ) 324,336 17,520 (48, 32, 64) 8 (48 Hrs) 8 (48 Hrs)

Table 4: Datasets statistics, training and testing splits, image sequences, and resolutions
B.1 WEATHERBENCH

The original ERA5 dataset Hersbach et al. [2020] has a resolution of 721×1440 recorded every hour over almost 40 years
from 1979 to 2018 across 37 vertical levels for 0.25◦ latitude-longitude grid. The raw data is extremely bulky for running
experiments, even with powerful computing resources. We hence the typically used reduced resolutions (32×64: 5.625◦
latitude-longitude grid, 128×256: 5.625◦ latitude-longitude grid) as per the transformations made by Rasp et al. [2020].
We, however, stick to using the configuration set by Nguyen et al. [2023] for standard comparison. The prediction model
considers 6 atmospheric variables at 7 vertical levels, 3 surface variables, and 3 constant fields, resulting in 48 input channels
in total for predicting four target variables that are considered for most medium-range NWP models like the state of the art
IFSWedi et al. [2015a] and are often used for benchmarking in previous deep learning work as well like Lam et al. [2023],
Pathak et al. [2022a], they are geopotential at 500hPa (Z500), the temperature at 850hPa (T850), the temperature at 2 meters
from the ground (T2m), and zonal wind speed at 10 meters from the ground (U10). We use a leap period of 6 hours as a
single timestep and hence our model takes in 8 timesteps (48 hours or 2 days) to predict for the next 8 timesteps (48 hours or
2 days). According to the same setting used the data from 1979 to 2015 as training, for the year 2016 as validation set and
for the years 2017 and 2018 as testing set. The details of the variables considered are in Table 5.

Type Variable Name Abbrev. ECMWF ID Levels


Land-sea mask LSM 172
Static Orography OROG 228002
Soil Type SLT 43
2 metre temperature T2m 167
Single 10 metre U wind component U10 165
10 metre V wind component V10 166
Geopotential Z 129 50, 250, 500, 600, 700, 850, 925
U wind component U 131 50, 250, 500, 600, 700, 850, 925
Atmospheric V wind component V 132 50, 250, 500, 600, 700, 850, 925
Temperature T 130 50, 250, 500, 600, 700, 850, 925
Specific humidity Q 133 50, 250, 500, 600, 700, 850, 925
Relative humidity R 157 50, 250, 500, 600, 700, 850, 925

Table 5: Variables considered for global weather prediction model.

C COMPARISON METRICS

C.1 ENSEMBLE MEAN AND ENSEMBLE STANDARD DEVIATION

Let S = I1 , .., IN be a set of images, Ii ∈ RC×H×W such that i, j, k, N, C, H, W ∈ N, i ≤ N , j ≤ C, k ≤ H, and l ≤ W .


An image Ii is defined as
Ii = {Pij,k,l ∈ R|j ≤ C, k ≤ H, l ≤ W},
where Pij,k,l is called a pixel in an image Ii .
Ensemble mean state (image), IEM ∈ RC×H×W , is defined as

N
j,k,l 1 X j,k,l j,k,l
IEM = {PEM = P |Pi ∈ Ii }. (16)
N i=1 i

Ensemble mean score VEM is defined as


N X C XH X
W
1 X
VEM = Pij,k,l , (17)
N · C · H · W i=1 j=1
k=1 l=1

where Pij,k,l ∈ Ii .
Ensemble standard deviation state (image), IES ∈ RC×H×W , is defined as

13
v
u
u1 X N
j,k,l
IES = {PES =t j,k,l 2
(P j,k,l − PEM j,k,l
) |Pij,k,l ∈ Ii , PEM ∈ IEM }. (18)
N i=1 i

Ensemble variance score VES is defined as

N X C XH X
W
1 X
VES = (Pij,k,l − VEM )2 , (19)
N · C · H · W i=1 j=1
k=1 l=1

where Pij,k,l ∈ Ii .

C.2 MSE, MAE, SSIM

N XH XW X C
1 X
MSE = (y − ŷ)2 (20)
N · C · H · W i=1 w=1 c=1
h=1

N XH XW X C
1 X
MAE = |y − ŷ| (21)
N · C · H · W i=1 w=1 c=1
h=1

(2µx µy + C1 )(2σxy + C2 )
SSIM(x,y) = (22)
(µ2x + µ2y + C1 )(σx2 + σy2 + C2 )

N
1 X
SSIM = SSIM(x, y) (23)
N i=1

where:

N is the number of images in the dataset,


H is the height of the images,
W is the width of the images,
C is the number of channels (e.g., 3 for RGB images),
y is the true pixel value at position (i, h, w, c), and
ŷ is the predicted pixel value at position (i, h, w, c).
MAX is the maximum possible pixel value of the image (e.g., 255 for an 8-bit image),
MSE is the Mean Squared Error between the original and compressed image.
µx is the average of x,
µy is the average of y,
σx2 is the variance of x,
σy2 is the variance of y,
σxy is the covariance of x and y,
C1 = (K1 L)2 and C2 = (K2 L)2 are two variables to stabilize the division with weak denominator,
L is the dynamic range of the pixel values (typically, this is 255 for 8-bit images),
K1 and K2 are small constants (typically, K1 = 0.01 and K2 = 0.03).

14
D ADDITIONAL RESULTS

Probabilistic Forecasting For Vancouver Temperature: We now use the Vancouver93 dataset to build a forecasting
model using SI. The approximating function we use is a residual-network based TCN inspired by Bai et al. [2018] along
with time embedding on each layer. In this case we propose a model that uses a sequence of the last 5 days’ temperatures to
predict the next 5 days’ temperatures. For this problem, it is easily to visualize how SI learns the transportation from one
distribution to another.
In the first experiment with this dataset, we take all the cases from the test-set where the temperature at Station 1 is 20◦ C
and we compare the actual distribution of temperatures at the same stations after 10 days. Essentially, if we look at the phase
diagrams between any two stations for distributions of states after 10 days apart, we can visualize how the final distributions
obtained using SI are very similar to the distributions we obtain from test data. Such distributions Often they look like
skewed distributions, like we can see in figure 9 which shows the an extremely skewed distribution matching very well
with the state after 5 days and similarly we can see in figure 10 which shows the distribution matching very well with the
state after 10 days. Appendix E.2 shows the matching histograms of the distributions for the later case. Some metrics for
comparison of results can be seen in tables 6 to 8.

Figure 9: Phase diagram showing initial state and final states from data and model using SI after 5 days in between station 2
and station 10.

Figure 10: Phase diagram showing initial state and final states from data and model using SI after 10 days in between station
20 and station 1.

15
True Our True Our
Data Mean Mean Std. Dev. Std. Dev.
Vancouver93 1.85e1 1.89e1 4.05 3.97
CloudCast 1.53e-2 1.51e-2 9.84e-2 9.84e-2

Table 6: Comparison of mean and standard deviation of ensemble predictions.

Data MSE MAE SSIM


Vancouver93 5.9e-1 6.1e-1 NA
CloudCast 2.0e-7 6.0e-4 0.99

Table 7: Similarity metrics for ensemble average.

Data MSE MAE SSIM


Vancouver93 1.8e-1 3.4e-1 NA
CloudCast 4.9e-7 6.0e-4 0.85

Table 8: Similarity metrics for ensemble standard deviation.

E STATISTICAL COMPARISON OF OUR ENSEMBLE PREDICTIONS

E.1 PREDATOR-PREY MODEL

Figures in 11 shows the histograms of y1 and y2 respectively. It can be noticed that the histogram of an actual state variable
is matching very well with the histogram obtained from our results using SI.

Figure 11: Histograms of actual final distribution of y1 (left) and y2 (right) compared with that obtained using SI on the
predator-prey model

E.2 VACOUVER93

Figure 12 shows the histograms of the two distributions which are very similar to suggest that SI efficiently learns the
transport map to the distribution of temperatures at a station for this problem whose deterministic solution is very tough.

16
Figure 12: Histograms of distributions observed after 10 days and distribution obtained using SI excluding outliers.

E.3 MOVING MNIST

Let us take as input, initial 10 frames from a case in Moving MNIST test set as history and actual 10 subsequent frames in the
sequence as future. For such a case Figures 13 and 14 show statistical images for small perturbation and large perturbation
respectively. 100 perturbed samples are taken and their final states are predicted using SI. The figures showcase how similar
the ensemble mean and the ensemble standard deviations are. This is a fair justification that the uncertainty designed on
Moving MNIST is captured very well.

E.4 CLOUDCAST

E.4.1 Predictions

As per the default configuration of 4 timeframe sequences from the test set of Cloudcast dataset is used to predict 4
subsequent timeframes in the future. A point to notice is each time step equals to 15 minutes in real time and hence the
differential changes is what we observe. For one such a case Figure 15 shows twelve different predictions. They are very
similar but not the same once we zoom into high resolution. Figure 16 shows the prediction after 3 hours by autoregression
to showcase how the clouds look noticeably different. The small cloud patches are visibly different in shape and sizes. Cloud
being a complex and nonlinear dynamical system, the slight non-noticeable difference in 1 hour can lead to very noticeably
different predictions after 3 hours.

E.4.2 Statistical comparison

Figure 17 shows statistical images for generated outputs on CloudCast testset using SI. Similar samples are taken as the set
of initial states to predict their final states using SI. The figures showcase how similar the ensemble mean and the ensemble
standard deviations are.

17
Figure 13: Statistical comparison of Moving MNIST sequences predicted using SI with random perturbed initial states.

E.5 WEATHERBENCH

E.5.1 Predictions

Figure 18 shows some of predictions using SI with mild perturbations.

E.5.2 Statistical comparison

Figure 19 shows the statistical comparison of ensemble predictions with 20 mildly perturbed states. Figure 20 shows the
statistical comparison of ensemble predictions with 78 strongly perturbed states. The mean and standard deviation of the
states for 2 day (48 hours predictions) can be easily compared. Table 9 and table 10 shows the metrics to compare the
accuracy of our ensemble mean and ensemble standard deviation for 6 hour and 2 day predictions respectively.

Ensemble Mean Ensemble Std. Dev.


Variable True Score Our Score MSE(↓) MAE(↓) SSIM(↑) True Score Our Score MSE(↓) MAE(↓) SSIM(↑)
Z500 5.40e4 5.40e4 2.00e4 1.04e2 0.992 3.96e2 3.74e2 1.83e4 9.16e1 0.850
T2m 2.78e2 2.78e2 2.62 9.33e-1 0.986 1.78 1.84 8.90e-1 5.60e-1 0.745
U10 -1.85e-1 -2.70e-1 1.27 8.40e-1 0.886 2.51 2.34 1.12 7.35e-1 0.708
T850 -2.43e-2 -3.74e-2 1.62 9.79e-1 0.812 3.75 3.58 1.29 8.26e-1 0.774

Table 9: Similarity metrics for weatherBench ensemble prediction after 6 hours.

18
Figure 14: Statistical comparison of Moving MNIST sequence with random highly perturbed initial states.

Figure 15: Random predictions of differential change from a single sequence from cloudcast testset.

19
Figure 16: Random predictions from a single sequence from CloudCast testset for prediction after 3 hours.

Ensemble Mean Ensemble Std. Dev.


Variable True Score Our Score MSE(↓) MAE(↓) SSIM(↑) True Score Our Score MSE(↓) MAE(↓) SSIM(↑)
Z500 5.40e4 5.40e4 7.86e4 2.04e2 0.978 3.58e2 3.78e2 4.52e4 1.41e2 0.630
T2m 2.78e2 2.78e2 3.05 1.06 0.986 1.83 1.95 1.54 7.3e-1 0.681
U10 -1.16e-1 -1.10e-1 2.42 1.17 0.820 2.42 2.58 2.19 1.04 0.561
T850 -8.09e-2 -8.12e-2 4.30 1.53 0.688 3.64 3.72 3.12 1.28 0.561

Table 10: Similarity metrics for weatherBench ensemble prediction after 2 days.

F PERTURBATIONS USING SI: ADDITIONAL STUDY

F.1 MOVINGMNIST

Figure in 21 shows how the perturbed state generated using SI is better than Gaussian perturbations. Also, a crucial factor
for the sensitivity of perturbation is showcased, where a generated sequence from MovingMNIST has a different digit is
shown in the third row of the figure. With mild perturbaton, however, the mean of the 100 samples matches very well with
the original state. The standard deviation of those states show the scope of perturbation, which is good. Figure 22 shows how
a perturbed state varies with different noise levels. With larger noise, the digits seems to transform, like ’one’ turns to ’four’,
’eight’ turns to ’three’, and so on. The model understands than on transitioning, the digits should turn into another digit.

F.2 WEATHERBENCH

Figure in 23 shows the four curated important variables for weather prediction, Z500, T2m, U10 and T850, perturbed with
different values of noise.

G HYPERPARAMETER SETTINGS AND COMPUTATIONAL RESOURCES

G.1 UNET TRAINING

Table 11 shows the hyperparameter settings for training on MovingMNIST and CloudCast datasets. While table 12 shows
the settings for training on WeatherBench.

20
Hyperparameter Symbol Value
Learning Rate η 1e − 04
Batch Size B 64
Number of Epochs N 200
Optimizer - Adam
Dropout - 0.1
Number of Attention Heads - 4
Number of Residual Blocks - 2

Table 11: Neural Network Hyperparameters for training Moving MNIST and CloudCast

Hyperparameter Symbol Value


Learning Rate η 1e − 04
Batch Size B 8
Number of Epochs N 50
Optimizer - Adam
Dropout - 0.1
Number of Attention Heads - 4
Number of Residual Blocks - 2

Table 12: Neural Network Hyperparameters for training WeatherBench

G.2 COMPUTATIONAL RESOURCES

All our experiments are conducted using an NVIDIA RTX-A6000 GPU with 48GB of memory.

21
Figure 17: Statistical comparison of CloudCast sequences predicted using SI for a single initial state.

22
Figure 18: Four sample stochastic forecasts of Z500, T2m, U10 and T850 after 2 days obtained using SI.

Figure 19: Statistical comparisons of Z500, T2m, U10 and T850 for 2 day ensemble forecasting using SI using 20 mildly
perturbed samples.

Figure 20: Statistical comparisons of Z500, T2m, U10 and T850 for 2 day ensemble forecasting using SI using 78 strongly
perturbed samples.

23
Figure 21: Statistical comparison of 100 perturbed states from a single MovingMNIST sequence using SI.

24
Figure 22: Perturbed states for a single MovingMNIST sequence using SI for different levels of noise.

25
Figure 23: Perturbed states for a WeatherBench state using SI for different levels of noise.

26

You might also like