Probabilistic Forecasting For Dynamical Systems With Missing or Imperfect Data
Probabilistic Forecasting For Dynamical Systems With Missing or Imperfect Data
1
Institute of Applied Mathematics, University of British Columbia, Vancouver, BC, Canada
2
Department of Earth, Ocean and Atmospheric Sciences, University of British Columbia, Vancouver, BC, Canada
3
Recherche en prévision numérique atmosphérique, Environnement et Changement climatique Canada , Dorval, QC, Canada
2
where g is some functional architecture that is sufficiently
expressive to approximate f and θ are parameters in g. If
the original function f is known, then one can choose f = g
and p = θ. However, in many cases g is only a surrogate of
f and the parameters θ are very different than p.
The second, less trivial case is when the time interval T
is very large (relative to the smoothness of the problem).
In such cases, one cannot use simple finite difference to
approximate the derivative. Instead, we note that since the
differential equation has a unique solution and its trajectories
do not intersect, we have that
Figure 1: Trajectories for the Predator Pray Model. Note
Z t+T
that the trajectories get very close but do not intersect.
y(t + T ) = F (y(t), t, p) = f (y(τ ), τ, p)dτ. (3)
t
Here F is the function that integrates the ODE from time t equation 3. Furthermore, since the system is periodic, it is
to t + T . We can then approximate F using an appropriate easy to set a learning problem that can integrate the system
functional architecture thus predicting y at later time given for a long time.
observation on it in earlier times.
Note however, that the trajectories cluster at the lower left
The two scenarios above fall under the case where we can corner. This implies that even though the trajectories never
predict the future based on the current values of y. Formally, meet in theory, they may be very close, so numerically, if the
we define the following: data is noisy, the corner can be thought of as a bifurcation
point.
Definition 1 Closed System. Let Y be the space of some The system is therefore closed if the data y is accurate,
observed data y from a dynamical system D. We say the however, the system is open if there is noise on the data
system is closed if given y(t) we can uniquely recover such that late times can be significantly influenced by earlier
y(t+T ) for all t and finite bounded T ≤ τ , where τ is some times.
constant. Practically, given the data, y(t) and a constant ϵ
we can estimate a function F such that
The predator prey model draws our attention to a different
∥y(t + T ) − F (y(t), t, p)∥2 ≤ ϵ (4) scenario that is much more common in practice. Assume that
some data is missing or, that the data is polluted by noise.
For example, in weather prediction, where the primitive
Definition 1 implies that we can learn the function F in
equations are integrated to approximate global atmospheric
equation 3 assuming that we have sufficient amount of data
flow, it is common that we do not observe the pressure, tem-
in sufficient accuracy and an expressive enough architecture
perature or wind speed in sufficient resolution. In this case,
to approximates F (·, ·, ·). In this case, the focus of any ML
having information about the past is, in general, insufficient
based method should be given to the appropriate architec-
to predict the future. In this case we define the system as
ture that can express F perhaps using some of the known
open.
structure of the problem. This concept and its limitations
are illustrated using the predator prey model. This example While we could use classical machine learning techniques
demonstrates how a seemingly simple dynamical system can to solve closed systems it is unreasonable to attempt and
exhibit complex behavior through the interaction of just two solve open systems with similar techniques. This is because
variables. While theoretically deterministic, the system can the system does not have a unique solution, given partial or
becomes practically unpredictable if measured in regions noisy data. To demonstrate that we return to the predator
where the trajectories are very close. prey model, but this time with partial or noisy data.
Example 1 The Predator Prey Model [Murray, 2007, Example 2 The predator Pray Model with Partial or
Chapter 3]: The predator prey model is Noisy Data: Consider again the predator prey model, but
this time assume that we record y1 only. Now assume that
dy1 dy2
= p1 y1 − p2 y1 y2 = p3 y1 y2 − p4 y2 (5) we known that at t = 0, y1 = 1 and our goal is to predict
dt dt the solution at time t = 200, which is very far into the future.
The trajectories for different starting points are shown in Clearly, this is impossible to do. Nonetheless, we can run
fig. 1. Assuming that we record data in sufficient accuracy, many simulations where y1 = 1 and y2 = π(y2 ) where π
the trajectory at any time can be determined by the mea- is some density. For example, we choose π(y2 ) = U (0, 1).
surement point at earlier times, thus, justifying the model in In this case we ran simulations to obtain the results pre-
3
This approach acknowledges that a unique prediction may
not be attainable but allows for generating samples of future
outcomes. In contrast, deterministic prediction assumes a
closed system and highly accurate data, making it often
inapplicable.
4
solutions for low dimensions problems have been addressed 4 SAMPLING THE PERTURBED STATES
[Fisher, 1970, Villani et al., 2009], solutions for problems
in high dimensions are only recently being developed. Sampling from the distribution q0 in a meaningful way is
not trivial. There are a number of approaches to achieve this
One recent and highly effective technique to solve such
goal. A brute force approach can search through the data for
a problem is stochastic interpolation (SI) [Albergo and
so-called similar states, for example, given a particular state
Vanden-Eijnden, 2022], which is a technique that belongs
q0 we can search other states in the data set such that ∥q0 −
to a family of flow matching methods [Lipman et al., 2022,
qi ∥2 ≤ ϵ. While this approach is possible in low dimension
Albergo et al., 2023, Song et al., 2020]. The basic idea of
it is difficult if not impossible in very high dimensions. For
stochastic interpolation is to generate a simple interpolant
example, for global predictions, it is difficult to find two
for all possible points in q0 and qT . A simple interpolant of
states that are very close everywhere. To this end, we turn
this kind is linear and reads
to a machine learning approach that is designed to generate
realistic perturbations.
qt = tqT + (1 − t)q0 (7)
We turn our attention to Variational Auto Encoders (VAE’s).
where qT ∼ πT , q0 ∼ π0 , and t ∈ [0, 1] is a parameter. In particular, we use a flow based VAE [Dai and Wipf,
The points qt are associated from a distribution πt (q) that 2019, Vahdat et al., 2021]. Variational Autoencoders are
converges to πT (q) at t = 1 and to π0 (q) at t = 0. In SI particularly useful. In the encoding stage, they push the data
one learns (estimates) the velocity q0 from the original distribution π0 (q) to a vector z that is
sampled from a Gaussian, N (0, I). Next, in the decoding
vt (qt ) = q̇t = qT − q0 (8) stage, the model pushes the vector z back to q0 . Since the
Gaussian space is convex, small perturbation can be applied
by solving the stochastic optimization problem to the latent vector z which lead to samples from π0 (q) that
are centered around q0 .
Z 1
1 The difference between standard VAEs and flow based VAEs
min Eq0 ,q1 ∥vθ (qt , t) − (qT − q0 )∥2 dt (9)
θ 2 0 is that give q0 ∼ π0 , VAEs attempt to learn a transformation
to a Gaussian space N (0, I) directly. However, as have been
Here vθ (qt , t) is an interpolant for the velocity v that demonstrated in Ruthotto and Haber [2021] VAEs are only
is given at points qt . A common model that is used for able to create maps to a latent state that is similar to but not
vθ (qt , t) is a deep neural network. While for simple prob- Gaussian, making it difficult to perturb and sample from
lems such models can be composed of a simple network them. VAEs that are flow based can generate a Gaussian
with a small number of hidden layers, for complex problems, map to a much higher accuracy. Furthermore, such flows
especially problems that relate to space-time predictions, can learn the decoding from a Gaussian back to the original
complex models such as U-Nets are commonly used [Ron- distribution. Flows that can transform the points between
neberger et al., 2015]. the two distributions are sometimes referred to as symmetric
Assume that the a velocity model vθ (qt , t) is trained and flows [Ho et al., 2020, Albergo et al., 2023].
uses it to integrate q from time 0 to T , that is Such flows can be considered as a special case of SI, an
encoder-decoder scheme for a physical state q0 being en-
dq coded to a Gaussian state z. To this end, let us define the
= vθ (q, t) q(0) = q0 (10)
dt linear interpolant as
Note that we use a deterministic framework rather than (
(1 − 2t)q0 + 2tz, if t ∈ [0, 12 )
a stochastic framework. This allows to incorporate high qt = (11)
accuracy integrators that can take larger step sizes. However, (2t − 1)q0 + 2(1 − t)z, if t ∈ [ 12 , 1]
this implies that the solution of the ODE given a single
initial condition q0 gives a single prediction. In order to The velocities associated with the interpolant as simply
sample from the target distribution, πT we sample from q0 (
and then use the ODE (equation 10) to push many samples dqt 2(−q0 + z), if t ∈ [0, 21 )
ut = = (12)
forward. We thus obtain M different samples for q0 (see dt 2(q0 − z), if t ∈ [ 12 , 1]
next section) and use them in order to sample from πT .
Note that the flow starts at q0 ∼ π0 towards z ∼ N (0, I)
at t = 0 and arrive to z at t = 12 . This is the encoding
Comment 2 Note that the ODE obtained for q is not phys-
state. In the second part of the flow we learn a decoding
ical. It merely used to interpolate the density from time 0
map that pushes the points z back to q0 . Note also that u is
to T . To demonstrate we continue with the predator prey
symmetric about t = 1/2 which is used in training. Training
model.
these models is straight forward and is done in a similar way
5
to the training of our stochastic interpolation model, namely the setup described in Srivastava et al. [2015], we trained
we solve a stochastic optimization problem of the form our model on 10,000 randomly generated trajectories and
Z 1
then used the standard publicly available dataset of 10,000
1 2 trajectories for testing.
min Eq0 ,z ∥uθ (qt , t) + 2(q0 − z)∥2 dt (13)
θ 2 0 WeatherBench: The dataset used for global weather pre-
Note that we can use the same velocity for the reverse pro- diction in this work is derived from WeatherBench [Rasp
cess. Given the velocity uθ we now generate perturb sam- et al., 2020], which provides three scaled-down subsets of
ples in the following way. First, we integrate the ODE to the original ERA5 reanalysis data [Hersbach et al., 2020].
t = 12 , that is we solve the ODE The specific variables and model configurations are detailed
in appendix B.1. Our forecasting models use the 6-hour
dq 1 reanalysis window as the model time step.
= uθ (q, t) q(0) = q0 t ∈ [0, ] (14)
dt 2 The statistics and configurations for each of the datasets dur-
This yield the state q( 12 ) ∼ N (0, I). We then perturb the ing experiments are mentioned in the appendix B. We have
state two additional real datasets, Vancouver93 and CloudCast, in
the appendix for additional analysis. In the next section we
q
b(1/2) = q(1/2) + σω (15) give details for experiments and results on the predator-prey
model, the MovingMNIST and the WeatherBench datasets.
where ω ∼ N (0, I) and σ is a hyper-parameter. We then
integrate the ODE equation 14 from 12 to 1 starting from
b( 12 ) obtaining a perturb state q
q b0 . The integration is done
5.2 RESULTS
in batch mode, that is, we integrate M vectors simultane-
ously to obtain M samples from the initial state around q0 .
These states are then used to obtain M samples from qT as Our goal is to learn a homotopy based deterministic func-
explained in section 3. tional map for complex timeseries to predict the future state
from the current state but with uncertainty. To do so, we
evaluate an ensemble of predictions from an ensemble of
5 EXPERIMENTS initial states. The initial states are obtained by using the
auto-encoder described in Section 4. We then push those
Our goal in this section is to apply the proposed framework states forward obtaining an ensemble of final states. We then
to a set of various temporal datasets and show how it can be report on the statistics of the final states.
used to estimate the uncertainty in a forecast. We experiment
The Predator Pray Model: We apply SI to learn a deter-
with two synthetic datasets that can be fully analyzed and a
ministic functional map on predator-prey model 5.1 for a
realistic dataset for weather prediction. Further experiments
long time integration. The initial state of the data is sam-
on other data sets can be found in the appendix.
pled as a group of noisy initial states, that is, we consider
the case that q = y(0) + ϵ. In this case we consider
5.1 DATASETS y(0) = [0.1, 0.3]⊤ and we contaminate it with the noise
vector, ϵ ∼ N (0, 0.05I). Our goal is to predict the state,
We now describe the datasets considered in our experiments. q(T ) at time T = 200. The distribution of 250 noisy points
Lotka–Volterra predator–prey model: This is a nonlinear for q(0) and the final distributions for q(200) using numer-
dynamical system. It follows the equation 5, where p0 = ical integration and SI can be seen in fig. 3. For this simple
2/3, p1 = 4/3, p3 = 1 and p4 = 1. The initial distribution of case, a multilayer perceptron (MLP) is used to learn the map-
states that is Gaussian with a mean of y(0) = [0.1, 0.3]T ping. It can be noticed that SI is able to match with a sample
and standard deviation of 0.05 the final distribution of states from the final complex distribution, specifically the areas
is a distribution obtained after fine numerical integration with high probabilities are captured very well. Some more
over t ∈ [0, 200]. Note that the length of the integration statistical comparisons of the variables’ state are reported
time is very long which implies that the output probability in appendix E.1, to support our results. Appendix D shows
space is widely spread which makes this simple problem of some results on Vancouver93 where the learned final dis-
predicting the final distribution from an initial distribution tribution from simple distribution to a complex conditional
very difficult. distribution is clearly evident.
MovingMNIST: Moving MNIST is a variation of the well The same concept can be extended to high-dimensional spa-
known MNIST dataset [Srivastava et al., 2015]. The dataset tiotemporal image space instead of vector space without any
is a synthetic video dataset designed to test sequence pre- loss of generality. Like in Lipman et al. [2022], Karras et al.
diction models. It features 20-frame sequences where two [2024], Bieder et al. [2024], we use a U-Net architecture
MNIST digits move with random trajectories. Following from Dhariwal and Nichol [2021] to learn the velocity.
6
True Our True Our
Mean Mean Std Dev Std Dev
Data Score Score Score Score
Predator–Prey 7.55e-1 7.43e-1 1.04 1.07
MovingMNIST 6.01e-2 5.58e-2 2.21e-1 1.87e-1
WeatherBench 1.36e4 1.36e4 9.15e1 9.66e1
7
Figure 4: Six of 50 Moving MNIST trajectory predictions obtained using SI and their ensemble mean and standard deviation.
Figure 5: Two sample stochastic forecasts of U10 and T850 after 2 days obtained using SI and the ensemble mean and
standard deviation for 78 forecasts.
6 CONCLUSION
Probabilist forecasting is an important topic for many sci- Figure 8: Perturbed states of a WeatherBench T850 state.
entific problems. It expands traditional machine learning
techniques beyond paired input and outputs, to mapping be-
tween distributions. While generative AI has been focused that allows us to sample from the current distribution. We
on mapping Gaussians to data, similar methodologies can be have shown that this approach can lead to a computationally
applied for the generation of predictions given current states. efficient way to sample future states even for long integra-
In this work we have coupled a stochastic interpolation to tion times and for highly non-Gaussian distributions in high
propagate the current state distribution with an auto-encoder dimensions.
8
References Lucor, Bertrand Iooss, Julien Brajard, Dunhui Xiao, Ti-
jana Janjic, Weiping Ding, Yike Guo, Alberto Carrassi,
Michael S Albergo and Eric Vanden-Eijnden. Building Marc Bocquet, and Rossella Arcucci. Machine learn-
normalizing flows with stochastic interpolants. arXiv ing with data assimilation and uncertainty quantification
preprint arXiv:2209.15571, 2022. for dynamical systems: A review. IEEE/CAA Journal of
Automatica Sinica, 10(6):1361–1387, 2023b.
Michael S Albergo, Nicholas M Boffi, and Eric Vanden-
Eijnden. Stochastic interpolants: A unifying frame- Bin Dai and David Wipf. Diagnosing and enhancing vae
work for flows and diffusions. arXiv preprint models. arXiv preprint arXiv:1903.05789, 2019.
arXiv:2303.08797, 2023.
Prafulla Dhariwal and Alexander Nichol. Diffusion mod-
Shaojie Bai, J Zico Kolter, and Vladlen Koltun. An em- els beat gans on image synthesis. Advances in neural
pirical evaluation of generic convolutional and recur- information processing systems, 34:8780–8794, 2021.
rent networks for sequence modeling. arXiv preprint
arXiv:1803.01271, 2018. Ronald Aylmer Fisher. Statistical methods for research
workers. In Breakthroughs in statistics: Methodology and
J. D. Benamou and Y. Brenier. A computational fluid me- distribution, pages 66–70. Springer, 1970.
chanics solution to the monge kantorovich mass transfer
problem. SIAM J. Math. Analysis, 35:61–97, 2003. Deepak Ganesan, Sylvia Ratnasamy, Hanbiao Wang, and
Deborah Estrin. Coping with irregular spatio-temporal
Florentin Bieder, Julia Wolleb, Alicia Durrer, Robin Sand- sampling in sensor networks. ACM SIGCOMM Computer
kuehler, and Philippe C. Cattin. Memory-efficient 3d Communication Review, 34(1):125–130, 2004.
denoising diffusion models for medical image process-
ing. In Medical Imaging with Deep Learning, volume Alan J. Geer and Peter Bauer. Observation errors in all-
227 of Proceedings of Machine Learning Research, pages sky data assimilation. Quarterly Journal of the Royal
552–567. PMLR, 10–12 Jul 2024. Meteorological Society, 137(661):2024–2037, 2011.
Cristian Bodnar, Wessel P Bruinsma, Ana Lucic, Megan Amin Ghadami and Bogdan I. Epureanu. Data-driven predic-
Stanley, Johannes Brandstetter, Patrick Garvan, Maik tion in dynamical systems: recent developments. Philo-
Riechert, Jonathan Weyn, Haiyu Dong, Anna Vaughan, sophical Transactions of the Royal Society A: Mathe-
et al. Aurora: A foundation model of the atmosphere. matical, Physical and Engineering Sciences, 380(2229):
arXiv preprint arXiv:2405.13063, 2024. 20210213, 2022.
Michael Brin and Garrett Stuck. Introduction to dynamical John Guckenheimer and Philip Holmes. Nonlinear oscil-
systems. Cambridge university press, 2002. lations, dynamical systems, and bifurcations of vector
fields, volume 42. Springer Science & Business Media,
Ashesh Chattopadhyay, Pedram Hassanzadeh, and Devika 2013.
Subramanian. Data-driven predictions of a multiscale
lorenz 96 chaotic system using machine-learning meth- John Guckenheimer and Patrick Worfolk. Dynamical sys-
ods: reservoir computing, artificial neural network, and tems: Some computational problems. In Bifurcations and
long short-term memory network. Nonlinear Processes Periodic Orbits of Vector Fields, pages 241–277. Springer,
in Geophysics, 27(3):373–389, 2020. 1993.
9
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- Peter Lynch. The origins of computer weather prediction
sion probabilistic models. Advances in neural information and climate modeling. Journal of computational physics,
processing systems, 33:6840–6851, 2020. 227(7):3431–3444, 2008.
E Kalnay. Atmospheric Modeling, Data Assimilation and Parviz Moin and Krishnan Mahesh. Direct numerical sim-
Predictability, volume 341. Cambridge University Press, ulation: a tool in turbulence research. Annual review of
2003. fluid mechanics, 30(1):539–578, 1998.
Tero Karras, Miika Aittala, Jaakko Lehtinen, Janne Hellsten, James D Murray. Mathematical biology: I. An introduction,
Timo Aila, and Samuli Laine. Analyzing and improving volume 17. Springer Science & Business Media, 2007.
the training dynamics of diffusion models. In Proceedings
of the IEEE/CVF Conference on Computer Vision and Tung Nguyen, Johannes Brandstetter, Ashish Kapoor,
Pattern Recognition (CVPR), pages 24174–24184, June Jayesh K Gupta, and Aditya Grover. Climax: A foun-
2024. dation model for weather and climate. arXiv preprint
arXiv:2301.10343, 2023.
Remi Lam, Alvaro Sanchez-Gonzalez, Matthew Willson,
Peter Wirnsberger, Meire Fortunato, Ferran Alet, Suman A. H. Nielsen, A. Iosifidis, and H. Karstoft. Cloudcast: A
Ravuri, Timo Ewalds, Zach Eaton-Rosen, Weihua Hu, satellite-based dataset and baseline for forecasting clouds.
et al. Graphcast: Learning skillful medium-range global IEEE Journal of Selected Topics in Applied Earth Obser-
weather forecasting. arXiv preprint arXiv:2212.12794, vations and Remote Sensing, 14:3485–3494, 2021. doi:
2022. 10.1109/JSTARS.2021.3062936.
Remi Lam, Alvaro Sanchez-Gonzalez, Matthew Willson, Jaideep Pathak, Shashank Subramanian, Peter Harrington,
Peter Wirnsberger, Meire Fortunato, Ferran Alet, Suman Sanjeev Raja, Ashesh Chattopadhyay, Morteza Mardani,
Ravuri, Timo Ewalds, Zach Eaton-Rosen, Weihua Hu, Thorsten Kurth, David Hall, Zongyi Li, Kamyar Aziz-
Alexander Merose, Stephan Hoyer, George Holland, zadenesheli, et al. Fourcastnet: A global data-driven high-
Oriol Vinyals, Jacklynn Stott, Alexander Pritzel, Shakir resolution weather model using adaptive fourier neural
Mohamed, and Peter Battaglia. Learning skillful medium- operators. arXiv preprint arXiv:2202.11214, 2022a.
range global weather forecasting. Science, 382(6677):
1416–1421, 2023. Jaideep Pathak, Shashank Subramanian, Peter Harrington,
Sanjeev Raja, Ashesh Chattopadhyay, Morteza Mardani,
Christian Sebastian Lamprecht. Meteostat api. https: Thorsten Kurth, David Hall, Zongyi Li, Kamyar Aziz-
//[Link]/en/. zadenesheli, et al. Fourcastnet: A global data-driven high-
Sylvie Leroyer, Stéphane Bélair, Syed Z Husain, and Joce- resolution weather model using adaptive fourier neural
lyn Mailhot. Subkilometer numerical weather prediction operators. arXiv preprint arXiv:2202.11214, 2022b.
in an urban coastal area: A case study over the vancouver
Stephan Rasp, Peter D. Dueben, Sebastian Scher,
metropolitan area. Journal of Applied Meteorology and
Jonathan A. Weyn, Soukayna Mouatadid, and Nils
Climatology, 53(6):1433–1453, 2014.
Thuerey. Weatherbench: A benchmark data set for data-
Zongyi Li, Nikola Kovachki, Kamyar Azizzadenesheli, driven weather forecasting. Journal of Advances in Mod-
Burigede Liu, Kaushik Bhattacharya, Andrew Stuart, eling Earth Systems, 12(11), 2020.
and Anima Anandkumar. Fourier neural operator for
parametric partial differential equations. arXiv preprint Sebastian Reich and Colin Cotter. Probabilistic forecasting
arXiv:2010.08895, 2020. and Bayesian data assimilation. Cambridge University
Press, 2015.
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil-
ian Nickel, and Matt Le. Flow matching for generative Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net:
modeling. arXiv preprint arXiv:2210.02747, 2022. Convolutional networks for biomedical image segmenta-
tion. In Medical image computing and computer-assisted
Yaron Lipman, Marton Havasi, Peter Holderrieth, Neta intervention–MICCAI 2015: 18th international confer-
Shaul, Matt Le, Brian Karrer, Ricky TQ Chen, David ence, Munich, Germany, October 5-9, 2015, proceedings,
Lopez-Paz, Heli Ben-Hamu, and Itai Gat. Flow matching part III 18, pages 234–241. Springer, 2015.
guide and code. arXiv preprint arXiv:2412.06264, 2024.
Lars Ruthotto and Eldad Haber. An introduction to
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng deep generative modeling. GAMM-Mitteilungen, 44(2):
Zhang, Stephen Lin, and Baining Guo. Swin transformer: e202100008, 2021.
Hierarchical vision transformer using shifted windows. In
Proceedings of the IEEE/CVF international conference Xingjian Shi, Zhourong Chen, Hao Wang, Dit-Yan Yeung,
on computer vision, pages 10012–10022, 2021. Wai-Kin Wong, and Wang-chun Woo. Convolutional lstm
10
network: A machine learning approach for precipitation Niloufar Zakariaei, Siddharth Rout, Eldad Haber, and
nowcasting. Advances in neural information processing Moshe Eliasof. Advection augmented convolutional neu-
systems, 28, 2015. ral networks. arXiv preprint arXiv:2406.19253, 2024.
Ralph C Smith. Uncertainty quantification: theory, imple-
mentation, and applications. SIAM, 2024.
Yang Song, Sahaj Garg, Jiaxin Shi, and Stefano Ermon.
Sliced score matching: A scalable approach to density and
score estimation. In Uncertainty in Artificial Intelligence,
pages 574–584. PMLR, 2020.
Nitish Srivastava, Elman Mansimov, and Ruslan Salakhutdi-
nov. Unsupervised learning of video representations using
lstms. In Proceedings of the 32nd International Confer-
ence on International Conference on Machine Learning -
Volume 37, ICML’15, page 843–852. [Link], 2015.
Andrew Stuart and Anthony R Humphries. Dynamical
systems and numerical analysis, volume 2. Cambridge
University Press, 1998.
Pierre NV Tu. Dynamical systems: an introduction with
applications in economics and biology. Springer Science
& Business Media, 2012.
Arash Vahdat, Karsten Kreis, and Jan Kautz. Score-based
generative modeling in latent space. Advances in neural
information processing systems, 34:11287–11302, 2021.
Phillip Vannini, Dennis Waskul, Simon Gottschalk, and
Toby Ellis-Newstead. Making sense of the weather:
Dwelling and weathering on canada’s rain coast. Space
and Culture, 15(4):361–380, 2012.
Ashish Vaswani, Noam M. Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser,
and Illia Polosukhin. Attention is all you need. In Neural
Information Processing Systems, 2017.
Cédric Villani et al. Optimal transport: old and new, volume
338. Springer, 2009.
NP Wedi, P Bauer, W Denoninck, M Diamantakis, M Ham-
rud, C Kuhnlein, S Malardel, K Mogensen, G Mozdzyn-
ski, and PK Smolarkiewicz. The modelling infrastructure
of the Integrated Forecasting System: Recent advances
and future challenges. European Centre for Medium-
Range Weather Forecasts, 2015a.
NP Wedi, P Bauer, W Denoninck, M Diamantakis, M Ham-
rud, C Kuhnlein, S Malardel, K Mogensen, G Mozdzyn-
ski, and PK Smolarkiewicz. The modelling infrastructure
of the integrated forecasting system: Recent advances and
future challenges. 2015b.
Ling Yang, Zixiang Zhang, Zhilong Zhang, Xingchao Liu,
Minkai Xu, Wentao Zhang, Chenlin Meng, Stefano Er-
mon, and Bin Cui. Consistency flow matching: Defining
straight flows with velocity consistency. arXiv preprint
arXiv:2407.02398, 2024.
11
Appendix
1
Institute of Applied Mathematics, University of British Columbia, Vancouver, BC, Canada
2
Department of Earth, Ocean and Atmospheric Sciences, University of British Columbia, Vancouver, BC, Canada
3
Recherche en prévision numérique atmosphérique, Environnement et Changement climatique Canada , Dorval, QC, Canada
A ADDITIONAL DATASETS
A.1 DATASETS
• Vancouver 93 Temperature Trend (Vancouver93): This is a real nonlinear chaotic high dimensional dynamical
system, in which only a single state (temperature) is recorded. The daily average temperatures at 93 different weather
stations in and around Vancouver, BC, Canada are captured for the past 34 years from sources such as National
Oceanic and Atmospheric Administration (NOAA), the Government of Canada, and Germany’s national meteorological
service (DWD) through Meteostat’s python library Lamprecht. Essentially, it is a time series of 12,419 records at
93 stations. The complex urban coastal area of Vancouver with coasts, mountains, valleys, and islands makes it an
interesting location where forecasting is very difficult than in general Vannini et al. [2012], Leroyer et al. [2014]. The
historical temperature alone insufficient to predict the temperature in the future, as it requires additional variables
like precipitation, pressure, and many more variables at higher resolution. This makes the dataset fit the proposed
framework.
• CloudCast Nielsen et al. [2021]: a real physical nonlinear chaotic spatiotemporal dynamical system. The dataset
comprises 70,080 satellite images capturing 11 different cloud types for multiple layers of the atmosphere annotated on
a pixel level every 15 minutes from January 2017 to December 2018 and has a resolution of 128×128 pixels (15×15
km).
B DATASETS
Table 4 describes the statistics, train-test splits, image sequences used for modeling, and frame resolutions.
Table 4: Datasets statistics, training and testing splits, image sequences, and resolutions
B.1 WEATHERBENCH
The original ERA5 dataset Hersbach et al. [2020] has a resolution of 721×1440 recorded every hour over almost 40 years
from 1979 to 2018 across 37 vertical levels for 0.25◦ latitude-longitude grid. The raw data is extremely bulky for running
experiments, even with powerful computing resources. We hence the typically used reduced resolutions (32×64: 5.625◦
latitude-longitude grid, 128×256: 5.625◦ latitude-longitude grid) as per the transformations made by Rasp et al. [2020].
We, however, stick to using the configuration set by Nguyen et al. [2023] for standard comparison. The prediction model
considers 6 atmospheric variables at 7 vertical levels, 3 surface variables, and 3 constant fields, resulting in 48 input channels
in total for predicting four target variables that are considered for most medium-range NWP models like the state of the art
IFSWedi et al. [2015a] and are often used for benchmarking in previous deep learning work as well like Lam et al. [2023],
Pathak et al. [2022a], they are geopotential at 500hPa (Z500), the temperature at 850hPa (T850), the temperature at 2 meters
from the ground (T2m), and zonal wind speed at 10 meters from the ground (U10). We use a leap period of 6 hours as a
single timestep and hence our model takes in 8 timesteps (48 hours or 2 days) to predict for the next 8 timesteps (48 hours or
2 days). According to the same setting used the data from 1979 to 2015 as training, for the year 2016 as validation set and
for the years 2017 and 2018 as testing set. The details of the variables considered are in Table 5.
C COMPARISON METRICS
N
j,k,l 1 X j,k,l j,k,l
IEM = {PEM = P |Pi ∈ Ii }. (16)
N i=1 i
where Pij,k,l ∈ Ii .
Ensemble standard deviation state (image), IES ∈ RC×H×W , is defined as
13
v
u
u1 X N
j,k,l
IES = {PES =t j,k,l 2
(P j,k,l − PEM j,k,l
) |Pij,k,l ∈ Ii , PEM ∈ IEM }. (18)
N i=1 i
N X C XH X
W
1 X
VES = (Pij,k,l − VEM )2 , (19)
N · C · H · W i=1 j=1
k=1 l=1
where Pij,k,l ∈ Ii .
N XH XW X C
1 X
MSE = (y − ŷ)2 (20)
N · C · H · W i=1 w=1 c=1
h=1
N XH XW X C
1 X
MAE = |y − ŷ| (21)
N · C · H · W i=1 w=1 c=1
h=1
(2µx µy + C1 )(2σxy + C2 )
SSIM(x,y) = (22)
(µ2x + µ2y + C1 )(σx2 + σy2 + C2 )
N
1 X
SSIM = SSIM(x, y) (23)
N i=1
where:
14
D ADDITIONAL RESULTS
Probabilistic Forecasting For Vancouver Temperature: We now use the Vancouver93 dataset to build a forecasting
model using SI. The approximating function we use is a residual-network based TCN inspired by Bai et al. [2018] along
with time embedding on each layer. In this case we propose a model that uses a sequence of the last 5 days’ temperatures to
predict the next 5 days’ temperatures. For this problem, it is easily to visualize how SI learns the transportation from one
distribution to another.
In the first experiment with this dataset, we take all the cases from the test-set where the temperature at Station 1 is 20◦ C
and we compare the actual distribution of temperatures at the same stations after 10 days. Essentially, if we look at the phase
diagrams between any two stations for distributions of states after 10 days apart, we can visualize how the final distributions
obtained using SI are very similar to the distributions we obtain from test data. Such distributions Often they look like
skewed distributions, like we can see in figure 9 which shows the an extremely skewed distribution matching very well
with the state after 5 days and similarly we can see in figure 10 which shows the distribution matching very well with the
state after 10 days. Appendix E.2 shows the matching histograms of the distributions for the later case. Some metrics for
comparison of results can be seen in tables 6 to 8.
Figure 9: Phase diagram showing initial state and final states from data and model using SI after 5 days in between station 2
and station 10.
Figure 10: Phase diagram showing initial state and final states from data and model using SI after 10 days in between station
20 and station 1.
15
True Our True Our
Data Mean Mean Std. Dev. Std. Dev.
Vancouver93 1.85e1 1.89e1 4.05 3.97
CloudCast 1.53e-2 1.51e-2 9.84e-2 9.84e-2
Figures in 11 shows the histograms of y1 and y2 respectively. It can be noticed that the histogram of an actual state variable
is matching very well with the histogram obtained from our results using SI.
Figure 11: Histograms of actual final distribution of y1 (left) and y2 (right) compared with that obtained using SI on the
predator-prey model
E.2 VACOUVER93
Figure 12 shows the histograms of the two distributions which are very similar to suggest that SI efficiently learns the
transport map to the distribution of temperatures at a station for this problem whose deterministic solution is very tough.
16
Figure 12: Histograms of distributions observed after 10 days and distribution obtained using SI excluding outliers.
Let us take as input, initial 10 frames from a case in Moving MNIST test set as history and actual 10 subsequent frames in the
sequence as future. For such a case Figures 13 and 14 show statistical images for small perturbation and large perturbation
respectively. 100 perturbed samples are taken and their final states are predicted using SI. The figures showcase how similar
the ensemble mean and the ensemble standard deviations are. This is a fair justification that the uncertainty designed on
Moving MNIST is captured very well.
E.4 CLOUDCAST
E.4.1 Predictions
As per the default configuration of 4 timeframe sequences from the test set of Cloudcast dataset is used to predict 4
subsequent timeframes in the future. A point to notice is each time step equals to 15 minutes in real time and hence the
differential changes is what we observe. For one such a case Figure 15 shows twelve different predictions. They are very
similar but not the same once we zoom into high resolution. Figure 16 shows the prediction after 3 hours by autoregression
to showcase how the clouds look noticeably different. The small cloud patches are visibly different in shape and sizes. Cloud
being a complex and nonlinear dynamical system, the slight non-noticeable difference in 1 hour can lead to very noticeably
different predictions after 3 hours.
Figure 17 shows statistical images for generated outputs on CloudCast testset using SI. Similar samples are taken as the set
of initial states to predict their final states using SI. The figures showcase how similar the ensemble mean and the ensemble
standard deviations are.
17
Figure 13: Statistical comparison of Moving MNIST sequences predicted using SI with random perturbed initial states.
E.5 WEATHERBENCH
E.5.1 Predictions
Figure 19 shows the statistical comparison of ensemble predictions with 20 mildly perturbed states. Figure 20 shows the
statistical comparison of ensemble predictions with 78 strongly perturbed states. The mean and standard deviation of the
states for 2 day (48 hours predictions) can be easily compared. Table 9 and table 10 shows the metrics to compare the
accuracy of our ensemble mean and ensemble standard deviation for 6 hour and 2 day predictions respectively.
18
Figure 14: Statistical comparison of Moving MNIST sequence with random highly perturbed initial states.
Figure 15: Random predictions of differential change from a single sequence from cloudcast testset.
19
Figure 16: Random predictions from a single sequence from CloudCast testset for prediction after 3 hours.
Table 10: Similarity metrics for weatherBench ensemble prediction after 2 days.
F.1 MOVINGMNIST
Figure in 21 shows how the perturbed state generated using SI is better than Gaussian perturbations. Also, a crucial factor
for the sensitivity of perturbation is showcased, where a generated sequence from MovingMNIST has a different digit is
shown in the third row of the figure. With mild perturbaton, however, the mean of the 100 samples matches very well with
the original state. The standard deviation of those states show the scope of perturbation, which is good. Figure 22 shows how
a perturbed state varies with different noise levels. With larger noise, the digits seems to transform, like ’one’ turns to ’four’,
’eight’ turns to ’three’, and so on. The model understands than on transitioning, the digits should turn into another digit.
F.2 WEATHERBENCH
Figure in 23 shows the four curated important variables for weather prediction, Z500, T2m, U10 and T850, perturbed with
different values of noise.
Table 11 shows the hyperparameter settings for training on MovingMNIST and CloudCast datasets. While table 12 shows
the settings for training on WeatherBench.
20
Hyperparameter Symbol Value
Learning Rate η 1e − 04
Batch Size B 64
Number of Epochs N 200
Optimizer - Adam
Dropout - 0.1
Number of Attention Heads - 4
Number of Residual Blocks - 2
Table 11: Neural Network Hyperparameters for training Moving MNIST and CloudCast
All our experiments are conducted using an NVIDIA RTX-A6000 GPU with 48GB of memory.
21
Figure 17: Statistical comparison of CloudCast sequences predicted using SI for a single initial state.
22
Figure 18: Four sample stochastic forecasts of Z500, T2m, U10 and T850 after 2 days obtained using SI.
Figure 19: Statistical comparisons of Z500, T2m, U10 and T850 for 2 day ensemble forecasting using SI using 20 mildly
perturbed samples.
Figure 20: Statistical comparisons of Z500, T2m, U10 and T850 for 2 day ensemble forecasting using SI using 78 strongly
perturbed samples.
23
Figure 21: Statistical comparison of 100 perturbed states from a single MovingMNIST sequence using SI.
24
Figure 22: Perturbed states for a single MovingMNIST sequence using SI for different levels of noise.
25
Figure 23: Perturbed states for a WeatherBench state using SI for different levels of noise.
26