0% found this document useful (0 votes)
190 views12 pages

2023-Dealing With Changes Resilient Routing Via Graph Neural Networks and Multi-Agent Deep Reinforcement Learning

This paper proposes two approaches for resilient routing using reinforcement learning: a single-agent approach using graph convolutional networks and a multi-agent approach using deep Q-networks. Both approaches can adapt to network changes without retraining by operating directly on graph representations of the network topology and through cooperation of multiple agents.

Uploaded by

Ziqiang Hua
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
190 views12 pages

2023-Dealing With Changes Resilient Routing Via Graph Neural Networks and Multi-Agent Deep Reinforcement Learning

This paper proposes two approaches for resilient routing using reinforcement learning: a single-agent approach using graph convolutional networks and a multi-agent approach using deep Q-networks. Both approaches can adapt to network changes without retraining by operating directly on graph representations of the network topology and through cooperation of multiple agents.

Uploaded by

Ziqiang Hua
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

This article has been accepted for publication in IEEE Transactions on Network and Service Management.

This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TNSM.2023.3287936

Dealing with Changes: Resilient Routing via Graph Neural Networks and
Multi-Agent Deep Reinforcement Learning
Sai Shreyas Bhavanasi, Student Member, IEEE, Lorenzo Pappone, Student Member, IEEE, and
Flavio Esposito, Member, IEEE

Abstract—The computer networking community has been to implicitly learn the topology through a centralized strategy.
steadily increasing investigations into machine learning to help While such a strategy may work for logically centralized
solve tasks such as routing, traffic prediction, and resource Software-Defined Networks, it may be impractical for many
management. The traditional best-effort nature of Internet con-
nections allows a single link to be shared among multiple flows other computer networks. To overcome this limitation, we pro-
competing for network resources, often without consideration pose to tackle the routing problem using Graph Convolutional
of in-network states. In particular, due to the recent successes Networks (GCN) and distributed learning approaches, such as
in other applications, Reinforcement Learning has seen steady multi-agent RL.
growth in network management and, more recently, routing. In multi-agent RL, agents share a joint objective function
However, if there are changes in the network topology, retraining
is often required to avoid significant performance losses. This and can cooperate to learn faster, safeguard privacy to a certain
restriction has chiefly prevented the deployment of Reinforcement extent, and rely on a less failure-prone algorithm, overcoming
Learning-based routing in real environments. In this paper, we some of the limitations of a single RL agent. Both GCN and
approach routing as a reinforcement learning problem with two multi-agent RL help tame the complexity of creating a fully
novel twists: minimize flow set collisions, and construct a rein- distributed routing learning strategy that balances picking short
forcement learning policy capable of routing in dynamic network
conditions without retraining. We compare this approach to other and less congested paths based on local observations and being
routing protocols, including multi-agent learning, with respect informed about global and dynamic network states.
to various Quality-of-Service metrics, and we report our lesson Recent literature has shown different techniques to embed
learned. RL models into routing algorithms, achieving promising re-
Index Terms—Routing protocols, Machine learning algorithms, sults. Nonetheless, no prior solution considered the resiliency
Reinforcement learning, IP networks, Network Management. against network changes as a primary concern. A practical
example is the failure of one or multiple links or nodes in the
network, which leads to unwanted and unexpected topology
I. I NTRODUCTION
variations. If such an event occurs after the training phase
Machine learning (ML) has recently seen many applications has been completed, the RL model should be retrained from
within computer networking. Many ML techniques and algo- scratch to learn the new change in the environment.
rithms have been proven to enhance performance for many In this paper, we approach the problem of RL-based routing
tasks, ranging from network traffic prediction to resource with resiliency in mind. In particular, by resiliency in this
management and anomaly detection. One of the most prolific work, we mean the ability to dynamically adapt to computer
areas of ML research in recent years has been Reinforcement network changes without the need for retraining the RL model.
Learning (RL). Since the authors in [1] demonstrated how deep Our results can be applied to Wide-Area Networks, within
neural networks could be trained to approximate a Q-function intra and inter-domain routing (e.g., iBGP and eBGP), and
efficiently, RL became a topic of significant prominence. to Software-Defined networks. In some cases, e.g., at network
Since then, RL has been in the spotlight due to a slew of edge [2], to fine-tune traffic engineering policies, it is desirable
recent artificial intelligence breakthroughs, including defeating to overwrite classical interior BGP routing rules, such as
humans in games (e.g., Go, chess, StarCraft), self-driving Equal Cost Multi-Path (ECMP) and Open Shortest Path First
cars, smart-home automation, and service robots, to name a (OSPF). In other cases, a distributed approach is the only
few applications. Computer networks in general, and packet viable solution [3].
routing problems in particular, have also been solved using In particular, in this paper we present the design, imple-
RL, although with a single agent to learn the environment and mentation, and evaluation of two resilient RL-based routing
cope with the lack of performance awareness of commonly schemes with different learning algorithms, a single agent,
deployed routing protocols based on Dijkstra and Bellman- and a multi-agent solution. Both approaches show benefits
Ford algorithms. Despite all recent RL achievements, many when dealing with drastic network topology changes. The
simple tasks can still elude a single agent. The main limitation single-agent RL routing algorithm leverages Graph Neural
of existing single-agent RL-based routing is that the agent has Networks [4] to minimize retraining needs. Instead, the multi-
All authors are affiliated with the Department of Computer Science, Saint agent RL solution is based on Deep Q-Networks, where feder-
Louis University, St. Louis, MO 63103 USA (e-mail: [email protected]). ated routing agents cooperate to achieve a shared optimization
This work has been supported by NSF Awards # 1908574 and # 2201536. goal.
Bhavanasi completed this work as an undergraduate student. We extend our
gratitude to Hunter Parks, for his contributions as an undergraduate student The idea behind our Single-Agent Graph Convolutional Net-
during the initial stages of this project. work algorithm (SA-GCN) is to operate directly on network

Authorized licensed use limited to: Zhejiang Lab. Downloaded on July 18,2023 at 08:12:30 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Network and Service Management. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TNSM.2023.3287936

traffic datasets encoded in a graph format, instead of a tradi- path and the perceived QoS, and the ability to forecast the
tional vector and matrix data representations. The advantage repercussions of routing decisions. Traditional RL algorithms,
of training a RL routing system whose input is a graph lies particularly Q-learning, have been used to route traffic in a
in the ability to train such policy on any network topology, variety of network scenarios, given their scarce compute and
represented by a computer network adjacency matrix. This communication needs and their ability to identify an ideal
advantage naturally permits a RL policy (i.e., a given reward solution and adapt to changes in the environment. Different
function) to operate in any topology and learn from changes techniques to apply RL to the traffic routing problem have
in routing and congestion events, without having to retrain been proposed in literature. See e.g., these examples [5, 6,
the neural network. In our Multi-Agent routing with Deep Q- 7] or this recent survey [8]. These techniques differ in terms
Network algorithm instead (MA-DQN), each agent makes its of (i) learning capability distribution and (ii) the amount of
routing decisions locally. collaboration among numerous learners. Different techniques
We found that when RL agents learn about the network lend themselves better to specific network topologies and util-
topology explicitly, the training time significantly improves, ity purposes. The presence of a central node —the controller in
not surprisingly. This is because, by compiling the network Software-Defined Networks (SDN) and the sink in Internet of
topology into the RL agent’s state space, the machine learning Things (IoT) networks, respectively — allows for centralized
model is able to handle changes in latency and bandwidth more learning in SDN [9] and IoT. Routing in IoT, on the other
efficiently. We also found some surprising results by evaluating hand, necessitates decentralized RL [10, 11], with the learning
several RL models. In particular, we test their ability to capability dispersed across the routing nodes. In the rest of
optimize Quality of Service (QoS) computer networks metrics, this section, we focus on solutions that most closely match
such as latency and bandwidth, while considering different our contribution, with respect to two dimensions: Single-Agent
network topology factors (i.e., size, number of competing Q-learning solutions for traffic engineering and routing with
flows on the same physical or virtual link, congested links, Multi-Agent learning.
and link and node failures scenarios). Single-Agent Deep Q-learning for Traffic Engineering.
We report that our RL-based algorithms outperform the A few recent solutions have proposed employing Deep Q-
standard routing algorithms with an overall improvement Learning to tackle traffic engineering problems. Most of these
evaluated with different metrics. Furthermore, we compare approaches investigated the use of RL specifically for real-
our solutions and show that our MA-DQN routing algorithm time routing optimization [12] congestion control [13], and
can achieve the optimal policy much faster than our SA-GCN resource management [14]. The authors in [15] implement
model. Nonetheless, one of the major drawbacks of MA-DQN a deep-RL solution to improve the performance of baseline
is that the model cannot easily be translated to other topology TE algorithms of an SD-WAN-based network in terms of
configurations, as such topology is encoded into the neural service availability. Specifically, they evaluated three deep-
network model itself, whereas the SA-GCN algorithm results RL methods: Deep Q-Learning, policy optimization, and TD-
in a more adaptive solution with regards to topology changes. λ (Temporal-Difference value function algorithm). Results
In summary, we analyze and dissect the application of RL on show that Deep Q-Learning achieves better performance with
the routing problem as follows: (i) reporting the pros and cons respect to the other deep-RL algorithms and the baselines in
of using single-agent and multi-agent approaches individually; terms of the percentage of time in which the service is up.
(ii) comparing them in terms of network change resiliency In [16], a deep Q-Learning is used to specifically build
and retraining needs; (iii) discussing our lesson learned and a greedy online routing algorithm, improving different QoS
leaving several insights regarding the overall optimization metrics in SDNs. They implemented a greedy online QoS
process when dealing with routing with network topology routing method based on dueling deep Q-network with pri-
changes; (iv) comparing their performance against the existing oritized experience replay, proving that this solution can
standard and RL-based routing algorithms. learn the network topology to solve multiple QoS metrics
The rest of the paper is structured as follows. Section II optimization tasks. The approach reduced delay, cost, and loss
highlights the state-of-art RL-based solutions in routing prob- while maximizing bandwidth, outperforming existing learning-
lems, focusing on single and multi-agent approaches. Sec- based methods. Differently from all these sound solutions, in
tion III formulates our problem more formally and reports this work we investigate the application of GCNs to Deep
the mathematical model for both GCN and DQN-based al- Q-Learning for a QoS routing optimization problem.
gorithms. Section IV reports the results achieved with an Routing with Multi-Agent Reinforcement Learning. Al-
extensive evaluation of performance in terms of QoS metrics though previous studies of DRL-based techniques have
and gives a detailed comparison between our algorithms and demonstrated the ability to deploy routing configurations in
different baselines. In Section V we summarize our work and dynamic networks autonomously, some researchers have ar-
the take-away messages. gued that centralized controller approaches are likely to face
challenges in large-scale networks due to the difficulties of
II. R ELATED W ORK collecting widely distributed network status in real-time.
Routing protocols pose challenging requirements for ML Multi-Agent Reinforcement Learning has rapidly become
models. Examples of such challenges include the capacity to an important research direction. A few authors have proposed
deal with and scale complex and dynamic network topolo- bringing these approaches to optimize routing protocols. See
gies, the ability to learn the correlation between the selected e.g., [17, 18, 19]. In particular, [19] proposes a multi-agent

Authorized licensed use limited to: Zhejiang Lab. Downloaded on July 18,2023 at 08:12:30 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Network and Service Management. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TNSM.2023.3287936

reinforcement learning framework for adaptive routing in Since an action that an RL agent chooses influences subse-
communication networks, which takes advantage of both the quent actions, we model the problem as a Sequential Decision-
real-time Q-learning and the actor-critic methods. making Problem (SDP). The problem instance in our study is
Pinyoanuntapong et al. [20] formulated the traffic engi- characterized by the tuple: M =< S, A, R, γ >, where S is
neering decision-making problem as a Multi-Agent Markov a finite set of states, A is a finite set of actions, R is the
decision process (MA-MDP) instead of a Partially Observable immediate reward, and γ is the discount factor. We denote f
Markov Decision Process (POMDP). Similar to these ap- as the number of flow sets, and N as the number of nodes
proaches, we use Q-routing technique for traffic-aware routing, in the system. In order to optimize the policy network output
but our solution uses a fully distributed multi-agent Deep Q- (i.e., the probability distribution of actions), we employ an
Learning Reinforcement Learning algorithm to deal with QoS ”on-policy approach”. Specifically, we use the PPO (Proximal
requirements and topological changes. Policy Optimization [21]) reinforcement learning algorithm to
facilitate stable training and prevent divergence.
III. M ODEL AND BACKGROUND ON GCN AND MA-DQN We next describe each element of the Sequential Decision-
making Problem tuple in detail.
In this section, we detail the design of our single-agent and
State Space. Let S denote the finite set of states that are
multi-agent routing approaches. Specifically, we describe the
admissible in the environment. An arbitrary state contains
RL model and define the state and action space along with
three parts. The first part is the adjacency matrix of the
the selected reward functions both for GCN-based algorithm
network. The second part is a 2-column matrix containing
(Section III-A) and DQN (Section III-B).
a one-hot encoding of the source and destination nodes of
the flow set to be routed by the GCN. The third part is a
A. Single Agent GCN Policy Background and Settings f × 2 matrix where each layer of the matrix is a similar 2-
While Convolutional Neural Networks (CNNs) have per- column matrix but of a competing flow set in the network. It
formed well with structured data, they cannot be used for is important to clarify that the state does not contain what path
unstructured data such as graphs. In these situations, Graph other flow sets are taking across the network but merely their
Convolutional Neural Networks (GCNs) have shown great source and sink nodes. If there are n nodes and up to f other
promise. GCNs use convolutional operations to extract features flows, there are (f + 1) × n2 possible states.
from nodes and edges in a graph. The main idea is to Action Space. Let A denote the finite set of actions that
propagate information from neighboring nodes to update the the agent can take. We construct the action space as a one-
representation of each node in the graph. This is achieved hot vector the size of the highest degree node in the graph.
by defining a convolution operation on the graph, where the Having the action space be dim(G) as opposed to |E| offers
filters are learned using backpropagation. The filters capture a significant action space compression. When the agent is at a
local patterns in the graph and are used to update the feature node i, given that dim(ni ) < dim(G), only the first dim(ni )
representations of the nodes. We leverage GCNs as the policy output indices are considered when sampling from the policy.
network for our DRL algorithm. A policy network is a type of RL Reward Functions. Let R(s, a) denote the immediate
neural network that takes in the current state of an environment reward (or expected immediate reward) received for selecting
as input and outputs a probability distribution over actions that an action (routing decision) a ∈ A at a state s ∈ S which
the agent can take in that state. causes the state transition s → s′ , and R(s, a) ∈ [−1, 1]. A
The application of RL-based approaches to routing have dimension of our GCN-based routing policy evaluation is an-
been criticized for the inability to generalize across different alyzing reward functions and their influence on the strength of
topologies used for training. If there are any changes in the such policy concerning what it was supposed to optimize. The
network topology, retraining is required. This restriction has first reward function considered is shown in Equation 1. This
mostly prevented the deployment of RL-based routing in real discrete, sparse reward function, while simple, is deceptive.
environments. The principal benefit of incorporating Graph The goal of RL algorithms is to select actions that will yield
Convolutional Networks (GCNs) into the policy network the largest reward by the end of the episode and attempts to
lies in its capacity to deliver high performance even when achieve the largest reward in as few actions as possible. If an
operating on previously unseen network topologies. This is agent were to use this reward function to learn to route within
primarily due to the fact that the topology of the network a network, the agent would learn how to get to the destination
is explicitly provided as input to the policy network. In our node in as few hops as possible while oblivious to each link’s
work, each episode of our RL algorithm is trained on a new link capacity and delay.
network topology that the model has not encountered during

the training phase.
−1
 could not reach destination
As a network experiences congestion when several flow R(s, a) = 1 reached destination (1)
sets coexist on the same underlying path, we address this 
0 still in transit

problem by learning how to minimize the coexistence of flow
sets as long as alternative routes exist. We define such flow The natural second reward structure is defined in Equation 2;
coexistence as a collision, when two or more flow sets coexist in such equation, r is the proportion of the best arbitrary QoS
in the same forwarding application process, i.e, a (virtual) metric possible to the QoS realized by the path the agent chose
router or switch, simultaneously. from source to destination.

Authorized licensed use limited to: Zhejiang Lab. Downloaded on July 18,2023 at 08:12:30 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Network and Service Management. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TNSM.2023.3287936

Another challenge when operating on large networks is


 the use of ϵ − greedy exploration, namely a strategy for
−1
 could not reach destination balancing exploration and exploitation by randomizing actions.
R(s, a) = r reached destination (2) With random exploration, a packet is increasingly unlikely

0 still in transit to stumble upon the destination node, therefore delaying the

reception of a non-zero reward, hence prolonging the learning
For routing with other flow sets coexisting in the network, phase. To address such limitations, we consider the following
r is the ratio between the measured metric and the optimal more specific reward function (Equation 4):
QoS metric (Equation 3). If we consider end-to-end latency
as a metric, C1 represents the measured latency on a given 
−1 could not reach destination
path taken into consideration as an RL policy, and C2 is the 


r
latency of the shortest path when no other flow sets coexist in reached destination
R(s, a) = (4)
the network. If instead we wish to optimize over bandwidth, 

 0.1 moved closer to target
C1 and C2 represent the bandwidth available on the route 
0 otherwise
considered and the best bandwidth that can theoretically be
achieved when no other flows are present in the network: B. Multi Agent DQN Background and Settings
One of the first approaches to Deep Learning from high
C2 dimensional input is outlined in [1]. Their model, a Convolu-
r= . (3)
C1 tional Neural Network trained with a variant of Q Learning,
could surpass some of the previous approaches on the Atari
The agent operates in an environment with three inputs:
2006 games. The authors in [23] took this approach further:
(1) an adjacency matrix representing the connections between
they used reinforcement learning combined with deep neural
nodes in the network, (2) a two-column matrix indicating the
networks to develop a deep Q-Network (DQN).
current node and the destination node, and (3) a matrix of size
While tabular methods come with convergence guarantees
numf lows × 2 × dim(G) representing the start and endpoints
and work well with smaller state-action spaces, we use neural
of other flow sets in the network, which are one-hot encoded.
networks and Deep Reinforcement Learning as we aim to
The first input is processed by a graph convolutional net-
develop a more flexible and scalable approach that can be
work (GCN) specifically designed to handle graph-structured
easily adapted to larger and more complex problems. Deep
data. The GCN architecture is inspired by traditional convo-
RL provides a framework that allows us to handle these more
lutional networks and consists of a graph convolution layer,
challenging future scenarios without significant alterations to
a rectified linear unit (ReLU) activation function for non-
the underlying algorithm.
linearity, and a graph pooling layer to aggregate information.
Using neural networks for RL can cause instability during
This implementation is described in detail in the GCN imple-
training. This instability is attributed to correlations caused
mentation [22].
by sequences of observations, minor changes to the neural
The second and third inputs are processed by a feed- network that can change the policy, and correlations between
forward network with two hidden layers of size |E| 2 with the action and target values [23].
ReLU activation. The output of the GCN is flattened, and the To counter these issues, we use the following techniques
other two sub-networks’ outputs are concatenated lengthwise. with DQN to stabilize training. First, we employ a replay
This resulting vector is then propagated through another sub- buffer in our algorithm that keeps track of the (s, a, r, s′ , a′ )
network, a feed-forward neural network with three hidden tuple. A batch of these tuples is sampled randomly to adjust the
layers of size |E|
2 . The last layer of the policy has dim(G) neural network, preventing data correlations. Secondly, DQN
nodes. The value function is another feed-forward network uses an iterative update that periodically adjusts action values
with three hidden layers of size |E| 2 with ReLU activation, to the target values to avoid correlations.
and the last layer contains one neuron. The computer network in which the RL agent operates is
Reward Engineering and Considerations over Large Net- described by a standard graph G(V, E) where V represents the
works. While Equation 2 would properly train an agent to set of routing nodes and E represents the set of transmission
optimize an arbitrary QoS metric r, the sparse nature of non- links. Each transmission link contains some traffic that we
zero rewards in the functions is undesirable. Mathematically, simulate. The goal is to find the path that minimizes the travel
as the average eccentricity of a network increases, the RL time between each source (s) and destination (d).
agent’s horizon organically grows, making the credit assign- Each node acts as an independent agent in the MA-DQN
ment of actions increasingly difficult. Our reward structure model. Each agent has a separate neural network and makes
is semi-sparse, as the agent is given a small reward when it routing decisions locally.
successfully moves closer to the destination node and receives State Space. Each agent receives the destination as a one-
a much larger reward at the end of the episode. Naively, if the hot vector with size equal to |V |, i.e. the number of routers.
agent takes the shortest path to the destination node, it would The state space remains consistent across all agents for each
receive positive rewards at every timestep. However, the final episode.
reward would not be as significant if the agent collided with Action Space. We also create the action space as a one-hot
other flow sets along the way. vector whose size is equal to the number of agents’ neighbors.

Authorized licensed use limited to: Zhejiang Lab. Downloaded on July 18,2023 at 08:12:30 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Network and Service Management. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TNSM.2023.3287936

The agent selects a neighbor at each time step t after looking Algorithm 1 Deep Q-Network Running on Routing Agent
at the state space. 1: for agent i = 1, N do
Reward Functions. We employ a dense reward function to 2: Initialize replay buffer Di = ∅
route in larger networks. The agent receives either a positive 3: Initialize action-value function Qi with weights θi
or a negative reward for every routing decision taken based on 4: Initialize target action-value function Q̂i with random
Equation 5-6, We find that this structure significantly increases weights θi− = θi
the rate at which the model converges. We also discover that 5: end for
the only factors that significantly alter the model’s perfor- 6: for episode = 1, M do
mance are the positive and negative rewards’ relative sign and 7: for each decision epoch t do
magnitude differences. When the magnitude of the negative 8: Assign current agent n current packet p
reward is greater than the magnitude of the positive reward, 9: Observe current state st
the model avoids making those judgments and its behavior is 10: Select(and execute an action
more strongly reinforced. a random action, with probabilityϵ
11: at =
Therefore, if the packet is more than one step away from arg maxa Qn (st , a; θn ), with prob.1 − ϵ
the destination: 12: Forward p to next agent vt
( 13: Observe reward rt and next state st+1
r packet moved closer to the destination
R(s, a) = 14: Store transition (st , at , rt , st+1 ) in Dn
−.75 otherwise, 15: Sample( minibatch (sj , aj , rj , sj+1 ) from Dn
(5) rj , if episode terminates at step j + 1
16: yj =
where r represents the time difference between the current rj + γ maxa′ Q̂n (sj+1 , a′ ; θn− ), otherwise
node and the previous node in terms of packet delivery time 17: Grad. descent on (yj − Qn (sj , aj ; θn ))2 w.r.t. θn
to the destination. When a packet can reach the destination 18: Every C steps reset Qˆn = Qn
using a single link, the reward is defined as: 19: end for
20: end for
(
1.5 best link to destination chosen
R(s, a) =
−.75 sub-optimal link to destination chosen We store the transition (st , at , rt , st+1 ) in the agent’s re-
(6) play buffer (line 14). We then randomly sample a batch of
The agent is given a one-hot encoding of the destination as transitions (sj , aj , rj , sj+1 ) from the agent’s replay buffer
input. A feed-forward neural network with a ReLU activation and calculate the expected value yj from the “older” target
receives the input after that. The size of the input layer is network (lines 15 - 16). We then perform gradient descent on
the same as the number of graph nodes. Then, there are two (yj − Q(sj , aj ; θ))2 with respect to θ for the agent to optimize
the weights of the neural network (line 17). Periodically, the
hidden layers with a combined size of |V2 | , followed by an
weights of the target Neural Network are reset to the “newer”
output layer with a size equal to the number of neighbors for
Neural Network (line 18). By taking samples of transitions,
the agent.
we can avoid correlations in training. Additionally, this replay
The DQN algorithm that we used can be found in Algo-
buffer technique is more efficient as each step can be used in
rithm 1. We first initialize the replay buffer and the action and
many neural network updates [23].
target Neural Networks for each agent (lines 1-4). For each
episode, we randomly determine the source and destination
nodes. We then assign the packet to the agent at the source IV. E VALUATION
(line 8). For each epoch in each episode, the agent observes In this section, we evaluate both proposed algorithms,
the state s at timestep t. The agent then selects an action a single-agent via GCN and multi-agent via DQN, with respect
based on ϵ. The agent chooses either a random action or the to Quality of Service (QoS) metrics for traffic engineering.
best action a∗ based on the following maximization problem: In particular, in Section IV-A we describe our multi-agent
evaluation settings. In Section IV-B we discuss the evaluation
a∗ = arg max Q(st , a; θ), results of the multi-agent algorithm, showing how it outper-
a
forms two widely deployed algorithms, Open Shortest Path
depending on the value of ϵ (lines 9-11). The value of epsilon First (OSPF) and Equal-Cost Multi-Path (ECMP). We also
decays as the episodes pass so that the agent can explore at discuss a few lessons learned from reward engineering and
the beginning of the training and exploit the best-known values the impact of over-training the RL algorithm in our case. In
toward the end. We find that the rate at which ϵ decays plays Section IV-D, we discuss the evaluation of the centralized
a significant role in the model’s performance. If ϵ decays too routing solutions using GCN, showing interesting and surpris-
fast, the model converges to a poor value. If the model decays ing results on its ability to learn not only how to route, but
too slowly, the model never converges. Once the action is also on how to avoid collisions (i.e. coexistance and hence
selected, we pass the packet to the neighbor based on the competition) with other flow sets.
selected action and observe the reward rt and the next state The MA-DQN model is able to train faster than SA-GCN
st+1 (lines 12-13). and its ability to re-train can be used over new topologies

Authorized licensed use limited to: Zhejiang Lab. Downloaded on July 18,2023 at 08:12:30 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Network and Service Management. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TNSM.2023.3287936

with different parameters. Indeed, we found that MA-DQN can outperform classically adopted routing protocols such as
performs a fast re-training when there are small changes in OSPF and ECMP even on large networks. From Figures 1(d),
topologies. The MA-DQN model is also able to learn an 2(d), 5 (a-b) we observe that the training time scales well with
optimal policy without explicit knowledge of other flows in respect to the topology size, and from Figure 5 (c) we observe
the network. that the MA-DQN is able to minimize latency across various
topology sizes and types. We attribute this result to the fact
A. MA-DQN Evaluation Settings that each agent is responsible only for local routing decisions.
In all our experiments, we captured transmission and queu- This means that such an agent does not need to receive the
ing delays, and we considered network topologies following current location of the packet in the state space, thus reducing
either Waxman (random-biased) or Barabasi-Albert (preferen- the size of the inputs to the model. Our application of a dense
tial attachment) topologies. Our network configurations consist reward function also plays a significant role in performance.
of three sizes: 50 vertices and 100 edges (50V, 100E), 100 2) Impact of Retraining Needs: The present set of experi-
vertices and 200 edges (100V, 200E), and 150 vertices and ments aims to investigate the performance of a model when the
300 edges (150V, 300E). These topologies were generated network topology changes, specifically when nodes and links
using the BRITE topology generator [24]. We used a uniform become unavailable. The objective is to comprehend how well
bandwidth distribution from 0 to .99 Mbps. For the Waxman the model adapts to these alterations and to assess whether
topology [24], we used a 0.15 alpha parameter and a 0.2 adjusting the exploration-exploitation balance could improve
beta parameter (Waxman-specific exponents), given in this the model’s performance in such scenarios.
equation: To simulate topology changes, we randomly remove a per-
P (u, v) = αe−d/(βL) , (7) centage of nodes and links from a 50-node network generated
using the Waxman topology model after the RL model training
where P (u, v) is the probability of having an edge between had reached convergence. For clarity, the average performance
nodes u and v, 0 < α, β <= 1, d is the Euclidean distance over 20 runs is plotted in Figures 4(a-b). Similar results are
between nodes u and v, and L is the maximum distance obtained on the Barabasi-Albert topologies.
between any two nodes. We simulate several traffic scenarios Before training or testing the model with a source-
in static and dynamic topology conditions after generating destination pair, we verify if a valid route is available. When
these topologies. Each link has a bandwidth as generated a valid route is unavailable, sampling is continued until a
by the BRITE topology generator before the simulation. We connected pair is found. After training the model, we route
simulate traffic by generating flows that take the shortest path 2000 packets for evaluation. However, in some cases, heavy
from randomly generated source-destination pairs following a reductions of nodes result in a topology change too large for
uniform distribution. For each packet on a link, we reduce the MA-DQN model to relearn an optimal policy. To address
the corresponding residual capacity and increase the latency this limitation, we adjust the model’s exploration-exploitation
proportionally to the flow packet size. In all our network balance by changing the exploration value (ϵ), thus prompting
simulations, we do capture propagation delay as well as queue- the model to explore more instead of exploiting known routes.
ing delay. Unless otherwise specified, we use a replay buffer In a separate set of experiments, we examine the impact
size of 5000 — the memory that stores state-action tuples of of varying the ϵ value which determines the rate at which
the RL algorithms, a batch size of 192, the RMSprop [25] the RL agent makes random moves during its exploration
optimizer — RMSprop maintains a moving average of the phase (results shown in Figure 5d). However, we find that
squared gradients thus resulting in an adaptive learning rate. encouraging the model to explore more does not result in any
We use the default learning rate of 10−2 . All experiments performance improvement. The model’s knowledge which was
for the MA-DQN were performed on a Dell Inspiron 15. optimized for the existing environment, may not be suitable
Convergence times for the MA-DQN model w.r.t to topology for the new environment, which is drastically different from
size are as follows: a) (50N, 100E) took 5 minutes, (100N, what it had learned.
200E) took 10 minutes, (150N, 300E) took 17 minutes. 3) Lesson Learned from Reward Engineering in Routing:
We explored several reward functions to determine the best.
B. Evaluation of Multi-Agent DQN We first consider the performance of the MA-DQN model by
1) Multi Agent DQN based policies outperforms OSPF and using the reward function described in Equation 4. Figure 6(c)
ECMP: In this section, we compare the latency and bandwidth describes the error rate obtained when using (4). We observe
of converged DQN policies, OSPF, and ECMP. Figures 1(a-c) that the model was unable to converge. We also observe
and 2(a-c) compare the latency of the aforementioned policies that what produced optimal results for a single agent model
on networks of various sizes and topologies. Figures 1(a-c) performed poorly in a multi-agent model. The reward function
and 2(a-c) show that Multi-Agent DQN-based policies learn does not penalize poor routing decisions enough but rewards
near-optimal routing policies with networks of sizes up to the agent for reaching the destination regardless of what route
(150V, 300E). Our results presented in Figure 3 demonstrate it took when it was one step away from it. Such behavior leads
the ability of the MA-DQN model to balance bandwidth to positive rewards for greedy routing decisions. Additionally,
effectively, consistently outperforming OSPF and ECMP. we noted how, since each routing decision is independent,
One of the most significant results from our experiments punishing the model for failing to send the packet yielded poor
is the model’s scalability. We find that the multi-agent model routing performance results as only the last agent’s weights

Authorized licensed use limited to: Zhejiang Lab. Downloaded on July 18,2023 at 08:12:30 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Network and Service Management. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TNSM.2023.3287936

(a) (b) (c) (d)


Figure 1: CDF of the latency of a converged MA-DQN model on (a) (50V, 100E), (b) (100V, 200E), and (c) (150V, 300E)
Barabasi-Albert Network. We observe that the MA-DQN model is able to consistently outperform OSPF and ECMP. (d)
Episodic latency of the MA-DQN model during training in Barabasi-Albert Network.

(a) (b) (c) (d)


Figure 2: CDF of the latency of a converged MA-DQN model on (a) (50V, 100E), (b) (100V, 200E), and (c) (150V, 300E)
Waxman Network. We observe that the MA-DQN model is able to consistently outperform OSPF and ECMP.(d) Episodic
latency of the MA-DQN model during training in Waxman.

(a) (b) (c)


Figure 3: CDF of the bandwidth of a converged MA-DQN model on (a) (50V, 100E), (b) (100V, 200E), and (c) (150V, 300E)
Waxman Network. We observe that the MA-DQN model is able to consistently outperform OSPF and ECMP.

are affected by the punishment. The previous agents’ weights We choose the value of 1.5 as it is sufficiently larger than
are not impacted if the packet is not routed. Additionally, a. This reward structure brings an incentive for the agent to
the model gave the same reward regardless of how much choose the best possible link when the packet is one step away
closer it reached the destination. Since the reward function from the destination.
of Equation (4) did not provide enough incentives for better After modifying the reward, we assessed the performance
routes, the model took longer to reach the destination. After of the new reward function described in Equations 5 - 6. The
learning these lessons, we redesigned the reward functions last agent to make a packet routing decision does not receive
(Equations 5 - 6). To determine the negative constant used a punishment for failure to route the packet, as a single agent
in the reward function defined in (Equation 5 - 6), we run a cannot be the only one responsible for such failure to deliver
grid search in the range [−1, −.75, −.5, −.25, −.1] on a 100 the packet. Additionally, the agent receives punishment for
Node Waxman topology. This allowed us to find the optimal making a greedy decision despite being one step away from
constant. As shown in Figures 6(a) and 6(b), we find that the destination. Additionally, the agent receives a larger reward
for the agent to learn how to avoid poor routes effectively, for decisions that bring it much closer to the destination than
the magnitude of the negative reward needs to be larger than decisions that bring it slightly closer to the destination. We
the magnitude of the positive rewards. The model with the found that such modifications to the reward structure lead to
negative constant of -.1 did not learn the best policy. The a significant training convergence speed-up.
value of the positive reward received by non-terminal packet 4) Over-training Avoidance: In this section, we analyze the
forwarding, a, can only be in the range (0, 1) since each link results of over-fitting the model on a 100 Node following a
has a latency between (0, 1). The positive terminal constant Barabasi-Albert network topology. One of the most significant
encourages the agent to send the packet via the best possible parameters for training a RL model is the so-called ϵ: such
link when the packet is one step away from the destination. parameter represents the trade-off between exploring and ex-

Authorized licensed use limited to: Zhejiang Lab. Downloaded on July 18,2023 at 08:12:30 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Network and Service Management. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TNSM.2023.3287936

(a) (b) (c) (d)


Figure 4: Performance comparison of MA-DQN and GNN when network topology changes. (a) MA-DQN performance with
links removed, (b) MA-DQN performance with nodes removed, (c) GNN performance with links removed, and (d) GNN
performance with nodes removed. Both models are evaluated on 50-node Waxman networks, and the results illustrate how
each model adapts to changes in network topology.

(a) (b) (c) (d)


Figure 5: Network Rate at which the agent makes poor decisions in (a) Barabasi-Albert and (b) Waxman. (c) Comparison
of the latency of converged MA-DQN models across Barabasi-Albert and Waxman topologies. We observe that the models’
performance remains similar across network sizes and topologies. (d) Evaluating model’s performance when ϵ - the rate at
which model explores - is set to an initial value of 1 after 5 nodes are removed from a 100 Node Barabasi-Albert Network.

(a) (b) (c) (d)


Figure 6: (a) Latency of the MA-DQN model during training when r=-.75 in Equation. 5 - 6 is replaced with different reward
values. (b) Frequency of positive non-terminal rewards (a in Equation. 5 - 6) the MA-DQN model received during training
when the reward for a poor routing decision is -.1. (c) The error rate for the MA-DQN model when the reward function in
Equation. 4 on a (100V, 200E) Barabasi-Albert network is used. (d) Evaluating the error rate when the MA-DQN model model
continues to be trained even after it has converged.

ploiting the RL environment. This parameter decays toward that transfer learning is an open and unsolved problem in
0 during the training phase. When the parameter decays to reinforcement learning.
exactly 0, the type of accuracy loss in Figure 6(d) does not
occur. The value of ϵ decays to exactly 0 and yields results We improve the vanilla Q-Routing [26] by implementing
shown in Figure 5(a,b). In Figure 6(d), ϵ decayed to 0.05, two different approaches: Double Q Learning [27] and Dueling
causing the model to pick up on the noise and hence perform DQN [28]. Double Q learning fixes Q-network’s tendency
poorly. to overestimate the value of actions, which introduces a
maximization bias in learning and ultimately leads to unstable
training and negatively impacts the quality of the policy. Du-
C. GCN Evaluation Settings eling DQN is an innovation in policy architecture: separating
In reinforcement learning, two factors are commonly con- the value and advantage estimators into two policies, then
sidered when comparing different approaches: (i) timesteps rejoining them to select the action. The key concept behind this
to policy convergence and (ii) strength of a learned policy. design decision is that it is unnecessary to know each action’s
In a purely theoretical lens, the latter is the more critical value at every time step. These modifications result in more
metric. However, in systems where retraining is required, the stable training (regardless of the RL application domain). In
former metric becomes essential. The traditional unattractive our experiments, network topologies are generated by keeping
view towards machine learning-based routing comes from the the settings used in IV-A. We train the SA-GCN model on
protocol’s inability to quickly adjust to a change in network NVIDIA RTX A4000 GPUs, with 8GB of RAM, and train
state. This inability requires full retraining of the policy, given the model for 1 million episodes. The model requires 2GB of

Authorized licensed use limited to: Zhejiang Lab. Downloaded on July 18,2023 at 08:12:30 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Network and Service Management. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TNSM.2023.3287936

(a) (b) (c)


Figure 7: (a) Comparing the reward strength between a GCN policy, a Q-Network, and the reward used in the Learning to Route
paper. (b) Comparing the number of training timesteps required for convergence between a GCN policy and a Q-Network. (c)
Analysis of reward functions with respect to latency and bandwidth in a (150V, 350E) Waxman topology.

(a) (b) (c)


Figure 8: Analysis of latency for a GCN policy and Q-Network for three different network topology classes: fat tree, Waxman,
and Barabasi-Albert (BA). (a): (50V, 100E), (b): (100V, 200E), and (c): (150V, 350E).

RAM. algorithm that outperforms Q-Learning in both efficiency and


ability to obtain better rewards.
Lastly, we explore how various reward functions operate
D. Evaluation of GNN-Based Routing with respect to throughput and latency on a Barabasi-Albert
1) GCN based Policies leads to Faster Convergence and graph sized (100V, 350E) — Figure 7-c. For additional
Stronger Reward Signals: In this experiment set, we com- comparison, the performance of two of the most commonly
pare the characteristics of converged GCN-based policies, Q- used iBGP routing protocols, OSPF and ECMP, are plotted
routing policies, and the policy used by the Learning to Route as well. The three reward structures explored are as follows:
paper [29], used as a benchmark (Figure 7). We also explore (i) optimize latency only, (ii) optimize bandwidth only, and
how the use of Equation 3 to optimize a particular QoS metric (iii) co-optimize latency and bandwidth. We find that each
influences the routing performance in several networks. algorithm optimizes what each respective reward function
Once the model is fully trained, we test the model using a sought to be optimized, showing that RL-based approaches
test set of a hundred different topologies for every network can potentially be customized according to a particular set of
size. We randomly choose the source-destination pairs. In QoS metrics [30]. In Figure 11, we compare the performance
Figures 7-a and 7-b, the reward function used is described of MA-DQN with GNN-based methods inspired by [31], along
in Equation 3, where we optimize the reward r to prioritize with the reward function in equations 5 - 6. We observe that
low-latency paths. in smaller networks, MA-DQN and GNN perform similarly.
Figure 7-b compares the time steps required for policy However, in networks of size 75 nodes and above, MA-DQN
convergence which requires the policy to successfully route outperforms
2) Latency:GNN-based
Figure 8-amethods.
shows that GCN-based policies
every source-destination pair in the test set of 100 randomized can learn near-optimal routing policies with respect to the
topologies for every network size. The results show that latency in networks with sizes up to (60V, 160E), whereas
the GCN approach achieves a significantly higher percentage Q-Network’s ability to minimize latency when routing within
score. Indeed, graphical convolutions are excellent at feature large networks is limited. In particular, to avoid re-training,
extraction on graph-based data. We believe that such results the GCN policy is first trained on random topologies and sep-
are due to the application of GCN to a network routing arately evaluated on previously unseen ones. On the contrary,
task, combined with a reinforcement learning policy update both Q-routing and LTR algorithms are explicitly trained and

Authorized licensed use limited to: Zhejiang Lab. Downloaded on July 18,2023 at 08:12:30 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Network and Service Management. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TNSM.2023.3287936

10

(a) (b) (c) (d)


Figure 9: Analysis of bandwidth for a GCN policy and Q-Network for three different network topology classes: fat tree,
Waxman, and BA. (a): (50V, 100E), (b): (100V, 200E), and (c): (150V, 350E). (d) Comparing the average number of collisions
per routing for the SA-GCN model between the routed flow and the other flow sets in the network.

(a) (b) (c) (d)


Figure 10: Comparing QoS metrics for GCN (blue), Q-Network (green), Learning to Route (black) [29], OSPF (red), and
ECMP (yellow) with a variety of flow sets: (a) 1, (b) 2, (c) 4, and (d) 8, among 3 different network sizes: (50V, 100E), (100V,
200E), and (150V, 350E).

(a) (b)
Figure 11: Comparing GCN and MA-DQN: Latency CDF for MA-DQN and GCN models on (a) (50V, 100E), (b) (75V, 150E)
Waxman Network.

evaluated on each topology. Such expressive power can be 3) Bandwidth Improvements: Similar to the latency analysis
attributed to the policy construction of Q-Networks. Densely of GCN and Q-Routing, Figure 9 (a-c) compares the two
connected neural networks do not have as much expressive approaches and their ability to optimize bandwidth across the
power as their convolutional counterpart. The performance same three network topologies, considering the exact three
deficiency is compounded when the data is graphical and network sizes. It is observed that our GCN policy construction
compounded even further when the neural network has to learn outperforms Q-Routing in all nine scenarios.
the state transitions implicitly, s′ , from all state-action pairs, The GCN policy is only trained once for each network
(s, a). topology size and then tested the previously unseen test
In Figure 8 we show the network performance on a test set topologies for each class without any adjustment. Contrarily,
of 100 networks for each topology type. We selected a Fat Q-Routing is trained explicitly on the test topologies. Despite
Tree topology to simulate data center networks. Waxman and this, GCN was capable of achieving better performance in
Barabasi-Albert topology generators were used for the other terms of bandwidth than the Q-Routing solution.
two topology classes, given their popularity for modeling real- 4) Reward Engineering: In this experiment set, we inves-
world network topologies such as intra-domains and the World tigate how different reward functions influence both latency
Wide Web [32, 33]. In Figure 8, for each network topology and bandwidth of the routes selected to complete our analysis.
of all three sizes, a trained GCN-based policy outperforms a The three reward functions observed minimize latency, maxi-
trained Q-Network. mize bandwidth, and simultaneously optimize both with equal

Authorized licensed use limited to: Zhejiang Lab. Downloaded on July 18,2023 at 08:12:30 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Network and Service Management. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TNSM.2023.3287936

11

weights. In these experiments, no other flow sets coexisted routing protocols such as exterior BGP. We proposed a Single
in the network. As an additional comparison, we included Agent RL model, based on a Graph Convolutional Network [4]
OSPF and ECMP to highlight how our policies compare to (GCN) to fit the former, and a multi-agent Deep Q-Learning
industry-standard algorithms. For single flow protocols (i.e., Network model for the latter.
all algorithms expect ECMP), GCN shows to learn more We report reward engineering considerations on our routing
robust policies than Q-routing and LTR. This evaluation shows algorithms, evaluating both single and multi-agent solutions
promising observations in reinforcement learning routing pro- in extensive experimental settings with different network sizes
tocols that are fine-tuned adequately according to the desired and connectivity models. Our results show that single-agent
QoS metrics. with GCN improves the ability to achieve high QoS metrics
While GCNs outperform Q-Routing and LTR in several when the computer network topology changes after the RL
metrics, the evaluation thus far has been under non-stressed training converges. Moreover, the multi-agent model is able
network conditions. That is, we assumed that there are enough to scale well with larger networks, consistently outperforming
resources and no competition among flows to be routed. While OSPF and ECMP. Our findings indicate that the MA-DQN
such an assumption may be appropriate in eBGP, we cannot model is capable of adapting to changes in topology in
route unaware of other flows for transit links within each networks with up to 50 nodes, but we also observed limitations
Autonomous System or on other routing use cases. In an in its ability to relearn an optimal policy on larger networks.
even small data center network, tens of thousands of flows These observations highlight interesting directions for future
may compete across server racks under the administration of research in this area. Our code is available with an MIT license
a single domain. However, not all flow collisions are equally at [34].
damaging. To assess the impact of other flow and see how our
algorithm would learn how to route while avoiding collisions, R EFERENCES
we conducted two closely related experiments: comparing [1] Volodymyr Mnih et al. “Playing Atari with Deep Rein-
the average number of collisions each protocol endured as forcement Learning”. In: CoRR abs/1312.5602 (2013).
a function of other flow sets in the network and how said [2] Brandon Schlinker et al. “Engineering egress with edge
collisions impact the latency and bandwidth of the flow set fabric: Steering oceans of content to the world”. In: Pro-
being routed by the policy. ceedings of the Conference of the ACM Special Interest
Figure 9(d) depicts the average number of collisions as a Group on Data Communication. 2017, pp. 418–431.
function of the number of other flow sets in a network size [3] Timothy G Griffin, F Bruce Shepherd, and Gordon
of (100V, 200E). We measured how ECMP resulted in the Wilfong. “Policy disputes in path-vector protocols”.
highest number of collisions due to the flow being split among In: Proceedings. Seventh International Conference on
several distinct paths from source to destination. Our other Network Protocols. IEEE. 1999, pp. 21–30.
benchmark based on Q-Routing achieved a similar amount of [4] Thomas N. Kipf and Max Welling. “Semi-Supervised
average collisions with the GCN policy. To quantify how the Classification with Graph Convolutional Networks”. In:
number of packet collisions impacts QoS metrics in Figure 10 Proceedings of the 5th International Conference on
we provided a more in-depth analysis regarding such an impact Learning Representations (ICLR). Palais des Congrès
on latency and bandwidth. In the case of 1 and 2 competing Neptune, Toulon, France, 2017.
flow sets (say mice and elephant), ECMP outperformed all [5] Justin A Boyan and Michael L Littman. “Packet routing
other protocols. However, when there were 4 and 8 competing in dynamically changing networks: A reinforcement
flow sets, the GCN policy’s learned ability to avoid the other learning approach”. In: Advances in neural information
flow sets produced the best-realized bandwidth and latency processing systems. 1994, pp. 671–678.
among all the other protocols. [6] Michael Littman and Justin Boyan. “A distributed
5) Impact of retraining needs for GNN: Similar to Sec- reinforcement learning scheme for network routing”.
tion IV-B2, in Figures 4(c-d) we replicate the experiment In: Proceedings of the international workshop on ap-
for GNN and observe that the GNN can relearn an optimal plications of neural networks to telecommunications.
policy for minor changes in the environment. However, with Psychology Press. 2013, pp. 55–61.
larger network topology changes, GNN is unable to relearn an [7] Samuel PM Choi and Dit-Yan Yeung. “Predictive Q-
optimal policy in a timely fashion. Hence we conclude that in routing: A memory-based reinforcement learning ap-
those cases, retraining the model from scratch leads to better proach to adaptive traffic control”. In: Advances in Neu-
results. ral Information Processing Systems. 1996, pp. 945–951.
V. C ONCLUSION [8] Zoubir Mammeri. “Reinforcement learning based rout-
ing in networks: Review and classification of ap-
In this paper, we explored packet routing with reinforcement proaches”. In: IEEE Access 7 (2019), pp. 55916–55950.
learning with a few novel twists. Our objective has been [9] Shih-Chun Lin et al. “QoS-aware adaptive routing in
to study the impact of topology and traffic changes on a multi-layer hierarchical software defined networks: A
trained neural network, evaluating the ability of the model to reinforcement learning approach”. In: 2016 IEEE In-
dynamically adapt without the need for retraining. To do so, ternational Conference on Services Computing (SCC).
we focused on single-domain routing, suitable, e.g., for interior IEEE. 2016, pp. 25–33.
BGP, and on multi-domain routing, valuable for larger scale

Authorized licensed use limited to: Zhejiang Lab. Downloaded on July 18,2023 at 08:12:30 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Network and Service Management. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TNSM.2023.3287936

12

[10] Abhijeet A Bhorkar et al. “Adaptive opportunistic rout- Conference on Intelligent Sensors, Sensor Networks and
ing for wireless ad hoc networks”. In: IEEE/ACM Information. IEEE. 2007, pp. 1–6.
Transactions On Networking 20.1 (2011), pp. 243–256. [27] Hado Hasselt. “Double Q-learning”. In: Advances in
[11] Zhichu Lin and Mihaela van der Schaar. “Autonomic neural information processing systems 23 (2010).
and distributed joint routing and power control for [28] Ziyu Wang et al. “Dueling network architectures for
delay-sensitive applications in multi-hop wireless net- deep reinforcement learning”. In: International confer-
works”. In: IEEE Transactions on Wireless Communi- ence on machine learning. PMLR. 2016, pp. 1995–
cations 10.1 (2010), pp. 102–113. 2003.
[12] Xiaohong Huang et al. “Deep reinforcement learning [29] Asaf Valadarsky et al. In: 31st Conference of Neural
for multimedia traffic control in software defined net- Information Processing Systems. 2017.
working”. In: IEEE Network 32.6 (2018), pp. 35–41. [30] Doron Zarchy et al. “Axiomatizing Congestion Con-
[13] Dehao Lan et al. “A deep reinforcement learning based trol”. In: SIGMETRICS ’19. Phoenix, AZ, USA: ACM,
congestion control mechanism for NDN”. In: ICC 2019- 2019. DOI: 10.1145/3309697.3331501.
2019 IEEE International Conference on Communica- [31] Xuan Mai, Quanzhi Fu, and Yi Chen. Packet Routing
tions (ICC). IEEE. 2019, pp. 1–7. with Graph Attention Multi-agent Reinforcement Learn-
[14] Rongpeng Li et al. “Deep reinforcement learning for ing. 2021. arXiv: 2107.13181 [cs.AI].
resource management in network slicing”. In: IEEE [32] B. M. Waxman. “Routing of multipoint connections”.
Access 6 (2018), pp. 74429–74441. In: IEEE Journal on Selected Areas in Communications
[15] Sebastian Troia et al. “On deep reinforcement learning 6.9 (1988), pp. 1617–1622.
for traffic engineering in sd-wan”. In: IEEE Journal on [33] Albert-László Barabási and Réka Albert. “Emergence
Selected Areas in Communications (2020). of scaling in random networks”. In: science 286.5439
[16] Syed Qaisar Jalil et al. “DQR: Deep Q-Routing in (1999), pp. 509–512.
Software Defined Networks”. In: 2020 International [34] Sai Shreyas Bhavanasi, Lorenzo Pappone, and Flavio
Joint Conference on Neural Networks (IJCNN). IEEE. Esposito. https:// github.com/ routing-drl/ main/ . online.
2020, pp. 1–8. 2023.
[17] Xinyu You et al. “Toward packet routing with fully
distributed multiagent deep reinforcement learning”. In:
IEEE Transactions on Systems, Man, and Cybernetics: Sai Shreyas Bhavanasi is a graduate student at Saint
Systems 52.2 (2020), pp. 855–868. Louis University pursuing a Master’s in Artificial
[18] Ruijin Ding et al. “Deep reinforcement learning for Intelligence. He received his Bachelor of Science
in Computer Science and Data Science from Saint
router selection in network with heavy traffic”. In: IEEE Louis University. His research interests include ap-
Access 7 (2019), pp. 37109–37120. plied machine learning, reinforcement learning, and
[19] Siliang Zeng, Xingfei Xu, and Yi Chen. “Multi-Agent computer networks.
Reinforcement Learning for Adaptive Routing: A Hy-
brid Method using Eligibility Traces”. In: 2020 IEEE
16th International Conference on Control Automation
(ICCA). 2020, pp. 1332–1339. Lorenzo Pappone received the M.Sc. degree in
Computer Engineering from University of Naples
[20] Pinyarash Pinyoanuntapong, Minwoo Lee, and Pu Federico II in 2021. He is currently pursuing a
Wang. “Delay-optimal traffic engineering through Ph.D. degree in Computer Science at Saint Louis
multi-agent reinforcement learning”. In: Proc. of IEEE University, USA. His main research interests include
applied machine learning, deep learning, traffic en-
INFOCOM Workshops. IEEE. 2019, pp. 435–442. gineering, and distributed systems.
[21] John Schulman et al. “Proximal policy optimization al-
gorithms”. In: arXiv preprint arXiv:1707.06347 (2017).
[22] Mikael Henaff, Joan Bruna, and Yann Lecun. “Deep
Convolutional Networks on Graph-Structured Data”. In:
Flavio Esposito is an Associate Professor of Com-
(June 2015). puter Science at Saint Louis University. He obtained
[23] Volodymyr Mnih et al. “Human-level control through his BS and MS in Telecommunication Engineering
deep reinforcement learning”. In: nature 518.7540 from the University of Florence, Italy, and his Ph.D.
in Computer Science from Boston University. His
(2015), pp. 529–533. research centers on the intersection of networked
[24] A. Medina et al. “BRITE: an approach to universal systems and artificial intelligence. Before joining
topology generation”. In: Proc, of MASCOTS. 2001, academia, Flavio was a senior software engineer
and worked in a few research laboratories in Europe
pp. 346–353. and USA. He is a Principal Investigator on several
[25] Tom M. Mitchell. Machine learning, International research awards from the National Science Foun-
Edition. McGraw-Hill Series in Computer Science. dation. His funded projects include edge computing, machine learning for
network management, next-generation wireless networks, distributed artificial
McGraw-Hill, 1997. intelligence, and computer security. Flavio’s main research interests include
[26] Rocio Arroyo-Valles et al. “Q-probabilistic routing in network management, network virtualization, and distributed systems. Flavio
wireless sensor networks”. In: 2007 3rd International is the recipient of several awards, including several National Science Foun-
dation awards and the Comcast Innovation Award in 2021.

Authorized licensed use limited to: Zhejiang Lab. Downloaded on July 18,2023 at 08:12:30 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.

You might also like