0% found this document useful (0 votes)

35 views12 pages

Deep Reinforcement Learning For Router Selection in Network With

This document presents a deep reinforcement learning approach for router selection in networks with heavy traffic. It discusses the challenges of existing routing protocols under high traffic loads and proposes two deep Q-network algorithms that aim to reduce network congestion probability and transmission path length.

Uploaded by

Lia Astute

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views12 pages

Deep Reinforcement Learning For Router Selection in Network With

Uploaded by

Lia Astute

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Received February 2, 2019, accepted March 1, 2019, date of current version April 3, 2019.

Digital Object Identifier 10.1109/ACCESS.2019.2904539

Deep Reinforcement Learning for Router

Selection in Network With Heavy Traffic
RUIJIN DING 1,2,3 , YADONG XU4 , FEIFEI GAO 1,2,3 ,
XUEMIN SHEN 5 , (Fellow, IEEE), AND WEN WU 5
1 Institute for Artificial Intelligence, Tsinghua University (THUAI), Beijing 100084, China
2 State Key Laboratory of Intelligent Technologies and Systems, Tsinghua University, Beijing 100084, China
3 Beijing National Research Center for Information Science and Technology (BNRist), Department of Automation, Tsinghua University, Beijing 100084, China
4 Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing 100084, China
5 Department of Electrical and Computer Engineering, University of Waterloo, Waterloo, ON N2L 3G1, Canada

Corresponding author: Feifei Gao ([email protected])

This work was supported in part by the National Natural Science Foundation of China under Grant 61831013, Grant 61771274, and
Grant 61531011, in part by the Beijing Municipal Natural Science Foundation under Grant 4182030 and Grant L182042, in part by the
Guangdong Key Laboratory Project under Grant 017B030314147, and in part by the Key Laboratory of Universal Wireless
Communications (BUPT), Ministry of Education, China.

ABSTRACT The rapid development of wireless communications brings a tremendous increase in the amount
number of data streams and poses significant challenges to the traditional routing protocols. In this paper,
we leverage deep reinforcement learning (DRL) for router selection in the network with heavy traffic,
aiming at reducing the network congestion and the length of the data transmission path. We first illustrate
the challenges of the existing routing protocols when the amount of the data explodes. We then utilize the
Markov decision process (RSMDP) to formulate the routing problem. Two novel deep Q network (DQN)-
based algorithms are designed to reduce the network congestion probability with a short transmission path:
one focusing on reducing the congestion probability; while the other focuses on shortening the transmission
path. The simulation results demonstrate that the proposed algorithms can achieve higher network throughput
comparing to existing routing algorithms in heavy network traffic scenarios.

INDEX TERMS Deep reinforcement learning, routing, network congestion, network throughput, deep
Q network.

I. INTRODUCTION When the amount of data is small, these shortest-path based

The fifth generation (5G) of cellular mobile communications protocols bring low latency to the network. However, when
is coming [1], which targets high data rate [2], ultrashort the network data traffic volume dramatically increases, cer-
latency, high energy efficiency [3], and massive device contain routers selected by multiple paths may suffer from ter-
nectivity [4]. The number of devices has reached 8.4 billions rible traffic load. Especially, when the data volume exceeds
in 2017 and will further increase to 30 billions by 2020, the buffer size of the selected routers, the network will
as predicted in [5]. Such massive amount devices would sig- be congested, which decreases the network throughput and
nificantly grow the network traffic data. As a result, the exist- increases the network delay. In other words, the existing
ing routing protocols would face tremendous pressure in routing protocols are not intelligent enough to adjust their
maintaining the users’ Quality of Experience. transmission strategies according to actual network states.
Specifically, the existing routing protocols such as On the other side, with the growth of computing capability
OSPF [6], IS-IS [7], RIP [8], EIGRP gradually become and the explosion of data, Artificial Intelligence (AI) is dras-
unsuitable for the network with big data, high data rate, tically promoted in recent years, where the great computing
and low latency requirements. The key reason is that these capability enables to imitate deeper neural network (DNN)
protocols rely on calculating the shortest path from a source while the big data could provide sufficient training samples.
router to its destination [9] without considering the actual net- Probably the most successful example is the deep learn-
work states such as the remaining buffer size of each router. ing (DL) [10] that emerges from the artificial neural net-
work (ANN). DL could build DNN to simulate human brain
The associate editor coordinating the review of this manuscript and in order to learn and recognize abstract patterns [11] and
approving it for publication was Longzhi Yang. has been widely applied in image classification [12]–[14],

2169-3536
2019 IEEE. Translations and content mining are permitted for academic research only.
VOLUME 7, 2019 Personal use is also permitted, but republication/redistribution requires IEEE permission. 37109
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
R. Ding et al.: DRL for Router Selection in Network With Heavy Traffic

object detection [15]–[19], communications [20]–[25], as The rest of the paper is organized as follows. Section II
well as many other fields. states the routing problem and outlines the system model.
DL has also been adopted in routing problems. For In Section III, we introduce RSMDP in detail and analyze the
example, it could imitate the OSPF protocol [26] to reduce setting of some parameters. The proposed DRL algorithms
the signaling overhead. However, the algorithm in [26] is are detailed in Section IV. Section V provides the simulation
essentially an imitation of traditional protocols, and is insuf- results while Section VI concludes the paper.
ficiently intelligent to deal with complicated network states.
Following [26], a deep convolutional neural network based
II. PROBLEM STATEMENT AND SYSTEM MODEL
routing algorithm has been proposed in [27], which utilizes
A. PROBLEM STATEMENT
the neural network to judge the network congestion caused by
We assume that the network operates in a time-slotted fashion
the path combination. However, building a neural network for
with normalized time slot. Transmitting a data packet from
each possible path combination would result in a large num-
the source router to the destination is regarded as a data
ber of neural networks for training, and therefore increasing
transmission task. At each time slot, a task selects the next
the demand on computing resources.
hop router and the data packet is transferred to it. This process
However, DL generally requires label information for the
is continued until the data packets arrive at the destination.
training data, which then demands for massive manual efforts.
Network congestion happens when the size of the arriving
In addition, DL is inherently an approximation of certain
packet exceeds the remaining buffer size of the router.
function and is not suitable for decision-making problems,
The traditional routing protocols are formulated as a clas-
such as routing, energy allocation, and recommender sys-
sical combinatorial optimization problem, where the data
tem. In this case, deep reinforcement learning (DRL) [28]
packets are transmitted along the shortest path. Under such
emerges as an alternative to solve decision-making type
shortest path principle, certain routers may be simultaneously
problems. Compared with traditional reinforcement learning
selected for multi-tasks, which then very likely leads to net-
methods1 [29], DRL takes advantage of function approxima-
work congestion due to the finite buffer size of the routers.
tion ability of DL to solve practical problems with large-scale
For example, as shown in Fig. 1, three packets
state and action space [30]–[32]. For instance, DRL could
from L0 , L1 , L2 are transmitted to the destination L8 . Based
help the energy harvesting devices allocate the energy to max-
on the shortest path principle, L4 would be chosen as the next
imize the sum rate of the communications, predict the battery
hop for the packets by the traditional protocols. When the
power accurately [33], or guide the two-hop communications
packets are relatively large, the remaining buffer size of L4
to achieve high throughput [34]. Moreover, DRL has been
will be not sufficient and the network is prone to congestion.
utilized to rank in E-commerce search engine for improving
Moreover, when the same or similar situation appears again,
the total transaction amount [35].
traditional routing protocols would fall into the congestion
In this paper, we design two DRL-based online routing
again. Even though the network congestion has occurred
algorithms to address the network congestion problem. The
many times before, the traditional routing protocols would
proposed algorithms can reduce the probability of network
still select the same/similar routing path. Therefore, it is
congestion and shorten the length of transmission paths,
necessary and important for the routing strategy to learn from
i.e., the number of hops from the source router to the destina-
the past experience and make itself sufficiently intelligent to
tion. The main contributions of this paper are summarized as
choose optimal routing paths according to the network states.
follows:
• We leverage router selection Markov decision process
(RSMDP) concepts to formulate the routing problem B. SYSTEM MODEL
and define the corresponding state space, action space, Consider a general backbone network with N routers in the
reward function, and value function. set L = {L1 , L2 , . . . , LN }. Define Ls , Ld , and Lr as the
• We propose two online routing algorithms, i.e., source- disjoint sets of source routers, destination routers, and regular
destination multi-task deep Q network (SDMT-DQN) routers, respectively, with L = Ls ∪Ld ∪Lr . Moreover, there
and destination-only multi-task deep Q network are |Ls | , Ns , |Ld | , Nd , |Lr | , Nr , and Ns +Nd +Nr = N .
(DOMT-DQN), which can learn from past experiences Let Di,t and Bi,t denote the total size of all packets and
and update routing policies in real time. the remainingbuffer size in Li at time slot t, respectively.
Define Bt = B1,t , · · · , BN ,t and Dt = D1,t , · · · , DN ,t .

SDMT-DQN is able to significantly reduce the conges-
tion probability, while the corresponding path length We denote the size of the packet newly generated by
may occasionally be long. In comparison, DOMT-DQN data source i at time slot t by Vi,t and define V t =
V1,t , · · · , VNs ,t as the size vector of all input packets. The

can significantly shorten the path length as well as
maintaining the congestion probability at an acceptably data generation is set as a Poisson process. The state of
lower level. the network at time slot t can be characterized by a tuple
(V t , Dt , Bt ).
1 Reinforcement learning (RL) is a learning technique that an agent learns During time slot t, the input packets are generated by data
from the interaction with the environment via trial-and-error. sources, and then flow to the source routers and change the

37110 VOLUME 7, 2019

R. Ding et al.: DRL for Router Selection in Network With Heavy Traffic

FIGURE 1. The network topology.

remaining buffer size of the source routers. We assume that For instance, the task whose packet in Li selects Lj as the
a packet can be completely transferred from one router to next router, which means that link(i, j) ∈ E is selected as the
another in one time slot and the values of Di,t and Bi,t would action. Besides, the link between two routers is bidirectional,
change during the transmission process. For instance, if a data i.e., a data packet can be transferred from Li to Lj or con-
packet of size f flows from Li to Lj at time slot t, then at versely, denoted by link(i, j) and link(j, i), respectively. Let A
time slot t + 1, the tuple (Di,t+1 , Dj,t+1 , Bi,t+1 , Bj,t+1 ) has 6 denote the set of all possible actions, i.e. A = E, with
situations, as shown in (1), shown at the bottom of this page. cardinality |A| = Na . Note that not all actions are valid
When Li or Lj is the source router, the newly generated for a data transmission task, since the packet can only be
data should be considered. And if Lj is the destination router, transferred to the router connecting to its current position.
the data will be transferred to the terminals directly without Namely, the task can only choose the link starting from the
stored in the buffer. current position of its packet as the valid action. Therefore,
Note that the current location and the size of data packets during the transmission process, the valid actions of the task
would also affect the selection of the next hop router. We then are always changing according to its current position.
adopt modified one-hot encoding vector Ot of size N to rep-
resent these characteristics. When the packet is in router Li , III. ROUTER SELECTION MARKOV DECISION PROCESS
the ith element of Ot is the size of data packet, while the In this section, we formulate the routing process as a Markov
other elements are all zeros. Such modified one-hot encoding Decision Process (MDP), where the agent is the data trans-
can help the computer understand the size and position of mission task and the environment is the network.
the packet. Overall, we can denote the state of each task
by St = (V t , Dt , Bt ; Ot ). A. DEFINITION OF RSMDP
Moreover, the network can be represented by a directed In the considered scenario, the tasks decide the next hop
graph G = {V, E}, where V is the set of all vertexes corre- routers, and the corresponding decision-making process can
sponding to the routers and E is the set of edges corresponding be modeled as a MDP with rewards and actions. The MDP is
to the links between the routers. The data transmission task represented by a tuple (S, A, P, R, γ ), where
chooses action according to the network state along with the • The state space is denoted by S, which consists of
position and size of the packet, where action is defined as the terminal state and the nonterminal states. The ter-
the link between the current router and the next hop router. minal state is a special state, which indicates that the

+ Vi,t − f , Dj,t , Bi,t − Vi,t + f , Bj,t ) i ∈ Ls , j ∈ Ld



(Di,t
+ Vi,t − f , Dj,t + Vj,t + f , Bi,t − Vi,t + f , Bj,t − Vj,t − f ) i ∈ Ls , j ∈ Ls

(Di,t




+ Vi,t − f , Dj,t + f , Bi,t − Vi,t + f , Bj,t − f ) i ∈ Ls , j ∈ Lr

(D
i,t
(1)


(D i,t − f , Dj,t , Bi,t + f , Bj,t ) i∈/ Ls , j ∈ Ld
− f , Dj,t + Vj,t + f , Bi,t + f , Bj,t − Vj,t − f ) / Ls , j ∈ Ls

(Di,t

 i∈
− f , Dj,t + f , Bi,t + f , Bj,t − f ) / Ls , j ∈ Lr

(Di,t i∈


VOLUME 7, 2019 37111

R. Ding et al.: DRL for Router Selection in Network With Heavy Traffic

task terminates. If the action is invalid or causes the For the problem defined in Section II-A, avoiding network
network congestion, then the state turns into the terminal congestion is the prerequisite of seeking for the shortest path.
state. Besides, if the data packet arrives at the destination Thus, the reward should first punish the network congestion
router, then the task also terminates. The nonterminal and then minimize the path length. As described in Section II,
states contain all continuing events, where the packets since each task can only choose the edge that starts from the
are transferred to the next hop routers without conges- router where the packet currently stays, the reward function is
tion, and have not reached the destination. supposed to punish the invalid action. Moreover, the reward
• The action space is denoted by A, which corresponds to function needs to consider the path length for the task. In sum-
all the edges of the network topology graph. The actions mary, we set the reward function R(s, a, s0 ) as follows:
are divided into valid and invalid parts, depending on the 
rc if network congestion occurs,
current location of its packet.



The state transition probability function is denoted
r
e if a is invalid,
• R(s, a, s0 ) = (3)
by P(s, a, s0 ) = P[St+1 = s0 |St = s, At = a]. In the con- 

0 if packet arrives destination,
sidered scenario, the state transition probability function

−1 otherwise,
is related to the probability distribution of the size of the
where the reward −1 helps record the number of hops the
packets newly generated by the data sources. Because
data packet is transferred in the network. The constant rc
in the state tuple, the vector of newly generated packet
is the congestion reward that takes a negative value smaller
size V t is random.
than −1 since the network congestion should be avoided,
• The immediate reward on the transition from state s to s0
while constant re is the error reward when an invalid action
under action a is denoted by R(s, a, s0 ).
is chosen, which is a negative value smaller than −1 too.
• The discount rate is denoted by γ ∈ [0, 1), which
The network will feed back a non-negative reward only when
determines the present value of future rewards [29].
the packets arrive at the destination routers. As a result, to
avoid the network congestion/invalid action and reduce the
path length of each data transmission task, the objective of the
routing algorithm should be expressed as finding the optimal
policy to maximize the expected cumulative reward for each
task. The details will be described in the next subsection.

C. VALUE FUNCTION
From (3), the reward at time slot t can be denoted
by Rt = R(st , at , st+1 ). Assume the task turns into the termi-
nal state after T time slots. Then, the cumulative discounted
reward from time slot t can be expressed as
T
X
Gt = Rt+1 + γ Rt+2 + · · · + γ T −1
Rt+T = γ k−1 Rt+k .
FIGURE 2. Router selection Markov decision process. k=1
(4)
As Fig. 2 shows, at each time slot, the task selects the next
Define policy π as a probability distribution over action a,
hop router based on its current state, and the corresponding
given state s as:
reward is obtained. The above decision-making and reward
feedback process is repeated, which is named as the RSMDP. π (a|s) = P[At = a|St = s]. (5)
A MDP should satisfy the Markov property, which means
In the considered problem, policy π determines which router
the future state is independent of the past state given
should be chosen as the next hop router conditioned on the
the present state. Mathematically, the Markov property for
current state of the transmission task.
the MDP is defined as follows:
Define
P(st+1 |s0 , a0 , s1 , · · · , st , at ) = P(st+1 |st , at ). (2) Qπ (s, a) = Eπ [Gt |St = s, At = a] (6)
From (1), it is obvious that the next state is only related to the as the action-value function based on policy π of
current state and the current action. Hence the router selection the MDP, i.e., the expectation of the cumulative discounted
process satisfies Markov property. reward starting from s, taking action a, and following
policy π .
B. REWARD FUNCTION The objective of the routing algorithm is to find a policy to
For any state s ∈ S, R(s, a, s0 ) is the immediate reward that maximize the action-value function, i.e.,
numerically characterizes the performance of action a taken
Q∗ (s, a) = max Qπ (s, a). (7)
with the state transiting from s to s0 . π

37112 VOLUME 7, 2019

R. Ding et al.: DRL for Router Selection in Network With Heavy Traffic

The optimal policy can be found by maximizing over the 1) REWARDS OF DIFFERENT TYPES
optimal action-value function Q∗ (s, a) as In RSMDP, there are three situations that can terminate the
( tasks: (i) the packet has reached its destination; (ii) the trans-
1 if a = arg max Q∗ (s, a) mission of the packet results in the congestion in the next hop
π∗ (a|s) = a∈A (8) router; (iii) the action chosen by the task is invalid for trans-
0 otherwise.
mission. The latter two situations should be averted, which is
From (8), if the optimal action-value function Q∗ (s, a) can the prerequisite before shortening the length of transmission
be obtained, we can input (V t , Dt , Bt ; Ot ) to compute the paths. Therefore, we should guarantee that the congestion
value of each action, and then choose the action that max- reward and error reward are smaller than the cumulative
imizes Q∗ (s, a). As Section III-B mentioned, the optimal reward starting from current state. According to the reasons
policy obtained from (8) could reduce the path length while for the termination of the task, there are three cases of the
avoid network congestion. cumulative reward:
One possible way to obtain the optimal action-value • The task reaches the destination router at time slot T .
function Q∗ (s, a) is Q-learning, which can be iteratively In this case, Rt = −1 for (t = 1, · · · , T ). Then the
implemented as cumulative reward for the whole transmission process
of the task equals to
Q(St , At ) ← Q(St , At ) T T
X X 1 − γT
γ t−1 Rt = − γ t−1 = − .
h i
+ α Rt+1 + γ max Q(St+1 , a) − Q(St , At ) (9) Gt =
1−γ
(11)
a
t=1 t=1

during the training process, where α is the learning rate. • The task chooses the action that leads to the network
Iteration (9) updates estimates of the values of states based congestion at time slot T . In this case, RT = rc , while
on values of successor states, which is called bootstrapping. Rt = −1 for (t = 1, · · · , T − 1). Then the cumulative
In this case, the learned action-value function will converge reward for the whole transmission process of the task
to the optimal action-value function Q∗ [29]. equals to
To obtain the value of every action, the reinforcement T −1
TX
learning algorithm must try every possible action. However,
X
Gt = γ t−1 Rt = − γ t−1 + γ T −1 rc
if the task only chooses the action that maximizes Q(s, a) t=1 t=1
during the training, then the actions that have not been tried
1 − γ T −1
before will be barely chosen, which makes the action-value =− + γ T −1 rc . (12)
function fall into the local optimum. Therefore, the algorithm 1−γ
should not only exploit the actions that have been tried before, • The task chooses the invalid action at time slot T . In this
but also explore new actions. Hence, the -greedy method is case, RT = re , while Rt = −1 for (t = 1, · · · , T − 1).
usually applied as Then the cumulative reward for the whole transmission
process of the task equals to
with probability 1 −
(
arg max Q(s, a), T −1
TX
a= a (10) X
random action, with probability , Gt = γ t−1 Rt = − γ t−1 + γ T −1 re
t=1 t=1
where is the probability of randomly choosing actions. 1 − γ T −1
=− + γ T −1 re . (13)
1−γ
D. DISCOUNT RATE
Then, we should set rc and re as
In this subsection, we consider the influence of discount
rate γ in RSMDP. From (4), we know the cumulative dis- 1 − γT 1 − γ T −1
rc , re < min{− ,− + γ T −1 rc ,
counted reward leads to ‘‘myopic’’ or ‘‘far-sighted’’ evalu- 1−γ 1−γ
ation when γ is close to 0 or 1, respectively. Specifically, 1 − γ T −1
when γ is close to 0, the future rewards are hardly considered, − + γ T −1 re } (14)
1−γ
while when γ is close to 1, the future rewards are taken into
account with heavier weight. The value of the discount rate γ As we mentioned in Section III-B, both rc and re are less
will affect the DRL-based routing algorithm mainly in two than −1, therefore
aspects: 1 − γT 1 − γ T −1
− >− + γ T −1 rc , (15)
• How does the objective balance the congestion 1−γ 1−γ
reward rc , the error reward re , and the remaining cumu-
and
lative reward?
• What is the relationship between the cumulative reward 1 − γT 1 − γ T −1
− >− + γ T −1 re . (16)
and the hops of the packet to arrive its destination? 1−γ 1−γ

VOLUME 7, 2019 37113

R. Ding et al.: DRL for Router Selection in Network With Heavy Traffic

As a result, (14) can be transformed into

1 − γ T −1 1 − γ T −1
rc , re < min{− +γ T −1 rc ,− +γ T −1 re }.
1−γ 1−γ
(17)
We can observe the symmetry of rc and re , therefore,
let rc = re , then we get
1 − γ T −1
rc = re < − + γ T −1 re . (18)
1−γ
Then we get:
1
rc = re < − . (19)
1−γ

2) TRANSMISSION PATH LENGTH

(19) guarantees that the invalid actions and the actions caus-
ing network congestion are rarely chosen. When γ equals
FIGURE 3. Neural network in DQN.
to 1, the cumulative reward becomes the opposite number
of actual hops of the whole transmission process. Then,
maximizing the cumulative reward directly leads to the min-
imization of the path length. However, this property does not
hold when γ < 1. Therefore, considering future rewards in
a router selection MDP, we ought to set γ as close to 1 as
possible.

IV. DRL BASED ROUTING ALGORITHM

In this section, we design two online routing algorithms with
the aid of DQN to handle large-scale state space of RSMDP.
FIGURE 4. Experience replay memory.
A. DEEP Q NETWORK FOR RSMDP
In the considered scenario, the state includes the size of newly
generated data V t , the total data packet size of all routers Dt , the correlation among input data is quite high, which would
the remaining buffer size of all routers Bt , and the position affect the performance of neural network. In this case, we can
and size of the current data packet Ot . Therefore, the number use experience replay to break the correlation among data.
of the states is huge and we resort to DQN that could utilize The router selection can be divided into the experience
DNN to represent the action-value function and tackle the tuples (s, a, r, s0 ) as shown in Fig. 4, and the experience
large-scale state space. tuples are stored in the replay memory, denoted by D. Then,
As shown in Fig. 3, the input of the neural network is the training data of the neural network is sampled uniformly
state St , while the output is the value of each action. Let θ and randomly from D. Normally, D can only store the last M
denote the neural network parameters. Then, the action-value experience tuples.
function under θ can be represented by Q(s, a; θ ). DQN tries In order to further reduce the correlation among input data,
to minimize the loss function defined as a target network is built to deal with the TD error. As shown
in (21), the network parameter θ used to compute the tar-
2
L(θ ) = r + γ max Q(s, a0 ; θ ) − Q(s, a; θ) , (20) get r + γ maxa0 Q(s0 , a0 ; θ) is the same as that of the action-
a0
value function Q(s, a; θ ). An update that increases Q(s, a; θ )
i.e., the square of temporal-difference error (TD error). Dif- would also increase Q(s0 , a0 ; θ ), and therefore bringing cor-
ferentiating the loss function with respect to θ , we get the relation and possibly leading to oscillations or divergence
following update: of the policy [30]. To further reduce the correlation, DQN

uses a separate network to generate the target, whose network
θ ← θ +α r +γ max Q(s , a ; θ )−Q(s, a; θ ) ∇Q(s, a; θ )
0 0
a0 parameters are denoted by θ − . More precisely, network Q is
(21) cloned to obtain a target network Q̂ every Nu steps. Therefore,
the network parameters update to:
A general assumption for training the deep neural net-
work is that the input data is independently and identically θ ← θ +α r +γ max Q̂(s0 , a0 ; θ − )−Q(s, a; θ ) ∇Q(s, a; θ ).
distributed. However, if we utilize the data generated in a0
chronological order < s0 , a0 , r1 , s1 , · · · , st , at , rt+1 , st+1 >, (22)

37114 VOLUME 7, 2019

R. Ding et al.: DRL for Router Selection in Network With Heavy Traffic

FIGURE 5. SDMT-DQN algorithm.

B. THE PROPOSED SDMT-DQN AND router of the task, the packet size and the current position.
DOMT-DQN ALGORITHMS The centralized controller selects the task in Z one by one.
Originally, DQN is designed for single agent which cannot Then the neural network corresponding to the source router
help the multi-tasks choose the next hop routers. To tackle and destination router of the selected task takes the state
this issue, we assume there is a centralized controller with of the selected task as input, and outputs the value of each
sufficient computation ability that can collect information action. Afterwards, the centralized controller chooses action
about the input packets and instruct the router to send the data for the data packet based on -greedy method. If the selected
packets to the next hop router. action is invalid, then the centralized controller: (i) regards
We are interested in a distributed solution to find the rout- the task as termination and stores the corresponding state,
ing policies of the tasks. Even if the data packets of different action, reward re and terminate state in corresponding expe-
tasks are currently in the same router, the tasks may choose rience memory; (ii) re-chooses the action whose value is the
different actions, due to their different goals. Hence, the cen- largest among the valid actions and continues the transmis-
tralized controller needs multiple neural networks to instruct sion. Therefore, the invalid action will lead to two experi-
every router for delivering the packets properly. Furthermore, ence tuples. This procedure can guarantee the validity of the
we should categorize the tasks into different classes and apply selected action while storing the invalid action with re in
one uniform neural network for each class. In this paper, the memory, therefore reducing the probability of choosing
we adopt two criteria to classify the tasks, which yields two invalid action afterwards. Then, according to the selected
different algorithms, respectively: action, the centralized controller can know the next hop router
and determine the next state of the task. The possible situa-
1) THE SDMT-DQN ALGORITHM tions can be listed as follows.
In SDMT-DQN algorithm, we classify all the data transmis- • If the next router is the destination router, then the data
sion tasks into Ns × Nd categories based on their source transmission task is complete and the state turns into
routers and destination routers. Specifically, all data tasks terminal state. The corresponding reward is 0 in this
with the same source router and the same destination router case.
can be considered as the same type of tasks, to share the same • If the action causes congestion, then the task is termi-
neural network. As a result, Ns × Nd neural networks are nated and the reward is rc .
needed to represent the action-value functions of all kinds • Otherwise, the centralized controller updates the state of
of tasks. For those tasks from source router i to destination the task and re-appends it to the end of Z. Moreover,
router j, there is a corresponding replay memory Di,j with the network will return a reward −1.
capacity C to store the experience tuples for training. More- Then, the centralized controller stores the experience
over, there is a target network to reduce the correlation among tuples in the corresponding experience memory D, and the
input data. The algorithm can be illustrated in Fig. 5. neural network samples data from D for training. Repeat the
At the centralized controller, we set a task queue Z to store above procedures until each task in the queue has selected
the information of the tasks, e.g., the source, and destination an action. Finally, the centralized controller sends the action

VOLUME 7, 2019 37115

R. Ding et al.: DRL for Router Selection in Network With Heavy Traffic

commands of each task to the routers, and the routers send Algorithm 1 Source-Destination Multi-Task Deep Q Net-
their packets to the next hop routers in accordance with work (SDMT-DQN)
these commands. With such an online algorithm, the neu- 1: Initialize the task queue Z, the reply memories
ral networks can utilize the latest experiences to improve with capacity C for every source-destination pair
the performance. The overall algorithm is summarized D1,1 , D2,1 , · · · , DNs ,Nd , action-value functions Q with
in Algorithm 1. random parameters θ for every source-destination pair
θ1,1 , θ2,1 , · · · , θNs ,Nd , the corresponding target action-
2) THE DOMT-DQN ALGORITHM value functions Q̂ with parameters θ1,1 −
= θ1,1 , · · · ,
The DOMT-DQN algorithm can reduce the number of θNs ,Nd = θNs ,Nd , the buffer size of all the routers, and
−

the required neural networks, which differs from the the network state.
SDMT-DQN algorithm mainly in that the data transmission 2: for t = 1, 2 . . . , T do
tasks are classified into Nd categories that only correspond 3: The sources generate data tasks and append them to Z.
to their destination routers. Hence, the corresponding neural
network and the replay memory only depend on the desti- 4: The controller obtains the information of the new gen-
nation of the task. As the number of categories is reduced, erated tasks and computes the network state.
the number of tasks for each category increases. Therefore, 5: for n = 1, . . . , Nt (Nt is the number of tasks in Z) do
there is more sufficient training data for each corresponding 6: Pop a task n from Z, combine the current network
neural network, which leads to faster convergence. state and the position and size of task n to get
Note that DOMT-DQN can be demonstrated by modi- state st,n .
fying Algorithm 1. Specifically, the replay memory D1,1 , 7: Select neural network based on source router i and
D2,1 , · · · , DNs ,Nd are changed into D1 , D1 , · · · , DNd , and destination router j with parameters θi,j .
the parameters of the neural networks θ1,1 , θ2,1 · · · , θNs ,Nd 8: Choose a random action â with probability , other-
are substituted with θ1 , θ2 · · · , θNd . The remaining proce- wise selecting â = arg maxa Q(st,n , a; θi,j ).
dures are very similar to Algorithm 1, and the overall steps 9: if â is invalid then
of DOMT-DQN are summarized in Algorithm 2. 10: Store the experience tuple
(st,n , â, rt,n , terminal state) in Di,j .
V. SIMULATION RESULTS 11: Re-choose a valid action at,n with the largest
In this section, simulations are conducted to evaluate the value.
performance of SDMT-DQN and DOMT-DQN. The proba- 12: else
bility of randomly choosing action is set to 0.9. We use 13: at,n = â.
Python and the deep learning framework Pytorch for coding 14: end if
and the program is executed on a computer with an Intel 15: Simulate execution action at,n in the controller, get
Core i7-8700k CPU, 32GB random access memory (RAM), reward rt,n and next state s0t,n , then update the net-
and Nvidia GTX 1070 GPU. The operating system is work state.
Ubuntu 16.04. 16: Store the experience tuple (st,n , at,n , rt,n , s0t,n )
We compare the performance of the proposed algo- in Di,j .
rithms with the deep learning based algorithm [27] and the 17: Sample random minibatch of experience tuples
traditional routing protocol OSPF. To better demonstrate the (sk , ak , rk, s0k ) from Di,j .
performance comparison, we consider the simple network rk 1
18: Set yk = .
with topology depicted in Fig. 1. Each node is deemed as rk + γ maxa Q̂(sk , a ; θi,j )
0
0 0 −
2
a router and each edge is deemed as a transmission link. 19: 1 if the task terminates.
:
The routers L0 , L1 , L2 are set as source routers that receive 20: 1 otherwise.
:
input packets from the data sources and transmit them to 21: Perform a gradient descent step with a learning
2
the destination router L8 . All the routers in the network can rate α on yk − Q sk , ak ; θi,j with respect to the
receive and send the data packets. We assume that no matter network parameters θi,j .
how big the data packets are, they can be transferred from 22: Reset Q̂i,j = Qi,j every Nu steps.
one router to another in one time slot. If the network congests 23: end for
in a time slot, we will mark it, then compute the network 24: The controller sends Nt commands to all routers, and
congestion probability by calculating the proportion of time the routers send packets according the commands.
slots that are congested in every 1000 time slots. The buffer 25: end for
size of each router is set to 45 MB, and the packet generation
process is set as Poisson.

A. COMPLEXITY ANALYSIS network, while the number of units in the output layer
Based on the definition of the input state in Section II, there is Na = 32 since the output represents the value of each
are 3 × N + Ns = 30 units in the input layer of the neural action. The controller should choose the next hop router

37116 VOLUME 7, 2019

R. Ding et al.: DRL for Router Selection in Network With Heavy Traffic

Algorithm 2 Destination-Only Multi-Task Deep Q Network The total computational complexity can be summarized
(DOMT-DQN) in Table 2. Compared with the DL-based method, the pro-
1: Initialize the whole system, including the buffer posed algorithms has much fewer FLOPs for each neural
size of all the routers, the network state, the task network and number of neural networks. Therefore, the total
queue Z, the replay memory D1 , D1 , · · · , DNd , computational complexity of the two proposed algorithms are
the action-value functions with random parameters extremely lower.
θ1 , θ2 , · · · , θNd , and the corresponding target network
θ1− = θ1 , · · · , θN−d = θNd
2: for t = 1, 2 . . . , T do
3: The sources generate data tasks and append them to Z.

4: for n = 1, . . . , Nt (Nt is the number of tasks in Z) do

5: Select the corresponding neural network based on
the destination router i of task n, θi .
6: Choose action with -greedy, obtain the next state,
and store the experience tuples.
7: Sample random minibatch of experience tuples
(sk , ak , rk , s0k ) from Di and update the correspond-
ing parameters θi with gradient descent method.
8: Reset Q̂i = Qi every Nu steps.
9: end for
10: The controller sends Nt commands to all routers, and
the routers execute these actions. FIGURE 6. The performance comparison between our proposed
algorithms and tradition protocol as well as DL based algorithm in terms
11: end for of congestion probability.

for each task in a very short time, therefore light-weight B. PERFORMANCE COMPARISON
neural networks ought to be used. The specific neural network In Fig. 6, we compare congestion probabilities of
architectures for SDMT-DQN and DOMT-DQN are shown SDMT-DQN, DOMT-DQN, DL based algorithm and OSPF
in Table 1. versus the number of training steps. The discount rate γ
is set to 0.9, and the mean of Poisson data generation
TABLE 1. The neural network architecture.
process is set to 15 MB per time slot. The congestion
probabilities of OSPF stays at a high level due to the lack
of intelligence. In contrast, the congestion probabilities of
SDMT-DQN and DOMT-DQN significantly decrease with
the increase of training steps because the network has learned
The number of the required neural networks for our algo- from the past congestion and then generates a policy to reduce
rithms is significantly reduced compared with DL-based congestion probability. Moreover, both two proposed algo-
method in [27]. To be specific, Ns × Nd and Nd neural rithms can achieve lower congestion probability compared
networks are required for SDMT-DQN and DOMT-DQN, with the DL based algorithm [27]. This is because the DL
respectively. For example, considering the network topology based algorithm can only choose from the pre-defined path
of Fig. 1, SDMT-DQN requires three neural networks while combinations, instead of exploring the best possible paths
DOMT-DQN only needs one neural network. from the instantaneous states. We see that the training process
In addition, the required number of floating point opera- of DOMT-DQN converges faster than that of SDMT-DQN.
tions (FLOPs) is used as the metric of computational com- The reason can be explained as follows: The training data of
plexity. For convolutional layers, the number of FLOPs is: SDMT-DQN is divided into Ns × Nd categories, while that of
DOMT-DQN is only divided into Nd categories. Therefore,
FLOPs = 2Hin Win (Cin K 2 + 1)Cout , (23)
at the beginning of the training process, the training data
where Hin , Win and Cin are height, width and number of chan- for each neural network in DOMT-DQN is more sufficient
nels of the input feature map, K is the kernel size, and Cout is than that in SDMT-DQN. It is also seen that with the process
the number of output channels. of training, the congestion probability of SDMT-DQN can
For fully connected layers, FLOPs is computed as: reduce to almost zero, while that of DOMT-DQN maintain
at an acceptably low level, because adopting more neural
FLOPs = (2Nin − 1)Nout , (24)
networks of SDMT-DQN could provide better learning abil-
where Nin is the number of input neurons and Nout is the ity than DOMT-DQN. Besides, since further classifying the
number of output neurons [36]. data transmission tasks based on the source routers makes the

VOLUME 7, 2019 37117

R. Ding et al.: DRL for Router Selection in Network With Heavy Traffic

TABLE 2. The total complexity comparison of the three algorithms for the network topology in Fig. 1.

FIGURE 7. Network congestion probability comparison for various packet

generation rates. FIGURE 8. Network throughput comparison for various packet generation
rates.

learning process easier for each neural network, SDMT-DQN

would yield lower congestion probability than DOMT-DQN.
Next, we compare the congestion probability versus dif-
ferent data generation rates in Fig. 7, where the curves of
SDMT-DQN, DOMT-DQN, and the DL based algorithm are
calculated by the network parameters after sufficient rounds
of training. We can see that when the data generation rate is
slow, i.e., the network is idle, the data packets are unimpeded
in the network. In this case, none of the four compared
methods would cause congestion. However, when the amount
of data in the network increases, the congestion probability
of OSPF increases significantly. In contrast, the congestion
probabilities of SDMT-DQN and DOMT-DQN stay at a low
level, which indicates that OSPF can only perform well when
the network is idle, while the proposed ones can deal with
large amount of data. In addition, the proposed algorithms FIGURE 9. The probability of choosing the invalid actions.
outperform the DL based algorithm.
Fig. 8 plots network throughput versus packet generation
rates for different algorithms. Similar to Fig. 7, when the Fig. 9 plots the probability of choosing valid actions in the
network is idle, the performance of OSPF performs similarly first trial versus the number of training steps for the proposed
to the other three algorithms. However, when the network SDMT-DQN and DOMT-DQN algorithms. We see that the
traffic becomes heavier, OSPF drops a larger number of invalid actions are rarely chosen after very few training steps.
packets due to the increasing congestion probability. This in Therefore, SDMT-DQN and DOMT-DQN will not require
turn leads to a decrease in the network throughput. On the much additional computation to re-choose valid actions.
contrary, the proposed algorithms can improve the network In Fig. 10, we compare the path length of 1000 transmis-
throughput when the data generation rate increases because sion tasks under different discount rates of SDMT-DQN and
the congestion probability can always be maintained at a DOMT-DQN. It is seen that the closer γ is to 1, the shorter
very low level. Due to the lower congestion probability, the path length will be, which is consistent with the analysis
SDMT-DQN performs better than DOMT-DQN in terms of in Section III-D. When γ = 1, (19) seems impossible to
the network throughput. satisfy. But in fact, as long as rc and re are smaller than −1,

37118 VOLUME 7, 2019

R. Ding et al.: DRL for Router Selection in Network With Heavy Traffic

FIGURE 10. The comparison of different discount rates γ in terms of path length. (a) Path length under different
discount rates γ based on SDMT-DQN. (b) Path length under different discount rates γ based on DOMT-DQN.

the task tends to choose the actions which would not cause
congestion or be invalid along with the training. As a result,
when γ = 1, our algorithms can also reduce the congestion
probability, just slightly slower. In addition, DOMT-DQN
performs better than SDMT-DQN. Specially, for SDMT-
DQN, there are very few paths that are longer than 10, which
never happens for DOMT-DQN. This is because when we use
SDMT-DQN, the task from one source router may choose
another source router as hop router occasionally. Since the
probability of this behavior is very low, the training data
that guides the network to handle this situation cannot be
sufficient. Then, the data packets may be repeatedly trans-
ferred between two source routers, and the path length of
the corresponding task then becomes very long. On the other
hand, DOMT-DQN does not differentiate the tasks according
to their source routers. Hence, no matter which router the data
FIGURE 11. A more complicated network topology.
packet is transferred to, there can always be sufficient training
samples.
In the last example, we demonstrate the scalability of reduce the congestion probability. Similar to Fig. 6,
SDMT-DQN and DOMT-DQN in a more complicated net- SDMT-DQN performs better than DOMT-DQN in terms
work as shown in Fig. 11. In Fig. 12, we compare the of the congestion probability while DOMT-DQN con-
proposed algorithms with OSPF in terms of congestion verges faster than SDMT-DQN. Both SDMT-DQN and
probability.2 Both the proposed algorithms can significantly DOMT-DQN converge slower when being applied in a
2 The DL based algorithm [27] cannot be implemented in the current more complicated network, and the corresponding congestion
computer configuration, since the number of the possible path combinations probability after training will be slightly increased. This is
is too large. because when the number of routers in the network increases,

VOLUME 7, 2019 37119

R. Ding et al.: DRL for Router Selection in Network With Heavy Traffic

[12] A. Krizhevsky, I. Sutskever, and G. E. Hinton, ‘‘Imagenet classification

with deep convolutional neural networks,’’ in Proc. Neural Inf. Process.
Syst., Dec. 2012, pp. 1097–1105.
[13] K. He, X. Zhang, S. Ren, and J. Sun, ‘‘Deep residual learning for
image recognition,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
Jun. 2016, pp. 770–778.
[14] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger, ‘‘Densely
connected convolutional networks,’’ in Proc. IEEE Conf. Comput. Vis.
Pattern Recognit., Jul. 2017, pp. 2261–2269.
[15] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, ‘‘You only look once:
Unified, real-time object detection,’’ in Proc. IEEE Conf. Comput. Vis.
Pattern Recognit., Jun. 2016, pp. 779–788.
[16] J. Redmon and A. Farhadi, ‘‘YOLO9000: Better, faster, stronger,’’ in Proc.
IEEE Conf. Comput. Vis. Pattern Recognit., Jul. 2017, pp. 6517–6525.
[17] R. Girshick, J. Donahue, T. Darrell, and J. Malik, ‘‘Rich feature hierarchies
for accurate object detection and semantic segmentation,’’ in Proc. IEEE
Conf. Comput. Vis. Pattern Recognit., Jun. 2014, pp. 580–587.
[18] R. Girshick, ‘‘Fast R-CNN,’’ in Proc. IEEE Int. Conf. Comput. Vis.,
Dec. 2016, pp. 1440–1448.
FIGURE 12. The performance comparison of our proposals and tradition [19] S. Ren, K. He, R. Girshick, and J. Sun, ‘‘Faster R-CNN: Towards real-
protocol OSPF in terms of congestion probability in a more complicated time object detection with region proposal networks,’’ IEEE Trans. Pattern
network. Anal. Mach. Intell., vol. 39, no. 6, pp. 1137–1149, Jun. 2017.
[20] X. You, C. Zhang, X. Tan, S. Jin, and H. Wu. (2018). ‘‘AI for 5G:
Research directions and paradigms.’’ [Online]. Available: https://arxiv.org/
the proportion of valid actions for each task decreases signif- abs/1807.08671
icantly, making it more difficult to learn a good policy for the [21] C. Zhang, P. Patras, and H. Haddadi. (2018). ‘‘Deep learning in mobile
task. and wireless networking: A survey.’’ [Online]. Available: https://arxiv.org/
abs/1803.04311
[22] T. O’Shea and J. Hoydis, ‘‘An introduction to deep learning for the physical
VI. CONCLUSIONS
layer,’’ IEEE Trans. Cogn. Commun. Netw., vol. 3, no. 4, pp. 563–575,
In this paper, we have proposed two DRL-based online algo- Dec. 2017.
rithms to reduce the congestion probability and shorten the [23] T. Gruber, S. Cammerer, J. Hoydis, and S. ten Brink, ‘‘On deep learning-
transmission path when the network traffic is quite heavy. The based channel decoding,’’ in Proc. IEEE 51st Annu. Conf. Inf. Sci. Syst.
(CISS), Mar. 2017, pp. 1–6.
simulation results demonstrate that the two algorithms can [24] H. He, C.-K. Wen, S. Jin, and G. Y. Li, ‘‘Deep learning-based channel esti-
achieve high throughput in contrast with the traditional rout- mation for beamspace mmWave massive MIMO systems,’’ IEEE Wireless
ing protocols due to the low congestion probability. Besides, Commun. Lett., vol. 7, no. 7, pp. 852–855, Oct. 2018.
[25] C.-K. Wen, W.-T. Shih, and S. Jin, ‘‘Deep learning for massive MIMO
the proposed algorithms have lower computational complex- CSI feedback,’’ IEEE Wireless Commun. Lett., vol. 7, no. 5, pp. 748–751,
ity compared with the DL-based method. It is worth noting Oct. 2018.
that in this article, we only consider the update of the parame- [26] N. Kato et al., ‘‘The deep learning vision for heterogeneous network traf-
fic control: Proposal, challenges, and future perspective,’’ IEEE Wireless
ters of the neural networks. In the future, we will consider the Commun., vol. 24, no. 3, pp. 146–153, Jun. 2017.
neural network with dynamic architecture to achieve better [27] F. Tang et al., ‘‘On removing routing protocol from future wireless net-
performance. Nevertheless, our study demonstrates that DRL works: A real-time deep learning approach for intelligent traffic control,’’
IEEE Wireless Commun., vol. 25, no. 1, pp. 154–160, Feb. 2018.
is feasible to be applied to routing problem. [28] V. Mnih et al., ‘‘Playing atari with deep reinforcement learning,’’ in Proc.
NIPS Deep Learn. Workshop, 2013, pp. 1–9.
REFERENCES [29] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction.
[1] F. Boccardi, R. W. Heath, A. Lozano, T. L. Marzetta, and P. Popovski, Cambridge, MA, USA: MIT press, 1998.
‘‘Five disruptive technology directions for 5G,’’ IEEE Commun. Mag., [30] V. Mnih et al., ‘‘Human-level control through deep reinforcement learn-
vol. 52, no. 2, pp. 74–80, Feb. 2014. ing,’’ Nature, vol. 518, pp. 529–533, Feb. 2015.
[2] J. G. Andrews et al., ‘‘What will 5G be?’’ IEEE J. Sel. Areas Commun., [31] Z. Wang et al., ‘‘Dueling network architectures for deep reinforcement
vol. 32, no. 6, pp. 1065–1082, Jun. 2014. learning,’’ in Proc. Int. Conf. Mach. Learn., vol. 2016, pp. 1995–2003.
[3] C.-X. Wang et al., ‘‘Cellular architecture and key technologies for 5G [32] V. Mnih et al., ‘‘Asynchronous methods for deep reinforcement learning,’’
wireless communication networks,’’ IEEE Commun. Mag., vol. 52, no. 2, in Proc. Int. Conf. Mach. Learn., 2016, pp. 1928–1937.
pp. 122–130, Feb. 2014. [33] M. Chu, H. Li, X. Liao, and S. Cui, ‘‘Reinforcement learning based
[4] T. S. Rappaport et al., ‘‘Millimeter wave mobile communications for 5G multi-access control and battery prediction with energy harvesting in IoT
cellular: It will work!’’ IEEE Access, vol. 1, pp. 335–349, May 2013. systems,’’ IEEE Internet Things J., to be published.
[5] A. Nordrum, ‘‘Popular Internet of Things forecast of 50 billion devices by [34] A. Ortiz, H. Al-Shatri, T. Weber, and A. Klein. (2017). ‘‘Multi-agent rein-
2020 is outdated,’’ IEEE Spectr., vol. 18, no. 6, 2016. forcement learning for energy harvesting two-hop communications with
[6] J. Moy, OSPF Version 2, Standard RFC-2178, Jul. 1997. full cooperation.’’ [Online]. Available: https://arxiv.org/abs/1702.06185
[7] B. Fortz and M. Thorup, ‘‘Optimizing OSPF/IS-IS weights in a chang- [35] Y. Hu, Q. Da, A. Zeng, Y. Yu, and Y. Xu, ‘‘Reinforcement learning to rank
ing world,’’ IEEE J. Sel. Areas Commun., vol. 20, no. 4, pp. 756–767, in e-commerce search engine: Formalization, analysis, and application,’’
May 2002. in Proc. KDD, 2018, pp. 368–377.
[8] C. Hedrick, Routing Information Protocol, Standard RFC-1058, 1988. [36] P. Molchanov, S. Tyree, T. Karras, T. Aila, and J. Kautz, ‘‘Pruning convo-
[9] T. G. Griffin, F. B. Shepherd, and G. Wilfong, ‘‘The stable paths prob- lutional neural networks for resource efficient inference,’’ in Proc. ICLR,
lem and interdomain routing,’’ IEEE/ACM Trans. Netw., vol. 10, no. 2, 2017, pp. 1–17.
pp. 232–243, Apr. 2002.
[10] Y. LeCun, Y. Bengio, and G. Hinton, ‘‘Deep learning,’’ Nature, vol. 521,
pp. 436–444, May 2015.
[11] S. Lawrence, C. L. Giles, A. C. Tsoi, and A. D. Back, ‘‘Face recognition: Authors’ photographs and biographies not available at the time of
A convolutional neural-network approach,’’ IEEE Trans. Neural Netw., publication.
vol. 8, no. 1, pp. 98–113, Jan. 1997.

37120 VOLUME 7, 2019

1 Toward Packet Routing With Fully Distributed Multiagent Deep Reinforcement Learning
No ratings yet
1 Toward Packet Routing With Fully Distributed Multiagent Deep Reinforcement Learning
14 pages
(2018 - ICCC - IEEE) RNN Deep Reinforcement Learning For Routing Optimization
No ratings yet
(2018 - ICCC - IEEE) RNN Deep Reinforcement Learning For Routing Optimization
5 pages
On Removing Routing Protocol From Future Wireless Networks A Real-Time Deep Learning Approach For Intelligent Traffic Control
No ratings yet
On Removing Routing Protocol From Future Wireless Networks A Real-Time Deep Learning Approach For Intelligent Traffic Control
7 pages
Network Optimization - GNN
No ratings yet
Network Optimization - GNN
34 pages
(2020 - IEEE-Trsnsactions Omn NSAE) RL-Routing - An SDN Routing Alogorirjm Based On Deep Reinforcement Learning
No ratings yet
(2020 - IEEE-Trsnsactions Omn NSAE) RL-Routing - An SDN Routing Alogorirjm Based On Deep Reinforcement Learning
15 pages
Machine Learning Driven Network Route
No ratings yet
Machine Learning Driven Network Route
9 pages
2023-Dealing With Changes Resilient Routing Via Graph Neural Networks and Multi-Agent Deep Reinforcement Learning
No ratings yet
2023-Dealing With Changes Resilient Routing Via Graph Neural Networks and Multi-Agent Deep Reinforcement Learning
12 pages
1 Intelligent Routing Based On Reinforcement Learning For Software-Defined Networking
No ratings yet
1 Intelligent Routing Based On Reinforcement Learning For Software-Defined Networking
12 pages
DRL for Distributed Packet Routing
No ratings yet
DRL for Distributed Packet Routing
8 pages
Reinforcement Learning Based Routing in Networks R
No ratings yet
Reinforcement Learning Based Routing in Networks R
9 pages
Reinforcement Learning Based Routing in Networks Review and Classification of Approaches
No ratings yet
Reinforcement Learning Based Routing in Networks Review and Classification of Approaches
35 pages
Adaptive Routing in Wireless Mesh Networks Using Hybrid Reinforcement Learning Algorithm
No ratings yet
Adaptive Routing in Wireless Mesh Networks Using Hybrid Reinforcement Learning Algorithm
19 pages
A Reinforcement Learning-Based Traffic Engineering
No ratings yet
A Reinforcement Learning-Based Traffic Engineering
17 pages
Multi-Agent DRL for P4 Switch Routing
No ratings yet
Multi-Agent DRL for P4 Switch Routing
6 pages
IEEE Conf v2 0-3
No ratings yet
IEEE Conf v2 0-3
6 pages
2023IEEETransactionsonMobileComputing-Routing Optimization With Deep Reinforcement Learning in Knowledge Defined Networking
No ratings yet
2023IEEETransactionsonMobileComputing-Routing Optimization With Deep Reinforcement Learning in Knowledge Defined Networking
12 pages
Sensors 22 08139 With Cover
No ratings yet
Sensors 22 08139 With Cover
20 pages
Deep Learning for Network Traffic Control
No ratings yet
Deep Learning for Network Traffic Control
24 pages
An Intelligent Route Computation Approach Based On Real-Time Deep Learning Strategy For Software Defined Communication Systems
No ratings yet
An Intelligent Route Computation Approach Based On Real-Time Deep Learning Strategy For Software Defined Communication Systems
12 pages
A Novel Non-Supervised Deep-Learning-Based Network Traffic Control Method For Software Defined Wireless Networks
No ratings yet
A Novel Non-Supervised Deep-Learning-Based Network Traffic Control Method For Software Defined Wireless Networks
8 pages
6th Paper Conf
No ratings yet
6th Paper Conf
7 pages
Achieving Multi-Time-Step Segment Routing Via Traffic Prediction and Compressive Sensing Techniques
No ratings yet
Achieving Multi-Time-Step Segment Routing Via Traffic Prediction and Compressive Sensing Techniques
16 pages
An Improved Congestion-Controlled Routing Protocol For IoT Applications in Extreme Environments
No ratings yet
An Improved Congestion-Controlled Routing Protocol For IoT Applications in Extreme Environments
11 pages
IEEE Conference Template 1
No ratings yet
IEEE Conference Template 1
5 pages
ICCCN04DL CRPRouting KB
No ratings yet
ICCCN04DL CRPRouting KB
9 pages
1 s2.0 S2405959522000170 Main
No ratings yet
1 s2.0 S2405959522000170 Main
7 pages
Grom Paper
No ratings yet
Grom Paper
13 pages
Deep Reinforcement Learning For Mobile 5G and Beyond Fundamentals Applications and Challenges
100% (1)
Deep Reinforcement Learning For Mobile 5G and Beyond Fundamentals Applications and Challenges
16 pages
Intelligent Routing in SDN Using RL
No ratings yet
Intelligent Routing in SDN Using RL
13 pages
An Optimal Routing Algorithm in Service Customized
No ratings yet
An Optimal Routing Algorithm in Service Customized
7 pages
Redes Opticas
No ratings yet
Redes Opticas
208 pages
Performance Analysis of Congestion-Aware Q-Routing Algorithm For Network On Chip
No ratings yet
Performance Analysis of Congestion-Aware Q-Routing Algorithm For Network On Chip
9 pages
Paper Format WSN
No ratings yet
Paper Format WSN
14 pages
AMachine Learning Approach To Routing
No ratings yet
AMachine Learning Approach To Routing
8 pages
Research Article
No ratings yet
Research Article
16 pages
AI Routers & Network Mind: A Hybrid Machine Learning Paradigm For Packet Routing
No ratings yet
AI Routers & Network Mind: A Hybrid Machine Learning Paradigm For Packet Routing
10 pages
DRL Applications in 5G Network Optimization
No ratings yet
DRL Applications in 5G Network Optimization
9 pages
Journal 1
No ratings yet
Journal 1
9 pages
IEEE Formatted Paper
No ratings yet
IEEE Formatted Paper
3 pages
DRL GNN
No ratings yet
DRL GNN
12 pages
A Strategy of Dynamic Routing Based On SDN
No ratings yet
A Strategy of Dynamic Routing Based On SDN
6 pages
4 TH Papaer
No ratings yet
4 TH Papaer
7 pages
Reinforcement Learning for Caching Optimization
No ratings yet
Reinforcement Learning for Caching Optimization
50 pages
Wu 2021
No ratings yet
Wu 2021
15 pages
Markov Game For CV Online Routing in Stochastic Traffic Networks A Scalable Learning Approach
No ratings yet
Markov Game For CV Online Routing in Stochastic Traffic Networks A Scalable Learning Approach
46 pages
Mit
No ratings yet
Mit
155 pages
Traffic Aware Optimal Routing in SDN Using Artificial Intelligence
No ratings yet
Traffic Aware Optimal Routing in SDN Using Artificial Intelligence
12 pages
1139 3573 1 PB
No ratings yet
1139 3573 1 PB
15 pages
128 KT2260
No ratings yet
128 KT2260
7 pages
Deep Reinforcement Learning For Communication Flow Control in Wireless Mesh Networks
No ratings yet
Deep Reinforcement Learning For Communication Flow Control in Wireless Mesh Networks
8 pages
Research On Topology Planning For Wireless Mesh Networks Based On Deep Reinforcement Learning
No ratings yet
Research On Topology Planning For Wireless Mesh Networks Based On Deep Reinforcement Learning
6 pages
A Sample Article Using IEEEtran Cls For IEEE Journals and Transactions
No ratings yet
A Sample Article Using IEEEtran Cls For IEEE Journals and Transactions
5 pages
WCNC ICC 2018 Paper
No ratings yet
WCNC ICC 2018 Paper
10 pages
GNN-Enhanced DRL for Network Routing
No ratings yet
GNN-Enhanced DRL for Network Routing
11 pages
A Deep Reinforcement Learning-Based Geographic Packet Routing Optimization
No ratings yet
A Deep Reinforcement Learning-Based Geographic Packet Routing Optimization
12 pages
1.2-4n Real-Time System Optimal Traffic Routing Under Uncertainties
No ratings yet
1.2-4n Real-Time System Optimal Traffic Routing Under Uncertainties
36 pages
Deep RL for Dynamic Network Slicing
No ratings yet
Deep RL for Dynamic Network Slicing
9 pages
4452
No ratings yet
4452
179 pages
Pranab Bhattacharya Mca-Ty Roll No.: 14: Created by
No ratings yet
Pranab Bhattacharya Mca-Ty Roll No.: 14: Created by
20 pages
COC 3: Computer Server Setup Guide
No ratings yet
COC 3: Computer Server Setup Guide
3 pages
Pooja G Resume 2
No ratings yet
Pooja G Resume 2
5 pages
Git 101 For Dummies: Prologue
No ratings yet
Git 101 For Dummies: Prologue
13 pages
Java Mega Store Test Overview
No ratings yet
Java Mega Store Test Overview
4 pages
HP Designjet T3500 User Manual
No ratings yet
HP Designjet T3500 User Manual
261 pages
Gujarat Technological University
No ratings yet
Gujarat Technological University
1 page
C Programming Lab Exercises
No ratings yet
C Programming Lab Exercises
69 pages
CS250 - Data Structures & Algorithms: Lecture 2: Review C++ (Pointers & Dynamic Memory Allocation)
No ratings yet
CS250 - Data Structures & Algorithms: Lecture 2: Review C++ (Pointers & Dynamic Memory Allocation)
25 pages
Advanced Database Chapter 7 Assignment PDF
No ratings yet
Advanced Database Chapter 7 Assignment PDF
7 pages
Predictive Analysis For Big Mart Sales Using Machine Learning Algorithms
No ratings yet
Predictive Analysis For Big Mart Sales Using Machine Learning Algorithms
14 pages
Orchestration of Network Slicing For Next Generation Network
No ratings yet
Orchestration of Network Slicing For Next Generation Network
76 pages
NI Signals and Systems Lab Overview
No ratings yet
NI Signals and Systems Lab Overview
2 pages
Linkedin Posts 2024 Blue
No ratings yet
Linkedin Posts 2024 Blue
368 pages
ThinkPad P15s Gen 2 Specs
No ratings yet
ThinkPad P15s Gen 2 Specs
1 page
Iroro Jss1 Computer
No ratings yet
Iroro Jss1 Computer
2 pages
C/C++ Preprocessor Guide
No ratings yet
C/C++ Preprocessor Guide
12 pages
Danggayon I.S - Least Learned Competency - First Quarter
No ratings yet
Danggayon I.S - Least Learned Competency - First Quarter
2 pages
04 - Book - Python Programming (3rd SEM) - Watermark
No ratings yet
04 - Book - Python Programming (3rd SEM) - Watermark
155 pages
Sushmaa Resume
No ratings yet
Sushmaa Resume
1 page
SAP First Guidance Deploy SAP BW 4HANA T
No ratings yet
SAP First Guidance Deploy SAP BW 4HANA T
39 pages
4.2. Spark Applications
No ratings yet
4.2. Spark Applications
19 pages
Boq Specification and Installation Details - 2021 10 29 12 13 23
No ratings yet
Boq Specification and Installation Details - 2021 10 29 12 13 23
11 pages
Airlive 2.5ge Xpon Onu & Xpon Onu-1ge - User Manual - v1.0
No ratings yet
Airlive 2.5ge Xpon Onu & Xpon Onu-1ge - User Manual - v1.0
37 pages
FR Configurator SW3 Parameter File Editor Instruction Manual Melsoft
No ratings yet
FR Configurator SW3 Parameter File Editor Instruction Manual Melsoft
21 pages
Jishnu's Resume
No ratings yet
Jishnu's Resume
1 page
ETL Process Overview in Agriculture
100% (1)
ETL Process Overview in Agriculture
42 pages
Hi 5120301 OPEN6
No ratings yet
Hi 5120301 OPEN6
493 pages
EEE 3 2 CS Syllabus - UG - R20
No ratings yet
EEE 3 2 CS Syllabus - UG - R20
44 pages
BPM Software Selection for IT Optimization
No ratings yet
BPM Software Selection for IT Optimization
1 page

Deep Reinforcement Learning For Router Selection in Network With

Uploaded by

Deep Reinforcement Learning For Router Selection in Network With

Uploaded by

Received February 2, 2019, accepted March 1, 2019, date of current version April 3, 2019.

Digital Object Identifier 10.1109/ACCESS.2019.2904539

Deep Reinforcement Learning for Router

Corresponding author: Feifei Gao ([email protected])

I. INTRODUCTION When the amount of data is small, these shortest-path based

37110 VOLUME 7, 2019

FIGURE 1. The network topology.

+ Vi,t − f , Dj,t , Bi,t − Vi,t + f , Bj,t ) i ∈ Ls , j ∈ Ld

VOLUME 7, 2019 37111

37112 VOLUME 7, 2019

VOLUME 7, 2019 37113

As a result, (14) can be transformed into

2) TRANSMISSION PATH LENGTH

IV. DRL BASED ROUTING ALGORITHM

37114 VOLUME 7, 2019

FIGURE 5. SDMT-DQN algorithm.

VOLUME 7, 2019 37115

37116 VOLUME 7, 2019

4: for n = 1, . . . , Nt (Nt is the number of tasks in Z) do

VOLUME 7, 2019 37117

FIGURE 7. Network congestion probability comparison for various packet

learning process easier for each neural network, SDMT-DQN

37118 VOLUME 7, 2019

VOLUME 7, 2019 37119

[12] A. Krizhevsky, I. Sutskever, and G. E. Hinton, ‘‘Imagenet classification

37120 VOLUME 7, 2019

You might also like