Deep Reinforcement Learning For Router Selection in Network With
Deep Reinforcement Learning For Router Selection in Network With
ABSTRACT The rapid development of wireless communications brings a tremendous increase in the amount
number of data streams and poses significant challenges to the traditional routing protocols. In this paper,
we leverage deep reinforcement learning (DRL) for router selection in the network with heavy traffic,
aiming at reducing the network congestion and the length of the data transmission path. We first illustrate
the challenges of the existing routing protocols when the amount of the data explodes. We then utilize the
Markov decision process (RSMDP) to formulate the routing problem. Two novel deep Q network (DQN)-
based algorithms are designed to reduce the network congestion probability with a short transmission path:
one focusing on reducing the congestion probability; while the other focuses on shortening the transmission
path. The simulation results demonstrate that the proposed algorithms can achieve higher network throughput
comparing to existing routing algorithms in heavy network traffic scenarios.
INDEX TERMS Deep reinforcement learning, routing, network congestion, network throughput, deep
Q network.
2169-3536
2019 IEEE. Translations and content mining are permitted for academic research only.
VOLUME 7, 2019 Personal use is also permitted, but republication/redistribution requires IEEE permission. 37109
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
R. Ding et al.: DRL for Router Selection in Network With Heavy Traffic
object detection [15]–[19], communications [20]–[25], as The rest of the paper is organized as follows. Section II
well as many other fields. states the routing problem and outlines the system model.
DL has also been adopted in routing problems. For In Section III, we introduce RSMDP in detail and analyze the
example, it could imitate the OSPF protocol [26] to reduce setting of some parameters. The proposed DRL algorithms
the signaling overhead. However, the algorithm in [26] is are detailed in Section IV. Section V provides the simulation
essentially an imitation of traditional protocols, and is insuf- results while Section VI concludes the paper.
ficiently intelligent to deal with complicated network states.
Following [26], a deep convolutional neural network based
II. PROBLEM STATEMENT AND SYSTEM MODEL
routing algorithm has been proposed in [27], which utilizes
A. PROBLEM STATEMENT
the neural network to judge the network congestion caused by
We assume that the network operates in a time-slotted fashion
the path combination. However, building a neural network for
with normalized time slot. Transmitting a data packet from
each possible path combination would result in a large num-
the source router to the destination is regarded as a data
ber of neural networks for training, and therefore increasing
transmission task. At each time slot, a task selects the next
the demand on computing resources.
hop router and the data packet is transferred to it. This process
However, DL generally requires label information for the
is continued until the data packets arrive at the destination.
training data, which then demands for massive manual efforts.
Network congestion happens when the size of the arriving
In addition, DL is inherently an approximation of certain
packet exceeds the remaining buffer size of the router.
function and is not suitable for decision-making problems,
The traditional routing protocols are formulated as a clas-
such as routing, energy allocation, and recommender sys-
sical combinatorial optimization problem, where the data
tem. In this case, deep reinforcement learning (DRL) [28]
packets are transmitted along the shortest path. Under such
emerges as an alternative to solve decision-making type
shortest path principle, certain routers may be simultaneously
problems. Compared with traditional reinforcement learning
selected for multi-tasks, which then very likely leads to net-
methods1 [29], DRL takes advantage of function approxima-
work congestion due to the finite buffer size of the routers.
tion ability of DL to solve practical problems with large-scale
For example, as shown in Fig. 1, three packets
state and action space [30]–[32]. For instance, DRL could
from L0 , L1 , L2 are transmitted to the destination L8 . Based
help the energy harvesting devices allocate the energy to max-
on the shortest path principle, L4 would be chosen as the next
imize the sum rate of the communications, predict the battery
hop for the packets by the traditional protocols. When the
power accurately [33], or guide the two-hop communications
packets are relatively large, the remaining buffer size of L4
to achieve high throughput [34]. Moreover, DRL has been
will be not sufficient and the network is prone to congestion.
utilized to rank in E-commerce search engine for improving
Moreover, when the same or similar situation appears again,
the total transaction amount [35].
traditional routing protocols would fall into the congestion
In this paper, we design two DRL-based online routing
again. Even though the network congestion has occurred
algorithms to address the network congestion problem. The
many times before, the traditional routing protocols would
proposed algorithms can reduce the probability of network
still select the same/similar routing path. Therefore, it is
congestion and shorten the length of transmission paths,
necessary and important for the routing strategy to learn from
i.e., the number of hops from the source router to the destina-
the past experience and make itself sufficiently intelligent to
tion. The main contributions of this paper are summarized as
choose optimal routing paths according to the network states.
follows:
• We leverage router selection Markov decision process
(RSMDP) concepts to formulate the routing problem B. SYSTEM MODEL
and define the corresponding state space, action space, Consider a general backbone network with N routers in the
reward function, and value function. set L = {L1 , L2 , . . . , LN }. Define Ls , Ld , and Lr as the
• We propose two online routing algorithms, i.e., source- disjoint sets of source routers, destination routers, and regular
destination multi-task deep Q network (SDMT-DQN) routers, respectively, with L = Ls ∪Ld ∪Lr . Moreover, there
and destination-only multi-task deep Q network are |Ls | , Ns , |Ld | , Nd , |Lr | , Nr , and Ns +Nd +Nr = N .
(DOMT-DQN), which can learn from past experiences Let Di,t and Bi,t denote the total size of all packets and
and update routing policies in real time. the remainingbuffer size in Li at time slot t, respectively.
Define Bt = B1,t , · · · , BN ,t and Dt = D1,t , · · · , DN ,t .
SDMT-DQN is able to significantly reduce the conges-
tion probability, while the corresponding path length We denote the size of the packet newly generated by
may occasionally be long. In comparison, DOMT-DQN data source i at time slot t by Vi,t and define V t =
V1,t , · · · , VNs ,t as the size vector of all input packets. The
can significantly shorten the path length as well as
maintaining the congestion probability at an acceptably data generation is set as a Poisson process. The state of
lower level. the network at time slot t can be characterized by a tuple
(V t , Dt , Bt ).
1 Reinforcement learning (RL) is a learning technique that an agent learns During time slot t, the input packets are generated by data
from the interaction with the environment via trial-and-error. sources, and then flow to the source routers and change the
remaining buffer size of the source routers. We assume that For instance, the task whose packet in Li selects Lj as the
a packet can be completely transferred from one router to next router, which means that link(i, j) ∈ E is selected as the
another in one time slot and the values of Di,t and Bi,t would action. Besides, the link between two routers is bidirectional,
change during the transmission process. For instance, if a data i.e., a data packet can be transferred from Li to Lj or con-
packet of size f flows from Li to Lj at time slot t, then at versely, denoted by link(i, j) and link(j, i), respectively. Let A
time slot t + 1, the tuple (Di,t+1 , Dj,t+1 , Bi,t+1 , Bj,t+1 ) has 6 denote the set of all possible actions, i.e. A = E, with
situations, as shown in (1), shown at the bottom of this page. cardinality |A| = Na . Note that not all actions are valid
When Li or Lj is the source router, the newly generated for a data transmission task, since the packet can only be
data should be considered. And if Lj is the destination router, transferred to the router connecting to its current position.
the data will be transferred to the terminals directly without Namely, the task can only choose the link starting from the
stored in the buffer. current position of its packet as the valid action. Therefore,
Note that the current location and the size of data packets during the transmission process, the valid actions of the task
would also affect the selection of the next hop router. We then are always changing according to its current position.
adopt modified one-hot encoding vector Ot of size N to rep-
resent these characteristics. When the packet is in router Li , III. ROUTER SELECTION MARKOV DECISION PROCESS
the ith element of Ot is the size of data packet, while the In this section, we formulate the routing process as a Markov
other elements are all zeros. Such modified one-hot encoding Decision Process (MDP), where the agent is the data trans-
can help the computer understand the size and position of mission task and the environment is the network.
the packet. Overall, we can denote the state of each task
by St = (V t , Dt , Bt ; Ot ). A. DEFINITION OF RSMDP
Moreover, the network can be represented by a directed In the considered scenario, the tasks decide the next hop
graph G = {V, E}, where V is the set of all vertexes corre- routers, and the corresponding decision-making process can
sponding to the routers and E is the set of edges corresponding be modeled as a MDP with rewards and actions. The MDP is
to the links between the routers. The data transmission task represented by a tuple (S, A, P, R, γ ), where
chooses action according to the network state along with the • The state space is denoted by S, which consists of
position and size of the packet, where action is defined as the terminal state and the nonterminal states. The ter-
the link between the current router and the next hop router. minal state is a special state, which indicates that the
task terminates. If the action is invalid or causes the For the problem defined in Section II-A, avoiding network
network congestion, then the state turns into the terminal congestion is the prerequisite of seeking for the shortest path.
state. Besides, if the data packet arrives at the destination Thus, the reward should first punish the network congestion
router, then the task also terminates. The nonterminal and then minimize the path length. As described in Section II,
states contain all continuing events, where the packets since each task can only choose the edge that starts from the
are transferred to the next hop routers without conges- router where the packet currently stays, the reward function is
tion, and have not reached the destination. supposed to punish the invalid action. Moreover, the reward
• The action space is denoted by A, which corresponds to function needs to consider the path length for the task. In sum-
all the edges of the network topology graph. The actions mary, we set the reward function R(s, a, s0 ) as follows:
are divided into valid and invalid parts, depending on the
rc if network congestion occurs,
current location of its packet.
The state transition probability function is denoted
r
e if a is invalid,
• R(s, a, s0 ) = (3)
by P(s, a, s0 ) = P[St+1 = s0 |St = s, At = a]. In the con-
0 if packet arrives destination,
sidered scenario, the state transition probability function
−1 otherwise,
is related to the probability distribution of the size of the
where the reward −1 helps record the number of hops the
packets newly generated by the data sources. Because
data packet is transferred in the network. The constant rc
in the state tuple, the vector of newly generated packet
is the congestion reward that takes a negative value smaller
size V t is random.
than −1 since the network congestion should be avoided,
• The immediate reward on the transition from state s to s0
while constant re is the error reward when an invalid action
under action a is denoted by R(s, a, s0 ).
is chosen, which is a negative value smaller than −1 too.
• The discount rate is denoted by γ ∈ [0, 1), which
The network will feed back a non-negative reward only when
determines the present value of future rewards [29].
the packets arrive at the destination routers. As a result, to
avoid the network congestion/invalid action and reduce the
path length of each data transmission task, the objective of the
routing algorithm should be expressed as finding the optimal
policy to maximize the expected cumulative reward for each
task. The details will be described in the next subsection.
C. VALUE FUNCTION
From (3), the reward at time slot t can be denoted
by Rt = R(st , at , st+1 ). Assume the task turns into the termi-
nal state after T time slots. Then, the cumulative discounted
reward from time slot t can be expressed as
T
X
Gt = Rt+1 + γ Rt+2 + · · · + γ T −1
Rt+T = γ k−1 Rt+k .
FIGURE 2. Router selection Markov decision process. k=1
(4)
As Fig. 2 shows, at each time slot, the task selects the next
Define policy π as a probability distribution over action a,
hop router based on its current state, and the corresponding
given state s as:
reward is obtained. The above decision-making and reward
feedback process is repeated, which is named as the RSMDP. π (a|s) = P[At = a|St = s]. (5)
A MDP should satisfy the Markov property, which means
In the considered problem, policy π determines which router
the future state is independent of the past state given
should be chosen as the next hop router conditioned on the
the present state. Mathematically, the Markov property for
current state of the transmission task.
the MDP is defined as follows:
Define
P(st+1 |s0 , a0 , s1 , · · · , st , at ) = P(st+1 |st , at ). (2) Qπ (s, a) = Eπ [Gt |St = s, At = a] (6)
From (1), it is obvious that the next state is only related to the as the action-value function based on policy π of
current state and the current action. Hence the router selection the MDP, i.e., the expectation of the cumulative discounted
process satisfies Markov property. reward starting from s, taking action a, and following
policy π .
B. REWARD FUNCTION The objective of the routing algorithm is to find a policy to
For any state s ∈ S, R(s, a, s0 ) is the immediate reward that maximize the action-value function, i.e.,
numerically characterizes the performance of action a taken
Q∗ (s, a) = max Qπ (s, a). (7)
with the state transiting from s to s0 . π
The optimal policy can be found by maximizing over the 1) REWARDS OF DIFFERENT TYPES
optimal action-value function Q∗ (s, a) as In RSMDP, there are three situations that can terminate the
( tasks: (i) the packet has reached its destination; (ii) the trans-
1 if a = arg max Q∗ (s, a) mission of the packet results in the congestion in the next hop
π∗ (a|s) = a∈A (8) router; (iii) the action chosen by the task is invalid for trans-
0 otherwise.
mission. The latter two situations should be averted, which is
From (8), if the optimal action-value function Q∗ (s, a) can the prerequisite before shortening the length of transmission
be obtained, we can input (V t , Dt , Bt ; Ot ) to compute the paths. Therefore, we should guarantee that the congestion
value of each action, and then choose the action that max- reward and error reward are smaller than the cumulative
imizes Q∗ (s, a). As Section III-B mentioned, the optimal reward starting from current state. According to the reasons
policy obtained from (8) could reduce the path length while for the termination of the task, there are three cases of the
avoid network congestion. cumulative reward:
One possible way to obtain the optimal action-value • The task reaches the destination router at time slot T .
function Q∗ (s, a) is Q-learning, which can be iteratively In this case, Rt = −1 for (t = 1, · · · , T ). Then the
implemented as cumulative reward for the whole transmission process
of the task equals to
Q(St , At ) ← Q(St , At ) T T
X X 1 − γT
γ t−1 Rt = − γ t−1 = − .
h i
+ α Rt+1 + γ max Q(St+1 , a) − Q(St , At ) (9) Gt =
1−γ
(11)
a
t=1 t=1
during the training process, where α is the learning rate. • The task chooses the action that leads to the network
Iteration (9) updates estimates of the values of states based congestion at time slot T . In this case, RT = rc , while
on values of successor states, which is called bootstrapping. Rt = −1 for (t = 1, · · · , T − 1). Then the cumulative
In this case, the learned action-value function will converge reward for the whole transmission process of the task
to the optimal action-value function Q∗ [29]. equals to
To obtain the value of every action, the reinforcement T −1
TX
learning algorithm must try every possible action. However,
X
Gt = γ t−1 Rt = − γ t−1 + γ T −1 rc
if the task only chooses the action that maximizes Q(s, a) t=1 t=1
during the training, then the actions that have not been tried
1 − γ T −1
before will be barely chosen, which makes the action-value =− + γ T −1 rc . (12)
function fall into the local optimum. Therefore, the algorithm 1−γ
should not only exploit the actions that have been tried before, • The task chooses the invalid action at time slot T . In this
but also explore new actions. Hence, the -greedy method is case, RT = re , while Rt = −1 for (t = 1, · · · , T − 1).
usually applied as Then the cumulative reward for the whole transmission
process of the task equals to
with probability 1 −
(
arg max Q(s, a), T −1
TX
a= a (10) X
random action, with probability , Gt = γ t−1 Rt = − γ t−1 + γ T −1 re
t=1 t=1
where is the probability of randomly choosing actions. 1 − γ T −1
=− + γ T −1 re . (13)
1−γ
D. DISCOUNT RATE
Then, we should set rc and re as
In this subsection, we consider the influence of discount
rate γ in RSMDP. From (4), we know the cumulative dis- 1 − γT 1 − γ T −1
rc , re < min{− ,− + γ T −1 rc ,
counted reward leads to ‘‘myopic’’ or ‘‘far-sighted’’ evalu- 1−γ 1−γ
ation when γ is close to 0 or 1, respectively. Specifically, 1 − γ T −1
when γ is close to 0, the future rewards are hardly considered, − + γ T −1 re } (14)
1−γ
while when γ is close to 1, the future rewards are taken into
account with heavier weight. The value of the discount rate γ As we mentioned in Section III-B, both rc and re are less
will affect the DRL-based routing algorithm mainly in two than −1, therefore
aspects: 1 − γT 1 − γ T −1
− >− + γ T −1 rc , (15)
• How does the objective balance the congestion 1−γ 1−γ
reward rc , the error reward re , and the remaining cumu-
and
lative reward?
• What is the relationship between the cumulative reward 1 − γT 1 − γ T −1
− >− + γ T −1 re . (16)
and the hops of the packet to arrive its destination? 1−γ 1−γ
B. THE PROPOSED SDMT-DQN AND router of the task, the packet size and the current position.
DOMT-DQN ALGORITHMS The centralized controller selects the task in Z one by one.
Originally, DQN is designed for single agent which cannot Then the neural network corresponding to the source router
help the multi-tasks choose the next hop routers. To tackle and destination router of the selected task takes the state
this issue, we assume there is a centralized controller with of the selected task as input, and outputs the value of each
sufficient computation ability that can collect information action. Afterwards, the centralized controller chooses action
about the input packets and instruct the router to send the data for the data packet based on -greedy method. If the selected
packets to the next hop router. action is invalid, then the centralized controller: (i) regards
We are interested in a distributed solution to find the rout- the task as termination and stores the corresponding state,
ing policies of the tasks. Even if the data packets of different action, reward re and terminate state in corresponding expe-
tasks are currently in the same router, the tasks may choose rience memory; (ii) re-chooses the action whose value is the
different actions, due to their different goals. Hence, the cen- largest among the valid actions and continues the transmis-
tralized controller needs multiple neural networks to instruct sion. Therefore, the invalid action will lead to two experi-
every router for delivering the packets properly. Furthermore, ence tuples. This procedure can guarantee the validity of the
we should categorize the tasks into different classes and apply selected action while storing the invalid action with re in
one uniform neural network for each class. In this paper, the memory, therefore reducing the probability of choosing
we adopt two criteria to classify the tasks, which yields two invalid action afterwards. Then, according to the selected
different algorithms, respectively: action, the centralized controller can know the next hop router
and determine the next state of the task. The possible situa-
1) THE SDMT-DQN ALGORITHM tions can be listed as follows.
In SDMT-DQN algorithm, we classify all the data transmis- • If the next router is the destination router, then the data
sion tasks into Ns × Nd categories based on their source transmission task is complete and the state turns into
routers and destination routers. Specifically, all data tasks terminal state. The corresponding reward is 0 in this
with the same source router and the same destination router case.
can be considered as the same type of tasks, to share the same • If the action causes congestion, then the task is termi-
neural network. As a result, Ns × Nd neural networks are nated and the reward is rc .
needed to represent the action-value functions of all kinds • Otherwise, the centralized controller updates the state of
of tasks. For those tasks from source router i to destination the task and re-appends it to the end of Z. Moreover,
router j, there is a corresponding replay memory Di,j with the network will return a reward −1.
capacity C to store the experience tuples for training. More- Then, the centralized controller stores the experience
over, there is a target network to reduce the correlation among tuples in the corresponding experience memory D, and the
input data. The algorithm can be illustrated in Fig. 5. neural network samples data from D for training. Repeat the
At the centralized controller, we set a task queue Z to store above procedures until each task in the queue has selected
the information of the tasks, e.g., the source, and destination an action. Finally, the centralized controller sends the action
commands of each task to the routers, and the routers send Algorithm 1 Source-Destination Multi-Task Deep Q Net-
their packets to the next hop routers in accordance with work (SDMT-DQN)
these commands. With such an online algorithm, the neu- 1: Initialize the task queue Z, the reply memories
ral networks can utilize the latest experiences to improve with capacity C for every source-destination pair
the performance. The overall algorithm is summarized D1,1 , D2,1 , · · · , DNs ,Nd , action-value functions Q with
in Algorithm 1. random parameters θ for every source-destination pair
θ1,1 , θ2,1 , · · · , θNs ,Nd , the corresponding target action-
2) THE DOMT-DQN ALGORITHM value functions Q̂ with parameters θ1,1 −
= θ1,1 , · · · ,
The DOMT-DQN algorithm can reduce the number of θNs ,Nd = θNs ,Nd , the buffer size of all the routers, and
−
the required neural networks, which differs from the the network state.
SDMT-DQN algorithm mainly in that the data transmission 2: for t = 1, 2 . . . , T do
tasks are classified into Nd categories that only correspond 3: The sources generate data tasks and append them to Z.
to their destination routers. Hence, the corresponding neural
network and the replay memory only depend on the desti- 4: The controller obtains the information of the new gen-
nation of the task. As the number of categories is reduced, erated tasks and computes the network state.
the number of tasks for each category increases. Therefore, 5: for n = 1, . . . , Nt (Nt is the number of tasks in Z) do
there is more sufficient training data for each corresponding 6: Pop a task n from Z, combine the current network
neural network, which leads to faster convergence. state and the position and size of task n to get
Note that DOMT-DQN can be demonstrated by modi- state st,n .
fying Algorithm 1. Specifically, the replay memory D1,1 , 7: Select neural network based on source router i and
D2,1 , · · · , DNs ,Nd are changed into D1 , D1 , · · · , DNd , and destination router j with parameters θi,j .
the parameters of the neural networks θ1,1 , θ2,1 · · · , θNs ,Nd 8: Choose a random action â with probability , other-
are substituted with θ1 , θ2 · · · , θNd . The remaining proce- wise selecting â = arg maxa Q(st,n , a; θi,j ).
dures are very similar to Algorithm 1, and the overall steps 9: if â is invalid then
of DOMT-DQN are summarized in Algorithm 2. 10: Store the experience tuple
(st,n , â, rt,n , terminal state) in Di,j .
V. SIMULATION RESULTS 11: Re-choose a valid action at,n with the largest
In this section, simulations are conducted to evaluate the value.
performance of SDMT-DQN and DOMT-DQN. The proba- 12: else
bility of randomly choosing action is set to 0.9. We use 13: at,n = â.
Python and the deep learning framework Pytorch for coding 14: end if
and the program is executed on a computer with an Intel 15: Simulate execution action at,n in the controller, get
Core i7-8700k CPU, 32GB random access memory (RAM), reward rt,n and next state s0t,n , then update the net-
and Nvidia GTX 1070 GPU. The operating system is work state.
Ubuntu 16.04. 16: Store the experience tuple (st,n , at,n , rt,n , s0t,n )
We compare the performance of the proposed algo- in Di,j .
rithms with the deep learning based algorithm [27] and the 17: Sample random minibatch of experience tuples
traditional routing protocol OSPF. To better demonstrate the (sk , ak , rk, s0k ) from Di,j .
performance comparison, we consider the simple network rk
1
18: Set yk = .
with topology depicted in Fig. 1. Each node is deemed as rk + γ maxa Q̂(sk , a ; θi,j )
0
0 0 −
2
a router and each edge is deemed as a transmission link. 19: 1 if the task terminates.
:
The routers L0 , L1 , L2 are set as source routers that receive 20: 1 otherwise.
:
input packets from the data sources and transmit them to 21: Perform a gradient descent step with a learning
2
the destination router L8 . All the routers in the network can rate α on yk − Q sk , ak ; θi,j with respect to the
receive and send the data packets. We assume that no matter network parameters θi,j .
how big the data packets are, they can be transferred from 22: Reset Q̂i,j = Qi,j every Nu steps.
one router to another in one time slot. If the network congests 23: end for
in a time slot, we will mark it, then compute the network 24: The controller sends Nt commands to all routers, and
congestion probability by calculating the proportion of time the routers send packets according the commands.
slots that are congested in every 1000 time slots. The buffer 25: end for
size of each router is set to 45 MB, and the packet generation
process is set as Poisson.
A. COMPLEXITY ANALYSIS network, while the number of units in the output layer
Based on the definition of the input state in Section II, there is Na = 32 since the output represents the value of each
are 3 × N + Ns = 30 units in the input layer of the neural action. The controller should choose the next hop router
Algorithm 2 Destination-Only Multi-Task Deep Q Network The total computational complexity can be summarized
(DOMT-DQN) in Table 2. Compared with the DL-based method, the pro-
1: Initialize the whole system, including the buffer posed algorithms has much fewer FLOPs for each neural
size of all the routers, the network state, the task network and number of neural networks. Therefore, the total
queue Z, the replay memory D1 , D1 , · · · , DNd , computational complexity of the two proposed algorithms are
the action-value functions with random parameters extremely lower.
θ1 , θ2 , · · · , θNd , and the corresponding target network
θ1− = θ1 , · · · , θN−d = θNd
2: for t = 1, 2 . . . , T do
3: The sources generate data tasks and append them to Z.
for each task in a very short time, therefore light-weight B. PERFORMANCE COMPARISON
neural networks ought to be used. The specific neural network In Fig. 6, we compare congestion probabilities of
architectures for SDMT-DQN and DOMT-DQN are shown SDMT-DQN, DOMT-DQN, DL based algorithm and OSPF
in Table 1. versus the number of training steps. The discount rate γ
is set to 0.9, and the mean of Poisson data generation
TABLE 1. The neural network architecture.
process is set to 15 MB per time slot. The congestion
probabilities of OSPF stays at a high level due to the lack
of intelligence. In contrast, the congestion probabilities of
SDMT-DQN and DOMT-DQN significantly decrease with
the increase of training steps because the network has learned
The number of the required neural networks for our algo- from the past congestion and then generates a policy to reduce
rithms is significantly reduced compared with DL-based congestion probability. Moreover, both two proposed algo-
method in [27]. To be specific, Ns × Nd and Nd neural rithms can achieve lower congestion probability compared
networks are required for SDMT-DQN and DOMT-DQN, with the DL based algorithm [27]. This is because the DL
respectively. For example, considering the network topology based algorithm can only choose from the pre-defined path
of Fig. 1, SDMT-DQN requires three neural networks while combinations, instead of exploring the best possible paths
DOMT-DQN only needs one neural network. from the instantaneous states. We see that the training process
In addition, the required number of floating point opera- of DOMT-DQN converges faster than that of SDMT-DQN.
tions (FLOPs) is used as the metric of computational com- The reason can be explained as follows: The training data of
plexity. For convolutional layers, the number of FLOPs is: SDMT-DQN is divided into Ns × Nd categories, while that of
DOMT-DQN is only divided into Nd categories. Therefore,
FLOPs = 2Hin Win (Cin K 2 + 1)Cout , (23)
at the beginning of the training process, the training data
where Hin , Win and Cin are height, width and number of chan- for each neural network in DOMT-DQN is more sufficient
nels of the input feature map, K is the kernel size, and Cout is than that in SDMT-DQN. It is also seen that with the process
the number of output channels. of training, the congestion probability of SDMT-DQN can
For fully connected layers, FLOPs is computed as: reduce to almost zero, while that of DOMT-DQN maintain
at an acceptably low level, because adopting more neural
FLOPs = (2Nin − 1)Nout , (24)
networks of SDMT-DQN could provide better learning abil-
where Nin is the number of input neurons and Nout is the ity than DOMT-DQN. Besides, since further classifying the
number of output neurons [36]. data transmission tasks based on the source routers makes the
TABLE 2. The total complexity comparison of the three algorithms for the network topology in Fig. 1.
FIGURE 10. The comparison of different discount rates γ in terms of path length. (a) Path length under different
discount rates γ based on SDMT-DQN. (b) Path length under different discount rates γ based on DOMT-DQN.
the task tends to choose the actions which would not cause
congestion or be invalid along with the training. As a result,
when γ = 1, our algorithms can also reduce the congestion
probability, just slightly slower. In addition, DOMT-DQN
performs better than SDMT-DQN. Specially, for SDMT-
DQN, there are very few paths that are longer than 10, which
never happens for DOMT-DQN. This is because when we use
SDMT-DQN, the task from one source router may choose
another source router as hop router occasionally. Since the
probability of this behavior is very low, the training data
that guides the network to handle this situation cannot be
sufficient. Then, the data packets may be repeatedly trans-
ferred between two source routers, and the path length of
the corresponding task then becomes very long. On the other
hand, DOMT-DQN does not differentiate the tasks according
to their source routers. Hence, no matter which router the data
FIGURE 11. A more complicated network topology.
packet is transferred to, there can always be sufficient training
samples.
In the last example, we demonstrate the scalability of reduce the congestion probability. Similar to Fig. 6,
SDMT-DQN and DOMT-DQN in a more complicated net- SDMT-DQN performs better than DOMT-DQN in terms
work as shown in Fig. 11. In Fig. 12, we compare the of the congestion probability while DOMT-DQN con-
proposed algorithms with OSPF in terms of congestion verges faster than SDMT-DQN. Both SDMT-DQN and
probability.2 Both the proposed algorithms can significantly DOMT-DQN converge slower when being applied in a
2 The DL based algorithm [27] cannot be implemented in the current more complicated network, and the corresponding congestion
computer configuration, since the number of the possible path combinations probability after training will be slightly increased. This is
is too large. because when the number of routers in the network increases,