Sensors 22 08139 With Cover
Sensors 22 08139 With Cover
Article
Junyan Chen, Wei Xiao, Xinmei Li, Yang Zheng, Xuefeng Huang, Danli Huang and Min Wang
Special Issue
Smart Mobile and Sensing Applications
Edited by
Dr. Chien Aun Chan, Dr. Ming Yan and Dr. Chunguo Li
https://doi.org/10.3390/s22218139
sensors
Article
A Routing Optimization Method for Software-Defined Optical
Transport Networks Based on Ensembles and
Reinforcement Learning
Junyan Chen 1,2 , Wei Xiao 1 , Xinmei Li 1 , Yang Zheng 3, *, Xuefeng Huang 1 , Danli Huang 1 and Min Wang 1
1 School of Computer Science and Information Security, Guilin University of Electronic Technology,
Guilin 541004, China
2 School of Computer Science and Engineering, Northeastern University, Shenyang 110819, China
3 Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China
* Correspondence: [email protected]
Abstract: Optical transport networks (OTNs) are widely used in backbone- and metro-area trans-
mission networks to increase network transmission capacity. In the OTN, it is particularly crucial to
rationally allocate routes and maximize network capacities. By employing deep reinforcement learn-
ing (DRL)- and software-defined networking (SDN)-based solutions, the capacity of optical networks
can be effectively increased. However, because most DRL-based routing optimization methods have
low sample usage and difficulty in coping with sudden network connectivity changes, converging
in software-defined OTN scenarios is challenging. Additionally, the generalization ability of these
methods is weak. This paper proposes an ensembles- and message-passing neural-network-based
Deep Q-Network (EMDQN) method for optical network routing optimization to address this prob-
lem. To effectively explore the environment and improve agent performance, the multiple EMDQN
agents select actions based on the highest upper-confidence bounds. Furthermore, the EMDQN
Citation: Chen, J.; Xiao, W.; Li, X.;
Zheng, Y.; Huang, X.; Huang, D.;
agent captures the network’s spatial feature information using a message passing neural network
Wang, M. A Routing Optimization (MPNN)-based DRL policy network, which enables the DRL agent to have generalization capability.
Method for Software-Defined Optical The experimental results show that the EMDQN algorithm proposed in this paper performs better
Transport Networks Based on in terms of convergence. EMDQN effectively improves the throughput rate and link utilization of
Ensembles and Reinforcement optical networks and has better generalization capabilities.
Learning. Sensors 2022, 22, 8139.
https://doi.org/10.3390/s22218139 Keywords: optical transport network; software-defined networking; deep Q-network;
Academic Editors: Chien Aun Chan, message-passing neural network; ensemble learning
Ming Yan and Chunguo Li
2. Related Research
Traditional routing optimization schemes are usually based on the OSPF (open shortest
path first) [7] or ECMP (equal-cost multipath routing) [8]. The OSPF protocol routes all
flow requests individually to the shortest path. The ECMP protocol increases transmission
bandwidth using multiple links simultaneously. However, these approaches, based on
fixed forwarding rules, are prone to link congestion and cannot meet the demand of
exponential traffic growth. Recently, most heuristic algorithm-based approaches have been
built under the architecture of the SDN. The authors in [9] proposed a heuristic ant-colony-
based dynamic layout algorithm for SDNs with multiple controllers, which can effectively
reduce controller-to-switch and controller-to-controller communication delays caused by
link failures. The authors in [10] applied a random-based heuristic method called the
alienated ant algorithm, which forces ants to spread out across all available paths while
searching for food rather than converging on a single path. The authors in [11] analytically
extract historical user data through a semi-supervised clustering algorithm for efficient
data classification, analysis, and feature extraction. Subsequently, they used a supervised
classification algorithm to predict the flow of service demand. The authors in [12] proposed
a heuristic algorithm-based solution for DWDM-based OTN network planning. The authors
in [13] proposed a least-cost tree heuristic algorithm to solve the OTN path-sharing and load-
balancing problem. However, because of a lack of historical experience in data learning,
heuristic algorithms can only build models for specific problems. When the network
changes, it is difficult to determine the network parameters and there is limited scalability
to guarantee service quality. Furthermore, because of the tremendous computational effort
and high computational complexity of these methods, heuristic algorithms do not perform
well on OTN networks.
With SDN’s maturity and large-scale commercialization, the SD-OTN based on the
SDN is becoming increasingly popular in the industry. SD-OTN adapts the reconfigurable
optical add-drop multiplexer (ROADM) nodes through the southbound interface protocol
and establishes a unified resource and service model. The SD-OTN controller can realize
topology and network status data collection, routing policy distribution, and network
monitoring. Therefore, many researchers deploy artificial intelligence algorithms in the
controller. Deep learning, with its powerful learning algorithms and excellent performance
advantages, has gradually been applied to the SDN. To solve the SDN load-balancing
problem, Chen et al. [14] used the long short-term memory (LSTM) to predict the network
traffic in the SDN application plane. The authors in [15] proposed a weighted Markov
prediction model based on mobile user classification to optimize network resources and
reduce network congestion. The authors in [16] proposed an intrusion detection system
based on SDN and deep learning, reducing the burden of security configuration files on
network devices. However, deep learning requires many datasets for training and has poor
generalization abilities due to its inability to interact with the environment. These factors
make it difficult to optimize the performance of dynamic networks. Compared with deep
learning, reinforcement learning uses online learning for model training, changing agent
behaviors through continuous exploration, learning, and experimentation to obtain the
best return. Therefore, reinforcement learning does not require the model to be trained
in advance. It can change its action according to the environment and reward feedback.
The authors in [17] designed a Q-learning-based localization-free routing for underwater
sensor networks. The authors in [18] proposed a deep Q-routing algorithm to compute
the path of any source-destination pair request using a deep Q-network with prioritized
experience replay. The authors in [19] proposed traction control ideas to solve the routing
problem. The authors in [20] proposed a routing optimization algorithm based on the
proximal policy optimization (PPO) model in reinforcement learning. The authors in [21]
discussed a solution for automatic routing in the OTN using DRL. Although the studies
described above have been successful for the SDN demand-routing optimization problem,
they do not perform as well in new topologies because they do not consider the model’s
generalization capability.
Sensors 2022, 22, 8139 4 of 19
The traditional DRL algorithms use a typical neural network (NN) as the policy
network. The NN can extract and filter the features of the input information and data layer
by layer to finally obtain the results of tasks, such as classification and prediction. However,
as research advances, conventional neural networks are unable to solve all network routing
problems and will struggle to handle non-Euclidean-structured graph data. Therefore, we
need to optimize the traditional reinforcement learning algorithm to improve its ability
to extract the information features of the sample. Off-policy reinforcement learning (Off-
policy RL) algorithms significantly improve sample utilization by reusing past experiences.
The authors in [22] propose an off-policy actor–critic RL algorithm based on a maximum
entropy reinforcement learning framework. The participants’ goal in this framework is to
maximize the expected reward while maximizing the entropy. They achieved state-of-the-
art sample efficiency results by combining a maximum entropy framework. However, in
practice, the commonly used off-policy approximate dynamic programming methods based
on the Q-learning and actor–critic methods are susceptible to data distribution. They can
only make limited progress without collecting additional on-policy data. To address this
problem, the authors in [23] proposed bootstrap error accumulation reduction to reduce
off-policy algorithm instability caused by accumulating backup operators via the Bellman
algorithm. The authors in [24] developed a new estimator called offline dual reinforcement
learning, which is based on the cross-folding estimation of Q-functions and marginalized
density ratios. The authors in [25] used a framework combining imitation learning and deep
reinforcement learning, effectively reducing the RL algorithm’s instability. The authors
in [26] used the DQN replay datasets to study off-policy RL, effectively reducing the off-
policy algorithm’s instability. The authors in [27] proposed an intelligent routing algorithm
combining the graph neural network (GNN) and deep deterministic policy gradient (DDPG)
in the SDN environment, which can be effectively extended to different network topologies,
improving load-balancing capabilities and generalizability. The authors in [28] combined
GNN with the DQN algorithm to address the lack of generalization abilities in untrained
OTN topologies. OTN topology graphs are non-Euclidean data, and the nodes in their
topology graphs typically contain useful feature information that most neural networks
are unable to comprehend. They use MPNN to extract feature information between OTN
topological nodes, which improves the generalization performance of the DRL algorithm.
However, it is a challenge for a single DRL agent to balance exploration and devel-
opment, resulting in limited convergence performance. Ensemble learning solves a single
prediction problem by building several models. It works by generating several classifiers
or models, each of which learns and predicts independently. These predictions are finally
combined into a combined prediction, which outperforms any single classification for
making predictions [29]. There are two types of integrated base learning machines. One
type involves using various learning algorithms on the same dataset to obtain a base
learning machine, which is usually referred to as heterogeneous [30–32]. The other type
applies the same learning algorithm on a different training set (which can be obtained
by random sampling based on the original training dataset, etc.), and the base learning
machine obtained using this method is said to be a homogeneous type. However, because
of the high implementation difficulty and low scalability of heterogeneous types of base
learning machines, expansion to high-dimensional state and action spaces is difficult, mak-
ing it unsuitable for solving OTN routing optimization problems. Table 1 summarizes
the description of the papers reviewed, whether SDN and RL are considered, and the
evaluation indicators. The EMDQN algorithm we propose applies the same reinforcement
learning algorithm to different training sets to generate the base learning machine. We
combine multiple EMDQN agents to construct an ensemble learning machine and generate
diverse samples to effectively generate learning machines with high generalization abilities
and significant differences.
Sensors 2022, 22, 8139 5 of 19
3. SD-OTN Architecture
In this paper, the designed SD-OTN architecture consists of the application, control,
and data planes, as shown in Figure 1. The description of each part of the network
architecture is as follows:
Sensors 2022, 22, 8139 6 of 19
Figure 1. The SD-OTN architecture. The architecture consists of the application plane, control plane,
and data plane.
1. Data plane. The data plane consists of the ROADM nodes and the predefined optical
paths connecting them. In the data plane, the capacity of the links and the connection
status of the ROADM nodes are predefined. The data plane must collect the current
optical data unit (ODU) signal requests and network status information, which it
must then send to the control plane via the southbound interface. The data plane
implements the new routing forwarding policy after receiving it from the control
plane. It communicates the new network state and traffic demand to the control plane,
from which decision-makers in the application plane learn.
2. Control plane. The control plane consists of the SDN controller. The control plane
obtains the ODU signal request and network status information via the southbound in-
terface and calculates the reward using the reward function. Through the northbound
interface, the control plane sends the network state, traffic demand, and reward to the
application plane via the northbound interface. When receiving optimized routing
action from the application plane, the control plane sends a routing forwarding policy
to the data plane based on the routing action.
3. Application plane. The application plane manages the EMDQN agents. The agents
obtain network state information from the control plane, encode it, and feed it into
the agents’ policy network, which generates optimized routing actions. Subsequently,
the routing actions are sent down to the control plane.
G = (V, E) (1)
where V and E represent the set of n ROADM nodes and m optical links in the network
topology, respectively, as shown in Equations (2) and (3).
V = [ v1 , v2 , . . . , v n ]. (2)
E = [ e1 , e2 , . . . , e m ] . (3)
We use C to denote the set of link bandwidth capacity, as shown in Equation (4), where
|C | = | E| = m:
C = [ c1 , c2 , . . . , c m ]. (4)
The path k from node vi to node v j is defined as a sequence of links, as shown in
Equation (5), where ek(i) ∈ E:
n o
p k = e k (0) , e k (1) , . . . , e k ( n ) . (5)
We use dk to denote the traffic demand of the path k, and define D as the set of all
traffic demands, as shown in Equation (6):
D = [ d1 , d2 , . . . , d n ∗ n ]. (6)
The traffic routing problem in OTN is a classical resource allocation problem [26]. If the
bandwidth capacity of the distributed routing path is greater than the size of the bandwidth
requirement, the allocation is successful. After successfully allocating bandwidth capacity
for a node pair’s traffic demand, the routing path will not be able to release the bandwidth
occupied by that demand until the end of this episode. We use rbi to describe the remaining
bandwidth of the link ei , which is the link bandwidth capacity ci minus the traffic demands
of all paths passing through link ei , as shown in Equation (7). RB is the set of the remaining
bandwidth of all links, as shown in Equation (8).
rbi = ci − ∑ dk . (7)
Q = [ q1 , q2 , . . . , q n ∗ n ]. (10)
The optimization objective in this paper is to successfully allocate as much of the traffic
demand as possible, as shown in Equation (11):
max( ∑ q i ). (11)
qi ∈ Q
In view of the above optimization objective, the routing optimization can be modeled
as a Markov decision process, defined by the tuple {S, A, P, R}, where S is the state space,
A is the action space, P is the set of transfer probabilities, and R is the set of rewards. The
specific design is as follows:
1. Action space: The action space is designed as k shortest hop-based paths of source-
destination nodes. The action selects one of the k paths to transmit the traffic demand
Sensors 2022, 22, 8139 8 of 19
3. Reward function: The reward function returns a positive reward if the selected
link has sufficient capacity to support the traffic demand in an episode; otherwise, it
returns no reward and terminates the episode. According to the optimization objective
in Equation (11), the final reward for the episode is the sum of the rewards of all
successfully assigned traffic demand tuples {src, dst, demand}, as shown in Equation
(13), where N is the number of traffic demand tuples, rt represents the reward after
the action at time t, qi represents the i-th traffic demand successfully assigned, and
qmax represents the maximum traffic demand successfully assigned. The higher the
reward, the more bandwidth demands are successfully allocated in that time step,
and the better the network load-balancing capability.
.
N
rt = ∑ qi /qmax . (13)
i =1
The DQN agent selects and executes an action based on an ǫ-greedy policy. The policy
generates a random number in [0, 1] interval through a uniform distribution. If the number
Sensors 2022, 22, 8139 9 of 19
is less than 1 − ǫ, it selects an action that maximizes the Q-value; otherwise, it selects an
action randomly, as shown in Equation (15):
(
argmax Q(st , a, θ ), with probability 1−ε
at = a . (15)
random action, otherwise
The target network calculates the target value y by obtaining a random mini-batch
storage sample from the replay buffer, as shown in Equation (16), where r is the reward
value and γ is the discount factor:
y = r + γmaxQ̂ s′ , a′ , θ .
(16)
a′
The DQN defines the loss function of the network using the mean-square error, as
shown in Equation (17). The parameter θ is updated by the mini-batch semi-gradient
descent, as shown in Equations (18) and (19):
h i
L(θ ) = E (y − Q(s, a, θ ))2 , (17)
N
1
∇θ L(θ ) ≈
N ∑(y − Q(s, a, θ ))∇Q(s, a, θ ), (18)
i
θ ← θ − α ∇ θ L ( θ ), (19)
where N represents the number of samples and α represents the update parameter.
The target network is used by the DQN to keep the target Q-value constant over
time, which reduces the correlation between the predicted and target Q-values to a certain
extent. This operation reduces the possibility of loss value oscillation and divergence
during training and improves the algorithm’s stability.
ℎ = 𝑢(ℎ , 𝑀 )
the message function m(·). The message function m(·) is a fully connected CNN. After
𝑅(·)
iterating over all links, the link k receives messages from all neighboring links (denoted
by N (k)). It generates a new feature vector Mk using message aggregation, as shown in
Equation (21): 𝑄(𝑠, 𝑎, 𝜃) = 𝑅(∑ ∈ ℎ )
Mkt+1 = ∑ m htk , hit , (21)
𝐸 i ∈ N ( k )
Second, we update the hidden state of the link by aggregating the feature vector Mkt+1
with the link-hidden state htk through the update function u(·), as shown in Equation (22).
The update function u(·) is the Gate Recurrent Unit (GRU), which has the characteristics of
high training efficiency.
htk+1 = u htk , Mkt+1 . (22)
Finally, after the T-step message transmission, we use the readout function R(·) to
aggregate the hidden state of all links and obtain the Q-value, as shown in Equation (23):
!
{𝑄(𝑠, 𝑎, 𝜃 )} 𝜃 Q(s, a, θ ) = R ∑ hk , 𝑖 (23)
k∈ E
(𝜃 )(=
𝑳 { Qmean a, θ ) +𝑟λQ
st ,𝑤(𝑠) + std
𝛾 max 𝜃 ) −probability
(st , a,𝑄θ(𝑠)},,𝑎 ,with 𝑄(𝑠, 𝑎, 𝜃 ) 1−ε
(
max
at = a , (24)
random action, otherwise
𝑤(𝑠)
Sensors 2022, 22, 8139 𝑤(𝑠) 𝑊of 19
11
𝜎 𝑄 (𝑠)
𝑚𝑎𝑥 𝑄 (𝑠, 𝑎, 𝜃̅ ) 𝐿 (·)
where Qmean (st , a, θ ) and Qstd (st , a, θ ) are the mean and standard deviation of the Q-values
output by all MPNN policy networks { Q(s, a, θi )}iN=1 . The exploration reward λ > 0 is a
hyper-parameter. When λ increases, the EMDQN agents become more active in accessing
(𝑠) ∗ 𝑊 + 0.5
unknown state–action pairs. 𝑤(𝑠) = 𝜎 −𝑄
The traditional DQN loss function (Equation (6)) may be affected by error propagation,
that is, it propagates the target Q-network Q̂ s′ , a′ , θ error to the current state of the Q-
network Q(s, a, θ ). This error propagation can lead to an unstable convergence. To alleviate
this problem, for each EMDQN agent i, this paper uses Bellman weighted backups, as
{𝑠𝑟𝑐, 𝑑𝑠𝑡, 𝑑𝑒𝑚𝑎𝑛𝑑}
shown in Equation (25):
𝑘
𝑘
2 –
EMDQN ′ ′
LWQ (θi ) = w(s) r + γmaxQ̂ s , a , θi − Q(s, a, θi ) , 𝜖 (25)
a′
where w(s) represents the confidence weight of the set of target Q-networks in the interval
[0.5, 1.0]. w(s) is calculated from Equation (26), where the weight parameter W is a hyper-
parameter, σ is a sigmoid function, Q̂std (s) is the empirical standard deviation of all target
n o N EMDQN
Q-networks maxQ̂ s, a, θ . LWQ (·) reduces the weights of sample transitions
a i =1
with high variance between target Q-networks, resulting in better signal-to-noise ratios for
network updates.
𝐿
w(s) = σ − Q̂std (s) ∗ W + 0.5. (26)
Figure 4. The optical transmission network topologies: (a) the NSFNET topology; (b) the GEANT2
topology; (c) the GBN topology.
Figure 5. Comparison of the effect of some super-references: (a) ensemble number; (b) learning rate;
ϵ
(c) ǫ-decay; (d) UCB exploration reward λ;λ (e) weight parameter W.W
Sensors 2022, 22, 8139 14 of 19
Figure 5a shows the training results for the different numbers of EMDQN agents.
When the number of agents is high, the training slows down and aggravates the overfitting
of DRL agents in the application scenario, resulting in poorer results. The performance is
optimal when the number of EMDQN agents is two. Figure 5b shows the training results of
the stochastic gradient descent algorithm with different learning rates. When the learning
rate was 0.001, the algorithm reward achieved the highest value. Figure 5c shows the
training results for different decay rates of ǫ. In the initial stage of training, ǫ is close to
1. We executed 70 iterations and started to reduce ǫ exponentially using ǫ-decay until it
decreased to 0.05. During the process of ǫ reduction, the training curve tends to flatten
out, finally reaching convergence. The training results show that the reward value curve is
most stable after convergence when ǫ-decay is 0.995. Figure 5d shows the training results
for different λ values in Equation (11). λ denotes the exploration reward of the EMDQN
agent. From the results in Figure 5d, it is clear that the algorithm reward value is highest
when λ value is 5. Figure 5e depicts the training results for different weight parameters
W in Equation (14). In this paper, we set the size of samples to 32. As W increases, the
sample weights converge and become less than one, which affects the sample efficiency of
the EMDQN. The reward of this algorithm reaches its highest value when the value of W
is 0.05.
Table 2 shows some relevant parameters of the EMDQN and values taken after tuning
the parameters.
Parameter Value
Batch size 32
Learning rate 0.001
Soft wights copy α 0.08
Dropout rate 0.01
State hidden 20
Ensemble number 2
UCB exploration reward λ 5
Weight parameter W 0.05
ǫ-decay 0.995
Discount factor γ 0.95
Figure 6 shows the average reward of all algorithms for the three evaluation scenarios,
where the confidence interval is 95%. In this paper, we design the reward based on whether
the bandwidth demand can be successfully allocated. The greater the reward, the more
bandwidth demand is successfully allocated, and the better the network load-balancing
capability. In all three evaluation scenarios, the EMDQN algorithm proposed in this
paper performs better than other algorithms after convergence. The EMDQN algorithm
outperforms the DQN+GNN algorithm with ensemble learning removed after convergence
by more than 7%, demonstrating that the multi-agent ensemble learning approach can
effectively improve the convergence performance of the DQN. Additionally, the EMDQN
and DQN+GNN outperform the classical reinforcement learning algorithms (DQN and
PPO) by more than 25% in all three evaluated scenarios. This indicates that the MPNN
can effectively improve the decision performance of the reinforcement learning model by
capturing information about the relationship between the demand on links and network
topology. The DQN and PPO algorithms perform about as well as the ECMP algorithm after
convergence. The OSPF algorithm, on the other hand, routes all flow requests singularly to
the shortest path. Since this method is based on fixed forwarding rules, it can easily lead to
link congestion. Therefore, the OSPF algorithm is only close to ECMP in the GBN scenario
and the lowest in other scenarios.
Figure 6. Comparison of the rewards of each algorithm in different scenarios: (a) NSFNET scenario
evaluation; (b) GEANT2 scenario evaluation; (c) GBN scenario evaluation.
Table 3 shows the average throughput of each algorithm in ODU0 bandwidth units
for the three network topologies. Table 4 displays the average link utilization of each
algorithm across the three network topologies. The average throughput and link utilization
of the EMDQN are higher than those of other algorithms under various network topologies,
indicating that the EMDQN algorithm has a better load-balancing capability for the network
after convergence. The performance of the EMDQN algorithm is higher than that of the
DQN+GNN algorithm, which is a good indication that ensemble learning can improve the
convergence performance of the model. The results show that the EMDQN has excellent
decision-making abilities.
Table 4. A comparison of the average link utilization of each algorithm in different scenarios.
Figure 7. Evaluation of the model in different network scenarios with randomly broken lightpaths:
(a) broken lightpath at NSFNET; (b) broken lightpath at GEANT2; (c) broken lightpath at GBN.
algorithms. This confirms that the EMDQN agent can still maintain excellent decision-
making abilities in the case of network connectivity changes.
Figure 8. Performance of the algorithm after changing the network topology: (a) train in the NSFNET
and evaluate in the GEANT2; (b) train in the NSFNET and evaluate in the GBN.
Author Contributions: Conceptualization, J.C. and Y.Z.; methodology, J.C.; software, W.X. and X.L.;
validation, J.C., W.X., and X.L.; formal analysis, X.H.; investigation, M.W.; resources, D.H.; data
curation, J.C.; writing—original draft preparation, J.C., W.X., and X.L.; writing—review and editing,
J.C. and Y.Z.; visualization, X.H.; supervision, Y.Z.; project administration, Y.Z.; funding acquisition,
J.C. All authors have read and agreed to the published version of the manuscript.
Funding: This work is supported by the major program of Guangxi Natural Science Foundation
(No.2020GXNSFDA238001) and the Middle-aged and Young Teachers’ Basic Ability Promotion Project
of Guangxi (No.2020KY05033).
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: The authors confirm that the data supporting the findings of this study
are available within the article.
Conflicts of Interest: The authors declare that they have no conflict of interest.
Sensors 2022, 22, 8139 18 of 19
References
1. Karakus, M.; Durresi, A. Quality of service (QoS) in software defined networking (SDN). J. Netw. Comput. Appl. 2017, 80, 200–218.
[CrossRef]
2. Guo, X.; Lin, H.; Li, Z.; Peng, M. Deep-Reinforcement-Learning-Based QoS-Aware Secure Routing for SDN-IoT. IEEE Internet
Things J. 2020, 7, 6242–6251. [CrossRef]
3. Sun, P.; Lan, J.; Guo, Z.; Xu, Y.; Hu, Y. Improving the Scalability of Deep Reinforcement Learning-Based Routing with Control on
Partial Nodes. In Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP
2020), Barcelona, Spain, 4–8 May 2020; pp. 3557–3561.
4. Nguyen, T.G.; Phan, T.V.; Hoang, D.T.; Nguyen, T.N.; So-In, C. Federated Deep Reinforcement Learning for Traffic Monitoring in
SDN-Based IoT Networks. IEEE Trans. Cogn. Commun. Netw. 2021, 7, 1048–1065. [CrossRef]
5. Gilmer, J.; Schoenholz, S.S.; Riley, P.F.; Vinyals, O.; Dahl, G.E. Neural message passing for quantum chemistry. In Proceedings
of the 34th International Conference on Machine Learning (ICML 2017), Sydney, Australia, 4–11 August 2017; Volume 70, pp.
1263–1272.
6. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M. Playing Atari with deep reinforce-
ment learning. arXiv 2013, arXiv:1312.5602.
7. Ali Khan, A.; Zafrullah, M.; Hussain, M.; Ahmad, A. Performance analysis of OSPF and hybrid networks. In Proceedings of the
International Symposium on Wireless Systems and Networks (ISWSN 2017), Lahore, Pakistan, 19–22 November 2017; pp. 1–4.
8. Chiesa, M.; Kindler, G.; Schapira, M. Traffic engineering with Equal-Cost-Multipath: An algorithmic perspective. IEEE/ACM
Trans. Netw. 2017, 25, 779–792. [CrossRef]
9. Li, C.; Jiang, K.; Luo, Y. Dynamic placement of multiple controllers based on SDN and allocation of computational resources
based on heuristic ant colony algorithm. Knowl. Based Syst. 2022, 241, 108330. [CrossRef]
10. Di Stefano, A.; Cammarata, G.; Morana, G.; Zito, D. A4SDN—Adaptive Alienated Ant Algorithm for Software-Defined Network-
ing. In Proceedings of the 10th International Conference on P2P, Parallel, Grid, Cloud and Internet Computing (3PGCIC 2015),
Krakow, Poland, 4–6 November 2015; pp. 344–350.
11. Chen, F.; Zheng, X. Machine-learning based routing pre-plan for sdn. In International Workshop on Multi-Disciplinary Trends in
Artificial Intelligence; Springer: Cham, Switzerland, 2015; pp. 149–159.
12. Xavier, A.; Silva, J.; Martins-Filho, J.; Bastos-Filho, C.; Chaves, D.; Almeida, R.; Araujo, D.; Martins, J. Heuristic planning algorithm
for sharing restoration interfaces in OTN over DWDM networks. Opt. Fiber Technol. 2021, 61, 102426. [CrossRef]
13. Fang, C.; Feng, C.; Chen, X. A heuristic algorithm for minimum cost multicast routing in OTN network. In Proceedings of the
19th Annual Wireless and Optical Communications Conference (WOCC 2010), Shanghai, China, 14–15 May 2010; pp. 1–5.
14. Chen, J.; Wang, Y.; Huang, X.; Xie, X.; Zhang, H.; Lu, X. ALBLP: Adaptive Load-Balancing Architecture Based on Link-State
Prediction in Software-Defined Networking. Wirel. Commun. Mob. Comput. 2022, 2022, 8354150. [CrossRef]
15. Yan, M.; Li, S.; Chan, C.A.; Shen, Y.; Yu, Y. Mobility Prediction Using a Weighted Markov Model Based on Mobile User
Classification. Sensors 2021, 21, 1740. [CrossRef]
16. Wani, A.; Revathi, S.; Khaliq, R. SDN-based intrusion detection system for IoT using deep learning classifier (IDSIoT-SDL). CAAI
Trans. Intell. Technol. 2021, 6, 281–290. [CrossRef]
17. Zhou, Y.; Cao, T.; Xiang, W. Anypath Routing Protocol Design via Q-Learning for Underwater Sensor Networks. IEEE Internet
Things J. 2021, 8, 8173–8190. [CrossRef]
18. Jalil, S.Q.; Rehmani, M.; Chalup, S. DQR: Deep Q-Routing in Software Defined Networks. In Proceedings of the 2020 International
Joint Conference on Neural Networks (IJCNN), Glasgow, UK, 19–24 July 2020; pp. 1–8.
19. Sun, P.; Guo, Z.; Lan, J.; Li, J.; Hu, Y.; Baker, T. ScaleDRL: A scalable deep reinforcement learning approach for traffic engineering
in SDN with pinning control. Comput. Netw. 2021, 190, 107891. [CrossRef]
20. Che, X.; Kang, W.; Ouyang, Y.; Yang, K.; Li, J. SDN Routing Optimization Algorithm Based on Reinforcement Learning. Comput.
Eng. Appl. 2021, 57, 93–98.
21. Suárez-Varela, J.; Mestres, A.; Yu, J.; Kuang, L.; Feng, H.; Barlet-Ros, P.; Cabellos-Aparicio, A. Routing based on deep reinforcement
learning in optical transport networks. In Proceedings of the 2019 Optical Fiber Communications Conference and Exhibition
(OFC), San Diego, CA, USA, 3–7 March 2019; pp. 1–3.
22. Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a
stochastic actor. In Proceedings of the 35th International Conference on Machine Learning (ICML 2018), Stockholm, Sweden,
10–15 July 2018; Volume 80, pp. 1861–1870.
23. Kumar, A.; Fu, J.; Soh, M.; Tucker, G.; Levine, S. Stabilizing off-policy Q-learning via bootstrapping error reduction. In Proceedings
of the 33rd International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019;
pp. 11784–11794.
24. Kallus, N.; Uehara, M. Double reinforcement learning for efficient off-policy evaluation in Markov decision processes. J. Mach.
Learn. Res. 2002, 21, 6742–6804.
25. Qiang, F.; Xin, X.; Xitong, W.; Yujun, Z. Target-driven visual navigation in indoor scenes using reinforcement learning and
imitation learning. CAAI Trans. Intell. Technol. 2022, 7, 167–176.
26. Agarwal, R.; Schuurmans, D.; Norouzi, M. An optimistic perspective on offline reinforcement learning. In Proceedings of the 37th
International Conference on Machine Learning (ICML 2020), Virtual Event, 13–18 July 2020; pp. 104–114.
Sensors 2022, 22, 8139 19 of 19
27. Shahri, E.; Pedreiras, P.; Almeida, L. Extending MQTT with Real-Time Communication Services Based on SDN. Sensors 2022, 22,
3162. [CrossRef] [PubMed]
28. Almasan, P.; Suárez-Varela, J.; Badia-Sampera, A.; Rusek, K.; Barlet-Ros, P.; Cabellos-Aparicio, A. Deep Reinforcement Learning
meets Graph Neural Networks: Exploring a routing optimization use case. arXiv 2020, arXiv:1910.07421v2. [CrossRef]
29. Dong, X.; Yu, Z.; Cao, W.; Shi, Y.; Ma, Q. A survey on ensemble learning. Front. Comput. Sci. 2020, 14, 241–258. [CrossRef]
30. Zhang, T.; Zhang, D.; Yan, H.; Qiu, J.; Gao, J. A new method of data missing estimation with FNN-based tensor heterogeneous
ensemble learning for internet of vehicle. Neurocomputing 2021, 420, 98–110. [CrossRef]
31. Fang, Z.; Wang, Y.; Peng, L.; Hong, H. A comparative study of heterogeneous ensemble-learning techniques for landslide
susceptibility mapping. Int. J. Geogr. Inf. Sci. 2021, 35, 321–347. [CrossRef]
32. Lei, L.; Kou, L.; Zhan, X.; Zhang, J.; Ren, Y. An Anomaly Detection Algorithm Based on Ensemble Learning for 5G Environment.
Sensors 2022, 22, 7436. [CrossRef] [PubMed]
33. Strand, J.; Chiu, A.; Tkach, R. Issues for routing in the optical layer. IEEE Commun. Mag. 2001, 39, 81–87. [CrossRef]
34. Chen, R.; Sidor, S.; Abbeel, P.; Schulman, J. UCB exploration via Q-ensembles. arXiv 2017, arXiv:1706.01502.