Resource_Usage_Cost_Optimization_in_Cloud_Computing_Using_Machine_Learning
Resource_Usage_Cost_Optimization_in_Cloud_Computing_Using_Machine_Learning
Abstract—Cloud computing is gaining popularity among small and medium-sized enterprises. The cost of cloud resources plays a
significant role for these companies and this is why cloud resource optimization has become a very important issue. Numerous
methods have been proposed to optimize cloud computing resources according to actual demand and to reduce the cost of cloud
services. Such approaches mostly focus on a single factor (i.e., compute power) optimization, but this can yield unsatisfactory results in
real-world cloud workloads which are multi-factor, dynamic and irregular. This article presents a novel approach which uses anomaly
detection, machine learning and particle swarm optimization to achieve a cost-optimal cloud resource configuration. It is a complete
solution which works in a closed loop without the need for external supervision or initialization, builds knowledge about the usage
patterns of the system being optimized and filters out anomalous situations on the fly. Our solution can adapt to changes in both system
load and the cloud provider’s pricing plan. It was tested in Microsoft’s cloud environment Azure using data collected from a real-life
system. Experiments demonstrate that over a period of 10 months, a cost reduction of 85 percent was achieved.
Index Terms—Cloud resource usage prediction, anomaly detection, machine learning, particle swarm optimization, resource cost
optimization
location and network infrastructure which allows a cloud administrators must approve the changes proposed every
provider to reduce costs. In comparison to the above solu- time they find them useful. Some of these systems advertise
tions, our approach is focused on cost optimization from the that they are using ML in their analysis,5, 6 but in fact, they
end-user perspective; although a reduction in server opera- offer an overall view of spending sources and a simple
tion costs could possibly lead to a provider offering a dis- scheduling of component scaling plus human support,
count, the solution proposed by us provides direct cost which helps reduce costs but without automation.
savings. The solutions described in the aforementioned Our analysis of existing solutions shows that currently
articles take into consideration only the provisioning of vir- none tackle the problem of optimizing different types of
tual machines as a cloud provider’s building blocks, while cloud resources (IaaS, PaaS, SaaS) with proactive usage pre-
we are focused on cost optimization from the end user’s per- diction, anomaly detection and efficient, cloud-provider
spective, and thus not only IaaS, but also PaaS and SaaS are specific, automatic resource allocation. The contribution of
considered. this study is to define such a fully automatic system along
Usage prediction enables us to develop a resource usage with simulations and tests of its behavior using real-life
plan. Many works describe different techniques of resource usage data. Our solution does not require initial reservation
allocation. In [31], Wei et al. present a game-theoretic schedules or knowledge about the type of tasks performed
method while the authors of [32] propose a coral-reef and by the system. It works with different combinations of cloud
game theory-based approach. Machine learning is proposed component types (IaaS, PaaS, SaaS) and accounts for various
in [33], and a combinatorial auction algorithm and a combi- resource properties (CPU, IOPS, RAM, etc.). The cost opti-
natorial double auction algorithm are described in [34] and mization mechanism is resistant to anomalies (i.e., tempo-
[35]. Zhang et al. [36] propose machine learning-based rary usage spikes) and adapts to price changes (i.e., periodic
resource allocation, and in [37] the authors put forward discounts) as pricing policy is obtained directly from the
greedy particle swarm optimization. Our solution uses the cloud provider.
more lightweight, although accurate, Integer-PSO algorithm
described in [38], which we adapt and use for resource allo-
cation planning purposes.
3 CLOUD RESOURCE COST OPTIMIZATION
In the survey [39], Gondhi et al. review different virtual Systems located in the cloud can be complicated and
machine scheduling algorithms. Besides particle swarm involve multiple different resource types. The demand for
optimization, which is a base for Integer-PSO, the authors those resources varies over time, which is conditioned by:
describe a genetic algorithm, simulated annealing, ant col-
1) usage patterns generated by users which depend on
ony optimization, an artificial immune system and other
the time of the day and the day of the week;
meta-heuristics algorithms. Despite providing comparisons
2) usage patterns which depend on end-point machine
of advantages and disadvantages of the methods presented,
configuration (usage generated by automated devi-
the survey does not describe complete solutions. For exam-
ces, i.e., IoT);
ple, a continuous PSO algorithm has to be first adapted to
3) changes in system configuration (new functionali-
the discrete resource allocation problem (Integer-PSO) and
ties, new devices);
only then can it be used in the optimization process, while
4) accidental changes caused by temporary conditions
the aforementioned survey does not cover this adaptation.
(a software bug, communication issues).
On the other hand, our work describes a complete solution
A system must meet availability demands. A change in
which was tested on real-world data.
demand for cloud resources necessitates changes in those
In addition to the research described above, there are
resources’ configurations, which means scaling them. Resour-
some commercial solutions which enable cloud resource
ces can be scaled up or out. For example, a virtual machine
optimization. For example, scaling components as exempli-
can be scaled up by increasing its CPU parameters or it can be
fied by Azure Autoscale,1 AWS Autoscale2 and Google
scaled out by provisioning another copy of the given VM.
Cloud Autoscale3 are part of the cloud environment. Unfor-
Depending on the cloud provider’s pricing plan, either scal-
tunately, only threshold-based scaling and simple time-
ing up or scaling out can be more cost-effective while provid-
based scaling are available. Both of those scaling techniques
ing the same computing power. Scaling takes time, so it must
require an analysis of system usage patterns which might
be performed before it is needed, which requires resource
be difficult when the system is complicated. There are also
usage prediction.
commercial cloud provider-independent systems4 that offer
To meet the above requirements, we have developed a
cloud resource optimization. These systems analyze spend-
solution which performs prediction and monitoring. It con-
ing and present it in an easy-to-understand form. Addition-
sists of a Prediction module, a Monitoring module and a Database
ally, they provide hints about potential scale-downs of
to store predicted data (Fig. 2). We designed it (Fig. 3) to peri-
some cloud components or reorganizations which reduce
odically (every week) gather historical usage data from the
cloud running costs. These systems are not automated, so
last month for each resource which needs to be tailored. This
task is done by the Prediction module. In the next step, the
1. Azure Autoscale – https://azure.microsoft.com/en-us/features/ solution filters out anomalies to improve prediction quality.
autoscale
2. AWS Autoscale – https://aws.amazon.com/autoscaling
3. Google Cloud Autoscale – https://cloud.google.com/compute/ 5. Cloud Cost Management, Efficiency and Optimization – https://
docs/autoscaler www.cloudability.com
4. Azure Cost Management – https://azure.microsoft.com/en-us/ 6. Next-Generation Cloud Optimization for CloudOps – https://
services/cost-management www.densify.com
Authorized licensed use limited to: GITAM University. Downloaded on August 07,2024 at 04:55:45 UTC from IEEE Xplore. Restrictions apply.
2082 IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. 10, NO. 3, JULY-SEPTEMBER 2022
the module chooses a cost-optimal configuration (Calculate- Cost definition for minimization algorithm D is as follows:
Configuration()) using a particle swarm optimization algo-
C if P L
rithm. Based on the solution described by A. S. Ajeena DðC; P ; LÞ ¼ ; (6)
Beegom et al. [38], we defined our own version of the Inte- 1 otherwise
ger-PSO algorithm which is suited to our needs. Given the
predicted required level of resources L = ½L1 ; . . . ; Lm (i.e., where P L is defined as
CPU core count, or RAM amount) and n different compo- P L , 8i 2 ð1; . . . ; mÞ Pi Li : (7)
nent configuration types (i.e., compute-optimized, memory-
optimized, general-purpose) from the cloud provider In the example, D ¼ C ¼ €40.00 as 8 7 and 17 16.
½T1 ; . . . ; Tn , our problem is to find a set of configurations Q Q with the minimal cost can be found using the cost func-
which will meet the L constraint and will be cost-efficient at tion D from Equation (6) and the Integer-PSO algorithm. As
the same time. Q = ½z1 ; . . . ; zn defines how many instances the cloud providers’ pricing policies are usually complex, it
of every configuration type should be used. As an example, is impossible to define how many minimums exist in the cost
we can take virtual machines with CPU core count (L1 ) and function, which is discrete, as a fractional component cannot
RAM amount (L2 ) as the resources examined along with the be provisioned. The final stage of the original algorithm
predicted required level as L = ½7; 16, which means 7 CPU described in [38] was altered, as we are looking for multiples
cores and 16 GB of RAM. A sample cloud provider offers 3 of available machines ½z1 ; . . . ; zn rather than task assignment
different machine types: configuration.
To reduce frequent configuration changes, the new calcu-
T1 : 4 CPU cores, 1 GB of RAM, € 12.00/month;
lated configuration Q0 is compared to the previous configura-
T2 : 2 CPU cores, 8 GB of RAM, € 14.00/month;
tion. If the old Q still meets the P L constraint and if
T3 : 2 CPU cores, 2 GB of RAM, € 10.00/month. P P 0
In this case Q , which meets the L constraint, can be 8i 2 ð1; . . . ; mÞ di < F (where di ¼ iPi i and F is a stability
defined as ½1; 2; 0. It means one virtual machine of type T1 factor), Q0 is discarded and Q is used instead. F determines
and 2 machines of type T2 . The maximum value k for zi how probable it is that the algorithm will keep the previous
(i 2 ð1; . . . ; nÞ) which has to be taken into consideration configuration set. Continuing the example defined previously
while finding Q can be defined as the number of the least where Q ¼ ½1; 2; 0 and P = ½8; 17, we can take as an example
powerful configurations needed to meet the L level. Adding a new predicted required level L0 = ½4; 15, new set Q0 =
more resources will be more expensive and is not necessary, ½0; 2; 0, P 0 = ½4; 16 and we can define the stability vector as
as L is definitely already met. Following the above example, F ¼ 0:4. CPU count d1 ¼ 84 8 ¼ 0:5, RAM amount d2 ¼ 17
1716
k = 16 as 16 virtual machines of type T1 fulfills the L require- 0:06. In this case, di < F is not met for the CPU count (i ¼ 1)
ment in terms of RAM amount. Q is defined as and a new value Q0 will be used. Each time the old configura-
tion is used, F is decremented; when Q0 is used, F is reset to
Q ¼ ½ z1 ; . . . ; zn ; (1) its initial value. The final results are stored in the database
(WriteToDB()) and are later used by the Monitoring module.
where 8i 2 ð1; . . . ; nÞ 0 zi k. The cost of such set C is In a separate loop, the Monitoring module runs every
defined as hour. It monitors if a given resource must be scaled accord-
2 3 ing to the predicted configuration, and scales it if needed.
m1
6 . 7 X
n To estimate the quality of the predicted components’ set,
Q; M Þ ¼ Q M ¼ Q 4 .. 5 ¼
CðQ ðzi mi Þ; (2) we use common prediction measurements: Root Mean
i¼1 Square Error (RMSE), Mean Absolute Error (MAE), Rela-
mn
tive Absolute Error (RAE) and Root Relative Squared Error
where mi is the price of Ti configuration type. The resource (RRSE). To compare the predicted configuration with real
level P provided by Q is defined as usage history, we defined the R metric, which is the mean
2 3 of overusage errors. For the given predicted usage during
s11 ; . . .; s1m hours t1 to tm , R is defined as
6 . .. 7
P ¼ Q 4 .. . 5 ¼ ½ P1 ; . . . ; Pm ; (3) Pm
Et
sn1 ; . . .; snm R ¼ t¼1 ; (8)
m
P
where Pj ¼ ni¼1 ðzi sij Þ and sij is the j resource level pro- where Et is prediction error for the hour t, defined as
vided by the Ti configuration type. In the example defined
Et ¼ ðut pt Þ Hðut pt Þ; (9)
before, cost is calculated as
3 2 where H is a discrete Heaviside Step Function
€12
C ¼ ½ 1; 2; 0 4 €14 5 ¼ €40:00; (4)
0; n < 0
€10 HðnÞ ¼ ; (10)
1; n 0
and resource level as
2 3 pt is the calculated level for hour t and ut is the actual
4; 1 resource usage level for hour t.
P ¼ ½ 1; 2; 0 4 2; 8 5 ¼ ½ 8; 17 : (5) In the end, we measure average cost savings per hour: V .
2; 2 For a given resource and given predicted usage of this
Authorized licensed use limited to: GITAM University. Downloaded on August 07,2024 at 04:55:45 UTC from IEEE Xplore. Restrictions apply.
2084 IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. 10, NO. 3, JULY-SEPTEMBER 2022
Fig. 5. Architecture of TMS – a real-life system used as the test data pro- devices [42]. TMS enables credit card payments in vending
vider for our simulations. machines and kiosks. It consists of many endpoint devices
which connect to the central server. The central server pro-
resource during hours t1 to tm , V is defined as cesses payment transactions and allows operators to config-
ure and maintain end-point devices. The central server
Pm
t¼1 ðGt Ct Þ consists of micro services (which deal with payment) and
V ¼ ; (11) virtual machines (which host management/reporting web-
m
pages). Both micro services and virtual machines connect to
the SQL database (Fig. 5).
where Gt is the cost of the configuration without optimiza-
Payment devices connect to the Payment service during
tion during the hour t and Ct is the cost of predicted config-
the credit card payment process. These devices are located in
uration during the hour t. Both are expressed in cloud
Asia, Europe and America and are used mostly in unat-
provider currency.
tended vending machines. This causes daily variations in
The system defined above, which uses machine learning
resource demand. Web browsers connect to the Management
combined with anomaly detection along with the PSO algo-
webpage when the operator changes configurations or gen-
rithm, calculates the optimal cloud resource configuration.
erates reports. Also, devices connect to the Management web
As a result, resource usage cost reduction is achieved.
page to report their status and check for configuration
changes. The main load comes from the devices, which are
4 EVALUATION configured to connect periodically. Therefore, there is no vis-
Based on the concept from the previous section, we have ible resource demand variation pattern. The database is used
developed an optimization system which uses Azure cloud by both the Payment service and the webpage, and thus the
computing. The efficiency of our system was proved during resource demand variations visible in the payment module
tests with cloud simulators; using a simulator reduces test- are also present in database usage to a certain extent. As
ing time and improves testing elasticity as described in [41]. TMS consists of components with different usage character-
Azure (Microsoft’s cloud service) exposes an API which istics, we can test our idea in different test conditions.
gives access to a component’s historical usage and makes it We set up a test environment (Fig. 6) that allowed us to
possible to get and set a component’s parameters. The perform time-compressed tests. Instead of using the real pro-
Azure API also exposes the current pricing plan. Since this duction system (TMS), we created mock components: the
is convenient, we focus on the Azure cloud only, especially Payment service, the Management web page and the Data-
on virtual machines (IaaS), App Services (PaaS) and the base, which were used as inputs for our solution. Data col-
Azure SQL (SaaS). Virtual machines and App Services can lected from the production TMS (10 months in total) were
be scaled in terms of Azure Compute Units (ACUs), which stored in a separate database for test purposes. The entire
represent unified compute (CPU) performance power. The TMS system was monitored and all types of components
available RAM can be scaled for an App Service and the (PaaS, IaaS and SaaS) were taken into account; ACU, RAM,
maximum level of input/output operations per second can IOPS, DTU and storage usage were used in tests. We imple-
be scaled for a virtual machine. SQL databases can be scaled mented four different prediction types: Bayesian Linear
in terms of storage size and available Database Transaction (BL), Decision Forest Regression (DF), Boosted Decision Tree
Units (DTUs) which are a blend of used memory, CPU Regression (BDT) and Neural Network Regression (NN).
power and IOPS level. Nevertheless, our solution is suitable The Bayesian approach uses linear regression enhanced by
for any cloud provider and any cloud resources which can information in the form of a probability distribution. Statisti-
be scaled. cal analysis is undertaken, prior knowledge about model
To take advantage of the Azure environment, we selected parameters is merged with a likelihood function, and poste-
Microsoft Azure Machine Learning Studio as our main rior estimates for the parameters are generated [43]. Decision
prediction engine. Machine Learning Studio offers ready- trees are models which execute a sequence of data analyses
to-use data processing and ML components. It also allows until a decision is achieved. The Decision Forest Regression
custom functions written in the R and Python languages. model consists of multiple decision trees. Each tree creates a
We tested our solution using real-life data from a work- prediction (a Gaussian distribution) which is compared to
ing system called Terminal Management System (TMS), the combined distribution for all trees in the model [44].
which is a cloud-based manager of Internet of Things Boosted Decision Tree Regression uses the MART gradient
Authorized licensed use limited to: GITAM University. Downloaded on August 07,2024 at 04:55:45 UTC from IEEE Xplore. Restrictions apply.
OSYPANKA AND NAWROCKI: RESOURCE USAGE COST OPTIMIZATION IN CLOUD COMPUTING USING MACHINE LEARNING 2085
boosting algorithm which gradually builds a series of deci- In total, we made 24 predictions (6 optimization factors mul-
sion trees. The optimal tree is selected using an arbitrary loss tiplied by 4 algorithms). Each prediction consisted of more
function [45]. Neural Network Regression uses a neural net- than 6,500 points (10 months with hourly resolution). For
work as a model. This type of regression is suitable for diffi- purposes of clarity, we chose one optimization factor for
cult problems where other regression models cannot fit a every component and presented them separately for
solution [46]. selected periods (Fig. 7). In fact, as defined in Equations (3),
Each prediction type operates in the “Tune Model Hyper- (2), and (7), all resources included in the component in
parameters”7 self-tune mode, which means that prediction question are calculated together. For every component, the
algorithm parameters are picked automatically. Prediction chart presents a workload characteristic (Actual usage)
was performed for every type of machine learning, thus we and all prediction algorithm results. We selected periods
were able to compare the results. As initial values for self- so that they consisted of anomalies (Figs. 7a, 7b, 7c), visible
tuning, the following default configurations were used: patterns (Fig. 7b) and longer high-usage events (Fig. 7a). In
addition, negative impact of previous data on the predic-
1) BL tion process can be observed (Fig. 7b) where, especially in
Regularization weight = 1 the beginning, the usage level predicted is clearly lower
Tune Model Hyperparameters maximum num- than the actual one. Nevertheless, even this error has no
ber of runs = 15 negative impact on the real system as we aim to keep
2) DF resource usage at 70 percent, which gives us a 30 percent
Re-sampling method = Bagging safety margin.
Number of decision trees = 8 To clearly visualize the optimization process, we choose
Maximum depth of the decision trees = 32 3 weeks (8th to 31st May 2019), one component (SaaS), one
Number of random splits per node = 128 property (DTU) and one ML algorithm (DF). Fig. 8 presents
Minimum number of samples per leaf node = 1 the actual usage level along with usage level after anomaly
Tune Model Hyperparameters maximum num- detection (with the anomalies removed). Prediction is based
ber of runs = 5 on the data after anomaly detection and thus it is not dis-
3) BDT torted by temporary usage spikes (Fig. 9).
Maximum number of leaves per tree = 20 Although we are using the Integer-PSO algorithm to find
Minimum number of samples per leaf node = 10 the optimum component configuration, due to cloud
Learning rate = 0.2 resource granulation (the cloud provider only offers pre-
Total number of trees constructed = 100 configured component variants, e.g., a VM with 210 ACUs
Tune Model Hyperparameters maximum num- and 4,000 IOPS), the values predicted are not used exactly
ber of runs = 5 in the calculated configuration. In the chart (Fig. 9) we pres-
4) NN ent the actual resource usage, predicted usage and calcu-
Hidden layers = 1, fully connected lated configuration based on DF prediction. Despite this
Number of hidden nodes = 100 granulation, we still observe a significant reduction in
Learning rate = 0.02 resource costs (Fig. 10). In the TMS system, the cost of SaaS
Number of iterations = 80 in May 2019 equals € 392 and the optimization achieved by
The initial learning weights diameter = 0.1 our system reduces the cost to € 23. For the entire period
The momentum = 0 tested, PaaS costs were reduced by 88 percent, which results
The type of normalizer = Do not normalize in savings of € 4,268.
Integer-PSO was used with 300 particles in 500 epochs. Our tests demonstrate that in the case of anomalous
As in [38], we set the Inertia weight to 0.6 and acceleration behavior (sudden high resource usage), the calculated con-
coefficients to 0.2. The maximum velocity was set to 0:1 n, figuration does not cover 100 percent of resource demand
where n is the number of available configuration options, and the cloud provider resorts to throttling. This slows
and minimum velocity was set accordingly with the minus down the processing of incoming requests or, in cases of
sign. Accuracy was set to 3 digits. prolonged high-level usage, results in a timeout response to
For Payment service (PaaS) optimization, we selected the client (we did not observe such long-lasting anomalies).
ACU and RAM utilization levels as optimization factors. For Nevertheless, the TMS system is designed to handle such
the Management web page (IaaS), ACU and IOPS were situations, as timeouts are often caused by poor network
selected, and for the Database (SaaS), DTU and disk space conditions at the endpoint side (in this case – the credit card
were selected. Because we are using the Database equally for payment terminal) anyway.
reading and writing, it is hard to scale it out (multiply its During tests conducted for data between 8th and 31st
instances), so in this case we set k ¼ 1 in the Equation (1) to May 2019, we observed a reduction in resource usage cost
limit PSO-Integer to one instance only. We performed our not only for SaaS (as presented here), but also for other com-
simulation using 10 months of data from TMS, and we com- ponent types: IaaS and PaaS. Although we did not find simi-
pared the results with the production configuration. We also lar solutions or test data to compare them with our system,
compared the results obtained from different ML algorithms. we used the Azure Autoscale mechanism described in
Section 2 as the point of reference. Although Azure Auto-
scale performs only a horizontal (quantity) optimization for
7. Tune Model Hyperparameters – https://docs.microsoft.com/en-
us/azure/machine-learning/studio-module-reference/tune-model- IaaS and PaaS, we chose it for its out-of-the-box availability.
hyperparameters As vertical (quality) optimization is not available, we used
Authorized licensed use limited to: GITAM University. Downloaded on August 07,2024 at 04:55:45 UTC from IEEE Xplore. Restrictions apply.
2086 IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. 10, NO. 3, JULY-SEPTEMBER 2022
Fig. 7. Comparison of predictions made with different ML algorithms for different resources.
the cheapest possible components (Autoscale is not enabled time was 2.5 times longer and the variance of response time
for low-price PaaS) to ensure the most detailed scaling. In was 7 times higher when compared to our solution. This led
Table 1, we present financial savings along with common to longer periods of availability issues. On the other hand,
prediction quality metrics described in Section 3, the R for PaaS, where the resource demand was stable, both
(mean of overusage errors) and V (cost savings per hour) response time and variance were similar to our system.
parameters calculated according to Equations (8) and (11). Azure Autoscale was not able to optimize SaaS resources,
When compared to the original value, the high savings per- and PaaS and IaaS were optimized only in one dimension;
centage figure is caused by the considerable resource over- the final result was more expensive and in the case of IaaS,
provisioning in the TMS system due to the tendency performance was much lower.
described in Section 1. Our anomaly detection solution Optimization introduces quality degradation when com-
makes this over-provisioning unnecessary. Azure Autoscale pared to the original system. During our test period (from
reduced the cost to some degree, but it was still only half as 8th to 31st May 2019), in which the original system was
efficient as our solution. Additionally, being a reactive sys- highly over-provisioned, we observed a mean response
tem, it introduced a performance decline. For IaaS, where time that was 4 times longer and a variance that was almost
dynamic resource demand was observed, the mean response 100 times higher while the original system response time
Authorized licensed use limited to: GITAM University. Downloaded on August 07,2024 at 04:55:45 UTC from IEEE Xplore. Restrictions apply.
OSYPANKA AND NAWROCKI: RESOURCE USAGE COST OPTIMIZATION IN CLOUD COMPUTING USING MACHINE LEARNING 2087
Fig. 8. Comparison of the actual and anomaly filtered DTU usage level Fig. 10. Comparison of the cost of the database with and without the
(SaaS, May 2019). optimization (SaaS, May 2019).
was almost constant. However, when tested during high Microsoft Azure Machine Learning Studio. All these opera-
usage periods (from 12th to 19th September 2019), the origi- tions fit in the free plans offered by Azure.
nal system’s mean response time and variance were similar
to the values observed after our optimization. Despite the 5 CONCLUSION AND FURTHER WORK
fact that longer response times still allow the system to
operate properly, optimization with quality of user experi- In this work, we present a solution for optimizing cloud
ence as a parameter will be a topic of our further studies as resource costs. Our approach operates autonomously, in
mentioned in Section 5. closed-loop configuration, without any need for external
The optimization solution runs independently from the tuning. We used real-world data from a production system.
working system which is being optimized, and thus the Tests show that the savings calculated are significant and
optimization process does not introduce any performance that our system works properly, minimizing cloud resource
overhead. The monitoring module only runs when a com- usage and cost. A comparison between current system costs
ponent change is required (usually once a couple of hours), and those after optimization demonstrates that during the
and the prediction module runs once a week and uses 10 months covered by tests, the solution, if implemented in
the working system, would have resulted in savings of €
6,128, which translates to an 85 percent cost reduction.
Our solution aims to reduce the cost of using cloud resour-
ces by predicting future demand for resources and adjusting
the provisioned resources accordingly. Therefore, any cloud-
based system which uses scalable resources (i.e., IaaS, PaaS or
SaaS) can be optimized using our solution. Optimization is
performed at the resource allocation level and knowledge of
the internal structure of the system being optimized is not
required; however, any performance improvements in this
system will be captured by our solution and less resources
will be provisioned in the future. Since we are using predic-
tion techniques, the greatest cost reduction will be observed
for systems with usage patterns that are complicated, hard to
define and varied over time; these patterns will be determined
by machine learning algorithms. Scaling resources is simpler
when client-server communication is stateless, as every call
can be directed to the appropriate resource independently;
nevertheless, cloud providers also offer scaling of stateful
communications. Our solution is compatible with many
cloud-based system types, i.e., IoT hubs or Enterprise Resour-
ces Planning services in the form of web services, payment
Fig. 9. Comparison of the actual DTU usage level, its DF prediction and gateways that process online transactions, e-commerce solu-
the calculated configuration (SaaS, May 2019). tions as well as web information portals and social networks.
Authorized licensed use limited to: GITAM University. Downloaded on August 07,2024 at 04:55:45 UTC from IEEE Xplore. Restrictions apply.
2088 IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. 10, NO. 3, JULY-SEPTEMBER 2022
TABLE 1
Savings and Quality Metrics for the Best Algorithms (May 2019)
Time-compressed tests demonstrate that the efficiency of [8] Y. Yu, V. Jindal, F. Bastani, F. Li, and I. Yen, “Improving the
smartness of cloud management via machine learning based
our solution improves over time. This is why, if historical workload prediction,” in Proc. IEEE 42nd Annu. Comput. Softw.
data are available, the solution can be trained in advance to Appl. Conf., 2018, pp. 38–44.
boost efficiency from the start. This topic may be the subject [9] R. Yang, X. Ouyang, Y. Chen, P. Townend, and J. Xu, “Intelligent
of our further studies. Currently, we are monitoring and resource scheduling at scale: A machine learning perspective,” in
Proc. IEEE Symp. Service-Oriented Syst. Eng., 2018, pp. 132–141.
storing over 100 parameters of the production system [10] C.-C. Crecana and F. Pop, “Monitoring-based auto-scalability
(TMS). In the future, we would like to incorporate quality of across hybrid clouds,” in Proc. 33rd Annu. ACM Symp. Appl. Com-
user experience criteria in our resource prediction process, put., 2018, pp. 1087–1094.
[11] T. Mehmood, S. Latif, and S. Malik, “Prediction of cloud comput-
which may result in better resource usage optimization and ing resource utilization,” in Proc. 15th Int. Conf. Smart Cities:
quicker system response times. Improving Qual. Life Using ICT IoT, 2018, pp. 38–42.
[12] I. K. Kim, W. Wang, Y. Qi, and M. Humphrey, “CloudInsight: Uti-
lizing a council of experts to predict future cloud application work-
ACKNOWLEDGMENT loads,” in Proc. IEEE 11th Int. Conf. Cloud Comput., 2018, pp. 41–48.
The research presented in this article was supported by funds [13] B. Sniezynski, P. Nawrocki, M. Wilk, M. Jarzab, and K. Zielinski,
“VM reservation plan adaptation using machine learning in cloud
from the Polish Ministry of Science and Higher Education computing,” J. Grid Comput., vol. 17, pp. 797–812, Jul. 2019.
assigned to the AGH University of Science and Technology. [14] S. Chen, Y. Shen, and Y. Zhu, “Modeling conceptual characteris-
tics of virtual machines for CPU utilization prediction,” in Proc.
Int. Conf. Conceptual Model., 2018, pp. 319–333.
REFERENCES [15] M. Ghobaei-Arani, S. Jabbehdari, and M. A. Pourmina, “An auto-
[1] A. S. Andrae and T. Edler, “On global electricity usage of commu- nomic resource provisioning approach for service-based cloud
nication technology: Trends to 2030,” Challenges, vol. 6, no. 1, applications: A hybrid approach,” Future Gener. Comput. Syst.,
pp. 117–157, 2015. vol. 78, pp. 191–210, 2018.
[2] M. Mao and M. Humphrey, “Auto-scaling to minimize cost and [16] Q. Zhang, L. T. Yang, Z. Yan, Z. Chen, and P. Li, “An efficient deep
meet application deadlines in cloud workflows,” in Proc. Int. Conf. learning model to predict cloud workload for industry informatics,”
High Perform. Comput. Netw. Storage Anal., 2011, pp. 1–12. IEEE Trans. Ind. Informat., vol. 14, no. 7, pp. 3170–3178, Jul. 2018.
[3] J. Yang, W. Xiao, C. Jiang, M. S. Hossain, G. Muhammad, and [17] A. Abdelaziz, M. Elhoseny, A. S. Salama, and A. Riad, “A machine
S. U. Amin, “AI-powered green cloud and data center,” IEEE learning model for improving healthcare services on cloud com-
Access, vol. 7, pp. 4195–4203, 2019. puting environment,” Measurement, vol. 119, pp. 117–128, 2018.
[4] S. Abrishami, “Deadline-constrained workflow scheduling algo- [18] J. Kumar and A. K. Singh, “Workload prediction in cloud using
rithms for infrastructure as a service clouds,” Future Gener. Com- artificial neural network and adaptive differential evolution,”
put. Syst., vol. 29, pp. 158–169, 2013. Future Gener. Comput. Syst., vol. 81, pp. 41–52, 2018.
[5] S. Memeti, S. Pllana, A. Binotto, J. Kolodziej, and I. Brandic, “A [19] J. N. Witanto, H. Lim, and M. Atiquzzaman, “Adaptive selection
review of machine learning and meta-heuristic methods for of dynamic VM consolidation algorithm using neural network for
scheduling parallel computing systems,” in Proc. Int. Conf. Learn. cloud resource management,” Future Gener. Comput. Syst., vol. 87,
Optim. Algorithms: Theory Appl., 2018, pp. 5:1–5:6. pp. 35–42, 2018.
[6] Y. Zhang, J. Yao, and H. Guan, “Intelligent cloud resource man- [20] K. Mason, M. Duggan, E. Barrett, J. Duggan, and E. Howley,
agement with deep reinforcement learning,” IEEE Cloud Comput., “Predicting host CPU utilization in the cloud using evolutionary
vol. 4, no. 6, pp. 60–69, Nov./Dec. 2017. neural networks,” Future Gener. Comput. Syst., vol. 86, pp. 162–173,
[7] M. H. Hilman, M. A. Rodriguez, and R. Buyya, “Task runtime pre- 2018.
diction in scientific workflows using an online incremental learn- [21] A. M. Al-Faifi, B. Song, M. M. Hassan, A. Alamri, and A. Gumaei,
ing approach,” in Proc. IEEE/ACM 11th Int. Conf. Utility Cloud “Performance prediction model for cloud service selection from
Comput., 2018, pp. 93–102. smart data,” Future Gener. Comput. Syst., vol. 85, pp. 97–106, 2018.
Authorized licensed use limited to: GITAM University. Downloaded on August 07,2024 at 04:55:45 UTC from IEEE Xplore. Restrictions apply.
OSYPANKA AND NAWROCKI: RESOURCE USAGE COST OPTIMIZATION IN CLOUD COMPUTING USING MACHINE LEARNING 2089
[22] A. A. Rahmanian, M. Ghobaei-Arani, and S. Tofighy, “A learning [39] N. K. Gondhi and A. Gupta, “Survey on machine learning based
automata-based ensemble resource usage prediction algorithm scheduling in cloud computing,” in Proc. Int. Conf. Intell. Syst.
for cloud computing environment,” Future Gener. Comput. Syst., Metaheuristics Swarm Intell., 2017, pp. 57–61.
vol. 79, pp. 54–71, 2018. [40] G. Cherubin, A. Baldwin, and J. Griffin, “Exchangeability martin-
[23] M. Ranjbari and J. A. Torkestani, “A learning automata-based gales for selecting features in anomaly detection,” in Proc. 7th
algorithm for energy and SLA efficient consolidation of virtual Symp. Conformal Probabilistic Prediction Appl., 2018, pp. 157–170.
machines in cloud data centers,” J. Parallel Distrib. Comput., vol. 113, [41] T. Lorido-Botran, J. Miguel-Alonso, and J. A. Lozano, “A review
pp. 55–62, 2018. of auto-scaling techniques for elastic applications in cloud envi-
[24] G. Kaur, A. Bala, and I. Chana, “An intelligent regressive ensem- ronments,” J. Grid Comput., vol. 12, pp. 559–592, 2014.
ble approach for predicting resource usage in cloud computing,” [42] A. Botta, W. de Donato, V. Persico, and A. Pescape, “On the inte-
J. Parallel Distrib. Comput., vol. 123, pp. 1–12, 2019. gration of cloud computing and Internet of Things,” in Proc. Int.
[25] X. Chen, J. Lin, B. Lin, T. Xiang, Y. Zhang, and G. Huang, “Self-learning Conf. Future Internet Things Cloud, 2014, pp. 23–30.
and self-adaptive resource allocation for cloud-based software serv- [43] C. Bishop and M. Tipping, “Bayesian regression and classification,”
ices,” Concurrency Comput., Practice Experience, vol. 31, 2018, in Advances in Learning Theory: Methods, Models and Applications,
Art. no. e4463. J. Suykens, I. Horvath, S. Basu, C. Micchelli, and J. Vandewalle, Eds.
[26] C. Qu, R. N. Calheiros, and R. Buyya, “Auto-scaling web applica- Amsterdam, The Netherlands: IOS Press, 2003, pp. 267–285.
tions in clouds: A taxonomy and survey,” ACM Comput. Surv., [44] A. Criminisi, J. Shotton, and E. Konukoglu, “Decision forests: A
vol. 51, no. 4, pp. 73:1–73:33, Jul. 2018. unified framework for classification, regression, density estima-
[27] Y. Al-Dhuraibi, F. Paraiso, N. Djarallah, and P. Merle, “Elasticity in tion, manifold learning and semi-supervised learning,” Found.
cloud computing: State-of-the-art and research challenges,” IEEE Trends Comput. Graph. Vis., vol. 7, no. 2/3, pp. 81–227, Feb. 2012.
Trans. Services Comput., vol. 11, no. 2, pp. 430–447, Mar./Apr. 2018. [45] C. J. Burges, “From RankNet to LambdaRank to LambdaMART:
[28] H. M. Makrani, H. Sayadi, D. Motwani, H. Wang, S. Rafatirad, An overview,” Microsoft, Redmond, WA, USA, Tech. Rep. MSR-
and H. Homayoun, “Energy-aware and machine learning-based TR-2010–82, Jun. 2010.
resource provisioning of in-memory analytics on cloud,” in Proc. [46] C. M. Bishop, “Neural networks: A pattern recognition perspective,”
ACM Symp. Cloud Comput., 2018, pp. 517–517. Aston Univ., Birmingham, U.K., Tech. Rep. NCRG/96/001, Jan. 1996.
[29] D. Minarolli and B. Freisleben, “Virtual machine resource alloca-
tion in cloud computing via multi-agent fuzzy control,” in Proc. Patryk Osypanka received the MSc degree, and
Int. Conf. Cloud Green Comput., 2013, pp. 188–194. is currently working toward the doctoral degree
[30] A. Singh, D. Juneja, and M. Malhotra, “A novel agent based auton- with the Department of Computer Science, AGH
omous and service composition framework for cost optimization University of Science and Technology, Krakow,
of resource provisioning in cloud computing,” J. King Saud Univ. Poland. He works professionally with ASEC S.A.
Comput. Inf. Sci., vol. 29, no. 1, pp. 19–28, 2017. as software development team leader, mainly
[31] G. Wei, A. V. Vasilakos, Y. Zheng, and N. Xiong, “A game-theoretic using Microsoft technologies (.Net, Azure). His
method of fair resource allocation for cloud computing services,” J. research focuses on cloud computing.
Supercomput., vol. 54, no. 2, pp. 252–269, Nov. 2010.
[32] M. Ficco, C. Esposito, F. Palmieri, and A. Castiglione, “A coral-
reefs and game theory-based approach for optimizing elastic
cloud resource allocation,” Future Gener. Comput. Syst., vol. 78,
pp. 343–352, 2018.
[33] S. Sotiriadis, N. Bessis, and R. Buyya, “Self managed virtual machine Piotr Nawrocki received the PhD degree. He is
scheduling in cloud systems,” Inf. Sci., vol. 433/434, pp. 381–400, 2018. an associate professor with the Department of
[34] D. Gudu, M. Hardt, and A. Streit, “Combinatorial auction algo- Computer Science, AGH University of Science
rithm selection for cloud resource allocation using machine and Technology, Krakow, Poland. His research
learning,” in Proc. Eur. Conf. Parallel Process., 2018, pp. 378–391. interests include distributed systems, computer
[35] S. A. Tafsiri and S. Yousefi, “Combinatorial double auction-based networks, mobile systems, cloud computing,
resource allocation mechanism in cloud computing market,” J. Internet of Things, and service-oriented architec-
Syst. Softw., vol. 137, pp. 322–334, 2018. tures. He has participated in several EU research
[36] J. Zhang, N. Xie, K. Yue, W. Li, and D. Kumar, “Machine learning projects including MECCANO, 6WINIT, Univer-
based resource allocation of cloud computing in auction,” Comput. sAAL and national projects including IT-SOA, and
Mater. Continua, vol. 56, pp. 123–135, Jan. 2018. ISMOP. He is a member of the Polish Information
[37] Z. Zhong, K. Chen, X. Zhai, and S. Zhou, “Virtual machine-based Processing Society (PTI).
task scheduling algorithm in a cloud computing environment,”
Tsinghua Sci. Technol., vol. 21, no. 6, pp. 660–667, Dec. 2016.
[38] A. S. Ajeena Beegom and M. S. Rajasree, “Integer-PSO: A discrete " For more information on this or any other computing topic,
PSO algorithm for task scheduling in cloud computing systems,” please visit our Digital Library at www.computer.org/csdl.
Evol. Intell., vol. 12, pp. 227–239, Feb. 2019.
Authorized licensed use limited to: GITAM University. Downloaded on August 07,2024 at 04:55:45 UTC from IEEE Xplore. Restrictions apply.