0% found this document useful (0 votes)
16 views19 pages

Short Term Electrical Load Forecasting Using Mutua

Uploaded by

madupiz@gmail
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views19 pages

Short Term Electrical Load Forecasting Using Mutua

Uploaded by

madupiz@gmail
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

entropy

Article
Short Term Electrical Load Forecasting Using
Mutual Information Based Feature Selection
with Generalized Minimum-Redundancy
and Maximum-Relevance Criteria
Nantian Huang *, Zhiqiang Hu, Guowei Cai and Dongfeng Yang
School of Electrical Engineering, Northeast Dianli University, Jilin 132012, China;
[email protected] (Z.H.); [email protected] (G.C.); [email protected] (D.Y.)
* Correspondence: [email protected]; Tel.: +86-432-6480-6691

Academic Editor: Raúl Alcaraz Martínez


Received: 21 July 2016; Accepted: 5 September 2016; Published: 8 September 2016

Abstract: A feature selection method based on the generalized minimum redundancy and maximum
relevance (G-mRMR) is proposed to improve the accuracy of short-term load forecasting (STLF).
First, mutual information is calculated to analyze the relations between the original features and the
load sequence, as well as the redundancy among the original features. Second, a weighting factor
selected by statistical experiments is used to balance the relevance and redundancy of features when
using the G-mRMR. Third, each feature is ranked in a descending order according to its relevance
and redundancy as computed by G-mRMR. A sequential forward selection method is utilized for
choosing the optimal subset. Finally, a STLF predictor is constructed based on random forest with the
obtained optimal subset. The effectiveness and improvement of the proposed method was tested
with actual load data.

Keywords: short term load forecasting; generalized minimum redundancy and maximum relevance;
random forest; sequential forward selection

1. Introduction
A short-term load forecasting (STLF) predicts future electric loads with a particular prediction
limit from one hour extending up to several days. The primary target of smart grids, such as reducing
the difference between peak and valley electric loads, large-scale renewable energy absorption, demand
side response, and optimal economic operation of the power grid, needs accurate STLF results [1].
In addition, with the development of competitive electricity markets, an accurate STLF is an important
basis for drafting a reasonable electricity price and improving the stability of electricity market
operation [2].
The existing STLF methods can be divided into traditional methods and artificial intelligence
methods. In the traditional methods, such as autoregressive integrated moving average (ARIMA) [3]
and regression analysis [4], Kalman filter [5] and exponential smoothing [6] are commonly used.
The combination of autoregressive and moving average in ARIMA is a better time series model for
STLF [7]. According to the historical time-varying load data, the ARIMA is established and applied
for predicting the forthcoming electrical load. The regression analysis uses historical data to establish
simple but highly efficient regression models [8]. The Kalman filter improves the accuracy of STLF
by estimating each component of load which is apportioned into random and fixed components.
The exponential smoothing eliminates the noise in the load time series, and the degree of future
load influenced by recent load data can be reflected by adjusting the weight of both data, which is
helpful for improving the accuracy of STLF [9]. Overall, the traditional STLF methods can analyze

Entropy 2016, 18, 330; doi:10.3390/e18090330 www.mdpi.com/journal/entropy


Entropy 2016, 18, 330 2 of 19

the linear relationships between input and output, but not the nonlinear relationships [10]. If the load
presents large fluctuations caused by environmental factors, the traditional methods may provide
inaccurate forecasts.
In recent years, predictors based on artificial intelligence algorithms were widely used in the STLF
of power systems [10–17]. Such processes like fuzzy logic [14], expert systems [16,17], artificial neural
networks (ANNs) [18,19], and support vector machines (SVMs) [20,21] are currently used in STLF.
Fuzzy logic methods divide the input and the output into different kinds of membership functions,
and then the relationship between input and output is established by a set of fuzzy rules for fuzzy
systems for STLF [22]. However, the fuzzy systems with single if-then rules lack self-learning and
adaptive ability to be able to learn the input information effectively. An ANN acquires the complicated
non-linear relationship between input and output variables by learning the training samples. However,
there is no scientific way of acquiring the optimal network architecture when establishing an ANN
model. In addition, it also encounters the problems of falling into local optima and over-fitting [15,23].
SVMs overcome the deficiencies of ANNs by dealing with quadratic programming problems in
acquiring the global optimal solution. As compared to an ANN, the SVM has many advantages.
However, the SVM parameters, such as the type and variance of the kernel function, and penalty
factor, are selected empirically. To achieve the optimal parameters, a SVM combined with genetic and
particle swarm optimization algorithm is utilized [24,25]. The random forest (RF) is a combination of
classification and regression trees (CARTs) and a bagging learning method. Randomly, by sampling
from the training samples and selecting features for splitting node, the RF provides the ability to resist
noise and is free from over-fitting problems [26]. Furthermore, in actual practice, there are only two
parameters (the tree number and the number of the features for node splitting) that need to be set
when RF is applied for STLF [15], making RF highly suitable for STLF.
Considering the effect of various factors, artificial intelligence methods analyze the complicated
nonlinear relationships between power load and related factors to achieve higher precision of
prediction. However, the features that the predictor employs will influence the accuracy and efficiency
of STLF. Therefore, a feature selection schedule should be generated for choosing the optimal feature
subset for a predictor. The common features, including historical load, time, and meteorology, are used
for STLF modeling [11,27,28]. Historical load can reflect the variation of load accurately, which contains
plenty of information. The features of time, such as hour point, day of week, and on/off work day,
can also indirectly show the load pattern. In addition, a short-term power load is mainly affected by
the changing weather conditions which have a strong correlation with load demand. The accurate
meteorological information of the numerical weather prediction (NWP) can improve the accuracy of
STLF effectively. Consequently, NWP errors will reduce the accuracy of STLF [29].
A feature selection is a process of choosing the most effective features from an original feature
set. The optimal feature subset extracted from a given feature set can improve the efficiency and
accuracy of predictor in STLF [30]. Nowadays, the manner of selecting the features has become a hot
topic in short-term load forecasting research. Reference [31] adopted conditional mutual information
for feature selection. The mutual information values between features and load was measured and
subsequently ranked through their values. The first 50 features were used as a threshold parameter
for filtering out the irrelevant and weakly relevant features. Reference [10] constructed an original
feature set by using the phase space reconstruction theory. The correlation between features and load
was analyzed, discovering the optimal feature subset. In reference [29], the mutual information was
applied for extracting the effective features from the weather features, as well as, the historical load data
features were also extracted for improving the accuracy of holiday load forecasting. Reference [32] used
a memetic algorithm to extract a proper feature subset from an original feature set for medium-term
load forecasting. Reference [33] analyzed the daily and weekly pattern by autocorrelation function,
and chose 50 features as the best features for very short-term load forecasting. The mutual information
based on feature selection was used in reference [23]. By calculating the mutual information values
between feature vectors and target variable, we can temporarily define a lower boundary criterion
Entropy 2016, 18, 330 3 of 19

to filter the features. The optimal feature subset with best features was achieved for STLF. All of the
researches [10,23,29,31–33] made important contributions to the feature selection in STLF. However,
these feature selection methods were just carried out by analyzing the correlation between features
and load and the redundancy among these features was not considered.
To improve the accuracy of STLF, the mutual information based on generalized minimum
redundancy and maximum relevance feature selection and RF for STLF is proposed. First, an original
feature set is formed by extracting historical load features and time features from the original load
data. Second, G-mRMR is used for generating the candidate feature, which is ranked in a descending
order. Third, the sequential forward selection (SFS) method and a decision criteria based on mean
absolute percentage error (MAPE) are utilized for obtaining optimal feature subset by adding one
feature at a time to the input feature set of RF. Finally, the RF-based predictor is constructed with the
optimal feature subset to achieve the optimal predictor. The proposed method is validated through
STLF experiments using the actual load data from a city in Northeast China. The experimental results
are compared with different feature selection methods and predictors.

2. Methodology

2.1. Mutual Information-Based Generalized Minimal-Redundancy and Maximal-Relevance


The minimum-redundancy and maximum-relevance (mRMR) is the method which uses mutual
information (MI) to measure the dependence between two variables. The MI-based mRMR not only
considers the effective information between feature and target variable, but also acquires the repetitive
information among features [34]. It has the advantage of obtaining helpful features accurately when
dealing with high dimensional data.
Given two random variables X and Y, the MI between them can be estimated as:

P( x, y)
I ( X, Y ) = ∑ P(x, y)log P(x) P(y) (1)
X,Y

where P(x) and P(y) are the marginal density functions, and P(x, y) is the joint probability
density function.
The target of feature selection methods based on MI is finding a feature subset J with n features
which reflect the largest dependency on the target variable l from a feature set Fm with m features
(n  m).
The maximum-relevance criterion uses the mean value of MI between feature xi and target l is
described as follows:
1
| J | x∑
maxD ( J, l ), D = I ( xi , l ) (2)
∈J i

The redundancy indicated by MI value describes the overlapping information among features,
wherein a larger MI signifies more overlapping information and vice versa. In the process of feature
selection, the features selected by maximum-relevance criterion can have more redundancy, and the
redundant features have similar information as the prior selected feature cannot improve the accuracy
of predictor. Therefore, the redundancy among features should also be evaluated in the process of
feature selection.
The minimum-redundancy requires a minimum dependency among each feature:

1
minR( J ), D = ∑
| J |2 xi ,x j ∈ J
I ( xi , x j ) (3)

The mRMR criterion combined with Equations (2) and (3) is computed as follows:

maxψ( D, R), ψ = D − R (4)


Entropy 2016, 18, 330 4 of 19

Generally, an incremental search method is used to search for the optimal features [34].
Generally,
Supposing there anisincremental
a feature set search method
Jn−1 with n−1 is used tothat
features search
hasfor theselected.
been optimal features
The aim[34].
is toSupposing
select the
there is a feature set J
nth feature from the nrest with n − 1 features that has been selected.
−1 of set {Fm-Jn−1} according to Equation (4). The incremental search The aim is to select the nthmethod
feature
from respect
with the rest to
of the
set {F m -Jn −1 } according
condition is as follows: to Equation (4). The incremental search method with respect to
the condition is as follows:
" 1  #
mRMR : max  I ( x j , l) − 1  I ( x j , xi )  (5)
mRMR :
x j ∈Fm − Jn−1
max I ( x j , l ) − Jn −1 xi ∈Jn−1∑ I ( x j, xi ) (5)
x j ∈ Fm -Jn−1 | Jn−1 | x ∈ J
i n −1
where |Jn−1| refers to the number of features in Jn−1.
where |Jn −1 | refersEquation
Restructuring to the number (5) by of features
using ainweighting Jn − 1 . factor to balance the redundancy and
Restructuring Equation (5) by using
relevance of feature subset develops into the generalized a weighting factor to balance mRMR the redundancy
(G-mRMR) and relevance
presented as
of feature
follows subset develops into the generalized mRMR (G-mRMR) presented as follows [35]:
[35]:

 
" #
G − mRMR
G-mRMR : : max max II((xxjj,,ll) −−αα ∑I( xIj (, xji ,)xi ) (6)
(6)
j ∈-J
x j ∈ Fxm Fm − Jn−1
n −1   i∈
xx J
i ∈n−J1n−1 

2.2. Random Forest


2.2. Random Forest
The random forest (RF) is a kind of machine learning algorithm presented by Leo Breiman, who
The random forest (RF) is a kind of machine learning algorithm presented by Leo Breiman,
integrates classification and regression tree (CART) and bagging algorithm [26]. A RF generates many
who integrates classification and regression tree (CART) and bagging algorithm [26]. A RF
different CARTs by sampling with replacement, wherein each CART achieves one result. The final
generates many different CARTs by sampling with replacement, wherein each CART achieves one
forecasting result is achieved by computing the average value of all CARTs’ results.
result. The final forecasting result is achieved by computing the average value of all CARTs’ results.
2.2.1. CART
2.2.1. CART
The CART employs binary recursive partitioning technology for solving classification and
The CART
regression issuesemploys binary which
[36]. A CART, recursive partitioning
consists of a roottechnology for solving
node, non-leaf classification
nodes, branches, and and
leaf
regression issues [36]. A CART, which consists of a root node, non-leaf nodes, branches,
nodes, is shown in Figure 1. Each non-leaf node must be divided according to the Gini index when and leaf
nodes, is
CART grows.shown in Figure 1. Each non-leaf node must be divided according to the Gini index when
CART grows.

Root

Feature 1

node1 node2

Feature 2 Feature 3

node 3 node 4 node 5 node 6

Leaf 2 Leaf 3 Leaf 4 Leaf 5

Figure 1. A
A simple
simple CART.
CART.

Supposing there is a dataset D with d samples which includes C classes, the Gini index of D
Supposing there is a dataset D with d samples which includes C classes, the Gini index of D can
can be defined as:
be defined as:
C  22
d
G ( D ) = 1 − ∑  di i
C
G( D) = 1 −   d (7)
(7)
i =1 d
i =1  
where di is the number of ith class.
where di is the number
Afterward, of ithf class.
the feature is used to divide D into D1 and D2 subset, wherein the Gini index after
Afterward,
the split is: the feature f is used to divide D into D1 and D2 subset, wherein the Gini index after
the split is: d d2
Gsplit ( D ) = 1 G ( D1 ) + G ( D2 ) (8)
d d
Entropy 2016, 18, 330 5 of 19

d1 d
Entropy 2016, 18, 330 Gsplit ( D) = G( D1 ) + 2 G( D2 ) (8)19
5 of
d d

2.2.2. Bagging
2.2.2. Bagging
The bagging
The bagging is is an
an integrated
integrated learning
learning algorithm
algorithm proposed
proposed by
by Leo
Leo Breiman
Breiman [37].
[37]. Given
Given dataset
dataset
B
B with
with M
M features
features and
and learning
learning rule
rule H,
H, a
a bootstrapping
bootstrapping is
is carried
carried out
out to
to generate
generate training
training sets
sets
q
 1 2
B1 , B2 , . . . , B . The samples in dataset
{ B , B , , B } . The samples in dataset B may
q B may be appraised many times or not at all. A forecasting
 be appraised many times or not at all. A forecasting
system consists of a group of learning rule H 11, H 22, . . . , Hqq which have learned the training set is
system consists of a group of learning rule { H , H , , H } which have learned the training set is
achieved. Breiman pointed out that bagging can improve the accuracy of predicting the instability of
achieved. Breiman pointed out that bagging can improve the accuracy of predicting the instability
learning algorithms such as CART and ANN [37].
of learning algorithms such as CART and ANN [37].
2.2.3. RF
2.2.3. RF
The RF is a group of predictors { p(x, Θk ), k = 1, 2, . . .}, which is composed of numbers of CARTs,
where The RFthe
x is is ainput
groupvector
of predictors x , Θk ), k = 1,2,
and {Θk {}p(represents the } , which is composed
independent identicallyofdistributed
numbers ofrandom
CARTs,
vectors.
where x The modeling
is the process
input vector {Θkis:
of RF
and } represents the independent identically distributed random
vectors. The modeling process of RF is:
(1) k training sets are sampled with replacement from the dataset B by bootstrap.
(2) (1)
Eachktraining
training set
setsgrows
are sampled
up to awith
tree replacement
according tofrom
CART thealgorithm.
dataset B by bootstrap.dataset B has
Supposing
(2) Each training
M features and mtry setfeatures
grows up
are to a tree according
randomly to CART
selected from B foralgorithm. Supposing
each non-leaf dataset B
node. Afterward,
has M
the node features
is split by a and mtry
feature features
selected arethese
from randomly selected from B for each non-leaf node.
mtry features.
(3) EachAfterward,
tree grows the node is split
completely by apruning.
without feature selected from these mtry features.
(3) Each tree grows completely without
(4) The forecasting result is solved by calculating pruning.the mean value of the consequences of each
(4) The forecasting result is solved by calculating the mean value of the consequences of each
tree predicted.
tree predicted.
The flow chart of RF model is illustrated in Figure 2.
The flow chart of RF model is illustrated in Figure 2.

Original
training set

bootstrap

Training set S1 Training set S2 ... Training set Sk


Random Forest

...

Tree 1 Tree 2 Tree k

Forecasting Forecasting Forecasting


value(1) value(2) ... value(k)

1 k
- ∑
k i=1
value(i)

Figure
Figure 2.
2. Random
Random Forest
Forest modeling
modeling and
and predicting
predicting process.
process.

The bagging and the random selection of feature for splitting ensure the good performance of
The bagging and the random selection of feature for splitting ensure the good performance of
RF, wherein:
RF, wherein:
Entropy 2016, 18, 330 6 of 19

Entropy 2016, 18, 330 6 of 19


(1) The same capacity of the training set sampled by bootstrap guarantees each sample in dataset B
to(1) The same equally.
be appraised capacityAofsituation
the training set sample
that one sampledmay by appear
bootstrap
manyguarantees
times in each sample
the same in
training
dataset B to be appraised equally. A situation
set and some may not causes low correlation among the trees.that one sample may appear many times in
the same training set and some may not causes low correlation among the trees.
(2) The manner of selecting feature for node split applies randomness, and ensures the generalized
(2) The manner
performance of RF. of selecting feature for node split applies randomness, and ensures the
generalized performance of RF.
The
Thenumber featuremtry
numberofoffeature mtryand
andthe
thenumber
numberofoftree RFnTree
treeofofRF nTreeshould
√ bebeset
should setwhen
whenapplying
applyingRF.
Generally, mtry suggested
RF. Generally, settingsetting
mtry suggested is mtry
is either either= mtry
[log2=(M) or mtry
+ 1](M)
[log2 M or= mtry
= mtry
+ 1] or = M/3. The scale
√ or mtry = M/3.
ofThe
RF scale
generally selected empirically the largest size in order to improve the diversity
of RF generally selected empirically the largest size in order to improve the diversity of trees of
and
guarantee
trees and the performance
guarantee of RF.
the performance of RF.

3.3.Data
Data Analysis
Analysis
The
Thehistorical
historicalload
loaddata
data used
used in
in this paper is
this paper is archived
archiveddata
datafrom
fromaacity
cityininNortheast
Northeast China
China from
from
2005
2005toto2012.
2012.As Asshown
shown in in Figure
Figure 3a,b, the load
3a,b, the load demand
demandfrom
from2005
2005toto2012
2012increased
increased rapidly
rapidly with
with
the increase in population and development of the local society. It is difficult to
the increase in population and development of the local society. It is difficult to generate a highlygenerate a highly
accurate
accurateSTLF
STLFininthis
thiskind
kind ofof load
load pattern. Figure 3c
pattern. Figure 3cshows
showsthethecorrelation
correlationanalysis
analysis results
results of of
thethe
historical
historicalload
loadbybyautocorrelation
autocorrelation function [38]. Evidently,
function [38]. Evidently,the
theautocorrelation
autocorrelationcoefficient
coefficient is is reduced
reduced
gradually
graduallyalong
alongwith
withthe
theincreasing
increasingofofhour
hourlag.
lag.According
AccordingtotoFigure
Figure3c,3c,the
theload
loadfar
farfrom
fromcurrent
currenthas
low
hascorrelation. Only Only
low correlation. the correlation of theof
the correlation load
thedata
loadfrom
data 2011
fromto 2012
2011 tois2012
above the confidence
is above interval
the confidence
interval
which which iscorrelation
is positive positive correlation (above
(above of the blueofline).
the blue
Withline). With the increasing
the increasing of the load,ofthethehistoric
load, the load
historic
with largeload
lag with largelow
has very lag correlation
has very low correlation
with with thepoint.
the forecasting forecasting point.we
Therefore, Therefore ,wedata
prefer the prefer
from
the to
2011 data from
2012 to 2011 to 2012
be used to be used
for further for further research.
research.

4000 5.95 800


2005 2006 2007 2008 2009 2010 2011 2012
3500 5.9 700
Average daily load(MW)

GDP(billion yuan)
Population(million)

3000 5.85 600

2500 5.8 500

2000 5.75 400

1500 5.7 300

1000 5.65 200


8 365 730 1095 1460 1825 2190 2555 2920 2005 2006 2007 2008 2009 2010 2011 2012
Day Year
(a) (b)
1
2012 2011 2010 2009 2008 2007 2006
Autocorrelation coefficient

0.8

0.6

0.4

0.2

-0.2
0 0.876 1.752 2.628 3.504 4.38 5.256 6.132
4
Lag/Hour X10
(c)
Figure 3. Yearly load curve analysis: (a) Average daily load from 8 January 2005 to 31 December.
Figure 3. Yearly
2012; (b) load curveand
The population analysis: (a) Average
GDP from 2005 todaily
2012;load from 8 load
(c) Hourly January 2005 to 31 December
autocorrelation 2012;
of historical
(b) Thedata.
load population and GDP from 2005 to 2012; (c) Hourly load autocorrelation of historical load data.

Figure44shows
Figure showsthe
theaverage
average daily
daily load
load pattern
pattern occurring
occurringinindifferent
differentseasons.
seasons.These
Theseloads have
loads have
visibly different patterns which are caused by the varying climate.
visibly different patterns which are caused by the varying climate.
Entropy 2016, 18,
2016,
Entropy 2016, 18, 330
330 7 of
of 19
19
Entropy 18, 330 77 of 19
Entropy 2016, 18, 330 7 of 19
3400
3400
3400

Demand(MW)
LoadDemand(MW)
Average Daily Load Demand(MW)
3000
30003000

2600
DailyLoad

26002600
AverageDaily

2200
22002200
Average

Winter Load Spring Load


SpringLoad
Load Summer Load
Summer Load Fall Load
Fall
Winter
Winter Load
Load Spring Summer Load FallLoad
Load
1800
18001800
0 90 182 274 365
0 0 9090 182
182 274
274 365
365
Day
Day
Day

Figure 4. Four
Figure
Figure Four seasons
4. Four seasonsaverage
averagedaily
dailyload
loadprofile
profile from December
December2010
December 2010totoNovember
November2011.
2011.
Figure 4.
4. Four seasons
seasons average
average daily
daily load
load profile
profile from
from December 2010
2010 to
to November 2011.
November 2011.

ByBy
By observing
observing
observing Figure
Figure5,
Figure 5,5,it
ititis
isispossible
possible
possibletoto know
toknow that
know that the load
that the load demand
load demandpresents
demand presentsa aakind
presents kindofof
kind ofcycling
cycling
cycling
Bywith
observing Figureof75,7days.
itdays.
is possible to know thatfrom
the load demand presents isisasimilar,
kind of whereas
cycling mode
mode with a period of 7 days. The load demand from Monday to Friday is similar, whereasonon
mode
mode witha a period
period of Theload
The load demand
demand from Monday
Monday to
to Friday
Friday similar, whereas on
with a period
Saturday
Saturday and of
and 7 days. The
Sundaythey
Sunday load
theyare demand
aredissimilar from
dissimilar from Monday
from each to Friday
each other. This
Thisis similar,
pattern
pattern whereas
isisdue
due on
totothe Saturday
theconcurrent and
concurrent
Saturday and Sunday they are dissimilar from each other. This pattern is due to the concurrent
Sunday
changing they
changingof ofare
load dissimilar
load level
level with from
withthe each other.
thevarying
varying This pattern
electricity
electricity is duebehavior
consumption
consumption to the concurrent
behavior of
ofpeople changing
peoplewithin of load
withina aweek.
a week.
week.
changing of load level with the varying electricity consumption behavior of people within
level with the varying electricity consumption behavior of people within a week.
3800
3800
3800

3200
3200
3200

2600
2600
2600

Mon. Tues. Wed. Thur. Fri. Sat. Sun. Mon. Tues. Wed. Thur. Fri. Sat. Sun.
2000
0
Mon. 24
Tues. 48
Wed. 72Thur. 96Fri. 120Sat. 144Sun. 168Mon.192Tues.216 Wed.
240 Thur.
264 Fri.288 Sat. 312 Sun.336
2000 Mon. Tues. Wed. Thur. Fri. Sat. Sun. Mon. Tues. Wed. Thur. Fri. Sat. Sun.
2000 0 24 48 72 96 120 144 Hour168 192 216 240 264 288 312 336
0 24 48 72 96 120 144 168 192 216 240 264 288 312 336
Hour
Figure 5. Load curve from Hour 15 to 28 August 2011.

Figure 5.
Figure 5. Load
Load curve
Load curve from 15
curve from 15 to
to 28
28 August
August 2011.
2011.
The load point predicts the highly correlated load points similar from the day before as well as
relevant
The load
The with
load previous
point week.
predicts theAs shown
highly in Figureload
correlated 6, the load similar
points points throughout
from the thebefore
the day week at
as lag
well1,as
The load point
point predicts
predicts the
the highly
highly correlated
correlated load
load points
points similar
similar from
from the day
day before
before as
as well
well as
as
lag 24,with
relevant lag 48, lag 72, lag
previous 96, lag
week. As120, lag 144,
shown and lag
in Figure
Figure 6,168
thehave
loadstrong
pointsrelevance assuming
throughout eachatlag
the week
week lagis 1,
relevant
relevant with
with previous
previous week.
week. AsAs shown
shown in in Figure 6,6, the
the load
load points
points throughout
throughout the
the week at
at lag
lag 1,
1,
lag124,
lag
hour
24, lagdifference.
lag 48, lag
48, 72,Furthermore,
lag 72, lag 96,
lag 96, lag other
lag 120,
120, lagmoment
lag andload
144, and
144, lagvalues
lag 168 have
168
alsostrong
have have different
strong
dependence.
relevance
relevance assuming each
assuming each lag
lag is
is
lag 24, lag 48, lag 72, lag 96, lag 120, lag 144, and lag 168 have strong relevance assuming each lag is
1 hour
hour difference.
difference. Furthermore,
Furthermore, other
other moment
moment loadload values
values also
also have different
different dependence.
dependence.
11 h difference. Furthermore, other
1 moment load values also havehave
different dependence.
10.8
1
Autocorrelation coefficient

0.80.6
0.8
coefficient

0.60.4
Autocorrelationcoefficient

0.6
0.40.2
0.4
0
0.2
Autocorrelation

0.2
-0.2
0
0
-0.4
-0.2
-0.2
-0.6
-0.4 0 24 48 72 96 120 144 168
-0.4
Lag(Hours)
-0.6
Figure
-0.6 6.
0 Autocorrelation
24 48 coefficient
72 96 of load
120 with
144 168 168
lags.
0 24 48 72 96 120 144 168
Lag(Hours)
Lag(Hours)
Figure 6.
Figure 6. Autocorrelation
Autocorrelation coefficient
coefficient of
of load
load with
with 168
168 lags.
lags.
Figure 6. Autocorrelation coefficient of load with 168 lags.
Entropy 2016, 18, 330 8 of 19

The original feature set for STLF can be achieved based on the above analysis. The 168 load
variables {Lt-168 , Lt-167 , . . . , Lt-2 , Lt-1 } are extracted as part of original feature set. When doing a day
ahead load forecasting, assuming the current moment is t, the load values from the moment t-1 to t-24
are unknown. Therefore, the variables {Lt-24 , Lt-23 , . . . , Lt-1 } are eliminated from the original feature
set. In addition, the features, such as hour of day, the day is within weekday or weekend, day of week
and season, are considered for constructing the original feature set.
Though meteorological factor affects the load demand, it is not considered in this paper because
the error of NWP influences the accuracy of STLF [29]. If needed, the meteorological can be added into
the original feature set for feature selection in the same manner. There are 168 features in the original
feature set F, as shown in Table 1.

Table 1. The original feature set.

Feature Type Original Feature


Exogenous features 1.FHour , 2.FWW , 3.FDW , 4.FS
5.FL(t-25) , 6.FL(t-26) , 7.FL(t-27) , 8.FL(t-28) , . . . ,
Endogenous features
146.FL(t-166) , 147.FL(t-167) , 148.FL(t-168)

The meaning of features in Table 1 is:


Exogenous features:
FHour means the moment of hour, which is tagged by the numbers from 1 to 24.
FWW is either weekday or weekend marked by binary numbers, wherein 0 means weekend and
1 means weekday.
FDW refers to the day of week, which is labeled by the numbers from 1 to 7.
FS uses the numbers from 1 to 4.
Endogenous features:
FL(t-25) is the load 25 h before, FL(t-26) means the load 26 h before, and so on.

4. The Proposed Feature Selection Method and STLF Model


A feature selection method combined with G-mRMR and RF is proposed. First, the redundancy
of features in F and the relevance between features and load are measured by G-mRMR. Each feature
with mRMR value is ranked in a descending order. Afterward, a SFS-based RF is used to search for the
optimal feature subset. The MAPE used as a performance index in the feature subset selection process
is defined as:
1 N Z − Ẑi
MAPE = ∑ i × 100% (9)
N i =1 Zi

where Zi is the actual value of load, Ẑi is the forecasting value, N is the number of sample.

4.1. G-mRMR for Feature Selection


Supposed an original feature set Fm including m features and a selected feature set J. The detail of
feature selection process is enumerated below:
(1) Initialization Ø→J.
(2) Compute the relevance between each feature and target variable l. Pick out the feature from Fm
which satisfies Equation (2) and add it into J.
(3) Find the feature in the rest of m−1 features in Fm that satisfies Equation (4) and add it in to J.
(4) Repeat step (3) until Fm becomes Ø.
(5) Rank the features in feature set J in descending order in accordance with the measured
mRMR value.
Entropy 2016, 18, 330 9 of 19

Entropy 2016, 18, 330 9 of 19


4.2. Wrapper for Feature Selection
4.2. common
The Wrapper forwrapper
Feature Selection
is a sequential forward and backward selection, both of which do not
consider the The common wrapper[34,39].
feature weighting Therefore,
is a sequential the effects
forward of different
and backward dimensional
selection, both of features
which doare notmust
be measured,
consider making wrapper
the feature weightinga complex
[34,39]. and computational
Therefore, the effectsfeature selection
of different method.features
dimensional According
are to
mustofbefeature
the result measured, making
selection of wrapper
G-mRMR, a complex
a wrapper andforcomputational featuresubset
finding a feature selection
can method.
be applied
According
in simpler manner.to the result of feature
Considering the selection
featuresof G-mRMR,
selected by amRMR
wrapperareforranked
finding ina feature subset can
a descending order,
be applied in simpler manner. Considering the features selected by
the features in the front of the ranking list contain more effective information, thus SFS is used mRMR are ranked in a for
descending order, the features in the front of the ranking list contain more effective information,
finding a small feature subset.
thus SFS is used for finding a small feature subset.
A SFS, in which features are sequentially added to an empty candidate set until the addition of
A SFS, in which features are sequentially added to an empty candidate set until the addition of
another features,
another does does
features, not decrease the criterion.
not decrease the criterion.ByBy defining ananempty
defining emptysetsetSSand
andan an original feature set
original feature
Fm , in set
theFfirst step, the wrapper searched for the feature subset with only one feature,
m, in the first step, the wrapper searched for the feature subset with only one feature, marked as
marked as S1 ,
wherein S1, the
wherein thexfeature
feature 1 selected in S1 leads
x1 selected to thetolargest
in S1 leads prediction
the largest error
prediction reduction.
error reduction.InInthe
thesecond
secondstep,
the wrapper
step, theselects
wrapperthe selects
featurethe x2 feature
from {Fxm2 -S 1 } and
from {Fmcombines with S1 with
-S1} and combines lead toS1 the
leadlargest
to the prediction
largest
prediction error
error reduction. reduction.
The search The search
schedule scheduleuntil
is repeated is repeated until the prediction
the prediction stops decreasing.
stops decreasing.

4.3.
4.3. The The Proposed
Proposed STLFSTLF Model
Model
BasedBased
on theonmethods
the methods in Sections
in Sections 4.1 4.1
andand
4.2,4.2,
thethe methodofoffeature
method feature selection
selection with
withRF
RFfor
forSTLF
STLF is
is proposed. The feature selection and short-term load forecasting process are shown in Figure 7,
proposed. The feature selection and short-term load forecasting process are shown in Figure 7, where p
where p is the number of feature and α is the weighting factor from 0.1 to 0.9, with an increment of
is the number of feature and α is the weighting factor from 0.1 to 0.9, with an increment of 0.1.
0.1.

Start RF initialization

The number of feature initialization,


Construct original feature set Fm p=0

Form training set and validation set p=p +1

Weighting factor initialization, α = 0 Train RF and apply it to validation set

α = α + 0.1 Record a performance index


No
p = 148 ?
Feature set J Initialization, Ø→J
Yes
Record a set of performance indexes
Finding the feature xi in Fm satisfies
the condition: max[I(xi, l)], and add No
the feature to the selected feature set α = 0.9 ?
J, Fm-1→ Fm
Yes
Compare the performance index and
Finding the feature satisfies equation the number of feature in different sets
(4) in the rest of m-1 features in Fm-1,
and add the feature to the selected
Get the optimal features
feature set J

No Apply RF to the test set which


Fm-1 is Ø ? includes optimal features
Yes

Candidate feature set The forecasting results

End

Figure
Figure 7. The
7. The feature
feature selection
selection processbased
process based on
on G-mRMR
G-mRMR and
andRFRFforfor
STLF.
STLF.
Entropy 2016, 18, 330 10 of 19

5. Case Study and Results Analysis


The data for the experiment consists of the actual data from 2011 to 2012 from a city in Northeast
China. For the purpose of feature selection and STLF, the data is divided into three parts: (1) training
set (extract eight months randomly from 2011); (2) validation set (the remaining four months of 2011);
and (3) Entropy
test set (extract
2016, 18, 330 one week from each season from the data of 2012). More information
10 of about
19 the
data set is shown in Table 2.
5. Case Study and Results Analysis
Table 2. Detail information about the data set.
The data for the experiment consists of the actual data from 2011 to 2012 from a city in
Northeast China. For the purpose of feature selection and STLF, the data is divided into three parts:
Data Set Information Purpose
(1) training set (extract eight months randomly from 2011); (2) validation set (the remaining four
January,
months of 2011); and February,
(3) test May,one
set (extract June, August,
week from each season from the data of 2012). More
Training Set Train RF
information about theSeptember, October,
data set is shown December
in Table 2.
Validation Set March, April, July, November Use for obtain the best weighting factor
Table 2. Detail information about the data set.
23–29 February 2012 (Winter)
Data Set Information Purpose
13–19 May 2012 (Spring)
Test Set January, February, May, June, Test performance of RF
Training Set 21–27 August 2012 (Summer) Train RF
August, September, October, December
24–30 November 2012 (Fall)
Use for obtain the best
Validation Set March, April, July, November
weighting factor
The number of variable mtry, 23–29 FebruaryRF
which 2012is(Winter)
not overly sensitive to, is recommended as
13–19 May 2012 (Spring)
mtry = p/3 [40]. The complexity of
Test Set RF is affected by
21–27 August 2012 (Summer)
the number of Test tree.performance
Under theof RFpremise of
non-reduction of prediction accuracy, theNovember
24–30 2012 (Fall)of trees nTree is set as 500 [15].
initial number
Let Equation (9) to be one of the criteria of RF. In addition, the root mean square error (RMSE) is
The number of variable mtry, which RF is not overly sensitive to, is recommended as mtry = p/3
also used.
[40].The
The RMSE is defined
complexity of RF is in the follow
affected by the equation:
number of tree. Under the premise of non-reduction of
prediction accuracy, the initial number of trees vnTree is set as 500 [15].
Let Equation (9) to be one of the criteria u
of RF. N
1 In addition, the 2
root mean square error (RMSE)

N i∑
u
is also used. The RMSE is defined RMSE = tequation:
in the follow ( Zi − Ẑi ) (10)
=1
1 N
N

5.1. Feature Selection Results Based on G-mRMR and RF
i =1
(Zi − Zˆ i )2
RMSE = (10)

In this subsection, the optimal subset is achieved according to the minimum MAPE by setting
5.1. Feature Selection Results Based on G-mRMR and RF
different weighting factor values of G-mRMR. Figure 8 shows the MAPE curves of the results from
RF predictionsIn this subsection,
under the optimal
different subsetfactor
weighting is achieved
α. Asaccording
showntointheFigure
minimum8a, MAPE
the MAPEby setting
is reduced
different weighting factor values of G-mRMR. Figure 8 shows the MAPE curves of the results from
and reaches a minimum value with the increase in the number of feature. Subsequently, it ceases
RF predictions under different weighting factor α. As shown in Figure 8a, the MAPE is reduced and
to decrease and gradually increases, indicating that the later addition of features does not improve
reaches a minimum value with the increase in the number of feature. Subsequently, it ceases to
the performance
decrease and ofgradually
RF, but increases,
only brings adverse
indicating thateffect.
the laterAs shown
addition of in Figure
features 8b,not
does the error is
improve thereduced
rapidlyperformance
when adopting of RF,abut
small
onlyvalue
bringsof α, foreffect.
adverse instance α = 0.1,
As shown in which
Figure 8b,indicates
the errorthat features have
is reduced
rapidly when for
useful information adopting a smallthe
improving value of α, for instance
performance of RF.α By
= 0.1, which indicates
excessively that features
considering have
the redundancy
among useful
features information
when using for aimproving
large value theofperformance
α, the selectedof RF. By excessively
feature subset doesconsidering
not provide theenough
redundancy among features when using a large value of α, the selected feature subset does not
relevant information for the prediction of RF-based predictor.
provide enough relevant information for the prediction of RF-based predictor.
5.5
α=0.1
α=0.1 3 α=0.2
5 α=0.2
α=0.3
α=0.3
α=0.4
α=0.4 2.9
4.5 α=0.5
α=0.5
MAPE (%)

α=0.6
MAPE (%)

α=0.6 α=0.7
α=0.7 2.8
4 α=0.8
α=0.8
α=0.9
α=0.9
3.5 2.7

3 2.6

(15 , 2.5597)
2.5 2.5
0 32 64 96 128 148 10 13 16 19 22 25 28 30
Number of Feature Number of Feature
(a) (b)

Figure 8. Prediction error curves: (a) Prediction error curves corresponding to different weighting
Figure 8. Prediction error curves: (a) Prediction error curves corresponding to different weighting
factor α; (b) The enlarged figure of red box in (a).
factor α; (b) The enlarged figure of red box in (a).
Entropy 2016, 18, 330 11 of 19

Table 3 presents the results of feature selection. When α = 0.4, the feature subset has the least
number of feature and the RF generates the minimum MAPE. The optimal feature subset is selected.

Table 3. Feature subsets selected by minimum MAPE under different weighting factors.

α Min MAPE (%) Number of Features Feature Subset


FL(t-168) , FL(t-25) , FL(t-48) , FL(t-144) , FL(t-72) , FHour , FL(t-47) ,
FL(t-26) , FL(t-120) , FL(t-167) , FWW , FS , FDW , FL(t-34) ,
0.1 2.5640 26
FL(t-158) , FL(t-103) , FL(t-27) , FL(t-96) , FL(t-162) , FL(t-132) ,
FL(t-44) , FL(t-88) , FL(t-149) , FL(t-153) , FL(t-37) , FL(t-107)
FL(t-168) , FL(t-25) , FL(t-48) , FL(t-144) , FHour , FS , FWW ,
FL(t-71) , FDW , FL(t-27) , FL(t-106) , FL(t-162) , FL(t-38) , FL(t-127) ,
0.2 2.5857 25
FL(t-93) , FL(t-156) , FL(t-88) , FL(t-32) , FL(t-29) , FL(t-96) , FL(t-44) ,
FL(t-134) , FL(t-26) , FL(t-166) , FL(t-59)
FL(t-168) , FL(t-25) , FL(t-48) , FHour , FWW , FS , FL(t-144) ,
FDW , FL(t-103) , FL(t-37) , FL(t-162) , FL(t-70) , FL(t-131) ,
0.3 2.5858 27 FL(t-28) , FL(t-88) , FL(t-153) , FL(t-106) , FL(t-75) , FL(t-159) ,
FL(t-34) , FL(t-125) , FL(t-96) , FL(t-43) , FL(t-165) , FL(t-109) ,
FL(t-31) , FL(t-26)
FL(t-168) , FL(t-25) , FL(t-48) , FWW , FS , FL(t-127) , FL(t-85) ,
0.4 2.5597 15 FL(t-139) , FDW , FL(t-34) , FL(t-160) , FL(t-70) , FL(t-28) ,
FL(t-120) , FL(t-141)
FL(t-168) , FL(t-25) , FL(t-47) , FWW , FS , FL(t-127) , FL(t-86) ,
FDW , FL(t-139) , FL(t-35) , FL(t-99) , FL(t-160) , FL(t-69) , FL(t-29 ),
0.5 2.5897 80 FL(t-154) , FL(t-120) , FL(t-41) , FL(t-81) , FL(t-133) , FL(t-148) ,
FL(t-166) , FL(t-32) , FL(t-63) , FL(t-92) , FL(t-26) , FL(t-108) ,
FL(t-162) , FL(t-78) , . . .
FL(t-168) , FL(t-25) , FHour , FL(t-47) , FS , FL(t-127) , FL(t-88) ,
FDW , FL(t-156) , FL(t-139) , FL(t-76) , FL(t-34) , FL(t-110) , FL(t-69) ,
0.6 2.5868 46 FL(t-149) , FL(t-120) , FL(t-41) , FL(t-81) , FL(t-27) , FL(t-165) ,
FL(t-37) , FL(t-162) , FL(t-98) , FL(t-30) , FL(t-131) , FL(t-159) ,
FL(t-104) , FL(t-44) , . . .
FL(t-168) , FL(t-25) , FWW , FS , FL(t-103) , FL(t-61) , FL(t-139) ,
FDW , FL(t-47) , FL(t-160) , FL(t-82) , FL(t-124) , FL(t-30) , FL(t-93) ,
0.7 2.5891 88 FL(t-156) , FL(t-41) , FL(t-146) , FL(t-33) , FL(t-110) , FL(t-72) ,
FL(t-152) , FL(t-164) , FL(t-27) , FL(t-90) , FL(t-131) , FL(t-39) ,
FL(t-118) , FL(t-77) , . . .
FL(t-168) , FL(t-25) , FWW , FS , FL(t-103) , FL(t-61) , FL(t-139) ,
FDW , FL(t-47) , FL(t-160) , FL(t-82) , FL(t-124) , FL(t-30) , FL(t-93) ,
0.8 2.6046 93 FL(t-156) , FL(t-41) , FL(t-146) , FL(t-33) , FL(t-110) , FL(t-166) ,
FL(t-75) , FL(t-152) , FL(t-90) , FL(t-72) , FL(t-44) , FL(t-131) ,
FL(t-28) , FL(t-39) , . . .
FL(t-168) , FL(t-25) , FWW , FS , FL(t-103) , FL(t-67) , FL(t-133) ,
FDW , FL(t-34) , FL(t-160) , FL(t-46) , FL(t-148) , FL(t-96) , FL(t-29) ,
0.9 2.5918 35 FL(t-84) , FL(t-140) , FL(t-39) , FL(t-153) , FL(t-75) , FL(t-114) ,
FL(t-165) , FL(t-56) , FL(t-122) , FL(t-62) , FL(t-155) , FL(t-126) ,
FL(t-41) , FL(t-119) , . . .

The RF will do poor forecasting with less trees, while excessive trees will make it a complicated
predictor. In order to obtain a reasonable number of trees of RF, an experiment is designed as follows:

(1) The training set and test set with optimal features are used for the experiment.
(2) The initial number of tree nTree = 1.
(3) Training RF and testing with different nTree value with increment of 1 until nTree = 500.

The experimental result is shown in Figure 9.


Entropy 2016, 18, 330 12 of 19
Entropy 2016, 18, 330 12 of 19
The
Entropy experimental
2016, 18, 330 result is shown in Figure 9. 12 of 19
The experimental result is shown in Figure 9.
4.4
4.4

4
4

(%)(%)
3.6

MAPE
3.6

MAPE
3.2
3.2

2.8
2.8

2.4
0 100 200 300 400 500
2.4
0 100 200
Number 300
of tree 400 500
Number of tree
Figure 9. Correlation between tree number and prediction of RF.
Figure 9.
Figure 9. Correlation
Correlation between
between tree
tree number
number and
and prediction
prediction of
of RF.
RF.
The prediction error decreases with the increasing number of tree. When nTree > 100, the error
The
tendsTheto beprediction
steady. By error decreases
analyzing with
thewith
result,the increasing number of tree. When=nTree > 100, the error
prediction error decreases thenTree = 184 with
increasing minimum
number of tree.MAPE
When nTree2.5389% is obtained,
> 100, the error
tends this
using to be steady.of By analyzing the result, nTree
of RF==in184
184 with minimum MAPE = 2.5389% is obtained,
tends to benumber
steady. Bytrees as the the
analyzing parameter
result, nTree thewith
future experiment.
minimum MAPE = 2.5389% is obtained,
using this number of trees as the parameter of RF in the future experiment.
using this number of trees as the parameter of RF in the future experiment.
5.2. Comparision Experiments for STLF
5.2. Comparision Experiments for STLF
5.2. Comparison Experiments for STLF
The data shown in Table 2 are used in the comparision of experiments.
The data shown in Table 2 are used in the comparision of experiments.
The data shown in Table 2 are used in the comparison of experiments.
5.2.1. Comparison of Different Feature Selection Methods
5.2.1. Comparison
5.2.1. Comparison of of Different
Different Feature
Feature Selection
Selection Methods
Methods
By using RF as the predictor, the feature selection methods such as Pearson Correlation
By using
By
Coefficient using RFRF
(PCC), as and
asMI,
the the SFS,
predictor,
predictor, the feature
thecompared
are feature withselection
selection themethods methods
proposedsuchmethod such foras
as Pearson Pearson the
Correlation
estimating Correlation
Coefficient
effect of
Coefficient
(PCC), (PCC),
MI, and of
feature selection MI,
SFS, and SFS,
are compared
G-mRMR. are
The resultscompared
withofthe these with
proposedthe proposed method
method methods
feature selection for
for estimating estimating
the effect
are presented the
of effect
in Figure 10.of
feature
feature
selection selection
of G-mRMR.
In Figure of G-mRMR.
10, with The The
theresults results
sameofpredictor, of these
these feature feature selection
the selection
SFS provides methods
methodsthe are are presented
bestpresented in
in Figure
performance, Figure
10. 10.by
followed
G-mRMR In Figure
In Figure 10,
(α = 10, with
0.4)with
and thethe same
MI,same predictor,
predictor,
and finally the
the thePCC. SFS
SFS provides
provides
The the best
the
SFS, which best performance,
performance,
convolves with RF, followed
followed
selects by by
22
G-mRMR
G-mRMR
features and (α = 0.4)
(α =achieves and
0.4) and the MI, and
MI, minimum finally
and finallyMAPE the
the PCC. PCC. The SFS,
The SFS, Considering
= 2.4925%. which
which convolves convolves with
with RF, selects
the relevance RF,
between selects
22 features
feature22
features
and loadand
achieves andachieves
thetheminimum the minimum
redundancy MAPE MAPE
= 2.4925%.
among = Considering
features, 2.4925%.
G-mRMR Considering
the = 0.4)the
(α relevance relevance
between
selects between
feature
15 features feature
and
with load
the
and
and load and
the redundancy
minimum the redundancy
MAPE = 2.5597%. among features,among
The feature features,
G-mRMR G-mRMR
subset(αselected (α
= 0.4) selects= 0.4)
by MI,15which selects
features 15
does features
withnotthe with
minimum
consider the
the
minimum
MAPE
redundancy MAPE
= 2.5597%.
among =
The 2.5597%. The
feature subset
features, is higherfeature
selected subset selected
by MI, which
than G-mRMR by
(α =does MI,
0.4).not which
consider
Only the PCCdoes not consider
the redundancy
analyzes theamong the
linear
redundancy
features,
relation betweenamong
is higher than features,
features and is
G-mRMR higher
(α =however
load, than
0.4). Only G-mRMR
thethe PCC(α
feature = 0.4).selected
analyzes
subset Only
the the
linear PCC analyzes
relation
through between
this methodthe linear
features
is not
relation
and between
load,ashowever
as good G-mRMR features
the(α = 0.4).subset selected through this method is not as good as G-mRMR (α =is0.4).
and
feature load, however the feature subset selected through this method not
as good as G-mRMR (α = 0.4).
7 3
7 MI 3
MI
G-mRMR (α=0.4) 2.9
6 MI
G-mRMR (α=0.4)
PCC 2.9 (21, 2.8841)
6 MI
G-mRMR (α=0.4)
SFS
PCC
2.8 (21, 2.8841) G-mRMR (α=0.4)
PCC
SFS
5 2.8 SFS
PCC
5 SFS
2.7
2.7
4
4 2.6 (15, 2.5597)
2.6 (15, 2.5597) (56, 2.6017)
3 (56, 2.6017)
2.5
3 (22, 2.4925)
2.5
(22, 2.4925)
2 2.4
0 32 64 96 128 148 10 20 30 40 50 60
2 2.4
0 32 64 (a) 96 128 148 10 20 30 (b) 40 50 60
(a) (b)
Figure 10. Prediction error curves: (a) Prediction error curves corresponding to different feature
Figure
Figure 10.
10. Prediction
Prediction error curves:
errorenlarge (a) Prediction
curves:figure
(a) Prediction error curves corresponding to different feature
selection methods; (b) The of red box error
in (a).curves corresponding to different feature
selection
selection methods;
methods; (b)
(b) The
The enlarge
enlarge figure
figureofofred
redbox
boxin
in(a).
(a).
Entropy 2016, 18, 330 13 of 19

Entropy 2016, 18, 330 13 of 19


In order to verify the validity of the feature subset applied for STLF, there were four weeks
In order
distributed amongto verify the validity
four seasons in 2012of the feature
are used to subset
test eachapplied forsubset
feature STLF, with
thereRF.
were
Forfour weeks
comparison,
thedistributed
full set ofamong fourwith
features seasons
RF inis 2012
also are used to
tested. testexperimental
The each feature subset with
results is RF. For comparison,
shown in Figure 11.
the full set of features with RF is also tested. The experimental results is shown in Figure 11. By
By examining the results in Figure 11a–d, generalized minimum redundancy and maximum
examining the results in Figure 11a–d, generalized minimum redundancy and maximum
relevance-random forest (G-mRMR-RF) (α = 0.4), mutual information-random forest (MI-RF),
relevance-random forest (G-mRMR-RF) (α = 0.4), mutual information-random forest (MI-RF),
sequential forward selection-random forest (SFS-RF), and RF (full features) can fit with true load
sequential forward selection-random forest (SFS-RF), and RF (full features) can fit with true load
value accurately, whereas the accuracy of pearson correlation coefficient-random forest (PCC-RF)
value accurately, whereas the accuracy of pearson correlation coefficient-random forest (PCC-RF) is
is low. The results
low. The results of
of fifth
fifth day
day prediction
predictionininFigure
Figure11a11aand
andthetheseventh
seventhdaydayininFigure
Figure 11c
11c show
show
G-mRMR-RF has a better fit than MI-RF, indicating the necessity of considering
G-mRMR-RF has a better fit than MI-RF, indicating the necessity of considering the redundancy the redundancy
among
among features.
features.The
The results
results ofof fifth
fifth day
day prediction
prediction in in Figure
Figure 11a
11ashow
showthat
thatSFS-RF
SFS-RFhas hasbetter
better
prediction performance
prediction performance than G-mRMR,
than G-mRMR, while the seventh
while day prediction
the seventh resultsresults
day prediction in Figure 11c indicates
in Figure 11c
G-mRMR-RF predicts better.
indicates G-mRMR-RF predicts better.

5000 True value


5000 True value
G-mRMR-RF(α=0.4) G-mRMR-RF(α=0.4)
MI-RF MI-RF
4500 PCC-RF 4500 PCC-RF
Load Demand(MW)

Load Demand(MW)
SFS-RF SFS-RF
RF (full features) RF (full features)
4000 4000

3500 3500

3000 3000

2500 2500

2000 2000
0 24 48 72 96 120 144 168 0 24 48 72 96 120 144 168
Hour Hour
(a) (b)
5000 5000 True value
True value
G-mRMR-RF(α=0.4) G-mRMR-RF(α=0.4)
MI-RF MI-RF
4500 4500 PCC-RF
Load Demand(MW)

PCC-RF
Load Demand(MW)

SFS-RF SFS-RF
RF (full features) RF (full features)
4000 4000

3500 3500

3000 3000

2500 2500

2000 2000
0 24 48 72 96 120 144 168 0 24 48 72 96 120 144 168
Hour Hour
(c) (d)

Figure 11. Load curves of forecasting results of four weeks in four seasons and the true values: (a)
Figure 11. Load curves of forecasting results of four weeks in four seasons and the true values:
Forecasting from 23 to 29 February 2012; (b) Forecasting from 13 to 19 May 2012; (c) Forecasting
(a) Forecasting from 23 to 29 February 2012; (b) Forecasting from 13 to 19 May 2012; (c) Forecasting
from 21 to 27 August 2012; (d) Forecasting from 24 to 30 November 2012.
from 21 to 27 August 2012; (d) Forecasting from 24 to 30 November 2012.

By analyzing Figures 10 and 11 and Tables 3–6 comprehensively, although SFS achieved the
Byforecasting
best analyzing Figures
results in 10the
andfeature
11 andselection
Tables 4–7 comprehensively,
process, the proposedalthough
method SFS achieved
achieved the best
the better
forecasting
result in results in the
the testing feature selection
schedule. process,the
When predicting the 28
proposed
days in method
the test achieved the bettermethod
set, the proposed result in
theyields
testingthe
schedule. When predicting
best forecasting in 20 daystheand
28 days in the in
the MAPE testthe
set,remaining
the proposed
eightmethod
days is yields
higherthe best
than
other methods,
forecasting ranging
in 20 days and the from 0.04%into
MAPE 0.37%.
the The average
remaining eight daysMAPE and the
is higher thanaverage RMSE indicate
other methods, ranging
G-mRMR-RF
from performs
0.04% to 0.37%. the bestMAPE
The average amongandthe
the methods
average RMSE whichindicate
demonstrates
G-mRMR-RFthe validity
performs andthe
advancement of G-mRMR.
best among the methods which demonstrates the validity and advancement of G-mRMR.
TheThenewnew methodalso
method alsohas
hasthe
theminimum
minimum value
value ofof the
the maximum
maximumerror errorofofSTLF
STLF inin
the testing
the testingset.set.
AsAs shown
shown in in Table
Table 6, 5,
the the maximumMAPE
maximum MAPEand and maximum
maximum RMSE RMSEof ofthe
theproposed
proposedmethod
method areare
6.12%
6.12%
and 208.00 MW. Although the maximum error of the new method is high,
and 208.00 MW. Although the maximum error of the new method is high, but compared with other but compared with other
Entropy 2016, 18, 330 14 of 19

methods, the proposed method still performed better. The high prediction error can be caused by two
factors. On the one hand, the load of forecasting day is much larger than the historical load data in the
training set. In this paper, most features in the original feature set are extracted from the historical
load data. Without the consideration of other features, the prediction results cannot advance just
by improving the feature selection and forecasting method. On the other hand, with the significant
economic rise of China from 2005 to 2012, the growth rate of gross domestic product of the city is
more than 10%. Under this premise, the electric load of the city increases rapidly which makes STLF
a challenging work.

Table 4. Comparison of prediction error (MAPE (%) and RMSE (MW)) from 23 to 29 February 2012.

G-mRMR-RF RF with
MI-RF PCC-RF SFS-RF
Day (α = 0.4) Full Features
MAPE RMSE MAPE RMSE MAPE RMSE MAPE RMSE MAPE RMSE
Day 1 1.93 75.24 1.79 69.42 10.28 401.01 2.07 79.58 1.91 70.74
Day 2 1.77 66.63 1.78 67.90 9.78 388.26 2.22 79.46 1.80 69.13
Day 3 1.58 53.24 1.63 51.63 7.59 285.51 1.47 49.33 1.50 50.49
Day 4 1.69 79.28 1.59 70.02 5.35 189.65 2.52 105.32 1.98 76.33
Day 5 2.26 90.72 2.66 104.16 11.14 440.91 2.04 83.68 2.91 113.32
Day 6 1.58 57.73 2.37 83.87 9.78 396.44 1.61 57.41 2.54 87.59
Day 7 1.28 51.92 0.97 36.35 9.26 362.46 1.87 73.03 1.29 44.60
Average 1.72 67.82 1.82 69.05 9.02 352.03 1.97 75.40 1.99 73.17

Table 5. Comparison of prediction error (MAPE (%) and RMSE (MW)) from 13 to 19 May 2012.

G-mRMR-RF RF with
MI-RF PCC-RF SFS-RF
Day (α = 0.4) Full Features
MAPE RMSE MAPE RMSE MAPE RMSE MAPE RMSE MAPE RMSE
Day 1 1.20 42.36 1.22 39.15 3.76 110.39 1.43 50.90 1.57 47.92
Day 2 1.64 60.32 1.33 50.26 8.98 273.78 1.28 46.10 1.37 53.34
Day 3 2.04 66.88 2.04 67.09 6.56 246.64 2.03 69.43 2.00 66.78
Day 4 0.94 34.38 0.96 34.48 7.04 263.29 0.89 34.98 1.11 41.54
Day 5 1.55 53.26 1.40 46.62 7.17 261.54 1.40 50.04 1.50 52.38
Day 6 1.28 41.45 1.34 44.68 6.66 237.55 1.28 40.22 1.45 40.03
Day 7 0.84 26.82 0.99 36.97 5.51 178.83 0.92 50.61 1.01 49.05
Average 1.35 46.49 1.33 48.03 6.53 224.57 1.32 48.90 1.40 50.15

Table 6. Comparison of prediction error (MAPE (%) and RMSE (MW)) from 21 to 27 August 2012.

G-mRMR-RF RF with
MI-RF PCC-RF SFS-RF
Day (α = 0.4) Full Features
MAPE RMSE MAPE RMSE MAPE RMSE MAPE RMSE MAPE RMSE
Day 1 2.88 92.71 2.61 83.05 6.69 258.31 3.32 104.41 2.89 90.68
Day 2 1.48 55.30 1.55 56.81 8.22 319.74 1.77 62.53 1.59 57.62
Day 3 0.91 31.93 0.82 29.02 7.04 263.33 1.00 38.68 1.07 36.28
Day 4 1.88 76.95 2.27 90.86 8.97 344.82 1.99 84.88 2.17 87.44
Day 5 1.77 54.77 1.87 56.56 6.42 227.25 2.16 70.95 1.91 58.15
Day 6 2.08 73.60 1.78 71.44 5.91 181.33 1.71 65.13 1.86 74.78
Day 7 6.12 208.00 6.77 237.00 11.26 458.19 6.98 247.66 6.57 227.17
Average 2.45 72.83 2.52 89.25 7.79 293.28 2.70 96.32 2.58 90.30
Entropy 2016, 18, 330 15 of 19

Table 7. Comparison of prediction error (MAPE (%) and RMSE (MW)) from 24 to 30 November 2012.

G-mRMR-RF RF with
MI-RF PCC-RF SFS-RF
Day (α = 0.4) Full Features
MAPE RMSE MAPE RMSE MAPE RMSE MAPE RMSE MAPE RMSE
Day 1 1.61 58.60 1.64 58.26 6.80 263.40 1.96 68.90 1.78 63.62
Day 2 1.05 48.24 1.12 43.26 6.97 242.74 2.11 78.43 1.09 40.30
Day 3 1.98 74.62 2.06 74.67 10.78 427.39 1.98 73.90 2.12 76.64
Day 4 1.47 57.14 1.33 48.09 9.21 387.30 1.80 67.13 1.50 57.63
Day 5 1.12 42.84 0.90 33.83 10.29 413.17 1.15 46.87 1.26 45.01
Day 6 1.31 53.79 1.33 52.33 9.03 389.52 1.32 54.74 1.23 47.40
Day 7 1.10 42.86 1.22 45.31 9.53 387.69 1.06 44.21 1.18 43.42
Average 1.38 54.01 1.37 50.82 8.93 358.74 1.63 62.02 1.45 53.43

5.2.2. Comparison of Different Intelligent Methods


For comparing the influence of different predictors to STLF, support vector regression (SVR) and
back propagation neural network (BPNN) are examined with G-mRMR for feature selecting in this
subsection. The parameters of SVR are set as follows: the penalty factor is C = 100, the insensitive loss
function is ε = 0.1, and the kernel width is δ2 = 2 [41].
The parameters of BPNN are set as follows: the number of neurons in hidden layer is
Nneu = 2p+1 [42], and the iteration is T = 2000 [43].
Data consist of training set, validation set, and test set are similar with Section 4.2. The SVR and
BPNN are used to generate the optimal feature subsets.
Table 8 presents feature subsets that different intelligent STLF methods had selected. With different
predictors, the weighting factors are diverse, thus features are varying. Although the final number of
feature selected by SVR and BPNN are less than RF, the RF-based predictor has higher precision of
prediction which is the main target of STLF.

Table 8. The optimal subset selected by using different intelligent STLF methods.

Predictor Min MAPE (%) Number of Features Feature Subset


FL(t-168) , FL(t-25) , FL(t-48) , FWW , FS , FL(t-127) ,
G-mRMR-RF
2.5389% 15 FL(t-85) , FL(t-139) , FDW , FL(t-34) , FL(t-160) , FL(t-70) ,
(α = 0.4)
FL(t-28) , FL(t-120) , FL(t-141)
G-mRMR-SVR
3.3293% 5 FL(t-168) , FL(t-25) , FL(t-48) , FHour , FWW
(α = 0.3)
G-mRMR-BPNN FL(t-168) , FL(t-25) , FL(t-48) , FL(t-144) , FL(t-72) ,
2.7186% 11
(α = 0.1) FHour , FL(t-47) , FL(t-26) , FL(t-120) , FL(t-167) , FWW

The test sets, with four weeks being distributed over the four seasons, are used for estimating
each predictor with the features chosen above. Figure 12 shows the MAPE for comparison and
Table 9 gives the predictive accuracy of each model through maximum, minimum, and average
MAPE. In addition, a direct comparison between G-mRMR-RF, generalized minimum redundancy and
maximum relevance-back propagation neural network (G-mRMR-BPNN), and generalized minimum
redundancy and maximum relevance-support vector regression (G-mRMR-SVR), in terms of MAPE,
are also presented in this figure. Except for the MAPE prediction in the seventh day, as shown in
Figure 12c, the accuracy of G-mRMR-RF is between 1% and 2%; one point is above 2%. In the
whole experiment, only four days show that G-mRMR-RF forecasted worse than other models.
Clearly, the G-mRMR-RF is the best prediction model for its low MAPE and small fluctuation of
error. The G-mRMR-BPNN shows a little better performance than G-mRMR-SVR. We can observe
the maximum MAPE of these four weeks of G-mRMR-RF is 2.26%, 2.04%, 6.12%, 1.98%, respectively,
which is smaller than other models. Same conclusion can be drawn by analyzing the minimum and
average MAPE.
Entropy 2016, 18, 330 16 of 19

fluctuation of error. The G-mRMR-BPNN shows a little better performance than G-mRMR-SVR. We
can observe the maximum MAPE of these four weeks of G-mRMR-RF is 2.26%, 2.04%, 6.12%, 1.98%,
respectively,
Entropy 2016, 18,which
330 is smaller than other models. Same conclusion can be drawn by analyzing the
16 of 19
minimum and average MAPE.

8 8
G-mRMR-RF G-mRMR-RF
7 7
G-mRMR-BPNN G-mRMR-BPNN
6 G-mRMR-SVR 6 G-mRMR-SVR

5 5

MAPE(%)
MAPE(%)

4 4

3 3

2 2

1 1

0 0
1 2 3 4 5 6 7 1 2 3 4 5 6 7
Day Day
(a) (b)
8 8
G-mRMR-RF G-mRMR-RF
7 7
G-mRMR-BPNN G-mRMR-BPNN
6 G-mRMR-SVR 6 G-mRMR-SVR

5 5
MAPE(%)

MAPE(%)

4 4

3 3

2 2

1 1

0
1 2 3 4 5 6 7 1 2 3 4 5 6 7
Day Day
(c) (d)

Figure 12. Forecasting error profiles of different predictors: (a) Forecsting from 23 to 29 February 2012; (b)
Figure 12. Forecasting error profiles of different predictors: (a) Forecasting from 23 to 29 February 2012;
Forecsting from 13 to 19 May 2012; (c) Forecsting from 21 to 27 August 2012; (d) Forecsting from 24 to 30
(b) Forecasting from 13 to 19 May 2012; (c) Forecasting from 21 to 27 August 2012; (d) Forecasting from
November 2012.
24 to 30 November 2012.

Table 8. Max, Min and Average daily MAPEs of test set corresponding to different predictors.
Table 9. Max, Min and Average daily MAPEs of test set corresponding to different predictors.

Test Set Test Set MAPE (%)MAPE (%)


G-mRMR-RF
G-mRMR-RF G-mRMR-ANN
G-mRMR-ANN G-mRMR-SVR
G-mRMR-SVR
Max 2.26 3.72 4.24
Max
23–29 February 2012 2.26 3.72 4.24
23–29 February 2012 Min 1.28 1.23 1.37
(Winter) Min 1.28 1.23 1.37
(Winter) Ave 1.72 2.18 2.61
Ave 1.72 2.18 2.61
Max 2.04 3.14 4.42
13–19 May 2012 Max 2.04 3.14 4.42
13–19 May 2012 Min 0.84 1.24 1.87
(Spring) Min 0.84 1.24 1.87
(Spring) Ave 1.35 2.05 2.78
Ave 1.35 2.05 2.78
Max 6.12 6.32 7.68
21–27 August 2012Max 6.12 6.32
21–27 August(Summer)
2012 Min 0.91 1.78 1.697.68
Min Ave 0.91 2.45 1.78
2.87 3.501.69
(Summer)
Ave Max 2.45 1.98 2.87
2.84 4.013.50
24–30 November 2012Max 1.98 2.84
Min 1.05 1.38 1.804.01
24–30 November (Fall)
Min Ave 1.05 1.38 2.10
1.38 2.861.80
2012 (Fall)
Ave 1.38 2.10 2.86
Based on the comprehensive analysis above, as compared to BPNN and SVR, the RF combines
with G-mRMR is more suitable for STLF.
Entropy 2016, 18, 330 17 of 19

6. Conclusions
For the issues regarding the selection of reasonable features for STLF, a feature selection method
based on G-mRMR and RF is proposed in this paper. The experimental results show that the proposed
feature selection approach can select fewer features than other feature selection methods, and the
features identified by the proposed approach are useful for STLF. In addition, the experimental results
show that the forecasting consequences by RF are better than other predictors.
The advantages of the proposed method are as follows:

(1) MI is adopted as the criterion to measure the relevance between features and time series of load
and the dependency among features, which is the basis of quantitative analysis of feature selection
by mRMR.
(2) The correlation between features and load as well as the redundancy of these features are
considered. As compared to the maximum relevance method, the G-mRMR method for feature
selection reduces the number of optimal feature subset and avoids the association of STLF
accuracy with the redundancy of features. For the time being, the relevance and redundancy
are balanced by using a variable weighting factor. The features selected by G-mRMR make the
accuracy of RF more precise than mRMR.
(3) The optimal structure of RF is designed for reducing the complexity of the model and for
improving the accuracy of STLF.

Acknowledgments: This work is supported by the National Nature Science Foundation of China (No. 51307020),
the Science and Technology Development Project of Jilin Province (No. 20160411003XH) and the Science and
Technology Foundation of Department of Education of Jilin Province (2016, No. 90).
Author Contributions: Nantian Huang put forward to the main idea and design the whole venation of this paper.
Zhiqiang Hu did the experiments and prepared the manuscript. Guowei Cai guided the experiments and paper
writing. Dongfeng Yang provided materials. All authors have read and approved the final manuscript.
Conflicts of Interest: The authors declare no conflict of interest.

References
1. Moslehi, K.; Kumar, R. A reliability perspective of the smart grid. IEEE Trans. Smart Grid 2010, 1, 57–64.
[CrossRef]
2. Ren, Y.; Suganthan, P.N.; Srikanth, N.; Amaratunga, G. Random vector functional link network for short-term
electricity load demand forecasting. Inf. Sci. 2016, 367–368, 1078–1093. [CrossRef]
3. Lee, C.-M.; Ko, C.-N. Short-term load forecasting using lifting scheme and ARIMA models. Expert Syst. Appl.
2011, 38, 5902–5911. [CrossRef]
4. Goia, A.; May, C.; Fusai, G. Functional clustering and linear regression for peak load forecasting.
Int. J. Fofrecast. 2010, 26, 700–711. [CrossRef]
5. Al-Hamadi, H.M.; Soliman, S.A. Fuzzy short-term electric load forecasting using Kalman filter. IEE Proc.
Gener. Transm. Distrib. 2006, 153, 217–227. [CrossRef]
6. Ramos, S.; Soares, J.; Vale, Z. Short-term load forecasting based on load profiling. In Proceedings of the 2013
IEEE Power and Energy Society General Meeting, Vancouver, BC, Canada, 21–25 July 2013; pp. 1–5.
7. Li, W.; Zhang, Z.G. Based on Time Sequence of ARIMA Model in the Application of Short-Term Electricity
Load Forecasting. In Proceedings of the 2009 International Conference on Research Challenges in Computer
Science, Shanghai, China, 28–29 December 2009; pp. 11–14.
8. Deshmukh, M.R.; Mahor, A. Comparisons of Short Term Load Forecasting using Artificial Neural Network
and Regression Method. Int. J. Adv. Comput. Res. 2011, 1, 96–100.
9. Taylor, J.W. Short-Term Load Forecasting With Exponentially Weighted Methods. IEEE Trans. Power Syst.
2012, 27, 458–464. [CrossRef]
10. Kouhi, S.; Keynia, F.; Ravadanegh, S.N. A new short-term load forecast method based on neuro-evolutionary
algorithm and chaotic feature selection. Int. J. Electr. Power Energy Syst. 2014, 62, 862–867. [CrossRef]
11. Raza, M.Q.; Khosravi, A. A review on artificial intelligence based load demand forecasting techniques for
smart grid and buildings. Renew. Sustain. Energy Rev. 2015, 50, 1352–1372. [CrossRef]
Entropy 2016, 18, 330 18 of 19

12. Lin, C.T.; Chou, L.D.; Chen, Y.M.; Tseng, L.M. A hybrid economic indices based short-term load forecasting
system. Int. J. Electr. Power Energy Syst. 2014, 54, 293–305. [CrossRef]
13. Yu, F.; Xu, X. A short-term load forecasting model of natural gas based on optimized genetic algorithm and
improved BP neural network. Appl. Energy 2014, 134, 102–113. [CrossRef]
14. Çevik, H.H.; Çunkaş, M. Short-term load forecasting using fuzzy logic and ANFIS. Neural Comput. Appl.
2015, 26, 1355–1367. [CrossRef]
15. Lahouar, A.; Slama, J.B.H. Day-ahead load forecast using random forest and expert input selection.
Energy Convers. Manag. 2015, 103, 1040–1051. [CrossRef]
16. Ho, K.L.; Hsu, Y.Y.; Chen, C.F.; Lee, T.E. Short term load forecasting of Taiwan power system using
a knowledge-based expert system. IEEE Trans. Power Syst. 1990, 5, 1214–1221.
17. Srinivasan, D.; Tan, S.S.; Cheng, C.S.; Chan, E.K. Parallel neural network-fuzzy expert system strategy for
short-term load forecasting: System implementation and performance evaluation. IEEE Trans. Power Syst.
1999, 14, 1100–1106. [CrossRef]
18. Quan, H.; Srinivasan, D.; Khosravi, A. Uncertainty handling using neural network-based prediction intervals
for electrical load forecasting. Energy 2014, 73, 916–925. [CrossRef]
19. Hernández, L.; Baladrón, C.; Aguiar, J.M.; Carro, B.; Sánchez-Esguevillas, A.; Lloret, J. Artificial neural
networks for short-term load forecasting in microgrids environment. Energy 2014, 75, 252–264. [CrossRef]
20. Ko, C.N.; Lee, C.M. Short-term load forecasting using SVR (support vector regression)-based radial basis
function neural network with dual extended Kalman filter. Energy 2013, 49, 413–422. [CrossRef]
21. Che, J.X.; Wang, J.Z. Short-term load forecasting using a kernel-based support vector regression combination
model. Appl. Energy 2014, 132, 602–609. [CrossRef]
22. Pandian, S.C.; Duraiswamy, K.; Rajan, C.C.A.; Kanagaraj, N. Fuzzy approach for short term load forecasting.
Electr. Power Syst. Res. 2006, 76, 541–548. [CrossRef]
23. Božić, M.; Stojanović, M.; Stajić, Z.; Stajić, N. Mutual Information-Based Inputs Selection for Electric Load
Time Series Forecasting. Entropy 2013, 15, 926–942. [CrossRef]
24. Ma, L.; Zhou, S.; Lin, M. Support Vector Machine Optimized with Genetic Algorithm for Short-Term Load
Forecasting. In Proceedings of the 2008 International Symposium on Knowledge Acquisition and Modeling,
Wuhan, China, 21–22 December 2008; pp. 654–657.
25. Gao, R.; Liu, X. Support vector machine with PSO algorithm in short-term load forecasting. In Proceedings
of the 2008 Chinese Control and Decision Conference, Yantai, China, 2–4 July 2008; pp. 1140–1142.
26. Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [CrossRef]
27. Jurado, S.; Nebot, À.; Mugica, F.; Avellana, N. Hybrid methodologies for electricity load forecasting:
Entropy-based feature selection with machine learning and soft computing techniques. Energy 2015, 86,
276–291. [CrossRef]
28. Wilamowski, B.M.; Cecati, C.; Kolbusz, J.; Rozycki, P. A Novel RBF Training Algorithm for short-term Electric
Load Forecasting and Comparative Studies. IEEE Trans. Ind. Electron. 2015, 62, 6519–6529.
29. Wi, Y.M.; Joo, S.K.; Song, K.B. Holiday load forecasting using fuzzy polynomial regression with weather
feature selection and adjustment. IEEE Trans. Power Syst. 2012, 27, 596–603. [CrossRef]
30. Viegas, J.L.; Vieira, S.M.; Melício, M.; Mendes, V.M.F.; Sousa, J.M.C. GA-ANN Short-Term Electricity
Load Forecasting. In Proceedings of the 7th IFIP WG 5.5/SOCOLNET Advanced Doctoral Conference on
Computing, Electrical and Industrial Systems, Costa de Caparica, Portugal, 11–13 April 2016; pp. 485–493.
31. Li, S.; Wang, P.; Goel, L. A novel wavelet-based ensemble method for short-term load forecasting with hybrid
neural networks and feature selection. IEEE Trans. Power Syst. 2015, 1788–1798. [CrossRef]
32. Hu, Z.; Bao, Y.; Chiong, R.; Xiong, T. Mid-term interval load forecasting using multi-output support vector
regression with a memetic algorithm for feature selection. Energy 2015, 84, 419–431. [CrossRef]
33. Koprinska, I.; Rana, M.; Agelidis, V.G. Yearly and seasonal models for electricity load forecasting.
In Proceedings of the 2011 International Joint Conference on Neural Networks (IJCNN), San Jose, CA,
USA, 31 July–5 August 2011; pp. 1474–1481.
34. Peng, H.; Long, F.; Ding, C. Feature selection based on mutual information: Criteria of max-dependency,
max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 2005, 27, 1226–1238. [CrossRef]
[PubMed]
Entropy 2016, 18, 330 19 of 19

35. Nguyen, X.V.; Chan, J.; Romano, S.; Bailey, J. Effective global approaches for mutual information based feature
selection. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and
Data Mining, New York, NY, USA, 24–27 August 2014; pp. 512–521.
36. Speybroeck, N. Classification and regression trees. Int. J. Public Health 2012, 57, 243–246. [CrossRef] [PubMed]
37. Breiman, L. Bagging predictors. Mach. Learn. 1996, 24, 123–140. [CrossRef]
38. Sood, R.; Koprinska, I.; Agelidis, V.G. Electricity load forecasting based on autocorrelation analysis.
In Proceedings of the 2010 International Joint Conference on Neural Networks (IJCNN), Barcelona, Spain,
18–23 July 2010; pp. 1–8.
39. Kohavi, R.; John, G.H. Wrappers for feature subset selection. Artif. Intell. 1997, 97, 273–324. [CrossRef]
40. Dudek, G. Short-Term Load Forecasting Using Random Forests. In Intelligent Systems’2014; Springer: Cham,
Switzerland, 2015; pp. 821–828.
41. Che, J.X.; Wang, J.Z.; Tang, Y.J. Optimal training subset in a support vector regression electric load forecasting
model. Appl. Soft Comput. 2012, 12, 1523–1531. [CrossRef]
42. Sheela, K.G.; Deepa, S.N. Review on Methods to Fix Number of Hidden Neurons in Neural Networks.
Math. Probl. Eng. 2013, 2013, 425740. [CrossRef]
43. Rana, M.; Koprinska, I. Forecasting electricity load with advanced wavelet neural networks. Neurocomputing
2016, 182, 118–132. [CrossRef]

© 2016 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC-BY) license (http://creativecommons.org/licenses/by/4.0/).

You might also like