0% found this document useful (0 votes)
106 views8 pages

A Regression Equation Model For Height and Weight

The document presents a regression equation model for predicting human height and weight from body measurements. It describes conducting principal component analysis on multiple measurements to reduce dimensionality and avoid issues like multicollinearity. Regression analysis was then performed to establish linear relationships between principal components representing limb data and both height and weight. Models were evaluated based on various statistical metrics to select the best fitting equations for prediction.

Uploaded by

Diego Fernando
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
106 views8 pages

A Regression Equation Model For Height and Weight

The document presents a regression equation model for predicting human height and weight from body measurements. It describes conducting principal component analysis on multiple measurements to reduce dimensionality and avoid issues like multicollinearity. Regression analysis was then performed to establish linear relationships between principal components representing limb data and both height and weight. Models were evaluated based on various statistical metrics to select the best fitting equations for prediction.

Uploaded by

Diego Fernando
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Journal of Physics: Conference Series

PAPER • OPEN ACCESS

A Regression Equation Model for Height and Weight Prediction


To cite this article: Xu Zhang et al 2018 J. Phys.: Conf. Ser. 1087 022018

View the article online for updates and enhancements.

This content was downloaded from IP address 181.215.145.125 on 03/10/2018 at 02:08


First International Conference on Advanced Algorithms and Control Engineering IOP Publishing
IOP Conf. Series: Journal of Physics: Conf. Series 1087 (2018)
1234567890 ‘’“” 022018 doi:10.1088/1742-6596/1087/2/022018

A Regression Equation Model for Height and Weight


Prediction

Xu ZHANG1,2, Chao ZHAO2, Hai-tao WANG 2, Fang ZHANG2,Jing ZHAO2 and


Gang WU 2,*
1
School of Mathematical Sciences, Capital Normal University, Beijing, China
2
China National Institute of Standardization, Beijing, China
[email protected]

Abstract. Height and weight prediction has been a popular problem in ergonomics study. In
this paper, we reduce the dimension by principal component analysis and choose the best
regression equation using various statistical criterion such as Residual Mean Square(RMSq),
Mallow Cp and Akaike information criterion(AIC). Finally, compared with the real value, we
analyze the fitting accuracy of the regression equation we proposed.

1. Introduction
Regression analysis method is the most widely used in multivariate statistical analysis. Regression
analysis is widely used in social, economic, scientific and technological fields of data analysis, the
establishment of empirical formula for regular forecasting, such as weather forecasting, earthquake
prediction, stock market analysis and so on.
Height and weight inference is an important part of individual identification work in forensic
anthropology. In the actual case detection process, height and weight prediction play a key role in
finding the dead source as well as detecting the unknown body, albino bone entity, broken corpse and
other cases. In previous studies, the height and weight were predicted with a single variable such as
foot length, head circumference and shoulder width. Some articles use the neck data, head data for
multiple regression.
The primary purpose of this research is to establish a method between the body data with the height
and weight. This paper not only introduces the prediction of height and weight by one variable, but
also gives the optimal combination of multiple variables when they are fitted. These equations have a
great reference value for forensic science and anthropology.

2. Related work
The physical characteristics of the body are an important part of anthropological research. In forensic
personal identification, it is important to speculate on height and weight. The more accurately we
speculate height and weight, the more quickly we identify the identity of the deceased. Thereby it can
shorten the detection time of the case and improve the efficiency of detection. At present, domestic
research on these aspects is rare, especially about weight prediction.
We focus on forensic anthropology in height and weight prediction research. At the scene of the
incident, we can find many data of the body. It becomes more critical that how to do more accurate
height and weight prediction. When the conditions are not allowed, we get limited data. This paper
first carries out regression prediction based on a single factor. When we get a lot of data, how to

Content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further distribution
of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.
Published under licence by IOP Publishing Ltd 1
First International Conference on Advanced Algorithms and Control Engineering IOP Publishing
IOP Conf. Series: Journal of Physics: Conf. Series 1087 (2018)
1234567890 ‘’“” 022018 doi:10.1088/1742-6596/1087/2/022018

choose a more appropriate variable is what we need to consider. The second part of this article focuses
on how to choose variables.

3. Methodology
Total 2000 participants were recruited for this experiment. Participants were between 18 and 60 years
old, without any deformity of body.
Table 1 Data resources
items n minimum maximum average SD
height(mm) 2000 1496 1893 1677.605 58.9680
weight(hg) 2000 370 880 596.667 80.5707
armreach from
2000 712 956 833.698 35.4544
back(mm)
hand length(mm) 2000 153 212 182.845 8.0007
foot length(mm) 2000 212 283 246.726 10.3434
lower extremity
2000 834 1128 992.421 43.4268
length(mm)
neck
2000 296 428 352.664 18.0861
circumference(mm)
bust(mm) 2000 718 1085 871.142 54.0766
waist
2000 571 1036 748.869 76.2112
circumference(mm)
head width(mm) 2000 512 615 561.132 15.3977

In the prediction of height and weight, it is often necessary to establish a model of the relationship
between variables and height and weight. For the above data to simulate, we find that the variance
expansion factor of the explanatory variables is quite large. It indicates that there is a more serious
multiple collinearity between the explanatory variables. The least squares regression shows that the
sign of the partial regression coefficients is negative and very unstable. At the same time, although the
F-test can pass, but most of the regression coefficient estimation t-test was rejected, and the mean
square error is as high as 29.76.
When the multiple collinearity of the observed multiple matrices occur is too high, the mean square
error of the regression coefficient will be too large. That will affect the goodness of the model fitting.
If we can compress the relevant variables into a few independent variables which can reflect the
original number of variable information, we will avoid the above problems. Principal component
analysis is an effective way to convert multiple variables into a few independent variables by means of
linear combinations, which provides convenience for subsequent analysis.
Principal component analysis is implemented in the R software through the princomp function. The
principal component analysis of the nine variables is taken as the principal component analysis. We get
the result that the total variance contribution rate is 87.1%, the maximum value is 3.7985 and the
minimum is 0.1982. The four principal components are used as the new variable values for further
analysis. For the convenience of the description below, the four principal components are referred to as
limb data.

2
First International Conference on Advanced Algorithms and Control Engineering IOP Publishing
IOP Conf. Series: Journal of Physics: Conf. Series 1087 (2018)
1234567890 ‘’“” 022018 doi:10.1088/1742-6596/1087/2/022018

Fig.1 The scree plot

3.1. The relation between human height with limb data


In the study of the predicted model, the simplest and commonly method is that the two characteristic
parameters of the system are closely related to the linear relationship. For such model, it is generally
use a linear regression method---the least squares method. For the data that basically conforms to the
linear relationship, the least squares method used for the linear regression is to obtain the coefficients a
and b of the regression line with minimize the distance between the regression line and the point of the
hash in the Y direction. Given n-th column (x1, y1) , ( x2 , y2 ) ,…, ( xn , yn ) ,the regression equation is
y  bx  a . (1)
n
We use  [ y  (a  bx )]
i 1
i i
2
to quantitative description the point in the y direction to the straight

distance of the straight line. So it can be seen as a binary function


n
Q (a, b)   [ yi  ( a  bxi )]2 . (2)
i 1
So the question that finding a straight line that is the most close to the n points of the problem, into
 
the problem that finding two numbers a , b that can make the binary function Q ( a, b) to a minimum.
Derived by the formula, we have
n

 ( x  x )( y  y )
i i
b i 1
n , (3)
 (x  x )
i 1
i

1 n 1 n
where x   i n
n i 1
x ; y 
i 1
yi ;a  y  bx .

Scatter plot shows the dependent variable with the independent variable and the general trend,
which can choose the appropriate function of the data points to fit. We draw the scatter plot of height
with other factors based on the data.

3
First International Conference on Advanced Algorithms and Control Engineering IOP Publishing
IOP Conf. Series: Journal of Physics: Conf. Series 1087 (2018)
1234567890 ‘’“” 022018 doi:10.1088/1742-6596/1087/2/022018

Fig.2 The scatter plot of height with other factors


By calculating the correlation coefficient, we found that height with limb data were highly
correlated and they have a proportional relationship. According to the independent variable, we
respectively estimated linear regression equation. The equation has extremely significant statistical
significance by t-test and F-test.
Table 2 Regression equation of height and the related detection value
T Adjusted
items X height Y F-statistic
value R-squared
armreach from
Y=1.1322X+733.69265 41.54 0.4631 1725
back
hand length Y=4.8645X+788.1596 39.27 0.4353 1524
foot length Y=3.92466X+709.28839 42.42 0.4737 1800
lower extremity
Y=1.09538X+590.52738 61.02 0.6506 3723
length
Compare the adjusted R-squared in the table, 0.6507>0.4739>0.4634>0.4356, we get the
conclusion that it is better to establish the regression equation according to the lower extremity length.
The range of the estimated error is smaller lower extremity. Therefore, the priority order used to
estimate the height of the indicators should be the lower extremity length.
When we choose all the variables, the regression equation is
Y  339.54799  0.27558 X 1  0.7695 X 2  1.06205 X 3  0.71095 X 4 (4)
and the adjusted R-squared rise to 0.7323. Compared to the prediction of lower extremity length,
the accuracy has improved greatly.

3.2. Prediction weight by optimal subsidence regression


Now, we use the same method to do weight analysis on the four variables. We draw the scatter plot of
weight with other factors based on the data.

4
First International Conference on Advanced Algorithms and Control Engineering IOP Publishing
IOP Conf. Series: Journal of Physics: Conf. Series 1087 (2018)
1234567890 ‘’“” 022018 doi:10.1088/1742-6596/1087/2/022018

Fig.3 The scatter plot of weight with other factors

Table 3 Regression equation of weight and the related detection value


T Adjusted
items X weight Y F-statistic
value R-squared
armreach from
Y=1.1308-346.0583 25.641 0.2472 657.5
back
hand length Y=3.9348X-122.7905 18.974 0.1522 350
foot length Y=3.3096X-219.8888 20.979 0.1801 440.1
lower extremity
Y=0.74869X-146.34772 19.714 0.1624 388.6
length
We find that the adjusted R-squared are less than 0.3. It indicates that in the case of a single
variable, the weight of the fitting effect is not very good. So we introduce all the variables to fit the
regression. It can be seen from the regression equation that the coefficient of X2 is not significant. The
next step is to select the variable so that all variable coefficients are significant.
First, we can compute the independent variables of all possible subsets and the corresponding
RMSq and C p statistic values. It' s defined by
1
RMSq  RSSq , (5)
n p
where RSSq  y(I  Xq ( Xq Xq )1 Xq ) y is the robust residual sum from a robust fit of the selected
model.
We define. Mallow’s Cp is a technique for model selection in regression (Mallow’s1974). The Cp
Statistic is defined as a criterion to assess fits when models with different number of parameters are
being compared. It is given by
RSSq
Cp   2  (n  2q) . (6)

5
First International Conference on Advanced Algorithms and Control Engineering IOP Publishing
IOP Conf. Series: Journal of Physics: Conf. Series 1087 (2018)
1234567890 ‘’“” 022018 doi:10.1088/1742-6596/1087/2/022018

Table 4 The RMSq and C p values for the regression equation

Variable subset Cp RMSq


X2 981.9697 7015.616
X3 3.53E+33 8.31E+33
X1 3.62E+33 8.52E+33
X4 3.97E+33 9.36E+33
X1,X2 3.561356 4708.288
X2,X3 445.0235 5748.823
X2,X4 516.1024 5916.357
X1,X3 3.36E+33 7.92E+33
X3,X4 3.49E+33 8.22E+33
X1,X4 3.57E+33 8.42E+33
X1,X2,X4 3.543942 4705.89
X1,X2,X3 5.517213 4710.543
X2,X3,X4 340.461 5500.407
X1,X3,X4 3.36E+33 7.92E+33
X1,X2,X3,X4 5 4706.965
According to the properties of RMSq and C p , we choose the subset of independent variables that
is based on the principle of the smaller of the value. From the table we can see that in the subset of
independent variables without X3, both RMSq and C p criteria choose {X1,X2,X4} as the optimal
subset. Thus, the optimal subset that can be used for prediction is
Y  472.24592  0.83625 X 1  0.8322 X 2  0.22124 X 4 . (7)

3.3. Prediction weight by stepwise regression


We can use stepwise regression method for variable selection. The stepwise regression of the
multivariate regression equation is based on the least squares principle. It is used to remove the factors
that have little or no effect on the dependent variable. The significant factor is selected and the optimal
regression model is obtained. Since Akaike(1978) put forward the AIC criterion (Akaike Information
Criterion) from the principle of maximum likelihood of information theory and promotion, the method
of successive test hypothesis in model selection has gradually been replaced. The AIC criterion is a
generalization of the principle of great likelihood. Let the parameter vector below the k-th model M k
fall within the parameter space  k , denote the likelihood function with L ( ) , and let
Lk  sup L( ), k  0, , p  1 . (8)
  k

According to the AIC criterion, we use the model M q selected by the following formula to take
care of M q :

q  Arg min{ AIC ( k ) : k  0, , p  1}, (9)
where AIC (k )   log Lk  v(k ). Here, v ( k ) represents the number of free parameters to be
evaluated within  k .
Table 5 The result of variable selection using AIC
The variables in the model Adjusted R-squared AIC
X1,X2,X3,X4 0.2735 16922.6
X1,X2,X4 0.2624 16951.81
X1,X3,X4 0.2738 16920.68
The set of parameters that make AIC reach the minimum is the optimal parameter selection.

6
First International Conference on Advanced Algorithms and Control Engineering IOP Publishing
IOP Conf. Series: Journal of Physics: Conf. Series 1087 (2018)
1234567890 ‘’“” 022018 doi:10.1088/1742-6596/1087/2/022018

Comparing with AIC criteria, the model is better when the variable is X1,X3 and X4. Accordingly, the
regression equation is
Y  524.66568  0.7871X 1  1.2947 X 3  0.14681X 4 . (10)

4. Discussion
In this paper, we establish the equation between the male body data with the height and weight through
the 2000 male physical measurement. When we select only one item in the four variables, the best fit
variable is lower extremity length. Using the data obtained by a linear regression equation, we make a
conclusion that more than 84.2% of people in the upper and lower error within the range of 5cm.
When we use all the limb data to measure height, we find that 5 cm error within the accuracy rate
increased to 89.5%. Therefore, the more accurate the variable, the more accurate the results of the
height forecast. However, weight prediction doesn’t have such conclusion. In the range of 50hg of
error, we calculated that using three variables than using all variables get a higher accuracy of the
results prediction. Despite the difference in selection variables is very small, we still choose a more
appropriate model.
The above equations have their own different applicable conditions. In the actual calculation of
height and weight, we select the most appropriate indicators for prediction that is based on the specific
circumstances. Thereby, we can narrow the error and get a more reliable estimate. The research object
of this study is adult male, so the regression equation established has some limitations compared with
the whole people. For the calculation of height and height of female as well as by age to calculate, we
should be further exploration and research.

Acknowledgement
This paper is supported by grants from National Key R&D Program of China (2016YFF0204205) and
China National Institute of Standardization (712016Y-4941-2016, 522016Y-4681-2016).

References

[1] Wang B, Yan G. Robust RMS q criterion of variable selection[J]. Journal of First Military Medical
University, 1997.
[2] Pinky Baruah. Application of R2, Ṝ2, Cp and Residual Mean Square in Regression, May-2015
[3] Maggino F, Fattore M. Multivariate Statistical Analysis[J]. Psychophysiology, 1973,
10(5):517-532.
[4] Cai T T, Hall P. Prediction in Functional Linear Regression[J]. Annals of Statistics, 2006,
34(5):2159-2179.

You might also like