Machine Learning Project Car Price Prediction Algorithm
Machine Learning Project Car Price Prediction Algorithm
Intent:
Train a Machine Learning algorithm and predict Car price based on the selected features.
Method:
The Machine Learning method used for this Project, Car Price Prediction, is Multiple-Linear
Regression with Gradient Descent.
Process:
1. Packages:
• Numpy
• Pandas
• Matplotlib.pyplot
2. Data:
Dataset collected from PAKWHEELS processed and saved as CSV (Comma Delimited) format.
To access the data ‘Pandas’ library function READ_CSV() is used. The data is then separated
as Features and Label, as ‘x_train’ and ‘y_train’, respectively.
The data() function returns ‘x_train’ and ‘y_train’ to the main() function for further use.
3. Normalization
The feature set is then normalized using the formula
X = (X – MEAN) / (STANDARD DEVIATION)
The normal() function returns ‘x_train’, ‘data_mean’ and ‘data_std’: which are the
normalized features, mean of the features and standard deviation of the features,
respectively, to the main() function. The ‘data_mean’ and ‘data_std’ will be used in
normalizing the TEST INPUT, in the prediction.py, to predict the car price.
4. Complete Data
The feature set is completed for processing by adding the ‘bias’ unit for all the rows, in the
data_complete() function, using the numpy.ones() function.
5. Parameter Initialization
The parameters ‘theta’ is randomly initialized using the numpy.random.rand(), and is
returned to the main() function.
6. Gradient Descent
The gradient() function has been used to minimize the parameter theta and reduce the cost
iteratively, on each iteration theta values are updated which generates a cost history. The
last training cost is then multiplied by million to have our final cost for training set.
7. Mean Absolute error
The mae() function calculates the absolute mean of the error in the test set, training set and
the CV set of the data by using formula which is called in the main function using print() and
displayed for the user.
Flow of Algorithm:
Train Model:
2. Calculate
Retrieve Data 1. Generate Random
Hypothesis
Theta Values
4. Gradient Descent:
Define Features and Update values of theta
Append bias units
Label iteratively to generate
cost history
Training:
In training part, the data has been read from a csv file, separated into training feature set and
training label. For the data set provided the dimensions of the feature set and the label set are
Then the data has been categorized in another pair of sets named cross validation feature set and
cross validation label set. The calculated value for the dimensions of the feature and label set are
(5425, 9) (5425,) respectively.
After that the last pair of needed sets, the test feature set and the test label set has been created
similarly. The dimensions for these pair of sets are (5428, 9) (5428,).
These columns are then completed by adding the bias units or appending the columns with ones
hence the resultant values for dimensions become
(16281, 10) (16281, 1) (5425, 10) (5425, 1) (5428, 10) (5428, 1) (16281, 10) (16281, 1) (5425, 10)
(5425, 1) (5428, 10) (5428, 1).
Random theta values are initialized
[0.62653841]
[0.11832303]
[0.0742843 ]
[0.42119429]
[0.39886133]
[0.27029176]
[0.76941718]
[0.92276763]
[0.51739262]].
After completing the sets, the cost formula is used to calculate the cost function before and after the
data is trained and for the cross-validation data set. These values are
The graph of final cost for training set and cross-validation set is plotted as:
This is done by using the formula of the absolute mean error which is error = np.sum (abs (h-y) / m,
where all the used variables have already been defined in the code.
Prediction:
The training2 file of code is imported to this prediction file, to be able to use the calculated values
for all the sets created.
In the prediction() function an array for all the nine features is created, the data is normalized by
using the mean and standard deviation functions created in the training2 code of file.
This file is appended with ones to be able to calculate the dot product of the test data with the final
value of theta.
Value obtained in the previous step is multiplied with one million to get an appropriate price for the
car whose price needs to be predicted.
Conclusion:
The algorithm used in this program predicts the car prices by dividing the provided sets into multiple
sets of data. Calculations are performed, and data is normalized to generate efficient prediction
results.