Predicting House Prices
Objective: Use Simple Linear Regression to predict house prices based on their size
Database: https://www.kaggle.com/c/boston-housing.
Tasks:
1. Load and explore the dataset.
2. Create a scatter plot to visualize the relationship between house size and price.
3. Implement Simple Linear Regression to predict prices.
4. Evaluate the model's performance using R² and Mean Squared Error (MSE).
import pandas as pd
data_path = '/content/drive/MyDrive/nkphd/bostan/'
# Load the datasets
submission_example = pd.read_csv(os.path.join(data_path,
'submission_example.csv'))
train = pd.read_csv(os.path.join(data_path, 'train.csv'))
test = pd.read_csv(os.path.join(data_path, 'test.csv'))
# Display first few rows to confirm
print("Submission Example:")
print(submission_example.head())
print("\nTrain Dataset:")
print(train.head())
print("\nTest Dataset:")
print(test.head())
from google.colab import drive
drive.mount('/content/drive')
import pandas as pd
# Define dataset path
data_path = '/content/drive/MyDrive/nkphd/bostan/'
# Load train dataset
train = pd.read_csv(data_path + 'train.csv')
# Scatter plot for relationship between house size ('rm') and price
('medv')
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(10, 6))
sns.scatterplot(x=train['rm'], y=train['medv'])
plt.title('Relationship Between House Size (RM) and Price (MEDV)',
fontsize=14)
plt.xlabel('Average Number of Rooms per Dwelling (RM)', fontsize=12)
plt.ylabel('Median Value of Owner-Occupied Homes (MEDV)', fontsize=12)
plt.grid(True)
plt.show()
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')
import pandas as pd
# Load the train dataset
data_path = '/content/drive/MyDrive/nkphd/bostan/'
train = pd.read_csv(data_path + 'train.csv')
# Implement Simple Linear Regression
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# Prepare the data
X = train[['rm']] # Average number of rooms per dwelling
y = train['medv'] # Median value of owner-occupied homes
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
# Initialize and fit the linear regression model
linear_regressor = LinearRegression()
linear_regressor.fit(X_train, y_train)
# Predict on the test set
y_pred = linear_regressor.predict(X_test)
# Model evaluation
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
# Display results
print("Mean Squared Error (MSE):", mse)
print("R-squared (R^2):", r2)
print("Coefficient (Slope):", linear_regressor.coef_[0])
print("Intercept:", linear_regressor.intercept_)
Drive already mounted at /content/drive; to attempt to forcibly remount,
call drive.mount("/content/drive", force_remount=True).
Mean Squared Error (MSE): 36.361622515889756
R-squared (R^2): 0.5959747117709422
Coefficient (Slope): 8.584424490365215
Intercept: -30.96185860010203
from sklearn.metrics import mean_squared_error, r2_score
# Calculate Mean Squared Error
mse = mean_squared_error(y_test, y_pred)
# Calculate R²
r2 = r2_score(y_test, y_pred)
# Print the results
print("Mean Squared Error (MSE):", mse)
print("R-squared (R²):", r2)
Mean Squared Error (MSE): 36.361622515889756
R-squared (R²): 0.5959747117709422
Evaluation of performance using R² and Mean Squared Error (MSE).
The Simple Linear Regression model for predicting house prices was evaluated using MSE and
R². The MSE was 36.36, indicating the average squared error in predictions. The R² score of
0.596 shows that 59.6% of the variance in house prices is explained by the number of rooms per
dwelling. The slope of 8.58 indicates an increase in house price by 8.58 units per additional
room. While the model shows a moderate fit, including more features could improve accuracy.