Skip to content

Data preprocessing, feature engineering, and EDA for "Big Mart Sales" data set using SAS Studio. The dataset is taken from Kaggle (https://www.kaggle.com/mrmorj/big-mart-sales).

Notifications You must be signed in to change notification settings

caesarmario/big-mart-sales-preprocessing-SAS-studio

Repository files navigation

🛒 Big Mart Sales Data Preprocessing & EDA 🛒

using SAS Studio 🖥


Star Badge Linktree


📃 Table of Contents:


🖋 About Project

👉 Dataset is taken from Kaggle.
👉 In this project, will perform:

  • Initial data exploration
  • Data Preprocessing
  • EDA
  • Hypothesis testing (statistical and non-statistical)
  • Feature engineering (one-hot encoding, label encoding, and binning)

📌 Objectives

  • Perform initial data exploration
  • Perform data preprocessing
  • Perform EDA
  • Perform hypothesis testing (statistical and non-statistical)
  • Perform feature engineering (one-hot encoding, label encoding, and binning)

🧾 Data Set Description

👉 There are 12 variables in this data sets:

  • 5 categorical variables,
  • 5 continuous variables,
  • 1 variable to accommodate the Item identifier (ID), and
  • 1 variable to accommodate the Outlet identifier (ID).

👉 The variables available in this dataset are:

Column Name Data Type Type Description
Item_Identifier char Nominal Product ID
Item_Weight num Ratio Weight of product
Item_Fat_Content char Nominal Content of product (low fat or regular)
Item_Visibility num Nominal The percentage of all products in the store that are assigned to a specific product in the total display area
Item_Type char Nominal Category of product
Item_MRP num Ratio Maximum retail price of a product
Outlet_Identifier char Nominal Store ID
Outlet_Establishment_Year num Nominal Year the store established
Outlet_Size char Ordinal Size of the store
Outlet_Location_Type char Ordinal The type of city where the store is located
Outlet_Type char Ordinal Type of the store
Item_Outlet_Sales num Nominal Sales of product


📊 Initial Data Exploration

🏛 Data Sets Summary:



🔠 Character Column Exploration:

  • Item_Identifier
    Item_Identifier

    • It can be seen that there are 1559 different item IDs in the “Item_Identifier” column.
    • Besides, there are “Item_Identifier” values that start with the same specific character, such as “DR”, “FD”, and “NC”, followed by numbers at the end of a value.

  • Item_Fat_Content
    Item_Fat_Content

    • It can be seen that most of the products are categorized as “Low Fat” products with a percentage of 59.71%.
    • It can be seen that there are inconsistent values have the same meaning but in different values, such as “LF”, “low fat”, and “reg”.
    • These variables will be preprocessed in the next section.

  • Item_Type
    Item_Type

    • There are 16 item types in the dataset, with “Fruits and vegetables” became the item type with the highest number, with a percentage of 14.46%.
    • However, the lowest number item type is “Seafood”, which only has a percentage of 0.75%.

  • Outlet_Identifier
    Outlet_Identifier

    • It can be seen there are ten outlets in the dataset, with all the outlets percentage have almost the same number of outlets (about 10%).
    • However, the percentage of outlets “OUT10” and “OUT19” have the least number of outlets, only 6%.

  • Outlet_Size
    Outlet_Size

    • There are 3 types of outlet size, and “Medium” size becomes the size of outlet with the most number in the dataset with a percentage of 45.69%.
    • The smallest number outlet size is the “High” size which only has the percentage of 15.25%.

  • Outlet_Location_Type
    Outlet_Location_Type

    • There are 3 levels of outlet, and “Tier 3” becomes the tier of outlet with the most number in the dataset with a percentage of 39.31%.
    • The smallest tier of outlet is the “Tier 1”, which only have a percentage of 28.02%.

  • Outlet_Type
    Outlet_Type

    • It can be seen there are four outlet types in this dataset, and “Supermarket Type 1” become the type of outlet with the most number in the dataset with a percentage of 65.43%.
    • The smallest number of outlet types are “Supermarket Type 2” with only a percentage of 10.89% and “Supermarket Type 3” with only 10.97%.

🔢 Continuous Column Exploration:

  • Item_Weight
    Item_Weight

    • It can be seen that the distribution is normal, which means no outliers in this column.

  • Item_Visibility
    Item_Visibility_1
    Item_Visibility_2

    • This figure shows that the distribution is right-skewed distributions, which has a long right tail, and the mean position is on the right side of the data.
    • It can be seen that there are outliers in this column and the minimum value in this column is 0.
    • These outliers will be pre processed in the next section.

  • Item_MRP
    Item_MRP

    • It can be seen that the distribution is normal (no outliers detected).

  • Outlet_Establishment_Year
    Outlet_Establishment_Year

    • It can be seen that the distribution is normal, which means no outliers in this column.

  • Item_Outlet_Sales
    Item_Outlet_Sales
    Item_Outlet_Sales

    • It can be seen that the distribution is right-skewed distributions, which has a long right tail, and the mean is usually to the right of the median of the data.
    • It can be seen that there are outliers value in this column.
    • These outliers will be pre processed in the next section.

❓ Missing Values:

Missing Values_1
Missing Values_2
👉 It can be seen that there are missing values in the “Item_Weight” column (1463 missing values) and the “Outlet_Size” column (2410 missing values).
👉 These missing values will be pre processed in the next section.



⚙ Data Pre-processing

🧹 Handling Dirty Data:

👉 The previous section already mentioned inconsistent values in the “Item_Fat_Content” column, which have the same meaning but in different forms.

Handling Dirty Data

👉 It can be seen now that the inconsistent values have been replaced into the same values, such as “LF” to “Low Fat”, “reg” to “Regular”, etc.

❓ Handling Missing Values:

  • Outlet_Size
    Handling Missing Values_1
    👉 It can be seen that the “Small” outlet size belongs to “Grocer Store” and “Supermarket Type1” outlet type.
    👉 However, if the total “Outlet_Size” outlets are added up, the “Medium” size outlet is much bigger than the “Small” size outlets.
    👉 In this case, it was decided to fill in the missing data by following the previous data (last value replacement method) to reduce the biased of analysis.

Handling Missing Values_2
👉 Based on "Outlet_Identifier" column, it can be seen that “OUT010”, “OUT017”, and “OUT045” outlets have missing values in the “Outlet_Size” column.

Handling Missing Values_3
👉 The table shows the comparison between before and after imputing missing values in the “Outlet_Size” column.
👉 It can be seen that the missing values have disappeared, and outlet size for “OUT010”, “OUT017”, and “OUT045” already imputed.

  • Item_Weight
    Item_Weight based on Item_Identifier - 1
    👉 It can be seen that there are some previous “Item_Weight” values in the dataset based on the “Item_Identifier” column.
    👉 In this case, it was decided to fill these missing values by following the previous data (last value replacement method) to reduce the biased of analysis that will be performed in the next section.

Item_Weight based on Item_Identifier - 2
👉 There are still missing values in the “Item_Weight” column.
👉 This is because the item ID does not have the value in the previous record.
👉 The values will be filled with the “Item_Weight” column's mean value.

Item_Weight based on Item_Identifier - 3
👉 The table shows the comparison between before and after imputing missing values in the “Item_Weight” column.
👉 It can be seen that the missing values have disappeared.

🔧 Handling Outliers:

👉 From previous section, it can be seen that "Item_Visibility" and "Item_Outlet_Sales" have outliers.
👉 This section will transform the column values using log and square root transformation, then compare which one is the best transformation for this case.

Item_Visibility & Item_Outlet_Sales Transformation

👉 It can be seen that the square root transformation is better than the log transformation for these two columns since the frequency distribution of the square root transformation is closer to the normal distribution even though there are still outliers in both columns.



📈 EDA

🧮 Pearson Correlation:

EDA 1

👉 It can be seen there is a high correlation between the “Item_MRP” and the square root transformation of “Item_Outlet_Sales”, with a coefficient of 0.56343.
👉 there is also a weak correlation between “Item_MRP” with “Item_Weight”, with a coefficient of 0.02597.

🥧 Item Fat Content based on Item Type:

EDA 2

👉 As can be seen, the “Household” item becomes the item type with the most number in the “Low Fat” category with a percentage of 16.49%. 👉 However, the item type with the smallest number in the “Low Fat” category is the “Baking Good” items with a percentage of 5.96%.
👉 In addition, “Fruits and Vegetables” is the most number in the “Regular” category with a percentage of 20.03%.
👉 However, “Meat” is the item with the smallest number in the “Regular” category with a percentage of 8.48%.

📉 Outlet Location Type based on Outlet Type:

EDA 3

👉 In general, “Supermarket Type1” is an outlet type owned only by the “Tier 2” outlet locations.
👉 In addition, the “Supermarket Type1” outlet type has the most “Tier 1” outlet locations, with a percentage of 21.8% compared to the “Grocery Store” outlet type with a percentage of 6.2%.
👉 The total number of “Tier 3” outlet locations is almost the same for each outlet type, approximately 11%.

💹 Outlet Identifier based on Outlet Size:

EDA 4

👉 Most outlets have more “Small” size-types than the other size types, and the number each outlet has is approximately the same.
👉 Other than that, outlets “OUT046”, “OUT045”, “OUT035”, “OUT019”, “OUT017”, and “OUT010” only have the “Small” size type.
👉 However, the “OUT019” and “OUT010” outlets have fewer outlets than the other outlets with a similar type size, around 500 outlets.
👉 “OUT049”, “OUT027”, and “OUT018” only have the size type “Medium,” and “OUT013” only have the size type “High”.

📊 Outlet Identifier based on Outlet Type:

EDA 5

👉 In general, the data for “Supermarket Type1” has symmetrical data (the median is in the middle of the box plot), which means that the data is normally distributed.
👉 The larger the outlet type, the more items are purchased at the outlet.



🧪 Hypothesis

1️⃣ Hypothesis 1:

  • H0: Outlet_Establishment_Year is normally distributed.
  • H1: Outlet_Establishment_Year is not normally distributed.

Hypo 1
Conclusion: H0 rejected
👉 From Kolmogorov-Smirnov test, it can be concluded that H0 rejected since the D value is < 1 and the p-value < 0.05.

2️⃣ Hypothesis 2:

  • H0: There is no heteroscedasticity between “Item_MRP” and “Item_Weight”.
  • H1: There is heteroscedasticity between “Item_MRP” and “Item_Weight”.

Hypo 2
Conclusion: H0 accepted
👉 It can be seen that there is no specific pattern for irregular points spread above and below the 0 axes on the Y-axis.
👉 It can be concluded that there is no heteroscedasticity between “Item_MRP” and “Item_Weight”.

3️⃣ Hypothesis 3:

  • H0: Tier 3 has the most numbers in “High” size outlets.
  • H1: Tier 3 does not have the most numbers in “High” size outlets.

Hypo 3
Conclusion: H0 rejected
👉 As can be seen, “Tier 3” has the most numbers in “Medium” size outlets with 1863 outlets, followed by “High” size outlets with 932 outlets.

4️⃣ Hypothesis 4:

  • H0: There was a decrease in the number of outlets from 1985 to 1998.
  • H1: There was an increase in the number of outlets from 1985 to 1998.

Hypo 4
Conclusion: H0 accepted
👉 It can be seen that there was a decrease in the number of outlets from 1463 outlets in the year 1985 to 555 outlets in the year 1998.

5️⃣ Hypothesis 5:

  • H0: The number of “Supermarket Type1” reached its peak in 1987.
  • H1: The number of “Supermarket Type1” not reached its peak in 1987.

Hypo 5
Conclusion: H0 accepted
👉 It can be seen that the number of “Supermarket Type1” reached its peak in 1987 with 932 outlets.
👉 However, in 2007, the number of “Supermarket Type1” only 926 outlets.



⚒ Feature Engineering

🧺 Binning

  • Item_Category
    Item_Category

    • It can be seen that for item ID in the "Item_Identifier" column that starts with character "DR" will be classified as "Drink, "FD" will be classified as "Food", and "NC" will be classified as "Non-Consumable".

  • Age_Outlet
    Age_Outlet

    • It can be seen that the age of an outlet is derived from the reduction in the year the outlet was established by 2021.

  • Outlet_Category
    Outlet_Category

    • It can be seen that the age of outlets is classified into three categories: "New Outlet" for outlets aged 1-10 years, "Moderate Outlet" for outlets aged 11-20 years, and "Old Outlet" for those over 20 years old

🔖 Label Encoding

👉 This section will perform label encoding for "Item_Fat_Content" column.
👉 The label encoding will be described as follows:

Column Name Level Encoded Label
Item_Fat_Content Low Fat 0
Regular 1

Item_Fat_Content_Encoding


🔥 One-Hot Encoding

  • Item_Category
    Item_Category_Enc

    • For "Drink" category will be included in group 1, "Food" will be included in group 2, and "Non-Consumable" will be included in group 3.

  • Outlet_Category
    Outlet_Category_Enc

    • For "New Outlet" category will be included in group 1, "Moderate Outlet" will be included in group 2, and "Old Outlet" will be included in group 3.

  • Outlet_Size
    Outlet_Size_Enc

    • For "Small" category will be included in group 1, "Medium" will be included in group 2, and "High" will be included in group 3.

  • Outlet_Location_Type
    Outlet_Location_Type_Enc

    • For "Tier 1" category will be included in group 1, "Tier 2" will be included in group 2, and "Tier 3" will be included in group 3.

  • Outlet_Type
    Outlet_Type_Enc

    • For "Grocery Store" category will be included in group 1, "Supermarket Type1" will be included in group 2, "Supermarket Type2" will be included in group 3, and "Supermarket Type3" will be included in group 4.



📜 New Metadata

Column Name Data Type Description Instances
Item_Identifier Char ProductID {FDA15, DRC01, FDN15, …}
Item_Category Char Category of a product
(Drink, Food, Non-Consumable)
{Food, Drink, Food, …}
Item_Category_GROUP_1 Num Encoded Item_Category
0≠ Drink
1= Drink
{0, 1, 0, …}
Item_Category_GROUP_2 Num Encoded Item_Category
0≠ Food
1= Food
{1, 0, 1, …}
Item_Category_GROUP_3 Num Encoded Item_Category
0≠ Non-Consumable
1= Non-Consumable
{0, 0, 0, …}
Item_Weight Num The product's weight {9.3, 5.92, 17.5, …}
Item_Fat_Content Char Content of product
(Low Fat or Regular)
{Low Fat, Regular, Low Fat, …}
Item_Fat_Content_Encoding Num Types of Product Content
0= Low Fat
1= Regular
{0,1, 0, …}
Item_Visibility Num The percentage of a store's overall display area that is dedicated to a single product. {0.016047301,0.019278216, 0.016760075, …}
Item_Type Char Category of product
(Soft Drinks, Dairy, Hard Drinks, Canned, Frozen Foods, Fruits and Vegetables, Snack Foods, Baking Goods, Starchy Foods, Meat, Seafood, Breakfast, Breads, Health and Hygiene, Household, Others)
{Dairy, Soft Drinks, Meat, …}
Item_MRP Num Maximum Retail Price of a product {249.8092,48.2692, 141.618, …}
Outlet_Identifier Char Outlet ID {OUT049,OUT018, OUT049, …}
Outlet_Establishment_Year Num Year the outlet established {1999,2009, 1999, …}
Age_Outlet Num Age of an outlet based on "Outlet_Establishment_Year" and 2021 {22,12, 22, …}
Outlet_Category Char Category of an outlet based on age
0-10= New Outlet
11-20= Moderate Outlet
>20= Old Outlet
{OldOutlet, Moderate Outlet, Old Outlet, …}
Outlet_Category_GROUP_1 Num Encoded Outlet_Category
0≠ New Outlet
1= New Outlet
{0, 0, 0, …}
Outlet_Category_GROUP_2 Num Encoded Outlet_Category
0≠ Moderate Outlet
1= Moderate Outlet
{0, 1, 0, …}
Outlet_Category_GROUP_3 Num Encoded Outlet_Category
0≠ Old Outlet
1= Old Outlet
{1, 0, 1, …}
Outlet_Size Char Size of the store
(Small, Medium, High)
{Medium, Medium, Medium, …}
Outlet_Size_GROUP_1 Num EncodedOutlet_Size
0≠ Small
1= Small
{0, 0, 0, …}
Outlet_Size_GROUP_2 Num EncodedOutlet_Size
0≠ Medium
1= Medium
{1, 1, 1, …}
Outlet_Size_GROUP_3 Num EncodedOutlet_Size
0≠ High
1= High
{0, 0, 0, …}
Outlet_Location_Type Char Thetype of city where the store is located(Tier1, Tier 2, Tier 3) {Tier1, Tier 3, Tier 1, …}
Outlet_Location_Type_GROUP_1 Num Encoded Outlet_Location_Type
0≠ Tier 1
1= Tier 1
{1, 0, 1, …}
Outlet_Location_Type_GROUP_2 Num Encoded Outlet_Location_Type
0≠ Tier 2
1= Tier 2
{0, 0, 0, …}
Outlet_Location_Type_GROUP_3 Num Encoded Outlet_Location_Type
0≠ Tier 3
1= Tier 3
{0, 1, 0, …}
Outlet_Type Char Type of the store
(Grocery Store, Supermarket Type1, Supermarket Type2, Supermarket Type3)
{Supermarket Type1, Supermarket Type2, Supermarket Type1, …}
Outlet_Type_GROUP_1 Num EncodedOutlet_Type
0≠ Grocery Store
1= Grocery Store
{0, 0, 0, …}
Outlet_Type_GROUP_2 Num Encoded Outlet_Type
0≠ Supermarket Type1
1= Supermarket Type1
{1, 0, 1, …}
Outlet_Type_GROUP_3 Num EncodedOutlet_Type
0≠ Supermarket Type2
1= Supermarket Type2
{0, 1, 0, …}
Outlet_Type_GROUP_4 Num Encoded Outlet_Type
0≠ Supermarket Type3
1= Supermarket Type3
{0, 0, 0, …}
Item_Outlet_Sales Num Sales of product {3735.138, 443.4228, 2097.27, …}




🙌 Support me!

👉 If you find this project useful, please ⭐ this repository 😆!

🎈 Check out my data pre-processing and feature engineering using Python on Kaggle here!


👉 More about myself: here

Releases

No releases published

Packages

No packages published

Languages