using SAS Studio 🖥
- About Project
- Objectives
- Data Set Description
- Initial Data Exploration
- Data Pre Processing
- EDA
- Hypothesis
- Feature Engineering
- New Metadata
👉 Dataset is taken from Kaggle.
👉 In this project, will perform:
- Initial data exploration
- Data Preprocessing
- EDA
- Hypothesis testing (statistical and non-statistical)
- Feature engineering (one-hot encoding, label encoding, and binning)
- Perform initial data exploration
- Perform data preprocessing
- Perform EDA
- Perform hypothesis testing (statistical and non-statistical)
- Perform feature engineering (one-hot encoding, label encoding, and binning)
👉 There are 12 variables in this data sets:
- 5 categorical variables,
- 5 continuous variables,
- 1 variable to accommodate the Item identifier (ID), and
- 1 variable to accommodate the Outlet identifier (ID).
👉 The variables available in this dataset are:
Column Name | Data Type | Type | Description |
---|---|---|---|
Item_Identifier | char | Nominal | Product ID |
Item_Weight | num | Ratio | Weight of product |
Item_Fat_Content | char | Nominal | Content of product (low fat or regular) |
Item_Visibility | num | Nominal | The percentage of all products in the store that are assigned to a specific product in the total display area |
Item_Type | char | Nominal | Category of product |
Item_MRP | num | Ratio | Maximum retail price of a product |
Outlet_Identifier | char | Nominal | Store ID |
Outlet_Establishment_Year | num | Nominal | Year the store established |
Outlet_Size | char | Ordinal | Size of the store |
Outlet_Location_Type | char | Ordinal | The type of city where the store is located |
Outlet_Type | char | Ordinal | Type of the store |
Item_Outlet_Sales | num | Nominal | Sales of product |
-
- It can be seen that there are 1559 different item IDs in the “Item_Identifier” column.
- Besides, there are “Item_Identifier” values that start with the same specific character, such as “DR”, “FD”, and “NC”, followed by numbers at the end of a value.
-
- It can be seen that most of the products are categorized as “Low Fat” products with a percentage of 59.71%.
- It can be seen that there are inconsistent values have the same meaning but in different values, such as “LF”, “low fat”, and “reg”.
- These variables will be preprocessed in the next section.
-
- There are 16 item types in the dataset, with “Fruits and vegetables” became the item type with the highest number, with a percentage of 14.46%.
- However, the lowest number item type is “Seafood”, which only has a percentage of 0.75%.
-
- It can be seen there are ten outlets in the dataset, with all the outlets percentage have almost the same number of outlets (about 10%).
- However, the percentage of outlets “OUT10” and “OUT19” have the least number of outlets, only 6%.
-
- There are 3 types of outlet size, and “Medium” size becomes the size of outlet with the most number in the dataset with a percentage of 45.69%.
- The smallest number outlet size is the “High” size which only has the percentage of 15.25%.
-
- There are 3 levels of outlet, and “Tier 3” becomes the tier of outlet with the most number in the dataset with a percentage of 39.31%.
- The smallest tier of outlet is the “Tier 1”, which only have a percentage of 28.02%.
-
- It can be seen there are four outlet types in this dataset, and “Supermarket Type 1” become the type of outlet with the most number in the dataset with a percentage of 65.43%.
- The smallest number of outlet types are “Supermarket Type 2” with only a percentage of 10.89% and “Supermarket Type 3” with only 10.97%.
-
- It can be seen that the distribution is normal, which means no outliers in this column.
- It can be seen that the distribution is normal, which means no outliers in this column.
-
- This figure shows that the distribution is right-skewed distributions, which has a long right tail, and the mean position is on the right side of the data.
- It can be seen that there are outliers in this column and the minimum value in this column is 0.
- These outliers will be pre processed in the next section.
-
- It can be seen that the distribution is normal (no outliers detected).
- It can be seen that the distribution is normal (no outliers detected).
-
- It can be seen that the distribution is normal, which means no outliers in this column.
- It can be seen that the distribution is normal, which means no outliers in this column.
-
- It can be seen that the distribution is right-skewed distributions, which has a long right tail, and the mean is usually to the right of the median of the data.
- It can be seen that there are outliers value in this column.
- These outliers will be pre processed in the next section.
👉 It can be seen that there are missing values in the “Item_Weight” column (1463 missing values) and the “Outlet_Size” column (2410 missing values).
👉 These missing values will be pre processed in the next section.
👉 The previous section already mentioned inconsistent values in the “Item_Fat_Content” column, which have the same meaning but in different forms.
👉 It can be seen now that the inconsistent values have been replaced into the same values, such as “LF” to “Low Fat”, “reg” to “Regular”, etc.
- Outlet_Size
👉 It can be seen that the “Small” outlet size belongs to “Grocer Store” and “Supermarket Type1” outlet type.
👉 However, if the total “Outlet_Size” outlets are added up, the “Medium” size outlet is much bigger than the “Small” size outlets.
👉 In this case, it was decided to fill in the missing data by following the previous data (last value replacement method) to reduce the biased of analysis.
👉 Based on "Outlet_Identifier" column, it can be seen that “OUT010”, “OUT017”, and “OUT045” outlets have missing values in the “Outlet_Size” column.
👉 The table shows the comparison between before and after imputing missing values in the “Outlet_Size” column.
👉 It can be seen that the missing values have disappeared, and outlet size for “OUT010”, “OUT017”, and “OUT045” already imputed.
- Item_Weight
👉 It can be seen that there are some previous “Item_Weight” values in the dataset based on the “Item_Identifier” column.
👉 In this case, it was decided to fill these missing values by following the previous data (last value replacement method) to reduce the biased of analysis that will be performed in the next section.
👉 There are still missing values in the “Item_Weight” column.
👉 This is because the item ID does not have the value in the previous record.
👉 The values will be filled with the “Item_Weight” column's mean value.
👉 The table shows the comparison between before and after imputing missing values in the “Item_Weight” column.
👉 It can be seen that the missing values have disappeared.
👉 From previous section, it can be seen that "Item_Visibility" and "Item_Outlet_Sales" have outliers.
👉 This section will transform the column values using log and square root transformation, then compare which one is the best transformation for this case.
👉 It can be seen that the square root transformation is better than the log transformation for these two columns since the frequency distribution of the square root transformation is closer to the normal distribution even though there are still outliers in both columns.
👉 It can be seen there is a high correlation between the “Item_MRP” and the square root transformation of “Item_Outlet_Sales”, with a coefficient of 0.56343.
👉 there is also a weak correlation between “Item_MRP” with “Item_Weight”, with a coefficient of 0.02597.
👉 As can be seen, the “Household” item becomes the item type with the most number in the “Low Fat” category with a percentage of 16.49%.
👉 However, the item type with the smallest number in the “Low Fat” category is the “Baking Good” items with a percentage of 5.96%.
👉 In addition, “Fruits and Vegetables” is the most number in the “Regular” category with a percentage of 20.03%.
👉 However, “Meat” is the item with the smallest number in the “Regular” category with a percentage of 8.48%.
👉 In general, “Supermarket Type1” is an outlet type owned only by the “Tier 2” outlet locations.
👉 In addition, the “Supermarket Type1” outlet type has the most “Tier 1” outlet locations, with a percentage of 21.8% compared to the “Grocery Store” outlet type with a percentage of 6.2%.
👉 The total number of “Tier 3” outlet locations is almost the same for each outlet type, approximately 11%.
👉 Most outlets have more “Small” size-types than the other size types, and the number each outlet has is approximately the same.
👉 Other than that, outlets “OUT046”, “OUT045”, “OUT035”, “OUT019”, “OUT017”, and “OUT010” only have the “Small” size type.
👉 However, the “OUT019” and “OUT010” outlets have fewer outlets than the other outlets with a similar type size, around 500 outlets.
👉 “OUT049”, “OUT027”, and “OUT018” only have the size type “Medium,” and “OUT013” only have the size type “High”.
👉 In general, the data for “Supermarket Type1” has symmetrical data (the median is in the middle of the box plot), which means that the data is normally distributed.
👉 The larger the outlet type, the more items are purchased at the outlet.
- H0: Outlet_Establishment_Year is normally distributed.
- H1: Outlet_Establishment_Year is not normally distributed.
▶ Conclusion: H0 rejected
👉 From Kolmogorov-Smirnov test, it can be concluded that H0 rejected since the D value is < 1 and the p-value < 0.05.
- H0: There is no heteroscedasticity between “Item_MRP” and “Item_Weight”.
- H1: There is heteroscedasticity between “Item_MRP” and “Item_Weight”.
▶ Conclusion: H0 accepted
👉 It can be seen that there is no specific pattern for irregular points spread above and below the 0 axes on the Y-axis.
👉 It can be concluded that there is no heteroscedasticity between “Item_MRP” and “Item_Weight”.
- H0: Tier 3 has the most numbers in “High” size outlets.
- H1: Tier 3 does not have the most numbers in “High” size outlets.
▶ Conclusion: H0 rejected
👉 As can be seen, “Tier 3” has the most numbers in “Medium” size outlets with 1863 outlets, followed by “High” size outlets with 932 outlets.
- H0: There was a decrease in the number of outlets from 1985 to 1998.
- H1: There was an increase in the number of outlets from 1985 to 1998.
▶ Conclusion: H0 accepted
👉 It can be seen that there was a decrease in the number of outlets from 1463 outlets in the year 1985 to 555 outlets in the year 1998.
- H0: The number of “Supermarket Type1” reached its peak in 1987.
- H1: The number of “Supermarket Type1” not reached its peak in 1987.
▶ Conclusion: H0 accepted
👉 It can be seen that the number of “Supermarket Type1” reached its peak in 1987 with 932 outlets.
👉 However, in 2007, the number of “Supermarket Type1” only 926 outlets.
-
- It can be seen that for item ID in the "Item_Identifier" column that starts with character "DR" will be classified as "Drink, "FD" will be classified as "Food", and "NC" will be classified as "Non-Consumable".
- It can be seen that for item ID in the "Item_Identifier" column that starts with character "DR" will be classified as "Drink, "FD" will be classified as "Food", and "NC" will be classified as "Non-Consumable".
-
- It can be seen that the age of an outlet is derived from the reduction in the year the outlet was established by 2021.
- It can be seen that the age of an outlet is derived from the reduction in the year the outlet was established by 2021.
-
- It can be seen that the age of outlets is classified into three categories: "New Outlet" for outlets aged 1-10 years, "Moderate Outlet" for outlets aged 11-20 years, and "Old Outlet" for those over 20 years old
- It can be seen that the age of outlets is classified into three categories: "New Outlet" for outlets aged 1-10 years, "Moderate Outlet" for outlets aged 11-20 years, and "Old Outlet" for those over 20 years old
👉 This section will perform label encoding for "Item_Fat_Content" column.
👉 The label encoding will be described as follows:
Column Name | Level | Encoded Label |
---|---|---|
Item_Fat_Content | Low Fat | 0 |
Regular | 1 |
-
- For "Drink" category will be included in group 1, "Food" will be included in group 2, and "Non-Consumable" will be included in group 3.
- For "Drink" category will be included in group 1, "Food" will be included in group 2, and "Non-Consumable" will be included in group 3.
-
- For "New Outlet" category will be included in group 1, "Moderate Outlet" will be included in group 2, and "Old Outlet" will be included in group 3.
- For "New Outlet" category will be included in group 1, "Moderate Outlet" will be included in group 2, and "Old Outlet" will be included in group 3.
-
- For "Small" category will be included in group 1, "Medium" will be included in group 2, and "High" will be included in group 3.
- For "Small" category will be included in group 1, "Medium" will be included in group 2, and "High" will be included in group 3.
-
- For "Tier 1" category will be included in group 1, "Tier 2" will be included in group 2, and "Tier 3" will be included in group 3.
- For "Tier 1" category will be included in group 1, "Tier 2" will be included in group 2, and "Tier 3" will be included in group 3.
Column Name | Data Type | Description | Instances |
---|---|---|---|
Item_Identifier | Char | ProductID | {FDA15, DRC01, FDN15, …} |
Item_Category | Char | Category of a product (Drink, Food, Non-Consumable) |
{Food, Drink, Food, …} |
Item_Category_GROUP_1 | Num | Encoded Item_Category 0≠ Drink 1= Drink |
{0, 1, 0, …} |
Item_Category_GROUP_2 | Num | Encoded Item_Category 0≠ Food 1= Food |
{1, 0, 1, …} |
Item_Category_GROUP_3 | Num | Encoded Item_Category 0≠ Non-Consumable 1= Non-Consumable |
{0, 0, 0, …} |
Item_Weight | Num | The product's weight | {9.3, 5.92, 17.5, …} |
Item_Fat_Content | Char | Content of product (Low Fat or Regular) |
{Low Fat, Regular, Low Fat, …} |
Item_Fat_Content_Encoding | Num | Types of Product Content 0= Low Fat 1= Regular |
{0,1, 0, …} |
Item_Visibility | Num | The percentage of a store's overall display area that is dedicated to a single product. | {0.016047301,0.019278216, 0.016760075, …} |
Item_Type | Char | Category of product (Soft Drinks, Dairy, Hard Drinks, Canned, Frozen Foods, Fruits and Vegetables, Snack Foods, Baking Goods, Starchy Foods, Meat, Seafood, Breakfast, Breads, Health and Hygiene, Household, Others) |
{Dairy, Soft Drinks, Meat, …} |
Item_MRP | Num | Maximum Retail Price of a product | {249.8092,48.2692, 141.618, …} |
Outlet_Identifier | Char | Outlet ID | {OUT049,OUT018, OUT049, …} |
Outlet_Establishment_Year | Num | Year the outlet established | {1999,2009, 1999, …} |
Age_Outlet | Num | Age of an outlet based on "Outlet_Establishment_Year" and 2021 | {22,12, 22, …} |
Outlet_Category | Char | Category of an outlet based on age 0-10= New Outlet 11-20= Moderate Outlet >20= Old Outlet |
{OldOutlet, Moderate Outlet, Old Outlet, …} |
Outlet_Category_GROUP_1 | Num | Encoded Outlet_Category 0≠ New Outlet 1= New Outlet |
{0, 0, 0, …} |
Outlet_Category_GROUP_2 | Num | Encoded Outlet_Category 0≠ Moderate Outlet 1= Moderate Outlet |
{0, 1, 0, …} |
Outlet_Category_GROUP_3 | Num | Encoded Outlet_Category 0≠ Old Outlet 1= Old Outlet |
{1, 0, 1, …} |
Outlet_Size | Char | Size of the store (Small, Medium, High) |
{Medium, Medium, Medium, …} |
Outlet_Size_GROUP_1 | Num | EncodedOutlet_Size 0≠ Small 1= Small |
{0, 0, 0, …} |
Outlet_Size_GROUP_2 | Num | EncodedOutlet_Size 0≠ Medium 1= Medium |
{1, 1, 1, …} |
Outlet_Size_GROUP_3 | Num | EncodedOutlet_Size 0≠ High 1= High |
{0, 0, 0, …} |
Outlet_Location_Type | Char | Thetype of city where the store is located(Tier1, Tier 2, Tier 3) | {Tier1, Tier 3, Tier 1, …} |
Outlet_Location_Type_GROUP_1 | Num | Encoded Outlet_Location_Type 0≠ Tier 1 1= Tier 1 |
{1, 0, 1, …} |
Outlet_Location_Type_GROUP_2 | Num | Encoded Outlet_Location_Type 0≠ Tier 2 1= Tier 2 |
{0, 0, 0, …} |
Outlet_Location_Type_GROUP_3 | Num | Encoded Outlet_Location_Type 0≠ Tier 3 1= Tier 3 |
{0, 1, 0, …} |
Outlet_Type | Char | Type of the store (Grocery Store, Supermarket Type1, Supermarket Type2, Supermarket Type3) |
{Supermarket Type1, Supermarket Type2, Supermarket Type1, …} |
Outlet_Type_GROUP_1 | Num | EncodedOutlet_Type 0≠ Grocery Store 1= Grocery Store |
{0, 0, 0, …} |
Outlet_Type_GROUP_2 | Num | Encoded Outlet_Type 0≠ Supermarket Type1 1= Supermarket Type1 |
{1, 0, 1, …} |
Outlet_Type_GROUP_3 | Num | EncodedOutlet_Type 0≠ Supermarket Type2 1= Supermarket Type2 |
{0, 1, 0, …} |
Outlet_Type_GROUP_4 | Num | Encoded Outlet_Type 0≠ Supermarket Type3 1= Supermarket Type3 |
{0, 0, 0, …} |
Item_Outlet_Sales | Num | Sales of product | {3735.138, 443.4228, 2097.27, …} |
👉 If you find this project useful, please ⭐ this repository 😆!
🎈 Check out my data pre-processing and feature engineering using Python on Kaggle here!
👉 More about myself: here