World Sales Analysis
Description: Our project “World Sales Analysis” is focused on the sales of
different commodities across the world through online and offline mode
of shopping. We will use HDFS, Hive, Hbase and zeppelin to analyse the
data .
Step 1: Firstly, we upload “[Link]” file to Hadoop with
path hadoop/user/maria_dev.
Step 2: Afterwards, we create hive external table using following
[Link] that making different database for storing tables related to
our project would be crucial .
Query : Create database worldsalesInfo
Now moving ahead to create external table
CREATE EXTERNAL TABLE IF NOT EXISTS worldSalesData_external(id
Order_Id, Region string,Country string, Item_Type string, Sales_channel
string,Customer_Id int,Order_Date date , Ship_Date date, Units_Sold int,
Unit_Price Decimal,Unit_Cost Decimal,Total_Revenue Decimal, Total_Cost
Decimal, Total_Profit Decimal)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/user/maria_dev;
Now database “worldsalesInfo” contains our table
Step3 : Loading a data from file to table is the next [Link] use
following query
LOAD DATA INPATH '/User/maria_dev/[Link]' overwrite INTO
TABLE worldSalesData_external
Data has been loaded to a table.
Scrolling right
All the records present in csv file are now stored in external hive table
Step 4: We then create internal table as follows
Now my database has internal table too
Step 5: Having done that, we can now proceed further with loading of
new table with the external table as follows
Viewing data from internal table
Scrolling all the way to right and the bottom
Step 6: Now we first create hbase table and then hive table which maps
on that table.
Create hBase table with one column family
Now a table would be created in Hive .
Now, populating table in hive.
Table in hbase has also been populated
Now, viewing results with order id 9
Filtering Online sales
Filtering Offline Sales
From results, it is vivid that Amazon has sold more products through
online mode.
Creating new notebook
Notebook created
Creating dataframe
Printing schema:
Creating dataframe for [Link]
Printing schema
Data:
Customers data:
Creating temp view:
Now we can work with views using [Link]
Checking views data:
Customers who have placed orders:
Total revenue group by region:
Scatter chart:
Itemtypes with unit cost less than 100
How many customers like offline and online sales channel
Online-2378
Offline-2950
Customers records who have placed orders
Customers data as per the unit cost of the order placed and
filtering by order date