Dev Answer Key
Dev Answer Key
1. Data Collection. Data collection is an essential part of exploratory data analysis. ...
2. Data Cleaning. Data cleaning refers to the process of removing unwanted variables and values
from your dataset and getting rid of any irregularities in it. ...
3. Univariate Analysis. ...
4. Bivariate Analysis.
Businesses frequently gather large amounts of data about their online customers and other website visitors.
In the case of any company that sells a product or service online, aggregated data might include statistics on
customer demographics (e.g., gender, average age, location, etc.)
8. Using what function the we can group data together based on certain criteria?
Pandas dataframe.groupby() function is used to split the data into groups based on some criteria.
Pandas objects can be split on any of their axes. The abstract definition of grouping is to provide a
mapping of labels to group names.
There are five main types of aggregate functions that are essential for SQL coding:
COUNT()
COUNTIF()
SUM()
AVG()
MIN() / MAX()
10. Distinguish between pivot table and cross tabulation techniques.
A pivot spreadsheet is a broader definition similar to cross tab. While the cross tab assumes there is only
one horizontal dimension and all the values belong to that dimension, pivot assumes there might exist one
or more horizontal dimensions.
Numerical data
This data has a sense of measurement involved in it; for example, a person's age, height, weight, blood pressure,
heart rate, temperature, number of teeth, number of bones, and the number of family members. This data is often
referred as quantitative data in statistics. The numerical dataset can be either discrete or continuous types.
Discrete data
This is data that is countable and its values can be listed out. For example, if we flip a coin, the number of heads in
200 coin flips can take values from 0 to 200 (finite) cases. A variable that represents a discrete dataset is referred to
as a discrete variable. The discrete variable takes a fixed number of distinct values. For example,
the Country variable can have values such as Nepal, India, Norway, and Japan. It is fixed. The Rank variable of a
student in a classroom can take values from 1, 2, 3, 4, 5, and so on.
Continuous data
A variable that can have an infinite number of numerical values within a specific range is classified as continuous
data. A variable describing continuous data is a continuous variable. For example, what is the temperature of your
city today? Can we be finite? Similarly, the weight variable in the previous section is a continuous variable.
Categorical data
This type of data represents the characteristics of an object; for example, gender, marital status, type of address, or
categories of the movies. This data is often referred to as qualitative datasets in statistics.
A binary categorical variable can take exactly two values and is also referred to as a dichotomous
variable. For example, when you create an experiment, the result is either success or failure. Hence, results
can be understood as a binary categorical variable.
Polytomous variables are categorical variables that can take more than two possible values.
Measurement scales
There are four different types of measurement scales described in statistics: nominal, ordinal, interval, and ratio.
Nominal
These are practiced for labeling variables without any quantitative value. The scales are generally referred to
as labels.
Ordinal
The main difference in the ordinal and nominal scale is the order. In ordinal scales, the order of the values is a
significant factor.
Interval
In interval scales, both the order and exact differences between the values are significant.
Ratio
Ratio scales contain order, exact values, and absolute zero, which makes it possible to be used in descriptive and
inferential statistics
The first step in EDA is to generate questions about your data. Just like a detective, you want to know
everything there is to know about the data. What is it about? What kind of patterns might you see? What
kinds of relationships might exist between different variables?
Step 2: Apply Visualization
The next step is to apply visualization techniques to the data to help you answer those questions. You can
use charts, tables, graphs, and other tools to help you visualize the data and identify any patterns or
relationships that might exist.
Step 3: Transform and Model Data to Look for Answers
Once you have some visualizations, you can start to explore the data more deeply. This might involve
transforming the data in some way, or applying statistical models to help you look for relationships
between different variables.
Step 4: Use What You Learn to Refine Questions or Generate New Questions
Finally, you can use what you learn from your exploratory analysis to refine your questions or generate
new questions. For example, you might find that there is a relationship between two variables that you
didn’t expect, which leads you to generate new questions about that relationship.
11(b) Illustrate the different visualization techniques in detail.
1. Box plots
A box plot or box and whisker plot give a visual outline of information through its quartiles.
At first, a box is drawn from the primary quartile to the third of the data set. A line inside the box addresses the median.
"Whiskers," or lines, are then drawn stretching out from the box to the base (lower extreme) and greatest (upper extreme).
Outliers are addressed by individual focuses that are in-line with the whiskers.
This kind of outline is useful in rapidly distinguishing whether the information is balanced or slanted, just as giving a visual
rundown of the data set that can be effectively deciphered.
In simple language, we can understand that box plots indicate the five-number summary of a set of data which comprises
the minimum score, lower quartile, median, upper quartile, and maximum score.
2. Histograms
A histogram is a graphical presentation of information using bars of various heights and in a histogram, each bar groups
numbers into ranges.
3. Heat maps
A heatmap has a very different concept of representing the data. It is a graphical portrayal of data that uses different colors
to address different values. This difference in color representation makes it easy for the viewers to understand the trend
more quickly.
Heat Maps
It is beneficial for two major purposes:
For instance, if you need to dissect which time of day a store makes the most deals, in that case, you can use a heat map
that indicates the day of the week on the vertical axis and time of day on the horizontal axis.
After that, by shading in the matrix with colors that relate to the number of deals at each time of the day, you can specify
the trends in the data that enable you to decide the specific times your store experiences the most deals.
4. Charts
Bar Charts
It is one of the simple techniques of data visualization. These types of charts are used to compare the quantities of
different categories.
Hence, values of a category are addressed with the aid of bars and they can be designed with vertical or flat bars,
with the length or height of each bar addressing the value.
If you want to examine data over time or the data is assembled in multiple sectors like different industries, variety
of food, etc, a Bar Graph is the best option with some characteristics or some sorts of thorough ideas.
Bar Charts
Line Charts
It is used to plot the relationship of dependence of one variable on another like if you want to show data over very
long periods or continuously changing data, the line graph could be a solid option to consider.
To plot the connection between the two variables, we can basically call the plot function. The line chart is most
often used to indicate trends and evaluate how the data has changed over time.
Pie Charts
Pie Chart is one of the very basic and well-known techniques of data visualization. It is very simple and easy to
understand. It is a circular statistical graph that supposes pieces to clarify numerical ratios. Thus, here the arc size
of each piece is equal to the amount it indicates.
For example, a company witnessed a growth of 150% in which they found out 60% of growth was due to
marketing, 40% was due to sales, 30% was due to product and 20% was due to technology adoption.
Pie Charts
Scatter Charts
It is a two-dimensional plot denoting the joint variation of two data elements such that
Simply, it is a type of mathematical illustration that shows the value for generally two variables for a set of data by
using Cartesian coordinates.
Scatter Charts
Bubble Charts
Bubble charts are a variation of scatter charts in which the data points are replaced with bubbles. Also, an extra
proportion of data is portrayed in the size of the bubbles. You can use this chart for analyzing patterns or
correlations.
Each dot in a bubble chart adapts with a single data point. The variables’ values for each point are implied by
5. Treemaps
This method indicates hierarchical data in a nested format, (understand how hierarchical clustering works)
Sample Treemaps
In Treemap, the size of the rectangles used for each category is proportional to its percentage of the whole.
A leaf hub rectangle has a zone corresponding to the predetermined element of the data.
Depending upon the decision, the leaf hub is colored, sized, or both as per picked credits.
They utilize space, hence show a great many things on the screen all the while.
12(a) Discuss the data transformation techniques given below in an elaborate manner,
(i)Merging database
(ii)Reshaping and pivoting
Data merging is the process of combining two or more data sets into a single data set. Most often, this process is necessary when
you have raw data stored in multiple files, worksheets, or data tables, that you want to analyze all in one go.
There are two common examples in which a data analyst will need to merge new cases into a main, or principal, data file:
1. They have collected data in a longitudinal study (tracker) – a project in which an analyst collects data over a period of time and
analyzes it as intervals.
2. They have collected data in a before-and-after project – where the analyst collects data before an event, and then again after.
Similarly, some analysts collect data for a specific set of variables, but may at a later stage augment it with either data in different
variables, or with data that comes from a different source altogether. Thus, there are three situations that may necessitate merging
data into an existing file: you can add new cases, new variables (or both new cases and variables), or data based on one or more
look-up values.
Contrary to when you merge new cases, merging in new variables requires the IDs for each case in the two files to be the same,
but the variable names should be different. In this scenario, which is sometimes referred to as augmenting your data (or in SQL,
“joins”) or merging data by columns (i.e. you’re adding new columns of data to each row), you’re adding in new variables with
information for each existing case in your data file. As with merging new cases where not all variables are present, the same thing
applies if you merge in new variables where some cases are missing – these should simply be given blank values.
It could also happen that you have a new file with both new cases and new variables. The approach here will depend on the
software you’re using for your merge. If the software cannot handle merging both variables and cases at the same time, then
consider first merging in only the new variables for the existing sample (i.e. augment first), and then append the new cases across
all variables as a second step to your merge.
You will have your survey data on the one hand (left in the diagram below), and a list of zip-codes with corresponding income
values on the other (right in the diagram). Here, the zip-code would be referred to as a look-up code and function as the ID value
did in our previous examples.
In other words, we use the look-up code as the identifier and add in the income values into a new variable in our data file. In the
diagram, observe how the data is matched up for each case by looking up the zip-code and then augmenting the original data with
the income data for each matching zip-code. For those familiar with Excel, for instance, the formula to perform this type of
augmentation is =VLOOKUP().
The look-up code should be unique in the file that contains the additional data (in our example, each zip-code should only appear
once, with a single associated income), but the same value can appear multiple times in the file you’re wanting to augment.
Think of it like this: lots of people can share a zip-code, but there’s only one average income for each of those locations.
computing aggregations like sum(), mean(), median(), min(), and max(), in which a single number gives
insight into the nature of a potentially large dataset
Aggregation Description
The split step involves breaking up and grouping a DataFrame depending on the value of the specified key.
The apply step involves computing some function, usually an aggregate, transformation, or filtering, within
the individual groups.
The combine step merges the results of these operations into an output array.
Column indexing
The GroupBy object supports column indexing in the same way as the DataFrame, and returns a
modified GroupBy object.
The GroupBy object supports direct iteration over the groups, returning each group as a Series or DataFrame.
GroupBy objects have aggregate(), filter(), transform(), and apply() methods that efficiently implement a variety of
useful operations before combining the grouped data.
Aggregation
Filtering
A filtering operation allows you to drop data based on the group properties.
Transformation
While aggregation must return a reduced version of the data, transformation can return some transformed version of
the full data to recombine.
The apply() method lets you apply an arbitrary function to the group results.
(https://jakevdp.github.io/PythonDataScienceHandbook/03.08-aggregation-and-grouping.html)
Data aggregation
Data aggregation refers to a process of collecting information from different sources and presenting it in a
summarized format so that business analysts can perform statistical analyses of business schemes. The collected
information may be gathered from various data sources to summarize these data sources into a draft for data
analysis.
o Collection of data
o Processing of data
o Presentation of data
Collection of data
As the name suggests, the collection of data means gathering data from different sources. The data can be extracted
using the internet of things (IoT), such as
o News headlines
o Once data is collected, the data aggregator determines the atomic data and aggregates it. In the data
processing technique, data aggregators use numerous algorithms form the AI or ML techniques, and it also
utilizes statical methodology to process it like the predictive analysis.
o Presentation of data
o In this step, the gathered information will be summarized, providing a desirable statistical output with
accurate data.
o Data aggregation can also be applied manually. When someone starts, any startup can choose a manual
aggregator by using excel sheets and creating charts to manage the performance, marketing and budget.
o Data aggregation is a well-established organization that uses a middleware, typically third-party software,
to implement the data automatically using various marketing tools. But in the case of huge datasets, a data
aggregator system is needed because it provides accurate outcomes.
1. Time Aggregation
2. Spatial aggregation
Time Aggregation
Time aggregation provides the data point for an individual resource for a defined period.
Spatial aggregation
Spatial aggregation provides the data point for various groups of resources for a defined period.
Pivot tables and crosstabs are ways to display and analyze sets of data. Both are similar to each other, with pivot
tables having just a few added features.
Pivot tables and crosstabs present data in tabular format, with rows and columns displaying certain data. This data
can be aggregated as a sum, count, max, min, or average if desired. These tools allow the user to easily recognize
trends, see relationships between their data, and access information quickly and efficiently.
The Differences Between Pivot Tables and Crosstabs
Pivot tables and crosstabs are nearly identical in form, and the terms are often used interchangeably. However, pivot
tables present some added benefits that regular crosstabs do not.
Pivot tables allow the user to create additional reports on the spot by easily rearranging, adding, counting, and
deleting certain data entries.
Pivot tables work well with hierarchal organization where data sets can be drilled into to reveal more
information. For example, when viewing the total sales at a store by month, you can drill further into the data
and see the sales data on individual products for each month. With a basic crosstab, you would have to go back
to the program and create a separate crosstab with the information on individual products.
Pivot tables let the user filter through their data, add or remove custom fields, and change the appearance of
their report.
When They Are Most Effective
Pivot tables and crosstabs work well with any sized data set. They both present quick and efficient ways to analyze
and summarize data. They are most useful with larger sets of data because the more data .
there is, the more difficult it becomes to recognize relationships without pivot/crosstabs or other visualization tools.