DA-Interview Reference Material
DA-Interview Reference Material
SQL stands for Structured Query Language. It is a language used to communicate with databases. MySQL
on the other hand is a RDBMS whereas SQL is a language.
* Inner join: Returns all rows from the two tables where the join condition is met.
* Outer join: Returns all rows from the left table, and the rows from the right table that match the join
condition.
* Left join: Returns all rows from the left table, and the matching rows from the right table.
* Right join: Returns all rows from the right table, and the matching rows from the left tab
GROUP BY goes before the ORDER BY statement because the latter operates on the final result of the
query.
Write a SQL statement selects all products with a price between 10 and 20. In addition; do not show products
with a CategoryID of 1,2, or 3. Table name is Products. Columns are Price and CategoryID.
Write a SQL Query to get top 5 people from IT department.Lets say we have a customers table and there are
3 columns cust_name,cust_id,dept(Make use of Limit statement:
Write a SQL statement using rownum function for 3 employees who were hired earliest and the
result is displayed in increasing order of their hire date. Columns are employee_id, first_name
,last_name,email,contact_num, hire_date,department_id,salary.
The above SQL query selects all columns (*) from the salesman table alias a and the customer table
alias b, and performs a cross join between the two tables. The query also includes a WHERE clause
that filters the results to only include rows from the salesman table where the 'city' column is not
null.
This means that the query will return all combinations of rows from the salesman table where the
'city' column is not null and the customer table, effectively creating a Cartesian product of the two
tables.
Suppose table A has 5 rows and table B has 6 rows. You perform a cross join on these two tables.
How many rows will it have?
30
Power BI Questions:
What is DAX?
DAX stands for Data Analysis Expressions. It's a collection of functions, operators, and constants used in
formulas to calculate and return values. In other words, it helps you create new info from data you already
have.
Tell about filter function and ALL function and the syntax for the same.
ALL Function: It can retrieve all values in a column or rows in a given table, overriding any
·
previous filters.
ALL(<table> or <column>)
Can you have a table in the model which does not have any relationship with other tables?
Yes. There are two main reasons why you can have disconnected tables:
· The table is used to present the user with parameter values to be exposed and selected in slicers
Bidirectional cross-filtering lets data modelers to decide how they want their Power BI Desktop filters to flow
for data, using the relationships between tables. The filter context is transmitted to a second related table that
exists on the other side of any ogiven table relationship. This procedure helps data modelers solve the many-
to-many issue without having to complicated DAX formulas.
What are pre requisite of append and union all in power bi?
Data types and no of columns should be same. Good if column name is same.
Let us suppose you need to put the Pack size (Column C) values in different buckets.
Pack size less than equal to 500 then “SMALL PACK”
Pack size between 500 and 2000 then “MEDIUM PACK”
Pack size between 1000 and 2000 then “LARGE PACK”
Anything above 2000 then “PACKAGE”
In this case you will be putting 3 IF statements and putting the conditions accordingly.
Tableau interview questions:
The parameter is a variable (numbers, strings, or date) created to replace a constant value in calculations,
filters, or reference lines. For example, you create a field that returns true if the sales are greater than 30,000
and false if otherwise. Parameters are used to replace these numbers (30000 in this case) to dynamically set
this during calculations. Parameters allow you to dynamically modify values in a calculation. The
parameters can accept values in the following options:
LIVE: Live connection is a dynamic way to extract real-time data by directly connecting to the data
source. Tableau directly creates queries against the database entries and retrieves the query results in a
workbook.
EXTRACT: A snapshot of the data, extract the file (.tde or .hyper file) contains data from a relational
database. The data is extracted from a static source of data like an Excel Spreadsheet. You can schedule to
refresh the snapshots which are done using the Tableau server. This doesn’t need any connection with the
database
LOD expressions are used to perform aggregations that are more granular than the view's original level of
aggregation. There are three types of LOD expressions: FIXED, INCLUDE, and EXCLUDE.
• Data Source Filter: This filter refrains users from viewing sensitive information and thus reduces data feeds.
In what situation will you prefer to use a treemap over a heat map?
When we have to deal with large quantitative values having hierarchically structured data, we can prefer
treemaps. Each rectangular set on the same hierarchy level denotes a data table column.
Mention some significant ways of improving Tableau's performance.
There are different ways of improving Tableau's performance. Some well-known techniques are:
· We can minimize the scope of data and keep only the data that we need for our visualization. It
will eventually decrease the volume of data making Tableau's processing faster.
· To run our Tableau workbook faster, we can use the Extract.
· We can avoid using strings while dealing with numbers and prefer using Boolean and integer
values. It is because they are faster than strings.
· We can hide unnecessary or unused fields.
· We can also eliminate needless calculations and sheets.
Python Questions:
The reason for Python's popularity is its extensive collection of libraries. These libraries include various
functionalities and tools to analyze and manage data. The popular Python libraries for data science are:
• SciPy
• NumPy
• Pandas
• Matplotlib
• PyTorch
• Scrapy
• BeautifulSoup
Get a good hands-on knowledge on these libraries and understand how do you manage data and do data related
operations on these libraries.
The simplest series that can be created is an empty series. The Series() function of Pandas is used to create a series
of any kind.
Code Example 1:
# import pandas as pd
import pandas as pd
How to create a series from an array: Pandas is built on top of the Numpy library. In order to create a series from
the NumPy array, we have to import the NumPy module and have to use numpy.array() the function.
Code Example 2:
# import pandas as pd
import pandas as pd
# import numpy as np
import numpy as np
# simple array
data = np.array(['s', 'c', 'a', 'l', 'a','r'])
ser = pd.Series(data)
print(s.value_counts())
Hypothesis testing is used to find out the statistical significance of the insight. To elaborate, the null hypothesis and the alternate
After calculating the p-value, the null hypothesis is assumed true, and the values are determined. To fine-tune the result, the alpha
value, which denotes the significance, is tweaked. If the p-value turns out to be less than the alpha, then the null hypothesis is
Outliers are data points that vary in a large way when compared to other observations in the dataset. Depending on the learning
process, an outlier can worsen the accuracy of a model and decrease its efficiency sharply.
· Standard deviation/z-score
State the case where the median is a better measure when compared to the
mean.
In the case where there are a lot of outliers that can positively or negatively skew data, the median is preferred as it provides an