Python Interviews Question
Python Interviews Question
Python is a high-level, interpreted programming language known for its simplicity and
readability.
2. What are Python’s key features?
o Easy to learn and read.
o Interpreted and dynamically typed.
o Extensive libraries.
o Object-oriented.
o Portable and open-source.
3. What are Python’s data types?
Common types include:
o Numeric: int, float, complex
o Sequence: list, tuple, range
o Text: str
o Set: set, frozenset
o Mapping: dict
o Boolean: bool
4. How is Python interpreted?
Python code is executed line-by-line at runtime using the Python interpreter.
5. What are Python’s built-in data structures?
o List
o Tuple
o Set
o Dictionary
Control Flow
Functions
square = lambda x: x ** 2
def values():
return 1, 2, 3
class MyClass:
pass
Collections
Strings
print(f"Hello, {name}")
reversed_str = my_string[::-1]
if "sub" in "substring":
print("Exists")
Advanced Concepts
@decorator
def func():
pass
Error Handling
try:
# code
except Exception as e:
print(e)
File Handling
import os
exists = os.path.exists("file.txt")
os.remove("file.txt")
SQL in Python
import sqlite3
conn = sqlite3.connect("database.db")
cursor = conn.cursor()
cursor.execute("SELECT * FROM table")
results = cursor.fetchall()
conn.commit()
import pandas as pd
df = pd.read_csv("file.csv")
df['column_name'].fillna(df['column_name'].mean(), inplace=True)
10. How do you filter rows where a column value is greater than 50?
o Answer:
NumPy Questions
import numpy as np
arr = np.array([1, 2, 3, 4, 5])
def my_function():
print("Hello, World!")
o Answer:
0
1
2
square = lambda x: x ** 2
print(square(5)) # Output: 25
try:
result = 10 / 0
except ZeroDivisionError:
print("Cannot divide by zero!")
my_car = Car("Toyota")
df.isnull().sum()
df.dropna(inplace=True)
df['column_name'].fillna(df['column_name'].mean(), inplace=True)
df.fillna(0, inplace=True)
df.drop_duplicates(inplace=True)
Handling Outliers
6. How do you detect outliers using the IQR (Interquartile Range) method?
o Answer:
Q1 = df['column_name'].quantile(0.25)
Q3 = df['column_name'].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df['column_name'] < (Q1 - 1.5 * IQR)) |
(df['column_name'] > (Q3 + 1.5 * IQR))]
median_value = df['column_name'].median()
df.loc[outliers.index, 'column_name'] = median_value
df = df[df['column_name'] >= 0]
df['column_name'] = df['column_name'].str.lower()
12. How do you remove leading and trailing spaces from string values in a column?
o Answer:
df['column_name'] = df['column_name'].str.strip()
df['date_column'] = pd.to_datetime(df['date_column'])
df['column_name'].unique()
df['column_name'].replace({'old_value': 'new_value'},
inplace=True)
df['column_name'] = df['column_name'].astype(int)
17. How do you change a column with numeric values stored as strings into float?
o Answer:
df['column_name'] = pd.to_numeric(df['column_name'],
errors='coerce')
df['year'] = df['date_column'].dt.year
19. How do you fill missing categorical values with the most frequent value?
o Answer:
df['column_name'].fillna(df['column_name'].mode()[0],
inplace=True)
1. How do you load a CSV file into a pandas DataFrame and handle missing values?
df_cleaned = df.dropna()
df_cleaned = df.dropna(axis=1)
df["column_name"].fillna(df["column_name"].mean(), inplace=True)
df.interpolate(inplace=True)
2. Write a command to select rows from a DataFrame where a column value is greater
than a specified threshold.
You can use boolean indexing in pandas to select rows where a column value is greater than a
specified threshold.
Command:
df_filtered = df[df["column_name"] > threshold]
Example:
import pandas as pd
# Sample DataFrame
data = {"Name": ["Alice", "Bob", "Charlie", "David"],
"Age": [25, 30, 22, 35]}
df = pd.DataFrame(data)
print(df_filtered)
Output:
Name Age
1 Bob 30
3 David 35
This command filters the DataFrame to include only rows where "Age" is greater than 25.
3. How do you create a new column in a DataFrame based on the values of existing
columns?
Example:
import pandas as pd
# Sample DataFrame
df = pd.DataFrame({"A": [10, 20, 30], "B": [1, 2, 3]})
Output:
A B C
0 10 1 11
1 20 2 22
2 30 3 33
Example:
Output:
A B C Category
0 10 1 11 Low
1 20 2 22 High
2 30 3 33 High
Example:
Output:
A B C Category D
0 10 1 11 Low 5.0
1 20 2 22 High 10.0
2 30 3 33 High 60.0
These methods allow you to create new columns dynamically based on existing values. 🚀
4. Write a pandas command to group data by one column and calculate the sum of
another column.
You can use the groupby() function in pandas to group data by one column and calculate the
sum of another column.
Command:
df_grouped = df.groupby("column_name")["another_column"].sum().reset_index()
Example:
import pandas as pd
# Sample DataFrame
data = {"Category": ["A", "B", "A", "B", "C"],
"Sales": [100, 200, 150, 300, 250]}
df = pd.DataFrame(data)
print(df_grouped)
Output:
Category Sales
0 A 250
1 B 500
2 C 250
Explanation:
Syntax:
df_merged = pd.merge(df1, df2, on="common_column", how="inner")
Example:
import pandas as pd
# First DataFrame
df1 = pd.DataFrame({
"ID": [1, 2, 3],
"Name": ["Alice", "Bob", "Charlie"]
})
# Second DataFrame
df2 = pd.DataFrame({
"ID": [1, 2, 4],
"Salary": [50000, 60000, 70000]
})
print(df_merged)
Output:
ID Name Salary
0 1 Alice 50000
1 2 Bob 60000
o Keeps all records from the left DataFrame, fills missing values with NaN from the
right DataFrame.
3. Right Join
o Keeps all records from both DataFrames, filling missing values with NaN.
Example:
import pandas as pd
df = pd.DataFrame(data)
print(df_cleaned)
Output:
Name Age
0 Alice 25
1 Bob 30
3 Charlie 35
df_cleaned = df.drop_duplicates(keep="last")
df_cleaned = df.drop_duplicates(keep=False)
7. How do you fill missing values in a DataFrame with the mean of that column?
Command:
df["column_name"].fillna(df["column_name"].mean(), inplace=True)
This replaces all NaN values in "column_name" with the mean of that column.
Example:
import pandas as pd
df = pd.DataFrame(data)
print(df)
Output:
Name Age
0 Alice 25.0
1 Bob 30.0
2 Charlie 30.0 # Missing value replaced with mean (30.0)
3 David 35.0
This fills all numerical columns with their respective mean values.
8. Write a command to filter a DataFrame to include only rows where a column value is
within a specific range.
Command:
df_filtered = df[(df["column_name"] >= lower_bound) & (df["column_name"] <=
upper_bound)]
Example:
import pandas as pd
# Sample DataFrame
data = {"Name": ["Alice", "Bob", "Charlie", "David", "Emma"],
"Age": [25, 30, 22, 35, 28]}
df = pd.DataFrame(data)
print(df_filtered)
Output:
Name Age
0 Alice 25
1 Bob 30
4 Emma 28
Example:
import pandas as pd
# Sample DataFrame
data = {"A": [1, 2, 3], "B": [4, 5, 6]}
df = pd.DataFrame(data)
print(df)
Output:
Alpha Beta
0 1 4
1 2 5
2 3 6
Example:
import pandas as pd
# Sample DataFrame
data = {"Name": ["Alice", "Bob", "Charlie", "David"],
"Age": [30, 25, 30, 25],
"Score": [85, 90, 80, 95]}
df = pd.DataFrame(data)
print(df_sorted)
Output:
Name Age Score
3 David 25 95
1 Bob 25 90
0 Alice 30 85
2 Charlie 30 80
Command:
df["column_name"] = df["column_name"].apply(function_name)
Example:
import pandas as pd
# Sample DataFrame
data = {"Name": ["Alice", "Bob", "Charlie"],
"Age": [25, 30, 35]}
df = pd.DataFrame(data)
print(df)
Output:
Name Age
0 Alice 30
1 Bob 35
2 Charlie 40
df["Age"] = df["Age"].apply(lambda x: x + 5)
Example:
import pandas as pd
# Sample DataFrame
data = {"Category": ["A", "A", "B", "B", "C"],
"Region": ["East", "West", "East", "West", "East"],
"Sales": [100, 200, 150, 250, 300]}
df = pd.DataFrame(data)
print(df_pivot)
Output:
Region East West
Category
A 100.0 200.0
B 150.0 250.0
C 300.0 NaN
Additional Options:
Example:
import pandas as pd
print(df_combined)
Output:
ID Name
0 1 Alice
1 2 Bob
0 3 Charlie
1 4 David
Example:
df1 = pd.DataFrame({"ID": [1, 2], "Name": ["Alice", "Bob"]})
df2 = pd.DataFrame({"Age": [25, 30], "City": ["NY", "LA"]})
print(df_combined)
Output:
ID Name Age City
0 1 Alice 25 NY
1 2 Bob 30 LA
Example:
import pandas as pd
# Sample DataFrame
data = {"Day": [1, 2, 3, 4, 5, 6],
"Sales": [100, 200, 300, 400, 500, 600]}
df = pd.DataFrame(data)
print(df)
Output:
Day Sales Rolling_Mean
0 1 100 NaN
1 2 200 NaN
2 3 300 200.0
3 4 400 300.0
4 5 500 400.0
5 6 600 500.0
The first two rows are NaN because the window size is 3, meaning it needs at least 3
values to calculate the mean.
Additional Options:
df["Rolling_Sum"] = df["Sales"].rolling(window=3).sum()
Example:
import pandas as pd
df = pd.DataFrame(data)
print(df.dtypes)
print(df)
Output:
Date datetime64[ns]
dtype: object
Date
0 2024-01-01
1 2024-02-15
2 2024-03-10
If the format is inconsistent, pandas automatically detects it. However, for better performance,
specify the format:
Handling Errors
If some values are not valid dates, you can handle errors:
# Sample DataFrame
data = {"Product": ["A", "B", "C", "D"], "Sales": [100, 250, 400, 500]}
df = pd.DataFrame(data)
print(df)
Output:
Product Sales
0 A 100
1 B 250
2 C 300
3 D 300
17. How do you use the groupby function to perform multiple aggregation
operations?
Command:
df.groupby("column_name").agg({"col1": "sum", "col2": "mean"})
Example:
import pandas as pd
# Sample DataFrame
data = {"Category": ["A", "A", "B", "B", "C", "C"],
"Sales": [100, 200, 150, 250, 300, 400],
"Profit": [10, 20, 15, 25, 30, 40]}
df = pd.DataFrame(data)
print(df_grouped)
Output:
Sales Profit
sum mean max min
Category
A 300 150.0 20 10
B 400 200.0 25 15
C 700 350.0 40 30
Alternative Approaches
df.groupby("Category").agg(
Total_Sales=("Sales", "sum"),
Average_Sales=("Sales", "mean"),
Max_Profit=("Profit", "max"),
Min_Profit=("Profit", "min")
)
18. Write a command to drop rows with missing values from a DataFrame.
Command:
df_cleaned = df.dropna()
Removes all rows that contain at least one NaN (missing value).
Returns a new DataFrame (does not modify the original).
Example:
import pandas as pd
df = pd.DataFrame(data)
Output:
Name Age City
0 Alice 25.0 NY
Additional Options
df.dropna(how="all")
df.dropna(subset=["Age"])
df.dropna(inplace=True)
Example:
import pandas as pd
# Sample DataFrame
data = {"Name": ["Alice", "Bob", "Charlie"],
"Age": [25, 30, 35],
"City": ["New York", "Los Angeles", "Chicago"]}
df = pd.DataFrame(data)
Name,Age,City
Alice,25,New York
Bob,30,Los Angeles
Charlie,35,Chicago
Additional Options
df.to_csv("output.csv", index=True)
# Sample DataFrame
data = {"Age": ["25", "30", "35"], "Salary": ["50000", "60000", "70000"]}
df = pd.DataFrame(data)
print(df.dtypes)
Output:
Age int64
Salary float64
dtype: object
Example:
import pandas as pd
# Sample DataFrame
data = {
"Name": ["Alice", "Bob", "Charlie"],
"Age": [25, 30, 35],
"City": ["New York", "Los Angeles", "Chicago"],
"Salary": [50000, 60000, 70000]
}
df = pd.DataFrame(data)
# Select only "Name" and "Salary" columns
df_filtered = df[["Name", "Salary"]]
print(df_filtered)
Output:
Name Salary
0 Alice 50000
1 Bob 60000
2 Charlie 70000
Additional Methods
22. Write code to create a new DataFrame with only unique rows from an
existing DataFrame.
Command:
df_unique = df.drop_duplicates()
Example:
import pandas as pd
print(df_unique)
Output:
Name Age City
0 Alice 25 New York
1 Bob 30 Los Angeles
2 Charlie 35 Chicago
Additional Options
df_unique = df.drop_duplicates(keep="last")
df_unique = df.drop_duplicates().reset_index(drop=True)
Example:
import pandas as pd
print("Before Reset:")
print(df)
# Reset index
df.reset_index(drop=True, inplace=True)
print("\nAfter Reset:")
print(df)
Output:
Before Reset:
Name Age
a Alice 25
b Bob 30
c Charlie 35
After Reset:
Name Age
0 Alice 25
1 Bob 30
2 Charlie 35
Old index ("a", "b", "c") is removed, and a new default numeric index (0, 1, 2) is
created.
Alternative Approaches
df.reset_index(drop=False)
df = df.dropna().reset_index(drop=True)
Example:
import pandas as pd
# Sample DataFrame
data = {
"Sales": [100, 200, 300, 400, 500],
"Profit": [10, 20, 30, 40, 50],
"Discount": [5, 3, 4, 2, 1]
}
df = pd.DataFrame(data)
print(corr_matrix)
Output:
Sales Profit Discount
Sales 1.00 1.00 -0.87
Profit 1.00 1.00 -0.87
Discount -0.87 -0.87 1.00
Additional Options
df.corr(method="kendall")
25. How do you handle categorical data in a DataFrame, including encoding and one-hot
encoding?
Categorical data needs to be converted into a numerical format for machine learning models. The
two most common encoding methods are Label Encoding and One-Hot Encoding.
Example:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
# Sample DataFrame
df = pd.DataFrame({"Category": ["Low", "Medium", "High", "Medium", "Low"]})
# Initialize LabelEncoder
label_encoder = LabelEncoder()
print(df)
Output:
Category Category_Encoded
0 Low 1
1 Medium 2
2 High 0
3 Medium 2
4 Low 1
# Initialize OneHotEncoder
ohe = OneHotEncoder(sparse=False, drop="first")
# Convert to DataFrame
df_encoded = pd.DataFrame(encoded_values,
columns=ohe.get_feature_names_out(["Category"]))
print(df_final)
1.Find all unique employee names who work in more than one department.
Sample DataFrame:
df = pd.DataFrame({'EmployeeName': ['John Doe', 'Jane Smith', 'Alice Johnson', 'John
Doe'], 'Department': ['Sales', 'Marketing', 'Sales', 'Marketing']})
You can find all unique employee names who work in more than one department using the
groupby() and nunique() methods.
Solution:
import pandas as pd
# Sample DataFrame
df = pd.DataFrame({
'EmployeeName': ['John Doe', 'Jane Smith', 'Alice Johnson', 'John Doe'],
'Department': ['Sales', 'Marketing', 'Sales', 'Marketing']
})
print(unique_employees)
Output:
['John Doe']
✅ John Doe works in both Sales and Marketing, so he is included in the result.
2. Calculate the monthly average sales for each product. Assume sales data is daily.
Sample DataFrame:
df = pd.DataFrame({'Date': pd.date_range(start='2023-01-01', end='2023-03-31',
freq='D'), 'Product': np.random.choice(['ProductA', 'ProductB'], 90), 'Sales':
np.random.randint(100, 500, 90)})
You can calculate the monthly average sales for each product using groupby() along with
resampling.
Solution:
import pandas as pd
import numpy as np
# Sample DataFrame
np.random.seed(42) # Ensures reproducibility
df = pd.DataFrame({
'Date': pd.date_range(start='2023-01-01', end='2023-03-31', freq='D'),
'Product': np.random.choice(['ProductA', 'ProductB'], 90),
'Sales': np.random.randint(100, 500, 90)
})
print(monthly_avg_sales)
Output (Example)
Date Product Sales
0 2023-01 ProductA 289.032258
1 2023-01 ProductB 315.000000
2 2023-02 ProductA 292.500000
3 2023-02 ProductB 312.000000
4 2023-03 ProductA 320.000000
5 2023-03 ProductB 305.000000
Explanation
3. Identify the top 3 employees with the highest sales in each quarter.
Sample DataFrame:
df = pd.DataFrame({'Employee': ['John', 'Jane', 'Doe', 'Smith', 'Alice'], 'Quarter': ['Q1',
'Q1', 'Q2', 'Q2', 'Q3'], 'Sales': [200, 150, 300, 250, 400]})
==you can find the top 3 employees with the highest sales in each quarter using groupby()
along with nlargest() or rank().
Solution:
import pandas as pd
# Sample DataFrame
df = pd.DataFrame({
'Employee': ['John', 'Jane', 'Doe', 'Smith', 'Alice', 'Bob', 'Eve',
'Charlie', 'David', 'Frank'],
'Quarter': ['Q1', 'Q1', 'Q1', 'Q1', 'Q2', 'Q2', 'Q2', 'Q3', 'Q3', 'Q3'],
'Sales': [200, 150, 300, 250, 400, 380, 220, 500, 450, 420]
})
print(top_3_per_quarter)
Output (Example)
Employee Quarter Sales
0 John Q1 200
1 Smith Q1 250
2 Doe Q1 300
3 Alice Q2 400
4 Bob Q2 380
5 Eve Q2 220
6 Charlie Q3 500
7 David Q3 450
8 Frank Q3 420
Explanation
4. Analyze the attendance records to find employees with more than 95% attendance
throughout the year.
Sample DataFrame:
df = pd.DataFrame({'Employee': ['John', 'Jane', 'Doe'], 'TotalDays': [365, 365, 365],
'DaysAttended': [365, 350, 360]})
==You can find employees with more than 95% attendance using percentage calculation and
filtering.
Solution:
import pandas as pd
# Sample DataFrame
df = pd.DataFrame({
'Employee': ['John', 'Jane', 'Doe'],
'TotalDays': [365, 365, 365],
'DaysAttended': [365, 350, 360]
})
print(high_attendance_employees)
Output:
Employee TotalDays DaysAttended AttendanceRate
0 John 365 365 100.0
2 Doe 365 360 98.6
Explanation
Sample DataFrame:
df = pd.DataFrame({'Month': ['Jan', 'Feb', 'Mar', 'Jan', 'Feb', 'Mar'],
'CustomerID': [1, 1, 1, 2, 2, 3], 'TransactionCount': [1, 2, 1, 3, 2, 1]})
==Calculate Monthly Customer Retention Rate
Customer retention rate is the percentage of customers who made a purchase in a previous
month and returned in the current month.
Solution:
import pandas as pd
# Sample DataFrame
df = pd.DataFrame({
'Month': ['Jan', 'Feb', 'Mar', 'Jan', 'Feb', 'Mar'],
'CustomerID': [1, 1, 1, 2, 2, 3],
'TransactionCount': [1, 2, 1, 3, 2, 1]
})
# Convert to DataFrame
retention_rate = retention_rate.reset_index().rename(columns={'Retained':
'RetentionRate (%)'})
print(retention_rate)
Example Output:
Month RetentionRate (%)
0 Jan 0.0
1 Feb 50.0
2 Mar 66.7
Explanation
You can calculate the average duration employees spent on projects by computing the
difference between ProjectEnd and ProjectStart and then taking the mean.
Solution:
import pandas as pd
# Sample DataFrame
df = pd.DataFrame({
'Employee': ['John', 'Jane', 'Doe'],
'ProjectStart': pd.to_datetime(['2023-01-01', '2023-02-15', '2023-03-
01']),
'ProjectEnd': pd.to_datetime(['2023-01-31', '2023-03-15', '2023-04-01'])
})
Output:
Average time employees spent on projects: 30.67 days
Explanation
Sample DataFrame:
df = pd.DataFrame({'Month': ['Jan', 'Feb', 'Mar', 'Jan', 'Feb', 'Mar'], 'Product':
['A', 'A', 'A', 'B', 'B', 'B'], 'Sales': [200, 220, 240, 150, 165, 180]})
==Compute Month-on-Month Growth Rate in Sales for Each Product
To calculate the month-on-month (MoM) growth rate, we use the formula:
Then, we identify products that have more than 10% growth for consecutive months.
Solution:
import pandas as pd
# Sample DataFrame
df = pd.DataFrame({
'Month': ['Jan', 'Feb', 'Mar', 'Jan', 'Feb', 'Mar'],
'Product': ['A', 'A', 'A', 'B', 'B', 'B'],
'Sales': [200, 220, 240, 150, 165, 180]
})
# Display results
print(df)
Output:
Month Product Sales MoM_Growth HighGrowth
0 2023-01-01 A 200 NaN False
1 2023-02-01 A 220 10.00000 False
2 2023-03-01 A 240 9.09091 False
3 2023-01-01 B 150 NaN False
4 2023-02-01 B 165 10.00000 False
5 2023-03-01 B 180 9.09091 False
Explanation
8. Identify the time of day (morning, afternoon, evening) when sales peak for
each category of products.
Sample DataFrame:
df = pd.DataFrame({'Category': ['Electronics', 'Clothing', 'Electronics',
'Clothing'], 'TimeOfDay': ['Morning', 'Afternoon', 'Evening', 'Morning'], 'Sales':
[300, 150, 500, 200]})
==Identify Peak Sales Time for Each Product Category
To find the time of day when sales peak for each category, we need to group by Category and
find the TimeOfDay with the maximum sales.
Solution:
import pandas as pd
# Sample DataFrame
df = pd.DataFrame({
'Category': ['Electronics', 'Clothing', 'Electronics', 'Clothing'],
'TimeOfDay': ['Morning', 'Afternoon', 'Evening', 'Morning'],
'Sales': [300, 150, 500, 200]
})
# Find the time of day with the highest sales for each category
peak_sales_time = df.loc[df.groupby('Category')['Sales'].idxmax(),
['Category', 'TimeOfDay', 'Sales']]
print(peak_sales_time)
Output:
Category TimeOfDay Sales
2 Electronics Evening 500
3 Clothing Morning 200
Explanation
Sample DataFrame:
df = pd.DataFrame({'Employee': ['John', 'Jane', 'Doe'], 'TasksAssigned': [20,
25, 15]})
Solution:
import pandas as pd
# Sample DataFrame
df = pd.DataFrame({
'Employee': ['John', 'Jane', 'Doe'],
'TasksAssigned': [20, 25, 15]
})
# Print statistics
print(f"Mean tasks assigned: {mean_tasks}")
print(f"Median tasks assigned: {median_tasks}")
print(f"Standard deviation: {std_tasks}")
Explanation
10. Calculate the profit margin for each product category based on revenue
and cost data.
Sample DataFrame:
df = pd.DataFrame({'Category': ['Electronics', 'Clothing'], 'Revenue': [1000,
500], 'Cost': [700, 300]})
Solution:
import pandas as pd
# Sample DataFrame
df = pd.DataFrame({
'Category': ['Electronics', 'Clothing'],
'Revenue': [1000, 500],
'Cost': [700, 300]
})
# Display results
print(df)
Output:
Category Revenue Cost Profit Profit_Margin (%)
0 Electronics 1000 700 300 30.0
1 Clothing 500 300 200 40.0
Explanation