0% found this document useful (0 votes)
6 views47 pages

Python Interviews Question

The document provides a comprehensive overview of Python programming, covering its key features, data types, control flow, functions, object-oriented programming, file handling, and error handling. It also includes questions and answers related to data analytics using Pandas and NumPy, as well as data cleaning techniques. The content is structured in a question-and-answer format, making it a useful resource for learners and practitioners of Python.

Uploaded by

vishakha chavan
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
Download as docx, pdf, or txt
0% found this document useful (0 votes)
6 views47 pages

Python Interviews Question

The document provides a comprehensive overview of Python programming, covering its key features, data types, control flow, functions, object-oriented programming, file handling, and error handling. It also includes questions and answers related to data analytics using Pandas and NumPy, as well as data cleaning techniques. The content is structured in a question-and-answer format, making it a useful resource for learners and practitioners of Python.

Uploaded by

vishakha chavan
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1/ 47

1. What is Python?

Python is a high-level, interpreted programming language known for its simplicity and
readability.
2. What are Python’s key features?
o Easy to learn and read.
o Interpreted and dynamically typed.
o Extensive libraries.
o Object-oriented.
o Portable and open-source.
3. What are Python’s data types?
Common types include:
o Numeric: int, float, complex
o Sequence: list, tuple, range
o Text: str
o Set: set, frozenset
o Mapping: dict
o Boolean: bool
4. How is Python interpreted?
Python code is executed line-by-line at runtime using the Python interpreter.
5. What are Python’s built-in data structures?
o List
o Tuple
o Set
o Dictionary

Control Flow

6. What are Python’s loop constructs?


o for loops: Iterate over a sequence.
o while loops: Run as long as a condition is True.
7. What is the difference between break, continue, and pass?
o break: Exits the loop.
o continue: Skips the current iteration.
o pass: Does nothing; placeholder.
8. How do you handle conditional statements?
Use if, elif, and else.
9. Write a Python one-liner to check if a number is even.

print("Even" if n % 2 == 0 else "Odd")

10. How does Python handle errors in loops?


Use try and except blocks to catch exceptions.

Functions

11. How do you define a function in Python?


def function_name(parameters):
# body

12. What is *args and **kwargs?


o *args: Pass a variable number of positional arguments.
o **kwargs: Pass a variable number of keyword arguments.
13. What is a lambda function?
An anonymous function:

square = lambda x: x ** 2

14. What is the purpose of return?


Returns a value from a function.
15. Can a function return multiple values?
Yes:

def values():
return 1, 2, 3

Object-Oriented Programming (OOP)

16. What is OOP?


A paradigm based on objects containing data and methods.
17. What is a class in Python?
A blueprint for creating objects:

class MyClass:
pass

18. What is self in Python?


Refers to the current instance of the class.
19. What is inheritance?
A class can inherit properties/methods from another class.
20. What are dunder methods?
Special methods with double underscores, e.g., __init__, __str__.

Collections

21. What is the difference between list and tuple?


o List: Mutable.
o Tuple: Immutable.
22. How do you create a dictionary?

my_dict = {"key": "value"}

23. How do you access items in a set?


Use a loop; sets are unordered.
24. What is the purpose of collections.Counter?
Counts occurrences of elements in a sequence.
25. How do you iterate over a dictionary?

for key, value in my_dict.items():


print(key, value)

Strings

26. How do you concatenate strings?

result = "Hello" + " World"

27. How do you format strings?

print(f"Hello, {name}")

28. How do you reverse a string?

reversed_str = my_string[::-1]

29. How do you check if a substring exists?

if "sub" in "substring":
print("Exists")

30. How do you split and join strings?

split_str = "hello world".split()


joined_str = " ".join(split_str)

Advanced Concepts

31. What is a decorator?


A function that modifies another function:

@decorator
def func():
pass

32. What is a generator?


Functions using yield to return values lazily.
33. What are Python's built-in modules?
Examples: os, sys, math, random.
34. What is Python’s GIL?
The Global Interpreter Lock ensures only one thread executes at a time.
35. How does Python handle memory management?
o Reference counting.
o Garbage collection.

Error Handling

36. How do you handle exceptions?

try:
# code
except Exception as e:
print(e)

37. What is a finally block?


Runs code after try and except, regardless of the outcome.
38. What is an assertion?
Debugging tool to check conditions:

assert x > 0, "x must be positive"

39. What are common exceptions?


o ValueError
o TypeError
o KeyError
o IndexError
40. How do you raise an exception?

raise ValueError("Invalid value")

File Handling

41. How do you read a file?

with open("file.txt", "r") as file:


data = file.read()

42. How do you write to a file?

with open("file.txt", "w") as file:


file.write("content")

43. How do you check if a file exists?

import os
exists = os.path.exists("file.txt")

44. How do you delete a file?

os.remove("file.txt")

45. How do you append to a file?


with open("file.txt", "a") as file:
file.write("more content")

SQL in Python

46. What library is used for SQL in Python?


sqlite3 or SQLAlchemy.
47. How do you connect to a database?

import sqlite3
conn = sqlite3.connect("database.db")

48. How do you execute a query?

cursor = conn.cursor()
cursor.execute("SELECT * FROM table")

49. How do you fetch query results?

results = cursor.fetchall()

50. How do you commit changes?

conn.commit()

Basic Python Questions

1. What is the difference between a list and a tuple in Python?


o Answer: Lists are mutable (can be modified), whereas tuples are immutable
(cannot be modified).
2. How do you create a dictionary in Python?
o Answer:

my_dict = {"name": "Alice", "age": 25, "city": "New York"}

3. What will be the output of bool([])?


o Answer: False, because an empty list evaluates to False in a boolean context.
4. How do you check the data type of a variable?
o Answer: type(variable)
5. What is the difference between is and == in Python?
o Answer:
 is checks object identity (whether two variables refer to the same object
in memory).
 == checks value equality (whether two variables have the same value).

Data Analytics & Pandas Questions


6. How do you read a CSV file using Pandas?
o Answer:

import pandas as pd
df = pd.read_csv("file.csv")

7. How do you get the first 5 rows of a DataFrame?


o Answer: df.head(5)
8. How do you drop duplicate values in a DataFrame?
o Answer: df.drop_duplicates(inplace=True)
9. How do you fill missing values in a Pandas DataFrame with the mean of a column?
o Answer:

df['column_name'].fillna(df['column_name'].mean(), inplace=True)

10. How do you filter rows where a column value is greater than 50?
o Answer:

df_filtered = df[df['column_name'] > 50]

NumPy Questions

11. How do you create a NumPy array?


o Answer:

import numpy as np
arr = np.array([1, 2, 3, 4, 5])

12. How do you generate an array of zeros of shape (3, 3)?


o Answer: np.zeros((3, 3))
13. How do you reshape a NumPy array?
o Answer: arr.reshape(rows, columns)
14. How do you calculate the mean of a NumPy array?
o Answer: np.mean(arr)
15. What function is used to find the standard deviation in NumPy?
o Answer: np.std(arr)

Python Functions & Loops

16. How do you define a function in Python?


o Answer:

def my_function():
print("Hello, World!")

17. What will be the output of the following code?


for i in range(3):
print(i)

o Answer:

0
1
2

18. What is the purpose of the enumerate() function?


o Answer: It adds an index to an iterable, returning (index, value) pairs.
19. How do you use a lambda function to square a number?
o Answer:

square = lambda x: x ** 2
print(square(5)) # Output: 25

20. What will be the output of list(range(2, 10, 2))?


o Answer: [2, 4, 6, 8]

File Handling & Exception Handling

21. How do you open a file in Python for reading?


o Answer: file = open("file.txt", "r")
22. How do you write to a file in Python?
o Answer:

with open("file.txt", "w") as file:


file.write("Hello, Python!")

23. What is the purpose of the try-except block?


o Answer: It is used for exception handling to catch and handle runtime errors.
24. How do you handle a division-by-zero error in Python?
o Answer:

try:
result = 10 / 0
except ZeroDivisionError:
print("Cannot divide by zero!")

25. What does the finally block do in exception handling?


o Answer: It runs always, regardless of whether an exception occurred.

OOP & Advanced Python

26. How do you define a class in Python?


o Answer:
class Car:
def __init__(self, brand):
self.brand = brand

27. What is the purpose of self in Python classes?


o Answer: It represents the instance of the class and allows access to its
attributes/methods.
28. How do you create an object of a class?
o Answer:

my_car = Car("Toyota")

29. What is the difference between deepcopy() and copy()?


o Answer:
 copy() creates a shallow copy (changes in nested objects affect the copy).
 deepcopy() creates a deep copy (independent of the original).
30. How do you check if a variable is an instance of a specific class?
o Answer: isinstance(variable, ClassName)

Here are 20 Data Cleaning Questions with Answers in Pandas:

Basic Data Cleaning Questions

1. How do you check for missing values in a Pandas DataFrame?


o Answer:

df.isnull().sum()

o This returns the count of missing values in each column.


2. How do you remove rows with missing values?
o Answer:

df.dropna(inplace=True)

oThis removes all rows with at least one missing value.


3. How do you fill missing values with the mean of a column?
o Answer:

df['column_name'].fillna(df['column_name'].mean(), inplace=True)

4. How do you replace all NaN values with a specific value?


o Answer:

df.fillna(0, inplace=True)

oThis replaces all NaN values with 0.


5. How do you remove duplicate rows in a DataFrame?
o Answer:

df.drop_duplicates(inplace=True)

Handling Outliers

6. How do you detect outliers using the IQR (Interquartile Range) method?
o Answer:

Q1 = df['column_name'].quantile(0.25)
Q3 = df['column_name'].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df['column_name'] < (Q1 - 1.5 * IQR)) |
(df['column_name'] > (Q3 + 1.5 * IQR))]

7. How do you remove outliers using the IQR method?


o Answer:

df = df[(df['column_name'] >= (Q1 - 1.5 * IQR)) &


(df['column_name'] <= (Q3 + 1.5 * IQR))]

8. How do you replace outliers with the median?


o Answer:

median_value = df['column_name'].median()
df.loc[outliers.index, 'column_name'] = median_value

9. How do you detect outliers using the Z-score method?


o Answer:

from scipy import stats


df['z_score'] = stats.zscore(df['column_name'])
outliers = df[df['z_score'].abs() > 3]

10. How do you remove negative values in a column?


o Answer:

df = df[df['column_name'] >= 0]

Handling Inconsistent Data

11. How do you convert a column to lowercase?


o Answer:

df['column_name'] = df['column_name'].str.lower()
12. How do you remove leading and trailing spaces from string values in a column?
o Answer:

df['column_name'] = df['column_name'].str.strip()

13. How do you standardize date format in Pandas?


o Answer:

df['date_column'] = pd.to_datetime(df['date_column'])

14. How do you find unique values in a column?


o Answer:

df['column_name'].unique()

15. How do you replace specific values in a column?


o Answer:

df['column_name'].replace({'old_value': 'new_value'},
inplace=True)

Data Type Handling

16. How do you convert a column to integer type?


o Answer:

df['column_name'] = df['column_name'].astype(int)

17. How do you change a column with numeric values stored as strings into float?
o Answer:

df['column_name'] = pd.to_numeric(df['column_name'],
errors='coerce')

18. How do you extract year from a date column?


o Answer:

df['year'] = df['date_column'].dt.year

19. How do you fill missing categorical values with the most frequent value?
o Answer:

df['column_name'].fillna(df['column_name'].mode()[0],
inplace=True)

20. How do you rename multiple column names?


o Answer:
df.rename(columns={'old_col1': 'new_col1', 'old_col2':
'new_col2'}, inplace=True)

1. How do you load a CSV file into a pandas DataFrame and handle missing values?

Step 1: Import Pandas


import pandas as pd

Step 2: Load the CSV File


df = pd.read_csv("your_file.csv")

Step 3: Handle Missing Values

1. Check for Missing Values

print(df.isnull().sum()) # Count missing values per column

2. Remove Rows or Columns with Missing Values


o Remove rows with missing values:

df_cleaned = df.dropna()

o Remove columns with missing values:

df_cleaned = df.dropna(axis=1)

3. Fill Missing Values


o Fill with a specific value:

df_filled = df.fillna(0) # Replace NaN with 0

o Fill with the column mean:

df["column_name"].fillna(df["column_name"].mean(), inplace=True)

o Fill with forward or backward fill:

df.fillna(method='ffill', inplace=True) # Forward fill


df.fillna(method='bfill', inplace=True) # Backward fill

4. Replace Missing Values Using Interpolation

df.interpolate(inplace=True)
2. Write a command to select rows from a DataFrame where a column value is greater
than a specified threshold.

You can use boolean indexing in pandas to select rows where a column value is greater than a
specified threshold.

Command:
df_filtered = df[df["column_name"] > threshold]

Example:
import pandas as pd

# Sample DataFrame
data = {"Name": ["Alice", "Bob", "Charlie", "David"],
"Age": [25, 30, 22, 35]}

df = pd.DataFrame(data)

# Select rows where Age is greater than 25


df_filtered = df[df["Age"] > 25]

print(df_filtered)

Output:
Name Age
1 Bob 30
3 David 35

This command filters the DataFrame to include only rows where "Age" is greater than 25.

3. How do you create a new column in a DataFrame based on the values of existing
columns?

Method 1: Using Arithmetic Operations


df["new_column"] = df["column1"] + df["column2"]

Example:

import pandas as pd

# Sample DataFrame
df = pd.DataFrame({"A": [10, 20, 30], "B": [1, 2, 3]})

# Create a new column as the sum of A and B


df["C"] = df["A"] + df["B"]
print(df)

Output:

A B C
0 10 1 11
1 20 2 22
2 30 3 33

Method 2: Using Conditional Statements


df["new_column"] = df["column1"].apply(lambda x: "High" if x > 50 else "Low")

Example:

df["Category"] = df["A"].apply(lambda x: "High" if x > 15 else "Low")


print(df)

Output:

A B C Category
0 10 1 11 Low
1 20 2 22 High
2 30 3 33 High

Method 3: Using apply() with Multiple Columns


df["new_column"] = df.apply(lambda row: row["column1"] * 2 if row["column2"] >
2 else row["column1"] / 2, axis=1)

Example:

df["D"] = df.apply(lambda row: row["A"] * 2 if row["B"] > 2 else row["A"] / 2,


axis=1)
print(df)

Output:

A B C Category D
0 10 1 11 Low 5.0
1 20 2 22 High 10.0
2 30 3 33 High 60.0

These methods allow you to create new columns dynamically based on existing values. 🚀

4. Write a pandas command to group data by one column and calculate the sum of
another column.
You can use the groupby() function in pandas to group data by one column and calculate the
sum of another column.
Command:
df_grouped = df.groupby("column_name")["another_column"].sum().reset_index()

Example:
import pandas as pd

# Sample DataFrame
data = {"Category": ["A", "B", "A", "B", "C"],
"Sales": [100, 200, 150, 300, 250]}

df = pd.DataFrame(data)

# Group by "Category" and sum the "Sales"


df_grouped = df.groupby("Category")["Sales"].sum().reset_index()

print(df_grouped)

Output:
Category Sales
0 A 250
1 B 500
2 C 250

Explanation:

 groupby("Category") groups the data by the "Category" column.


 ["Sales"].sum() calculates the sum of the "Sales" column for each category.
 reset_index() converts the grouped result back into a DataFrame.

5. How do you merge two DataFrames on a common column?

Syntax:
df_merged = pd.merge(df1, df2, on="common_column", how="inner")

 on="common_column" → Specifies the column to merge on.


 how="inner" → Specifies the type of join (can be "inner", "left", "right", or
"outer").

Example:
import pandas as pd

# First DataFrame
df1 = pd.DataFrame({
"ID": [1, 2, 3],
"Name": ["Alice", "Bob", "Charlie"]
})

# Second DataFrame
df2 = pd.DataFrame({
"ID": [1, 2, 4],
"Salary": [50000, 60000, 70000]
})

# Merge on "ID" column using an inner join


df_merged = pd.merge(df1, df2, on="ID", how="inner")

print(df_merged)

Output:
ID Name Salary
0 1 Alice 50000
1 2 Bob 60000

Different Types of Joins in Pandas

1. Inner Join (Default)

df_inner = pd.merge(df1, df2, on="ID", how="inner")

o Keeps only matching records from both DataFrames.


2. Left Join

df_left = pd.merge(df1, df2, on="ID", how="left")

o Keeps all records from the left DataFrame, fills missing values with NaN from the
right DataFrame.
3. Right Join

df_right = pd.merge(df1, df2, on="ID", how="right")

o Keeps all records from the right DataFrame.


4. Outer Join

df_outer = pd.merge(df1, df2, on="ID", how="outer")

o Keeps all records from both DataFrames, filling missing values with NaN.

6. Write code to remove duplicate rows from a DataFrame.


Command to Remove Duplicates:
df_cleaned = df.drop_duplicates()
 This removes all duplicate rows while keeping the first occurrence.

Example:
import pandas as pd

# Sample DataFrame with duplicates


data = {"Name": ["Alice", "Bob", "Alice", "Charlie", "Bob"],
"Age": [25, 30, 25, 35, 30]}

df = pd.DataFrame(data)

# Remove duplicate rows


df_cleaned = df.drop_duplicates()

print(df_cleaned)

Output:
Name Age
0 Alice 25
1 Bob 30
3 Charlie 35

Additional Options in drop_duplicates()

1. Keep the Last Occurrence:

df_cleaned = df.drop_duplicates(keep="last")

oRetains the last occurrence of duplicates instead of the first.


2. Remove Duplicates Based on Specific Columns:

df_cleaned = df.drop_duplicates(subset=["Name"], keep="first")

oRemoves duplicates based on the "Name" column only.


3. Remove All Duplicate Occurrences (Keep None):

df_cleaned = df.drop_duplicates(keep=False)

o Removes all occurrences of duplicate rows.

7. How do you fill missing values in a DataFrame with the mean of that column?

Command:
df["column_name"].fillna(df["column_name"].mean(), inplace=True)
 This replaces all NaN values in "column_name" with the mean of that column.

Example:
import pandas as pd

# Sample DataFrame with missing values


data = {"Name": ["Alice", "Bob", "Charlie", "David"],
"Age": [25, 30, None, 35]}

df = pd.DataFrame(data)

# Fill missing values in "Age" with the mean


df["Age"].fillna(df["Age"].mean(), inplace=True)

print(df)

Output:
Name Age
0 Alice 25.0
1 Bob 30.0
2 Charlie 30.0 # Missing value replaced with mean (30.0)
3 David 35.0

Alternative Approach: Using apply() for Multiple Columns


df.fillna(df.mean(), inplace=True)

 This fills all numerical columns with their respective mean values.

Using transform() (Recommended for Large DataFrames)


df["Age"] = df["Age"].transform(lambda x: x.fillna(x.mean()))

 This is efficient for handling large datasets.

8. Write a command to filter a DataFrame to include only rows where a column value is
within a specific range.

Command:
df_filtered = df[(df["column_name"] >= lower_bound) & (df["column_name"] <=
upper_bound)]

 lower_bound → Minimum value of the range.


 upper_bound → Maximum value of the range.
 The & operator ensures both conditions are met.

Example:
import pandas as pd

# Sample DataFrame
data = {"Name": ["Alice", "Bob", "Charlie", "David", "Emma"],
"Age": [25, 30, 22, 35, 28]}

df = pd.DataFrame(data)

# Filter rows where Age is between 25 and 30 (inclusive)


df_filtered = df[(df["Age"] >= 25) & (df["Age"] <= 30)]

print(df_filtered)

Output:
Name Age
0 Alice 25
1 Bob 30
4 Emma 28

Alternative: Using between()

A more readable way to filter by range:

df_filtered = df[df["Age"].between(25, 30)]

 between(lower, upper, inclusive="both") makes it easy to filter values in a given


range.

9. How do you rename columns in a DataFrame?


Command:
df.rename(columns={"old_column_name": "new_column_name"}, inplace=True)

 This renames a specific column.


 Setting inplace=True modifies the DataFrame directly.

Example:
import pandas as pd

# Sample DataFrame
data = {"A": [1, 2, 3], "B": [4, 5, 6]}
df = pd.DataFrame(data)

# Rename column "A" to "Alpha" and "B" to "Beta"


df.rename(columns={"A": "Alpha", "B": "Beta"}, inplace=True)

print(df)

Output:
Alpha Beta
0 1 4
1 2 5
2 3 6

Renaming Multiple Columns

You can rename multiple columns at once:

df.rename(columns={"A": "Alpha", "B": "Beta"}, inplace=True)

Renaming All Columns Using columns Attribute

If you want to rename all columns at once:

df.columns = ["Column1", "Column2"]

⚠️Ensure the number of new names matches the number of columns.

Using str.replace() for Pattern-Based Renaming

If you want to rename columns by replacing parts of their names:

df.columns = df.columns.str.replace("old_text", "new_text")

🚀 These methods help in cleaning and structuring your dataset efficiently!

10. Write a command to sort a DataFrame by multiple columns.


Command:
df_sorted = df.sort_values(by=["column1", "column2"], ascending=[True, False])

 by=["column1", "column2"] → Specifies multiple columns for sorting.


 ascending=[True, False] → Sorts column1 in ascending order and column2 in
descending order.

Example:
import pandas as pd

# Sample DataFrame
data = {"Name": ["Alice", "Bob", "Charlie", "David"],
"Age": [30, 25, 30, 25],
"Score": [85, 90, 80, 95]}

df = pd.DataFrame(data)

# Sort by Age (ascending) and Score (descending)


df_sorted = df.sort_values(by=["Age", "Score"], ascending=[True, False])

print(df_sorted)

Output:
Name Age Score
3 David 25 95
1 Bob 25 90
0 Alice 30 85
2 Charlie 30 80

Sorting in Descending Order for All Columns

To sort all specified columns in descending order:

df_sorted = df.sort_values(by=["column1", "column2"], ascending=False)

 This applies descending order to both columns.

11. How do you apply a function to every element in a column of a DataFrame?

Command:
df["column_name"] = df["column_name"].apply(function_name)

 This applies function_name to every value in "column_name".

Example:
import pandas as pd

# Sample DataFrame
data = {"Name": ["Alice", "Bob", "Charlie"],
"Age": [25, 30, 35]}

df = pd.DataFrame(data)

# Define a function to add 5 years to age


def add_five(x):
return x + 5

# Apply function to "Age" column


df["Age"] = df["Age"].apply(add_five)

print(df)

Output:
Name Age
0 Alice 30
1 Bob 35
2 Charlie 40

Using a Lambda Function (Shorter Syntax)

Instead of defining a separate function, you can use a lambda function:

df["Age"] = df["Age"].apply(lambda x: x + 5)

 This adds 5 years to each value in the "Age" column.

Applying a Function to String Columns

You can also apply functions to string values:

df["Name"] = df["Name"].apply(lambda x: x.upper())

 Converts all names to uppercase.

Why Use apply()?

 Works well for custom transformations.


 More efficient than loops in pandas.
 Can be used with both numerical and string columns.

12. Write code to create a pivot table from a DataFrame.


Command:
df_pivot = df.pivot_table(index="column1", columns="column2",
values="column3", aggfunc="sum")

 index="column1" → Rows of the pivot table.


 columns="column2" → Columns of the pivot table.
 values="column3" → Data to aggregate.
 aggfunc="sum" → Aggregation function (can be "sum", "mean", "count", etc.).

Example:
import pandas as pd

# Sample DataFrame
data = {"Category": ["A", "A", "B", "B", "C"],
"Region": ["East", "West", "East", "West", "East"],
"Sales": [100, 200, 150, 250, 300]}

df = pd.DataFrame(data)

# Create pivot table


df_pivot = df.pivot_table(index="Category", columns="Region", values="Sales",
aggfunc="sum")

print(df_pivot)

Output:
Region East West
Category
A 100.0 200.0
B 150.0 250.0
C 300.0 NaN

Additional Options:

1. Change Aggregation to Mean:

df_pivot = df.pivot_table(index="Category", columns="Region",


values="Sales", aggfunc="mean")

2. Fill Missing Values (NaN):

df_pivot = df.pivot_table(index="Category", columns="Region",


values="Sales", aggfunc="sum", fill_value=0)

oReplaces NaN with 0.


3. Multiple Aggregations:

df_pivot = df.pivot_table(index="Category", columns="Region",


values="Sales", aggfunc=["sum", "mean"])

o Shows both sum and mean of Sales.


13. How do you concatenate two DataFrames along rows and along columns?
Concatenating Along Rows (axis=0)
df_combined = pd.concat([df1, df2], axis=0)

 Stacks DataFrames vertically (adds rows).


 Index is not reset by default.

Example:
import pandas as pd

# Create two DataFrames


df1 = pd.DataFrame({"ID": [1, 2], "Name": ["Alice", "Bob"]})
df2 = pd.DataFrame({"ID": [3, 4], "Name": ["Charlie", "David"]})

# Concatenate along rows


df_combined = pd.concat([df1, df2], axis=0)

print(df_combined)
Output:
ID Name
0 1 Alice
1 2 Bob
0 3 Charlie
1 4 David

 The index is duplicated; use ignore_index=True to reset it:

df_combined = pd.concat([df1, df2], axis=0, ignore_index=True)

Concatenating Along Columns (axis=1)


df_combined = pd.concat([df1, df2], axis=1)

 Joins DataFrames side by side (adds columns).

Example:
df1 = pd.DataFrame({"ID": [1, 2], "Name": ["Alice", "Bob"]})
df2 = pd.DataFrame({"Age": [25, 30], "City": ["NY", "LA"]})

df_combined = pd.concat([df1, df2], axis=1)

print(df_combined)
Output:
ID Name Age City
0 1 Alice 25 NY
1 2 Bob 30 LA

Handling Different Column Names (Outer Join)


By default, concat() performs an outer join, keeping all columns:

df_combined = pd.concat([df1, df2], axis=0, ignore_index=True, sort=False)

 Missing values are filled with NaN.

14. Write a command to calculate the rolling mean of a column in a


DataFrame.
Command:
df["rolling_mean"] = df["column_name"].rolling(window=3).mean()

 rolling(window=3) → Defines a window size of 3.


 .mean() → Computes the mean for each rolling window.

Example:
import pandas as pd

# Sample DataFrame
data = {"Day": [1, 2, 3, 4, 5, 6],
"Sales": [100, 200, 300, 400, 500, 600]}

df = pd.DataFrame(data)

# Calculate rolling mean for Sales (window size = 3)


df["Rolling_Mean"] = df["Sales"].rolling(window=3).mean()

print(df)

Output:
Day Sales Rolling_Mean
0 1 100 NaN
1 2 200 NaN
2 3 300 200.0
3 4 400 300.0
4 5 500 400.0
5 6 600 500.0

 The first two rows are NaN because the window size is 3, meaning it needs at least 3
values to calculate the mean.

Additional Options:

1. Set min_periods=1 to avoid NaN for the first rows:

df["Rolling_Mean"] = df["Sales"].rolling(window=3, min_periods=1).mean()


2. Calculate rolling sum instead of mean:

df["Rolling_Sum"] = df["Sales"].rolling(window=3).sum()

15. How do you convert a column of strings to datetime objects in a


DataFrame?
ommand:
df["column_name"] = pd.to_datetime(df["column_name"])

 This automatically detects and converts various date formats.

Example:
import pandas as pd

# Sample DataFrame with date as string


data = {"Date": ["2024-01-01", "2024-02-15", "2024-03-10"]}

df = pd.DataFrame(data)

# Convert "Date" column to datetime format


df["Date"] = pd.to_datetime(df["Date"])

print(df.dtypes)
print(df)

Output:
Date datetime64[ns]
dtype: object

Date
0 2024-01-01
1 2024-02-15
2 2024-03-10

 The Date column is now in datetime64 format.

Handling Different Date Formats

If the format is inconsistent, pandas automatically detects it. However, for better performance,
specify the format:

df["Date"] = pd.to_datetime(df["Date"], format="%d-%m-%Y")

 Common date formats:


o %Y-%m-%d → "2024-01-01"
o %d/%m/%Y → "01/01/2024"
o %m-%d-%Y → "01-01-2024"

Handling Errors

If some values are not valid dates, you can handle errors:

df["Date"] = pd.to_datetime(df["Date"], errors="coerce")

 Invalid dates will be converted to NaT (Not a Time).

16. Write code to replace values in a DataFrame based on a condition.


Method 1: Using .loc[]
df.loc[df["column_name"] > threshold, "column_name"] = new_value

 df.loc[condition, "column_name"] = new_value → Updates values where the


condition is met.

Example: Replace Sales > 300 with 300


import pandas as pd

# Sample DataFrame
data = {"Product": ["A", "B", "C", "D"], "Sales": [100, 250, 400, 500]}
df = pd.DataFrame(data)

# Replace values where Sales > 300


df.loc[df["Sales"] > 300, "Sales"] = 300

print(df)
Output:
Product Sales
0 A 100
1 B 250
2 C 300
3 D 300

Method 2: Using apply() for Complex Conditions


df["column_name"] = df["column_name"].apply(lambda x: new_value if condition
else x)
Example: Replace Sales > 300 with "High" and others with "Low"
df["Sales_Category"] = df["Sales"].apply(lambda x: "High" if x > 300 else
"Low")
print(df)
Output:
Product Sales Sales_Category
0 A 100 Low
1 B 250 Low
2 C 300 Low
3 D 300 Low
Method 3: Using .replace()
df["column_name"] = df["column_name"].replace({old_value: new_value})
Example: Replace "A" with "Apple" in Product Column
df["Product"] = df["Product"].replace({"A": "Apple"})

Method 4: Using .where() and .mask()


df["column_name"] = df["column_name"].where(~condition, new_value)

 Keeps values where the condition is False, otherwise replaces them.

df["column_name"] = df["column_name"].mask(condition, new_value)

 Replaces values where the condition is True.

Example: Set Sales > 300 to NaN


df["Sales"] = df["Sales"].mask(df["Sales"] > 300, float("nan"))

17. How do you use the groupby function to perform multiple aggregation
operations?
Command:
df.groupby("column_name").agg({"col1": "sum", "col2": "mean"})

 groupby("column_name") → Groups by a specific column.


 .agg({"col1": "sum", "col2": "mean"}) → Applies different aggregation functions.

Example:
import pandas as pd

# Sample DataFrame
data = {"Category": ["A", "A", "B", "B", "C", "C"],
"Sales": [100, 200, 150, 250, 300, 400],
"Profit": [10, 20, 15, 25, 30, 40]}

df = pd.DataFrame(data)

# Group by "Category" and perform multiple aggregations


df_grouped = df.groupby("Category").agg({"Sales": ["sum", "mean"], "Profit":
["max", "min"]})

print(df_grouped)

Output:
Sales Profit
sum mean max min
Category
A 300 150.0 20 10
B 400 200.0 25 15
C 700 350.0 40 30

 Sales: Sum and Mean calculated per Category.


 Profit: Max and Min calculated per Category.

Alternative Approaches

1. Using .agg() with a list of functions:

df.groupby("Category").agg({"Sales": ["sum", "mean"], "Profit": ["max",


"min"]})

2. Using Named Aggregations (Pandas 1.0+):

df.groupby("Category").agg(
Total_Sales=("Sales", "sum"),
Average_Sales=("Sales", "mean"),
Max_Profit=("Profit", "max"),
Min_Profit=("Profit", "min")
)

o This renames columns in the output.

18. Write a command to drop rows with missing values from a DataFrame.
Command:
df_cleaned = df.dropna()

 Removes all rows that contain at least one NaN (missing value).
 Returns a new DataFrame (does not modify the original).

Example:
import pandas as pd

# Sample DataFrame with missing values


data = {"Name": ["Alice", "Bob", None, "David"],
"Age": [25, None, 30, 40],
"City": ["NY", "LA", "SF", None]}

df = pd.DataFrame(data)

# Drop rows with any missing values


df_cleaned = df.dropna()
print(df_cleaned)

Output:
Name Age City
0 Alice 25.0 NY

 Only rows without NaN values are kept.

Additional Options

1. Drop rows only if all values are NaN:

df.dropna(how="all")

2. Drop rows only if specific column has NaN:

df.dropna(subset=["Age"])

3. Modify the DataFrame in place:

df.dropna(inplace=True)

19. How do you export a DataFrame to a CSV file?


Command:
df.to_csv("filename.csv", index=False)

 "filename.csv" → Specifies the name of the output file.


 index=False → Excludes the DataFrame index from the CSV file.

Example:
import pandas as pd

# Sample DataFrame
data = {"Name": ["Alice", "Bob", "Charlie"],
"Age": [25, 30, 35],
"City": ["New York", "Los Angeles", "Chicago"]}

df = pd.DataFrame(data)

# Export DataFrame to CSV


df.to_csv("output.csv", index=False)
print("CSV file saved successfully!")

📂 This will create an output.csv file with the following content:

Name,Age,City
Alice,25,New York
Bob,30,Los Angeles
Charlie,35,Chicago

Additional Options

1. Include index in CSV:

df.to_csv("output.csv", index=True)

2. Change delimiter (e.g., | instead of ,):

df.to_csv("output.csv", sep="|", index=False)

3. Save without headers:

df.to_csv("output.csv", index=False, header=False)

4. Specify encoding (for special characters):

df.to_csv("output.csv", index=False, encoding="utf-8")

20. Write a command to change the data type of a column in a DataFrame.


Command:
df["column_name"] = df["column_name"].astype(new_dtype)

 Converts "column_name" to the specified new_dtype.

Example 1: Convert Column to Integer


import pandas as pd

# Sample DataFrame
data = {"Age": ["25", "30", "35"], "Salary": ["50000", "60000", "70000"]}
df = pd.DataFrame(data)

# Convert Age and Salary to integer


df["Age"] = df["Age"].astype(int)
df["Salary"] = df["Salary"].astype(float)

print(df.dtypes)
Output:
Age int64
Salary float64
dtype: object

🔹 "Age" is now an integer, and "Salary" is a float.

Example 2: Convert to Datetime


df["Date"] = pd.to_datetime(df["Date"])

✅ Converts a string column into datetime format.

Example 3: Convert to Categorical


df["Category"] = df["Category"].astype("category")

✅ Reduces memory usage for categorical data.

 Handle errors in conversion:

df["Age"] = pd.to_numeric(df["Age"], errors="coerce")

 Convert multiple columns at once:

df = df.astype({"Age": int, "Salary": float})

21. How do you filter a DataFrame to select only specific columns?


Command:
df_filtered = df[["column1", "column2"]]

 Creates a new DataFrame with only the selected columns.

Example:
import pandas as pd

# Sample DataFrame
data = {
"Name": ["Alice", "Bob", "Charlie"],
"Age": [25, 30, 35],
"City": ["New York", "Los Angeles", "Chicago"],
"Salary": [50000, 60000, 70000]
}

df = pd.DataFrame(data)
# Select only "Name" and "Salary" columns
df_filtered = df[["Name", "Salary"]]

print(df_filtered)

Output:
Name Salary
0 Alice 50000
1 Bob 60000
2 Charlie 70000

✅ Only "Name" and "Salary" columns are selected.

Additional Methods

1. Select columns using .loc[] (explicit label-based selection)

df_filtered = df.loc[:, ["Name", "Salary"]]

2. Select columns using .iloc[] (index-based selection)

df_filtered = df.iloc[:, [0, 3]] # Selects first and fourth columns

3. Filter columns using a condition (e.g., column names containing a keyword)

df_filtered = df.filter(like="Sal") # Selects columns containing "Sal"

22. Write code to create a new DataFrame with only unique rows from an
existing DataFrame.
Command:
df_unique = df.drop_duplicates()

 Removes duplicate rows while keeping only unique ones.

Example:
import pandas as pd

# Sample DataFrame with duplicate rows


data = {
"Name": ["Alice", "Bob", "Charlie", "Alice"],
"Age": [25, 30, 35, 25],
"City": ["New York", "Los Angeles", "Chicago", "New York"]
}
df = pd.DataFrame(data)

# Create a new DataFrame with unique rows


df_unique = df.drop_duplicates()

print(df_unique)

Output:
Name Age City
0 Alice 25 New York
1 Bob 30 Los Angeles
2 Charlie 35 Chicago

✅ The duplicate row for Alice has been removed.

Additional Options

1. Keep only unique rows based on specific columns

df_unique = df.drop_duplicates(subset=["Name", "Age"])

2. Keep the last occurrence of a duplicate row instead of the first

df_unique = df.drop_duplicates(keep="last")

3. Reset index after removing duplicates

df_unique = df.drop_duplicates().reset_index(drop=True)

23. How do you reset the index of a DataFrame?


Command:
df.reset_index(drop=True, inplace=True)

 drop=True → Removes the old index instead of adding it as a column.


 inplace=True → Modifies the DataFrame directly.

Example:
import pandas as pd

# Sample DataFrame with custom index


data = {"Name": ["Alice", "Bob", "Charlie"], "Age": [25, 30, 35]}
df = pd.DataFrame(data, index=["a", "b", "c"])

print("Before Reset:")
print(df)
# Reset index
df.reset_index(drop=True, inplace=True)

print("\nAfter Reset:")
print(df)

Output:
Before Reset:
Name Age
a Alice 25
b Bob 30
c Charlie 35

After Reset:
Name Age
0 Alice 25
1 Bob 30
2 Charlie 35

 Old index ("a", "b", "c") is removed, and a new default numeric index (0, 1, 2) is
created.

Alternative Approaches

1. Keep old index as a column:

df.reset_index(drop=False)

2. Create a new index after dropping some rows:

df = df.dropna().reset_index(drop=True)

24. Write a command to calculate the correlation matrix of a DataFrame.


Command:
df.corr()

 Computes the pairwise correlation between numerical columns.

Example:
import pandas as pd

# Sample DataFrame
data = {
"Sales": [100, 200, 300, 400, 500],
"Profit": [10, 20, 30, 40, 50],
"Discount": [5, 3, 4, 2, 1]
}

df = pd.DataFrame(data)

# Calculate correlation matrix


corr_matrix = df.corr()

print(corr_matrix)

Output:
Sales Profit Discount
Sales 1.00 1.00 -0.87
Profit 1.00 1.00 -0.87
Discount -0.87 -0.87 1.00

 1.00 → Perfect positive correlation.


 -0.87 → Strong negative correlation.

Additional Options

1. Use a specific correlation method (pearson, kendall, spearman)

df.corr(method="kendall")

2. Visualize correlation matrix using a heatmap (with Seaborn)

import seaborn as sns


import matplotlib.pyplot as plt

sns.heatmap(df.corr(), annot=True, cmap="coolwarm")


plt.show()

25. How do you handle categorical data in a DataFrame, including encoding and one-hot
encoding?

Handling Categorical Data in Pandas

Categorical data needs to be converted into a numerical format for machine learning models. The
two most common encoding methods are Label Encoding and One-Hot Encoding.

1️⃣ Label Encoding

 Assigns a unique integer to each category.


 Suitable for ordinal (ordered) data.

Example:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Sample DataFrame
df = pd.DataFrame({"Category": ["Low", "Medium", "High", "Medium", "Low"]})

# Initialize LabelEncoder
label_encoder = LabelEncoder()

# Transform categorical values to numbers


df["Category_Encoded"] = label_encoder.fit_transform(df["Category"])

print(df)
Output:
Category Category_Encoded
0 Low 1
1 Medium 2
2 High 0
3 Medium 2
4 Low 1

 "High" → 0, "Low" → 1, "Medium" → 2


 Use when order matters (e.g., Low < Medium < High).

2️⃣ One-Hot Encoding (OHE)

 Converts each category into separate binary (0/1) columns.


 Suitable for nominal (unordered) data.

Example: Using pd.get_dummies()


df_ohe = pd.get_dummies(df, columns=["Category"], drop_first=True)
print(df_ohe)
Output:
Category_High Category_Low Category_Medium
0 0 1 0
1 0 0 1
2 1 0 0
3 0 0 1
4 0 1 0

 Each category is now a separate column with 0/1 values.


 Use drop_first=True to avoid the dummy variable trap.

3️⃣ Using OneHotEncoder from Scikit-Learn


from sklearn.preprocessing import OneHotEncoder

# Initialize OneHotEncoder
ohe = OneHotEncoder(sparse=False, drop="first")

# Transform categorical column


encoded_values = ohe.fit_transform(df[["Category"]])

# Convert to DataFrame
df_encoded = pd.DataFrame(encoded_values,
columns=ohe.get_feature_names_out(["Category"]))

# Concatenate with original DataFrame


df_final = pd.concat([df, df_encoded], axis=1)

print(df_final)

1.Find all unique employee names who work in more than one department.

Sample DataFrame:
df = pd.DataFrame({'EmployeeName': ['John Doe', 'Jane Smith', 'Alice Johnson', 'John
Doe'], 'Department': ['Sales', 'Marketing', 'Sales', 'Marketing']})
You can find all unique employee names who work in more than one department using the
groupby() and nunique() methods.

Solution:
import pandas as pd

# Sample DataFrame
df = pd.DataFrame({
'EmployeeName': ['John Doe', 'Jane Smith', 'Alice Johnson', 'John Doe'],
'Department': ['Sales', 'Marketing', 'Sales', 'Marketing']
})

# Find employees working in more than one department


employees_multiple_depts = df.groupby("EmployeeName")["Department"].nunique()
unique_employees = employees_multiple_depts[employees_multiple_depts >
1].index.tolist()

print(unique_employees)

Output:
['John Doe']

✅ John Doe works in both Sales and Marketing, so he is included in the result.

2. Calculate the monthly average sales for each product. Assume sales data is daily.

Sample DataFrame:
df = pd.DataFrame({'Date': pd.date_range(start='2023-01-01', end='2023-03-31',
freq='D'), 'Product': np.random.choice(['ProductA', 'ProductB'], 90), 'Sales':
np.random.randint(100, 500, 90)})
You can calculate the monthly average sales for each product using groupby() along with
resampling.

Solution:
import pandas as pd
import numpy as np

# Sample DataFrame
np.random.seed(42) # Ensures reproducibility
df = pd.DataFrame({
'Date': pd.date_range(start='2023-01-01', end='2023-03-31', freq='D'),
'Product': np.random.choice(['ProductA', 'ProductB'], 90),
'Sales': np.random.randint(100, 500, 90)
})

# Convert 'Date' column to datetime format (if not already)


df['Date'] = pd.to_datetime(df['Date'])

# Group by month and product, then calculate average sales


monthly_avg_sales = df.groupby([df['Date'].dt.to_period('M'), 'Product'])
['Sales'].mean().reset_index()

# Convert 'Date' back to a string format


monthly_avg_sales['Date'] = monthly_avg_sales['Date'].astype(str)

print(monthly_avg_sales)

Output (Example)
Date Product Sales
0 2023-01 ProductA 289.032258
1 2023-01 ProductB 315.000000
2 2023-02 ProductA 292.500000
3 2023-02 ProductB 312.000000
4 2023-03 ProductA 320.000000
5 2023-03 ProductB 305.000000

Explanation

1. Convert Date to datetime format (if not already).


2. Group by month (dt.to_period('M')) and Product.
3. Calculate the average sales (mean()) for each product per month.
4. Convert the Date period back to a string for better readability.

3. Identify the top 3 employees with the highest sales in each quarter.

Sample DataFrame:
df = pd.DataFrame({'Employee': ['John', 'Jane', 'Doe', 'Smith', 'Alice'], 'Quarter': ['Q1',
'Q1', 'Q2', 'Q2', 'Q3'], 'Sales': [200, 150, 300, 250, 400]})
==you can find the top 3 employees with the highest sales in each quarter using groupby()
along with nlargest() or rank().

Solution:
import pandas as pd

# Sample DataFrame
df = pd.DataFrame({
'Employee': ['John', 'Jane', 'Doe', 'Smith', 'Alice', 'Bob', 'Eve',
'Charlie', 'David', 'Frank'],
'Quarter': ['Q1', 'Q1', 'Q1', 'Q1', 'Q2', 'Q2', 'Q2', 'Q3', 'Q3', 'Q3'],
'Sales': [200, 150, 300, 250, 400, 380, 220, 500, 450, 420]
})

# Get top 3 employees per quarter


top_3_per_quarter = df.groupby("Quarter").apply(lambda x: x.nlargest(3,
"Sales")).reset_index(drop=True)

print(top_3_per_quarter)

Output (Example)
Employee Quarter Sales
0 John Q1 200
1 Smith Q1 250
2 Doe Q1 300
3 Alice Q2 400
4 Bob Q2 380
5 Eve Q2 220
6 Charlie Q3 500
7 David Q3 450
8 Frank Q3 420

Explanation

1. Group by "Quarter" to analyze sales per quarter.


2. Use nlargest(3, "Sales") to get the top 3 employees by sales in each quarter.
3. Reset the index to flatten the grouped data.

4. Analyze the attendance records to find employees with more than 95% attendance
throughout the year.

Sample DataFrame:
df = pd.DataFrame({'Employee': ['John', 'Jane', 'Doe'], 'TotalDays': [365, 365, 365],
'DaysAttended': [365, 350, 360]})
==You can find employees with more than 95% attendance using percentage calculation and
filtering.

Solution:
import pandas as pd

# Sample DataFrame
df = pd.DataFrame({
'Employee': ['John', 'Jane', 'Doe'],
'TotalDays': [365, 365, 365],
'DaysAttended': [365, 350, 360]
})

# Calculate attendance percentage


df['AttendanceRate'] = (df['DaysAttended'] / df['TotalDays']) * 100

# Filter employees with attendance > 95%


high_attendance_employees = df[df['AttendanceRate'] > 95]

print(high_attendance_employees)

Output:
Employee TotalDays DaysAttended AttendanceRate
0 John 365 365 100.0
2 Doe 365 360 98.6

✅ John and Doe have attendance greater than 95%.

Explanation

1. Calculate AttendanceRate = (DaysAttended / TotalDays) * 100


2. Filter employees where AttendanceRate > 95%.

5. Calculate the monthly customer retention rate based on the transaction


logs.

Sample DataFrame:
df = pd.DataFrame({'Month': ['Jan', 'Feb', 'Mar', 'Jan', 'Feb', 'Mar'],
'CustomerID': [1, 1, 1, 2, 2, 3], 'TransactionCount': [1, 2, 1, 3, 2, 1]})
==Calculate Monthly Customer Retention Rate

Customer retention rate is the percentage of customers who made a purchase in a previous
month and returned in the current month.
Solution:
import pandas as pd

# Sample DataFrame
df = pd.DataFrame({
'Month': ['Jan', 'Feb', 'Mar', 'Jan', 'Feb', 'Mar'],
'CustomerID': [1, 1, 1, 2, 2, 3],
'TransactionCount': [1, 2, 1, 3, 2, 1]
})

# Convert month names to datetime format for sorting


df['Month'] = pd.to_datetime(df['Month'], format='%b')

# Sort data by month and customer


df = df.sort_values(by=['CustomerID', 'Month'])

# Identify retained customers (customers who appear in consecutive months)


df['Retained'] = df.duplicated(subset=['CustomerID'], keep='first')

# Calculate retention rate per month


retention_rate = df.groupby(df['Month'].dt.strftime('%b'))['Retained'].mean()
* 100

# Convert to DataFrame
retention_rate = retention_rate.reset_index().rename(columns={'Retained':
'RetentionRate (%)'})

print(retention_rate)

Example Output:
Month RetentionRate (%)
0 Jan 0.0
1 Feb 50.0
2 Mar 66.7

Explanation

1. Convert Month to datetime for proper sorting.


2. Sort by CustomerID and Month to track customer activity.
3. Identify retained customers:
o Customers appearing in consecutive months are marked as retained.
4. Calculate retention rate per month:
o (Retained customers in month) / (Total customers in previous
month) * 100
5. Convert to a readable format.

6. Determine the average time employees spent on projects, assuming you


have start and end dates for each project participation.
Sample DataFrame:
df = pd.DataFrame({'Employee': ['John', 'Jane', 'Doe'], 'ProjectStart':
pd.to_datetime(['2023-01-01', '2023-02-15', '2023-03-01']), 'ProjectEnd':
pd.to_datetime(['2023-01-31', '2023-03-15', '2023-04-01'])})
==Determine the Average Time Employees Spent on Projects

You can calculate the average duration employees spent on projects by computing the
difference between ProjectEnd and ProjectStart and then taking the mean.

Solution:
import pandas as pd

# Sample DataFrame
df = pd.DataFrame({
'Employee': ['John', 'Jane', 'Doe'],
'ProjectStart': pd.to_datetime(['2023-01-01', '2023-02-15', '2023-03-
01']),
'ProjectEnd': pd.to_datetime(['2023-01-31', '2023-03-15', '2023-04-01'])
})

# Calculate project duration in days


df['ProjectDuration'] = (df['ProjectEnd'] - df['ProjectStart']).dt.days

# Calculate the average time spent on projects


average_time_spent = df['ProjectDuration'].mean()

print(f"Average time employees spent on projects: {average_time_spent} days")

Output:
Average time employees spent on projects: 30.67 days

Explanation

1. Convert ProjectStart and ProjectEnd to datetime (if not already).


2. Calculate the project duration by subtracting ProjectStart from ProjectEnd.
3. Compute the average duration using .mean().

7. Compute the month-on-month growth rate in sales for each product,


highlighting products with more than 10% growth for consecutive months.

Sample DataFrame:
df = pd.DataFrame({'Month': ['Jan', 'Feb', 'Mar', 'Jan', 'Feb', 'Mar'], 'Product':
['A', 'A', 'A', 'B', 'B', 'B'], 'Sales': [200, 220, 240, 150, 165, 180]})
==Compute Month-on-Month Growth Rate in Sales for Each Product
To calculate the month-on-month (MoM) growth rate, we use the formula:

Growth Rate=Current Month Sales−Previous Month SalesPrevious Month Sales×100\


text{Growth Rate} = \frac{\text{Current Month Sales} - \text{Previous Month Sales}}{\
text{Previous Month Sales}} \times
100Growth Rate=Previous Month SalesCurrent Month Sales−Previous Month Sales×100

Then, we identify products that have more than 10% growth for consecutive months.

Solution:
import pandas as pd

# Sample DataFrame
df = pd.DataFrame({
'Month': ['Jan', 'Feb', 'Mar', 'Jan', 'Feb', 'Mar'],
'Product': ['A', 'A', 'A', 'B', 'B', 'B'],
'Sales': [200, 220, 240, 150, 165, 180]
})

# Convert month names to datetime for proper sorting


df['Month'] = pd.to_datetime(df['Month'], format='%b')

# Sort data by Product and Month


df = df.sort_values(by=['Product', 'Month'])

# Calculate Month-on-Month Growth Rate


df['MoM_Growth'] = df.groupby('Product')['Sales'].pct_change() * 100

# Identify products with >10% growth for consecutive months


df['HighGrowth'] = (df['MoM_Growth'] > 10) & (df.groupby('Product')
['MoM_Growth'].shift(1) > 10)

# Display results
print(df)

Output:
Month Product Sales MoM_Growth HighGrowth
0 2023-01-01 A 200 NaN False
1 2023-02-01 A 220 10.00000 False
2 2023-03-01 A 240 9.09091 False
3 2023-01-01 B 150 NaN False
4 2023-02-01 B 165 10.00000 False
5 2023-03-01 B 180 9.09091 False

Explanation

1. Convert Month to datetime for correct ordering.


2. Sort the DataFrame by Product and Month.
3. Compute the MoM growth rate using pct_change() * 100.
4. Mark products that have >10% growth in consecutive months.
o If both the current and previous month's growth rate exceed 10%, set HighGrowth
= True.

8. Identify the time of day (morning, afternoon, evening) when sales peak for
each category of products.

Sample DataFrame:
df = pd.DataFrame({'Category': ['Electronics', 'Clothing', 'Electronics',
'Clothing'], 'TimeOfDay': ['Morning', 'Afternoon', 'Evening', 'Morning'], 'Sales':
[300, 150, 500, 200]})
==Identify Peak Sales Time for Each Product Category

To find the time of day when sales peak for each category, we need to group by Category and
find the TimeOfDay with the maximum sales.

Solution:
import pandas as pd

# Sample DataFrame
df = pd.DataFrame({
'Category': ['Electronics', 'Clothing', 'Electronics', 'Clothing'],
'TimeOfDay': ['Morning', 'Afternoon', 'Evening', 'Morning'],
'Sales': [300, 150, 500, 200]
})

# Find the time of day with the highest sales for each category
peak_sales_time = df.loc[df.groupby('Category')['Sales'].idxmax(),
['Category', 'TimeOfDay', 'Sales']]

print(peak_sales_time)

Output:
Category TimeOfDay Sales
2 Electronics Evening 500
3 Clothing Morning 200

Explanation

1. Group by Category to analyze each category separately.


2. Find the TimeOfDay with max sales using idxmax() to locate the row with the highest
sales per category.
3. Extract the relevant columns (Category, TimeOfDay, Sales) to display the peak sales
time.
9. Evaluate the distribution of workload among employees based on the
number of tasks assigned in the last six months.

Sample DataFrame:
df = pd.DataFrame({'Employee': ['John', 'Jane', 'Doe'], 'TasksAssigned': [20,
25, 15]})

==Evaluate the Distribution of Workload Among Employees

To analyze the workload distribution, we can:

1. Visualize the distribution using a bar chart.


2. Compute basic statistics such as mean, median, and standard deviation.

Solution:
import pandas as pd

import matplotlib.pyplot as plt

# Sample DataFrame
df = pd.DataFrame({
'Employee': ['John', 'Jane', 'Doe'],
'TasksAssigned': [20, 25, 15]
})

# Compute basic statistics


mean_tasks = df['TasksAssigned'].mean()
median_tasks = df['TasksAssigned'].median()
std_tasks = df['TasksAssigned'].std()

# Print statistics
print(f"Mean tasks assigned: {mean_tasks}")
print(f"Median tasks assigned: {median_tasks}")
print(f"Standard deviation: {std_tasks}")

# Visualizing the workload distribution


plt.bar(df['Employee'], df['TasksAssigned'], color=['blue', 'green', 'red'])
plt.axhline(mean_tasks, color='black', linestyle='dashed', label='Mean')
plt.xlabel("Employee")
plt.ylabel("Number of Tasks Assigned")
plt.title("Workload Distribution Among Employees")
plt.legend()
plt.show()

Output (Example Statistics)


Mean tasks assigned: 20.0
Median tasks assigned: 20.0
Standard deviation: 5.0
🔹 Visualization: A bar chart showing how tasks are distributed among employees.

Explanation

1. Compute key statistics:


o Mean: Average workload.
o Median: Middle value of assigned tasks.
o Standard deviation: How much workload varies.
2. Plot a bar chart:
o Show tasks assigned to each employee.
o Include a dashed line for the mean to highlight imbalance.

10. Calculate the profit margin for each product category based on revenue
and cost data.

Sample DataFrame:
df = pd.DataFrame({'Category': ['Electronics', 'Clothing'], 'Revenue': [1000,
500], 'Cost': [700, 300]})

==Calculate Profit Margin for Each Product Category

The profit margin is calculated as:

Profit Margin=(Revenue−CostRevenue)×100\text{Profit Margin} = \left( \frac{\text{Revenue} -


\text{Cost}}{\text{Revenue}} \right) \times 100Profit Margin=(RevenueRevenue−Cost)×100

This formula expresses profit margin as a percentage.

Solution:
import pandas as pd

# Sample DataFrame
df = pd.DataFrame({
'Category': ['Electronics', 'Clothing'],
'Revenue': [1000, 500],
'Cost': [700, 300]
})

# Calculate Profit and Profit Margin


df['Profit'] = df['Revenue'] - df['Cost']
df['Profit_Margin (%)'] = (df['Profit'] / df['Revenue']) * 100

# Display results
print(df)
Output:
Category Revenue Cost Profit Profit_Margin (%)
0 Electronics 1000 700 300 30.0
1 Clothing 500 300 200 40.0

Explanation

1. Calculate profit: Profit = Revenue - Cost


2. Compute profit margin: Profit_Margin (%) = (Profit / Revenue) * 100
3. Result:
o Electronics: 30% profit margin.
o Clothing: 40% profit margin.

You might also like