0% found this document useful (0 votes)

167 views33 pages

Predictive Modelling Coded Project

The document outlines a predictive modeling project using linear regression to analyze factors influencing first-day viewership for ShowTime, an OTT service provider. It includes a comprehensive data overview, exploratory data analysis, and model building, revealing key insights such as strong correlations between trailer views and content views. The analysis aims to identify actionable insights to enhance content viewership based on various metrics including visitor counts, ad impressions, and genre performance.

Uploaded by

shaam solanki

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

167 views33 pages

Predictive Modelling Coded Project

Uploaded by

shaam solanki

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

PREDICTIVE MODELLING –

LINEAR REGRESSION
SHOWTIME – OTT SERVICES

APRIL 20, 2025

PGPDSBA
Sham Solanki
Contents
Problem Statement .................................................................................................................................................2
Business Context .................................................................................................................................................2
Objective..................................................................................................................................................................2
Data Description..................................................................................................................................................2
Data Overview ...........................................................................................................................................................3
Displaying the first 5 rows of dataset .......................................................................................................3
Displaying the shape of the dataset ...........................................................................................................3
Displaying the Datatypes of the Columns ...............................................................................................3
Displaying the statistical Summary............................................................................................................4
Checking the missing values ..........................................................................................................................4
Exploratory Data Analysis ...................................................................................................................................5
Univariate Analysis ............................................................................................................................................5
Bivariate Analysis ............................................................................................................................................ 10
KEY QUESTIONS .................................................................................................................................................... 21
DATA PROCESSING .............................................................................................................................................. 23
[Link] and Missing values ................................................................................................................. 23
2. Feature Engineering .................................................................................................................................. 24
[Link] Detection and Treatment ......................................................................................................... 25
[Link] Preparation for Modelling ............................................................................................................ 25
Model Building ....................................................................................................................................................... 26
Mean Squared Error (MSE): 0.0025........................................................................................................ 27
Interpretation:................................................................................................................................................... 27
Coefficients.......................................................................................................................................................... 27
Testing Assumptions Of Linear Regression Model .......................................................................... 27
Model Performance Evaluation Based On Different Metrics ...................................................... 30
Actionable Insights & Business Recommendation .................................................................................... 31
Overall Assesssment ............................................................................................................................................ 31
Key Takeaways For Business .......................................................................................................................... 32
Problem Statement
Business Context
An over-the-top (OTT) media service is a media service offered directly to viewers via
the internet. The term is most synonymous with subscription-based video-on-demand
services that offer access to film and television content, including existing series
acquired from other producers, as well as original content produced specifically for the
service. They are typically accessed via websites on personal computers, apps on
smartphones and tablets, or televisions with integrated Smart TV platforms.
Presently, OTT services are at a relatively nascent stage and are widely accepted as a
trending technology across the globe. With the increasing change in customers' social
behavior, which is shifting from traditional subscriptions to broadcasting services and
OTT on-demand video and music subscriptions every year, OTT streaming is expected
to grow at a very fast pace. The global OTT market size was valued at $121.61 billion in
2019 and is projected to reach $1,039.03 billion by 2027, growing at a CAGR of 29.4%
from 2020 to 2027. The shift from television to OTT services for entertainment is driven
by benefits such as on-demand services, ease of access, and access to better networks
and digital connectivity.
With the outbreak of COVID19, OTT services are striving to meet the growing
entertainment appetite of viewers, with some platforms already experiencing a 46%
increase in consumption and subscriber count as viewers seek fresh content. With
innovations and advanced transformations, which will enable the customers to access
everything they want in a single space, OTT platforms across the world are expected to
increasingly attract subscribers on a concurrent basis.

Objective
ShowTime is an OTT service provider and offers a wide variety of content (movies, web
shows, etc.) for its users. They want to determine the driver variables for first-day
content viewership so that they can take necessary measures to improve the viewership
of the content on their platform. Some of the reasons for the decline in viewership of
content would be the decline in the number of people coming to the platform, decreased
marketing spend, content timing clashes, weekends and holidays, etc. They have hired
you as a Data Scientist, shared the data of the current content in their platform, and
asked you to analyze the data and come up with a linear regression model to determine
the driving factors for first-day viewership.

Data Description
1. visitors: Average number of visitors, in millions, to the platform in the past week
2. ad_impressions: Number of ad impressions, in millions, across all ad campaigns
for the content (running and completed)
3. major_sports_event: Any major sports event on the day
4. genre: Genre of the content
5. dayofweek: Day of the release of the content
6. season: Season of the release of the content
7. views_trailer: Number of views, in millions, of the content trailer
8. views_content: Number of first-day views, in millions, of the content
Data Overview
The initial steps to get an overview of any dataset is to:
• observe the first few rows of the dataset, to check whether the dataset has been
loaded properly or not
• get information about the number of rows and columns in the dataset
• find out the data types of the columns to ensure that data is stored in the
preferred format and the value of each property is as expected.
• check the statistical summary of the dataset to get an overview of the numerical
columns of the data

Displaying the first 5 rows of dataset

Displaying the shape of the dataset

Data has 1000 row and 8 columns

(1000, 8)

Displaying the Datatypes of the Columns

<class '[Link]'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 visitors 1000 non-null float64
1 ad_impressions 1000 non-null float64
2 major_sports_event 1000 non-null int64
3 genre 1000 non-null object
4 dayofweek 1000 non-null object
5 season 1000 non-null object
6 views_trailer 1000 non-null float64
7 views_content 1000 non-null float64
dtypes: float64(4), int64(1), object(3)
memory usage: 62.6+ KB
Displaying the statistical Summary

Checking the missing values

visitors 0
ad_impressions 0
major_sports_event 0
genre 0
dayofweek 0
season 0
views_trailer 0
views_content 0
dtype: int64
Exploratory Data Analysis
Univariate Analysis

Visitors

Observation
Shape: Approximately normal distribution with a slight right skew Range: Primarily
between 1.2 and 2.2 (likely in millions) Peak: Highest frequency around 1.7-1.8 Key
Observations:
Most content receives between 1.6-1.8 million visitors There's a gradual decline in
frequency for both very low (<1.4) and very high (>2.0) visitor counts The relatively
symmetrical distribution suggests stable platform traffic patterns
Ad Impressions

Observation
Shape: Bimodal distribution (two peaks) Range: 1000-2400 impressions Primary Peak:
Around 1300-1400 impressions Secondary Peak: Around 1600-1700 impressions Key
Observations:
The bimodal nature suggests two distinct advertising strategies or campaigns Majority
of content receives between 1200-1800 ad impressions Long right tail indicates some
content getting significantly higher advertising exposure This could indicate targeted
marketing for certain content types
Views Trailers

Observation
Shape: Highly right-skewed with a sharp peak Range: 25-200 views Peak: Concentrated
around 50 views Key Observations:
Most trailers receive around 50 views Very few trailers receive more than 75 views The
sharp peak suggests consistency in trailer viewing behavior Long right tail indicates a
few highly successful trailer campaigns
Views Content

Observation
Shape: Slightly right-skewed with a clear central tendency Range: 0.2-0.9 (likely in
millions) Peak: Around 0.4-0.5 Key Observations:
Most content receives between 0.4-0.5 million views More symmetric than trailer views,
suggesting more predictable content viewing patterns Gradual decline in higher view
counts indicates natural viewing limits
Genres

Observations
Genre Distribution Pattern:
The distribution is multimodal, with several peaks across different genres "Others"
category shows the highest frequency with around 250 entries Most other genres have
frequencies between 100-125 entries
Genre-specific Frequencies:
"Others" category is significantly overrepresented compared to specific genres Horror,
Thriller, Sci-Fi show similar frequencies (around 100 entries each) Comedy and
Romance also show comparable frequencies to each other Drama and Action have
slightly lower frequencies than other specific genres
Category Groupings:
Thriller and Sci-Fi genres show similar frequencies, suggesting comparable popularity
Comedy and Romance have nearly identical frequencies, possibly indicating viewer
overlap Horror maintains a moderate presence in the collection
Balance Analysis:
Excluding the "Others" category, there's relatively even distribution across genres The
specific genres show a balanced representation in the content library No specific genre
(excluding "Others") dominates or is significantly underrepresented
Bivariate Analysis

Correlation Heatmap

Observation
1. Strong Positive Correlation:
The strongest correlation (0.75) exists between views_trailer and views_content This
suggests that people who watch trailers are likely to watch the actual content The
relationship makes intuitive sense as interest in trailers often translates to content
viewing
1. Weak Correlations:
Visitors and ad_impressions show almost no correlation (0.03) Ad_impressions have
very weak correlations with all other metrics:
Only 0.01 with views_trailer 0.05 with views_content
This suggests ad impressions operate somewhat independently of other metrics
1. Moderate Correlations:
Visitors have a weak positive correlation (0.26) with views_content This indicates that
higher visitor numbers somewhat translate to more content views, but the relationship
isn't very strong Visitors have a slight negative correlation (-0.03) with views_trailer
1. Pattern Insights:
The effectiveness of ads seems questionable given their weak correlations with viewing
metrics Trailer views are a much better predictor of content viewing than visitor
numbers or ad impressions

1. Visitors vs All Other Variables

Visitors vs Genre

Observation
Distribution Patterns:
Action shows highest median visitors (~1.75) Romance and Comedy follow closely
Horror, Thriller, and Sci-Fi show similar visitor patterns 'Others' category shows
highest variability
Key Insights:
Action content drives more platform traffic Traditional genres (Horror, Thriller) show
consistent but lower visitor numbers All genres maintain baseline visitors between 1.4-
1.8 million
Outlier Treatment

# Visitors vs Day of Week

Outlier Treatment

Observation
Temporal Patterns:
Saturday shows highest median visitors Weekdays (Tuesday-Thursday) show relatively
consistent patterns Weekend days (Friday-Sunday) show slightly higher visitor
numbers Monday shows moderate visitor numbers
Before vs After Comparison:
More consistent patterns in the "After" period Reduced outliers in recent data Slightly
higher median values across most days Weekend preference remains consistent
Visitors vs Season
Outlier Treatment

Observation
From the original seasonal distribution (Image 1):
Seasonal Patterns:
All seasons show similar median visitor numbers (around 1.7) The interquartile ranges
(box sizes) are relatively consistent across seasons Each season has some outliers at
higher visitor numbers (around 2.2) The distribution appears fairly symmetric for all
seasons
Looking at the Before/After outlier treatment (Image 2):
Changes After Treatment:
The extreme outliers at the top (around 2.3-2.4) have been removed The overall shape
and central tendency of the distributions remain largely unchanged The whiskers
(vertical lines extending from boxes) are now more uniform across seasons The boxes
(interquartile ranges) maintain their original positions
Impact of Treatment:
The outlier treatment has preserved the underlying seasonal patterns The data is now
more consistent across seasons The treatment appears to have been conservative,
maintaining the natural variation while removing only extreme values Winter and
Summer show slightly higher medians after treatment, but the difference is minimal
Overall, this suggests that:
Visitor numbers are relatively stable across seasons The outlier treatment successfully
removed extreme values while preserving the natural seasonal patterns There's no
strong seasonal effect on visitor numbers The data cleaning has made the distributions
more comparable across seasons without losing important trends

Visitors vs Views Content

Observation
Relationship Pattern:
There's a positive correlation between visitors and content views The relationship
appears to be somewhat linear, but with considerable scatter The density of points is
highest in the middle range (around 0.4-0.6 content views)
Distribution Features:
Most data points cluster between 1.4 and 2.0 visitors Content views mostly fall between
0.3 and 0.7 There are some outlier points, particularly in the upper right quadrant The
scatter becomes more dispersed as content views increase
Notable Insights:
Higher content views generally correspond to higher visitor numbers There's
significant variation in visitor numbers for any given level of content views The
relationship seems to have more variability at higher content view levels

2. Views Content vs Other Variables

Views Content vs Day of Week
Observation
Daily Patterns:
Wednesday shows the highest median content views and largest interquartile range
Friday appears to have the lowest median content views The distribution is relatively
consistent across other days All days show several outliers (dots) above the whiskers,
indicating some unusually high viewing periods
Distribution Characteristics:
Each day shows a similar interquartile range (box size), suggesting consistent
variability The medians (horizontal lines in boxes) are fairly stable across days, mostly
between 0.4 and 0.5 There's slight positive skew across all days, shown by the higher
position of outliers
Outlier Treatment

Views Content vs Season

Observation
The median content views are fairly consistent across seasons, hovering around 0.45-
0.5 Winter appears to have a slightly higher median and larger spread There are several
outliers in all seasons, particularly in Summer, with content views reaching up to 0.8-0.9
Summer and Winter show slightly more variability in their distributions compared to
Spring and Fall
Views Content vs Views Trailer
Observation
There's a clear positive correlation between trailer views and content views The
relationship appears to be non-linear, with the correlation becoming stronger as trailer
views increase There's a dense cluster of data points around 50 trailer views, suggesting
this might be a common baseline The spread of content views increases with higher
trailer views, indicating more variability in content performance for videos with
popular trailers The maximum content views (around 0.9) are associated with higher
trailer views (150-200 range)

Views Content vs Genre

Observation
The median content views are relatively consistent across genres, typically between 0.4-
0.5 Romance and Comedy show slightly higher variability in their distributions Sci-Fi
and Thriller genres have several high-performing outliers (reaching 0.8+ views) The
"Others" category shows the most compact distribution, suggesting more consistent but
moderate performance

3. Ad Impressions
Ad Impressions vs Views Trailer
Observation
There's a clear concentration of trailer views around the 50-view mark across all ad
impression levels There's a scattered distribution of higher trailer views (75-200) that
doesn't show a strong correlation with ad impressions Ad impressions range primarily
from 1000 to 2400, with the densest cluster between 1000-1800 The relationship
doesn't appear to be strongly linear - increasing ad impressions doesn't necessarily lead
to proportionally higher trailer views

Ad Impressions vs Views Content

Observation
Content views are fairly dispersed between 0.3 and 0.8, regardless of ad impression
count There's no clear correlation between ad impressions and content views The
highest content views (0.8-0.9) appear across various ad impression levels The density
of points is highest in the 0.4-0.6 content views range

KEY QUESTIONS

1. What does the distribution of the content look like?

It has a normal distribution but slightly right
skewed and the mean values lie between 4 to
5 million.

[Link] does the distribution of genres look like?

The category others has the highest

content release, followed by the genres
comedy and thrillers

[Link] day of the week on which content is released generally plays a key role in
the viewership. How does the viewership vary with the day of release?

Wednesday and Saturday have the highest

content release. Friday has the least
viewership, while the other days have the
same average views during the content release.

[Link] does the viewership vary with the season of release?

There is not much of an impact of season on

the viewership and release of content.
Summer has the highest viewership followed
by Winter.

5. What is the correlation between trailer views and content views?

There is a positive relationship between

Trailer views and , content views
increase correspondingly.

DATA PROCESSING
[Link] and Missing values

There are no duplicate or missing values.

2. Feature Engineering
1. Data Copy:
Making a copy of the input data to keep the original dataset unchanged.
2. Categorical Encoding:
Converting the genre column into multiple binary columns (one-hot encoding)
and also turn the day o fweek and season columns into numeric codes. This
makes all categorical data usable for modeling.
3. Cyclical Feature Creation:
To capture repeating patterns, transform the numeric day-of-week and season
values using sine and cosine functions, allowing the model to recognize cycles in
the data.
4. Numerical Transformations:
Applying a logarithmic transformation to features like ad impressions, content
views, and trailer views. This reduces the effect of extreme values and helps
normalize their distributions.
5. Interaction Features:
Creating new features by multiplying content views with both the day-of-week
code and the genre code. These capture how the effect of content views might
change depending on the day or genre.
6. Derived Features:
Calculate engagement per visitor (content views divided by visitors) and the
difference between content views and trailer views, providing more insight into
user behavior.
7. Merging Encoded Features:
The one-hot encoded genre columns are added back to the main dataset,
ensuring all relevant information is included.
8. Feature Scaling:
Standardize all important numerical features so they have a mean of zero and a
standard deviation of one, which helps many machine learning models perform
better.
9. Output:
The function returns the processed dataset, now containing a rich set of
engineered features that are ready for use in predictive modeling.

Processed features:
['visitors', 'ad_impressions', 'major_sports_event', 'genre', 'dayofweek',
'season', 'views_trailer', 'views_content', 'dayofweek_num', 'season_num',
'dayofweek_sin', 'dayofweek_cos', 'season_sin', 'season_cos',
'log_ad_impressions', 'log_views_content', 'log_views_trailer',
'interaction_views_day', 'interaction_views_genre', 'engagement_per_visitor',
'conversion_trailer_to_content', 'genre_Comedy', 'genre_Drama', 'genre_Horror',
'genre_Others', 'genre_Romance', 'genre_Sci-Fi', 'genre_Thriller']

Sample of processed data:

visitors ad_impressions major_sports_event genre dayofweek season \
0 -0.147893 -1.108892 0 Horror Wednesday Spring
1 -1.053623 0.220110 1 Thriller Friday Fall
2 -1.010493 -1.228523 1 Thriller Wednesday Fall
3 0.628448 -0.317711 1 Sci-Fi Friday Fall
4 -1.053623 0.220110 0 Sci-Fi Sunday Winter

views_trailer views_content dayofweek_num season_num ... \

0 -0.292011 0.345735 6 1 ...
1 -0.406636 -1.449064 0 0 ...
2 -0.519546 -0.787822 6 0 ...
3 -0.488961 -0.315507 0 0 ...
4 -0.316880 -0.126581 3 3 ...

interaction_views_genre engagement_per_visitor \
0 -0.146168 0.370693
1 0.489149 -0.929328
2 0.927607 -0.233740
3 0.847074 -0.647948
4 0.954451 0.516651

conversion_trailer_to_content genre_Comedy genre_Drama genre_Horror \

0 0.293727 False False True
1 0.403170 False False False
2 0.518344 False False False
3 0.489121 False False False
4 0.317220 False False False

genre_Others genre_Romance genre_Sci-Fi genre_Thriller

0 False False False False
1 False False False True
2 False False False True
3 False False True False
4 False False True False

[5 rows x 28 columns]

Notable Insights:
The data suggests different viewing behaviors across different days and seasons There
appears to be varying levels of engagement between trailers and content The
conversion rates from trailer to content views vary significantly Genre distribution
shows a mix of different content types

[Link] Detection and Treatment

We have carried out some outlier treatment earlier and further
treatment is unnecessary at this point.

[Link] Preparation for Modelling

• Target variable is “views of the content”
• The categorical features are – Major Sport event, Genre, Season, day of the week.

Model Building

Model Performance:
Mean Squared Error: 0.0025
R^2 Score: 0.7743

Model Coefficients:
Feature Coefficient
0 visitors 0.128909
12 dayofweek_Saturday 0.052561
16 dayofweek_Wednesday 0.049532
11 dayofweek_Monday 0.045065
18 season_Summer 0.044605
13 dayofweek_Sunday 0.038818
15 dayofweek_Tuesday 0.032412
19 season_Winter 0.026532
17 season_Spring 0.023201
14 dayofweek_Thursday 0.019637
10 genre_Thriller 0.011518
5 genre_Drama 0.010636
9 genre_Sci-Fi 0.010008
6 genre_Horror 0.009434
7 genre_Others 0.004984
4 genre_Comedy 0.004389
3 views_trailer 0.002311
1 ad_impressions 0.000008
8 genre_Romance -0.001385
2 major_sports_event -0.059559
Mean Squared Error (MSE): 0.0025
Interpretation: This indicates the average squared difference between predicted and
actual values. A lower MSE reflects better model performance.
Comment: While the MSE is low, its meaningfulness depends on the scale of the target
variable (views_content). If the values are small (e.g., between 0 and 1), this MSE is
acceptable; otherwise, it might need further investigation. R² Score: 0.7743

Interpretation:
The R² score explains how much of the variance in the target variable is
explained by the model. Here, approximately 77.43% of the variation in
views_content is explained by the independent variables.
Comment: An R² of 0.77 is generally considered a good fit for linear regression.
However, it also suggests there is still 22.57% of unexplained variance, potentially
due to missing features, non-linear relationships, or noise in the data.

Coefficients
The coefficients indicate the contribution (weight) of each independent variable to the
prediction of the target (views_content). Key observations:
i. Positive Impact: visitors has the highest positive coefficient (0.1289), suggesting it is
the most significant predictor of views_content. Several dayofweek and season variables
(e.g., dayofweek_Saturday, season_Summer) also positively impact the target.
ii. Negative Impact: major_sports_event has the most significant negative impact (-
0.0956), indicating that the presence of major sports events might reduce
views_content. genre_Romance has a slight negative impact, potentially due to lower
engagement for this genre compared to others.
iii. Minimal Impact: Variables like ad_impressions and views_trailer have coefficients
close to zero, suggesting they might have negligible influence on views_content. These
might need further exploration or even exclusion.

Testing Assumptions Of Linear Regression Model

Shapiro-Wilk Test: Test Statistic = 0.9979, p-value = 0.4379
Residuals appear to be normally distributed.

Variance Inflation Factor (VIF):

Feature VIF
0 visitors 26.065912
1 ad_impressions 19.731307
2 major_sports_event 1.742907
3 views_trailer 4.511708
4 genre_Comedy 1.939941
5 genre_Drama 2.068384
6 genre_Horror 2.037733
7 genre_Others 3.261380
8 genre_Romance 2.038509
9 genre_Sci-Fi 1.937147
10 genre_Thriller 2.010787
11 dayofweek_Monday 1.076586
12 dayofweek_Saturday 1.228074
13 dayofweek_Sunday 1.202715
14 dayofweek_Thursday 1.281478
15 dayofweek_Tuesday 1.084141
16 dayofweek_Wednesday 1.897971
17 season_Spring 2.011645
18 season_Summer 2.001957
19 season_Winter 2.088696

Observations
1. Linearity (Residuals vs Fitted Plot):
The residuals plot shows points scattered around the horizontal zero line (red dashed
line) There's no clear pattern in the residuals. This suggests the linearity assumption is
met
1. Normality of Residuals (Q-Q Plot and Shapiro-Wilk Test):
The Q-Q plot shows points following the diagonal red line quite closely. Shapiro-Wilk
test results: statistic = 0.9979, p-value = 0.4379 Since p-value > 0.05, we fail to reject
the null hypothesis that residuals are normally distributed Both visual and statistical
tests strongly support the normality assumption
1. Multicollinearity (VIF values):
Most variables show acceptable VIF values (< 5), except: visitors (VIF = 26.07) and
ad_impressions (VIF = 19.73) show very high multicollinearity. This is a concern as
these variables are strongly correlated, which could affect coefficient stability and
interpretation. Consider removing one of these variables or combining them into a
single metric.
1. Homoscedasticity (Residuals vs Fitted Plot ):
The spread of residuals appears consistent across fitted values. There's no clear funnel
or fan shape. This suggests the homoscedasticity assumption is met.
Model Performance Evaluation Based On Different Metrics

Model Performance Metrics:

Train Mean Absolute Error (MAE): 0.0389
Test Mean Absolute Error (MAE): 0.0399

Train Mean Squared Error (MSE): 0.0024

Test Mean Squared Error (MSE): 0.0025

Train Root Mean Squared Error (RMSE): 0.0489

Test Root Mean Squared Error (RMSE): 0.0500

Train R^2 Score: 0.7868

Test R^2 Score: 0.7743

Adjusted R^2 for Test Set: 0.7661

Observations
1. Mean Absolute Error (MAE) Train MAE: 0.0389 Test MAE: 0.0399 The MAE
values for both train and test sets are very close, indicating that the model
generalizes well and is not overfitting or underfitting. The average prediction
error is approximately 0.04, meaning the model predicts values with minimal
error.

2. Mean Squared Error (MSE) Train MSE: 0.0024 Test MSE: 0.0025 The MSE
values for train and test sets are also very close, further confirming that the
model performs similarly on both sets. The small MSE indicates that the squared
errors (and thus large prediction deviations) are minimal.

3. Root Mean Squared Error (RMSE) Train RMSE: 0.0489 Test RMSE: 0.0500
RMSE gives a sense of the average magnitude of the prediction error in the same
scale as the target variable. A small and comparable RMSE for train and test sets
suggests a good fit and no significant overfitting.

4. R-squared, Train 𝑅 ^ 2 : 0.7868 Test 𝑅 ^2 : 0.7743 𝑅 ^2 indicates how well the

model explains the variability of the dependent variable. Approximately 78.68%
of the variance in the training data and 77.43% of the variance in the test data
are explained by the model. The slight drop in 𝑅 ^2 from train to test is expected
and acceptable, indicating no major overfitting.

5. Adjusted R-squared Adjusted 𝑅 ^2 for Test Set: 0.7661 Adjusted 𝑅 ^2 accounts

for the number of predictors in the model and penalizes unnecessary predictors.
The small difference between 𝑅 ^2 and adjusted 𝑅 ^2 (0.7743, 0.7743 vs.0.7661,
0.7661) suggests that the predictors in the model are relevant and contribute
meaningfully to explaining the target variable.
Actionable Insights & Business Recommendation
Overall Assesssment
The model performs well on both the train and test datasets, with comparable errors
and variance explained. The small difference in performance metrics between train and
test sets indicates that the model generalizes well and is not overfitting. The high 𝑅 2 R
2 and adjusted 𝑅 2 R 2 values suggest that the model explains a significant proportion of
the variance in the target variable.
1. Visitors (Average visitors in the past week): High significance: The number of
visitors to the platform in the past week is a strong predictor of first-day
viewership. A higher visitor count indicates a larger engaged audience, directly
translating to increased content viewership. Business Insight: Boosting platform
traffic (e.g., through promotions, partnerships, or improved user experience) in
the days leading up to content releases can significantly improve first-day
viewership.

2. Ad Impressions (Number of ad impressions): Moderate to high significance:

Increased ad impressions correlate positively with higher first-day viewership,
suggesting that marketing campaigns are effective in driving engagement.
Business Insight: Prioritize well-targeted ad campaigns with high visibility to
ensure new content reaches the maximum audience. Focus on optimizing ad
spend to maximize returns.

3. Major Sports Event: Negative correlation: The presence of a major sports event
reduces viewership, likely due to competition for audience attention. Business
Insight: Avoid scheduling major content releases on days when significant sports
events are happening. Use such periods for less critical content or alternative
marketing strategies.

4. Genre:

Significant variation: Certain genres perform better than others (e.g., action, drama,
comedy). This indicates audience preferences for specific types of content. Business
Insight: Tailor content production and acquisition strategies toward high-performing
genres. For lower-performing genres, consider niche marketing to target specific
audience segments.
1. Day of the Week:
Significant impact: Content released on weekends and holidays typically garners higher
first-day viewership, likely due to increased free time and leisure activities. Business
Insight: Strategically schedule major content releases on weekends and public holidays
to maximize viewership.
1. Season of Release:
Moderate significance: Seasonal factors (e.g., holiday seasons or summer vacations)
affect viewership trends. Content released during popular seasons tends to perform
better. Business Insight: Capitalize on seasonal trends by aligning content releases with
periods of high platform activity.
1. Trailer Views (Number of trailer views):
Highly significant: Trailer views are directly proportional to first-day viewership,
indicating that effective trailers create anticipation and engagement. Business Insight:
Invest in creating high-quality, engaging trailers to build excitement for upcoming
content. Promote trailers across multiple platforms to maximize visibility.

Key Takeaways For Business

1. Boost Platform Traffic:
Increase platform visitors by using strategic campaigns, offering free trials, discounts, or
limited-time access to premium content to attract more users.
1. Optimize Marketing Spend:
Focus ad budgets on campaigns for genres and content types that show high potential
for viewership. Target platforms and times that align with audience habits to maximize
ad impressions.
1. Strategic Scheduling:
Avoid content clashes with major sports events or other high-viewership activities.
Release high-priority content on weekends, holidays, or during seasons with higher
engagement.
1. Focus on High-Performing Genres:
Double down on producing or acquiring content in genres that consistently perform
well. For less popular genres, identify niche audiences and personalize marketing
efforts.
1. Enhance Trailer Effectiveness:
Trailers are critical for driving excitement and engagement. Ensure trailers are
compelling, well-produced, and promoted on both traditional and social media
channels.
1. Seasonal Campaign Planning:
Align major content releases with seasonal periods of higher activity (e.g., holiday
seasons or summer vacations) to maximize reach and viewership.
1. Competitive Landscape:
Monitor competitor activity and schedule content releases strategically to avoid head-
to-head clashes with major sports events, blockbuster releases, or other high-profile
launches.

ShowTime OTT Business Report
No ratings yet
ShowTime OTT Business Report
17 pages
Business Report on Statistical Analysis
No ratings yet
Business Report on Statistical Analysis
51 pages
FRA Project Report - Chilla Nagaraju
100% (1)
FRA Project Report - Chilla Nagaraju
66 pages
Airline Ticket Price Prediction App
No ratings yet
Airline Ticket Price Prediction App
7 pages
Customer Churn Prediction Capstone Projectdocx
No ratings yet
Customer Churn Prediction Capstone Projectdocx
11 pages
Customer Churn Prediction Strategies
No ratings yet
Customer Churn Prediction Strategies
33 pages
Hackathon Overall Travel Experience of Traveling in Shinkansen Bullet Train Merging Two Data Set
No ratings yet
Hackathon Overall Travel Experience of Traveling in Shinkansen Bullet Train Merging Two Data Set
59 pages
Customer Segmentation & Claim Prediction Report
100% (1)
Customer Segmentation & Claim Prediction Report
98 pages
Predictive Modeling for Lead Conversion
No ratings yet
Predictive Modeling for Lead Conversion
21 pages
Grocery Store Combo Meal Analysis
100% (1)
Grocery Store Combo Meal Analysis
24 pages
Ott Data Project
No ratings yet
Ott Data Project
40 pages
Predictive Modeling
No ratings yet
Predictive Modeling
38 pages
Sports Injury Probability Analysis
No ratings yet
Sports Injury Probability Analysis
12 pages
GDH Profile - Detailed
No ratings yet
GDH Profile - Detailed
22 pages
Customer Segmentation Clustering Analysis
100% (1)
Customer Segmentation Clustering Analysis
16 pages
Annual Spending Analysis of Retailers in Portugal
No ratings yet
Annual Spending Analysis of Retailers in Portugal
12 pages
Predictive Modelling Project - Nandini
No ratings yet
Predictive Modelling Project - Nandini
31 pages
Model Building Strategies for Team India
No ratings yet
Model Building Strategies for Team India
20 pages
FoodHub Data Analysis Insights
No ratings yet
FoodHub Data Analysis Insights
17 pages
Dental Implant Case Study
No ratings yet
Dental Implant Case Study
15 pages
Customer Satisfaction Regression Analysis
No ratings yet
Customer Satisfaction Regression Analysis
25 pages
Project Report TSF Extendec
No ratings yet
Project Report TSF Extendec
52 pages
Lead Scoring for X Education
No ratings yet
Lead Scoring for X Education
24 pages
SMDM Project Report Dipti
No ratings yet
SMDM Project Report Dipti
14 pages
Insurance Claim Prediction Models Analysis
67% (3)
Insurance Claim Prediction Models Analysis
33 pages
ML 2 Project Business Report - Nandini
No ratings yet
ML 2 Project Business Report - Nandini
43 pages
Advanced Statistics Business Report CMSU
No ratings yet
Advanced Statistics Business Report CMSU
25 pages
Predicting Flight Delays
No ratings yet
Predicting Flight Delays
6 pages
Car Insurance Data Insights with Tableau
No ratings yet
Car Insurance Data Insights with Tableau
10 pages
INN Hotels Project
No ratings yet
INN Hotels Project
26 pages
CMSU Student Survey Analysis
No ratings yet
CMSU Student Survey Analysis
10 pages
Capstone Notes-1
No ratings yet
Capstone Notes-1
18 pages
INNHotels Group
No ratings yet
INNHotels Group
40 pages
Tableau Car Insurance Claims Analysis
No ratings yet
Tableau Car Insurance Claims Analysis
20 pages
Project 1 Austo Automobiles
No ratings yet
Project 1 Austo Automobiles
10 pages
Business Report on Customer Spending Analysis
100% (1)
Business Report on Customer Spending Analysis
32 pages
FoodHub Data Analysis Insights 2023
No ratings yet
FoodHub Data Analysis Insights 2023
20 pages
Capstone Project: Team India Performance Analysis
No ratings yet
Capstone Project: Team India Performance Analysis
27 pages
Descriptive Statistics for Analytics
100% (1)
Descriptive Statistics for Analytics
3 pages
Predicting Employee Transport Choices
No ratings yet
Predicting Employee Transport Choices
17 pages
Time Series Analysis of Rose Data
No ratings yet
Time Series Analysis of Rose Data
2,484 pages
Flight Price Prediction Analysis
No ratings yet
Flight Price Prediction Analysis
15 pages
Project 6 - Thera Bank
No ratings yet
Project 6 - Thera Bank
13 pages
Predictive Modelling Alternate Project Business Case
No ratings yet
Predictive Modelling Alternate Project Business Case
47 pages
Axis Insurance Project
No ratings yet
Axis Insurance Project
14 pages
Wholesale Customer Spending Analysis
100% (1)
Wholesale Customer Spending Analysis
34 pages
Deakin Master of Data Science Course Map
0% (1)
Deakin Master of Data Science Course Map
3 pages
Election Prediction Model Analysis
50% (2)
Election Prediction Model Analysis
46 pages
UL Coded Project Report - KC
No ratings yet
UL Coded Project Report - KC
30 pages
Machine Learning Clustering Analysis Report
No ratings yet
Machine Learning Clustering Analysis Report
23 pages
ANOVA and PCA Analysis of Salary Data
No ratings yet
ANOVA and PCA Analysis of Salary Data
35 pages
Taiwan Customer Default Analysis Report
No ratings yet
Taiwan Customer Default Analysis Report
6 pages
Data Mining Business Report: Clustering & Prediction
No ratings yet
Data Mining Business Report: Clustering & Prediction
6 pages
Flight Price Prediction Analysis
No ratings yet
Flight Price Prediction Analysis
69 pages
Inferential Statistics Project
No ratings yet
Inferential Statistics Project
16 pages
Machine Learning Project Analysis
No ratings yet
Machine Learning Project Analysis
114 pages
Data Science & Business Analytics: Post Graduate Program in
No ratings yet
Data Science & Business Analytics: Post Graduate Program in
16 pages
Wholesale Customer Spending Analysis
100% (1)
Wholesale Customer Spending Analysis
9 pages
Business Report on Machine Learning Analysis
100% (1)
Business Report on Machine Learning Analysis
23 pages
Predictive Modelling Coded Project
No ratings yet
Predictive Modelling Coded Project
34 pages
The Illusion of Advertising Promises
No ratings yet
The Illusion of Advertising Promises
2 pages
Morrison Smith Ruiz2020 Article ChallengesAndBarriersInVirtual
No ratings yet
Morrison Smith Ruiz2020 Article ChallengesAndBarriersInVirtual
33 pages
Lecture Notes Stereochemistry 2 e Z Isomerism
No ratings yet
Lecture Notes Stereochemistry 2 e Z Isomerism
5 pages
Samsung Management Case Study: April 2020
No ratings yet
Samsung Management Case Study: April 2020
18 pages
Aquestia OCV AVIATION
No ratings yet
Aquestia OCV AVIATION
16 pages
Bharathiar University, Coimbatore - 641 046 School of Distance Education
No ratings yet
Bharathiar University, Coimbatore - 641 046 School of Distance Education
1 page
One of The Field
No ratings yet
One of The Field
4 pages
Thought Groups in Phonetics and Phonology
No ratings yet
Thought Groups in Phonetics and Phonology
3 pages
Strength and Conditioning Principles For Physical Therapists Cheat Sheet 1
No ratings yet
Strength and Conditioning Principles For Physical Therapists Cheat Sheet 1
8 pages
Hamlet Lesson Plan for 12th Grade
No ratings yet
Hamlet Lesson Plan for 12th Grade
3 pages
Somnambulism Diagnosis and Treatment
No ratings yet
Somnambulism Diagnosis and Treatment
4 pages
12 Verb Tenses
No ratings yet
12 Verb Tenses
2 pages
Longevity - Gene Oscillation Approach
No ratings yet
Longevity - Gene Oscillation Approach
7 pages
Unlocking Your Digital Dynasty - Strategies To Build A Thriving Online Empire Before Turning 20
No ratings yet
Unlocking Your Digital Dynasty - Strategies To Build A Thriving Online Empire Before Turning 20
4 pages
Design and Implementation of A Digital
No ratings yet
Design and Implementation of A Digital
9 pages
Infor Windows 11 Quick Start Guide
No ratings yet
Infor Windows 11 Quick Start Guide
19 pages
Analyzing Marine Incident Causes
No ratings yet
Analyzing Marine Incident Causes
10 pages
Automatic Back-Flushing Filter For Process Technology Autofilt Rf3
No ratings yet
Automatic Back-Flushing Filter For Process Technology Autofilt Rf3
8 pages
Conscience and Natural Law Insights
No ratings yet
Conscience and Natural Law Insights
10 pages
Harsh Kumar CPC
No ratings yet
Harsh Kumar CPC
1 page
2021 B.tech. 3rd Semester (Cse) Mathematics III (Multivariable Calculus and Differential Equations
No ratings yet
2021 B.tech. 3rd Semester (Cse) Mathematics III (Multivariable Calculus and Differential Equations
4 pages
Family Law I Course Manual Overview
No ratings yet
Family Law I Course Manual Overview
24 pages
Technologies For Biogas Upgrading To Biomethane: A Review: Bioengineering
No ratings yet
Technologies For Biogas Upgrading To Biomethane: A Review: Bioengineering
23 pages
Greenhaven Road Value - Halogen Software FINAL
100% (1)
Greenhaven Road Value - Halogen Software FINAL
24 pages
Problems - Chap 03
No ratings yet
Problems - Chap 03
8 pages
KBS Supports Year of Science Festival
No ratings yet
KBS Supports Year of Science Festival
1 page
IT Exam Prep for Fortinet NSE4
No ratings yet
IT Exam Prep for Fortinet NSE4
63 pages
Gis Data Structures
No ratings yet
Gis Data Structures
3 pages
ETP
No ratings yet
ETP
1 page
Functional Setup Manager Guide
No ratings yet
Functional Setup Manager Guide
20 pages

Predictive Modelling Coded Project

Uploaded by

Predictive Modelling Coded Project

Uploaded by

PREDICTIVE MODELLING –

APRIL 20, 2025

Displaying the first 5 rows of dataset

Displaying the shape of the dataset

Displaying the Datatypes of the Columns

Checking the missing values

1. Visitors vs All Other Variables

# Visitors vs Day of Week

Visitors vs Views Content

2. Views Content vs Other Variables

Views Content vs Season

Views Content vs Genre

Ad Impressions vs Views Content

1. What does the distribution of the content look like?

[Link] does the distribution of genres look like?

The category others has the highest

Wednesday and Saturday have the highest

[Link] does the viewership vary with the season of release?

There is not much of an impact of season on

5. What is the correlation between trailer views and content views?

There is a positive relationship between

There are no duplicate or missing values.

Sample of processed data:

views_trailer views_content dayofweek_num season_num ... \

conversion_trailer_to_content genre_Comedy genre_Drama genre_Horror \

genre_Others genre_Romance genre_Sci-Fi genre_Thriller

[Link] Detection and Treatment

[Link] Preparation for Modelling

Testing Assumptions Of Linear Regression Model

Variance Inflation Factor (VIF):

Model Performance Metrics:

Train Mean Squared Error (MSE): 0.0024

Train Root Mean Squared Error (RMSE): 0.0489

Train R^2 Score: 0.7868

Adjusted R^2 for Test Set: 0.7661

4. R-squared, Train 𝑅 ^ 2 : 0.7868 Test 𝑅 ^2 : 0.7743 𝑅 ^2 indicates how well the

5. Adjusted R-squared Adjusted 𝑅 ^2 for Test Set: 0.7661 Adjusted 𝑅 ^2 accounts

2. Ad Impressions (Number of ad impressions): Moderate to high significance:

Key Takeaways For Business

You might also like