SOIL FERTILITY PREDICTION USING MACHINE LEARNING
A Comprehensive End-to-End Data Science Project
Submitted in Partial Fulfilment of the Summer Internship under
IEEE Signal Processing Society (SPS), Gujarat Section
Summer Mentorship Program 2025
Submitted by:
Miss. Vibhuti Prashant Sarode
Department of Artificial Intelligence
GH Raisoni College of Engineering and Management, Nagpur
Email: vibhutisarode00@[Link]
GitHub: [Link]
Internship Title:
Implementation of End-to-End Data Science Application using MLOps
Internship Duration:
1st June 2025 – 15th July 2025 (Online Mode)
Mentor:
Dr. Nikunj Tahilramani
Senior Data Scientist, IEEE SPS Gujarat Section
Email: nikunjvtec@[Link]
Date of Submission:
13 July 2025
1
INDEX
Chapter No. Title Of The Chapter Page No.
1 Abstract 3
2 Introduction 4
3 Literature Review 6
4 Problem Statement 7
5 Objectives 9
6 Methodology 10
7 System Design & Architecture 12
8 Technology Stack 14
9 Implementation 16
10 Results and Analysis 19
11 Web Application Overview 22
12 Deployment and Testing 24
13 Conclusion 25
Chapter 1: Abstract
2
This project presents an end-to-end machine learning solution for
predicting soil fertility, aimed at bridging the technological gap in
agricultural diagnostics. Traditional soil testing methods are often
expensive, time-consuming, and inaccessible to many farmers. This
project addresses those limitations by developing a lightweight,
intelligent system capable of assessing soil fertility using data-driven
methods.
A Decision Tree classifier was trained on a dataset of over 800 soil
samples collected from the state of Rajasthan. The trained model
achieved an accuracy of over 95%, making it reliable for practical usage.
The project pipeline includes data ingestion, preprocessing, model
training, and web deployment, all managed using MLOps tools like
MLflow and DagsHub.
The final product is a responsive web application developed using Flask
and deployed on cloud infrastructure, providing farmers and
agricultural consultants with immediate fertility predictions and
actionable fertilizer recommendations. The system can be a valuable
tool in improving crop yields, optimizing fertilizer use, and promoting
data-driven farming practices.
CHAPTER 2: INTRODUCTION
3
2.1 Background
Agriculture remains the backbone of the Indian economy, employing
more than half of the nation’s population. One of the most significant
determinants of agricultural productivity is soil fertility, which directly
influences crop yield. However, traditional soil testing practices require
physical sampling, laboratory analysis, and expert interpretation—
making them time-consuming, costly, and often inaccessible to small
and medium-scale farmers.
In recent years, Machine Learning (ML) and Data Science have
emerged as powerful tools in solving real-world problems across
various domains, including agriculture. Leveraging these technologies
allows for the automation of complex assessments and generation of
actionable insights from historical data.
This project aims to harness the power of machine learning to create a
digital, scalable, and intelligent system that predicts soil fertility levels
using key soil parameters and provides instant recommendations.
2.2 Motivation
The motivation for this project arises from the pressing need to
modernize agricultural diagnostics and empower farmers with
technology. Key driving factors include:
Accessibility: Bridging the gap between scientific soil analysis and
smallholder farmers.
Efficiency: Reducing time and cost involved in laboratory testing.
Educational Value: Demonstrating a complete machine learning
workflow from data collection to web deployment.
4
Practical Impact: Helping users make informed fertilizer
application decisions to boost crop yield and reduce costs.
2.3 Scope of the Project
This project encompasses the following aspects:
Data Collection and Preprocessing: Acquiring soil data from a
reliable database and applying necessary transformation
techniques.
Model Training and Evaluation: Developing and tuning ML
models, primarily focusing on Decision Trees.
Web Application Development: Creating an easy-to-use Flask-
based interface.
MLOps Integration: Tracking experiments and managing model
versions using MLflow and DagsHub.
Cloud Deployment: Deploying the application on a publicly
accessible platform like Render or Heroku.
The goal is to create a solution that is not only technically sound but
also practical and user-friendly for real-world deployment.
5
CHAPTER 3: LITERATURE REVIEW
3.1 Soil Fertility Assessment Techniques
Traditional methods for assessing soil fertility involve physical collection
of soil samples and chemical testing in laboratories. These tests
determine the concentration of essential nutrients such as Nitrogen
(N), Phosphorus (P), and Potassium (K), as well as factors like pH level
and organic matter content. While effective, these methods are costly,
time-intensive, and difficult to scale.
Recent studies have highlighted the potential of machine learning
techniques to complement or even replace conventional soil testing by
using historical data and predictive models.
3.2 Machine Learning in Agriculture
Machine Learning has been widely adopted in agriculture for tasks such
as:
Crop Yield Prediction
Pest and Disease Detection
Irrigation Scheduling
Soil Type Classification
In particular, Decision Trees and Random Forests have shown high
interpretability and predictive performance in soil-related studies. They
are especially valuable because they provide clear rules, which are
important in agricultural contexts where farmers need to trust and
understand recommendations.
Other algorithms such as XGBoost and Neural Networks have also
demonstrated success, but often at the cost of reduced interpretability
6
CHAPTER 4: PROBLEM STATEMENT
4.1 Primary Problem
Rajasthan, being one of the largest agricultural states in India, faces
unique challenges due to its diverse soil types, semi-arid climate, and
limited water resources. The state’s farmers often struggle to make
informed decisions about fertilizer use, leading to low crop productivity
and soil degradation.
Traditional soil testing facilities are sparsely located and inaccessible to
many farmers in rural districts. As a result, most rely on guesswork or
outdated recommendations, which may not match their soil’s actual
fertility level.
The core problem is the lack of fast, reliable, and accessible soil
fertility assessment tools that are customized for the conditions of
Rajasthan’s agricultural landscape.
4.2 Secondary Challenges
In the context of Rajasthan, several additional challenges exist:
High cost and slow turnaround of laboratory soil testing
Lack of personalized fertilizer advice tailored to local soil
conditions
Technological unawareness among smallholder farmers
Absence of real-time, easy-to-use diagnostic tools
4.3 Project Goal
The goal of this project is to build a machine learning-based solution
specifically trained on soil data from Rajasthan, that can:
7
Instantly predict soil fertility levels using simple inputs like pH,
NPK content, organic matter, and soil type
Offer personalized recommendations for fertilizer use
Increase accessibility through a web-based platform usable by
farmers and agricultural consultants
Bridge the digital divide by providing easy-to-understand outputs
with clear explanations
This system is designed to enhance agricultural decision-making and
promote sustainable, data-driven farming practices across Rajasthan.
CHAPTER 5: OBJECTIVES
5.1 Primary Objectives
8
To develop a machine learning model for predicting soil fertility
based on key parameters (pH, NPK, organic matter, etc.).
To create an end-to-end pipeline including data ingestion,
transformation, model training, and evaluation.
To build a user-friendly web application for real-time soil analysis.
To deploy the system on a cloud platform for public access.
5.2 Secondary Objectives
To provide personalized fertilizer recommendations.
To integrate MLflow for experiment tracking and version control.
To ensure model transparency and easy interpretation.
To document and test the entire solution for scalability and future
use.
CHAPTER 6: METHODOLOGY
9
This project follows a structured Data Science lifecycle to build an end-
to-end soil fertility prediction system. The methodology is divided into
multiple stages that align with industry-standard practices in machine
learning and MLOps. Each phase was carefully designed to ensure high
performance, usability, and reliability of the final product.
6.1 Phase 1: Data Collection and Understanding
The project began with the collection of soil data from Rajasthan. The
dataset consisted of over 800 samples, each containing attributes such
as pH level, nitrogen (N), phosphorus (P), potassium (K), organic matter
percentage, district name, and soil type. Exploratory data analysis (EDA)
was performed to identify missing values, data distribution, and
relationships between features and fertility labels.
6.2 Phase 2: Data Preprocessing
Data preprocessing was a critical step to ensure model accuracy.
Numerical features were standardized using StandardScaler, and
categorical features such as district and soil type were encoded using
OneHotEncoder. Missing values were handled using imputation
techniques. A robust preprocessing pipeline was developed and saved
as an artifact for reuse during both training and prediction.
6.3 Phase 3: Model Building and Evaluation
Multiple machine learning models were trained and evaluated,
including Decision Tree, Random Forest, Gradient Boosting, and
XGBoost. Decision Tree was selected as the final model based on its
high accuracy (95.2%) and interpretability. Hyperparameter tuning was
performed using GridSearchCV to optimize performance. Evaluation
metrics such as accuracy, RMSE, and R² score were used to compare
models.
6.4 Phase 4: Web Application Development
10
A Flask-based web application was created with a responsive HTML/CSS
frontend. The interface allows users to input soil data, which is
processed and fed into the trained model to generate real-time
predictions. The app also displays fertility status, confidence scores,
and personalized recommendations.
6.5 Phase 5: Deployment and MLOps
The project was deployed on a cloud platform (Render/Heroku) using
Docker. MLflow was integrated for experiment tracking, model logging,
and version control. Environment variables were used for secure
database access and configuration management. The system is
containerized for scalability and future expansion.
CHAPTER 7: SYSTEM DESIGN & ARCHITECTURE
11
The architecture of the soil fertility prediction system is designed to
ensure modularity, scalability, and seamless integration of all
components from data preprocessing to cloud deployment. The system
follows a layered structure consisting of a data layer, model layer,
application layer, and deployment layer.
7.1 High-Level Architecture
The high-level design includes the following main components:
Data Source Layer: Soil data is stored in a MySQL database and
CSV files. This layer provides the raw inputs for model training and
prediction.
ML Pipeline Layer: This includes preprocessing, model training,
and evaluation. Artifacts like the trained model and preprocessing
pipeline are stored here.
Web Application Layer: A Flask-based web server provides the
frontend interface for user input and backend API for predictions.
Cloud Deployment Layer: The entire solution is deployed on a
cloud platform (e.g., Render, Heroku) for public access.
7.2 Component Breakdown
Data Ingestion Module: Fetches and splits data into train-test sets.
It handles raw input loading from MySQL and saves processed
files.
Data Transformation Module: Applies standardization, encoding,
and imputation. Produces a clean dataset ready for modeling.
Model Training Module: Trains various models, selects the best
one, and logs results with MLflow. Saves the final model and
preprocessor.
12
Prediction Pipeline: Loads saved artifacts and processes real-time
user input to generate predictions and recommendations.
Frontend Interface: Developed using HTML and CSS, it allows
users to enter soil parameters and view results.
Backend API: Processes inputs, runs the model, and sends outputs
to the frontend in real-time.
7.3 Advantages of the Architecture
Modular Design: Components can be updated or replaced without
breaking the entire system.
Reusability: Preprocessing and model artifacts are reusable for
both training and inference.
Scalability: Docker-based deployment allows the system to scale
with user demand.
Transparency: Clear architecture helps in debugging, monitoring,
and expansion.
CHAPTER 8: TECHNOLOGY STACK
The project utilizes a carefully selected combination of programming
languages, libraries, tools, and platforms to build an efficient and
13
scalable end-to-end data science solution. The chosen stack ensures
compatibility, modularity, and ease of deployment, making it suitable
for real-world agricultural applications.
8.1 Programming Languages
Python 3.9+: The core language for all machine learning, data
processing, and backend web development tasks.
HTML5 & CSS3: Used to build a clean, responsive, and accessible
web frontend.
JavaScript (basic): Employed for client-side form validation and
interactivity.
8.2 Machine Learning and Data Science Libraries
Scikit-learn: Primary library for implementing ML models such as
Decision Trees and performing preprocessing.
Pandas & NumPy: For data handling, transformation, and
numerical operations.
XGBoost & CatBoost: Used for experimentation and performance
comparison.
Matplotlib & Seaborn: For data visualization and analysis.
8.3 Web Development Tools
Flask: Lightweight Python framework used to build the web
application backend.
Jinja2: Templating engine used in Flask for dynamic HTML
rendering.
Werkzeug: A WSGI utility library used internally by Flask for
routing and server management.
14
8.4 MLOps and Tracking Tools
MLflow: Used for experiment tracking, model logging, and version
control.
DagsHub: Integrated with MLflow to store experiment metadata
and model artifacts.
DVC (Data Version Control): Optional tool for managing data
versions.
8.5 Deployment and Infrastructure
Render / Heroku: Cloud platforms used to host and deploy the
application.
Docker: Used to containerize the application for portability and
scalability.
Git & GitHub: Version control and collaboration platform for
managing codebase and deployments.
8.6 Database
MySQL: Relational database used to store the raw soil data.
PyMySQL: Python library used to connect the application to the
MySQL database.
CHAPTER 9: IMPLEMENTATION
This chapter outlines the practical steps involved in building the soil
fertility prediction system. The implementation process is divided into
several subcomponents: data pipeline development, model training,
15
artifact management, and integration with a web application. Emphasis
was placed on modular code structure, reusability, and maintainability.
9.1 Project Directory Structure
The entire project is organized into a modular folder structure to
support scalability and ease of development:
bash
CopyEdit
SoilFertilityMLProject/
├── [Link] # Flask application entry point
├── [Link] # Script to initiate training pipeline
├── src/
│ └── SoilFertilityMLProject/
│ ├── components/
│ │ ├── data_ingestion.py # Load and split data
│ │ ├── data_transformation.py # Preprocessing logic
│ │ └── model_trainer.py # Model building and evaluation
│ ├── pipelines/
│ │ ├── training_pipeline.py # Calls data -> model steps
│ │ └── prediction_pipeline.py # Used in real-time prediction
│ ├── [Link] # Logging system
│ ├── [Link] # Custom exception handler
│ └── [Link] # Helper functions
16
├── templates/
│ ├── [Link] # Homepage
│ └── [Link] # Prediction form
├── artifacts/ # Preprocessor & model files
├── logs/ # Log files for debugging
├── [Link] # Dependencies
└── [Link] # Project documentation
9.2 Data Pipeline Components
Data Ingestion:
Connects to a MySQL database using PyMySQL to extract soil data.
The dataset is saved locally as a CSV and then split into training
and testing sets (80%-20%). Errors are logged using the custom
logger.
Data Transformation:
Handles preprocessing for both numeric and categorical data.
o Numeric features: Scaled using StandardScaler
o Categorical features: Encoded using OneHotEncoder
o Missing values: Imputed using mean or most frequent
strategy
The preprocessor pipeline is saved as [Link] in the
artifacts/ directory.
Model Training:
Multiple models were tested (Decision Tree, Random Forest,
XGBoost, etc.).
17
o Evaluation metrics: Accuracy, RMSE, R² Score
o Hyperparameter tuning: GridSearchCV
o Best model: Decision Tree (95.2% accuracy)
The trained model is saved as [Link].
9.3 Prediction Pipeline
Converts user input from the web form into a DataFrame.
Loads the saved preprocessor and trained model.
Applies the same transformations as training.
Outputs fertility status and personalized fertilizer advice.
9.4 Logging and Error Handling
[Link]: Logs every step of the data and model pipeline with
timestamps.
[Link]: Catches and reports custom exceptions with error
context, improving debugging and system stability.
CHAPTER 10: RESULTS AND ANALYSIS
This chapter presents the performance results of the machine learning
models used for predicting soil fertility. It includes accuracy metrics,
model comparisons, feature importance analysis, and sample test
cases. All experiments were tracked using MLflow, ensuring
reproducibility and organized logging of model performance.
18
10.1 Model Performance Comparison
Multiple machine learning algorithms were trained and evaluated on
the processed soil dataset. The Decision Tree model emerged as the
best performer due to its high accuracy and clear interpretability.
Algorithm Accuracy RMSE R² Score Training Time
Decision Tree 95.2% 0.21 0.952 0.05 seconds
Random Forest 93.8% 0.24 0.938 0.12 seconds
Gradient Boosting 92.1% 0.27 0.921 0.45 seconds
XGBoost 91.5% 0.28 0.915 0.38 seconds
Linear Regression 78.2% 0.42 0.782 0.02 seconds
Selected Model: Decision Tree
Reasons: High accuracy, fast execution, and high interpretability—ideal
for agricultural users who value transparent decision-making.
10.2 Feature Importance
The Decision Tree model also provides insights into which features
contribute most to the fertility prediction:
Feature Importance Score
pH Level 0.28
19
Feature Importance Score
Nitrogen Content 0.24
Phosphorus Content 0.19
Potassium Content 0.16
Organic Matter (%) 0.08
Soil Type 0.03
District 0.02
Observation: pH Level and Nitrogen Content are the most influential in
determining soil fertility in Rajasthan.
10.3 MLflow Experiment Tracking
All model training runs were tracked using MLflow, allowing for clear
comparisons and version control. Each run logged:
Parameters: Model type, hyperparameters
Metrics: Accuracy, RMSE, R²
Artifacts: Trained model, preprocessor
Run ID: Automatically stored and accessible via DagsHub
This integration ensured reproducibility and systematic model
selection.
10.4 Sample Predictions (Test Cases)
Case 1: Jaipur, Loamy, pH=6.8, OM=3.5%, N=45, P=25, K=50
20
Predicted: High Fertility ✅
Advice: Maintain nutrient balance; minimal intervention needed.
Case 2: Jodhpur, Sandy, pH=5.5, OM=0.8%, N=15, P=8, K=20
Predicted: Low Fertility ✅
Advice: Improve soil with compost and phosphorus-rich fertilizers.
CHAPTER 11: WEB APPLICATION OVERVIEW
To make the machine learning model accessible to non-technical users
such as farmers and agricultural consultants, a simple and responsive
web application was developed using the Flask web framework. The
application allows users to input soil parameters and instantly receive
fertility predictions along with personalized recommendations.
21
11.1 Application Features
User Input Form:
A clean and intuitive form is provided where users can select their
district, soil type, and input numerical values such as pH, organic
matter, and NPK content.
Real-time Prediction:
On form submission, the application processes the input through
the trained model and displays the predicted fertility status (Low,
Medium, or High).
Intelligent Recommendations:
Based on the prediction, the application displays custom advice.
For example:
o If fertility is low: Recommend adding organic compost or
specific nutrients.
o If pH is too high or low: Show warnings about soil acidity or
alkalinity.
o If fertility is high: Suggest balanced maintenance.
11.2 Technical Architecture
Frontend:
Built using HTML5 and CSS3, the interface is responsive and
mobile-friendly. It includes tooltips and validation for user inputs.
Backend:
Developed in Flask, the backend handles:
o Data validation and conversion
22
o Model and preprocessor loading
o Prediction generation
o Template rendering
API Design:
A RESTful API endpoint (/predictdata) receives user data and
returns prediction results. This makes the system extendable to
mobile or third-party platforms.
11.3 User Experience Design
Color-coded fertility results (e.g., red for low, green for high)
Summary of inputs for user verification
Helpful error messages and data validation prompts
Clear navigation and layout for ease of use
The application bridges the gap between AI and agriculture by
providing instant, interpretable, and actionable results to its users.
CHAPTER 12: DEPLOYMENT AND TESTING
The soil fertility prediction system was deployed on a cloud platform
using Flask, Docker, and GitHub, allowing users to access the web
application from any device. The deployment was automated using
GitHub integration, and all configuration files like Procfile and
[Link] ensured smooth cloud setup.
23
Sensitive credentials such as database access and MLflow URIs were
managed securely through environment variables.
12.1 Testing Approach
The application was tested at different levels to ensure accuracy and
reliability:
Unit Testing: Each module (data loading, transformation, model)
was tested individually.
Integration Testing: End-to-end functionality was validated from
input to prediction.
Web Testing: Input validation, user flow, and output results were
tested in real-time.
12.2 Sample Results
Input (District, pH, NPK) Prediction
Jaipur, 6.8, N=45, P=25, K=50 ✅ High Fertility
Jodhpur, 5.5, N=15, P=8, K=20 ✅ Low Fertility
The model maintained over 94% accuracy during real-world tests and
responded within seconds. All model runs were logged and tracked
using MLflow, ensuring reproducibility.
CHAPTER 13: CONCLUSION
This project successfully demonstrates the implementation of an end-
to-end machine learning solution for soil fertility prediction using real-
world data from Rajasthan. By combining data preprocessing, model
training, web development, and deployment, the system offers an
accessible tool for farmers and agricultural consultants.
24
The use of a Decision Tree model ensured high accuracy and
interpretability, while the web application made the solution user-
friendly and publicly accessible. Overall, the project bridges the gap
between AI technology and practical agriculture.
25