0% found this document useful (0 votes)

16 views7 pages

IR Mini Project-Doc Summarization System

The document outlines a mini-project aimed at developing an Extractive Text Summarization System using Natural Language Processing (NLP) techniques. The system focuses on automatically generating concise summaries of lengthy documents by selecting key sentences based on word frequency analysis. It includes a Python implementation that demonstrates the summarization logic, benefiting users by improving reading efficiency and comprehension.

Uploaded by

ayushagc.44

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views7 pages

IR Mini Project-Doc Summarization System

Uploaded by

ayushagc.44

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Modern Education Society’s

Wadia College of Engineering, Pune

NAME OF STUDENT: CLASS:

SEMESTER/YEAR: ROLL NO:

DATE OF PERFORMANCE: DATE OF SUBMISSION:

EXAMINED BY: EXPERIMENT NO: Mini-Project

ASSIGNMENT - Mini Project

AIM: Mini Project Implementation

TITLE: Develop Document Summarization System

OBJECTIVES:
The primary objective of this project is to develop an Extractive Text Summarization System
using Natural Language Processing (NLP) techniques to automatically generate concise and
coherent summaries of lengthy documents.

THEORY:
1. Introduction
In the modern digital world, a tremendous amount of textual information is generated daily in the
form of news articles, research papers, blogs, and social media posts. It becomes difficult for
users to manually read and extract important points from lengthy documents.
Text Summarization is the process of automatically generating a concise and coherent summary
that contains the most important information from the original document.
There are two main types of summarization:
• Extractive Summarization: Selects key sentences from the text.
• Abstractive Summarization: Generates new sentences to represent the main ideas.
This project focuses on an Extractive Text Summarization System using Word Frequency
Ranking in Natural Language Processing (NLP).
2. Objectives

The project’s core and specific objectives are as follows:

• To design a system that can automatically summarize long text documents.
• To use Word Frequency Ranking to identify the most significant sentences.
• To provide a simple interface for text input and summary output.
• To demonstrate the use of NLP techniques like tokenization and stopword removal.
• To make reading and comprehension faster and more efficient.

3. System Overview

The Document Summarization System is composed of a single core component: the

Summarization Logic (Python Core). This component is implemented and prototyped in a
Jupyter Notebook, providing a clear, step-by-step demonstration of how text can be
automatically summarized using NLP techniques.

3.1 Summarization Logic (Python Core)

This component handles the complete process of extracting key sentences from a document.
Core Functional Steps:
• Sentence and Word Tokenization:
The input text is first split into individual sentences using the sent_tokenize() function
from the NLTK library. Each sentence is further divided into words using
word_tokenize(). This step structures the text for detailed analysis.
• Word Frequency Analysis:
A frequency dictionary is built to count occurrences of each non-stopword in the text.
Common stopwords (like “the,” “is,” and “and”) are removed using NLTK’s stopword
list. This ensures only meaningful words contribute to sentence importance.
• Frequency Normalization:
Each word’s raw frequency is divided by the highest frequency in the document. This
assigns weighted importance scores to words, giving more significance to relatively rare,
meaningful terms.
• Sentence Scoring:
Every sentence receives a score equal to the sum of the normalized frequencies of the
words it contains. Longer sentences naturally accumulate higher scores, but
normalization and length checks prevent them from dominating the summary.
• Summary Selection:
Sentences are ranked based on their scores, and the top sentences (determined by a user-
defined ratio, e.g., 30% of total sentences) are selected. These sentences are then ordered
according to their original position in the document to preserve coherence and logical
flow. The result is a concise, coherent extractive summary.

4. Key Theoretical Concepts

The system relies fundamentally on concepts from NLP:

• Natural Language Processing (NLP):

Techniques that allow computers to process, analyze, and understand human language.
Tokenization, stopword removal, and frequency analysis are fundamental NLP tasks.

• Tokenization:

• Sentence Tokenization: Splits the text into sentences.

• Word Tokenization: Splits sentences into words for detailed analysis.

• Stopwords:
Commonly occurring words (e.g., “the,” “is,” “and”) that are usually ignored in analysis because
they carry little semantic meaning.

• Word Frequency Analysis:

Counts occurrences of words to identify important terms. The more a word appears, the more
likely it is important, after removing stopwords.

• Frequency Normalization:
Dividing each word frequency by the maximum frequency in the document to create weighted
scores. This reduces bias toward extremely common words.

• Sentence Scoring:
A method of assigning numerical importance to each sentence by summing the normalized
frequencies of its words.

• Extractive Summarization:
The process of selecting and concatenating the most important sentences from the original
document to create a summary. It preserves the original wording and sentence structure.

• Compression Ratio / User-defined Ratio:

A parameter that determines the fraction of sentences to include in the summary (e.g., 30% of
original sentences). It allows control over summary length.

5. Benefits
The Document Summarization System offers significant advantages in managing information:
● Efficiency: Dramatically reduces the time required to understand the core message
of a long document.
● Fidelity: Preserves the original language and context since it only extracts
existing sentences.
● Scalability: Can process documents of arbitrary length rapidly, making it suitable
for quick analysis of large corpora.
● Accessibility: Provides an intuitive, web-based interface, democratizing access to
complex NLP functionality.

PROGRAM CODE:

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
import heapq
import re

nltk.download('punkt')
nltk.download('stopwords')

text = """Information retrieval (IR) in computing and information science is the task
of identifying and retrieving information system resources that
are relevant to an information need. The information need can be specified in the
form of a search query. In the case of document retrieval, queries can be based on
full-text or other content-based indexing. Information retrieval is the science[1] of
searching for information in a document, searching for documents
themselves, and also searching for the metadata that describes data, and for
databases of texts, images or sounds.

Automated information retrieval systems are used to reduce what has been called
information overload. An IR system is a software system that provides
access to books, journals and other documents; it also stores and manages those
documents. Web search engines are the most visible IR applications.
"""

text = re.sub(r'\s+', ' ', text)

text = re.sub(r'\[[0-9]*\]', ' ', text)

sentences = sent_tokenize(text)
stop_words = set(stopwords.words('english'))
word_frequencies = {}
for word in word_tokenize(text):
word = word.lower()
if word.isalpha() and word not in stop_words:
if word not in word_frequencies:
word_frequencies[word] = 1
else:
word_frequencies[word] += 1

maximum_frequency = max(word_frequencies.values())
for word in word_frequencies.keys():
word_frequencies[word] = (word_frequencies[word] / maximum_frequency)

sentence_scores = {}
for sent in sentences:
for word in word_tokenize(sent.lower()):
if word in word_frequencies:
if len(sent.split(' ')) < 30: # Ignore very long sentences
if sent not in sentence_scores:
sentence_scores[sent] = word_frequencies[word]
else:
sentence_scores[sent] += word_frequencies[word]

summary_sentences = heapq.nlargest(3, sentence_scores, key=sentence_scores.get)

summary = ' '.join(summary_sentences)

print("🔹 ORIGINAL TEXT:\n")

print(text)
print("\n\n🔸 SUMMARIZED TEXT:\n")
print(summary)
OUTPUTS:

ORIGINAL TEXT:

Information retrieval (IR) in computing and information science is the task of identifying and retrieving
information system resources that are relevant to an information need. The information need can be specified
in the form of a search query. In the case of document retrieval, queries can be based on full-text or other
content-based indexing. Information retrieval is the science of searching for information in a document,
searching for documents themselves, and also searching for the metadata that describes data, and for
databases of texts, images or sounds. Automated information retrieval systems are used to reduce what has
been called information overload. An IR system is a software system that provides access to books, journals
and other documents; it also stores and manages those documents. Web search engines are the most visible
IR applications.

SUMMARIZED TEXT:

Information retrieval (IR) in computing and information science is the task of identifying and retrieving
information system resources that are relevant to an information need. Automated information retrieval
systems are used to reduce what has been called information overload. An IR system is a software system that
provides access to books, journals and other documents; it also stores and manages those documents.

CONCLUSION:

The Document Summarization System (Python Core) demonstrates an effective

approach for automatic text summarization using Word Frequency Ranking:

• It accurately identifies and extracts the most important sentences from a

document.

• The system is easy to implement and interpret, making it suitable for educational
purposes and lightweight text analytics tasks.

• By adjusting the user-defined ratio, summaries can be concise or detailed

depending on the user’s needs.

• This foundational approach can serve as a basis for more advanced summarization
techniques, including graph-based (TextRank) or transformer-based
abstractive summarization.

In essence, the system converts long, information-dense documents into readable,

coherent summaries while maintaining the original meaning, improving comprehension
and efficiency.

Extractive Text Summarization Using Word Frequency
No ratings yet
Extractive Text Summarization Using Word Frequency
6 pages
Research Paper Summarizer Using AI
No ratings yet
Research Paper Summarizer Using AI
5 pages
Text Summarization Techniques Overview
No ratings yet
Text Summarization Techniques Overview
3 pages
AI Research Paper Summarizer Tool
No ratings yet
AI Research Paper Summarizer Tool
5 pages
Extractive Text Summarization Overview
No ratings yet
Extractive Text Summarization Overview
12 pages
Technical Seminar Report-6607
No ratings yet
Technical Seminar Report-6607
11 pages
Paper 3
No ratings yet
Paper 3
3 pages
Text Summarization Using NLP Techniques
No ratings yet
Text Summarization Using NLP Techniques
14 pages
Text Summarization Using Python NLTK
No ratings yet
Text Summarization Using Python NLTK
8 pages
Comparative Study of Text Summarization Methods
No ratings yet
Comparative Study of Text Summarization Methods
6 pages
Text Summarizer Using NLP (Natural Language Processing) : © JUL 2022 - IRE Journals - Volume 6 Issue 1 - ISSN: 2456-8880
No ratings yet
Text Summarizer Using NLP (Natural Language Processing) : © JUL 2022 - IRE Journals - Volume 6 Issue 1 - ISSN: 2456-8880
6 pages
Research Final
No ratings yet
Research Final
6 pages
Text Summarization Using Natural Language Processing
No ratings yet
Text Summarization Using Natural Language Processing
5 pages
Chapter #7 Applicatios of NLP (Reading Ass)
No ratings yet
Chapter #7 Applicatios of NLP (Reading Ass)
58 pages
Operating
No ratings yet
Operating
3 pages
A Jaccards Similarity Score Based Methodology For Kannada Text Document Summarization
No ratings yet
A Jaccards Similarity Score Based Methodology For Kannada Text Document Summarization
4 pages
Synopsis Creation For Research Paper Using Text Summarization Models
No ratings yet
Synopsis Creation For Research Paper Using Text Summarization Models
5 pages
Query-Specific Text Summarization
No ratings yet
Query-Specific Text Summarization
9 pages
Ir Practical 10
No ratings yet
Ir Practical 10
3 pages
Extractive Text Summarization: Motilal Nehru National Institute of Technology Allahabad
No ratings yet
Extractive Text Summarization: Motilal Nehru National Institute of Technology Allahabad
29 pages
An Automatic Text Summarization Using Feature Terms For Relevance Measure
No ratings yet
An Automatic Text Summarization Using Feature Terms For Relevance Measure
5 pages
(IJCST-V3I4P21) : Ms - Pallavi.D.Patil, P.M.Mane
No ratings yet
(IJCST-V3I4P21) : Ms - Pallavi.D.Patil, P.M.Mane
7 pages
Text Summarisation and Document Understanding
No ratings yet
Text Summarisation and Document Understanding
7 pages
Module 7
No ratings yet
Module 7
50 pages
Context-Sensitive Document Indexing Model
No ratings yet
Context-Sensitive Document Indexing Model
8 pages
Abriefoverviewofautomaticdocument Summarization: Abhishek Sathe
No ratings yet
Abriefoverviewofautomaticdocument Summarization: Abhishek Sathe
2 pages
ATSSI Abstractive Text Summarization Using Sentiment Infusion
No ratings yet
ATSSI Abstractive Text Summarization Using Sentiment Infusion
7 pages
NLP Mod-5
No ratings yet
NLP Mod-5
17 pages
Module 5 - Information Retrieval and Lexical Resources
0% (1)
Module 5 - Information Retrieval and Lexical Resources
80 pages
Automatic Text Summarization Techniques
No ratings yet
Automatic Text Summarization Techniques
54 pages
Automatic Text Summarization Using Natural Language Processing PDF
No ratings yet
Automatic Text Summarization Using Natural Language Processing PDF
54 pages
YouTube Edu Video Summarizer
No ratings yet
YouTube Edu Video Summarizer
5 pages
Text Summarization Using NLP
No ratings yet
Text Summarization Using NLP
6 pages
Methods of Automatic Document Summarisation
No ratings yet
Methods of Automatic Document Summarisation
84 pages
Abstrating Wisdom: Text Summarization in The Age of Intelligence
No ratings yet
Abstrating Wisdom: Text Summarization in The Age of Intelligence
8 pages
Machine Learning for Text Summarization
No ratings yet
Machine Learning for Text Summarization
56 pages
Analysis On Text Summarization
No ratings yet
Analysis On Text Summarization
10 pages
Abstractive Text Summarization Using Transformer Architecture
No ratings yet
Abstractive Text Summarization Using Transformer Architecture
5 pages
Machine Learning Text Summarizer Guide
No ratings yet
Machine Learning Text Summarizer Guide
9 pages
Research Paper 2
No ratings yet
Research Paper 2
7 pages
Extractive Text Summarization Methods
No ratings yet
Extractive Text Summarization Methods
6 pages
Malayalam 2
No ratings yet
Malayalam 2
4 pages
A Survey of Advances in Text Summarization Methods
No ratings yet
A Survey of Advances in Text Summarization Methods
5 pages
IR Practical (Theory - All Practical)
No ratings yet
IR Practical (Theory - All Practical)
32 pages
Survey on Text Mining Summarization
No ratings yet
Survey on Text Mining Summarization
21 pages
Rane, Govilkar - 2019 - Recent Trends in Deep Learning Based Abstractive Text Summarization-Annotated
No ratings yet
Rane, Govilkar - 2019 - Recent Trends in Deep Learning Based Abstractive Text Summarization-Annotated
8 pages
Text Summarisation and Document Understanding Report
No ratings yet
Text Summarisation and Document Understanding Report
50 pages
A Sentence Scoring Method For Extractive Text Summarization Based On Natural Language Queries
No ratings yet
A Sentence Scoring Method For Extractive Text Summarization Based On Natural Language Queries
5 pages
Thesis Summary
No ratings yet
Thesis Summary
117 pages
Lect NLP 20
No ratings yet
Lect NLP 20
31 pages
A Review Paper On Extractive Techniques of Text Summarization
No ratings yet
A Review Paper On Extractive Techniques of Text Summarization
4 pages
Extractive Summarization of Call Transcripts
No ratings yet
Extractive Summarization of Call Transcripts
15 pages
Text Summarization
No ratings yet
Text Summarization
38 pages
5bbb PDF
No ratings yet
5bbb PDF
6 pages
Research Paper Summer Izer
No ratings yet
Research Paper Summer Izer
6 pages
IRS Module 1 All Questions Possible 2 Marks
No ratings yet
IRS Module 1 All Questions Possible 2 Marks
13 pages
OCR and Text Summarization Project Report
No ratings yet
OCR and Text Summarization Project Report
26 pages
DNLP ABL Project
No ratings yet
DNLP ABL Project
7 pages
Multiword Detection in NLP Summarization
No ratings yet
Multiword Detection in NLP Summarization
3 pages
English Grammar Practice Test
No ratings yet
English Grammar Practice Test
3 pages
Sequential Logic Design Overview
No ratings yet
Sequential Logic Design Overview
31 pages
© 2018 Caendra Inc. - Hera For Ptpv5 - Linux Exploitation (Local Enumeration)
No ratings yet
© 2018 Caendra Inc. - Hera For Ptpv5 - Linux Exploitation (Local Enumeration)
11 pages
ch5ـcontextـfreeـgrammars
No ratings yet
ch5ـcontextـfreeـgrammars
49 pages
Contextual Teaching's Impact on Modal Auxiliaries
No ratings yet
Contextual Teaching's Impact on Modal Auxiliaries
81 pages
English Homework 13 Conditional Sentences: Lecturer: Mr. Dr. H. Abdul Hamid, M.Si
No ratings yet
English Homework 13 Conditional Sentences: Lecturer: Mr. Dr. H. Abdul Hamid, M.Si
3 pages
Programming Visual Basic 2008 Build NET 3 5 Applications With Microsoft S RAD Tool For Business 1st Edition Tim Patrick Kindle & PDF Formats
No ratings yet
Programming Visual Basic 2008 Build NET 3 5 Applications With Microsoft S RAD Tool For Business 1st Edition Tim Patrick Kindle & PDF Formats
154 pages
Omni Bundle - Workbook (Instructor)
100% (3)
Omni Bundle - Workbook (Instructor)
114 pages
Berklee Jazz Guitar Chord Rick Peckham 13741474 PDF
100% (1)
Berklee Jazz Guitar Chord Rick Peckham 13741474 PDF
2 pages
Between Two Horizons Spanning New Testament Studies and Systematic Theology
No ratings yet
Between Two Horizons Spanning New Testament Studies and Systematic Theology
128 pages
Phy1071 - Phy1072 - Unit Iv - 2024-2
No ratings yet
Phy1071 - Phy1072 - Unit Iv - 2024-2
91 pages
CAE Writing
No ratings yet
CAE Writing
3 pages
Hindi Journals and Publications List
100% (1)
Hindi Journals and Publications List
3 pages
Nafis Affilancers CV
No ratings yet
Nafis Affilancers CV
2 pages
Jack and the Beanstalk: A Thematic Analysis
No ratings yet
Jack and the Beanstalk: A Thematic Analysis
31 pages
French Beginner Lesson 5
No ratings yet
French Beginner Lesson 5
4 pages
Business Events On R12 and 11i
100% (2)
Business Events On R12 and 11i
69 pages
English Lesson Plan: Family Theme
No ratings yet
English Lesson Plan: Family Theme
49 pages
500 - Mixed Tenses Test Exercises Multiple Choice Questions With Answers Advanced Level 1
90% (10)
500 - Mixed Tenses Test Exercises Multiple Choice Questions With Answers Advanced Level 1
5 pages
Types of Semantic Changes in Languages
No ratings yet
Types of Semantic Changes in Languages
9 pages
Short One-Line Jokes
100% (1)
Short One-Line Jokes
9 pages
Vocabulary Test PDF with Answers
No ratings yet
Vocabulary Test PDF with Answers
6 pages
DSC2012 Community Female NALGONDA
100% (1)
DSC2012 Community Female NALGONDA
265 pages
Hardware Troubleshooting Pre-Test Guide
No ratings yet
Hardware Troubleshooting Pre-Test Guide
18 pages
Day 1 Session
No ratings yet
Day 1 Session
14 pages
Systems Analysis and Design in A Changing World 7th Edition Satzinger Solutions Manual PDF Download
100% (15)
Systems Analysis and Design in A Changing World 7th Edition Satzinger Solutions Manual PDF Download
26 pages
Airflow for Data Engineers
No ratings yet
Airflow for Data Engineers
34 pages
E-Shopping Cart Micro-Project Report
No ratings yet
E-Shopping Cart Micro-Project Report
12 pages
Lecture 5 Perceptual and Instrumental Assessment 2022
No ratings yet
Lecture 5 Perceptual and Instrumental Assessment 2022
10 pages
12th Physics TM Second Mid Term Exam 2023 Question Paper Kanchipuram District Tamil Medium PDF Download
No ratings yet
12th Physics TM Second Mid Term Exam 2023 Question Paper Kanchipuram District Tamil Medium PDF Download
2 pages

IR Mini Project-Doc Summarization System

Uploaded by

IR Mini Project-Doc Summarization System

Uploaded by

Modern Education Society’s

Wadia College of Engineering, Pune

NAME OF STUDENT: CLASS:

SEMESTER/YEAR: ROLL NO:

DATE OF PERFORMANCE: DATE OF SUBMISSION:

EXAMINED BY: EXPERIMENT NO: Mini-Project

ASSIGNMENT - Mini Project

AIM: Mini Project Implementation

TITLE: Develop Document Summarization System

The project’s core and specific objectives are as follows:

The Document Summarization System is composed of a single core component: the

3.1 Summarization Logic (Python Core)

4. Key Theoretical Concepts

• Natural Language Processing (NLP):

• Sentence Tokenization: Splits the text into sentences.

• Word Frequency Analysis:

• Compression Ratio / User-defined Ratio:

text = re.sub(r'\s+', ' ', text)

summary_sentences = heapq.nlargest(3, sentence_scores, key=sentence_scores.get)

print("🔹 ORIGINAL TEXT:\n")

The Document Summarization System (Python Core) demonstrates an effective

• It accurately identifies and extracts the most important sentences from a

• By adjusting the user-defined ratio, summaries can be concise or detailed

In essence, the system converts long, information-dense documents into readable,

You might also like