0% found this document useful (0 votes)
16 views7 pages

IR Mini Project-Doc Summarization System

The document outlines a mini-project aimed at developing an Extractive Text Summarization System using Natural Language Processing (NLP) techniques. The system focuses on automatically generating concise summaries of lengthy documents by selecting key sentences based on word frequency analysis. It includes a Python implementation that demonstrates the summarization logic, benefiting users by improving reading efficiency and comprehension.

Uploaded by

ayushagc.44
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views7 pages

IR Mini Project-Doc Summarization System

The document outlines a mini-project aimed at developing an Extractive Text Summarization System using Natural Language Processing (NLP) techniques. The system focuses on automatically generating concise summaries of lengthy documents by selecting key sentences based on word frequency analysis. It includes a Python implementation that demonstrates the summarization logic, benefiting users by improving reading efficiency and comprehension.

Uploaded by

ayushagc.44
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Modern Education Society’s

Wadia College of Engineering, Pune

NAME OF STUDENT: CLASS:

SEMESTER/YEAR: ROLL NO:

DATE OF PERFORMANCE: DATE OF SUBMISSION:

EXAMINED BY: EXPERIMENT NO: Mini-Project

ASSIGNMENT - Mini Project

AIM: Mini Project Implementation

TITLE: Develop Document Summarization System

OBJECTIVES:
The primary objective of this project is to develop an Extractive Text Summarization System
using Natural Language Processing (NLP) techniques to automatically generate concise and
coherent summaries of lengthy documents.

THEORY:
1. Introduction
In the modern digital world, a tremendous amount of textual information is generated daily in the
form of news articles, research papers, blogs, and social media posts. It becomes difficult for
users to manually read and extract important points from lengthy documents.
Text Summarization is the process of automatically generating a concise and coherent summary
that contains the most important information from the original document.
There are two main types of summarization:
• Extractive Summarization: Selects key sentences from the text.
• Abstractive Summarization: Generates new sentences to represent the main ideas.
This project focuses on an Extractive Text Summarization System using Word Frequency
Ranking in Natural Language Processing (NLP).
2. Objectives

The project’s core and specific objectives are as follows:


• To design a system that can automatically summarize long text documents.
• To use Word Frequency Ranking to identify the most significant sentences.
• To provide a simple interface for text input and summary output.
• To demonstrate the use of NLP techniques like tokenization and stopword removal.
• To make reading and comprehension faster and more efficient.

3. System Overview

The Document Summarization System is composed of a single core component: the


Summarization Logic (Python Core). This component is implemented and prototyped in a
Jupyter Notebook, providing a clear, step-by-step demonstration of how text can be
automatically summarized using NLP techniques.

3.1 Summarization Logic (Python Core)


This component handles the complete process of extracting key sentences from a document.
Core Functional Steps:
• Sentence and Word Tokenization:
The input text is first split into individual sentences using the sent_tokenize() function
from the NLTK library. Each sentence is further divided into words using
word_tokenize(). This step structures the text for detailed analysis.
• Word Frequency Analysis:
A frequency dictionary is built to count occurrences of each non-stopword in the text.
Common stopwords (like “the,” “is,” and “and”) are removed using NLTK’s stopword
list. This ensures only meaningful words contribute to sentence importance.
• Frequency Normalization:
Each word’s raw frequency is divided by the highest frequency in the document. This
assigns weighted importance scores to words, giving more significance to relatively rare,
meaningful terms.
• Sentence Scoring:
Every sentence receives a score equal to the sum of the normalized frequencies of the
words it contains. Longer sentences naturally accumulate higher scores, but
normalization and length checks prevent them from dominating the summary.
• Summary Selection:
Sentences are ranked based on their scores, and the top sentences (determined by a user-
defined ratio, e.g., 30% of total sentences) are selected. These sentences are then ordered
according to their original position in the document to preserve coherence and logical
flow. The result is a concise, coherent extractive summary.

4. Key Theoretical Concepts


The system relies fundamentally on concepts from NLP:

• Natural Language Processing (NLP):


Techniques that allow computers to process, analyze, and understand human language.
Tokenization, stopword removal, and frequency analysis are fundamental NLP tasks.

• Tokenization:

• Sentence Tokenization: Splits the text into sentences.


• Word Tokenization: Splits sentences into words for detailed analysis.

• Stopwords:
Commonly occurring words (e.g., “the,” “is,” “and”) that are usually ignored in analysis because
they carry little semantic meaning.

• Word Frequency Analysis:


Counts occurrences of words to identify important terms. The more a word appears, the more
likely it is important, after removing stopwords.

• Frequency Normalization:
Dividing each word frequency by the maximum frequency in the document to create weighted
scores. This reduces bias toward extremely common words.

• Sentence Scoring:
A method of assigning numerical importance to each sentence by summing the normalized
frequencies of its words.

• Extractive Summarization:
The process of selecting and concatenating the most important sentences from the original
document to create a summary. It preserves the original wording and sentence structure.

• Compression Ratio / User-defined Ratio:


A parameter that determines the fraction of sentences to include in the summary (e.g., 30% of
original sentences). It allows control over summary length.

5. Benefits
The Document Summarization System offers significant advantages in managing information:
● Efficiency: Dramatically reduces the time required to understand the core message
of a long document.
● Fidelity: Preserves the original language and context since it only extracts
existing sentences.
● Scalability: Can process documents of arbitrary length rapidly, making it suitable
for quick analysis of large corpora.
● Accessibility: Provides an intuitive, web-based interface, democratizing access to
complex NLP functionality.

PROGRAM CODE:

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
import heapq
import re

nltk.download('punkt')
nltk.download('stopwords')

text = """Information retrieval (IR) in computing and information science is the task
of identifying and retrieving information system resources that
are relevant to an information need. The information need can be specified in the
form of a search query. In the case of document retrieval, queries can be based on
full-text or other content-based indexing. Information retrieval is the science[1] of
searching for information in a document, searching for documents
themselves, and also searching for the metadata that describes data, and for
databases of texts, images or sounds.

Automated information retrieval systems are used to reduce what has been called
information overload. An IR system is a software system that provides
access to books, journals and other documents; it also stores and manages those
documents. Web search engines are the most visible IR applications.
"""

text = re.sub(r'\s+', ' ', text)


text = re.sub(r'\[[0-9]*\]', ' ', text)

sentences = sent_tokenize(text)
stop_words = set(stopwords.words('english'))
word_frequencies = {}
for word in word_tokenize(text):
word = word.lower()
if word.isalpha() and word not in stop_words:
if word not in word_frequencies:
word_frequencies[word] = 1
else:
word_frequencies[word] += 1

maximum_frequency = max(word_frequencies.values())
for word in word_frequencies.keys():
word_frequencies[word] = (word_frequencies[word] / maximum_frequency)

sentence_scores = {}
for sent in sentences:
for word in word_tokenize(sent.lower()):
if word in word_frequencies:
if len(sent.split(' ')) < 30: # Ignore very long sentences
if sent not in sentence_scores:
sentence_scores[sent] = word_frequencies[word]
else:
sentence_scores[sent] += word_frequencies[word]

summary_sentences = heapq.nlargest(3, sentence_scores, key=sentence_scores.get)


summary = ' '.join(summary_sentences)

print("🔹 ORIGINAL TEXT:\n")


print(text)
print("\n\n🔸 SUMMARIZED TEXT:\n")
print(summary)
OUTPUTS:

ORIGINAL TEXT:

Information retrieval (IR) in computing and information science is the task of identifying and retrieving
information system resources that are relevant to an information need. The information need can be specified
in the form of a search query. In the case of document retrieval, queries can be based on full-text or other
content-based indexing. Information retrieval is the science of searching for information in a document,
searching for documents themselves, and also searching for the metadata that describes data, and for
databases of texts, images or sounds. Automated information retrieval systems are used to reduce what has
been called information overload. An IR system is a software system that provides access to books, journals
and other documents; it also stores and manages those documents. Web search engines are the most visible
IR applications.

SUMMARIZED TEXT:

Information retrieval (IR) in computing and information science is the task of identifying and retrieving
information system resources that are relevant to an information need. Automated information retrieval
systems are used to reduce what has been called information overload. An IR system is a software system that
provides access to books, journals and other documents; it also stores and manages those documents.

CONCLUSION:

The Document Summarization System (Python Core) demonstrates an effective


approach for automatic text summarization using Word Frequency Ranking:

• It accurately identifies and extracts the most important sentences from a


document.

• The system is easy to implement and interpret, making it suitable for educational
purposes and lightweight text analytics tasks.

• By adjusting the user-defined ratio, summaries can be concise or detailed


depending on the user’s needs.

• This foundational approach can serve as a basis for more advanced summarization
techniques, including graph-based (TextRank) or transformer-based
abstractive summarization.

In essence, the system converts long, information-dense documents into readable,


coherent summaries while maintaining the original meaning, improving comprehension
and efficiency.

You might also like