Automatic Text Summarization Using Natural Language Processing
Automatic Text Summarization Using Natural Language Processing
Processing
To
This is to certify that this project report entitled Automatic Text Summarization
Using Natural Language Processing submitted to Jaypee University of
Information Technology, is a bonafide record of work done by
Under my supervision from July 2018 to May 2019 in partial fulfillment of the
requirements for the award of the degree of Bachelor of Technology in Computer
Science & Engineering.
The matter embodied in the report has not been submitted for the award of any other
degree or diploma.
2
ACKNOWLEDGMENT
We would like to express our special thanks of gratitude to our Project Supervisor
Dr. Vivek Sehgal as well as our Project Coordinator Dr. Hemraj Saini who gave us
the golden opportunity to do this wonderful project on the topic Automatic Text
Summarization Using Natural Language Processing, which also helped us in
doing a lot of Research and we came to know about so many new things we are
really thankful to them. Secondly, we would also like to thank our parents and friends
who helped us a lot in finalizing this project within the limited time frame.
3
ABSTRACT
4
TABLE OF CONTENT
Chapters Page
no.
1. INTRODUCTION
1.1 Introduction 1-2
1.3 Objectives 3
3. SYSTEM DEVELOPMENT
3.1 NLP 24
4. Performance analysis
4.1 Approaches to Sentence Extraction
4.1.1 Frequency based approach 35-36
4.1.2 Feature-based approach 37-39
5
4.3 Training Dataset 45
5. CONCLUSION
5.1 Conclusion 52
6. Reference 54-55
6
Chapter-1
INTRODUCTION
1.1 Introduction
With the developing measure of data, it has turned out to be hard to discover brief data.
In this way, it is critical to making a framework that could condense like a human.
Programmed content rundown with the assistance of Normal Dialect Handling is an
instrument that gives synopses of a given archive. Content Outline strategies is divided in
two ways i.e. - extractive and abstractive approach. The extractive approach basically
choose the various and unique sentences, sections and so forth make a shorter type of the
first report. The sentences are estimated and chosen based on accurate highlights of the
sentences. In the Extractive technique, we have to choose the subset from the given
expression or sentences in given frame of the synopsis. The extractive outline
frameworks depends on two methods i.e. - extraction and expectation which includes the
arrangement of the particular sentences that are essential in the general comprehension
the archive. What’s more, the other methodology i.e. abstractive content synopsis
includes producing completely new articulations to catch the importance of the first
record. This methodology is all the more difficult but on the other hand is the
methodology utilized by people.
New methodologies like Machine taking in procedures from firmly related fields, for
example, content mining and data recovery have been utilized to help programmed
content synopsis.
From Completely Mechanized Summarizers (FAS), there are techniques that assistance
clients doing rundown (MAHS = Machine Helped Human Synopsis), for instance by
featuring hopeful sections to be included the outline, and there are frameworks that rely
upon post-preparing by a human (HAMS = Human Supported Machine Rundown).
There are two types of extractive rundown errands which rely on the outline application
focuses. One is nonexclusive synopsis, which centres on getting a general rundown or
1
unique of the Archive (regardless of whether records, news stories and so on.). Another is
inquiry related synopsis, some of the time called question based outline, which abstracts
especially to the question. Outline strategies can make both inquiry related content
rundowns and conventional machine-created synopses relying upon what the client
needs.
Likewise, rundown strategies endeavour to discover subsets of items, which contain data
of the total set. This is otherwise called the centre set. These calculations demonstrate
experiences like inclusion, decent variety, data or representativeness of the outline.
Question based synopsis techniques, furthermore demonstrate for purpose of the outline
with the inquiry. A few techniques and calculations which specifically outline issues are
Text Rank and Page Rank, Sub modular set capacity, determinately point process,
maximal negligible significance (MMR) and so forth.
In the new period, where tremendous measure of data is accessible on the Web, it is most
vital to give the enhanced gadget to get data rapidly. It is extremely intense for
individuals to physically pick the synopsis of expansive archives of content. So there is
an issue of scanning for vital reports from the accessible archives and discovering
essential data. Along these lines programmed content rundown is the need of great
importance. Content rundown is the way toward recognizing the most vital important
data in a record or set of related archives. What’s more, compact them into a shorter
rendition looking after its implications. .
2
1.3 Objectives
The objective of the project is to understand the concepts of natural language processing
and creating a tool for text summarization. The concern in automatic summarization is
increasing broadly so the manual work is removed. The project concentrates creating a
tool which automatically summarizes the document.
1.4 Methodologies
For obtaining automatic text summarization, there are basically two major techniques
i.e.-Abstraction based Text Summarization and Extraction based Text Summarization.
The Extractive summaries are used to highlight the words which are relevant, from input
source document. Summaries help in generating concatenated sentences taken as per the
appearance. Decision is made based on every sentence if that particular sentence will be
included in the summary or not. For example, Search engines typically use Extractive
summary generation methods to generate summaries from web page. Many types of
logical and mathematical formulations have been used to create summary. The regions
are scored and the words containing highest score are taken into the consideration. In
extraction only important sentences are selected. This approach is easier to implement.
There are three main obstacles for extractive approach. The first thing is ranking problem
which includes ranking of the word. The second one selection problem that includes the
selection of subset of particular units of ranks and the third one is coherence that is to
know to select various units from understandable summary. There are many algorithms
which are used to solve ranking problem. The two obstacles i.e. - selection and coherence
are further solved to improve diversity and helps in minimizing the redundancy and
pickup the lines which are important. Each sentence is scored and arranged in decreasing
order according to the score. It is not trivial problem which helps in selecting the subsets
of sentences for coherent summary. It helps in reduction of redundancy. When the list is
put in ordered manner than the first sentence is the most important sentence which helps
in forming the summary. The sentence having the highest similarity is selected in next
step is picked from the top half of the list. The process has to be repeated until the limit is
reached and relevant summary is generated.
3
Abstraction Based summarization
People by and large utilize abstractive outlines. In the wake of perusing content,
Individuals comprehend the point and compose a short outline in their own particular
manner creating their very own sentences without losing any essential data. In any case,
it is troublesome for machine to make abstractive synopses. Along these lines, it very
well may be said that the objective of reflection based outline is to make a synopsis
utilizing regular dialect preparing procedure which is utilized to make new sentences that
are syntactically right. Abstractive rundown age is difficult than extractive technique as
it needs a semantic comprehension of the content to be encouraged into the Common
Dialect framework. Sentence Combination being the significant issue here offers ascend
to irregularity in the produced outline, as it's anything but an all around created field yet.
1.5 Organization
4
Chapter 2: Includes literature survey. We have studied various papers
and journal from reputed sources on machine learning and artificial
neural network and have mentioned those in this chapter. .
5
Chapter-2
LITERATURE SURVEY
2.2.1
6
that knowledge from the provision text and presents it defines. There
are three main approaches to summarization i.e.-
statistical, the graph-based, and machine learning approaches.
The other approach is a clustering approach.
In statistical approaches, researchers are based upon sentence
ranking and the important sentences are selected from the given
document, regarded as the important summary compression ratio.
Graph-based approaches concentrate on the semantic analysis and
relationship among sentences. The graph-based approach is used in
the representation for text inside documents.
Machine learning approaches helps in producing summary by
applying machine learning algorithms. This approach deals with the
summarization process as a classification problem. Based on the
characteristics sentences are divided for summary.
2.2.2
7
results very long and hard to read . thus the demand of automatic
summarization increased. Automatic summarization collect the
important information from the given document and generate the
summary which is important and save time. The review is based on
single and multiple summarization methods. Automatic
summarization takes out the data from the document and save the
time. H.P. Luhn discovered automatic summarization of given text in
the year 1958. `NLP invented the subfield of summarization. In
automatic summarization important points are not lost. There are two
approaches-Abstraction and extraction approach.
Extraction is domain independent and provides summary.
Abstraction is domain dependent and understands the whole
document and produce the summary.
Two types of summarization-
Single document text summarization
Multi-document text summarization
The idea of single document summarization dropped and
the focus was on multi-document which helps in size reduction
, maintaining syntax and semantic relationship Abstractive seq to seq
models is generally trained on titles and subtitles. The similar
approach is adopted with document context which helps in scaling.
Further all the sentences are rearranged in the order during the
inference. Document summarization is changed to supervised or
semi-supervised problems. Hints are given like topic words; blacklist
words etc. in case of supervised learning approach which are used to
identify the sentences as positive or negative classes or the sentences
are manually tagged. Then the binary classifier helps in obtaining the
scores of each sentence. However they are not successful in providing
document specific summaries. The document helps in predicting if
the document level information is not provided.
Abstraction is domain dependent and understands the whole
document and produce the summary.
Two types of summarization-
8
Single document text summarization
Multi-document text summarization
The idea of single document summarization dropped and
the focus was on multi-document which helps in size reduction
,maintaining syntax and semantic relationship
2.2.3
9
for example, word and expression recurrence. The heaviness of the
sentence was proposed. High recurrence words were given high
inclination.
The two methodologies for programmed rundown is extraction and
reflection. Extractive outline strategies distinguish imperative
segments of the content and creating them precisely. Abstractive
synopsis techniques deliver
Imperative component recently. Characteristic dialect handling is
utilized to translate and inspect the content with the end goal to
create another shorter content. Extractive rundown gives preferred
outcomes over abstractive outline.
2.2.4
10
of content outline and gives phenomenal queries about research and
methodologies for future appearances.
Content synopsis acquires valuable data from this expansive
information which can be utilized to make the summary. The
fundamental point is to look at how changed techniques have been
utilized to assemble rundown frameworks and perform surveys.
2.2.5
11
decided of the particular sentence.
Abstractive Approach - It consists of understanding the original
text and converting in summary. It checks the test and interprets it. It
describes by generating in shorter form which includes most
important information from the given document.
Therefore, a twofold problem is faced for important documents
through number of documents available, and it absorbs the large
quantity of important information. Automatic text
summarization is use to short the source text into a shorter version
protecting its information content and overall meaning. The
advantage of the summary is that the reading time is reduced. The
repetition is kept to be minimum. Summarization tools also search
for headings to identify the key points of a document. Microsoft
Word’s AutoSummarize function is example of text summarization.
Extractive text summarization process is divided into two steps
1) Pre Processing step
2)Processing step.
Pre Processing is a structured description of the original text. It
usually involves:
a)Sentences boundary identification- it is identified with the
appearance of dot at the end of decision.
b) Stop-Word Elimination-Common words with no semantics.
c) Stemming—The stemming is used to get the stem of every word,
that highlight its semantics.
In Processing step, the importance of given sentence is decided and
the weight is assigned using weight learning method. The score is
calculated using Feature-weight equation. The sentences containing
highest ranking are converted for summary.
12
2.2.6
13
.Distraction mechanism is used to avoid the redundancy.
In this paper there is study of how neural summarization model
generate or get to know salient information of a particular document.
By studying and seeing graph-based extractive approach the novel
based approach is discovered in the encoder-decoder framework.
Seqtoseq model is also generated and discovered a new hierarchical
decoding algorithm is determined with a reference mechanism which
generate the abstractive summaries. The method helps to check
constraints of saliency The proposed method, non-redundancy,
information correctness, and fluency under various framework.
Experiments have been conducted on two large-scale corpora with
human generated summaries. Results are produced successfully that
outperforms previous neural abstractive Summarisation models, and
is also competitive with state-of-the-art extractive methods. Various
methods, experiments have been provided in given paper.
2.2.7
14
Summary Well organized summary is generated of single and multiple
documents. Multi-document summarization has become very
important part of our daily lives as there is lot of information about
one particular topic so it becomes very difficult to read. Summary of
document helps to easily understand about the topic and important
information is generated. The extractive approach is used which is
popular for document summary. Summary is generated by selecting
words and sentences from the provided document because it is
difficult to guarantee the linguistic quality. Marginal relevance
(MMR) is used which is used to score every textual unit and take out
the highest score. Greedy MMR algorithm is also used but due to its
greediness they don’t take into account the whole quality. Global
inference algorithm is also used for summary. However, these
algorithms create lot of problem in formulation of integer linear
programming for scoring and the time complexity is very hard. So
there is great need of efficient algorithms. In this paper the new
approach in generated called Automatic Summarization using
Reinforcement Learning (ASRL), where the summary is generated
within framework and scores the function of summary. The method
is used and adapts to problem with automatic summarization in
natural way. Sentence compression is also adapted as action of
framework. ASRL is evaluated which is comparable with the state of
ILP-style taking rouge score into consideration. Evaluation is done
on basis of execution time. State space is searched efficiently for sub
optimal solution underscore functions and the score function, and
produce a summary whose score denotes the expectation of the score
of the same features’ states. The quality of summary only depends on
score function.
15
2.2.8
16
important information is generated accurately and OOV words are
also taken into consideration while summarizing a paragraph.
Coverage mechanism, intra-decoder attention mechanisms and many
more approaches have been used.
2.2.9
17
focuses on selecting sentences to generate output summary. Neural
arch. Have also been used for geometry reasoning. With the help of
number of transformation and scoring algorithms, highlights to
document content are matched and two large training data sets are
constructed. One is generated for sentence extraction and the other
one is generated through word extraction. They viewed
summarization as a problem analogous to statistical machine
translation and generate headlines using statistical models for
selecting and ordering the summary words. Model keeps on
operating representations and helps in producing multi-sentence
output and organizes summary words into sentences so that it is easy
for reader to read. Meaning of sentences is also determined and
employs neural network directly to generate actual summarization.
Neural attention model is generated for abstractive sentence
compression which is trained in on different pairs of headlines and
different sentences in article. Rather than selecting whole vocabulary
the decoder selects output symbols.
Model helps in accommodating both generation and extraction. The
evaluation is done by both the ways i.e. - automatically and by
humans on both the datasets.
18
2.2.10
19
Chapter-3
System Development
NLP is way toward empowering PCs to comprehend and deliver human dialect. Uses of
NLP systems are utilized in separating of text, machine interpretation and Voice Agents
like Alexa and Siri. NLP is one of the fields that are profited from the advanced
methodologies in Machine Adapting, particularly from Profound Learning strategies.
Regular Dialect Preparing method utilize the characteristic dialect toolbox for making the
principle arrange in python tasks to work with human dialect data. This is simpler to-use
by giving the interfaces to at least one than 40 corpora and dictionary resources, for
portrayal, for part passages sentences and to get the words in its unique frame Marking,
parsing, and glossary thinking for current reasoning quality basic dialect dealing with
libraries, and for dynamic discourse. The NLTK will utilize a colossal instrument area
and will make some help for individuals with the whole basic dialect taking care of
system. This will assist individuals with “part sentences from sections, to part up words,
seeing the syntactic segments of those words, denoting the fundamental subjects, doing
this it serves to your machine by acknowledging the main thing to the substance.
20
3.2 Lesk Algorithm
NLP is the way toward empowering PCs to comprehend and deliver human dialect. Uses
of NLP systems are utilized in separating of text, machine interpretation and Voice
Agents like Alexa and Siri. NLP is one of the fields that are profited from the advanced
methodologies in Machine Adapting, particularly from Profound Learning strategies.
Regular Dialect Preparing method utilize the characteristic dialect toolbox for making the
principle arrange in python tasks which work with human language data. This is simpler
to-use by providing the interfaces to at least one than 40 corpora and dictionary
resources, for portrayal, for part passages sentences and to get words in its unique frame.
marking, parsing, and glossary thinking for current reasoning quality basic dialect
dealing with libraries, and for dynamic discourse. The NLTK will utilize a colossal
instrument area and will make some help for individuals with the whole basic dialect
taking care of system. This will assist individuals with role sentences from sections,
seeing the syntactic segments of those words, denoting the fundamental subjects, doing
this it serves to the machine by acknowledging the main thing to the substance.
The wordnet is arranged semantically which creates the electronic database of verbs,
adjectives etc. Similar words are grouped together to form the synonym sets. The
algorithm is used which removes the words that belong to at least one synset and is
known as wordnet words. Synsets are interrelated by semantics and lexical relation.
Word net not only links the words but also the senses of the words. It group together all
the English words and provides the short definitions. It is accessible to human users via a
web browser and is used in automatic text review and artificial intelligence utilizations
21
prepositions and other function words and includes nouns, verbs etc. All Synsets are
connected by semantic relation.
22
Step 1: Data Pre-Processing Programmed record outline generator helps in removing the
things which are not required and occurs in substance. Hence there are sentence part,
empty stopwords and perform stemming.
Step 2: Evaluation is further done by the weights Lesk count and word net is used to
process the repeat of every sentence. For all N number of documents number of total is
spread and founded between detail and brilliance. Further, a specific sentence of the
document is selected for every sentence. From every sentence, the stop words are
removed as there is no intrigue in sense assignment process. Every word is removed with
the help of Wordnet. The document is selected and performed between the sparkles and
the data content. When it is overall the intersection guides comes to the largeness of the
sentence.
23
Step 3: Summarization this is the last stage for automatic summarization. The last outline
of the particular stage is evaluated the introductions of the yield and survey is done at the
time when all the sentences are arranged. Firstly, it will select the onceover of sentences
with weight and are planned in jumping demand which is concerned by the increasing
weights. Various numbers of sentences are picked from the rate of summary. Further the
sentences which are picked are recomposed by the gathering in information. Further, the
sentences which are selected are gathered without any dependence of any particular
object rather than the denotative erudition lying in the sentence. Restrained matter once-
over is without spoken language.
The proposed approach uses machine and deep learning concepts. The flow chart for this
approach is as follows:
24
3.8 Platform Used
3.8.1 Windows 10
Windows 10 is defined as the Microsoft which works with the particular framework for
PCs, tablets, and inserted gadgets etc. Microsoft discharged Windows 10 is follow-up to
Windows 8. It was said on July that the window 10 will be refreshed instead of
discharging it and framework as a successor.
25
Windows 8 came up with the new idea and gave touch-empowered motion driven UI
like those on cell phones and tablets, but there was not much interpretation of well to
customary work area and workstation PCs, particularly in big business settings. In
Windows 10, Microsoft venture to address this issue and different behaviour of Windows
8, for example, an absence of big business neighbourly highlights.
The declaration of Windows 10 in September 2014 from Microsoft was made and
window insider was made that time. There was the discharge from Microsoft to Windows
10 by seeing the total population in July 2015. After that clients observed that Windows
10 is cordial than Windows 8 because it was more conventional interface, which echoes
the work area engaging format of Windows 7.
The Windows 10 consecrate Refresh, which turned out in August 2016, made some
modifications to the assignment bar and Begin Menu. It additionally presented program
augmentations in Edge and gave client’s access to Cortana on the bolt screen. In April
2017, Microsoft discharged the Windows 10 Makers Refresh, which made Windows Hi's
facial acknowledgment innovation quicker and enabled clients to spare tabs in Microsoft
Edge to see later.
The Windows 10 Fall Makers Refresh appeared in October 2017, adding Windows
Safeguard Adventure Monitor to secure against zero-day assaults. The refresh likewise
enabled clients and IT to put applications running out of sight into vitality productive
mode to safeguard battery life and enhance execution.
Ubuntu is a free and open-source working framework and Linux conveyance which
dependents on Debi an. Ubuntu has three authority versions: Ubuntu Work area for PCs,
26
Ubuntu Server for servers and the cloud, and Ubuntu Centre for the Web of things
gadgets and robots. Ubuntu which are new happen at regular intervals, while long haul
bolster (LTS) discharges happen like clockwork, and the latest one is, 18.04 LTS (Bionic
Beaver), is upheld for a long time.
Ubuntu is the working framework for the cloud and is the reference working framework
for Open Stack. With our committed server designs, you don't need to share the assets.
You are qualified to utilize 100% of the gave server to deal with the activity to your site
and deal with the business. On the off chance that your requirements develop with the
time you can redesign the business to a greater and quicker server. We ensure that we
develop as your business develops. We give the most secure, solid and adaptable Ubuntu
based committed server facilitating administrations. Hostingraja gives concentrated and
altered a devoted server on Ubuntu working framework.
It is the second most utilized Linux enhance for committed facilitating and VPS
facilitating in the facilitating Enterprises. Being the mainstream decision of the holsters
for running Linux virtual machines or devoted server, the second most well known is
Ubuntu. Ubuntu is for the most part utilized in cloud stage or any cloud facilitating
arrangements.
27
MiaCMS, Microweber, Midgard CMS, MODX, Novius OS, Core CMS, OctoberCMS,
Omeka, OpenCart, papaya CMS, pH7CMS, Phire CMS, PHP-Nuke, phpWebLog,
phpWiki, Pimcore, PivotX, Pixie (CMS), PmWiki, Prestashop, ProcessWire, Luck,
SilverStripe, SMW+, SPIP, Textpattern, Tiki Wiki CMS Groupware, TYPO3,
WordPress, Xaraya, XOOPS”
This is the main PHP related stages there are numerous stages lick JAVA, Java
bundles/package, Microsoft the Authority Microsoft ASP.NET Site, Perl, Python, etc the
rundown simply continue endlessly HostingRaja assists with all the innovation which is
utilized by Ubuntu working framework.
28
Chapter-4
Performance Analysis
In each work based on content summarization, which was spearheaded, it was expected
that vital words in the record are rehashed ordinarily contrasted with the different words
in the record. Along these lines, demonstrate of the significance of sentences in the
record by utilizing word recurrence. From that point forward, a considerable lot of the
summarization frameworks utilize recurrence of the approaches in the extraction of the
sentences. Two procedures that utilization recurrence as an essential frame measures in
the content summarization is: word likelihood what's more, term recurrence reverse
record recurrence.
It was expected that one of the least complex methods for utilizing recurrence is done by
including the crude recurrence of the word i.e., by essentially including every word event
the archive. Nonetheless, the actions are enormously affected by the report length. One
approach is to get the modification for the report length is by processing the word
likelihood. The equation 1 shows the probability of the particular word:
29
The discoveries from the examination conveyed based on human-composed outlines
demonstrate that individuals will, in general, utilize word recurrence to decide the key
subjects of an archive. In case of summarization framework that misuses word likelihood
to make outlines. The Sum Basic framework initially processes the word likelihood from
the information archive. Each sentence Sj, it processes the sentence weight as a capacity
of word likelihood. Best scoring is picked up on the basis of sentence weight.
30
Every word is partitioned and standardized from aggregate number of the various words
in document j. The term is used to weight the calculation is like the word likelihood
calculation given in Condition 1. The inverse document frequency of a word I is
processed:
Where, the total number is further divided in various number of the document in the
corpus which has different words. Based on Condition 3 and 4, the of word I in document
j is calculated:
31
Sentence Position
The starting sentences in a record as a rule depict the primary data concerning the
archive.
Title/Headline Word
Title words showing up in a sentence could recommend that the sentence contains critical
data.
Term Weight
Words with high event inside the record are utilized to decide the significance of the
document.
Sentence Length
Sentences which are very short contain very less data and the sentences which are long
are not proper to speak to outline.
Based on Figure it delineates the basic model of an element based on the summarizer.
The scores are registered for each and every element what's more, joined for sentence
scoring. Before scoring of the sentence, the highlights are offered weights to decide its
dimension of significance. For this situation, highlight weighting will be connected to
decide the weights related to each highlight and the sentence score is then registered
utilizing the direct mix of each component score duplicated by its comparing weight:
32
Binwahlan et al. (2009) proposed a content synopsis display dependent on Molecule
Swarm Enhancement to decide component weights. The scientists utilized hereditary
calculation of rough the best weight blend for the multiple record summary. Different
development calculation is additionally been utilized to scale the pertinence of highlight
weights. Examination on the impact of various element mix was conveyed by Hariharan
where it was discovered that better outcomes were gotten to consolidating term
recurrence weight with position and hub weight.
In this project we are going to use the concept of Deep Learning for abstractive
summarizer based on food review dataset. So before developing the model, let’s
understand the concept of deep learning. The basic structures of neural network with its
hidden layer are shown in the following figure.
33
Neural Networks (NN) are also used for Natural Language Processing (NLP), including
Summarizers. Neural networks are effective in solving almost any machine learning
classification problem. Important parameters required in defining the architecture of
neural network (NN) are total amount of hidden layers used, number of hidden units to be
present in each layer, activation function for each node, error threshold for the data, the
type of interconnections, etc. neural networks can capture very complex characteristics of
data without any significant involvement of manual labour as opposed to the machine
learning systems. Deep learning uses deep neural networks to learn good representations
of the input data, which can then be used to perform specific tasks.
Recurrent Neural Networks was formed in the year 1980’s but is very popular helps in
increasing the power which is computational from GPU. They are useful in terms of
sequential data because neuron can use its internal memory. It helps in maintaining the
information about the past input. This is great because in cases of language, “I had
washed my house” is very different than “I had my house washed”. The network helps in
gaining deep understanding of the given statement. A RNN contain loops in them where
the information is taken across neurons while reading the given input.
Here xt is the i/p, A is termed as a part of the RNN and ht is the o/p. The words are feed
from given sentences. Or even various characters from a string as xt and will come
upwards with ht. The ht is used as o/p and the comparison is done to given test data.
Hence, the error rate will be determined. After the comparison with o/p from test data the
back propagation technique is used. BPTT checks again with the help of network and
34
check and adjusts the weight depending on error rate.RNN is used to handle context from
the starting of the sentence where the prediction is correct.
35
4.2.3 Encoders and Decoders
For predicting sequence to sequence problems which is effective is known as Encoder-
Decoder LSTM. It contains two models: “one for reading the input sequence and
encoding it into a fixed-length vector, and a second for decoding the fixed-length vector
and outputting the predicted sequence”. Encoder-Decoder LSTM is designed specifically
for sequence to sequence problems. It was developed for NLP problems where it gave
state-of-the-art performance, majorly in translation of text called statistical machine
translation. The method for this thing is sequence embedding. During the task translation,
when the i/p sequence was reversed then model was more effective and it was effective
on long i/p sequences. This approach has also been used with image inputs.
36
argmaxy P(y|x)”, y is termed as a random variable which denotes a sequence of N words.
Conditional probability is modelled by a parametric function with parameters θ: P(y|x) =
P(y|x; θ). We need to find θ that helps in maximizing the conditional probability of
sentence-summary pairs in the training corpus. If models generates the next word then
conditions can be factorized into a product of individual conditional probabilities:
37
4.4 Training Snippet
38
39
40
41
42
4.5 Custom Input and Output
4.5.1 Input
4.5.2 Output
43
4.6 Loss Graph
44
Chapter-5
Conclusion
As with time internet is growing at a very fast rate and with it data and information is
also increasing. it will going to be difficult for human to summarize large amount of data.
Thus there is a need of automatic text summarization because of this huge amount of
data. Until now, we have read multiple papers regarding text summarization, natural
language processing and lesk algorithms. There are multiple automatic text summarizers
with great capabilities and giving good results. We have learned all the basics of
Extractive and Abstractive Method of automatic text summarization and tried to
implement extractive one. We have made a basic automatic text summarizer using nltk
library using python and it is working on small documents. We have used extractive
approach to do text summarization.
45
5.1 Future Scope
46
References
[1] Ahmad T. Al-Taani. “,Automatic Text Summarization Approaches” International
Conference on Infocom Technologies and Unmanned Systems (ICTUS'2017)
[2] Neelima Bhatia, Arunima Jaiswal, “Automatic Text Summarization: Single and
Multiple Summarizations ”, International Journal of Computer Applications
[3] Mehdi Allahyari, Seyedamin Pouriyeh, Mehdi Assefi, Saeid Safaei, Elizabeth D.
Trippe, Juan B. Gutierrez, Krys Kochut, “ Text Summarization Techniques: A Brief
Survey”, (IJACSA) International Journal of Advanced Computer Science and
Applications
[4]Pankaj Gupta, Ritu Tiwari and Nirmal Robert,”Sentiment Analysis and Text
Summarization of Online Reviews: A Survey”International Conzatiference on
Communication and Signal Processing,August 2013
47
[8]Tian shi, Yaser Keneshloo, Naren ramakrishnan, Chandan K. Reddy, Senior member,
IEEE “ Neural Abstractive text summarization with sequence-to -sequence models”
48