0% found this document useful (0 votes)

31 views18 pages

Natural Language Processing Unit 1-2

This document provides an overview of human languages and their importance in Natural Language Processing (NLP), detailing components such as phonetics, morphology, syntax, semantics, and pragmatics. It explains various NLP techniques, including text preprocessing, tokenization, and sentiment analysis, and discusses challenges like ambiguity and context understanding. Additionally, it covers different approaches to NLP, including rule-based, machine learning, and deep learning methods, highlighting their applications and effectiveness.

Uploaded by

chitransh04shukla

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views18 pages

Natural Language Processing Unit 1-2

Uploaded by

chitransh04shukla

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

UNIT 1

Human Languages in Natural Language Processing (NLP)

Human languages are complex systems of communication that have evolved over thousands of
years. In the field of Natural Language Processing (NLP), understanding human languages is
essential because it allows computers to interact with humans in a more natural and meaningful
way. This guide will explain human languages, their structure, and how NLP techniques work
with them, using simple examples.

it
What is a Human Language?

ix
A human language is a system of symbols (words) and rules (grammar) that allows people to
communicate. There are thousands of human languages in the world, each with its own unique
set of symbols and rules. For example, English, Hindi, Chinese, and Spanish are all human
languages, but they each have different words and grammatical structures.

Components of a Human Language

. Phonetics and Phonology:

D
. Phonetics: The study of sounds. For example, the sound of "p" in "pat".
. Phonology: The study of how sounds are organized in a language. For example, in English,
"str" can start a word (like "street"), but "rtz" cannot.
n
. Morphology:
. The study of the structure of words. For example, the word "unhappiness" is made up of three
ju

parts: "un-", "happy", and "-ness".

. Syntax:
. The study of sentence structure. For example, in English, the typical sentence structure is
Subject-Verb-Object (SVO) like "The cat (Subject) eats (Verb) fish (Object)".
Ar

. Semantics:
. The study of meaning. For example, the word "bank" can mean the side of a river or a
financial institution, depending on the context.

. Pragmatics:
. The study of how context influences the meaning of language. For example, "Can you pass
the salt?" is understood as a request rather than a question about ability.

Example: English Language

Let's use English as an example to illustrate these components.

. Phonetics: The sound /k/ in "cat" and /p/ in "pat".

. Phonology: The difference in pronunciation between "cat" and "bat".
. Morphology: The word "cats" consists of "cat" (a noun) and "s" (a plural marker).
. Syntax: "The dog chased the cat." follows the SVO structure.
. Semantics: "I went to the bank." The meaning of "bank" depends on whether we're talking
about a river or money.
. Pragmatics: "It's cold in here, isn't it?" might be a request to close the window rather than just a
statement about temperature.

it
NLP and Human Languages

NLP is the field of computer science focused on enabling computers to understand and process

ix
human languages. NLP techniques are used in various applications like chatbots, translation
services, and voice assistants.

Steps in NLP

. Text Preprocessing:
D
. Cleaning and preparing the text for analysis. For example, removing punctuation, converting
text to lowercase, and tokenization (splitting text into words).

. Tokenization:
. Splitting text into words or sentences. For example, the sentence "The cat sat on the mat."
n
can be tokenized into ["The", "cat", "sat", "on", "the", "mat"].

. Part-of-Speech Tagging:
ju

. Identifying the grammatical parts of speech in a sentence. For example, in the sentence "The
cat sat on the mat," "The" is a determiner, "cat" is a noun, "sat" is a verb, and so on.

. Named Entity Recognition (NER):

. Identifying and classifying proper nouns in text. For example, in "Barack Obama was born in
Ar

Hawaii," "Barack Obama" is a person, and "Hawaii" is a location.

. Dependency Parsing:
. Analyzing the grammatical structure of a sentence and how words are related. For example,
in the sentence "She enjoys playing tennis," "enjoys" is the main verb, and "playing tennis" is
the object of "enjoys."

. Sentiment Analysis:
. Determining the sentiment or emotion expressed in text. For example, "I love this movie!"
expresses positive sentiment, while "I hate this movie!" expresses negative sentiment.
Live Example: Chatbots

Let's see how a chatbot interacts using NLP.

Scenario: You are chatting with a customer service bot about your internet connection problem.

1. User Input: "My internet is not working."

2. Text Preprocessing:
. Convert text to lowercase: "my internet is not working."
. Tokenization: ["my", "internet", "is", "not", "working"]

it
3. Part-of-Speech Tagging:
. "my" (pronoun), "internet" (noun), "is" (verb), "not" (adverb), "working" (verb)

ix
4. Named Entity Recognition:
. No named entities in this sentence.

5. Dependency Parsing:

6. Intent Recognition:
D
. "internet" is the subject, "working" is the main verb, "not" modifies "working".

. The chatbot recognizes the user's intent as reporting an issue with the internet.

7. Response Generation:
n
. Based on the recognized intent, the chatbot generates a response: "I'm sorry to hear that
your internet is not working. Have you tried restarting your router?"
ju

Example Interaction:

. User: "My internet is not working."

. Chatbot: "I'm sorry to hear that your internet is not working. Have you tried restarting your
router?"
Ar

. User: "Yes, I have."

. Chatbot: "Let me check your connection. Please wait a moment."

In this example, the chatbot uses NLP techniques to understand the user's problem and provide
an appropriate response.

Challenges in NLP

. Ambiguity:
. Words and sentences can have multiple meanings. For example, "I saw the man with the
telescope" can mean either the man had the telescope or I used the telescope to see the man.
. Context Understanding:
. Understanding the context is crucial. For example, "bat" can mean an animal or a sports
equipment, depending on the context.

. Idioms and Expressions:

. Idiomatic expressions can be difficult for computers to understand. For example, "kick the
bucket" means "to die," not literally kicking a bucket.

. Sarcasm and Humor:

. Detecting sarcasm and humor is challenging because they often rely on tone and context that

it
are not explicitly stated in the text.

Conclusion

ix
Understanding human languages is fundamental to NLP. By breaking down language into its
components—phonetics, morphology, syntax, semantics, and pragmatics—NLP techniques can
process and analyze text to enable meaningful interactions between computers and humans.

D
Despite the challenges, advancements in NLP continue to improve the capabilities of
applications like chatbots, translation services, and voice assistants, making human-computer
communication more natural and effective.

Main Approaches of NLP with Live Examples

Natural Language Processing (NLP) is a field that focuses on the interaction between
n
computers and humans through natural language. The main approaches to NLP include
rule-based methods, machine learning-based methods, and deep learning-based methods. Let's
explore each approach with simple explanations and live examples.
ju

1. Rule-Based Approaches

Rule-based approaches rely on a set of predefined linguistic rules crafted by experts. These
rules can include grammar rules, dictionaries, and patterns to analyze and process text.
Ar

Example: Spam Email Detection

Imagine you want to filter out spam emails. You can create rules such as:

- If the email contains phrases like "win money" or "claim your prize", mark it as spam.
- If the email is from an unknown sender with a suspicious domain, mark it as spam.

Live Example:
- Email: "Congratulations! You have won $1000. Click here to claim your prize."
- Rule Applied: Contains "win money" -> Mark as spam.
Rule-based systems are simple and effective for specific tasks but can struggle with complex
language patterns and variations.

2. Machine Learning-Based Approaches

Machine learning-based approaches use algorithms to learn from data. Instead of manually
creating rules, these systems learn patterns from labeled examples.

Example: Sentiment Analysis

it
Sentiment analysis determines whether a piece of text expresses a positive, negative, or neutral
sentiment. Using machine learning, we can train a model on a dataset of labeled sentences
(positive, negative, neutral).

ix
Live Example:
- Training Data:
- "I love this product!" -> Positive

D
- "This is the worst service ever." -> Negative
- "The event was okay." -> Neutral

- New Input: "I am so happy with my purchase!"

- Model Prediction: Positive

The model learns from the training data and can predict the sentiment of new, unseen
n
sentences.

3. Deep Learning-Based Approaches

Deep learning-based approaches use neural networks with multiple layers to model complex
patterns in data. These methods are particularly powerful for tasks like language translation,
speech recognition, and text generation.
Ar

Example: Language Translation

Deep learning models, such as the Transformer model, can translate text from one language to
another by learning from large datasets of parallel sentences (sentences in two languages that
mean the same thing).

Live Example:
- Training Data:
- English: "How are you?" -> French: "Comment ça va ?"
- English: "Good morning!" -> Spanish: "¡Buenos días!"
- New Input: "Thank you very much."
- Model Output: "Muchas gracias." (Spanish translation)

Deep learning models can capture more nuanced language patterns and produce more
accurate translations than rule-based or basic machine learning models.

NLP has evolved significantly, from rule-based approaches to machine learning and deep
learning. Each approach has its strengths and weaknesses, but together they contribute to
powerful applications like spam detection, sentiment analysis, and language translation. By
understanding these methods, we can appreciate how NLP enables computers to process and
understand human language, making our interactions with technology more seamless and

it
natural.

Knowledge in Speech and Language Processing

ix
Definition: Knowledge in speech and language processing refers to the information and rules
that systems use to interpret, process, and generate human speech and text. This includes
understanding grammar, context, vocabulary, and the rules of a language.

Live Example: Voice Assistants

- Knowledge Used:
D
- Scenario: You ask your voice assistant, "What's the weather like today?"

- Understanding Words: The assistant needs to recognize the words "weather" and "today."
- Contextual Understanding: It understands that "weather" refers to atmospheric conditions like
temperature, precipitation, etc.
n
- Information Retrieval: It accesses the internet to find current weather data for your location.
- Response: "Today, it's sunny with a high of 25°C."
ju

Here, the assistant uses its knowledge of language to interpret your question and provide a
meaningful answer.

AMBIGUITY
Ar

Definition: Ambiguity occurs when a word, phrase, or sentence has multiple meanings, making it
difficult to determine the exact interpretation without additional context.

Types of Ambiguity:
- Lexical Ambiguity: When a single word has multiple meanings.
- Syntactic Ambiguity: When a sentence can be parsed in multiple ways due to its structure.
- Semantic Ambiguity: When a sentence has multiple interpretations based on meaning.
- Pragmatic Ambiguity: When the context of the conversation leads to multiple interpretations.

Live Example: "I saw the man with the telescope."

- Lexical Ambiguity: The word "saw" could mean to perceive with eyes or to cut with a tool
(though context usually clears this up).
- Syntactic Ambiguity:
- Interpretation 1: You used a telescope to see the man.
- Interpretation 2: The man had a telescope, and you saw him.
- Disambiguation: Additional context or information is needed to determine the correct meaning.

Without more context, it's unclear which interpretation is correct. Disambiguation can be
achieved through further conversation or additional information.

MODELS AND ALGORITHMS

it
Definition: Models and algorithms in NLP are the mathematical frameworks and procedures
used to process and analyze language data.

ix
- Models: These are trained on data to recognize patterns and make predictions.
- Algorithms: Step-by-step procedures or formulas for solving a problem, which are used to train
and apply the models.

Live Example: Spam Email Detection

- Algorithm: Naive Bayes D
- Training Phase: The algorithm is trained on a dataset of emails labeled as spam or not spam.
It learns the probability of certain words or phrases appearing in spam emails versus non-spam
emails.
- Model: The trained Naive Bayes model.
n
- Application Phase: When a new email arrives, the model analyzes it and assigns a probability
score indicating whether it's likely to be spam based on the learned patterns.
- Process:
ju

- Input: "Congratulations! You have won $1000. Click here to claim your prize."
- Model Analysis: The model checks for phrases like "won $1000" and "click here," which are
common in spam.
- Output: The email is marked as spam because it matches the learned patterns.
Ar

FORMAL LANGUAGE AND NATURAL LANGUAGE

Definition: Formal language is a set of strings of symbols that follow specific grammatical rules,
often used in computer science and mathematics. Natural language is the language used by
humans for everyday communication, which is more flexible and context-dependent.

Characteristics:
- Formal Language:
- Strict rules and syntax.
- Predictable and unambiguous.
- Used in programming languages, mathematics, and logic.
- Natural Language:
- Flexible and often ambiguous.
- Context and culture influence meaning.
- Used in everyday human communication.

Live Example:
- Formal Language: Mathematical expressions.
- Expression: "3 + 4 = 7"
- Rules: Numbers and operators follow mathematical syntax. The meaning is clear and
unambiguous.
- Natural Language: Conversational English.

it
- Sentence: "Three plus four equals seven."
- Flexibility: The sentence can be understood in different contexts. For example, a math class
or casual conversation.

ix
REGULAR EXPRESSIONS AND AUTOMATA
D
Definition: Regular expressions (regex) are sequences of characters that define search
patterns, typically used for string matching. Automata are abstract machines used to recognize
patterns defined by regular expressions.
n
Regular Expressions:
- Purpose: Used for searching, matching, and manipulating text.
- Syntax: Specific patterns that represent sets of strings.
- For example, the regex `^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$` matches valid
ju

email addresses.

Automata:
- Types: Finite automata, pushdown automata, Turing machines, etc.
- Function: Process input strings and determine if they match certain patterns.
Ar

- Finite Automaton: An automaton with a finite number of states that can recognize patterns
defined by regular expressions.

Live Example: Validating an Email Address

- Regular Expression: `^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$`
- Pattern Explanation:
- `^[a-zA-Z0-9._%+-]+`: Starts with one or more alphanumeric characters, periods,
underscores, percent signs, plus signs, or hyphens.
- `@`: Followed by an "@" symbol.
- `[a-zA-Z0-9.-]+`: One or more alphanumeric characters, periods, or hyphens.
- `\.`: Followed by a period.
- `[a-zA-Z]{2,}$`: Ends with two or more alphabetic characters.
- Automaton: A finite state machine that processes the input string (email address) to determine
if it matches the regex pattern.
- Process:
- The automaton checks each part of the email step-by-step to confirm it meets the criteria.
- For the email "[email protected]", the automaton would verify each character and
structure according to the regex rules.

By using these concepts, NLP enables computers to process and understand human language,
making interactions more intuitive and effective.

it
ix
D
n
ju
Ar
Unit 2

Text Pre-processing

Definition: Text pre-processing involves preparing and cleaning text data for analysis and
modeling in NLP. This step is crucial for converting raw text into a structured format that
algorithms can easily process.

Steps and Live Example:

1. Lowercasing: Converting all text to lowercase to ensure uniformity.

it
- Example: "The Cat Sat On The Mat." becomes "the cat sat on the mat."

2. Removing Punctuation: Eliminating punctuation marks that do not contribute to the meaning.

ix
- Example: "Hello, world!" becomes "Hello world"

3. Removing Stop Words: Stop words are common words (like "is", "and", "the") that are usually
removed to focus on more meaningful words.

D
- Example: "The cat is on the mat" becomes "cat mat"

4. Stemming/Lemmatization: Reducing words to their base or root form.

- Stemming Example: "running", "runner" become "run"
- Lemmatization Example: "better" becomes "good"

5. Tokenization: Splitting text into individual words or tokens.

n
- Example: "The cat sat on the mat" becomes ["the", "cat", "sat", "on", "the", "mat"]

TOKENIZATION
ju

Definition: Tokenization is the process of breaking down text into smaller units called tokens,
typically words or phrases.

Live Example:
Ar

- Input Sentence: "The quick brown fox jumps over the lazy dog."
- Tokenization:
- Word Tokenization: ["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]
- Sentence Tokenization: If the input was multiple sentences, it might break into ["The quick
brown fox jumps.", "Over the lazy dog."]

Feature Extraction from Text

Definition: Feature extraction involves converting text into numerical features that can be used
by machine learning models.
Methods and Live Example:

1. Bag of Words (BoW): Represents text as a set of words and their frequencies.
- Example: "The cat sat on the mat."
- BoW Representation: {"the": 2, "cat": 1, "sat": 1, "on": 1, "mat": 1}

2. Term Frequency-Inverse Document Frequency (TF-IDF): Weighs words by their importance in

a document relative to a collection of documents.
- Example: In a corpus, the word "the" appears frequently in many documents, so its TF-IDF
score is lower. "Cat" might have a higher score if it's less common.

it
3. Word Embeddings: Represent words in a continuous vector space where semantically similar
words are closer together.
- Example: Using pre-trained embeddings like Word2Vec, "king" and "queen" would have

ix
similar vectors.

Morphology: Inflectional and Derivational

forms.

Live Example:
D
Morphology studies the structure of words. Inflectional morphology deals with grammatical
variations of words, while derivational morphology involves creating new words from base

1. Inflectional Morphology: Changes the form of a word to express different grammatical

n
features (tense, number, etc.).
- Example: "cat" (singular) -> "cats" (plural)
ju

2. Derivational Morphology: Creates new words by adding prefixes or suffixes.

- Example: "happy" -> "happiness" (adding "-ness" to form a noun)

Finite State Morphological Parsing

Definition: Finite state morphological parsing uses finite state machines to analyze and generate
word forms based on their morphological structure.

Live Example:
- Word: "cats"
- Parsing: The finite state machine might identify "cat" as the root and "s" as the plural suffix.

Finite State Transducer

Definition: A finite state transducer (FST) is a type of automaton used for processing and
transforming strings, useful in tasks like morphological analysis and machine translation.
Live Example:
- Input: "cats"
- Transformation: An FST could transform "cats" into "cat + PLURAL", indicating the root word
and its grammatical feature.

Part of Speech Tagging

Definition: Part of Speech (POS) tagging assigns grammatical categories (nouns, verbs,
adjectives, etc.) to each word in a sentence.

it
Rule-Based POS Tagging

Definition: Uses predefined linguistic rules to assign POS tags to words.

ix
Live Example:
- Sentence: "The cat sat on the mat."
- Tagging Rules:
- "The" -> Determiner (DT)
- "cat" -> Noun (NN)
- "sat" -> Verb (VBD)
- "on" -> Preposition (IN)
- "the" -> Determiner (DT)
- "mat" -> Noun (NN)
D
n
Stochastic POS Tagging

Definition: Uses probabilistic models, such as Hidden Markov Models (HMMs), to assign POS
ju

tags based on the likelihood of sequences of tags.

Live Example:
- Sentence: "The cat sat on the mat."
- Tagging Process:
Ar

- The model uses probabilities learned from a tagged corpus to predict that "cat" is likely a
noun (NN) and "sat" is likely a past tense verb (VBD).

Transformation-Based Tagging

Definition: Also known as Brill tagging, it starts with an initial tagging and iteratively refines it
using transformation rules.

Live Example:
- Sentence: "The cat sat on the mat."
- Initial Tagging: Might start with a simple rule-based or stochastic tagging.
- Transformation Rules: Apply rules to correct mistakes, such as changing a tag if the context
indicates a different POS.
- Example Rule: If a word is tagged as a verb but is preceded by a determiner, change the tag
to noun.

Text Pre-processing: In-Depth Example

1. Lowercasing:
- Original Text: "Natural Language Processing is FUN!"
- Processed Text: "natural language processing is fun!"

it
2. Removing Punctuation:
- Original Text: "Hello, world! Welcome to NLP."
- Processed Text: "Hello world Welcome to NLP"

ix
3. Removing Stop Words:
- Original Text: "This is an example of text preprocessing."
- Processed Text: "example text preprocessing"

4. Stemming/Lemmatization:
D
- Original Text: "The cats are playing with the toys."
- Stemmed Text: "the cat are play with the toy"
- Lemmatized Text: "the cat are play with the toy"

5. Tokenization:
n
- Original Text: "The quick brown fox jumps over the lazy dog."
- Tokens: ["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]
ju

Feature Extraction from Text: In-Depth Example

1. Bag of Words (BoW):

- Text: "The cat sat on the mat. The dog sat on the mat."
- Vocabulary: ["the", "cat", "sat", "on", "mat", "dog"]
Ar

- BoW Representation:
- Sentence 1: {"the": 2, "cat": 1, "sat": 1, "on": 1, "mat": 1, "dog": 0}
- Sentence 2: {"the": 2, "cat": 0, "sat": 1, "on": 1, "mat": 1, "dog": 1}

2. TF-IDF:
- Corpus: ["The cat sat on the mat.", "The dog sat on the mat."]
- TF-IDF Calculation:
- "the": low weight (common in both sentences)
- "cat", "dog": higher weight (less common)

3. Word Embeddings:
- Text: "The king rules the kingdom."
- Embedding Example: Word2Vec generates vectors where "king" and "queen" have similar
values because of their semantic similarity.

FINITE STATE MORPHOLOGICAL PARSING

Definition: Finite state morphological parsing involves using finite state automata (FSA) to
analyze the structure of words and determine their morphological components, such as roots,
prefixes, suffixes, and infixes. It’s a method of breaking down and understanding the formation
and transformation of words.

it
Explanation:
- A finite state automaton (FSA) is a computational model used to perform tasks like recognizing
patterns and processing strings.

ix
- In morphological parsing, the FSA reads an input word and transitions through states
according to predefined rules, identifying morphological components.

Live Example:

Let's take the word "cats":

1. State 1: Start (S)

D
- The automaton starts here and looks for the root of the word.
2. State 2: Root detection (R)
- It recognizes "cat" as the root.
n
3. State 3: Suffix detection (Suffix)
- It identifies "s" as the suffix, which denotes the plural form.
ju

Steps:
- Input: "cats"
- The FSA starts in the initial state (S).
- It reads "cat" and transitions to the root detection state (R).
- It reads "s" and transitions to the suffix state (Suffix).
Ar

Output:
- Root: "cat"
- Suffix: "s"
- Meaning: The word "cats" is a plural form of "cat".

This simple parsing helps in understanding how words are constructed and can be useful in
various NLP tasks like text analysis and generation.

Finite State Transducer (FST)

Definition: A finite state transducer (FST) is an extension of the finite state automaton. It maps
between two sets of symbols, effectively transforming input sequences into output sequences.
FSTs are widely used in tasks like morphological analysis, speech recognition, and machine
translation.

Explanation:
- An FST reads an input string and produces an output string while transitioning through states.
- Each state transition can produce output symbols based on the input symbols.

Live Example:

it
Let's consider transforming the word "cats" into its morphological components:

1. State 1: Start (S)

ix
- Reads the input word and starts processing.
2. State 2: Root detection (R)
- Detects the root "cat" and outputs it.
3. State 3: Suffix detection (Suffix)

- Input: "cats"
D
- Detects the suffix "s" and outputs the feature "+PLURAL".

Steps:

- The FST starts in the initial state (S).

- It reads "cat" and transitions to the root detection state (R), outputting "cat".
- It reads "s" and transitions to the suffix state (Suffix), outputting "+PLURAL".
n
Output:
- Transformed Output: "cat +PLURAL"
ju

- Meaning: The word "cats" is transformed to indicate it is the plural form of "cat".

This transformation process is crucial for understanding word forms and is applied in various
NLP applications like translation and morphological analysis.
Ar

Part of Speech Tagging

Definition: Part of Speech (POS) tagging involves assigning a part of speech (such as noun,
verb, adjective, etc.) to each word in a sentence. This is a fundamental step in many NLP tasks,
including parsing and text analysis.

Rule-Based POS Tagging

Definition: Rule-based POS tagging uses a set of handcrafted linguistic rules to assign POS
tags to words. These rules consider the context of words within sentences.
Live Example:

Let's tag the sentence: "The cat sat on the mat."

Rules:
1. Determiners: Words like "the", "a" are tagged as determiners (DT).
2. Nouns: Words following determiners are usually nouns (NN).
3. Verbs: Words indicating actions are tagged as verbs (VB).
4. Prepositions: Words like "on", "in" are tagged as prepositions (IN).

Steps:

it
- "The" -> DT (Determiner)
- "cat" -> NN (Noun)
- "sat" -> VB (Verb)

ix
- "on" -> IN (Preposition)
- "the" -> DT (Determiner)
- "mat" -> NN (Noun)

Tagged Sentence:

D
- "The/DT cat/NN sat/VB on/IN the/DT mat/NN."

Rule-based taggers rely on these handcrafted rules and are effective for many cases, but they
may struggle with ambiguities and exceptions in language.

Stochastic POS Tagging

n
Definition: Stochastic POS tagging uses probabilistic models, such as Hidden Markov Models
(HMMs), to assign POS tags based on the likelihood of sequences of tags. It learns from a
ju

corpus of tagged text.

Live Example:

Let's tag the sentence: "The dog barks loudly."

Training:
- The model is trained on a tagged corpus, learning probabilities for tag sequences.
- Example probabilities: P(NN|DT), P(VB|NN), etc.

Tagging Process:
1. Initial Probabilities: P(DT|START) for "The".
2. Transition Probabilities: P(NN|DT) for "dog".
3. Emission Probabilities: P(barks|VB), P(loudly|RB).

Steps:
- "The" -> DT (Highest initial probability)
- "dog" -> NN (Highest probability after DT)
- "barks" -> VB (Highest probability after NN)
- "loudly" -> RB (Highest probability after VB)

Tagged Sentence:
- "The/DT dog/NN barks/VB loudly/RB."

Stochastic taggers can handle ambiguities better by considering probabilities from training data,
making them more robust than rule-based taggers.

it
Transformation-Based Tagging (Brill Tagging)

Definition: Transformation-based tagging, also known as Brill tagging, starts with an initial

ix
tagging (often from a stochastic or rule-based tagger) and iteratively applies transformation
rules to correct errors.

Live Example:

D
Let's tag the sentence: "Time flies like an arrow."

Initial Tagging (e.g., using a simple stochastic tagger):

- "Time/NN flies/NNS like/IN an/DT arrow/NN."

Transformation Rules:
n
1. If a noun (NN) is followed by a verb (VB), change the first tag to a verb (VB).
2. If a verb (VB) is followed by a noun (NN), change the second tag to a plural noun (NNS).
ju

Steps:
- Apply Rule 1: "Time/NN flies/NNS" -> "Time/VB flies/NNS".
- Apply Rule 2: "flies/NNS like/IN" -> No change.
- "like/IN an/DT arrow/NN" -> No change.
Ar

Final Tagged Sentence:

- "Time/VB flies/NNS like/IN an/DT arrow/NN."

Brill taggers refine the initial tagging through transformation rules, which improves accuracy by
leveraging learned corrections from training data.

Conclusion

Understanding these concepts and examples is essential for grasping the foundations of NLP.
Finite state morphological parsing and transducers help in understanding and transforming word
structures, while POS tagging (rule-based, stochastic, and transformation-based) is crucial for
grammatical analysis and further NLP applications. These methodologies work together to
enable computers to process and understand human language effectively.

it
ix
D
n
ju
Ar

Unit-I NLP
No ratings yet
Unit-I NLP
15 pages
Lesson 1 Introduction To Natural Language Processing
No ratings yet
Lesson 1 Introduction To Natural Language Processing
93 pages
NLP PPT1
No ratings yet
NLP PPT1
29 pages
1 - Introducntion To NLP
No ratings yet
1 - Introducntion To NLP
43 pages
Chapter23 - Natural Language Processing
No ratings yet
Chapter23 - Natural Language Processing
87 pages
NLP & Linguistics for Researchers
No ratings yet
NLP & Linguistics for Researchers
35 pages
Basic NLP To End-To-End Pipeline .PPTX - Removed
No ratings yet
Basic NLP To End-To-End Pipeline .PPTX - Removed
35 pages
Unit 4
No ratings yet
Unit 4
39 pages
1 Natural Language Processing-Intro
No ratings yet
1 Natural Language Processing-Intro
16 pages
NLP Module 1
No ratings yet
NLP Module 1
124 pages
NLP Teaching Plan With UNIT 1
No ratings yet
NLP Teaching Plan With UNIT 1
60 pages
NLP Lab1
No ratings yet
NLP Lab1
33 pages
Natural Language Processing Lec 1
No ratings yet
Natural Language Processing Lec 1
23 pages
Artificial Intelligence: Rohan Raj Poudel
No ratings yet
Artificial Intelligence: Rohan Raj Poudel
34 pages
Introduction
No ratings yet
Introduction
24 pages
NLP Merged
100% (1)
NLP Merged
975 pages
Natural Language Processing
No ratings yet
Natural Language Processing
43 pages
Lect1 Intro 3jan08
No ratings yet
Lect1 Intro 3jan08
94 pages
Part01 Overview
No ratings yet
Part01 Overview
31 pages
Module 1
No ratings yet
Module 1
40 pages
NLP Introduction
No ratings yet
NLP Introduction
36 pages
Introduction to Natural Language Processing
No ratings yet
Introduction to Natural Language Processing
31 pages
Natural Language Processin1
No ratings yet
Natural Language Processin1
86 pages
Natural Language Processing
No ratings yet
Natural Language Processing
5 pages
Unit I
No ratings yet
Unit I
36 pages
Unit-I NLP
No ratings yet
Unit-I NLP
37 pages
6CS4 AI Unit-5
No ratings yet
6CS4 AI Unit-5
65 pages
01 Introduction To Natural Language Processing
No ratings yet
01 Introduction To Natural Language Processing
42 pages
NLP Introduction Overview
No ratings yet
NLP Introduction Overview
34 pages
NLP Session 1
No ratings yet
NLP Session 1
4 pages
Natural Language Processing Tools and Approaches
No ratings yet
Natural Language Processing Tools and Approaches
106 pages
Introduction to Natural Language Processing
No ratings yet
Introduction to Natural Language Processing
88 pages
NLP Ia1
No ratings yet
NLP Ia1
7 pages
NLP Challenges: Sparsity Explained
No ratings yet
NLP Challenges: Sparsity Explained
55 pages
Module 1 Lecture 1
No ratings yet
Module 1 Lecture 1
29 pages
NLP Unit1
No ratings yet
NLP Unit1
51 pages
NLP Lecture 1
No ratings yet
NLP Lecture 1
3 pages
NLP for AI and Tech Enthusiasts
No ratings yet
NLP for AI and Tech Enthusiasts
30 pages
Course Code HUM1012 Logic and Language Structure BL202425040 0921 D21+D22
No ratings yet
Course Code HUM1012 Logic and Language Structure BL202425040 0921 D21+D22
55 pages
Unit 1-Introduction To NLP
No ratings yet
Unit 1-Introduction To NLP
68 pages
Module I NLP
No ratings yet
Module I NLP
65 pages
Natural Language Processing Language Models? - Term...
No ratings yet
Natural Language Processing Language Models? - Term...
4 pages
Topic 2: Introduction To Natural Language Processing (NLP)
No ratings yet
Topic 2: Introduction To Natural Language Processing (NLP)
16 pages
Reading4 NLP
No ratings yet
Reading4 NLP
64 pages
NLP Module 1
No ratings yet
NLP Module 1
55 pages
Introduction To Natural Language Processing
No ratings yet
Introduction To Natural Language Processing
69 pages
Binary Code
No ratings yet
Binary Code
2 pages
NLP
No ratings yet
NLP
21 pages
NLP for AI and Business Solutions
No ratings yet
NLP for AI and Business Solutions
13 pages
Bhawini NLP File
No ratings yet
Bhawini NLP File
100 pages
Bhawini NLP Practical
No ratings yet
Bhawini NLP Practical
98 pages
NLP Unit 1 & 2
No ratings yet
NLP Unit 1 & 2
29 pages
About NLP
No ratings yet
About NLP
14 pages
Unit 1a
No ratings yet
Unit 1a
53 pages
Digital Image Processing Unit 2
No ratings yet
Digital Image Processing Unit 2
82 pages
Digital Image Processing Unit 1
No ratings yet
Digital Image Processing Unit 1
38 pages
Understanding Exploratory Data Analysis
No ratings yet
Understanding Exploratory Data Analysis
39 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
16 pages
NLP 2K22 MAY CS3EA06 Natural Language Processing
No ratings yet
NLP 2K22 MAY CS3EA06 Natural Language Processing
2 pages
RA3CO42 Digital Image Processing QP
No ratings yet
RA3CO42 Digital Image Processing QP
2 pages
NLP 2K19 MAY CS3EA06-IT3EA06 Natural Language Processing
No ratings yet
NLP 2K19 MAY CS3EA06-IT3EA06 Natural Language Processing
3 pages
Student Selected: Novel Study
100% (1)
Student Selected: Novel Study
77 pages
Linguistic Modality in Ghanaian President Nana Addo Dankwa Akufo-Addo's 2017 Inaugural Address
No ratings yet
Linguistic Modality in Ghanaian President Nana Addo Dankwa Akufo-Addo's 2017 Inaugural Address
9 pages
Unit 1 Reading Plus Lesson (p2)
No ratings yet
Unit 1 Reading Plus Lesson (p2)
1 page
Language and Power Mini Unit Plan
No ratings yet
Language and Power Mini Unit Plan
21 pages
AppDev PDF
50% (2)
AppDev PDF
121 pages
Epigraphy Pioneer: C Veeraraghavan
No ratings yet
Epigraphy Pioneer: C Veeraraghavan
4 pages
NCEA Level 1 Unfamiliar Texts Guide
No ratings yet
NCEA Level 1 Unfamiliar Texts Guide
46 pages
The Writing Process and An Introduction To Business Message
0% (1)
The Writing Process and An Introduction To Business Message
12 pages
Instrumental Music Lesson Plan Free Word Download
No ratings yet
Instrumental Music Lesson Plan Free Word Download
2 pages
Plural Nouns Worksheet PDF
100% (4)
Plural Nouns Worksheet PDF
2 pages
Giving Directions Lesson Plan PDF
No ratings yet
Giving Directions Lesson Plan PDF
3 pages
My Activity Book Yrs 5 7
100% (1)
My Activity Book Yrs 5 7
104 pages
Ôn Tập Cuối Kì 1 23-24 (Đề Ngoài)
No ratings yet
Ôn Tập Cuối Kì 1 23-24 (Đề Ngoài)
8 pages
Focus Prep 2
No ratings yet
Focus Prep 2
12 pages
Campus Journalism Reviewer 2025
No ratings yet
Campus Journalism Reviewer 2025
25 pages
Grammar Translation Method
No ratings yet
Grammar Translation Method
16 pages
CS French 1
No ratings yet
CS French 1
2 pages
What Child Language Can Contribute To Pragmatics PDF
No ratings yet
What Child Language Can Contribute To Pragmatics PDF
8 pages
Feedback and Uptake-Yingli - Yang
No ratings yet
Feedback and Uptake-Yingli - Yang
22 pages
Business Letter Writing Guide
100% (1)
Business Letter Writing Guide
31 pages
Inanna's Descent: Sumerian Epic
No ratings yet
Inanna's Descent: Sumerian Epic
43 pages
Kareeen Research
0% (2)
Kareeen Research
92 pages
BÀI TẬP VỀ NHÀ DANH TỪ
No ratings yet
BÀI TẬP VỀ NHÀ DANH TỪ
7 pages
Greek 3
No ratings yet
Greek 3
14 pages
Head Start 6A: Scheme of Work
No ratings yet
Head Start 6A: Scheme of Work
2 pages
Ace Basic English: Level 1
No ratings yet
Ace Basic English: Level 1
4 pages
David Nunan 08
No ratings yet
David Nunan 08
30 pages
Jangamwadi Math Digital Collection
No ratings yet
Jangamwadi Math Digital Collection
210 pages
Encoding and Decoding in TV Communication
No ratings yet
Encoding and Decoding in TV Communication
2 pages
Modern Gallentary
No ratings yet
Modern Gallentary
3 pages

Natural Language Processing Unit 1-2

Uploaded by

Natural Language Processing Unit 1-2

Uploaded by

UNIT 1

Human Languages in Natural Language Processing (NLP)

Components of a Human Language

. Phonetics and Phonology:

parts: "un-", "happy", and "-ness".

Example: English Language

. Phonetics: The sound /k/ in "cat" and /p/ in "pat".

. Named Entity Recognition (NER):

Hawaii," "Barack Obama" is a person, and "Hawaii" is a location.

Let's see how a chatbot interacts using NLP.

1. User Input: "My internet is not working."

. User: "My internet is not working."

. User: "Yes, I have."

. Idioms and Expressions:

. Sarcasm and Humor:

Main Approaches of NLP with Live Examples

Example: Spam Email Detection

2. Machine Learning-Based Approaches

Example: Sentiment Analysis

- New Input: "I am so happy with my purchase!"

3. Deep Learning-Based Approaches

Example: Language Translation

Knowledge in Speech and Language Processing

Live Example: Voice Assistants

Live Example: "I saw the man with the telescope."

MODELS AND ALGORITHMS

Live Example: Spam Email Detection

FORMAL LANGUAGE AND NATURAL LANGUAGE

Live Example: Validating an Email Address

Steps and Live Example:

1. Lowercasing: Converting all text to lowercase to ensure uniformity.

4. Stemming/Lemmatization: Reducing words to their base or root form.

5. Tokenization: Splitting text into individual words or tokens.

Feature Extraction from Text

2. Term Frequency-Inverse Document Frequency (TF-IDF): Weighs words by their importance in

Morphology: Inflectional and Derivational

1. Inflectional Morphology: Changes the form of a word to express different grammatical

2. Derivational Morphology: Creates new words by adding prefixes or suffixes.

Finite State Morphological Parsing

Finite State Transducer

Part of Speech Tagging

Definition: Uses predefined linguistic rules to assign POS tags to words.

tags based on the likelihood of sequences of tags.

Text Pre-processing: In-Depth Example

Feature Extraction from Text: In-Depth Example

1. Bag of Words (BoW):

FINITE STATE MORPHOLOGICAL PARSING

Let's take the word "cats":

1. State 1: Start (S)

Finite State Transducer (FST)

1. State 1: Start (S)

- The FST starts in the initial state (S).

Part of Speech Tagging

Rule-Based POS Tagging

Let's tag the sentence: "The cat sat on the mat."

Stochastic POS Tagging

corpus of tagged text.

Let's tag the sentence: "The dog barks loudly."

Initial Tagging (e.g., using a simple stochastic tagger):

Final Tagged Sentence:

You might also like