0% found this document useful (0 votes)
21 views5 pages

NLP Techniques Before Deep Learning

The document outlines a structured approach to Natural Language Processing (NLP) before the advent of deep learning, detailing essential themes such as text handling, linguistic features, vectorization, and classical NLP tasks. It provides step-by-step instructions for implementing various techniques, including information extraction, topic modeling, and information retrieval, using tools like Python and libraries such as NLTK and spaCy. Additionally, it includes evaluation metrics and capstone project ideas to apply the learned concepts effectively.

Uploaded by

mafrasjcetaids
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views5 pages

NLP Techniques Before Deep Learning

The document outlines a structured approach to Natural Language Processing (NLP) before the advent of deep learning, detailing essential themes such as text handling, linguistic features, vectorization, and classical NLP tasks. It provides step-by-step instructions for implementing various techniques, including information extraction, topic modeling, and information retrieval, using tools like Python and libraries such as NLTK and spaCy. Additionally, it includes evaluation metrics and capstone project ideas to apply the learned concepts effectively.

Uploaded by

mafrasjcetaids
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

NLP (Before Deep Learning)

Part Theme Outcome

0 Setup Python, NLTK/spaCy, scikit-learn, gensim

1 Text Handling Clean, normalize, tokenize text

2 Linguistic Features POS, lemmas, chunks, regex patterns

3 Vectorization Bag-of-Words, n-grams, TF-IDF

4 Classical Tasks Spam/sentiment, intent with linear models

5 Information Extraction Rules/CRF for NER; regex for dates/emails

6 Topic Modeling LDA & evaluation (coherence)

7 Information Retrieval BM25 search mini-engine

8 Evaluation & Error Analysis F1/ROC/PR; error buckets

9 Capstones & Checklists End-to-end small projects

Part 1 — Setup
Use a simple and reproducible toolchain. No GPUs required.

• Python ≥3.10; libraries: numpy, pandas, scikit-learn, nltk, spacy, gensim, regex, matplotlib.

• Create a virtual environment; pin versions in [Link].

• Download a small English model for spaCy and NLTK resources.


pip install numpy pandas scikit-learn nltk spacy gensim regex matplotlib
rank-bm25 python -m spacy download en_core_web_sm python -c "import nltk;
[Link]('punkt'); [Link]('stopwords')"

Part 2 — Text Handling


Turn raw text into clean tokens you can analyze.

• Unicode normalization; lowercasing rules; punctuation & numbers handling.

• Sentence splitting vs. tokenization (whitespace, spaCy tokenizer).

Page 1
• Stop-words; stemming vs. lemmatization (trade-offs).

• Regex recipes: emails, URLs, dates, hashtags.

Step-by-step:
1 Load a small corpus (movie reviews, product comments).

2 Write a reusable clean(text) function.

3 Tokenize with spaCy; compare to simple split().

4 Create a unit test that guards your clean() behavior.


import re, unicodedata def clean(text): text = [Link]("NFKC",
text) text = [Link](r"https?://\S+", " ", text) text = [Link](r"[^A-Za-z0-
9@#'\s]", " ", text) text = [Link](r"\s+", " ", text).strip() return text

Part 3 — Linguistic Features


Use rule-based and statistical linguistic features without deep nets.

• Part-of-speech (POS) tags and lemmas via spaCy small models.

• Chunking: noun & verb phrases; dependency basics for patterns.

• Custom pattern matchers (spaCy Matcher) for intents/keywords.

Step-by-step:
1 Extract lemmas & POS; build frequency tables per class.

2 Create simple phrase patterns (e.g., 'want to *', 'error when *').

3 Use features (counts/ratios) in a small classifier.

Part 4 — Vectorization
Represent documents numerically with simple, effective methods.

• Bag-of-Words; character/word n-grams.

• TFIDF weighting; vocabulary pruning (min_df, max_df).

• Feature scaling and leakage pitfalls.

Page 2
Step-by-step:
1 Build TFIDF with bigrams; inspect top features by class.

2 Try character 3–5 grams for robustness to misspellings.

3 Export a fitted vectorizer to reuse at inference time.


from sklearn.feature_extraction.text import TfidfVectorizer vec =
TfidfVectorizer(ngram_range=(1,2), min_df=3, max_df=0.9) X =
vec.fit_transform(corpus)

Part 5 — Classical NLP Tasks


Supervised text tasks using linear models and Naive Bayes.

• Sentiment/spam/intent classification with Logistic Regression, Linear SVM, or Multinomial NB.

• Model selection via cross validation; class weights and calibration.

• Saving pipelines with joblib; simple CLI or FastAPI for inference.

Step-by-step:
1 Split dataset (train/val/test); stratify by label.

2 Pipeline: TFIDF → LogisticRegression; tune C and n-gram range.

3 Evaluate on test; plot confusion matrix; write a README.

from [Link] import Pipeline from sklearn.linear_model import


LogisticRegression clf = Pipeline([("tfidf",
TfidfVectorizer(ngram_range=(1,2))), ("lr",
LogisticRegression(max_iter=1000))]) [Link](text_train, y_train)

Part 6 — Information Extraction (IE)


Pull structured facts from text without deep learning.

• Regex & rule-based extractors for emails, IDs, prices, dates.

• Dictionary & gazetteer matching (e.g., product catalogs, locations).

• Sequence labeling with CRFs (e.g., sklearn-crfsuite) for NER without neural nets.

Step-by-step:

Page 3
1 Write patterns & tests for key entities in your domain.

2 Add post-processing: validation (e.g., date ranges), normalization (ISO formats).

3 Optionally train a CRF for PER/ORG/LOC using classic features.

Part 7 — Topic Modeling


Unsupervised exploration of themes.

• LDA with gensim; choose k by coherence and interpretability.

• Preprocessing choices (stops, bigrams) change topic quality.

• Label topics and build a tiny report.

Step-by-step:
1 Build a dictionary & corpus; train LDA for several k values.

2 Compute coherence; pick k and show top words per topic.

3 Assign dominant topic to each document and review samples.

Part 8 — Information Retrieval (IR)


Build a simple search engine and evaluate it.

• Inverted index idea; BM25 ranking using rank-bm25 package.

• Query preprocessing; query expansion with synonyms.

• IR evaluation: precision, recall, MAP.

Step-by-step:
1 Index a small corpus; implement BM25 search.

2 Create a few queries with relevance labels.

3 Measure precision and conduct manual error analysis.

Part 9 — Evaluation & Error Analysis


Measure what matters and improve systematically.

• Metrics: accuracy, precision/recall, F1, ROC-AUC; calibration curves.

Page 4
• Data issues: duplicates, imbalance, leakage, domain shift.

• Slice analysis (by length, POS patterns, misspellings).

Step-by-step:
1 Compute precision/recall/F1; tune decision thresholds.

2 Create slices and compare metrics; document top 3 failure modes.

3 Write actions to fix the top failures (data, features, rules).

Part 10 — Capstones & Checklists


A) Spam Filter (Email/Comments)
• Collect & clean data; label small set.

• TFIDF + Logistic Regression; evaluate F1.

• Ship a CLI script to classify new messages.

B) FAQ Intent Classifier


• Define 8–12 intents; gather examples.

• Pipeline with TFIDF + Linear SVM; export model with joblib.

• Write a README with examples & decision policy.

C) Rule based NER for Dates/Amounts


• Write regex & validators; unit tests for edge cases.

• Normalize outputs to ISO formats.

• Measure precision/recall on a labelled sample.

D) Mini Search Engine


• Index docs; implement BM25 queries.

• Add highlight of matched terms; evaluate precision@5.

• Create a small HTML report of results.

Page 5

You might also like