NLP (Before Deep Learning)
Part Theme Outcome
0 Setup Python, NLTK/spaCy, scikit-learn, gensim
1 Text Handling Clean, normalize, tokenize text
2 Linguistic Features POS, lemmas, chunks, regex patterns
3 Vectorization Bag-of-Words, n-grams, TF-IDF
4 Classical Tasks Spam/sentiment, intent with linear models
5 Information Extraction Rules/CRF for NER; regex for dates/emails
6 Topic Modeling LDA & evaluation (coherence)
7 Information Retrieval BM25 search mini-engine
8 Evaluation & Error Analysis F1/ROC/PR; error buckets
9 Capstones & Checklists End-to-end small projects
Part 1 — Setup
Use a simple and reproducible toolchain. No GPUs required.
• Python ≥3.10; libraries: numpy, pandas, scikit-learn, nltk, spacy, gensim, regex, matplotlib.
• Create a virtual environment; pin versions in [Link].
• Download a small English model for spaCy and NLTK resources.
pip install numpy pandas scikit-learn nltk spacy gensim regex matplotlib
rank-bm25 python -m spacy download en_core_web_sm python -c "import nltk;
[Link]('punkt'); [Link]('stopwords')"
Part 2 — Text Handling
Turn raw text into clean tokens you can analyze.
• Unicode normalization; lowercasing rules; punctuation & numbers handling.
• Sentence splitting vs. tokenization (whitespace, spaCy tokenizer).
Page 1
• Stop-words; stemming vs. lemmatization (trade-offs).
• Regex recipes: emails, URLs, dates, hashtags.
Step-by-step:
1 Load a small corpus (movie reviews, product comments).
2 Write a reusable clean(text) function.
3 Tokenize with spaCy; compare to simple split().
4 Create a unit test that guards your clean() behavior.
import re, unicodedata def clean(text): text = [Link]("NFKC",
text) text = [Link](r"https?://\S+", " ", text) text = [Link](r"[^A-Za-z0-
9@#'\s]", " ", text) text = [Link](r"\s+", " ", text).strip() return text
Part 3 — Linguistic Features
Use rule-based and statistical linguistic features without deep nets.
• Part-of-speech (POS) tags and lemmas via spaCy small models.
• Chunking: noun & verb phrases; dependency basics for patterns.
• Custom pattern matchers (spaCy Matcher) for intents/keywords.
Step-by-step:
1 Extract lemmas & POS; build frequency tables per class.
2 Create simple phrase patterns (e.g., 'want to *', 'error when *').
3 Use features (counts/ratios) in a small classifier.
Part 4 — Vectorization
Represent documents numerically with simple, effective methods.
• Bag-of-Words; character/word n-grams.
• TFIDF weighting; vocabulary pruning (min_df, max_df).
• Feature scaling and leakage pitfalls.
Page 2
Step-by-step:
1 Build TFIDF with bigrams; inspect top features by class.
2 Try character 3–5 grams for robustness to misspellings.
3 Export a fitted vectorizer to reuse at inference time.
from sklearn.feature_extraction.text import TfidfVectorizer vec =
TfidfVectorizer(ngram_range=(1,2), min_df=3, max_df=0.9) X =
vec.fit_transform(corpus)
Part 5 — Classical NLP Tasks
Supervised text tasks using linear models and Naive Bayes.
• Sentiment/spam/intent classification with Logistic Regression, Linear SVM, or Multinomial NB.
• Model selection via cross validation; class weights and calibration.
• Saving pipelines with joblib; simple CLI or FastAPI for inference.
Step-by-step:
1 Split dataset (train/val/test); stratify by label.
2 Pipeline: TFIDF → LogisticRegression; tune C and n-gram range.
3 Evaluate on test; plot confusion matrix; write a README.
from [Link] import Pipeline from sklearn.linear_model import
LogisticRegression clf = Pipeline([("tfidf",
TfidfVectorizer(ngram_range=(1,2))), ("lr",
LogisticRegression(max_iter=1000))]) [Link](text_train, y_train)
Part 6 — Information Extraction (IE)
Pull structured facts from text without deep learning.
• Regex & rule-based extractors for emails, IDs, prices, dates.
• Dictionary & gazetteer matching (e.g., product catalogs, locations).
• Sequence labeling with CRFs (e.g., sklearn-crfsuite) for NER without neural nets.
Step-by-step:
Page 3
1 Write patterns & tests for key entities in your domain.
2 Add post-processing: validation (e.g., date ranges), normalization (ISO formats).
3 Optionally train a CRF for PER/ORG/LOC using classic features.
Part 7 — Topic Modeling
Unsupervised exploration of themes.
• LDA with gensim; choose k by coherence and interpretability.
• Preprocessing choices (stops, bigrams) change topic quality.
• Label topics and build a tiny report.
Step-by-step:
1 Build a dictionary & corpus; train LDA for several k values.
2 Compute coherence; pick k and show top words per topic.
3 Assign dominant topic to each document and review samples.
Part 8 — Information Retrieval (IR)
Build a simple search engine and evaluate it.
• Inverted index idea; BM25 ranking using rank-bm25 package.
• Query preprocessing; query expansion with synonyms.
• IR evaluation: precision, recall, MAP.
Step-by-step:
1 Index a small corpus; implement BM25 search.
2 Create a few queries with relevance labels.
3 Measure precision and conduct manual error analysis.
Part 9 — Evaluation & Error Analysis
Measure what matters and improve systematically.
• Metrics: accuracy, precision/recall, F1, ROC-AUC; calibration curves.
Page 4
• Data issues: duplicates, imbalance, leakage, domain shift.
• Slice analysis (by length, POS patterns, misspellings).
Step-by-step:
1 Compute precision/recall/F1; tune decision thresholds.
2 Create slices and compare metrics; document top 3 failure modes.
3 Write actions to fix the top failures (data, features, rules).
Part 10 — Capstones & Checklists
A) Spam Filter (Email/Comments)
• Collect & clean data; label small set.
• TFIDF + Logistic Regression; evaluate F1.
• Ship a CLI script to classify new messages.
B) FAQ Intent Classifier
• Define 8–12 intents; gather examples.
• Pipeline with TFIDF + Linear SVM; export model with joblib.
• Write a README with examples & decision policy.
C) Rule based NER for Dates/Amounts
• Write regex & validators; unit tests for edge cases.
• Normalize outputs to ISO formats.
• Measure precision/recall on a labelled sample.
D) Mini Search Engine
• Index docs; implement BM25 queries.
• Add highlight of matched terms; evaluate precision@5.
• Create a small HTML report of results.
Page 5