NLP — Before the "Interesting Stuff"
Most NLP articles jump straight to the fun parts: intent detection, embeddings, transformers, LLMs.
But before a model can figure out what a piece of text means, there's often a forgotten phase — the small preprocessing steps that shape the input.
This article covers:
- tokenization
- stopwords
- stemming vs lemmatization
- POS tagging
- NER
The real question: is this stuff still relevant, or just "old skool NLP"?
Language is messy. Models are literal.
People are very good at ignoring noise. Models not so much.
We casually write: run, running, ran — car, cars, car's — filler words, inconsistent casing, punctuation everywhere.
Early NLP systems were built on rules, not context. Preprocessing wasn't about elegance — it was about survival.
The goal was simple: reduce chaos without throwing away meaning.
That goal hasn't changed. What has changed is when and how we do it.
The classic preprocessing steps
Using NLTK — not because it's cutting-edge, but because it makes the ideas obvious.
import nltk
nltk.download("punkt")
nltk.download("stopwords")
nltk.download("wordnet")
nltk.download("averaged_perceptron_tagger_eng")
nltk.download("maxent_ne_chunker")
nltk.download("words")
text = "Apple is looking at buying U.K. startup for $1 billion — and it's not joking."
Tokenization
Breaking raw text into units a model can work with.
from nltk.tokenize import word_tokenize
tokens = word_tokenize(text)
If you mess this up, nothing downstream matters.
(Modern models still tokenize — they just do it at the subword level.)
Lowercasing & light normalization
lower_tokens = [t.lower() for t in tokens]
Boring, but drastically reduces vocabulary size in small models.
Stopword removal (use with intent)
Stopwords are extremely common words: the, is, and.
from nltk.corpus import stopwords
stop_words = set(stopwords.words("english"))
filtered = [t for t in lower_tokens if t.isalpha() and t not in stop_words]
Good for: TF-IDF, keyword extraction, topic modelling.
Risky for: transformer-based models, anything relying on syntax.
Stop words are often safe to drop… except when they aren't.
Stemming vs Lemmatization
Stemming (fast, blunt):
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
stems = [stemmer.stem(t) for t in filtered]
Lemmatization (slower, cleaner):
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmas = [lemmatizer.lemmatize(t) for t in filtered]
Lemmatization preserves meaning better, especially paired with POS tagging.
POS tagging
Answers a simple but powerful question: what role is this word playing?
from nltk import pos_tag
pos = pos_tag(tokens)
"Book a flight" vs "read a book" — same token, different role, different intent.
Named Entity Recognition
NER extracts people, places, organisations, dates.
from nltk import ne_chunk
tree = ne_chunk(pos)
entities = []
for subtree in tree:
if hasattr(subtree, "label"):
name = " ".join([leaf[0] for leaf in subtree.leaves()])
entities.append((name, subtree.label()))
Even today, explicit NER is useful for: routing requests, normalising data, filtering PII, adding structure before storage or search.
But don't modern models already handle this?
Yes — large pretrained transformers do a lot of this under the hood. They use subword tokenisation, encode syntax, and infer entities implicitly.
Which means blindly applying old preprocessing rules can actively make things worse:
- removing stopwords can confuse transformers
- aggressive stemming can destroy context
This is where people trip into thinking preprocessing is obsolete. It isn't. It's just no longer automatic.
When I'd still use these steps
Classic ML (TF-IDF, Naive Bayes, Logistic Regression, LDA) Tokenisation, lowercasing, stopword removal, lemmatisation. Preprocessing often makes or breaks performance here.
Small / domain-specific transformers Minimal normalisation only. No stopword removal. NER as a separate step can still add a lot of value.
LLM-based pipelines Avoid aggressive preprocessing. Use it for: filtering junk, normalising entities, reducing token count before embeddings, routing/metadata extraction.
Cost-, latency-, or explainability-sensitive systems Fewer tokens = cheaper inference. Explicit features = easier debugging.
The actual takeaway
These preprocessing steps aren't "old NLP". They're tools for shaping the problem — not cleaning data for the sake of it.
Modern models changed how often we need them, not why they exist.
Understanding them makes you better at building NLP systems — even when you're using LLMs.