Naive Bayes Explained

Beginner 9 min read

What you'll learn

✓What Naive Bayes is and why it works
✓The independence assumption
✓How to compute class probabilities
✓When Naive Bayes shines
✓Common pitfalls and fixes

Prerequisites

•Basic probability
•Familiarity with classification problems

Naive Bayes is one of the oldest and most underrated classifiers in machine learning. It is fast, easy to implement, and often produces a strong baseline before you reach for anything more complex. This post explains how it works, where the name comes from, and how to actually use it on a real problem.

What it is and why use it

Naive Bayes is a family of probabilistic classifiers based on Bayes theorem. Given some features, it estimates the probability of each class label and picks the most likely one. The word naive refers to its core assumption: that features are conditionally independent given the class. This is rarely true in real data, yet the model still performs remarkably well, especially on text.

You reach for Naive Bayes when you need something quick, interpretable, and resistant to overfitting on small datasets. Spam filters, document categorization, and sentiment analysis all use it as a starting point.

Mental model

Imagine each class as a separate bag of features. Training counts how often each feature appears inside each bag. At prediction time, you peek at the new sample and ask each bag: how surprised would you be to see these features? The bag that is least surprised wins.

Bayes theorem gives this intuition a formal shape. The posterior probability of a class is proportional to the prior times the product of feature likelihoods. Taking logs turns the product into a sum, which is what most implementations actually compute.

Hands-on example

Suppose you want to classify short messages as spam or ham. You build a vocabulary, count how often each word appears in each class during training, and store those frequencies. At prediction time, for a new message you multiply the prior of each class by the likelihood of each word given that class, then compare.

Training:
messages -> tokenize -> count(word | class)
                               |
                               v
                     likelihood tables

Prediction:
new message -> tokenize -> for each class:
                             P(class) * prod P(word | class)
                           -> pick argmax

Naive Bayes classification flow

In scikit-learn this is a few lines: a CountVectorizer to turn text into counts, then MultinomialNB to fit. On a few thousand labelled messages you can hit accuracy in the high nineties for clear spam patterns. The point is not perfection, it is the speed at which you get a usable baseline.

Trade-offs

The independence assumption is the obvious weakness. Words in real sentences are heavily correlated, so the probabilities Naive Bayes outputs are usually poorly calibrated, even when the predicted class is correct. If you need reliable confidence scores, you will want to calibrate the outputs separately.

Continuous features force a choice of distribution. Gaussian Naive Bayes assumes each feature is normally distributed per class, which often does not hold. Discretizing or transforming features can help.

It also struggles with feature interactions. If two features only matter when combined, Naive Bayes will miss the signal. Tree models or gradient boosting will outperform it in those cases.

Practical tips

Use Laplace smoothing to avoid zero probabilities when a feature never co-occurs with a class in training. Most libraries default to this, but the alpha parameter is worth tuning.

Work in log space. Multiplying many small probabilities underflows quickly, so summing log probabilities is the standard implementation trick.

For text, try both Multinomial and Bernoulli variants. The first uses word counts, the second uses presence or absence. They often differ by a few percent on the same data.

Always compare against a simple logistic regression. If logistic regression beats Naive Bayes by a wide margin, your features have real interactions and you should respect that.

Wrap-up

Naive Bayes is the duct tape of classifiers. It will not win Kaggle competitions, but it will give you a working model in minutes and a clear baseline against which fancier methods must justify their complexity. Keep it in your toolkit for the first pass on any new classification problem, especially when text is involved.