Back to Home
Data Science

Sentiment Analysis in Python with Keras 3: A Complete 2026 Guide

Olatunji Azeez
May 14, 2026
0 views
Sentiment Analysis in Python with Keras 3: A Complete 2026 Guide

If you have ever wondered how apps automatically detect whether a customer review is glowing or scathing, you are about to find out. In this guide, you will build a full sentiment classification pipeline — from raw text all the way to a trained convolutional neural network — using the modern Keras 3 API running on TensorFlow.

By the end, you will understand:

  • How to turn raw sentences into numerical representations a model can learn from

  • Why learned word embeddings outperform simple word counts

  • How to build, train, and evaluate a 1D CNN for text classification

  • How to package preprocessing directly inside your model for clean, production-ready inference


Prerequisites

You should be comfortable with Python and have a basic understanding of what machine learning is trying to do. You do not need to be a neural-network expert — everything will be explained step by step.

Install the required packages before you begin:

bash

pip install tensorflow keras pandas scikit-learn numpy

This guide targets Keras 3 (bundled with TensorFlow 2.16+). Verify your versions:

python

import tensorflow as tf
import keras

print(tf.__version__)   # e.g. 2.18.0
print(keras.__version__)  # e.g. 3.4.1

The Dataset

We will work with the Sentiment Labelled Sentences dataset from the UCI Machine Learning Repository. It contains 3,000 sentences drawn from three sources — Amazon product reviews, IMDb movie reviews, and Yelp restaurant reviews — each labelled 1 (positive) or 0 (negative).

Download it from: https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences

Unzip it into a data/sentiment/ folder, then load everything with pandas:

python

import pandas as pd

sources = {
    "yelp":   "data/sentiment/yelp_labelled.txt",
    "amazon": "data/sentiment/amazon_cells_labelled.txt",
    "imdb":   "data/sentiment/imdb_labelled.txt",
}

frames = []
for name, path in sources.items():
    df = pd.read_csv(path, sep="\t", names=["text", "label"])
    df["source"] = name
    frames.append(df)

data = pd.concat(frames, ignore_index=True)
print(data.head())
                              text  label source
0       Wow... Loved this place.      1   yelp
1  Crust is not good.               0   yelp
2  Not tasty and the texture was...  0   yelp

Each row is a short sentence paired with a binary sentiment label. Simple structure, real-world messiness.


Establishing a Baseline

Before reaching for a neural network, it is always worth setting a baseline — a simple, interpretable model that tells you how much the more complex approach actually helps.

Bag-of-Words Representation

The most direct way to represent text numerically is the bag-of-words (BOW) model. You build a vocabulary of every unique word in your corpus and represent each sentence as a vector of word counts. Word order is discarded; only word frequency matters.

scikit-learn's CountVectorizer handles this in a few lines:

python

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

# Work with just the Yelp subset first
yelp = data[data["source"] == "yelp"]
X_raw = yelp["text"].values
y = yelp["label"].values

X_train_raw, X_test_raw, y_train, y_test = train_test_split(
    X_raw, y, test_size=0.25, random_state=42
)

vectorizer = CountVectorizer()
vectorizer.fit(X_train_raw)          # learn vocabulary from training set only

X_train = vectorizer.transform(X_train_raw)
X_test  = vectorizer.transform(X_test_raw)

clf = LogisticRegression(max_iter=500)
clf.fit(X_train, y_train)
print(f"Logistic Regression accuracy: {clf.score(X_test, y_test):.4f}")
Logistic Regression accuracy: 0.7960

Almost 80% accuracy with a bag-of-words and logistic regression — solid for a starting point. The question now is how much headroom a neural network can recover.


Modern Text Preprocessing with Keras 3

The old Keras workflow relied on a standalone Tokenizer utility that lived outside the model, requiring careful manual serialisation. Keras 3 replaces this with TextVectorization — a proper preprocessing layer that lives inside your model, travels with it when saved, and eliminates the gap between training and serving behaviour.

How TextVectorization Works

When you call .adapt() on the layer, it scans your training corpus and constructs an integer vocabulary. From that point on, the layer maps each word to its integer index and pads or truncates sequences to a fixed length. One layer handles the combined responsibilities of Tokenizer, pad_sequences, and vocabulary bookkeeping.

python

import keras
from keras import layers
import numpy as np

MAX_VOCAB  = 10_000   # cap vocabulary size
SEQ_LENGTH = 100      # pad/truncate to this many tokens

vectorize_layer = layers.TextVectorization(
    max_tokens=MAX_VOCAB,
    output_mode="int",
    output_sequence_length=SEQ_LENGTH,
    standardize="lower_and_strip_punctuation",
)

# Adapt ONLY on training text — never on test data
vectorize_layer.adapt(X_train_raw)

# Sanity check
sample = vectorize_layer(["Absolutely loved this product!"])
print(sample.numpy())

The vocabulary now lives inside the layer. Pass a raw string in; get a padded integer sequence out.


Building Your First Neural Network

The Sequential API

Keras models are assembled from layers. The Sequential API stacks layers into a linear chain — the right tool for classification tasks with a clear input-to-output flow.

python

EMBED_DIM = 64

model_dense = keras.Sequential([
    vectorize_layer,                            # raw text → integer sequence
    layers.Embedding(MAX_VOCAB, EMBED_DIM),    # integers → dense vectors
    layers.GlobalAveragePooling1D(),            # sequence → single vector
    layers.Dense(32, activation="relu"),
    layers.Dropout(0.3),
    layers.Dense(1, activation="sigmoid"),     # binary output
], name="dense_classifier")

model_dense.compile(
    optimizer="adam",
    loss="binary_crossentropy",
    metrics=["accuracy"],
)

model_dense.summary()

Notice that vectorize_layer is the first layer. The model now accepts raw strings directly — no external preprocessing step required at inference time.

Train it:

python

history = model_dense.fit(
    X_train_raw, y_train,
    epochs=15,
    batch_size=32,
    validation_split=0.15,
    verbose=1,
)

Evaluate on the held-out test split:

python

loss, acc = model_dense.evaluate(X_test_raw, y_test, verbose=0)
print(f"Dense model test accuracy: {acc:.4f}")

You should see improvement over logistic regression, typically in the 83–86% range depending on the random seed.


Word Embeddings: Learning Meaning from Context

The Embedding layer you used above might look like a minor detail, but it is one of the most consequential components in any NLP model. It deserves its own section.

Why Not One-Hot Vectors?

A one-hot vector represents each word as a vector with a single 1 and zeros everywhere else. A vocabulary of 10,000 entries means every word becomes a 10,000-dimensional sparse vector. Worse, all pairs of words are equally distant — "amazing" and "excellent" look as different as "amazing" and "tyre". The model cannot exploit the fact that similar words carry similar meaning.

Dense Embeddings

An embedding maps each word to a compact, dense vector of real numbers — 64 or 128 dimensions rather than 10,000. These vectors are learned during training. Words that appear in similar contexts end up near one another in this space. "Wonderful" and "fantastic" cluster together; "terrible" and "awful" cluster together elsewhere. Geometric proximity encodes semantic similarity.

Inside the Embedding Layer

python

# Shape perspective:
# input  (batch_size, SEQ_LENGTH)       — integer indices
# output (batch_size, SEQ_LENGTH, 64)   — dense vectors
emb = layers.Embedding(input_dim=MAX_VOCAB, output_dim=EMBED_DIM)

The embedding matrix is a trainable weight table of shape (vocab_size, embed_dim). Each forward pass is a fast integer-indexed lookup into that table.

Using Pretrained Embeddings (GloVe)

Training embeddings from scratch on a small dataset has limits. A smarter approach is to initialise the embedding matrix with vectors pretrained on a massive external corpus, then fine-tune from there.

GloVe (Global Vectors for Word Representation) offers freely available pretrained vectors. Download glove.6B.zip from https://nlp.stanford.edu/projects/glove/, extract it, and load the 100-dimensional version:

python

import numpy as np

GLOVE_PATH = "glove.6B.100d.txt"
EMBED_DIM  = 100

# Build a word → vector lookup table
glove_index = {}
with open(GLOVE_PATH, encoding="utf-8") as f:
    for line in f:
        parts = line.split()
        word  = parts[0]
        vec   = np.array(parts[1:], dtype="float32")
        glove_index[word] = vec

# Retrieve the vocabulary the TextVectorization layer built
vocab = vectorize_layer.get_vocabulary()

# Construct an embedding matrix aligned to that vocabulary
embedding_matrix = np.zeros((len(vocab), EMBED_DIM))
for idx, word in enumerate(vocab):
    vec = glove_index.get(word)
    if vec is not None:
        embedding_matrix[idx] = vec

print(f"Vocabulary size:        {len(vocab)}")
print(f"Embedding matrix shape: {embedding_matrix.shape}")

Pass that matrix to the embedding layer:

python

pretrained_embedding = layers.Embedding(
    input_dim=len(vocab),
    output_dim=EMBED_DIM,
    embeddings_initializer=keras.initializers.Constant(embedding_matrix),
    trainable=True,   # allow fine-tuning on your specific data
)

With trainable=True the model can adapt the GloVe vectors toward your domain. For very small datasets, try trainable=False to avoid overfitting — benchmark both settings.


Convolutional Neural Networks for Text

Convolutional Neural Networks made their name in image recognition, but 1D convolutions are remarkably effective for text. The intuition is clean: a filter slides across a sequence of word vectors looking for local patterns — bigrams, trigrams, short phrases — that signal sentiment.

A filter of width 3 examines three consecutive word vectors at once. Through training, filters learn to activate on phrases like "not worth buying" or "absolutely loved it" wherever those patterns appear in the sentence. The position does not matter; the pattern does.

The 1D CNN Architecture

python

model_cnn = keras.Sequential([
    vectorize_layer,
    layers.Embedding(MAX_VOCAB, EMBED_DIM),

    # Filters of different widths capture different n-gram windows
    layers.Conv1D(filters=128, kernel_size=3, activation="relu", padding="same"),
    layers.Conv1D(filters=128, kernel_size=4, activation="relu", padding="same"),

    # Global max-pooling selects the strongest activation across the sequence
    layers.GlobalMaxPooling1D(),

    layers.Dense(64, activation="relu"),
    layers.Dropout(0.4),
    layers.Dense(1, activation="sigmoid"),
], name="cnn_classifier")

model_cnn.compile(
    optimizer="adam",
    loss="binary_crossentropy",
    metrics=["accuracy"],
)

GlobalMaxPooling1D takes the maximum value across the time dimension for each filter channel, producing a fixed-length summary of the most decisive patterns detected throughout the sentence.

Train with early stopping:

python

early_stop = keras.callbacks.EarlyStopping(
    patience=3,
    restore_best_weights=True,
)

history_cnn = model_cnn.fit(
    X_train_raw, y_train,
    epochs=20,
    batch_size=32,
    validation_split=0.15,
    callbacks=[early_stop],
    verbose=1,
)

loss, acc = model_cnn.evaluate(X_test_raw, y_test, verbose=0)
print(f"CNN test accuracy: {acc:.4f}")

The EarlyStopping callback monitors validation loss and halts training once improvement stalls, then restores the best-performing checkpoint. This prevents overfitting without manually guessing the right epoch count.


Running All Sources Together

So far we only trained on Yelp reviews. Let's evaluate across all three domains:

python

results = {}

for source in data["source"].unique():
    subset = data[data["source"] == source]
    X = subset["text"].values
    y_all = subset["label"].values

    X_tr, X_te, y_tr, y_te = train_test_split(
        X, y_all, test_size=0.25, random_state=42
    )

    # Fresh vectorization layer per source
    vl = layers.TextVectorization(
        max_tokens=MAX_VOCAB,
        output_mode="int",
        output_sequence_length=SEQ_LENGTH,
        standardize="lower_and_strip_punctuation",
    )
    vl.adapt(X_tr)

    m = keras.Sequential([
        vl,
        layers.Embedding(MAX_VOCAB, EMBED_DIM),
        layers.Conv1D(128, 3, activation="relu", padding="same"),
        layers.GlobalMaxPooling1D(),
        layers.Dense(64, activation="relu"),
        layers.Dropout(0.4),
        layers.Dense(1, activation="sigmoid"),
    ])
    m.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])
    m.fit(
        X_tr, y_tr,
        epochs=20, batch_size=32, validation_split=0.15,
        callbacks=[keras.callbacks.EarlyStopping(patience=3, restore_best_weights=True)],
        verbose=0,
    )

    _, acc = m.evaluate(X_te, y_te, verbose=0)
    results[source] = acc
    print(f"{source:8s} → {acc:.4f}")

Typical output:

yelp     → 0.8360
amazon   → 0.8440
imdb     → 0.8640

Across the board the CNN outperforms the logistic regression baseline we started with.


Deploying for Inference: The End-to-End Model

Because TextVectorization lives inside the model, saving it produces a fully self-contained artefact that accepts raw strings at inference time. No external tokeniser, no separate preprocessing script to ship.

python

# Save
model_cnn.save("sentiment_model.keras")

# Load and predict on raw strings
loaded_model = keras.models.load_model("sentiment_model.keras")

test_sentences = [
    "This product exceeded all my expectations!",
    "Absolute waste of money. Broke after one day.",
    "Decent quality for the price.",
]

predictions = loaded_model.predict(test_sentences)
for sentence, score in zip(test_sentences, predictions):
    sentiment = "Positive" if score[0] >= 0.5 else "Negative"
    print(f"[{score[0]:.2f}] {sentiment}: {sentence}")
[0.92] Positive: This product exceeded all my expectations!
[0.06] Negative: Absolute waste of money. Broke after one day.
[0.61] Positive: Decent quality for the price.

Comparing All Approaches

Approach

Typical Accuracy

Logistic Regression (BOW)

~79–80%

Dense Network + Embedding

~83–85%

1D CNN + Embedding

~84–87%

1D CNN + GloVe Embeddings

~85–88%

The exact numbers vary with the domain and random seed. On small datasets the gap can be modest. The real advantage of neural approaches becomes apparent when you scale to tens of thousands of labelled examples.


What Has Changed Since the Early Keras Days

If you have worked with older Keras tutorials, you will notice several differences:

TextVectorization replaces Tokenizer. The legacy keras.preprocessing.text.Tokenizer still exists but is no longer the recommended path. TextVectorization integrates into the model graph and is saved and restored automatically with the model.

pad_sequences is no longer needed. Setting output_sequence_length on the TextVectorization layer handles padding and truncation transparently.

Multi-backend Keras 3. Keras is now a standalone library (pip install keras) that runs on TensorFlow, JAX, or PyTorch. Switch backends via os.environ["KERAS_BACKEND"] = "jax" before importing keras.

.keras save format. model.save("model.keras") uses the native Keras format, which serialises custom and preprocessing layers more robustly than the older .h5 HDF5 format.

EarlyStopping(restore_best_weights=True). This argument is fully reliable in Keras 3 and removes the need for a separate ModelCheckpoint callback in most cases.


Taking It Further

This guide covered the fundamentals. Here are directions worth exploring next:

Bidirectional LSTMs. Replace the CNN with layers.Bidirectional(layers.LSTM(64)) to model sequential dependencies in both directions. Particularly useful when the order of words carries meaning that local n-gram filters might miss.

Transformer-based transfer learning. Libraries such as keras-hub and Hugging Face transformers let you fine-tune BERT or DistilBERT on your own labelled data. This is the standard approach for state-of-the-art results on custom classification tasks in 2026.

Larger datasets. The UCI Sentiment dataset contains 3,000 examples. The IMDB 50K dataset and the Stanford Sentiment Treebank are natural next steps if you want to test how your pipeline scales.

KerasTuner for systematic hyperparameter search. Replace hand-picked values for embedding dimension, filter counts, dropout rate, and learning rate with a principled search using keras_tuner.BayesianOptimization or keras_tuner.RandomSearch.


Wrapping Up

You have built a complete, modern text classification pipeline using Keras 3:

  1. Loaded and explored the UCI Sentiment dataset across three review domains

  2. Established a logistic regression baseline using bag-of-words features

  3. Replaced hand-crafted preprocessing with Keras TextVectorization

  4. Trained a dense network and a 1D CNN, both outperforming the baseline

  5. Initialised a pretrained GloVe embedding layer to give the model a semantic head start

  6. Saved a fully self-contained model that accepts raw strings at inference time

The patterns here — keep preprocessing inside the model, separate vocabulary learning from evaluation, always verify against a baseline — transfer directly to any text classification problem you encounter in practice.

Share this article

Loading comments...