Sentiment Analysis in Python with Keras 3: A Complete 2026 Guide

If you have ever wondered how apps automatically detect whether a customer review is glowing or scathing, you are about to find out. In this guide, you will build a full sentiment classification pipeline — from raw text all the way to a trained convolutional neural network — using the modern Keras 3 API running on TensorFlow.
By the end, you will understand:
How to turn raw sentences into numerical representations a model can learn from
Why learned word embeddings outperform simple word counts
How to build, train, and evaluate a 1D CNN for text classification
How to package preprocessing directly inside your model for clean, production-ready inference
Prerequisites
You should be comfortable with Python and have a basic understanding of what machine learning is trying to do. You do not need to be a neural-network expert — everything will be explained step by step.
Install the required packages before you begin:
bash
pip install tensorflow keras pandas scikit-learn numpyThis guide targets Keras 3 (bundled with TensorFlow 2.16+). Verify your versions:
python
import tensorflow as tf
import keras
print(tf.__version__) # e.g. 2.18.0
print(keras.__version__) # e.g. 3.4.1The Dataset
We will work with the Sentiment Labelled Sentences dataset from the UCI Machine Learning Repository. It contains 3,000 sentences drawn from three sources — Amazon product reviews, IMDb movie reviews, and Yelp restaurant reviews — each labelled 1 (positive) or 0 (negative).
Download it from: https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences
Unzip it into a data/sentiment/ folder, then load everything with pandas:
python
import pandas as pd
sources = {
"yelp": "data/sentiment/yelp_labelled.txt",
"amazon": "data/sentiment/amazon_cells_labelled.txt",
"imdb": "data/sentiment/imdb_labelled.txt",
}
frames = []
for name, path in sources.items():
df = pd.read_csv(path, sep="\t", names=["text", "label"])
df["source"] = name
frames.append(df)
data = pd.concat(frames, ignore_index=True)
print(data.head()) text label source
0 Wow... Loved this place. 1 yelp
1 Crust is not good. 0 yelp
2 Not tasty and the texture was... 0 yelpEach row is a short sentence paired with a binary sentiment label. Simple structure, real-world messiness.
Establishing a Baseline
Before reaching for a neural network, it is always worth setting a baseline — a simple, interpretable model that tells you how much the more complex approach actually helps.
Bag-of-Words Representation
The most direct way to represent text numerically is the bag-of-words (BOW) model. You build a vocabulary of every unique word in your corpus and represent each sentence as a vector of word counts. Word order is discarded; only word frequency matters.
scikit-learn's CountVectorizer handles this in a few lines:
python
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
# Work with just the Yelp subset first
yelp = data[data["source"] == "yelp"]
X_raw = yelp["text"].values
y = yelp["label"].values
X_train_raw, X_test_raw, y_train, y_test = train_test_split(
X_raw, y, test_size=0.25, random_state=42
)
vectorizer = CountVectorizer()
vectorizer.fit(X_train_raw) # learn vocabulary from training set only
X_train = vectorizer.transform(X_train_raw)
X_test = vectorizer.transform(X_test_raw)
clf = LogisticRegression(max_iter=500)
clf.fit(X_train, y_train)
print(f"Logistic Regression accuracy: {clf.score(X_test, y_test):.4f}")Logistic Regression accuracy: 0.7960Almost 80% accuracy with a bag-of-words and logistic regression — solid for a starting point. The question now is how much headroom a neural network can recover.
Modern Text Preprocessing with Keras 3
The old Keras workflow relied on a standalone Tokenizer utility that lived outside the model, requiring careful manual serialisation. Keras 3 replaces this with TextVectorization — a proper preprocessing layer that lives inside your model, travels with it when saved, and eliminates the gap between training and serving behaviour.
How TextVectorization Works
When you call .adapt() on the layer, it scans your training corpus and constructs an integer vocabulary. From that point on, the layer maps each word to its integer index and pads or truncates sequences to a fixed length. One layer handles the combined responsibilities of Tokenizer, pad_sequences, and vocabulary bookkeeping.
python
import keras
from keras import layers
import numpy as np
MAX_VOCAB = 10_000 # cap vocabulary size
SEQ_LENGTH = 100 # pad/truncate to this many tokens
vectorize_layer = layers.TextVectorization(
max_tokens=MAX_VOCAB,
output_mode="int",
output_sequence_length=SEQ_LENGTH,
standardize="lower_and_strip_punctuation",
)
# Adapt ONLY on training text — never on test data
vectorize_layer.adapt(X_train_raw)
# Sanity check
sample = vectorize_layer(["Absolutely loved this product!"])
print(sample.numpy())The vocabulary now lives inside the layer. Pass a raw string in; get a padded integer sequence out.
Building Your First Neural Network
The Sequential API
Keras models are assembled from layers. The Sequential API stacks layers into a linear chain — the right tool for classification tasks with a clear input-to-output flow.
python
EMBED_DIM = 64
model_dense = keras.Sequential([
vectorize_layer, # raw text → integer sequence
layers.Embedding(MAX_VOCAB, EMBED_DIM), # integers → dense vectors
layers.GlobalAveragePooling1D(), # sequence → single vector
layers.Dense(32, activation="relu"),
layers.Dropout(0.3),
layers.Dense(1, activation="sigmoid"), # binary output
], name="dense_classifier")
model_dense.compile(
optimizer="adam",
loss="binary_crossentropy",
metrics=["accuracy"],
)
model_dense.summary()Notice that vectorize_layer is the first layer. The model now accepts raw strings directly — no external preprocessing step required at inference time.
Train it:
python
history = model_dense.fit(
X_train_raw, y_train,
epochs=15,
batch_size=32,
validation_split=0.15,
verbose=1,
)Evaluate on the held-out test split:
python
loss, acc = model_dense.evaluate(X_test_raw, y_test, verbose=0)
print(f"Dense model test accuracy: {acc:.4f}")You should see improvement over logistic regression, typically in the 83–86% range depending on the random seed.
Word Embeddings: Learning Meaning from Context
The Embedding layer you used above might look like a minor detail, but it is one of the most consequential components in any NLP model. It deserves its own section.
Why Not One-Hot Vectors?
A one-hot vector represents each word as a vector with a single 1 and zeros everywhere else. A vocabulary of 10,000 entries means every word becomes a 10,000-dimensional sparse vector. Worse, all pairs of words are equally distant — "amazing" and "excellent" look as different as "amazing" and "tyre". The model cannot exploit the fact that similar words carry similar meaning.
Dense Embeddings
An embedding maps each word to a compact, dense vector of real numbers — 64 or 128 dimensions rather than 10,000. These vectors are learned during training. Words that appear in similar contexts end up near one another in this space. "Wonderful" and "fantastic" cluster together; "terrible" and "awful" cluster together elsewhere. Geometric proximity encodes semantic similarity.
Inside the Embedding Layer
python
# Shape perspective:
# input (batch_size, SEQ_LENGTH) — integer indices
# output (batch_size, SEQ_LENGTH, 64) — dense vectors
emb = layers.Embedding(input_dim=MAX_VOCAB, output_dim=EMBED_DIM)The embedding matrix is a trainable weight table of shape (vocab_size, embed_dim). Each forward pass is a fast integer-indexed lookup into that table.
Using Pretrained Embeddings (GloVe)
Training embeddings from scratch on a small dataset has limits. A smarter approach is to initialise the embedding matrix with vectors pretrained on a massive external corpus, then fine-tune from there.
GloVe (Global Vectors for Word Representation) offers freely available pretrained vectors. Download glove.6B.zip from https://nlp.stanford.edu/projects/glove/, extract it, and load the 100-dimensional version:
python
import numpy as np
GLOVE_PATH = "glove.6B.100d.txt"
EMBED_DIM = 100
# Build a word → vector lookup table
glove_index = {}
with open(GLOVE_PATH, encoding="utf-8") as f:
for line in f:
parts = line.split()
word = parts[0]
vec = np.array(parts[1:], dtype="float32")
glove_index[word] = vec
# Retrieve the vocabulary the TextVectorization layer built
vocab = vectorize_layer.get_vocabulary()
# Construct an embedding matrix aligned to that vocabulary
embedding_matrix = np.zeros((len(vocab), EMBED_DIM))
for idx, word in enumerate(vocab):
vec = glove_index.get(word)
if vec is not None:
embedding_matrix[idx] = vec
print(f"Vocabulary size: {len(vocab)}")
print(f"Embedding matrix shape: {embedding_matrix.shape}")Pass that matrix to the embedding layer:
python
pretrained_embedding = layers.Embedding(
input_dim=len(vocab),
output_dim=EMBED_DIM,
embeddings_initializer=keras.initializers.Constant(embedding_matrix),
trainable=True, # allow fine-tuning on your specific data
)With trainable=True the model can adapt the GloVe vectors toward your domain. For very small datasets, try trainable=False to avoid overfitting — benchmark both settings.
Convolutional Neural Networks for Text
Convolutional Neural Networks made their name in image recognition, but 1D convolutions are remarkably effective for text. The intuition is clean: a filter slides across a sequence of word vectors looking for local patterns — bigrams, trigrams, short phrases — that signal sentiment.
A filter of width 3 examines three consecutive word vectors at once. Through training, filters learn to activate on phrases like "not worth buying" or "absolutely loved it" wherever those patterns appear in the sentence. The position does not matter; the pattern does.
The 1D CNN Architecture
python
model_cnn = keras.Sequential([
vectorize_layer,
layers.Embedding(MAX_VOCAB, EMBED_DIM),
# Filters of different widths capture different n-gram windows
layers.Conv1D(filters=128, kernel_size=3, activation="relu", padding="same"),
layers.Conv1D(filters=128, kernel_size=4, activation="relu", padding="same"),
# Global max-pooling selects the strongest activation across the sequence
layers.GlobalMaxPooling1D(),
layers.Dense(64, activation="relu"),
layers.Dropout(0.4),
layers.Dense(1, activation="sigmoid"),
], name="cnn_classifier")
model_cnn.compile(
optimizer="adam",
loss="binary_crossentropy",
metrics=["accuracy"],
)GlobalMaxPooling1D takes the maximum value across the time dimension for each filter channel, producing a fixed-length summary of the most decisive patterns detected throughout the sentence.
Train with early stopping:
python
early_stop = keras.callbacks.EarlyStopping(
patience=3,
restore_best_weights=True,
)
history_cnn = model_cnn.fit(
X_train_raw, y_train,
epochs=20,
batch_size=32,
validation_split=0.15,
callbacks=[early_stop],
verbose=1,
)
loss, acc = model_cnn.evaluate(X_test_raw, y_test, verbose=0)
print(f"CNN test accuracy: {acc:.4f}")The EarlyStopping callback monitors validation loss and halts training once improvement stalls, then restores the best-performing checkpoint. This prevents overfitting without manually guessing the right epoch count.
Running All Sources Together
So far we only trained on Yelp reviews. Let's evaluate across all three domains:
python
results = {}
for source in data["source"].unique():
subset = data[data["source"] == source]
X = subset["text"].values
y_all = subset["label"].values
X_tr, X_te, y_tr, y_te = train_test_split(
X, y_all, test_size=0.25, random_state=42
)
# Fresh vectorization layer per source
vl = layers.TextVectorization(
max_tokens=MAX_VOCAB,
output_mode="int",
output_sequence_length=SEQ_LENGTH,
standardize="lower_and_strip_punctuation",
)
vl.adapt(X_tr)
m = keras.Sequential([
vl,
layers.Embedding(MAX_VOCAB, EMBED_DIM),
layers.Conv1D(128, 3, activation="relu", padding="same"),
layers.GlobalMaxPooling1D(),
layers.Dense(64, activation="relu"),
layers.Dropout(0.4),
layers.Dense(1, activation="sigmoid"),
])
m.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])
m.fit(
X_tr, y_tr,
epochs=20, batch_size=32, validation_split=0.15,
callbacks=[keras.callbacks.EarlyStopping(patience=3, restore_best_weights=True)],
verbose=0,
)
_, acc = m.evaluate(X_te, y_te, verbose=0)
results[source] = acc
print(f"{source:8s} → {acc:.4f}")Typical output:
yelp → 0.8360
amazon → 0.8440
imdb → 0.8640Across the board the CNN outperforms the logistic regression baseline we started with.
Deploying for Inference: The End-to-End Model
Because TextVectorization lives inside the model, saving it produces a fully self-contained artefact that accepts raw strings at inference time. No external tokeniser, no separate preprocessing script to ship.
python
# Save
model_cnn.save("sentiment_model.keras")
# Load and predict on raw strings
loaded_model = keras.models.load_model("sentiment_model.keras")
test_sentences = [
"This product exceeded all my expectations!",
"Absolute waste of money. Broke after one day.",
"Decent quality for the price.",
]
predictions = loaded_model.predict(test_sentences)
for sentence, score in zip(test_sentences, predictions):
sentiment = "Positive" if score[0] >= 0.5 else "Negative"
print(f"[{score[0]:.2f}] {sentiment}: {sentence}")[0.92] Positive: This product exceeded all my expectations!
[0.06] Negative: Absolute waste of money. Broke after one day.
[0.61] Positive: Decent quality for the price.Comparing All Approaches
Approach | Typical Accuracy |
|---|---|
Logistic Regression (BOW) | ~79–80% |
Dense Network + Embedding | ~83–85% |
1D CNN + Embedding | ~84–87% |
1D CNN + GloVe Embeddings | ~85–88% |
The exact numbers vary with the domain and random seed. On small datasets the gap can be modest. The real advantage of neural approaches becomes apparent when you scale to tens of thousands of labelled examples.
What Has Changed Since the Early Keras Days
If you have worked with older Keras tutorials, you will notice several differences:
TextVectorization replaces Tokenizer. The legacy keras.preprocessing.text.Tokenizer still exists but is no longer the recommended path. TextVectorization integrates into the model graph and is saved and restored automatically with the model.
pad_sequences is no longer needed. Setting output_sequence_length on the TextVectorization layer handles padding and truncation transparently.
Multi-backend Keras 3. Keras is now a standalone library (pip install keras) that runs on TensorFlow, JAX, or PyTorch. Switch backends via os.environ["KERAS_BACKEND"] = "jax" before importing keras.
.keras save format. model.save("model.keras") uses the native Keras format, which serialises custom and preprocessing layers more robustly than the older .h5 HDF5 format.
EarlyStopping(restore_best_weights=True). This argument is fully reliable in Keras 3 and removes the need for a separate ModelCheckpoint callback in most cases.
Taking It Further
This guide covered the fundamentals. Here are directions worth exploring next:
Bidirectional LSTMs. Replace the CNN with layers.Bidirectional(layers.LSTM(64)) to model sequential dependencies in both directions. Particularly useful when the order of words carries meaning that local n-gram filters might miss.
Transformer-based transfer learning. Libraries such as keras-hub and Hugging Face transformers let you fine-tune BERT or DistilBERT on your own labelled data. This is the standard approach for state-of-the-art results on custom classification tasks in 2026.
Larger datasets. The UCI Sentiment dataset contains 3,000 examples. The IMDB 50K dataset and the Stanford Sentiment Treebank are natural next steps if you want to test how your pipeline scales.
KerasTuner for systematic hyperparameter search. Replace hand-picked values for embedding dimension, filter counts, dropout rate, and learning rate with a principled search using keras_tuner.BayesianOptimization or keras_tuner.RandomSearch.
Wrapping Up
You have built a complete, modern text classification pipeline using Keras 3:
Loaded and explored the UCI Sentiment dataset across three review domains
Established a logistic regression baseline using bag-of-words features
Replaced hand-crafted preprocessing with Keras
TextVectorizationTrained a dense network and a 1D CNN, both outperforming the baseline
Initialised a pretrained GloVe embedding layer to give the model a semantic head start
Saved a fully self-contained model that accepts raw strings at inference time
The patterns here — keep preprocessing inside the model, separate vocabulary learning from evaluation, always verify against a baseline — transfer directly to any text classification problem you encounter in practice.
Share this article
Loading comments...