BlogNext - Modern Blog Platform

Modern applications depend heavily on fast, reliable access to data. When we set out to improve the online grocery shopping experience for our users, one of our goals was to serve personalized product recommendations instantly. To do this efficiently, we pre-computed the top 200 recommended items for each customer and stored the results in a PostgreSQL OLTP database, expecting quick lookups at runtime.

But we soon ran into a major roadblock: simple queries on this recommendation table were taking far too long, sometimes close to a full second. With a table approaching 12GB and millions of rows, even a seemingly straightforward SELECT turned into a performance bottleneck.

Understanding the Performance Problem

Our table structure was simple—each row contained a customer_id, a product_id, and an associated recommendation score. We created a B-tree index on customer_id to accelerate searches:

SELECT * FROM personalized_recommendations
WHERE customer_id = 1;

Even with this index, PostgreSQL reported execution times around 100 ms for some customers. Something wasn’t adding up.

To dig deeper, we analyzed the query plan. PostgreSQL was indeed using the index, but fetching rows still involved many disk hits. The deeper question became:

Why is the database performing so much random I/O even though we have a proper index?

The Hidden Factor: Physical Row Placement

Indexes only tell PostgreSQL where matching rows are, not how close they are to each other on disk. To inspect actual storage locations, we looked at the ctid system column, which reveals the physical disk page and offset where each row is stored.

What we found was telling:

Rows belonging to the same customer were scattered across thousands of different pages. Even though the index could quickly identify the relevant rows, PostgreSQL still had to jump across the disk fetching each one—an expensive, random I/O pattern.

This explained the sluggish performance.

Tracing the Root Cause Back to Spark

The next question was why the data ended up distributed this way. The rows were inserted by a Spark job running a collaborative filtering model.

Spark writes data partition by partition. When we examined the transformation pipeline, we discovered the dataframe had been partitioned by product_id. That meant:

all rows with the same product_id were grouped together in one partition
rows for a single customer were spread across many partitions
PostgreSQL inserted the data in that same fragmented order

A quick test confirmed it: rows sharing a product ID were stored contiguously on disk, while customer-specific rows were dispersed widely.

Realigning the Data for Faster Queries

Our goal was to physically group rows by customer_id to reduce random I/O during index scans.

Attempt 1: Repartition by customer_id

We tried:

df.repartition($"customer_id")

But with over 10 million customers and Spark’s default of 200 partitions, each partition still contained data for tens of thousands of users. PostgreSQL still received mixed-customer batches, so the physical layout barely changed.

Attempt 2: Repartition + sort within each partition

This time we added:

df.repartition($"customer_id").sortWithinPartitions($"customer_id")

Sorting inside each partition dramatically improved row locality for each user. After writing this version back to PostgreSQL, inspecting ctid values showed what we hoped to see:

All rows for the same customer were now stored close together on disk.

The Final Outcome: 30x Faster Queries

Once the data was physically aligned, PostgreSQL switched to a much more efficient bitmap heap scan. The new query plan brought execution times down from roughly 100 ms to about 3 ms.

This improvement was significant enough to support real-time recommendation delivery and bulk operations for over 200,000 active users.

The overall table size has since grown from 12GB to over 22GB, yet lookup performance remains stable thanks to this storage-aware optimization. While the Spark workload increased due to repartitioning and sorting, this cost is acceptable because the pipeline runs in batch.

Key Takeaways

Indexes speed up lookups, but physical row locality determines how efficiently data is retrieved from disk.
Tools like ctid can help uncover whether data is scattered or clustered.
Distributed systems like Spark can inadvertently degrade database performance depending on how partitions are written.
Repartitioning and sorting can dramatically improve data locality—and in turn, query latency.

As data volume and user counts grow, understanding how your storage systems physically arrange records becomes critical for designing truly scalable architectures.

Why the Physical Layout of Your Database Tables Matters More Than You Think