Zero2Dataengineer

Zero2Dataengineer

Zero to DE Daily Lessons

Indexing Secrets They Don’t Teach You

How to actually use indexes like an engineer — not a tutorial bot

Avantikka_Penumarty's avatar
Avantikka_Penumarty
May 08, 2025
∙ Paid

Welcome back, data minds —
We’ve talked tables. We’ve shaped schemas.
But today, we talk about speed.

Because all the cleanest SQL in the world means nothing…
If your query takes 17 minutes and your dashboard cries blood.

Zero2Dataengineer is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.


Why Indexing Matters

Imagine searching for a contact in your phone:

  • No Index: Scroll, scroll, scroll… manually.

  • With Index: Type “S” → boom, “Sara” shows up instantly.

That’s what indexes do to your database.
They help the engine locate what you need — without scanning every row.


Why Indexing Exists

Imagine walking into a bookstore with no signage.
No sections, no labels. Just 100,000 books in a pile.

That’s your database without an index.
Even the simplest query becomes a full table scan — the database flips through every row, like a librarian on Red Bull.

With an index?
You give the DB a map. It jumps to exactly where the answer lives.



The Two Most Common Index Types


Deep Dive: B-Tree vs Hash Index

Interview tip: Most RDBMS default to B-Tree, so unless otherwise needed, go with that.


Real World Example: The Wrong Index Costs Real Money

When I worked on a loyalty points table with 30M+ rows,
a teammate created an index on user_id.

Problem?
All queries filtered on created_at That index was useless.

We swapped it to:

CREATE INDEX idx_created_at ON loyalty_points (created_at);

Result: Dashboard load time dropped from 38s → 2s.
That tiny change saved $600/month in Snowflake compute credits.


Real-World: How One Index Saved a Failing Job

At a logistics startup, we had a shipment tracking job that queried a 60M row tracking_events table every 5 minutes.

We thought it was clean:

SELECT * FROM tracking_events WHERE status = 'delivered';

But it took 45 seconds.

The problem? No index on status.

Worse — status had only 3 values (in_transit, delivered, failed) → low cardinality.

So indexing that didn’t help. Instead, we looked at event_time, which had high uniqueness and was used in the WHERE clause:

CREATE INDEX idx_event_time ON tracking_events (event_time);

Query dropped to 2.1 seconds.

And that’s when we learned:

Index the column that filters the MOST rows — and does so selectively.


UPGRADE TO ANNUAL

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 Avantika
Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture