Day 17/30 SQL, Python, ETL, Data Modelling Challenge FREE Solutions

Solutions for March 18th, 2025 CHALLENGE – unlock solutions + reasoning

Avantika_Penumarty

Mar 19, 2025

👋 Hey Data Engineers!

Welcome to Day 17 of the 30-Day Data Engineering Challenge! 🚀

Difficulty Level: Intermediate

Today's Challenge covers:

SQL Indexing & Query Performance
Memory Optimization in Python
Ensuring Data Integrity in ETL Pipelines
Choosing the Right Fact Table Granularity in Data Modeling

Not subscribed yet? Get daily challenges + expert breakdowns delivered straight to your inbox!

🧠 Understand, Don't Memorize:

✅ Clear explanations & reasoning
✅ Why this solution works
✅ Key optimizations & best practices

💡 Want deep dives + runnable code? Upgrade to the Annual Plan and master these concepts like a pro!

SQL Challenge - Query Performance with Indexing

Challenge Recap:

Question: Which SQL query is least likely to use an index efficiently?

🔘 A) SELECT * FROM users WHERE email = 'test@email.com';
🔘 B) SELECT * FROM users WHERE LOWER(name) = 'john doe';
🔘 C) SELECT * FROM users ORDER BY id;
🔘 D) SELECT * FROM users WHERE age > 30;

Answer: Option B

SELECT * FROM users WHERE LOWER(name) = 'john doe';

Explanation:

Applying a function (LOWER()) to an indexed column forces a full table scan because indexes store values as-is. Instead, store pre-lowered values or use case-insensitive collations to retain index efficiency.

Best Practices:

Avoid functions on indexed columns in WHERE clauses.
Use computed columns for case transformations instead of applying LOWER() dynamically.
Verify index utilization with EXPLAIN ANALYZE.

If you’re enjoying these challenges, it’s time to go beyond just knowing the answers.

To truly master SQL, Python, ETL, and Data Modeling, you need deep conceptual clarity, hands-on practice, and real-world applications—not just solutions.

UPGRADE TO DEEP DIVE

Python Challenge - Memory Optimization in Python

Challenge Recap:

Question: Which approach consumes the least memory when processing a large dataset?

🔘 A) list(range(10**6))
🔘 B) tuple(range(10**6))
🔘 C) set(range(10**6))
🔘 D) (x for x in range(10**6))

Answer: Option D : x for x in range(10**6)

Explanation:

Generators ((x for x in range(n))) don’t store the entire dataset in memory. Instead, they generate items on demand, significantly reducing memory usage compared to lists, tuples, or sets.

Best Practices:

Use generators when iterating over large datasets to avoid excessive memory consumption.
Prefer yield statements in functions instead of returning large lists.
When processing data from files, use iterators instead of loading everything into memory.

ETL Challenge - Ensuring Data Integrity in ETL Pipelines

Challenge Recap:

Question: Which technique is most effective for preventing duplicate records in an ETL process?

🔘 A) Using DISTINCT in SQL queries
🔘 B) Implementing an idempotent ETL design
🔘 C) Running a full reload daily
🔘 D) Ignoring primary key constraints

Answer: Option B - Implementing an idempotent ETL design

Explanation:

An idempotent ETL design ensures that reprocessing the same data multiple times does not create duplicates or corrupt data. It often relies on primary keys, upserts, and deduplication logic to guarantee consistency.

Best Practices:

Use UPSERT (MERGE) operations to insert new records and update existing ones.
Track ETL job run states to prevent reprocessing of the same data.
Implement checksum validation to detect duplicate or corrupted records.

Data Modeling Challenge - Choosing the Right Fact Table Granularity

Challenge Recap:

Question: Which scenario benefits most from a low-granularity fact table?

🔘 A) A daily sales summary report
🔘 B) Real-time order tracking for an e-commerce platform
🔘 C) A detailed clickstream analysis in an ad-tracking system
🔘 D) A transaction log capturing every user action

Answer: Option A - A daily sales summary report

Explanation:

A low-granularity fact table stores data at a summary level rather than an atomic level. This reduces storage and improves query performance for aggregated reporting, like daily sales summaries.

Best Practices:

Use low granularity for aggregated reporting to optimize storage and performance.
For detailed analytics, maintain high-granularity tables (e.g., every transaction).
Balance between query speed and data availability when designing fact tables.

JOIN 1000+ PAID MEMBERS

Discussion about this post

Ready for more?