📢 Day 16/30 - SQL, PYTHON, ETL, DATA MODELING CHALLENGE Solutions

Solutions for March 17th, 2025 CHALLENGE – unlock solutions + reasoning!

Mar 18, 2025

👋 Hey Data Engineers!

Welcome to Day 16 of the 30-Day Data Engineering Challenge 🚀.

Today’s Challenge covers:

🚀 SQL: Recursive CTEs for Hierarchical Data
🔹 Learn how to retrieve all employees reporting to a manager, including indirect reports using recursive queries.

🐍 Python: Multi-threading & Performance
🔹 Understand when to use threading vs. multiprocessing for executing concurrent tasks efficiently.

⚡ ETL: Handling Failures in Data Pipelines
🔹 Discover best practices for retrying failed ETL jobs due to intermittent network issues.

📊 Data Modeling: Normalization vs. Denormalization
🔹 Learn when to denormalize a schema for analytical workloads vs. when to normalize for transactional integrity.

Want deep dives + runnable code? Upgrade to the Annual Plan and master these concepts like a pro!

🧠 Understand, Don't Memorize:

✅ Clear explanations & reasoning
✅ Why this solution works
✅ Key optimizations & best practices

💡 Want deep dives + runnable code? Upgrade to the Annual Plan and master these concepts like a pro!

📌 SQL Challenge - Recursive Queries

👉 Question:
Which SQL query correctly retrieves all employees reporting to a given manager, including indirect reports?

🔘 A) SELECT * FROM employees WHERE manager_id = 5;
🔘 B) WITH RECURSIVE emp_cte AS (...) SELECT * FROM emp_cte;
🔘 C) SELECT * FROM employees INNER JOIN managers ON employees.id = managers.id;
🔘 D) SELECT * FROM employees WHERE id = ANY(SELECT manager_id FROM employees);

✅ Answer: B) WITH RECURSIVE emp_cte AS (...) SELECT * FROM emp_cte;

📖 Explanation:

Recursive CTEs (WITH RECURSIVE) allow us to traverse hierarchical relationships such as employee-manager structures.
This approach recursively selects direct and indirect reports, making it ideal for organizational trees, category hierarchies, and dependency chains.

💡 Best Practices:
✔ Always use recursive queries when dealing with hierarchical data.
✔ Include a termination condition in recursive CTEs to prevent infinite loops.
✔ Use depth limits if necessary to control recursion depth.

UPGRADE To MEMBERSHIP

🐍 Python Challenge - Multi-threading

👉 Question:
Which Python function is best for executing multiple tasks concurrently using threading?

🔘 A) threading.Thread(target=function).start()
🔘 B) multiprocessing.Process(target=function).start()
🔘 C) asyncio.run(function())
🔘 D) os.fork()

✅ Answer: A) threading.Thread(target=function).start()

📖 Explanation:

Threading allows multiple tasks to run concurrently in the same memory space, making it ideal for I/O-bound tasks like network requests or file reading.
multiprocessing creates separate processes, better suited for CPU-bound tasks.
asyncio is for asynchronous programming, which is different from multithreading.
os.fork() is for low-level process management, not general-purpose threading.

💡 Best Practices:
✔ Use threading for I/O-bound tasks (e.g., API calls, file reading).
✔ Use multiprocessing for CPU-intensive tasks (e.g., data processing, ML).
✔ Avoid global variables in threaded programs to prevent race conditions.

UPGRADE To MEMBERSHIP

⚡ ETL Challenge - Handling Failures

👉 Question:
What is the best way to handle an ETL job that fails due to an intermittent network issue?

🔘 A) Retry the job with exponential backoff
🔘 B) Ignore the failure and proceed to the next step
🔘 C) Stop the pipeline and notify the user
🔘 D) Delete and reload all data

✅ Answer: A) Retry the job with exponential backoff

📖 Explanation:

Exponential backoff gradually increases the wait time between retries, reducing network congestion and improving reliability.
Ignoring the failure (B) can lead to data inconsistencies.
Stopping the pipeline (C) without retrying isn't ideal unless the failure is persistent.
Deleting and reloading all data (D) is inefficient and unnecessary for transient issues.

💡 Best Practices:
✔ Implement retry logic with increasing delays (e.g., 1s, 2s, 4s, 8s).
✔ Use logging and monitoring to track failures and retries.
✔ Store intermediate checkpoints to avoid reprocessing large datasets.

📊 Data Modeling Challenge - Normalization vs. Denormalization

👉 Question:
Which scenario is best suited for a fully denormalized schema?

🔘 A) Analytical reporting with frequent aggregations
🔘 B) Transactional systems requiring strict data integrity
🔘 C) Highly normalized OLTP databases
🔘 D) Large-scale e-commerce platforms processing orders

✅ Answer: A) Analytical reporting with frequent aggregations

📖 Explanation:

Denormalization speeds up read performance by reducing joins, making it ideal for analytical workloads (OLAP systems, dashboards, data warehouses).
Transactional systems (OLTP) need normalization to ensure data integrity and avoid redundancy.
E-commerce platforms (D) often balance both approaches depending on performance needs.

💡 Best Practices:
✔ Normalize OLTP databases (3NF) to prevent redundancy.
✔ Denormalize OLAP databases for faster reads.
✔ Use materialized views or caching instead of full denormalization when possible.

🚀 Want the Full DEEP DIVE Analysis?
🔍 Concept breakdowns, live runnable code, and expert strategies are available for paid members.

🔥 Upgrade to Annual Membership for:
✅ Advanced SQL & Python solutions
✅ Real-world ETL & Data Modeling case studies
✅ FAANG-level interview strategies

UPGRADE To MEMBERSHIP

Discussion about this post

Ready for more?