<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[Zero2Dataengineer]]></title><description><![CDATA[AI and Data Engineering lessons for engineers who want to move fast, build smarter, and get the promotion in data roles you have been working toward. Learn the concepts shaping every industry, fix real DE and AI problems, crack the interview.]]></description><link>https://zero2dataengineer.substack.com</link><image><url>https://substackcdn.com/image/fetch/$s_!P4V8!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F480087e2-d585-43e3-8076-9e1282f0eb2d_200x200.png</url><title>Zero2Dataengineer</title><link>https://zero2dataengineer.substack.com</link></image><generator>Substack</generator><lastBuildDate>Fri, 12 Jun 2026 09:22:06 GMT</lastBuildDate><atom:link href="https://zero2dataengineer.substack.com/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Avantika]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[zero2dataengineer@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[zero2dataengineer@substack.com]]></itunes:email><itunes:name><![CDATA[Avantika_Penumarty]]></itunes:name></itunes:owner><itunes:author><![CDATA[Avantika_Penumarty]]></itunes:author><googleplay:owner><![CDATA[zero2dataengineer@substack.com]]></googleplay:owner><googleplay:email><![CDATA[zero2dataengineer@substack.com]]></googleplay:email><googleplay:author><![CDATA[Avantika_Penumarty]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[Why Facebook Never Goes Down: The Two Systems Behind 3 Billion Users]]></title><description><![CDATA[Most engineers never study LogDevice and RocksDB. Here is why they should.]]></description><link>https://zero2dataengineer.substack.com/p/why-facebook-never-goes-down-the</link><guid isPermaLink="false">https://zero2dataengineer.substack.com/p/why-facebook-never-goes-down-the</guid><dc:creator><![CDATA[Avantika_Penumarty]]></dc:creator><pubDate>Thu, 02 Apr 2026 00:30:56 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!P4V8!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F480087e2-d585-43e3-8076-9e1282f0eb2d_200x200.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p style="text-align: justify;">Every time you open Facebook, something extraordinary happens behind the scenes.</p><p style="text-align: justify;">Three billion people are doing the exact same thing at the same time. Posting. Scrolling. Messaging. Watching. And somehow the whole thing just works. No crashes. No waiting. No downtime.</p><p style="text-align: justify;">The answer is not more servers. It is not a bigger database. It is a fundamental design decision that most engineers never think about until their system is already on fire.</p><p style="text-align: justify;">Meta treats reads and writes as two completely separate problems.</p><p style="text-align: justify;">In most systems, reads and writes share the same path. The same database handles both. Which means when one gets busy, the other suffers. Your users are trying to load their feed while your pipeline is ingesting millions of new events at the same time. They are fighting over the same resources. And under load, everybody loses.</p><p style="text-align: justify;">Most engineers respond to this by adding memory, scaling horizontally, or upgrading their database tier. None of that fixes the actual problem. Because the actual problem is architectural, not operational.</p><p style="text-align: justify;">Meta solved it by building two completely different systems from scratch.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://zero2dataengineer.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Zero2Dataengineer is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p style="text-align: justify;"><strong>LOGDEVICE: BUILT FOR WRITES ONLY</strong></p><p style="text-align: justify;">LogDevice is Meta&#8217;s distributed log storage system. It was designed with one purpose: ingest data at massive speed without ever slowing down.</p><p style="text-align: justify;">Every like, every message, every video view, every backend sensor ping across three billion users - LogDevice takes it all in. It uses a log-structured approach which means it writes data sequentially rather than randomly jumping around the disk. Sequential writes are dramatically faster than random writes. That is not an accident. That is a deliberate design choice made specifically to maximize write throughput.</p><p style="text-align: justify;">LogDevice does not care about reads. It was never designed to serve reads efficiently. That is the point. By giving up on reads entirely, it becomes extraordinarily good at the one thing it was built for.</p><p style="text-align: justify;">Most engineers designing their first production system try to pick one database that handles everything. LogDevice is the proof that this instinct, while understandable, is wrong at scale.</p><p style="text-align: justify;"><strong>ROCKSDB: BUILT FOR READS WITH SURGICAL PRECISION</strong></p><p style="text-align: justify;">RocksDB started as Google&#8217;s LevelDB. Meta took it, rebuilt it, and open sourced it in 2013. Today it powers systems at Facebook, LinkedIn, Yahoo, Twitter, and hundreds of other companies running at scale.</p><p style="text-align: justify;">The reason Meta built RocksDB instead of using an existing solution is the same reason they built LogDevice. Nothing on the market gave them the control they needed.</p><p style="text-align: justify;">RocksDB is an embeddable key-value store that lets you tune read and write performance independently at the instance level. This is the part most engineers miss.</p><p style="text-align: justify;">You can deploy one RocksDB instance configured entirely for fast point lookups optimized for the read patterns of a news feed where you need to retrieve a specific user&#8217;s data in milliseconds. You deploy another instance configured for high write throughput - optimized for the ingestion patterns of an analytics pipeline processing billions of events. Same underlying technology. Completely different configurations. Completely different jobs.</p><p style="text-align: justify;">They never compete for the same resources because they were never meant to run the same workload.</p><p style="text-align: justify;">RocksDB also uses a data structure called an LSM tree -  Log Structured Merge tree which batches writes in memory and flushes them to disk in sorted order. This makes writes fast and keeps related data physically close together on disk. When you request data, the disk has to seek less to find it. Less seeking means faster reads. Meta takes this even further by pre-arranging the most frequently accessed bytes so they are physically adjacent on disk. The result is a feed that loads in milliseconds regardless of how many people are using it simultaneously.</p><p style="text-align: justify;"><strong>WHY THIS MATTERS FOR YOUR SYSTEM RIGHT NOW</strong></p><p>You are probably not building for three billion users. But the principle applies at every scale.</p><p>If you have a system that slows down under load, the first question to ask is not what hardware do I need. The question is are my reads and writes competing for the same resources.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://zero2dataengineer.substack.com/p/why-facebook-never-goes-down-the/comments&quot;,&quot;text&quot;:&quot;Leave a comment&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://zero2dataengineer.substack.com/p/why-facebook-never-goes-down-the/comments"><span>Leave a comment</span></a></p><p>I have seen this exact problem at companies with 50 engineers and companies with 5,000. A shared database handling both analytical queries and transactional writes. A single Kafka consumer group processing both real-time and batch workloads. One pipeline serving five different use cases with completely different performance requirements.</p><p>The symptom is always the same. Things work fine until load increases. Then everything degrades together because everything is coupled together.</p><p>The fix is always the same too. Separate the concern. Define the job. Build for that job specifically.</p><p>LogDevice does not try to be RocksDB. RocksDB does not try to be LogDevice. And Facebook never goes down.</p><p>Here is the three step framework I apply before designing any new data system:</p><p>Step one. Write down every read pattern your system needs to support. How frequently. What latency is acceptable. What the data shape looks like.</p><p>Step two. Write down every write pattern separately. How much volume. How fast does it need to land. What consistency guarantees do you need.</p><p>Step three. Ask honestly whether one system can serve both patterns without compromising either. If the answer is no, you already know what to do.</p><p>Separate the concern first. Then optimize. That is how you build something that survives contact with real traffic.</p><p>If you found this valuable, Thursday&#8217;s paid newsletter goes even deeper.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://zero2dataengineer.substack.com/subscribe&quot;,&quot;text&quot;:&quot;Not subscribed yet? Now is a good time.&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://zero2dataengineer.substack.com/subscribe"><span>Not subscribed yet? Now is a good time.</span></a></p><p>I am breaking down the exact career moves that separate engineers who understand systems from engineers who just operate them. The difference in compensation between those two groups at companies like Meta is not small.</p><p>Thursday 5:30pm. Paid subscribers only.</p><p>See you Thursday.</p><p>&#8212; Avantika</p>]]></content:encoded></item><item><title><![CDATA[Scale Does Not Break Your Code. It Breaks Your Assumptions.]]></title><description><![CDATA[I was wrong about retries. It cost a million users.]]></description><link>https://zero2dataengineer.substack.com/p/scale-does-not-break-your-code-it</link><guid isPermaLink="false">https://zero2dataengineer.substack.com/p/scale-does-not-break-your-code-it</guid><dc:creator><![CDATA[Avantika_Penumarty]]></dc:creator><pubDate>Wed, 01 Apr 2026 00:31:02 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!dAIQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff36d99c0-f515-46e6-951d-ba028293430a_1408x768.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h4 style="text-align: justify;">I remember the exact moment I realized I had no idea what I was doing.</h4><p style="text-align: justify;">It was my first week at Meta. I had just been handed access to one of the most complex data pipelines I had ever seen. A DAG running on 40 trillion events a day. Likes. Messages. Video views. Sensor pings from devices most people don&#8217;t even know exist.</p><p style="text-align: justify;">I sat there thinking, I have built pipelines before. Real ones. For Fortune 500 clients. I know Spark. I know SQL. I know how to ship. I&#8217;ve got this.</p><p style="text-align: justify;">I did not have this.</p><p style="text-align: justify;">The first thing that humbled me was not the complexity of the code. It was how wrong my assumptions were.</p><p style="text-align: justify;">Before Meta, I assumed deduplication was a solved problem. You write the logic once. It works. Done. At 10 million rows that is true. At 40 trillion events, I was generating duplicates that lived quietly in production for weeks before anyone noticed. And by the time we caught it, half the company was downstream of that bad data.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://zero2dataengineer.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Zero2Dataengineer is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p style="text-align: justify;"><strong>The code was not wrong. My assumption was wrong.</strong></p><p style="text-align: justify;">The second assumption that broke me was around retries. In most systems, if a job fails you retry it. Simple. Safe. Standard practice. At Meta scale, a retry meant potentially processing the same event twice. Which meant potentially double charging a million users. Which meant a P0 incident at 2am with half the engineering org on a call.</p><p style="text-align: justify;">I had never once thought about idempotency as a design requirement. At scale it is not a nice to have. It is the difference between a working system and a crisis.</p><p style="text-align: justify;">The third assumption was around SLAs. I assumed if a job had a 4 hour SLA and ran in 2 hours in staging, we were fine. Until the cluster was hot. Until three other high priority jobs were competing for the same resources. Until my 2 hour job was at hour 6 and my SLA was breached and I was explaining to my manager why downstream dashboards were empty.</p><p style="text-align: justify;">Here is what I learned from all of this. The engineers who survive at that level are not the smartest ones in the room. They are the ones who documented every assumption their system made and then intentionally tried to break each one. Not in production. Not after an incident. Before it ever went live.</p><p style="text-align: justify;">That is the skill nobody teaches you. Not in bootcamps. Not in certifications. Not in any course I have ever taken. You learn it by being on the wrong side of an incident and spending 6 hours in a war room tracing back to the assumption you made three months ago that seemed totally reasonable at the time.</p><p style="text-align: justify;">Or you learn it here. Before it costs you a night of sleep.</p><p style="text-align: justify;">The framework below is the exact checklist I run before any pipeline goes to production. The same one I wish someone handed me in my first week at Meta.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://zero2dataengineer.substack.com/subscribe&quot;,&quot;text&quot;:&quot;Unlock the Full Framework&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://zero2dataengineer.substack.com/subscribe"><span>Unlock the Full Framework</span></a></p><h4><strong>THE ASSUMPTION AUDIT: HOW TO BREAK YOUR PIPELINE BEFORE IT BREAKS YOU</strong></h4><p>After years of building at scale, I now run every pipeline through five assumption categories before it goes live. Not because I am paranoid. Because every incident I have ever been part of traced back to exactly one of these five.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!dAIQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff36d99c0-f515-46e6-951d-ba028293430a_1408x768.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!dAIQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff36d99c0-f515-46e6-951d-ba028293430a_1408x768.png 424w, https://substackcdn.com/image/fetch/$s_!dAIQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff36d99c0-f515-46e6-951d-ba028293430a_1408x768.png 848w, https://substackcdn.com/image/fetch/$s_!dAIQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff36d99c0-f515-46e6-951d-ba028293430a_1408x768.png 1272w, https://substackcdn.com/image/fetch/$s_!dAIQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff36d99c0-f515-46e6-951d-ba028293430a_1408x768.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!dAIQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff36d99c0-f515-46e6-951d-ba028293430a_1408x768.png" width="1408" height="768" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f36d99c0-f515-46e6-951d-ba028293430a_1408x768.png&quot;,&quot;srcNoWatermark&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7744e48b-0890-4904-9952-fcd2e865afe2_1408x768.png&quot;,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:768,&quot;width&quot;:1408,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1629403,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://zero2dataengineer.substack.com/i/192665197?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7744e48b-0890-4904-9952-fcd2e865afe2_1408x768.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!dAIQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff36d99c0-f515-46e6-951d-ba028293430a_1408x768.png 424w, https://substackcdn.com/image/fetch/$s_!dAIQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff36d99c0-f515-46e6-951d-ba028293430a_1408x768.png 848w, https://substackcdn.com/image/fetch/$s_!dAIQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff36d99c0-f515-46e6-951d-ba028293430a_1408x768.png 1272w, https://substackcdn.com/image/fetch/$s_!dAIQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff36d99c0-f515-46e6-951d-ba028293430a_1408x768.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div>
      <p>
          <a href="https://zero2dataengineer.substack.com/p/scale-does-not-break-your-code-it">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[The 30-Day DE Roadmap: Your Fast-Track to a $150K Job : How to Think Like a Data Engineer]]></title><description><![CDATA[Learn the Right Skills. Skip the Random Tools. Most people learn the wrong things. Here is what actually matters.]]></description><link>https://zero2dataengineer.substack.com/p/the-30-day-data-engineering-roadmap</link><guid isPermaLink="false">https://zero2dataengineer.substack.com/p/the-30-day-data-engineering-roadmap</guid><dc:creator><![CDATA[Avantika_Penumarty]]></dc:creator><pubDate>Sun, 01 Mar 2026 18:00:00 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!fIG8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17c845e8-81a7-4a32-87cf-0a54d05b108b_1920x2208.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><strong>Welcome back to Zero2DataEngineer. </strong></p><p><strong>New here? &#128075; </strong> <strong>Hi. I am Avantika. </strong></p><p>I spent 5 years at Meta building data systems for ARVR -Reality Labs, Marketplace and Consumer Connectivity. I then moved on to build platforms for Walmart and Marriott, processing half a billion events a day. Yes, half a billion. Before my morning coffee.</p><p>But I have also been the person who Googled &#8220;what is a slowly changing dimension&#8221; at 11pm the night before an interview. Zero connections in tech. Broke into MAANG the hard way. Both versions of me write this newsletter &#8212; the one who learned everything the hard way, and the one who finally knows enough to make it easier for you.</p><p>Let me tell you what nobody says out loud&#8230;&#8230;</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!CfAD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb006b514-a147-4b82-95ec-2ba6df3c3998_1509x905.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!CfAD!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb006b514-a147-4b82-95ec-2ba6df3c3998_1509x905.png 424w, https://substackcdn.com/image/fetch/$s_!CfAD!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb006b514-a147-4b82-95ec-2ba6df3c3998_1509x905.png 848w, https://substackcdn.com/image/fetch/$s_!CfAD!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb006b514-a147-4b82-95ec-2ba6df3c3998_1509x905.png 1272w, https://substackcdn.com/image/fetch/$s_!CfAD!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb006b514-a147-4b82-95ec-2ba6df3c3998_1509x905.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!CfAD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb006b514-a147-4b82-95ec-2ba6df3c3998_1509x905.png" width="1509" height="905" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b006b514-a147-4b82-95ec-2ba6df3c3998_1509x905.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:905,&quot;width&quot;:1509,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1531913,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://zero2dataengineer.substack.com/i/158895900?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9ab583f-6a74-477a-981f-944a00abaf4f_1536x1024.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!CfAD!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb006b514-a147-4b82-95ec-2ba6df3c3998_1509x905.png 424w, https://substackcdn.com/image/fetch/$s_!CfAD!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb006b514-a147-4b82-95ec-2ba6df3c3998_1509x905.png 848w, https://substackcdn.com/image/fetch/$s_!CfAD!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb006b514-a147-4b82-95ec-2ba6df3c3998_1509x905.png 1272w, https://substackcdn.com/image/fetch/$s_!CfAD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb006b514-a147-4b82-95ec-2ba6df3c3998_1509x905.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Most engineers who fail to break into Data Engineering did not fail because they lacked talent. They failed because they spent months learning tools instead of learning how to think. They memorized syntax without understanding the problem the syntax was solving. They chased certificates while the engineers getting hired were building things.</p><p>In 2026, the bar has shifted. Companies are not impressed by a list of tools on your resume. They are impressed by engineers who understand data at a conceptual level, write clean purposeful code, and can walk into a system they have never seen before and figure it out.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://zero2dataengineer.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Zero2Dataengineer is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>Because honestly, when AI can write the code in seconds, memorizing syntax is no longer your competitive advantage. What AI cannot do is understand why a pipeline is failing at 3am, make a judgment call on a schema design that has to last five years, or walk into a broken system and reason through it from first principles. That is what companies are paying $150K for in 2026. Not your ability to remember a function name. Your ability to think.</p><p>This roadmap is not about tools. It is about building the mental models that make everything else click.<br></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gZQ_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cece3c2-5084-4620-92a8-51d11e100f9f_825x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gZQ_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cece3c2-5084-4620-92a8-51d11e100f9f_825x1024.png 424w, https://substackcdn.com/image/fetch/$s_!gZQ_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cece3c2-5084-4620-92a8-51d11e100f9f_825x1024.png 848w, https://substackcdn.com/image/fetch/$s_!gZQ_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cece3c2-5084-4620-92a8-51d11e100f9f_825x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!gZQ_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cece3c2-5084-4620-92a8-51d11e100f9f_825x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gZQ_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cece3c2-5084-4620-92a8-51d11e100f9f_825x1024.png" width="825" height="1024" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7cece3c2-5084-4620-92a8-51d11e100f9f_825x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1024,&quot;width&quot;:825,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1366371,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://zero2dataengineer.substack.com/i/158895900?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cece3c2-5084-4620-92a8-51d11e100f9f_825x1024.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!gZQ_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cece3c2-5084-4620-92a8-51d11e100f9f_825x1024.png 424w, https://substackcdn.com/image/fetch/$s_!gZQ_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cece3c2-5084-4620-92a8-51d11e100f9f_825x1024.png 848w, https://substackcdn.com/image/fetch/$s_!gZQ_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cece3c2-5084-4620-92a8-51d11e100f9f_825x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!gZQ_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cece3c2-5084-4620-92a8-51d11e100f9f_825x1024.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Did you know most Data Engineers who land $150K roles cannot list every feature of every tool they use? But they can explain exactly why data moves the way it does, where it breaks, and how to fix it. That is the real skill.</p><p>And the engineers who move into staff and principal and manager roles? They are almost always the most technical people in the room. But not because they memorized the most.</p><p>Because they mastered the one skill nobody talks about enough: being the human in the room. The judgment. The context. The ability to sit across from a stakeholder and translate ambiguity into architecture. The instinct that comes from having actually broken things and fixed them.</p><p>In the fastest moving AI era we have ever seen, that is not a soft skill. That is your sharpest competitive edge.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://zero2dataengineer.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Zero2Dataengineer is a reader-supported publication. The engineers already inside are building the career you are still planning for. Get your daily edge delivered straight to your inbox.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2><strong>Week 1: Learn to Think in SQL</strong></h2><p>SQL is not a query language. It is a way of reasoning about data relationships.</p><p>Before you write a single line of code, understand what you are actually asking the database to do. Every JOIN is a question about how two sets of information relate to each other. Every window function is a question about context, how does this row relate to the rows around it? Every CTE is a way of breaking a complex question into smaller, readable pieces.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!fIG8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17c845e8-81a7-4a32-87cf-0a54d05b108b_1920x2208.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!fIG8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17c845e8-81a7-4a32-87cf-0a54d05b108b_1920x2208.png 424w, https://substackcdn.com/image/fetch/$s_!fIG8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17c845e8-81a7-4a32-87cf-0a54d05b108b_1920x2208.png 848w, https://substackcdn.com/image/fetch/$s_!fIG8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17c845e8-81a7-4a32-87cf-0a54d05b108b_1920x2208.png 1272w, https://substackcdn.com/image/fetch/$s_!fIG8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17c845e8-81a7-4a32-87cf-0a54d05b108b_1920x2208.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!fIG8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17c845e8-81a7-4a32-87cf-0a54d05b108b_1920x2208.png" width="1456" height="1674" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/17c845e8-81a7-4a32-87cf-0a54d05b108b_1920x2208.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1674,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:8195704,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://zero2dataengineer.substack.com/i/158895900?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17c845e8-81a7-4a32-87cf-0a54d05b108b_1920x2208.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!fIG8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17c845e8-81a7-4a32-87cf-0a54d05b108b_1920x2208.png 424w, https://substackcdn.com/image/fetch/$s_!fIG8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17c845e8-81a7-4a32-87cf-0a54d05b108b_1920x2208.png 848w, https://substackcdn.com/image/fetch/$s_!fIG8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17c845e8-81a7-4a32-87cf-0a54d05b108b_1920x2208.png 1272w, https://substackcdn.com/image/fetch/$s_!fIG8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17c845e8-81a7-4a32-87cf-0a54d05b108b_1920x2208.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Level 1: Foundation</strong> How data is stored and why. SELECT, WHERE, GROUP BY, ORDER BY. INNER, LEFT, RIGHT and FULL JOINs. Aggregations: COUNT, SUM, AVG, MIN, MAX. Filtering with HAVING versus WHERE and why the difference matters.</p><p><strong>Level 2: Intermediate</strong> CTEs and subqueries. When to use one over the other and why it affects readability. Window functions: ROW_NUMBER, RANK, DENSE_RANK, LAG, LEAD. Running totals and moving averages. Multi-table joins and handling NULL values intentionally.</p><p><strong>Level 3: Advanced</strong> Query optimization. How a database engine reads your query, what an execution plan actually tells you, and where queries die. Indexing strategy and when indexes hurt more than they help. Writing SQL for ETL transformations, not just reporting. Schema design decisions and how they affect every query written against them.</p><p>What to build: take one messy real-world dataset, something with missing values, duplicates, and inconsistent formats, and write SQL that cleans it, transforms it, and answers three business questions from it. Do not move on until you can explain every line of code out loud.</p><p>The concept that changes everything: query optimization. Understanding why a query is slow teaches you more about how databases actually work than any course ever will.</p><div><hr></div><h2><strong>Week 2: Learn to Think in Python</strong></h2><p>Python in Data Engineering is not about knowing Pandas. It is about understanding how data flows through a system.</p><p>An ETL pipeline is just three ideas: get data from somewhere, change its shape, put it somewhere else. Everything else is implementation detail. Once that concept clicks, the code almost writes itself.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!oG-4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6110c99-4fcf-454d-803d-0101a5787fac_890x1024.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!oG-4!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6110c99-4fcf-454d-803d-0101a5787fac_890x1024.jpeg 424w, https://substackcdn.com/image/fetch/$s_!oG-4!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6110c99-4fcf-454d-803d-0101a5787fac_890x1024.jpeg 848w, https://substackcdn.com/image/fetch/$s_!oG-4!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6110c99-4fcf-454d-803d-0101a5787fac_890x1024.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!oG-4!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6110c99-4fcf-454d-803d-0101a5787fac_890x1024.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!oG-4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6110c99-4fcf-454d-803d-0101a5787fac_890x1024.jpeg" width="890" height="1024" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e6110c99-4fcf-454d-803d-0101a5787fac_890x1024.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1024,&quot;width&quot;:890,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:273996,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://zero2dataengineer.substack.com/i/158895900?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6110c99-4fcf-454d-803d-0101a5787fac_890x1024.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!oG-4!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6110c99-4fcf-454d-803d-0101a5787fac_890x1024.jpeg 424w, https://substackcdn.com/image/fetch/$s_!oG-4!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6110c99-4fcf-454d-803d-0101a5787fac_890x1024.jpeg 848w, https://substackcdn.com/image/fetch/$s_!oG-4!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6110c99-4fcf-454d-803d-0101a5787fac_890x1024.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!oG-4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6110c99-4fcf-454d-803d-0101a5787fac_890x1024.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Level 1: Foundation</strong> Python data types and when to use each one. Reading and writing files: CSV, JSON, TXT. Functions, loops, and error handling. Connecting to a database and running a query from Python.</p><p><strong>Level 2: Intermediate</strong> Data manipulation with Pandas. Merging, reshaping, filtering and cleaning dataframes. Writing reusable ETL functions. Working with APIs and parsing responses. File format differences: CSV versus JSON versus Parquet and when each one is the right choice.</p><p><strong>Level 3: Advanced</strong> Building a full ETL pipeline from scratch. Scheduling and automating jobs. Handling failures gracefully with logging and alerting. Writing Python that a team can maintain, not just code that runs once. Performance optimization when your dataset stops fitting in memory.</p><p>What to build: a Python script that pulls data from a public API, transforms it into a clean structured format, and loads it into a local database. No frameworks. No shortcuts. Just you, the data, and the logic.</p><p>The concept that changes everything: understanding file formats. Why does Parquet exist? What problem does it solve that CSV does not? Engineers who understand the why behind format choices make better decisions at every level of a system.</p><div><hr></div><h2><strong>Week 3: Learn to Think About Scale</strong></h2><p>Here is where most people get confused. Cloud platforms are not the skill. Scale is the skill.</p><p>The question is never &#8220;how do I use AWS?&#8221; The question is &#8220;what happens to my pipeline when the data is 100 times bigger than it is today?&#8221; Cloud platforms are just where you go to find out.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6IJN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15e5cdb3-4dda-4f6b-8e7e-dfa457ba3286_890x1024.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6IJN!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15e5cdb3-4dda-4f6b-8e7e-dfa457ba3286_890x1024.jpeg 424w, https://substackcdn.com/image/fetch/$s_!6IJN!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15e5cdb3-4dda-4f6b-8e7e-dfa457ba3286_890x1024.jpeg 848w, https://substackcdn.com/image/fetch/$s_!6IJN!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15e5cdb3-4dda-4f6b-8e7e-dfa457ba3286_890x1024.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!6IJN!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15e5cdb3-4dda-4f6b-8e7e-dfa457ba3286_890x1024.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6IJN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15e5cdb3-4dda-4f6b-8e7e-dfa457ba3286_890x1024.jpeg" width="890" height="1024" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/15e5cdb3-4dda-4f6b-8e7e-dfa457ba3286_890x1024.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1024,&quot;width&quot;:890,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:297009,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://zero2dataengineer.substack.com/i/158895900?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15e5cdb3-4dda-4f6b-8e7e-dfa457ba3286_890x1024.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!6IJN!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15e5cdb3-4dda-4f6b-8e7e-dfa457ba3286_890x1024.jpeg 424w, https://substackcdn.com/image/fetch/$s_!6IJN!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15e5cdb3-4dda-4f6b-8e7e-dfa457ba3286_890x1024.jpeg 848w, https://substackcdn.com/image/fetch/$s_!6IJN!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15e5cdb3-4dda-4f6b-8e7e-dfa457ba3286_890x1024.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!6IJN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15e5cdb3-4dda-4f6b-8e7e-dfa457ba3286_890x1024.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Level 1: Foundation</strong> What cloud infrastructure actually is and why it exists. Object storage versus databases versus data warehouses. The difference between batch processing and streaming and when each one is the right choice. Cost awareness: why architectural decisions have a dollar amount attached to them.</p><p><strong>Level 2: Intermediate</strong> Designing pipelines that separate storage from compute. Serverless functions and when to use them. Infrastructure as code: why writing infrastructure in code is the same discipline as writing application code. Monitoring pipelines and knowing when something is wrong before users tell you.</p><p><strong>Level 3: Advanced</strong> Designing for failure. What happens when a pipeline breaks halfway through and how idempotency protects you. Data partitioning strategies and how they affect query performance downstream. The tradeoffs between real-time and near-real-time architectures and the cost of each.</p><p>What to build: take the ETL pipeline from the previous section and think through what would break at 10x the data volume. Then redesign it to handle that load. The redesign does not need to be deployed anywhere. The thinking is the exercise.</p><p>The concept that changes everything: serverless architecture. Understanding when to let infrastructure scale automatically versus when to control it manually is a decision every senior engineer makes constantly.</p><div><hr></div><h2><strong>Week 4: Learn to Think About Distributed Systems</strong></h2><p>Apache Spark is not a tool you learn. It is a mental model you develop.</p><p>The core idea: some problems are too big for one machine. Distributed computing is the art of breaking a big problem into smaller problems, solving them in parallel, and assembling the results. Once that concept is clear, the code is just syntax.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!OOQh!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe8f02a4-213a-4a92-ad26-0c471382eefb_890x1024.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!OOQh!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe8f02a4-213a-4a92-ad26-0c471382eefb_890x1024.jpeg 424w, https://substackcdn.com/image/fetch/$s_!OOQh!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe8f02a4-213a-4a92-ad26-0c471382eefb_890x1024.jpeg 848w, https://substackcdn.com/image/fetch/$s_!OOQh!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe8f02a4-213a-4a92-ad26-0c471382eefb_890x1024.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!OOQh!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe8f02a4-213a-4a92-ad26-0c471382eefb_890x1024.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!OOQh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe8f02a4-213a-4a92-ad26-0c471382eefb_890x1024.jpeg" width="890" height="1024" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/be8f02a4-213a-4a92-ad26-0c471382eefb_890x1024.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1024,&quot;width&quot;:890,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:304188,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://zero2dataengineer.substack.com/i/158895900?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe8f02a4-213a-4a92-ad26-0c471382eefb_890x1024.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!OOQh!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe8f02a4-213a-4a92-ad26-0c471382eefb_890x1024.jpeg 424w, https://substackcdn.com/image/fetch/$s_!OOQh!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe8f02a4-213a-4a92-ad26-0c471382eefb_890x1024.jpeg 848w, https://substackcdn.com/image/fetch/$s_!OOQh!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe8f02a4-213a-4a92-ad26-0c471382eefb_890x1024.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!OOQh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe8f02a4-213a-4a92-ad26-0c471382eefb_890x1024.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Level 1: Foundation</strong> Why distributed computing exists and what problems it actually solves. The difference between a single-node and a multi-node system. How data gets split across machines and reassembled. What a shuffle is and why it is the most expensive operation in distributed computing.</p><p><strong>Level 2: Intermediate</strong> DataFrames in a distributed context versus a single-machine context. Transformations versus actions and why the distinction changes how you write code. Partitioning strategies and how the wrong one quietly destroys performance. Reading and writing large datasets efficiently.</p><p><strong>Level 3: Advanced</strong> Performance tuning from first principles. Understanding data skew and why it kills parallel execution. Caching strategy and when holding data in memory helps versus hurts. Debugging a distributed job that fails silently on only some partitions. Designing pipelines that stay performant as data volumes grow unpredictably.</p><p>What to build: take a large dataset, at least a few million rows, and write a transformation job that would be painfully slow in regular Python. Then think through how distributing that work across multiple machines changes the execution. Understand the tradeoffs between partitioning strategies before worrying about which cluster to run it on.</p><p>The concept that changes everything: data locality. Where the data lives relative to where the computation happens determines everything about performance. This is the insight that separates engineers who tune systems from engineers who just run them.</p><div><hr></div><h2><strong>Four Weeks. One Shift. Everything Changes.</strong></h2><p>Four mental models. SQL thinking, Python thinking, scale thinking, distributed thinking. That is the entire curriculum of a $15K bootcamp. You just got it in 30 days by understanding the concepts not memorizing the commands.</p><p>Now go build something that breaks. That is where the real learning starts.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LSmv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3637d20f-8d58-40f3-b2d5-f0a7cfa973ec_825x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LSmv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3637d20f-8d58-40f3-b2d5-f0a7cfa973ec_825x1024.png 424w, https://substackcdn.com/image/fetch/$s_!LSmv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3637d20f-8d58-40f3-b2d5-f0a7cfa973ec_825x1024.png 848w, https://substackcdn.com/image/fetch/$s_!LSmv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3637d20f-8d58-40f3-b2d5-f0a7cfa973ec_825x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!LSmv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3637d20f-8d58-40f3-b2d5-f0a7cfa973ec_825x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LSmv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3637d20f-8d58-40f3-b2d5-f0a7cfa973ec_825x1024.png" width="825" height="1024" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3637d20f-8d58-40f3-b2d5-f0a7cfa973ec_825x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1024,&quot;width&quot;:825,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1454930,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://zero2dataengineer.substack.com/i/158895900?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3637d20f-8d58-40f3-b2d5-f0a7cfa973ec_825x1024.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!LSmv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3637d20f-8d58-40f3-b2d5-f0a7cfa973ec_825x1024.png 424w, https://substackcdn.com/image/fetch/$s_!LSmv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3637d20f-8d58-40f3-b2d5-f0a7cfa973ec_825x1024.png 848w, https://substackcdn.com/image/fetch/$s_!LSmv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3637d20f-8d58-40f3-b2d5-f0a7cfa973ec_825x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!LSmv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3637d20f-8d58-40f3-b2d5-f0a7cfa973ec_825x1024.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><p style="text-align: center;">Drop your biggest conceptual challenge in the comments. Not &#8220;I do not know which tool to learn.&#8221; The real question underneath that. That is where the work actually starts.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://zero2dataengineer.substack.com/p/the-30-day-data-engineering-roadmap/comments&quot;,&quot;text&quot;:&quot;Leave a comment&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://zero2dataengineer.substack.com/p/the-30-day-data-engineering-roadmap/comments"><span>Leave a comment</span></a></p><div><hr></div><p style="text-align: center;"><em>That is all for today.</em></p><p><em>If this helped, forward it to a friend or colleague who is figuring out their next move in tech. This newsletter costs less than your daily coffee and it might be the thing that gets them unstuck.</em></p><p><em>Follow me on LinkedIn and Instagram @avantika.tech for daily AI and Data Engineering content.</em></p><p><em>See you tomorrow.</em></p><p><em>&#8212; Avantika Penumarty</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://zero2dataengineer.substack.com/?utm_source=substack&amp;utm_medium=email&amp;utm_content=share&amp;action=share&quot;,&quot;text&quot;:&quot;Share Avantikka_Penumarty&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://zero2dataengineer.substack.com/?utm_source=substack&amp;utm_medium=email&amp;utm_content=share&amp;action=share"><span>Share Avantikka_Penumarty</span></a></p>]]></content:encoded></item><item><title><![CDATA[Almost Didn’t Apply: I procrastinated applying to data engineering roles for 8 months. ]]></title><description><![CDATA[How Imposter Syndrome Nearly Cost Me My Data Engineering Career]]></description><link>https://zero2dataengineer.substack.com/p/almost-didnt-apply-i-procrastinated</link><guid isPermaLink="false">https://zero2dataengineer.substack.com/p/almost-didnt-apply-i-procrastinated</guid><dc:creator><![CDATA[Avantika_Penumarty]]></dc:creator><pubDate>Sat, 17 Jan 2026 03:40:11 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!P4V8!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F480087e2-d585-43e3-8076-9e1282f0eb2d_200x200.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Hey there,</p><p>I need to tell you about the biggest mistake I almost made in my career.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://zero2dataengineer.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Zero2Dataengineer is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>It was 2021. I was scrolling through LinkedIn (procrastinating, as usual) when I saw it:</p><blockquote><p><strong>&#8220;Data Engineer &#8211; Series B Startup &#8211; $180K base&#8221;</strong></p></blockquote><p>My heart raced.<br>This was it. The role I&#8217;d been working toward.</p><p>Then I opened the job description.</p><h2>The Requirements That Almost Stopped Me</h2><p>The list felt intimidating:</p><ul><li><p>5+ years of data engineering experience<br><em>(I had 3 years in automation + 2 years as a data analyst)</em></p></li><li><p>Spark expert<br><em>(I&#8217;d used it in a few projects, but expert?)</em></p></li><li><p>Data engineering background required<br><em>(I came from automation engineering)</em></p></li><li><p>CS degree preferred<br><em>(I had one, plus a Master&#8217;s in Engineering Management and still doubted myself)</em></p></li></ul><p>I stared at the screen for 20 minutes.</p><p>Then I closed the tab.</p><blockquote><p><em>&#8220;I&#8217;m not qualified.&#8221;</em></p></blockquote><div><hr></div><h2>The Voice of Self-Sabotage</h2><p>For three weeks, that job posting haunted me.</p><p>Every morning, I&#8217;d see it in my saved jobs. Every evening, I&#8217;d tell myself:</p><ul><li><p>&#8220;I need more Spark experience first&#8221;</p></li><li><p>&#8220;They want someone from a pure DE background&#8221;</p></li><li><p>&#8220;I should wait until I&#8217;m more ready&#8221;</p></li><li><p>&#8220;What if they figure out I&#8217;m not qualified?&#8221;</p></li></ul><p>Here&#8217;s what I didn&#8217;t know at the time:</p><p>I was already qualified. I just couldn&#8217;t see it.</p><h2>What Changed Everything</h2><p>On a random Tuesday, I was venting to a former colleague about my job search frustration.</p><p>Me: &#8220;I keep finding roles that want 5+ years of pure data engineering experience. I have automation and analyst background, but that doesn&#8217;t count, right?&#8221;</p><p>Him: &#8220;Wait, you built those automation pipelines that processed millions of events per day, right?&#8221;</p><p>Me: &#8220;Yeah, but that&#8217;s not data engineering&#8212;&#8221;</p><p>Him: &#8220;That IS data engineering. You built data pipelines. You designed data models. You worked with stakeholders on data requirements. What do you think data engineering is?&#8221;</p><p>That conversation changed my perspective completely.</p><h2>The Truth About &#8220;Qualifications&#8221;</h2><p>I went back and looked at my actual experience with fresh eyes:</p><h3>From Automation Engineering, I Had:</h3><ul><li><p>Built pipelines processing 2M+ events daily</p></li><li><p>Designed monitoring and alerting systems</p></li><li><p>Debugged production issues under pressure</p></li><li><p>Understood reliability and system design</p></li></ul><h3>From Data Analyst Work, I Had:</h3><ul><li><p>Mastered SQL (wrote hundreds of complex queries)</p></li><li><p>Understood stakeholder needs deeply</p></li><li><p>Built data models for business analytics</p></li><li><p>Created dashboards and reports</p></li></ul><h3>From My Education, I Had:</h3><ul><li><p>Master&#8217;s in Computer Science</p></li><li><p>Master&#8217;s in Engineering Management</p></li><li><p>Strong fundamentals in algorithms and systems</p></li></ul><p>I wasn&#8217;t missing data engineering experience. I was just calling it different names.</p><h2>How I Reframed My Background</h2><p>Instead of apologizing for my &#8220;non-traditional&#8221; path, I rewrote my story:</p><h3>What I Used to Say:</h3><p>&#8220;I&#8217;m trying to transition into data engineering. I don&#8217;t have direct DE experience, but I&#8217;m a fast learner.&#8221;</p><h3>What I Started Saying:</h3><p>&#8220;I&#8217;ve built data pipelines processing millions of events in my automation work. I bring SQL expertise from my analyst background and system design thinking from my engineering management studies. Here are three projects that demonstrate my DE skills...&#8221;</p><p>The skills were always there. I just needed to own them.</p><h2>I Applied. Here&#8217;s What Happened.</h2><p>I finally clicked &#8220;Apply&#8221; on that job posting (it had been reposted after 3 weeks - they were still looking).</p><p>The interview came two days later.</p><p>Interviewer: &#8220;Tell me about your data engineering experience.&#8221;</p><p>Old me would have said: &#8220;Well, I don&#8217;t have traditional DE experience, but...&#8221;</p><p>What I actually said: &#8220;I&#8217;ve built automation pipelines that process 2 million events daily, designed data models for business analytics, and worked extensively with SQL for data transformation. Let me walk you through my most complex project...&#8221;</p><p>I got the offer two weeks later.</p><p>$185K base + equity + remote flexibility.</p><p>More importantly? I loved the work. My &#8220;non-traditional&#8221; background turned out to be an advantage. I brought perspectives the team needed.</p><h2>Why This Matters for You</h2><p>If you&#8217;re reading this and thinking &#8220;but my background is different too,&#8221; that&#8217;s exactly my point.</p><p>Your background is probably MORE relevant than you think.</p><p>Coming from Software Engineering? You understand production systems, testing, and code quality - things many DEs struggle with.</p><p>Coming from Data Analysis? You understand stakeholder needs and data modeling - the &#8220;why&#8221; behind the pipelines.</p><p>Coming from DevOps/SRE? You understand reliability, monitoring, and infrastructure - critical for production data systems.</p><p>Coming from QA/Testing? You understand data quality, edge cases, and validation - increasingly important in DE.</p><p>&#8220;Non-traditional&#8221; is your advantage, not your weakness.</p><h2>The Real Requirements</h2><p>Here&#8217;s what companies ACTUALLY need (vs. what they write in job descriptions):</p><h3>Job Description Says:</h3><ul><li><p>5+ years data engineering experience</p></li><li><p>Expert in Spark, Airflow, dbt</p></li><li><p>CS degree required</p></li><li><p>Big data experience</p></li></ul><h3>What They Actually Need:</h3><ul><li><p>Can you solve data problems?</p></li><li><p>Can you learn new tools quickly?</p></li><li><p>Can you communicate with stakeholders?</p></li><li><p>Can you ship working pipelines?</p></li></ul><p>If you can do the work, the specific background doesn&#8217;t matter as much as you think.</p><h2>How to Position Your &#8220;Non-Traditional&#8221; Background</h2><h3>Map Your Transferable Skills</h3><p>Your Experience &#8594; DE Skill &#8594; How to Frame It</p><p>Built automation scripts &#8594; Pipeline development &#8594; &#8220;Built automated data pipelines processing X events/day&#8221;</p><p>SQL for reporting &#8594; Data transformation &#8594; &#8220;Designed and optimized SQL queries for data transformation&#8221;</p><p>API integrations &#8594; Data ingestion &#8594; &#8220;Integrated data from 10+ external APIs&#8221;</p><p>Monitoring systems &#8594; Data observability &#8594; &#8220;Implemented monitoring and alerting for data quality&#8221;</p><p>Reframe, don&#8217;t apologize.</p><h3>Build 1-2 Targeted Projects</h3><p>Pick projects that fill your biggest gaps:</p><p>If you lack pipeline experience: Build an end-to-end ETL pipeline with Airflow + dbt</p><p>If you lack big data experience: Process a large public dataset with Spark</p><p>If you lack cloud experience: Deploy a pipeline on AWS/GCP free tier</p><p>Time investment: 2-4 weekends per project</p><p>ROI: &#8220;Here&#8217;s a project I built&#8221; beats &#8220;I&#8217;m trying to learn&#8221; every time</p><h2>Apply Even If &#8220;Underqualified&#8221;</h2><p>My rule: If you have 60% of the requirements, apply.</p><p>Why?</p><ul><li><p>Job descriptions are wish lists, not requirements</p></li><li><p>They&#8217;re often written by recruiters who don&#8217;t know the role</p></li><li><p>Teams are frequently flexible on specific requirements</p></li><li><p>Worst case? They say no. You lose nothing.</p></li></ul><p>Let THEM decide if you&#8217;re qualified. Stop deciding for them.</p><h2>The Action Plan</h2><p>If you&#8217;re in a similar position, here&#8217;s what to do this week:</p><p>Monday: Audit Your Skills (30 minutes) List everything you&#8217;ve built or done that involves data, even tangentially.</p><p>Tuesday: Rewrite Your Resume (1 hour) Reframe your experience using data engineering language.</p><p>Wednesday: Build Your First Project (Start it) Pick one weekend project that addresses your biggest gap.</p><p>Thursday-Friday: Apply to 5 Jobs (1 hour) Apply to roles where you have 60%+ of requirements.</p><p>Weekend: Ship Your Project Get something working and on GitHub.</p><p>Next Monday: Apply to 5 More Jobs Iterate weekly.</p><p>Don&#8217;t wait until you&#8217;re &#8220;ready.&#8221; You&#8217;re already more ready than you think.</p><h2>The Hard Truth</h2><p>Three months after I got that job, they hired someone else to the team.</p><p>Want to know their background?</p><p>Marketing analytics.</p><p>She had even less &#8220;traditional&#8221; DE experience than me. But she was brilliant, motivated, and brought a perspective the team needed.</p><p>That&#8217;s when I realized: The &#8220;perfect&#8221; candidate doesn&#8217;t exist. Teams need diverse backgrounds.</p><h2>What&#8217;s Holding You Back?</h2><p>I shared my story on LinkedIn yesterday, and within an hour, I got dozens of messages:</p><p>&#8220;This is exactly my situation. I have X experience but job wants Y.&#8221;</p><p>So let me ask you:</p><p>What&#8217;s actually stopping you from applying?</p><ul><li><p>Is it missing skills? (You probably have transferable ones)</p></li><li><p>Is it the job description? (It&#8217;s a wish list, not a requirement)</p></li><li><p>Is it imposter syndrome? (Everyone has it, even with 2 Master&#8217;s degrees)</p></li></ul><p>The only thing that will definitely stop you from getting a DE job is not applying.</p><h2>Your Next Step</h2><p>Here&#8217;s what I want you to do after reading this:</p><ol><li><p>Find 3 job postings you&#8217;ve been avoiding because you feel &#8220;not qualified&#8221;</p></li><li><p>Make a list of why you think you&#8217;re not qualified</p></li><li><p>Reframe each objection as a transferable skill or learnable gap</p></li><li><p>Apply to at least 1 of them this week</p></li></ol><p>If you&#8217;re serious about making this transition, I&#8217;ve created some resources to help:</p><h3>If you&#8217;re serious about making this transition, I&#8217;ve created some resources to help.</h3><p><strong>What this resource includes</strong></p><ul><li><p>A <strong>Data Engineering Career Readiness Assessment</strong> to help you honestly evaluate where you stand today.</p></li><li><p>A curated list of <strong>40 hands-on data engineering projects</strong>, organized by difficulty, so you know exactly what to build next.</p></li><li><p>Clear guidance on <strong>what each project demonstrates</strong>, how long it takes, and how to talk about it in interviews.</p></li><li><p>Practical frameworks to <strong>translate non-traditional experience</strong> into data engineering language recruiters understand.</p></li><li><p>An interview-focused approach so you&#8217;re not just building projects, but learning how to <strong>explain decisions, tradeoffs, and impact.</strong></p></li></ul><p><a href="https://notebooklm.google.com/notebook/3d884043-d547-4dc2-b76c-5832cad7b89c">Access Link Here</a><br><br>One Last Thing</p><p>The person who got that job I almost didn&#8217;t apply for?</p><p>Me. </p><p>But it almost wasn&#8217;t.</p><p>I almost let imposter syndrome make that decision for me.</p><p>Don&#8217;t let your doubt decide your future.</p><p>That voice saying &#8220;not yet&#8221; is a liar.</p><p>You&#8217;re ready. Or close enough.</p><p>Apply anyway.</p><p>See you next week,</p><p>Avantikka Penumarty</p><p>P.S. What job are you going to apply for this week? Hit reply and tell me. I want to know.</p><p>P.P.S. If this resonated with you, forward it to someone else who needs to hear it. We all need permission to believe in ourselves sometimes.</p><div><hr></div><p><strong>Follow my journey:</strong><br>LinkedIn: @Avantikka_Penumarty<br>Twitter: @avantikka_penumarty<br>Instagram: @avantikka.penumarty</p><p>Not subscribed yet? Subscribe here for weekly data engineering career advice.<br>Want to chat? Reply to this email. I read every response.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://zero2dataengineer.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Zero2Dataengineer is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Everyone Talks About Spark. SQL Still Runs the Data World.]]></title><description><![CDATA[How modern data pipelines actually move and scale]]></description><link>https://zero2dataengineer.substack.com/p/sql-in-data-engineering-2026</link><guid isPermaLink="false">https://zero2dataengineer.substack.com/p/sql-in-data-engineering-2026</guid><dc:creator><![CDATA[Avantika_Penumarty]]></dc:creator><pubDate>Thu, 15 Jan 2026 13:15:23 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!fJvW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7396fed-3677-4b5b-9129-93f280a4d81a_2742x1500.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p></p><p>Hi everyone,</p><p>Before we begin, I want to share a quick, honest note and sincere apologies to all my readers. </p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://zero2dataengineer.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Zero2Dataengineer is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>I&#8217;ve been quieter than usual over the last few months. I was dealing with some personal and health-related challenges and needed to step back briefly. Thank you for your patience, messages, and continued support it truly means more than you know.</p><p>I&#8217;m back now, and it felt right to restart with a topic that sits at the very core of data engineering one that has quietly shaped almost every system I&#8217;ve worked on.</p><h2>Table of Contents</h2><ul><li><p>What is SQL in Data Engineering?</p></li><li><p>Why SQL is Crucial for Data Engineering</p></li><li><p>SQL for ETL vs. ELT Pipelines</p></li><li><p>Essential SQL for Data Engineers</p></li><li><p>SQL in Modern Data Engineering Tools</p></li><li><p>Best Practices for Writing SQL in Data Engineering</p></li><li><p>Future of SQL in Data Engineering</p></li><li><p>Conclusion</p></li><li><p>FAQs</p></li></ul><div><hr></div><h2>Prefer listening over reading?</h2><p>Are you on your way to work or heading back home?<br>Starting a run, folding laundry, or just taking a quiet break?</p><p>If reading feels like too much right now, I&#8217;ve got you.</p><p>I recorded an audio version of this newsletter so you can listen while you move through your day. Same ideas, same depth just in a format that fits real life.</p><p>Plug in your headphones, press play, and let SQL make sense in the background while you take care of everything else.</p><div class="native-audio-embed" data-component-name="AudioPlaceholder" data-attrs="{&quot;label&quot;:null,&quot;mediaUploadId&quot;:&quot;189282ce-5be2-43b1-b333-eab1075954c4&quot;,&quot;duration&quot;:271.64734,&quot;downloadable&quot;:false,&quot;isEditorNode&quot;:true}"></div><div><hr></div><h2>Introduction</h2><p>Structured Query Language (SQL) remains the foundation of data engineering, enabling data professionals to design, build, and maintain scalable data pipelines. Despite the rise of modern technologies like Apache Spark and NoSQL databases, SQL&#8217;s declarative syntax and universal adoption make it indispensable in real-world data engineering workflows.</p><p>In this piece, I&#8217;ll walk you through how SQL shows up in real data engineering work, what actually matters in practice, and why it continues to be one of the most valuable skills you can invest in as a data engineer.</p><h2>What is SQL in Data Engineering?</h2><p><em>Meta story:</em> Early in my career, I believed mastering tools would make me a great data engineer. Spark, Airflow, Kafka I chased them all. What actually made my work reliable wasn&#8217;t a tool. It was the moment I truly understood SQL as a way of thinking: describing <em>what</em> the data should look like, not <em>how</em> to move every row. That shift changed how I designed pipelines forever.</p><p>At its core, SQL (Structured Query Language) is the language we use to talk to data stored in relational systems to ask questions, shape answers, and turn raw records into something meaningful.</p><p>In data engineering, SQL is used to:</p><ul><li><p>Ingest raw data</p></li><li><p>Clean and validate datasets</p></li><li><p>Transform data into analytics-ready models</p></li><li><p>Load data into warehouses and lakes</p></li></ul><p>SQL acts as the linchpin of both ETL (Extract, Transform, Load) and ELT (Extract, Load, Transform) pipelines, making it the backbone of modern data platforms.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!fJvW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7396fed-3677-4b5b-9129-93f280a4d81a_2742x1500.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!fJvW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7396fed-3677-4b5b-9129-93f280a4d81a_2742x1500.png 424w, https://substackcdn.com/image/fetch/$s_!fJvW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7396fed-3677-4b5b-9129-93f280a4d81a_2742x1500.png 848w, https://substackcdn.com/image/fetch/$s_!fJvW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7396fed-3677-4b5b-9129-93f280a4d81a_2742x1500.png 1272w, https://substackcdn.com/image/fetch/$s_!fJvW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7396fed-3677-4b5b-9129-93f280a4d81a_2742x1500.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!fJvW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7396fed-3677-4b5b-9129-93f280a4d81a_2742x1500.png" width="2742" height="1500" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c7396fed-3677-4b5b-9129-93f280a4d81a_2742x1500.png&quot;,&quot;srcNoWatermark&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0bc234bb-99ac-4f12-9c09-867c104ebce5_2742x1500.png&quot;,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1500,&quot;width&quot;:2742,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:8159196,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://zero2dataengineer.substack.com/i/184632503?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbec5eb62-8686-48cd-a88f-bbf11fdc3997_2752x1536.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!fJvW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7396fed-3677-4b5b-9129-93f280a4d81a_2742x1500.png 424w, https://substackcdn.com/image/fetch/$s_!fJvW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7396fed-3677-4b5b-9129-93f280a4d81a_2742x1500.png 848w, https://substackcdn.com/image/fetch/$s_!fJvW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7396fed-3677-4b5b-9129-93f280a4d81a_2742x1500.png 1272w, https://substackcdn.com/image/fetch/$s_!fJvW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7396fed-3677-4b5b-9129-93f280a4d81a_2742x1500.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>Why SQL is Crucial for Data Engineering</h2><p><em>Project moment:</em> On one production pipeline, we had Python transformations, custom logic, and retries everywhere and still broke SLAs weekly. The fix wasn&#8217;t a rewrite. It was replacing fragmented logic with clear, well-structured SQL. Fewer lines. Fewer bugs. More trust.</p><h3>1. Data Extraction</h3><p>SQL makes it surprisingly easy to pull data from structured systems like PostgreSQL, MySQL, and Oracle, and even from modern platforms that support SQL-style querying such as BigQuery and Redshift.</p><h3>2. Data Transformation</h3><p>Data engineers rely on SQL for cleansing, aggregation, and normalization. Features like Common Table Expressions (CTEs), window functions, and subqueries allow complex transformations to remain readable and maintainable.</p><h3>3. Data Loading</h3><p>In practice, SQL-powered pipelines move data into warehouses and lakes in a way that keeps analytics teams productive and downstream systems stable.</p><h3>4. Data Integration</h3><p>By joining datasets across multiple systems, SQL helps engineers create unified data models that power reliable reporting and decision-making.</p><h3>5. Performance Optimization</h3><p>Modern SQL engines such as Apache Hive, Presto, and Spark SQL provide query optimization capabilities that reduce execution time, improve resource utilization, and scale analytics workloads.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!snTI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58603c05-d225-498b-88d6-c1ac6f9b9754_1340x1782.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!snTI!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58603c05-d225-498b-88d6-c1ac6f9b9754_1340x1782.png 424w, https://substackcdn.com/image/fetch/$s_!snTI!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58603c05-d225-498b-88d6-c1ac6f9b9754_1340x1782.png 848w, https://substackcdn.com/image/fetch/$s_!snTI!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58603c05-d225-498b-88d6-c1ac6f9b9754_1340x1782.png 1272w, https://substackcdn.com/image/fetch/$s_!snTI!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58603c05-d225-498b-88d6-c1ac6f9b9754_1340x1782.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!snTI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58603c05-d225-498b-88d6-c1ac6f9b9754_1340x1782.png" width="1340" height="1782" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/58603c05-d225-498b-88d6-c1ac6f9b9754_1340x1782.png&quot;,&quot;srcNoWatermark&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/84e87705-07fa-4fbc-90ee-06144368ba3f_1340x1782.png&quot;,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1782,&quot;width&quot;:1340,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:685127,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://zero2dataengineer.substack.com/i/184632503?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84e87705-07fa-4fbc-90ee-06144368ba3f_1340x1782.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!snTI!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58603c05-d225-498b-88d6-c1ac6f9b9754_1340x1782.png 424w, https://substackcdn.com/image/fetch/$s_!snTI!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58603c05-d225-498b-88d6-c1ac6f9b9754_1340x1782.png 848w, https://substackcdn.com/image/fetch/$s_!snTI!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58603c05-d225-498b-88d6-c1ac6f9b9754_1340x1782.png 1272w, https://substackcdn.com/image/fetch/$s_!snTI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58603c05-d225-498b-88d6-c1ac6f9b9754_1340x1782.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>SQL for ETL vs. ELT Pipelines</h2><p><em>Career insight:</em> I&#8217;ve seen engineers struggle not because they chose ETL or ELT, but because they didn&#8217;t understand <em>where</em> SQL belongs in the system. Once you see SQL as a first-class layer not an afterthought architectural decisions become simpler.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vf17!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F980b19bd-e21c-4664-9f10-6f2aaa88171e_1542x546.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vf17!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F980b19bd-e21c-4664-9f10-6f2aaa88171e_1542x546.png 424w, https://substackcdn.com/image/fetch/$s_!vf17!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F980b19bd-e21c-4664-9f10-6f2aaa88171e_1542x546.png 848w, https://substackcdn.com/image/fetch/$s_!vf17!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F980b19bd-e21c-4664-9f10-6f2aaa88171e_1542x546.png 1272w, https://substackcdn.com/image/fetch/$s_!vf17!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F980b19bd-e21c-4664-9f10-6f2aaa88171e_1542x546.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vf17!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F980b19bd-e21c-4664-9f10-6f2aaa88171e_1542x546.png" width="690" height="244.31906614785993" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/980b19bd-e21c-4664-9f10-6f2aaa88171e_1542x546.png&quot;,&quot;srcNoWatermark&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/10752722-3ab0-4f88-bc0e-ca32e4681069_1542x546.png&quot;,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:546,&quot;width&quot;:1542,&quot;resizeWidth&quot;:690,&quot;bytes&quot;:147665,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://zero2dataengineer.substack.com/i/184632503?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10752722-3ab0-4f88-bc0e-ca32e4681069_1542x546.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!vf17!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F980b19bd-e21c-4664-9f10-6f2aaa88171e_1542x546.png 424w, https://substackcdn.com/image/fetch/$s_!vf17!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F980b19bd-e21c-4664-9f10-6f2aaa88171e_1542x546.png 848w, https://substackcdn.com/image/fetch/$s_!vf17!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F980b19bd-e21c-4664-9f10-6f2aaa88171e_1542x546.png 1272w, https://substackcdn.com/image/fetch/$s_!vf17!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F980b19bd-e21c-4664-9f10-6f2aaa88171e_1542x546.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>SQL sits at the center of both approaches. In recent years, ELT has become more common simply because cloud warehouses make large-scale transformations easier and cheaper to run in parallel.</p><h2>Essential SQL for Data Engineers</h2><p><em>Interview reality:</em> Almost every senior data engineering interview I&#8217;ve seen comes down to this section. Not syntax trivia but whether you can express business logic clearly, safely, and efficiently in SQL.</p><h3>1. Window Functions</h3><p>These are used when you need running totals, rankings, or comparisons across groups without losing row-level detail.</p><pre><code>SELECT
customer_id,
order_date,
SUM(order_amount) OVER (
PARTITION BY customer_id
ORDER BY order_date
) AS cumulative_sales
FROM orders;</code></pre><h3>2. Common Table Expressions (CTEs)</h3><p>CTEs make complex logic easier to read, reason about, and safely modify over time.</p><pre><code>WITH recent_orders AS (
SELECT order_id, customer_id, order_date
FROM orders
WHERE order_date &gt; '2026-01-01'
)
SELECT * FROM recent_orders;</code></pre><h3>3. Joins</h3><p>Most real-world datasets only make sense once multiple tables are joined together this is where SQL earns its keep.</p><pre><code>SELECT customers.name, orders.order_id
FROM customers
JOIN orders
ON customers.customer_id = orders.customer_id;</code></pre><h3>4. Indexes and Query Optimization</h3><p>Indexes improve read performance, while query planners and EXPLAIN statements help identify bottlenecks in large-scale systems.</p><h3>5. Data Partitioning</h3><p>Partitioning large tables improves performance in distributed systems such as Hive and BigQuery by limiting the amount of data scanned.</p><h2>SQL in Modern Data Engineering Tools</h2><p><em>Meta observation:</em> Tools change faster than job titles. What stays constant is SQL acting as the common language across platforms the one skill that transfers cleanly when stacks evolve.</p><p>ToolPurposeSQL RoleApache HiveData warehousing on HadoopHiveQL for querying HDFSApache Spark SQLDistributed data processingSQL on DataFramesGoogle BigQueryServerless data warehouseStandard SQLAWS RedshiftCloud data warehousePostgreSQL-like SQLSnowflakeCloud data platformANSI SQLdbtData transformationSQL-based modeling</p><h2>Best Practices for Writing SQL in Data Engineering</h2><p><em>Production lesson:</em> If someone else can&#8217;t understand your SQL six months later including you it will eventually cost time, trust, or money. Readability is not optional in production systems.</p><ol><li><p><strong>Use CTEs for complex logic</strong><br>Break queries into logical steps to improve readability and maintainability.</p></li><li><p><strong>Avoid </strong><code>SELECT *</code><br>Explicitly select required columns to reduce data scanning and improve performance.</p></li><li><p><strong>Leverage indexes and partitioning</strong><br>Use clustering, partition keys, and indexes to optimize large datasets.</p></li><li><p><strong>Monitor query performance</strong><br>Analyze execution plans using EXPLAIN statements to identify inefficiencies.</p></li><li><p><strong>Follow data governance standards</strong><br>Ensure compliance with organizational policies around data security, privacy, and access control.</p></li></ol><h2>Future of SQL in Data Engineering</h2><p><em>Forward-looking thought:</em> SQL isn&#8217;t competing with new paradigms it&#8217;s absorbing them. Streaming, federated queries, and data mesh architectures are all bending toward SQL as the shared interface.</p><p>Despite the growth of NoSQL and distributed systems, SQL&#8217;s declarative nature ensures its continued relevance.</p><p>Key trends shaping the future:</p><ul><li><p><strong>SQL on streaming data</strong> using platforms like Apache Flink and ksqlDB</p></li><li><p><strong>Federated queries</strong> enabling cross-platform data access</p></li><li><p><strong>SQL in data mesh architectures</strong> as a shared querying layer across decentralized domains</p></li></ul><p>SQL is not being replaced it is evolving alongside modern data architectures.</p><h2>Conclusion</h2><p><em>Closing reflection:</em> Every time data systems fail, the root cause is rarely &#8220;bad data.&#8221; It&#8217;s unclear logic. SQL, when written well, makes intent explicit and that&#8217;s why it continues to matter.</p><p>SQL is far more than just a querying language. It is the backbone of data engineering, powering data ingestion, transformation, integration, and analytics at scale.</p><p>As tools and platforms evolve, SQL&#8217;s clarity, expressiveness, and adaptability ensure it remains an essential skill for data engineers, data scientists, and analytics professionals.</p><h2>FAQs</h2><p><strong>Is SQL used in data engineering?</strong><br>Yes. SQL is fundamental to data engineering and is used extensively for data extraction, transformation, loading, validation, and modeling.</p><p><strong>How do I become a SQL data engineer?</strong><br>Build a strong foundation in SQL and database systems, practice query optimization, learn data modeling, and gain hands-on experience with modern cloud data warehouses. Complement SQL with Python for automation and orchestration.</p><p><strong>Is SQL still relevant in 2026?</strong><br>Absolutely. SQL remains one of the most in-demand skills due to its deep integration with cloud platforms, analytics tools, and modern data stacks.</p><p><strong>Is Python and SQL enough for data engineering?</strong><br>They form a strong foundation, but data engineers also benefit from learning data orchestration tools, distributed systems, and cloud platforms.</p><p><strong>Should data engineers know SQL?</strong><br>Yes. SQL is essential for building reliable data pipelines, modeling data, and ensuring data quality.</p><p><strong>What are some of the best SQL courses for data engineers?</strong></p><ul><li><p>PostgreSQL for Everybody (Coursera)</p></li><li><p>SQL Fundamentals (Dataquest)</p></li><li><p>The Ultimate MySQL Bootcamp (Udemy)</p></li><li><p>Complete SQL Mastery (CodeWithMosh)</p></li><li><p>Advanced SQL for Data Engineering (Udemy)</p></li></ul><div><hr></div><div class="poll-embed" data-attrs="{&quot;id&quot;:433651}" data-component-name="PollToDOM"></div><p>                                     &#128279; <strong>Follow <a href="https://www.linkedin.com/in/avantikkapenumarty/">Avantikka Penumarty</a> on LinkedIn</strong></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://zero2dataengineer.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Zero2Dataengineer is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>&#128274; <strong>Want to turn this into real skill?</strong><br>Access interactive SQL flashcards, quizzes, and AI-powered explanations in the paid section below.</p><h2>Take This From Insight to Instinct</h2><p>Reading builds understanding.<br>Practice builds confidence.</p><p>If SQL plays a role in your job, interviews, or long-term growth, passive reading isn&#8217;t enough.</p><p>I&#8217;ve created an <strong>interactive learning extension</strong> to this newsletter designed to help you <em>think in SQL</em>, not just recognize syntax.</p><p><strong>Inside the paid section, you&#8217;ll get:</strong></p><ul><li><p>&#129504; <strong>Flashcards</strong> to reinforce core SQL concepts and mental models</p></li><li><p>&#129514; <strong>Scenario-based quizzes</strong> that mirror real data engineering decisions</p></li><li><p>&#129302; <strong>AI-powered explanations</strong> that walk you through the <em>why</em>, not just the answer</p></li><li><p>&#127959;&#65039; <strong>Applied reasoning</strong> you can reuse in production systems and interviews</p></li></ul><p>This is how you move from:</p><blockquote><p>&#8220;I&#8217;ve read this&#8221;<br>to<br>&#8220;I can apply this under pressure.&#8221;</p></blockquote>
      <p>
          <a href="https://zero2dataengineer.substack.com/p/sql-in-data-engineering-2026">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[How to Go from New Grad to Data Engineer ]]></title><description><![CDATA[Use this roadmap to build your first DE portfolio & land your first offer &#8212; without a CS degree or 5 YOE. Designed by Avantikka Penumarty (Ex-META | Snr. Data Engineer)]]></description><link>https://zero2dataengineer.substack.com/p/how-to-go-from-new-grad-to-data-engineer</link><guid isPermaLink="false">https://zero2dataengineer.substack.com/p/how-to-go-from-new-grad-to-data-engineer</guid><dc:creator><![CDATA[Avantika_Penumarty]]></dc:creator><pubDate>Mon, 02 Jun 2025 16:00:22 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/f02c0e3f-6f8b-4b1b-998a-542023c26eda_1024x1536.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2>&#128075; Welcome!</h2><p>Hi there,</p><p>You&#8217;re here because you want more than just theory. You want a roadmap that actually works.</p><p>This isn&#8217;t fluff. I built this guide based on what I wish someone gave me when I was starting out &#8212; real skills, real proof, and real steps that lead to interviews and offers. No gatekeeping. No unnecessary jargon.</p><p>I&#8217;m not here to sell you a bootcamp. I&#8217;m here to show you that you can build production-level data systems &#8212; even without a CS degree, fancy title, or 5+ years of experience.</p><p>If you're serious about making the jump into DE, everything you need to start is in this email.</p><p>Let&#8217;s begin.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://zero2dataengineer.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Zero2Dataengineer is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h3>If you're a new grad trying to break into Data Engineering, you're not alone.</h3><p>Data Engineering is one of the fastest-growing roles in tech today &#8212; but also one of the most misunderstood. Most beginners assume they just need to learn SQL and Python. But here's the hard truth:</p><p><strong>Knowing </strong><code>SELECT * FROM table_name</code><strong> is not enough.</strong></p><p>Thousands of grads take bootcamps, binge YouTube tutorials, and still never land interviews. Why? Because they have no real-world proof that they can build or maintain data systems.</p><p>And in this field, that <strong>proof</strong> matters more than any degree or certificate.</p><p>So let&#8217;s fix that.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://zero2dataengineer.substack.com/subscribe&quot;,&quot;text&quot;:&quot;JOIN A COMMUNITY OF 10K+ Data Engineers&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://zero2dataengineer.substack.com/subscribe"><span>JOIN A COMMUNITY OF 10K+ Data Engineers</span></a></p><div><hr></div><h3>Why Data Engineering?</h3><p>It&#8217;s one of the most practical, future-proof, and underrated roles in tech today.</p><p>Perfect for those who:</p><ul><li><p>Enjoy backend more than frontend</p></li><li><p>Love building real systems over tweaking models</p></li><li><p>Want high pay, low ego, and fewer LeetCode puzzles</p></li></ul><p>But DE is also one of the most misunderstood roles.</p><p>New grads think they just need SQL + Python.<br>Others spend months collecting certifications.<br>But neither gets interviews.</p><p><strong>Why?</strong> Because they&#8217;re missing the one thing that matters most:</p><p><strong>Proof that you can build and run data systems.</strong></p><p>That <strong>proof &gt; any degree, certificate, or keyword-stuffed resume.</strong></p><div><hr></div><h3>What Do Data Engineers <em>Actually</em> Do at Work?</h3><p>This is what you'll be expected to do in a real DE role:</p><ul><li><p>Build pipelines that ingest, clean, and transform raw data</p></li><li><p>Maintain data quality, integrity, and freshness</p></li><li><p>Work with analysts, ML teams, and backend engineers</p></li><li><p>Manage cloud resources (like S3, Snowflake, BigQuery)</p></li><li><p>Monitor systems, set up alerts, and debug failures</p></li><li><p>Optimize queries, improve performance, and own uptime</p></li></ul><p>Think of yourself as the <strong>plumber of the data world</strong>. You build and maintain the pipes that make everything else possible.</p><div><hr></div><h3>Here&#8217;s What You Actually Need to Land Your First DE Role:</h3><p>&#9989; <strong>Go beyond basic SQL.</strong><br>Learn how to use CTEs, window functions, GROUPING SETS, and indexes. Hiring managers want to see that you can write performant queries, not just run simple reports.</p><ol><li><p><strong>Build ETL pipelines.</strong><br>Get hands-on with Airflow and dbt. Learn to extract messy data, clean it up, and load it into warehouses. Use cloud tools like AWS S3 or GCP BigQuery to simulate real infrastructure.</p></li><li><p><strong>Understand data modeling.</strong><br>Study how companies design scalable data systems. Know what a Star Schema is. Learn dimensional modeling. Explore how tools like Snowflake and Redshift structure data for analytics at scale.</p></li><li><p><strong>Publish real projects.</strong><br>Build 1&#8211;2 end-to-end data pipelines using public datasets. Document them clearly on GitHub. A great portfolio can often do more than a great resume &#8212; especially when you're just starting out.</p></li></ol><p>And what you <strong>don&#8217;t</strong> need:</p><p>&#10060; A CS degree<br>&#10060; 5+ years of experience<br>&#10060; Perfect code</p><p>You just need to prove that you understand how data flows through a system &#8212; and that you can make it work in production, not just in a classroom.</p><div><hr></div><h3>New Grad to Data Engineer: The Real Roadmap</h3><div><hr></div><h4>Month 1 &#8211; Build Your Foundations (SQL + Python)</h4><p>This month is NOT just about &#8220;learning basics.&#8221; It&#8217;s about mastering fundamentals that you&#8217;ll actually use in interviews and pipelines.</p><h4>SQL:</h4><ul><li><p>Learn SELECT, WHERE, GROUP BY, ORDER BY</p></li><li><p>Go deep into <strong>Window Functions</strong>: ROW_NUMBER, RANK, LAG, LEAD</p></li><li><p>Practice <strong>CTEs, Subqueries</strong>, and <strong>Indexing Basics</strong></p></li><li><p>Learn performance tuning (EXPLAIN plans, indexes)</p></li></ul><p>&#128073; <em>Resources:</em> LeetCode (SQL), Mode Analytics SQL Tutorial, OneCompiler</p><h4>Python:</h4><ul><li><p>Learn data structures: Lists, Dicts, Tuples</p></li><li><p>Write basic ETL scripts using <code>pandas</code></p></li><li><p>Practice reading/writing files (CSV, JSON, API calls)</p></li><li><p>Understand exception handling and basic functions</p></li></ul><p>&#128073; <em>Resources:</em> Dataquest, Jupyter Notebook, Kaggle Datasets</p><div><hr></div><h3><strong>My DE Prep Scheduler (SQL + Python)</strong></h3><p><em>Here&#8217;s how I structured my own study plan when I first applied for Data Engineering roles. No fluff. Just habits that stuck.</em></p><div><hr></div><h4><strong>Week 1 &#8211; SQL Core + Python Setup</strong></h4><p><strong>Goal:</strong> Build confidence with basic SQL and get Python environment ready.</p><p><strong>Mon&#8211;Tue:</strong></p><ul><li><p>SQL: SELECT, WHERE, GROUP BY, ORDER BY</p></li><li><p>Resource: Mode SQL Tutorial + OneCompiler (hands-on)</p></li><li><p>Python: Install Python, set up Jupyter Notebook or VS Code</p></li></ul><p><strong>Wed&#8211;Thu:</strong></p><ul><li><p>Python: Data types, loops, functions, lists &amp; dicts</p></li><li><p>SQL: Write queries using GROUP BY and filters on public datasets (Kaggle or LeetCode)</p></li></ul><p><strong>Friday:</strong></p><ul><li><p>Mini Project: Analyze a CSV file using pandas and filter top rows by condition</p></li><li><p>Output: Save your script and query result snapshot</p></li></ul><div><hr></div><h3><strong>Week 2 &#8211; SQL Intermediate + ETL in Python</strong></h3><p><strong>Goal:</strong> Get hands-on with intermediate SQL and write your first ETL script.</p><p><strong>Mon&#8211;Tue:</strong></p><ul><li><p>SQL: CTEs, Subqueries (solve 3 problems/day on LeetCode)</p></li><li><p>Python: Read/write CSV + JSON files, understand APIs with <code>requests</code></p></li></ul><p><strong>Wed&#8211;Thu:</strong></p><ul><li><p>Build: ETL Script (Extract COVID API &#8594; Clean &#8594; Write to local file or SQLite)</p></li><li><p>SQL: Join practice + filtering with aliases and subqueries</p></li></ul><p><strong>Friday:</strong></p><ul><li><p>GitHub Upload: Push your ETL script with README</p></li><li><p>Bonus: Create a 1-pager explaining what your script does</p></li></ul><div><hr></div><h3><strong>Week 3 &#8211; Window Functions + Real Data Handling</strong></h3><p><strong>Goal:</strong> Use advanced SQL and clean real-world data.</p><p><strong>Mon&#8211;Tue:</strong></p><ul><li><p>SQL: ROW_NUMBER, RANK, LAG, LEAD (focus on order + partition logic)</p></li><li><p>Python: Use pandas to clean messy real-world data (missing values, data types)</p></li></ul><p><strong>Wed&#8211;Thu:</strong></p><ul><li><p>Practice: NYC Taxi or Netflix dataset &#8212; clean with pandas, summarize with SQL</p></li><li><p>SQL: Practice use cases like top 3 per category, running totals</p></li></ul><p><strong>Friday:</strong></p><ul><li><p>Output: Save your cleaned dataset, sample queries, and explain what insights you found</p></li></ul><div><hr></div><h3><strong>Week 4 &#8211; Optimization + Portfolio Building</strong></h3><p><strong>Goal:</strong> Wrap Month 1 with real proof of work.</p><p><strong>Mon&#8211;Tue:</strong></p><ul><li><p>SQL: Learn EXPLAIN plan, indexing, query performance basics</p></li><li><p>Python: Add logging, exception handling to your ETL script</p></li></ul><p><strong>Wed&#8211;Thu:</strong></p><ul><li><p>Portfolio Time: Write a full README for your project</p></li><li><p>Push code + screenshots to GitHub</p></li></ul><p><strong>Friday:</strong></p><ul><li><p>Reflection:</p><ul><li><p>What did you learn?</p></li><li><p>What would you do differently?</p></li><li><p>What&#8217;s the next dataset you want to try?</p></li></ul></li></ul><div><hr></div><h3>Tools I Used (And Still Recommend):</h3><ul><li><p><strong>SQL Practice:</strong> LeetCode SQL, Mode, OneCompiler</p></li><li><p><strong>Python IDE:</strong> Jupyter Notebook, VS Code</p></li><li><p><strong>Data:</strong> Kaggle, NYC OpenData, COVID API</p></li><li><p><strong>Version Control:</strong> GitHub</p></li><li><p><strong>ETL Stack:</strong> pandas + SQLite (perfect beginner combo)</p></li></ul><div><hr></div><h3><strong>Month 2 &#8211; Learn ETL + Orchestration (Real Project Work Begins)</strong></h3><p>This month is about translating your learning into <strong>real pipelines</strong>. You&#8217;ll now build actual systems that move and transform data.</p><blockquote><p>Choose 1&#8211;2 projects below based on your tool comfort (local/cloud/dbt/Airflow).<br>The goal is <strong>not</strong> to learn everything &#8212; it&#8217;s to <strong>complete one pipeline end-to-end</strong> and publish it on GitHub.</p></blockquote><div><hr></div><h3><strong>Project 1: Local CSV to SQLite ETL (Beginner-Friendly, No Cloud)</strong></h3><p><strong>Stack:</strong> <code>Python</code>, <code>pandas</code>, <code>SQLite</code>, <code>cron job</code> (or Airflow optional)</p><p><strong>Problem Statement:</strong><br>Build a pipeline that pulls NYC Taxi data (CSV), cleans it using pandas, and stores it in a local SQLite DB.</p><p><strong>Steps:</strong></p><ol><li><p>Download public CSV dataset</p></li><li><p>Clean &amp; transform with pandas (fix datatypes, nulls, etc.)</p></li><li><p>Load into a local SQLite DB</p></li><li><p>Schedule pipeline using cron or basic Airflow DAG</p></li></ol><p><strong>Outcome:</strong></p><ul><li><p>Simple file-based ETL</p></li><li><p>Lightweight &amp; fully local</p></li><li><p>Teaches end-to-end scripting</p></li></ul><div><hr></div><h3><strong>Project 2: API to Cloud Warehouse with Airflow</strong></h3><p><strong>Stack:</strong> <code>Python</code>, <code>Airflow</code>, <code>AWS S3</code>, <code>BigQuery</code> or <code>Snowflake</code></p><p><strong>Problem Statement:</strong><br>Pull daily COVID-19 data from a public API, store raw files in S3, process data, and load it into a warehouse.</p><p><strong>Steps:</strong></p><ol><li><p>Extract from API (requests + JSON)</p></li><li><p>Save raw JSON/CSV to S3</p></li><li><p>Clean data in pandas or dbt</p></li><li><p>Load to Snowflake or BigQuery</p></li><li><p>Orchestrate everything with Airflow DAG</p></li></ol><p><strong>Outcome:</strong></p><ul><li><p>Full cloud-native ETL pipeline</p></li><li><p>Shows data ingestion + orchestration</p></li><li><p>Great proof of cloud skillset</p></li></ul><div><hr></div><h3><strong>Project 3: Local CSV to dbt + DuckDB (No Cloud, SQL Focus)</strong></h3><p><strong>Stack:</strong> <code>dbt</code>, <code>DuckDB</code>, <code>CSV</code>, <code>Jinja</code>, <code>SQL</code></p><p><strong>Problem Statement:</strong><br>Use dbt to build a transformation pipeline that models ecommerce order data from a local CSV.</p><p><strong>Steps:</strong></p><ol><li><p>Load CSV into DuckDB (acts like a local warehouse)</p></li><li><p>Create staging and mart models with dbt</p></li><li><p>Apply SQL transformations using dbt</p></li><li><p>Generate documentation &amp; DAG visualizations</p></li></ol><p><strong>Outcome:</strong></p><ul><li><p>Teaches modeling &amp; transformations</p></li><li><p>Easy to run locally</p></li><li><p>Helps learn dbt structure and SQL best practices</p></li></ul><div><hr></div><h3><strong>Project 4: Reddit Data Pipeline with Python + MongoDB</strong></h3><p><strong>Stack:</strong> <code>Python</code>, <code>PRAW API</code>, <code>MongoDB</code>, <code>Airflow</code> (optional)</p><p><strong>Problem Statement:</strong><br>Extract posts from a subreddit using Reddit&#8217;s API and store them into a NoSQL database.</p><p><strong>Steps:</strong></p><ol><li><p>Authenticate with Reddit using PRAW</p></li><li><p>Extract posts &amp; comments</p></li><li><p>Clean and process text</p></li><li><p>Store into MongoDB</p></li><li><p>Optional: schedule using Airflow</p></li></ol><p><strong>Outcome:</strong></p><ul><li><p>Exposure to unstructured data</p></li><li><p>Real-world use of APIs + NoSQL</p></li><li><p>Fun and engaging project for resumes</p></li></ul><div><hr></div><h3><strong>Project 5: Batch to Analytics Dashboard (SQL + Streamlit)</strong></h3><p><strong>Stack:</strong> <code>pandas</code>, <code>SQLite</code>, <code>Streamlit</code>, <code>Matplotlib</code>, <code>SQL</code></p><p><strong>Problem Statement:</strong><br>Ingest historical sales data and build an analytics dashboard to track trends.</p><p><strong>Steps:</strong></p><ol><li><p>Ingest CSV files weekly into SQLite</p></li><li><p>Use pandas/SQL to analyze key metrics (revenue, retention, cohort)</p></li><li><p>Build a live dashboard with Streamlit</p></li><li><p>Optional: Automate ingestion with cron job</p></li></ol><p><strong>Outcome:</strong></p><ul><li><p>Combines data engineering with dashboarding</p></li><li><p>Useful for end-user reporting</p></li><li><p>Makes you stand out for DE/BI hybrid roles</p></li></ul><div><hr></div><h3>Tip for All Projects:</h3><p>Each project should include:</p><ul><li><p><code>README.md</code> (with overview, stack, steps, and diagram)</p></li><li><p>Pipeline code/scripts/notebooks</p></li><li><p>Screenshot of output or working dashboard</p></li><li><p>Optional: Loom video explaining your pipeline</p></li></ul><div><hr></div><h3>How to Talk About Projects in Interviews</h3><p>Most candidates freeze here. Don&#8217;t.</p><p>Here&#8217;s how to prep for interview questions about your projects:</p><ul><li><p><em>Why did you choose this dataset?</em></p></li><li><p><em>What challenges did you face while cleaning or transforming it?</em></p></li><li><p><em>What trade-offs did you make in designing the pipeline?</em></p></li><li><p><em>How would you scale this pipeline for daily use?</em></p></li><li><p><em>What would you improve if given more time?</em></p></li></ul><p>Have clear answers. Show thought process. That&#8217;s how you stand out.</p><div><hr></div><h3>Common Traps to Avoid</h3><p>Avoid these if you want to land a job faster:</p><ul><li><p>Spending 3 months making your README &#8220;aesthetic&#8221;</p></li><li><p>Overbuilding with 6 AWS tools before writing your first ETL script</p></li><li><p>Learning SQL, Python, and Airflow in isolation &#8212; instead of <em>connecting them in one project</em></p></li><li><p>Binge-watching tutorials without building a single pipeline</p></li><li><p>Applying to 100 jobs with no GitHub proof of work</p></li></ul><div><hr></div><h3><strong>One Last Thought:</strong></h3><p>If you're stuck, it&#8217;s not because you're not smart enough. It's because no one ever gave you the real playbook.</p><p>This newsletter <em>is</em> that playbook.<br>If it helped you, forward it to a friend who's trying to get into tech. You never know whose life it might change.</p><p>&#8211;<br>Avantikka Penumarty<br>Ex-META | Snr. Data Engineer | Founder, Zero to Data Engineer<br>zero2dataengineer.substack.com</p>]]></content:encoded></item><item><title><![CDATA[Prefect vs Airflow vs Dagster]]></title><description><![CDATA[What You Pick Tells Me Everything About How You Think]]></description><link>https://zero2dataengineer.substack.com/p/prefect-vs-airflow-vs-dagster</link><guid isPermaLink="false">https://zero2dataengineer.substack.com/p/prefect-vs-airflow-vs-dagster</guid><dc:creator><![CDATA[Avantika_Penumarty]]></dc:creator><pubDate>Sat, 24 May 2025 00:30:00 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/5f906a77-1248-4735-a8b1-7acd36c91db3_1024x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Most engineers choose orchestration tools like they&#8217;re picking a side in a debate.</p><p>&#8220;Airflow is legacy!&#8221;<br>&#8220;Dagster is the future!&#8221;<br>&#8220;Prefect is so clean!&#8221;</p><p>But you&#8217;re not here for debate. You&#8217;re here to <strong>ship pipelines, learn fast, and get hired</strong>.</p><p>So let&#8217;s get real:<br>Which tool should <em>you</em> pick &#8212; for your project, for your portfolio, or for your resume?</p><p>This guide won&#8217;t compare every feature.<br>It&#8217;ll show you how to <strong>think like a system builder</strong>, not a tool fangirl.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://zero2dataengineer.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Zero2Dataengineer is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h3>1. When to Use <strong>Airflow</strong></h3><p>Use Airflow when:</p><ul><li><p>You&#8217;re applying to big tech or mature data teams</p></li><li><p>You want to show you understand orchestration, retries, and scheduling</p></li><li><p>You need strong scheduling and visibility (UI, logs, retries)</p></li></ul><p>Why it works for your <strong>resume</strong>:</p><blockquote><p>&#8220;Familiar with industry-standard orchestration using Airflow, including DAG design, sensors, retry logic, and alerting.&#8221;</p></blockquote><p>Why it&#8217;s great for <strong>teaching yourself the fundamentals</strong>:<br>Airflow makes you learn how pipelines <strong>actually</strong> run: tasks, triggers, failures, dependencies. It exposes orchestration in raw form.</p><p>Where it&#8217;s weaker:</p><ul><li><p>Not great for data scientists or notebooks</p></li><li><p>Heavier to deploy without Docker</p></li><li><p>Harder to reason about with dynamic workflows</p></li></ul><div><hr></div><h3>2. When to Use <strong>Prefect</strong></h3><p>Use Prefect when:</p><ul><li><p>You&#8217;re building a lightweight project</p></li><li><p>You don&#8217;t want to mess with Airflow configs</p></li><li><p>You like writing clean Python with decorators</p></li></ul><p>Why it works for <strong>solo projects</strong>:</p><ul><li><p>It&#8217;s Pythonic and elegant</p></li><li><p>Easy to get up and running</p></li><li><p>Great docs, fast dev feedback loop</p></li></ul><p>Why it works for <strong>data science/analytics engineers</strong>:</p><ul><li><p>You can orchestrate model training, dbt, API tasks quickly</p></li><li><p>You don&#8217;t need to explain DAGs to non-engineers</p></li></ul><p>Where it shines on a resume:</p><blockquote><p>&#8220;Used Prefect to orchestrate model training and dbt transformations in a low-latency ML workflow.&#8221;</p></blockquote><p>Where it&#8217;s weaker:</p><ul><li><p>Less recognized by recruiters</p></li><li><p>Smaller community</p></li><li><p>Not ideal for heavy-duty, multi-team orchestration</p></li></ul><div><hr></div><h3>3. When to Use <strong>Dagster</strong></h3><p>Use Dagster when:</p><ul><li><p>You care about type safety, IO contracts, and observability</p></li><li><p>You want to model your data pipeline like software</p></li><li><p>You&#8217;re building a <strong>data platform</strong>, not just a DAG</p></li></ul><p>What makes Dagster interesting:</p><ul><li><p>Built-in concepts like assets, software-defined pipelines</p></li><li><p>First-class support for data lineage and testing</p></li><li><p>Powerful for collaborative teams that want <em>engineer-level</em> visibility</p></li></ul><p>On your resume:</p><blockquote><p>&#8220;Designed asset-aware DAGs in Dagster to enforce data lineage, retry logic, and schema-aware contracts.&#8221;</p></blockquote><p>Where Dagster may not be right:</p><ul><li><p>Overkill for simple projects</p></li><li><p>Less intuitive for beginners</p></li><li><p>Still evolving rapidly &#8212; may have rough edges</p></li></ul><div><hr></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!585-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd951a63c-3fec-4309-aaba-70c9cebf6b1a_1666x732.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!585-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd951a63c-3fec-4309-aaba-70c9cebf6b1a_1666x732.png 424w, https://substackcdn.com/image/fetch/$s_!585-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd951a63c-3fec-4309-aaba-70c9cebf6b1a_1666x732.png 848w, https://substackcdn.com/image/fetch/$s_!585-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd951a63c-3fec-4309-aaba-70c9cebf6b1a_1666x732.png 1272w, https://substackcdn.com/image/fetch/$s_!585-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd951a63c-3fec-4309-aaba-70c9cebf6b1a_1666x732.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!585-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd951a63c-3fec-4309-aaba-70c9cebf6b1a_1666x732.png" width="1456" height="640" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d951a63c-3fec-4309-aaba-70c9cebf6b1a_1666x732.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:640,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:97696,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://zero2dataengineer.substack.com/i/164423537?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd951a63c-3fec-4309-aaba-70c9cebf6b1a_1666x732.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!585-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd951a63c-3fec-4309-aaba-70c9cebf6b1a_1666x732.png 424w, https://substackcdn.com/image/fetch/$s_!585-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd951a63c-3fec-4309-aaba-70c9cebf6b1a_1666x732.png 848w, https://substackcdn.com/image/fetch/$s_!585-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd951a63c-3fec-4309-aaba-70c9cebf6b1a_1666x732.png 1272w, https://substackcdn.com/image/fetch/$s_!585-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd951a63c-3fec-4309-aaba-70c9cebf6b1a_1666x732.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><h3>What Your Tool Choice Says About You</h3><p>If you picked...You&#8217;re signaling...<strong>Airflow</strong>&#8220;I understand production-scale systems.&#8221;<strong>Prefect</strong>&#8220;I move fast and iterate cleanly in Python.&#8221;<strong>Dagster</strong>&#8220;I think in contracts, lineage, and scale.&#8221;</p><p>There is no &#8220;best&#8221; tool &#8212; only the one that aligns with:</p><ul><li><p><strong>The story you want to tell</strong></p></li><li><p><strong>The type of work you want to do</strong></p></li><li><p><strong>The types of teams you want to join</strong></p></li></ul><div><hr></div><h3>The Hidden Skill Behind Tool Choice: Narrative Alignment</h3><p>The tool you use reflects how you <strong>frame problems</strong>.</p><p>Interviewers aren&#8217;t just looking at which orchestrator you picked &#8212; they&#8217;re evaluating whether your <strong>mental model</strong> matches the role.</p><ul><li><p>Use <strong>Airflow</strong> if you want to tell a story about <strong>scale, resilience, and legacy integration</strong></p></li><li><p>Use <strong>Prefect</strong> if you want to show you&#8217;re <strong>experiment-driven, lean, and fast-moving</strong></p></li><li><p>Use <strong>Dagster</strong> if you want to prove you're a <strong>system thinker who designs for complexity and traceability</strong></p></li></ul><p>This is less about syntax, more about <strong>signal</strong>. Make your choice reflect your <strong>intended audience</strong>.</p><div><hr></div><h3>What Top Companies Use (And Why It Matters)</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!dtvB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa14bed34-f7c4-4cd5-bb08-73222295b077_1630x536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!dtvB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa14bed34-f7c4-4cd5-bb08-73222295b077_1630x536.png 424w, https://substackcdn.com/image/fetch/$s_!dtvB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa14bed34-f7c4-4cd5-bb08-73222295b077_1630x536.png 848w, https://substackcdn.com/image/fetch/$s_!dtvB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa14bed34-f7c4-4cd5-bb08-73222295b077_1630x536.png 1272w, https://substackcdn.com/image/fetch/$s_!dtvB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa14bed34-f7c4-4cd5-bb08-73222295b077_1630x536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!dtvB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa14bed34-f7c4-4cd5-bb08-73222295b077_1630x536.png" width="1456" height="479" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a14bed34-f7c4-4cd5-bb08-73222295b077_1630x536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:479,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:88305,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://zero2dataengineer.substack.com/i/164423537?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa14bed34-f7c4-4cd5-bb08-73222295b077_1630x536.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!dtvB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa14bed34-f7c4-4cd5-bb08-73222295b077_1630x536.png 424w, https://substackcdn.com/image/fetch/$s_!dtvB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa14bed34-f7c4-4cd5-bb08-73222295b077_1630x536.png 848w, https://substackcdn.com/image/fetch/$s_!dtvB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa14bed34-f7c4-4cd5-bb08-73222295b077_1630x536.png 1272w, https://substackcdn.com/image/fetch/$s_!dtvB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa14bed34-f7c4-4cd5-bb08-73222295b077_1630x536.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>If you're applying somewhere &#8212; match your project tool with what that company runs in production (or wants to).</em></p><div><hr></div><h3>How to Frame This on Your Resume</h3><p>Don&#8217;t write:</p><blockquote><p>&#8220;Used Airflow for DAG orchestration.&#8221;</p></blockquote><p>Write:</p><blockquote><p>&#8220;Designed idempotent, retry-aware data pipelines using Airflow for scalable ETL orchestration across S3, Spark, and Snowflake &#8212; with SLA tracking and alerting.&#8221;</p></blockquote><p>Or:</p><blockquote><p>&#8220;Built fast, modular Prefect flows to orchestrate data science pipelines with resume-to-dashboard visibility in under 15 minutes per run.&#8221;</p></blockquote><p>Or:</p><blockquote><p>&#8220;Architected lineage-aware pipelines using Dagster assets and config mapping to reduce failure recovery time by 80% across multiple data teams.&#8221;</p></blockquote><p><strong>The tool isn&#8217;t the flex. The system design behind it is.</strong></p><div><hr></div><h3>Thinking Like a Hiring Manager</h3><p>If I&#8217;m hiring:</p><ul><li><p><strong>Airflow</strong> tells me you&#8217;ve worked with mature pipelines and understand operational burden.</p></li><li><p><strong>Prefect</strong> tells me you can build fast, are probably solo or hybrid (data + product), and can ship.</p></li><li><p><strong>Dagster</strong> tells me you write pipelines like software &#8212; clean, testable, typed &#8212; and are comfortable with architecture decisions.</p></li></ul><p>If you&#8217;re junior? Pick one and go deep.<br>If you&#8217;re senior? Know when to choose which &#8212; and explain it.</p><div><hr></div><h3>Final Advice</h3><p>Don&#8217;t just pick the tool that&#8217;s hyped.<br>Pick the one that <strong>matches how you want to think and explain your work</strong>.</p><p>What you build is one thing.<br>How you <em>talk about it</em> is what gets you hired.</p><div><hr></div>]]></content:encoded></item><item><title><![CDATA[Airflow Best Practices — What to Avoid in Projects & Interviews]]></title><description><![CDATA[From &#8216;It Worked on My Machine&#8217; to &#8216;It Survives in Prod&#8217;]]></description><link>https://zero2dataengineer.substack.com/p/airflow-best-practices-what-to-avoid</link><guid isPermaLink="false">https://zero2dataengineer.substack.com/p/airflow-best-practices-what-to-avoid</guid><dc:creator><![CDATA[Avantika_Penumarty]]></dc:creator><pubDate>Fri, 23 May 2025 00:30:22 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/cb9d4320-3c31-4338-aa6a-262c6e9a236f_1536x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Most people say,</p><blockquote><p>&#8220;Yeah, I&#8217;ve used Airflow.&#8221;</p></blockquote><p>But interviewers can tell <em>within 60 seconds</em> whether you actually understand it &#8212; or just ran someone else&#8217;s DAG.</p><p>This post is not a checklist of features.<br>It&#8217;s a breakdown of <strong>the top Airflow mistakes that break pipelines, burn teams, and ruin interviews</strong> &#8212; and how to avoid them.</p><p>Let&#8217;s make you sound like someone who&#8217;s deployed DAGs in production, not just built toy examples.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://zero2dataengineer.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Zero2Dataengineer is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h3>Mistake #1: Treating Airflow Like a Script Runner</h3><p>If your tasks are running huge pandas transformations, calling 15 APIs, and returning massive objects between tasks, you're doing it wrong.</p><p>Airflow is not Spark. Not dbt. Not a transformation engine. It&#8217;s the <strong>orchestrator</strong> &#8212; the conductor of the data workflow.</p><p><strong>Do this instead:</strong></p><ul><li><p>Keep tasks modular</p></li><li><p>Push heavy lifting into external jobs (Spark, SQL, cloud ETL)</p></li><li><p>Use Airflow to manage dependencies, not business logic</p></li></ul><p><strong>How to say this in interviews:</strong></p><blockquote><p>&#8220;I used Airflow to orchestrate pipeline steps, but offloaded heavy data processing to Snowflake and Spark jobs to keep the DAGs lean and observable.&#8221;</p></blockquote><div><hr></div><h3>Mistake #2: Ignoring Retry Logic and Failure Handling</h3><p>Most junior engineers write DAGs that work&#8230; when nothing goes wrong.</p><p>But in production:</p><ul><li><p>APIs time out</p></li><li><p>S3 files get delayed</p></li><li><p>Database connections drop</p></li></ul><p><strong>Best practices:</strong></p><ul><li><p>Always configure <code>retries</code>, <code>retry_delay</code>, and <code>on_failure_callback</code></p></li><li><p>Log why each task failed (don&#8217;t just rely on the default log dump)</p></li><li><p>Use idempotent task design &#8212; so retries don&#8217;t break downstream logic</p></li></ul><p><strong>Pro-level tip:</strong> Add exponential backoff and SLA alerts.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://zero2dataengineer.substack.com/subscribe&quot;,&quot;text&quot;:&quot;Upgrade to Annual&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://zero2dataengineer.substack.com/subscribe"><span>Upgrade to Annual</span></a></p>
      <p>
          <a href="https://zero2dataengineer.substack.com/p/airflow-best-practices-what-to-avoid">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Build Your First DAG in Airflow]]></title><description><![CDATA[From Zero to Working Pipeline in 15 Minutes]]></description><link>https://zero2dataengineer.substack.com/p/build-your-first-dag-in-airflow</link><guid isPermaLink="false">https://zero2dataengineer.substack.com/p/build-your-first-dag-in-airflow</guid><dc:creator><![CDATA[Avantika_Penumarty]]></dc:creator><pubDate>Fri, 23 May 2025 00:00:46 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d366aa5-7187-4a5a-809a-e975617b4568_832x832.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h3>You&#8217;ve heard the theory.</h3><p>You get the metaphors.<br>You&#8217;ve seen the screenshots.</p><p>But let&#8217;s be real:<br>Until you write a DAG yourself, Airflow will always feel like someone else&#8217;s tool.</p><p>Today&#8217;s Elite drop is different.<br>You&#8217;re not reading <em>about</em> DAGs &#8212; you&#8217;re building one. From scratch. With your hands. With real code.</p><p>This walkthrough is intentionally barebones.<br>Because clarity is power.<br>You&#8217;ll leave this with a runnable pipeline and confidence to answer:</p><blockquote><p>&#8220;Yes, I&#8217;ve built DAGs before.&#8221;</p></blockquote><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://zero2dataengineer.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Zero2Dataengineer is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h3>What You&#8217;re Building</h3><p>You&#8217;ll create a DAG that:</p><ul><li><p>Ingests a CSV file</p></li><li><p>Cleans it using Python</p></li><li><p>Loads the clean data to a Snowflake table</p></li><li><p>Runs every 6 hours</p></li><li><p>Sends a Slack alert if something fails</p></li></ul><p>Simple. But production-grade.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://zero2dataengineer.substack.com/subscribe&quot;,&quot;text&quot;:&quot;UPGRADE TO ELITE ACCELERATOR&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://zero2dataengineer.substack.com/subscribe"><span>UPGRADE TO ELITE ACCELERATOR</span></a></p><div><hr></div><h3>Step 1: Set Up Your Environment</h3><p>If you don&#8217;t already have Airflow:</p><pre><code>pip install apache-airflow</code></pre><p>Then initialize Airflow:</p><pre><code>airflow db init
airflow users create \
  --username admin \
  --firstname Firstname \
  --lastname Lastname \
  --role Admin \
  --email your@email.com
airflow webserver --port 8080
airflow scheduler</code></pre><p>You now have Airflow running locally.</p>
      <p>
          <a href="https://zero2dataengineer.substack.com/p/build-your-first-dag-in-airflow">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Scheduling vs Triggering]]></title><description><![CDATA[How Workflows Actually Run in Production]]></description><link>https://zero2dataengineer.substack.com/p/scheduling-vs-triggering</link><guid isPermaLink="false">https://zero2dataengineer.substack.com/p/scheduling-vs-triggering</guid><dc:creator><![CDATA[Avantika_Penumarty]]></dc:creator><pubDate>Thu, 22 May 2025 00:30:21 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/f377859c-0d1d-4a1a-bbaa-99fe54cafc4e_1024x1536.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Most people set <strong>schedule_interval ='@daily'</strong> and move on.<br>But in production, nothing is that simple.</p><p>Data arrives late.<br>APIs fail.<br>Files drop into S3 at random.<br>And your pipeline has to <strong>wait</strong>, <strong>trigger</strong>, or <strong>backfill</strong> &#8212; not just run on a timer.</p><p>Today, we&#8217;re digging into how scheduling <em>really</em> works &#8212; and how you should answer when interviewers ask:</p><blockquote><p>&#8220;How do you schedule and trigger your Airflow DAGs?&#8221;</p></blockquote><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://zero2dataengineer.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Zero2Dataengineer is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h3>First: Understand the Two Types of Runs</h3><ol><li><p><strong>Scheduled Runs</strong><br>You tell Airflow to run a DAG every X time:</p><ul><li><p>Every hour, day, week, etc.</p></li><li><p>Use cases: batch ETL, daily reporting, metrics updates</p></li></ul></li><li><p><strong>Triggered Runs</strong><br>Airflow runs the DAG <strong>when something happens</strong>:</p></li></ol><ul><li><p>A file lands in S3</p></li><li><p>An upstream DAG completes</p></li><li><p>An API returns a signal</p></li></ul><div><hr></div><p><strong>Elite Bonus Drop Coming:</strong><br>This Thursday, I&#8217;ll walk through <strong>Sensors, ExternalTask dependency patterns, and DAG chaining</strong> in a real multi-DAG system.</p><p>This is where your interview answers start sounding like a Staff Data Engineer.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://zero2dataengineer.substack.com/subscribe&quot;,&quot;text&quot;:&quot;UPGRADE TO ANNUAL MEMBERSHIP&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://zero2dataengineer.substack.com/subscribe"><span>UPGRADE TO ANNUAL MEMBERSHIP</span></a></p><div><hr></div>
      <p>
          <a href="https://zero2dataengineer.substack.com/p/scheduling-vs-triggering">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Build Your First DAG in Airflow]]></title><description><![CDATA[Hands-on Workflow You Can Actually Run]]></description><link>https://zero2dataengineer.substack.com/p/build-your-first-dag-in-airflow-f5c</link><guid isPermaLink="false">https://zero2dataengineer.substack.com/p/build-your-first-dag-in-airflow-f5c</guid><dc:creator><![CDATA[Avantika_Penumarty]]></dc:creator><pubDate>Wed, 21 May 2025 00:30:30 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/3620148f-53a9-4728-83c1-5e88acd77497_1024x1536.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>You can read 100 articles about DAGs.<br>You can quote what &#8220;Directed Acyclic Graph&#8221; means.<br>But none of it matters until you write one.</p><p>Today&#8217;s goal?<br>You&#8217;ll build your <strong>first working DAG</strong> &#8212; start to finish &#8212; that does something useful and gets you closer to job-ready.</p><p>If you can complete this, you&#8217;ll be able to walk into any interview and say:</p><blockquote><p>&#8220;Yes, I&#8217;ve built production-style pipelines in Airflow.&#8221;</p></blockquote><p>Let&#8217;s do this.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://zero2dataengineer.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Zero2Dataengineer is a reader-supported publication. <strong>Upgrade for Full Access &#8211; Learn in Detail, Never Forget!</strong></p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h3>Why This Matters</h3><p>Anyone can Google "What is Airflow?"<br>You&#8217;re here to <strong>build</strong> with it &#8212; and <strong>speak about it like an engineer who&#8217;s done it in prod</strong>.</p><p>Today&#8217;s drop isn&#8217;t theory.<br>You&#8217;ll create your <strong>first working DAG</strong> &#8212; a mini pipeline that does something useful, reliable, and interview-worthy.</p><p>You&#8217;ll leave this newsletter with:</p><ul><li><p>A runnable DAG</p></li><li><p>A deployable GitHub project</p></li><li><p>A STAR-based story you can use in interviews</p></li><li><p>And clarity about what makes DAGs production-grade</p></li></ul><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://zero2dataengineer.substack.com/subscribe&quot;,&quot;text&quot;:&quot;Upgrade &amp; Learn in Detail!&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://zero2dataengineer.substack.com/subscribe"><span>Upgrade &amp; Learn in Detail!</span></a></p><div><hr></div><h3>What You&#8217;re Building</h3><p>The business use case is real:<br>A team receives a messy CSV of order data every few hours. Right now it&#8217;s being cleaned manually. You&#8217;ve been tasked with automating this.</p><p>Here&#8217;s what your DAG will do:</p><ul><li><p><strong>Extract:</strong> Read CSV</p></li><li><p><strong>Transform:</strong> Drop nulls, clean types</p></li><li><p><strong>Load:</strong> Write to Snowflake</p></li><li><p><strong>Retry:</strong> If anything breaks</p></li><li><p><strong>Alert:</strong> Slack notifications</p></li><li><p><strong>Schedule:</strong> Every 6 hours</p></li></ul><div><hr></div><h2>Your Airflow Starter Pack</h2><h3>Step 1: Install Airflow (If not done yet)</h3><pre><code>pip install apache-airflow
airflow db init
airflow webserver --port 8080
airflow scheduler</code></pre><p>Open Airflow UI at <code>localhost:8080</code> and create your user.</p>
      <p>
          <a href="https://zero2dataengineer.substack.com/p/build-your-first-dag-in-airflow-f5c">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Airflow Isn’t Scary — It’s a Life Saver]]></title><description><![CDATA[Why Every Data Engineer Needs to Master DAGs]]></description><link>https://zero2dataengineer.substack.com/p/airflow-isnt-scary-its-a-life-saver</link><guid isPermaLink="false">https://zero2dataengineer.substack.com/p/airflow-isnt-scary-its-a-life-saver</guid><dc:creator><![CDATA[Avantika_Penumarty]]></dc:creator><pubDate>Tue, 20 May 2025 00:30:29 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!fDr8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F882e00e9-776a-4cc4-9e31-c5a4d979ec95_1536x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h3>Let&#8217;s Talk About Chaos.</h3><p>Not the kind you can meditate through &#8212;<br>The kind that breaks your data pipeline at 2AM while your dashboards go blank and your PM starts Slacking you with &#8220;???&#8221;</p><p>That&#8217;s why Airflow exists.</p><p>Airflow doesn&#8217;t just help you schedule things.<br>It&#8217;s the brain of your data pipelines &#8212; the system that says:<br>&#8220;Hey, this job failed, let&#8217;s retry.&#8221;<br>&#8220;Wait, don&#8217;t run until the upstream finished.&#8221;<br>&#8220;Log it, alert it, and move on.&#8221;</p><p>And yet, for most people, Airflow seems intimidating.</p><p>Today, we&#8217;ll break it down &#8212; no fluff, no buzzwords.<br>By the end of this, you&#8217;ll understand what Airflow really does, how DAGs work, and how to confidently talk about orchestration in any interview.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://zero2dataengineer.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Zero2Dataengineer is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h3>What Airflow Actually Does</h3><p>Forget the textbook.</p><p>Here&#8217;s what Airflow <em>really</em> handles:</p><ul><li><p>Running your ETL jobs <strong>in the right order</strong></p></li><li><p>Retrying failed jobs automatically</p></li><li><p>Monitoring status of every task</p></li><li><p>Sending alerts if something breaks</p></li><li><p>Triggering tasks on a <strong>schedule</strong> or based on <strong>dependencies</strong></p></li></ul><p>That&#8217;s orchestration. It&#8217;s not the &#8220;doing&#8221; &#8212; it&#8217;s the <strong>directing</strong>.</p><p>Airflow isn&#8217;t processing your data. It&#8217;s making sure the <em>process</em> happens correctly.</p><div><hr></div><h3>How DAGs Work (Without the Jargon)</h3><p>A <strong>DAG</strong> = Directed Acyclic Graph = A set of tasks that:</p><ol><li><p>Have <strong>a clear start and end</strong></p></li><li><p>Never loop back on themselves</p></li><li><p>Run in a <strong>specific order</strong></p></li></ol><p>Think of it like a checklist for your pipeline:</p><ul><li><p>Step 1: Pull data from S3</p></li><li><p>Step 2: Transform using Spark</p></li><li><p>Step 3: Load to Snowflake</p></li><li><p>Step 4: Run quality checks</p></li><li><p>Step 5: Refresh dashboard</p></li></ul><p>Each of these is a <strong>task</strong>.<br>The entire thing is a <strong>DAG</strong>.</p><p>In Airflow, these steps are defined in Python. You set dependencies like this:</p><p><strong>extract_task &gt;&gt; transform_task &gt;&gt; load_task</strong></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://zero2dataengineer.substack.com/subscribe&quot;,&quot;text&quot;:&quot;UPGRADE TO ELITE/ANNUAL MEMBERSHIP&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://zero2dataengineer.substack.com/subscribe"><span>UPGRADE TO ELITE/ANNUAL MEMBERSHIP</span></a></p>
      <p>
          <a href="https://zero2dataengineer.substack.com/p/airflow-isnt-scary-its-a-life-saver">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[ETL Interview Breakdown: How Data Engineers Are Tested]]></title><description><![CDATA[Why "Build a pipeline" isn&#8217;t really what they&#8217;re asking]]></description><link>https://zero2dataengineer.substack.com/p/etl-interview-breakdown-how-data</link><guid isPermaLink="false">https://zero2dataengineer.substack.com/p/etl-interview-breakdown-how-data</guid><dc:creator><![CDATA[Avantika_Penumarty]]></dc:creator><pubDate>Mon, 19 May 2025 17:30:38 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d913f3c-1b77-4736-bb68-3b16e27dd872_1536x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h3>Let&#8217;s Get Real About ETL Interviews</h3><p>You walk into a Data Engineering interview. They ask:</p><blockquote><p><em>&#8220;Can you walk me through an ETL pipeline you&#8217;ve built?&#8221;</em></p></blockquote><p>Seems basic, right?<br>But here&#8217;s the catch: <strong>they&#8217;re not looking for just a tool dump.</strong><br>They&#8217;re trying to reverse-engineer your thinking.</p><p>You&#8217;ve now built ETL pipelines, cleaned real data, and understood how batch vs streaming works.</p><p>Today&#8217;s goal is simple. In Zero2DataEngineer breakdown, we&#8217;ll <strong>decode how ETL interview questions are framed</strong>, what they&#8217;re secretly testing, and how to <strong>structure your answers like a pro</strong> &#8212; even if you&#8217;ve never worked at FAANG.</p><p>Whether you&#8217;re applying for a Data Engineer, Analytics Engineer, or even Backend-heavy role &#8212; <strong>ETL questions will show up</strong>.</p><p>Here&#8217;s how to answer them like someone who&#8217;s done it before.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://zero2dataengineer.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Zero2Dataengineer is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h3>How ETL Interview Questions Are Really Framed</h3><p>They won&#8217;t ask:</p><blockquote><p>&#8220;What is ETL?&#8221;</p></blockquote><p>They&#8217;ll ask:</p><ul><li><p>How would you design a pipeline to load millions of rows from an external source?</p></li><li><p>What happens if your load step fails halfway through?</p></li><li><p>How do you make your pipeline idempotent?</p></li><li><p>When do you use batch vs real-time ingestion?</p></li></ul><p>The trick is to answer <strong>like a system thinker</strong>, not just a coder.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!jC4y!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f1b9223-edd7-4991-a242-9f6ac8c33536_1672x800.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!jC4y!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f1b9223-edd7-4991-a242-9f6ac8c33536_1672x800.png 424w, https://substackcdn.com/image/fetch/$s_!jC4y!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f1b9223-edd7-4991-a242-9f6ac8c33536_1672x800.png 848w, https://substackcdn.com/image/fetch/$s_!jC4y!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f1b9223-edd7-4991-a242-9f6ac8c33536_1672x800.png 1272w, https://substackcdn.com/image/fetch/$s_!jC4y!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f1b9223-edd7-4991-a242-9f6ac8c33536_1672x800.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!jC4y!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f1b9223-edd7-4991-a242-9f6ac8c33536_1672x800.png" width="1456" height="697" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5f1b9223-edd7-4991-a242-9f6ac8c33536_1672x800.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:697,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:126320,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://zero2dataengineer.substack.com/i/163926328?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f1b9223-edd7-4991-a242-9f6ac8c33536_1672x800.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!jC4y!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f1b9223-edd7-4991-a242-9f6ac8c33536_1672x800.png 424w, https://substackcdn.com/image/fetch/$s_!jC4y!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f1b9223-edd7-4991-a242-9f6ac8c33536_1672x800.png 848w, https://substackcdn.com/image/fetch/$s_!jC4y!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f1b9223-edd7-4991-a242-9f6ac8c33536_1672x800.png 1272w, https://substackcdn.com/image/fetch/$s_!jC4y!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f1b9223-edd7-4991-a242-9f6ac8c33536_1672x800.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://zero2dataengineer.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://zero2dataengineer.substack.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h3>Interview Answer Formula (Reframe Your Thinking)</h3><p>Use this <strong>ETL STAR + Stack Formula</strong> when answering:</p><p><strong>S</strong>ituation: What was the use case?<br><strong>T</strong>ool stack: Which tools did you choose and why?<br><strong>A</strong>rchitecture: Show the pipeline stages.<br><strong>R</strong>esilience: How did you handle failures, alerts, schema drift?<br><strong>+ Stack Justification</strong>: Why this combo (Airflow + S3 + Spark etc.)?</p><p>Pro tip: If you haven&#8217;t built one end-to-end yet, use this:</p><blockquote><p>&#8220;Here&#8217;s how I <em>would</em> design it for a [use case].&#8221;<br>Then walk them through your design &#8212; <strong>intelligently, not hypothetically.</strong></p></blockquote><div><hr></div><h3>Master These Real ETL Questions Before Your Next Interview</h3><div><hr></div><p><strong>Q1. What are some common challenges in ETL pipelines?</strong><br>1. Handling bad records<br>2. Schema changes<br>3. Late-arriving data<br>4. Dependencies &amp; retries</p><p><strong>Sample Answer:</strong></p><blockquote><p>&#8220;One of the biggest challenges I&#8217;ve faced is handling schema drift &#8212; especially when upstream sources silently change a column name or data type. I&#8217;ve built schema validation into the extraction step using Great Expectations and version control through Glue Catalog.</p><p>I&#8217;ve also handled bad records by logging and quarantining them into a separate S3 bucket with alerting. For late-arriving data, I design pipelines to be idempotent &#8212; using UPSERT logic or late data windows. And for dependencies, I always make sure DAGs have proper <code>depends_on_past</code>, retry logic, and failure alerts configured.&#8221;</p></blockquote><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://zero2dataengineer.substack.com/subscribe&quot;,&quot;text&quot;:&quot;UPGRADE TO ANNUAL&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://zero2dataengineer.substack.com/subscribe"><span>UPGRADE TO ANNUAL</span></a></p>
      <p>
          <a href="https://zero2dataengineer.substack.com/p/etl-interview-breakdown-how-data">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Batch vs Streaming Pipelines]]></title><description><![CDATA[Choosing the right flow for the right kind of data]]></description><link>https://zero2dataengineer.substack.com/p/batch-vs-streaming-pipelines</link><guid isPermaLink="false">https://zero2dataengineer.substack.com/p/batch-vs-streaming-pipelines</guid><dc:creator><![CDATA[Avantika_Penumarty]]></dc:creator><pubDate>Fri, 16 May 2025 00:30:00 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d366aa5-7187-4a5a-809a-e975617b4568_832x832.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>You&#8217;ve built an ETL pipeline.<br>You&#8217;ve transformed and loaded data.</p><p>Now the question is:<br><strong>How often should your pipeline run?</strong><br>And more importantly&#8230; <strong>should it run in batches &#8212; or in real time?</strong></p><p>Let&#8217;s break it down.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://zero2dataengineer.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Zero2Dataengineer is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2>What Is a Batch Pipeline?</h2><p>A <strong>batch pipeline</strong> runs on a schedule.<br>It pulls a large chunk of data at once &#8212; typically daily, hourly, or weekly.</p><p>Think:</p><ul><li><p>Nightly revenue reports</p></li><li><p>Weekly customer churn rollups</p></li><li><p>Monthly sales summaries</p></li></ul><p>It&#8217;s like picking up laundry every Sunday.<br>No need to track every sock in real time &#8212; just do one large pickup.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://zero2dataengineer.substack.com/subscribe&quot;,&quot;text&quot;:&quot;Upgrade to Annual&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://zero2dataengineer.substack.com/subscribe"><span>Upgrade to Annual</span></a></p><div><hr></div><h3>When to Use Batch</h3><ul><li><p>Your data changes slowly (e.g., payments, orders)</p></li><li><p>You&#8217;re running reports, not alerts</p></li><li><p>You want to keep cloud costs low</p></li><li><p>You need high data completeness over speed</p></li></ul>
      <p>
          <a href="https://zero2dataengineer.substack.com/p/batch-vs-streaming-pipelines">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Data Transformation Deep Dive]]></title><description><![CDATA[How to reshape, enrich, and prepare data for decision-makers]]></description><link>https://zero2dataengineer.substack.com/p/data-transformation-deep-dive</link><guid isPermaLink="false">https://zero2dataengineer.substack.com/p/data-transformation-deep-dive</guid><dc:creator><![CDATA[Avantika_Penumarty]]></dc:creator><pubDate>Thu, 15 May 2025 00:30:36 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b6cd4ad-b111-41a3-8ec6-5de23d870ee0_1024x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Welcome to Day 3 of Zero2DataEngineer &#8211; Week 5</p><p>You&#8217;ve written your first full ETL pipeline.<br>But here&#8217;s the part that defines your value as a data engineer:</p><blockquote><p>Can you <strong>reshape raw data into something useful</strong>?</p></blockquote><p>That&#8217;s where transformation lives &#8212; and where good engineers become great ones.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://zero2dataengineer.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Zero2Dataengineer is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h3>What Data Transformation Really Means</h3><p>It&#8217;s more than <code>.fillna()</code> or <code>.dropna()</code>.<br>It&#8217;s about making messy, inconsistent data <strong>make sense</strong> &#8212; across teams, systems, and use cases.</p><p>Think of transformation as <strong>translation</strong>:<br>You&#8217;re turning raw, ambiguous data into structured, meaningful signals.</p><div><hr></div><h3>Real-World Scenario: Cleaning Order Data</h3><p>You get a CSV dump from a vendor with these columns:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Tv6z!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14c3d56c-965b-436a-b511-0d51fd2d6198_1642x386.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Tv6z!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14c3d56c-965b-436a-b511-0d51fd2d6198_1642x386.png 424w, https://substackcdn.com/image/fetch/$s_!Tv6z!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14c3d56c-965b-436a-b511-0d51fd2d6198_1642x386.png 848w, https://substackcdn.com/image/fetch/$s_!Tv6z!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14c3d56c-965b-436a-b511-0d51fd2d6198_1642x386.png 1272w, https://substackcdn.com/image/fetch/$s_!Tv6z!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14c3d56c-965b-436a-b511-0d51fd2d6198_1642x386.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Tv6z!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14c3d56c-965b-436a-b511-0d51fd2d6198_1642x386.png" width="1456" height="342" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/14c3d56c-965b-436a-b511-0d51fd2d6198_1642x386.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:342,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:47689,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://zero2dataengineer.substack.com/i/163548790?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14c3d56c-965b-436a-b511-0d51fd2d6198_1642x386.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Tv6z!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14c3d56c-965b-436a-b511-0d51fd2d6198_1642x386.png 424w, https://substackcdn.com/image/fetch/$s_!Tv6z!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14c3d56c-965b-436a-b511-0d51fd2d6198_1642x386.png 848w, https://substackcdn.com/image/fetch/$s_!Tv6z!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14c3d56c-965b-436a-b511-0d51fd2d6198_1642x386.png 1272w, https://substackcdn.com/image/fetch/$s_!Tv6z!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14c3d56c-965b-436a-b511-0d51fd2d6198_1642x386.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Your tasks:</p><ul><li><p>Normalize dates to standard format</p></li><li><p>Strip <code>$</code> and convert to float</p></li><li><p>Unify state values (<code>CA</code>, <code>Calif.</code> &#8594; <code>CA</code>)</p></li><li><p>Deduplicate based on key columns</p></li><li><p>Group purchases by user</p></li></ul><div><hr></div><h3>Sample Code Walkthrough</h3><pre><code>import pandas as pd

df = pd.read_csv("vendor_orders.csv")

# Standardize column names
df.columns = [c.strip().lower().replace(" ", "_") for c in df.columns]

# Parse dates and coerce invalid entries
df["order_date"] = pd.to_datetime(df["order_date"], errors="coerce")

# Clean monetary values
df["value"] = df["value"].replace("[\$,]", "", regex=True).astype(float)

# Normalize state values
df["state"] = df["state"].replace({"Calif.": "CA"})

# Deduplicate
df = df.drop_duplicates(subset=["order_id", "user_name"])

# Group and summarize
summary = df.groupby("user_name")["value"].sum().reset_index()</code></pre><div><hr></div>
      <p>
          <a href="https://zero2dataengineer.substack.com/p/data-transformation-deep-dive">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Build Your First ETL Pipeline]]></title><description><![CDATA[A hands-on walkthrough from raw CSV to a clean, queryable table]]></description><link>https://zero2dataengineer.substack.com/p/build-your-first-etl-pipeline</link><guid isPermaLink="false">https://zero2dataengineer.substack.com/p/build-your-first-etl-pipeline</guid><dc:creator><![CDATA[Avantika_Penumarty]]></dc:creator><pubDate>Wed, 14 May 2025 00:30:33 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!VvmC!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbaa917a-1be7-48f5-ba41-1e4cee63d636_1024x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Yesterday, you learned what ETL is &#8212; and why it matters.<br>Today, we get practical. You&#8217;ll write your <strong>first real ETL pipeline</strong>.</p><p>We&#8217;re not just talking about it.<br>We&#8217;re building it.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://zero2dataengineer.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Zero2Dataengineer is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2>The Setup</h2><p>You&#8217;ve been given a CSV file of product orders.<br>It&#8217;s messy. It has:</p><ul><li><p>Blank rows</p></li><li><p>Wrong column types</p></li><li><p>Duplicate users</p></li><li><p>Date inconsistencies</p></li></ul><p>Your job?</p><ul><li><p>Extract the data</p></li><li><p>Transform it for cleanliness and insight</p></li><li><p>Load it into a structured database</p></li></ul><div><hr></div><h2>Step 1: Extract (Read the Data)</h2><p>Let&#8217;s use Python&#8217;s pandas library to ingest the file.</p><pre><code>import pandas as pd

df = pd.read_csv("orders_raw.csv")
print(df.head())</code></pre><p>Scan the data. Understand it. Ask:</p><ul><li><p>Are column names clean?</p></li><li><p>Are dates formatted?</p></li><li><p>Are there obvious outliers or NULLs?</p><p></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://zero2dataengineer.substack.com/subscribe&quot;,&quot;text&quot;:&quot;UPGRADE TO ANNUAL&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://zero2dataengineer.substack.com/subscribe"><span>UPGRADE TO ANNUAL</span></a></p></li></ul>
      <p>
          <a href="https://zero2dataengineer.substack.com/p/build-your-first-etl-pipeline">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[What the Hell Is ETL?]]></title><description><![CDATA[Real-world breakdown of a misunderstood concept &#8212; and why every data engineer must master it]]></description><link>https://zero2dataengineer.substack.com/p/what-the-hell-is-etl</link><guid isPermaLink="false">https://zero2dataengineer.substack.com/p/what-the-hell-is-etl</guid><dc:creator><![CDATA[Avantika_Penumarty]]></dc:creator><pubDate>Tue, 13 May 2025 00:30:18 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!7c37!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6211054c-ba33-429c-8284-85c474d859c9_1536x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2>Welcome to Week 5 of Zero2DataEngineer</h2><p>This week, we&#8217;re shifting from data structure to <strong>data movement</strong>.<br>Because understanding SQL is table stakes &#8212; but knowing how raw data becomes trustworthy, usable insights?</p><p>That&#8217;s where <strong>ETL</strong> comes in.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://zero2dataengineer.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Zero2Dataengineer is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2>What the Hell Is ETL?</h2><p>ETL stands for:</p><ul><li><p><strong>Extract</strong> &#8211; pull raw data from source systems</p></li><li><p><strong>Transform</strong> &#8211; clean, reformat, validate, and reshape</p></li><li><p><strong>Load</strong> &#8211; push structured data into a final destination</p></li></ul><p>That&#8217;s it &#8212; in theory.</p><p>But in practice, ETL is <strong>messy, strategic, and essential.</strong></p><div><hr></div><h2>Let&#8217;s Make It Real: The Restaurant Analogy</h2><p>You're running a restaurant.</p><ul><li><p><strong>Extract</strong> = ordering raw ingredients from vendors</p></li><li><p><strong>Transform</strong> = chopping, cooking, seasoning, plating</p></li><li><p><strong>Load</strong> = serving the final dish to the customer</p></li></ul><p>You wouldn&#8217;t send a customer raw flour and onions.<br>That&#8217;s what skipping ETL looks like in a data system.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://zero2dataengineer.substack.com/subscribe&quot;,&quot;text&quot;:&quot;UPGRADE TO ANNUAL&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://zero2dataengineer.substack.com/subscribe"><span>UPGRADE TO ANNUAL</span></a></p>
      <p>
          <a href="https://zero2dataengineer.substack.com/p/what-the-hell-is-etl">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[SQL + RDBMS = Love]]></title><description><![CDATA[Why Structured Query Language Still Rules the World of Data]]></description><link>https://zero2dataengineer.substack.com/p/sql-rdbms-love</link><guid isPermaLink="false">https://zero2dataengineer.substack.com/p/sql-rdbms-love</guid><dc:creator><![CDATA[Avantika_Penumarty]]></dc:creator><pubDate>Sat, 10 May 2025 00:30:49 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!zQxA!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb79a41ed-d251-4719-b3a0-c334329ee992_1024x1536.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>This week, we&#8217;ve explored the foundation of structured data systems: relational databases, indexing, ACID transactions, and schema design.<br>Today, we bring it all together &#8212; and show how SQL is more than a language.<br>It&#8217;s a mindset. A framework. A career unlock.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://zero2dataengineer.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Zero2Dataengineer is a reader-supported publication. Make sure you're subscribed.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>You&#8217;ve learned SQL.<br>You&#8217;ve learned schema design.<br>You now understand indexing, ACID, joins, and denormalization.</p><p>But here&#8217;s the part nobody tells you:</p><blockquote><p>SQL isn&#8217;t just a query language.<br>It&#8217;s a way of <strong>thinking</strong> about data.</p></blockquote><p>The best data engineers? They don&#8217;t &#8220;run SQL.&#8221;<br>They <strong>design systems</strong> that make SQL sing.</p><div><hr></div><h2>The Layer Cake of Modern Data</h2><p>Let&#8217;s pull back the curtain. When you're working in a company like Stripe, Airbnb, or Netflix, your data system isn&#8217;t just tables and dashboards.</p><p>It&#8217;s a <strong>layered architecture</strong>, and SQL flows through it all:</p><ol><li><p><strong>Raw Layer</strong> &#8211; messy, fast, unfiltered (S3, GCS)</p></li><li><p><strong>Staging Layer</strong> &#8211; cleaned, typed, deduped</p></li><li><p><strong>Warehouse Layer</strong> &#8211; modeled, relational, denormalized</p></li><li><p><strong>Semantic Layer</strong> &#8211; Looker, Tableau, PowerBI</p></li><li><p><strong>Delivery Layer</strong> &#8211; Dashboards, APIs, notebooks</p></li></ol><p>SQL is the thread that weaves it all together &#8212; whether you write it, generate it with dbt, or visualize it through tools.</p><div><hr></div><h2>So What&#8217;s the Real Job of SQL?</h2><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://zero2dataengineer.substack.com/subscribe&quot;,&quot;text&quot;:&quot;UPGRADE TO ANNUAL&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://zero2dataengineer.substack.com/subscribe"><span>UPGRADE TO ANNUAL</span></a></p>
      <p>
          <a href="https://zero2dataengineer.substack.com/p/sql-rdbms-love">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Transactions & ACID: The Rules That Keep Data Sane]]></title><description><![CDATA[How databases protect your pipelines when everything else fails]]></description><link>https://zero2dataengineer.substack.com/p/transactions-and-acid-the-rules-that</link><guid isPermaLink="false">https://zero2dataengineer.substack.com/p/transactions-and-acid-the-rules-that</guid><dc:creator><![CDATA[Avantika_Penumarty]]></dc:creator><pubDate>Fri, 09 May 2025 00:30:35 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36f15e77-0b2a-45de-8f3a-aba575afca3e_1024x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Welcome to Day 4 of Zero2DataEngineer &#8212; this week is all about how real databases behave in production, not just in notebooks.</p><p>In data engineering, pipelines break, APIs fail, jobs timeout.<br>But the <em>data itself</em>?</p><p>That still needs to be correct.</p><p>Today we talk about how databases keep your world from falling apart &#8212; even during chaos.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://zero2dataengineer.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Zero2Dataengineer is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2>What is a Transaction?</h2><p>A transaction is a single unit of work in the database.</p><ul><li><p>Transfer money? Transaction.</p></li><li><p>Insert a row? Transaction.</p></li><li><p>Update 3 tables in sequence? Still a transaction.</p></li></ul><p>A good system either completes the transaction <strong>entirely</strong> or <strong>rolls everything back</strong> &#8212; no half-baked updates.</p><div><hr></div><h2>The ACID Model (Explained Like You&#8217;re On-Call)</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!UAkK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff0b030b1-6bbd-4ae8-b7bf-34cec61a66c9_1618x424.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!UAkK!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff0b030b1-6bbd-4ae8-b7bf-34cec61a66c9_1618x424.png 424w, https://substackcdn.com/image/fetch/$s_!UAkK!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff0b030b1-6bbd-4ae8-b7bf-34cec61a66c9_1618x424.png 848w, https://substackcdn.com/image/fetch/$s_!UAkK!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff0b030b1-6bbd-4ae8-b7bf-34cec61a66c9_1618x424.png 1272w, https://substackcdn.com/image/fetch/$s_!UAkK!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff0b030b1-6bbd-4ae8-b7bf-34cec61a66c9_1618x424.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!UAkK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff0b030b1-6bbd-4ae8-b7bf-34cec61a66c9_1618x424.png" width="1456" height="382" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f0b030b1-6bbd-4ae8-b7bf-34cec61a66c9_1618x424.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:382,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:72974,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://zero2dataengineer.substack.com/i/163148250?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff0b030b1-6bbd-4ae8-b7bf-34cec61a66c9_1618x424.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!UAkK!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff0b030b1-6bbd-4ae8-b7bf-34cec61a66c9_1618x424.png 424w, https://substackcdn.com/image/fetch/$s_!UAkK!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff0b030b1-6bbd-4ae8-b7bf-34cec61a66c9_1618x424.png 848w, https://substackcdn.com/image/fetch/$s_!UAkK!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff0b030b1-6bbd-4ae8-b7bf-34cec61a66c9_1618x424.png 1272w, https://substackcdn.com/image/fetch/$s_!UAkK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff0b030b1-6bbd-4ae8-b7bf-34cec61a66c9_1618x424.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Why this matters:</strong><br>You don&#8217;t want to deduct money from a customer&#8230; but fail to generate the order.<br>You also don&#8217;t want two people editing the same record and overwriting each other&#8217;s data.</p><div><hr></div><h2>Real-Life Scenario: E-Commerce Checkout</h2><p>Imagine a customer places an order. Here&#8217;s what happens:</p><ol><li><p>Add a row in <code>orders</code></p></li><li><p>Deduct stock from <code>inventory</code></p></li><li><p>Charge the customer in <code>payments</code></p></li><li><p>Send confirmation email</p></li></ol><p>Without transactions, if step 3 fails after step 1 &amp; 2 succeed&#8230;</p><ul><li><p>Inventory is gone</p></li><li><p>No payment</p></li><li><p>No order shipped</p></li><li><p>No email sent</p></li></ul><p>Now your support team is flooded.</p><p>With transactions, all 4 steps are wrapped in one block. If payment fails, nothing is saved.</p><div><hr></div><h2>SQL in Action</h2><pre><code>BEGIN;

INSERT INTO orders (...) VALUES (...);
UPDATE inventory SET quantity = quantity - 1 WHERE product_id = 123;
INSERT INTO payments (...) VALUES (...);

COMMIT;</code></pre><p>If anything fails before <code>COMMIT</code>, Postgres rolls it all back. No manual cleanup. No missing rows. No silent data corruption.</p><div><hr></div><h2>Interview Angle: How to Stand Out</h2><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://zero2dataengineer.substack.com/subscribe&quot;,&quot;text&quot;:&quot;UPGRADE TO ANNUAL MEMBERSHIP&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://zero2dataengineer.substack.com/subscribe"><span>UPGRADE TO ANNUAL MEMBERSHIP</span></a></p>
      <p>
          <a href="https://zero2dataengineer.substack.com/p/transactions-and-acid-the-rules-that">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Indexing Secrets They Don’t Teach You]]></title><description><![CDATA[How to actually use indexes like an engineer &#8212; not a tutorial bot]]></description><link>https://zero2dataengineer.substack.com/p/indexing-secrets-they-dont-teach</link><guid isPermaLink="false">https://zero2dataengineer.substack.com/p/indexing-secrets-they-dont-teach</guid><dc:creator><![CDATA[Avantika_Penumarty]]></dc:creator><pubDate>Thu, 08 May 2025 00:30:34 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!SEss!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F618d8ead-c882-4ef1-be6a-640b71600335_1024x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Welcome back, data minds &#8212;<br>We&#8217;ve talked tables. We&#8217;ve shaped schemas.<br>But today, we talk about <strong>speed</strong>.</p><p>Because all the cleanest SQL in the world means nothing&#8230;<br>If your query takes 17 minutes and your dashboard cries blood.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://zero2dataengineer.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Zero2Dataengineer is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h3>Why Indexing Matters</h3><p>Imagine searching for a contact in your phone:</p><ul><li><p><strong>No Index:</strong> Scroll, scroll, scroll&#8230; manually.</p></li><li><p><strong>With Index:</strong> Type &#8220;S&#8221; &#8594; boom, &#8220;Sara&#8221; shows up instantly.</p></li></ul><p>That&#8217;s what indexes do to your database.<br>They <strong>help the engine locate what you need &#8212; without scanning every row</strong>.</p><div><hr></div><h3>Why Indexing Exists</h3><p>Imagine walking into a bookstore with no signage.<br>No sections, no labels. Just 100,000 books in a pile.</p><p>That&#8217;s your database without an index.<br>Even the simplest query becomes a <strong>full table scan</strong> &#8212; the database flips through every row, like a librarian on Red Bull.</p><p><strong>With an index?</strong><br>You give the DB a map. It jumps to exactly where the answer lives.</p><div><hr></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!SEss!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F618d8ead-c882-4ef1-be6a-640b71600335_1024x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!SEss!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F618d8ead-c882-4ef1-be6a-640b71600335_1024x1024.png 424w, https://substackcdn.com/image/fetch/$s_!SEss!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F618d8ead-c882-4ef1-be6a-640b71600335_1024x1024.png 848w, https://substackcdn.com/image/fetch/$s_!SEss!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F618d8ead-c882-4ef1-be6a-640b71600335_1024x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!SEss!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F618d8ead-c882-4ef1-be6a-640b71600335_1024x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!SEss!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F618d8ead-c882-4ef1-be6a-640b71600335_1024x1024.png" width="1024" height="1024" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/618d8ead-c882-4ef1-be6a-640b71600335_1024x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1024,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1767465,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://zero2dataengineer.substack.com/i/163065723?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F618d8ead-c882-4ef1-be6a-640b71600335_1024x1024.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!SEss!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F618d8ead-c882-4ef1-be6a-640b71600335_1024x1024.png 424w, https://substackcdn.com/image/fetch/$s_!SEss!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F618d8ead-c882-4ef1-be6a-640b71600335_1024x1024.png 848w, https://substackcdn.com/image/fetch/$s_!SEss!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F618d8ead-c882-4ef1-be6a-640b71600335_1024x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!SEss!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F618d8ead-c882-4ef1-be6a-640b71600335_1024x1024.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><h3>The Two Most Common Index Types</h3><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!VvAd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e011261-a796-4c9e-8bec-e7b5f808b47b_1680x314.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!VvAd!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e011261-a796-4c9e-8bec-e7b5f808b47b_1680x314.png 424w, https://substackcdn.com/image/fetch/$s_!VvAd!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e011261-a796-4c9e-8bec-e7b5f808b47b_1680x314.png 848w, https://substackcdn.com/image/fetch/$s_!VvAd!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e011261-a796-4c9e-8bec-e7b5f808b47b_1680x314.png 1272w, https://substackcdn.com/image/fetch/$s_!VvAd!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e011261-a796-4c9e-8bec-e7b5f808b47b_1680x314.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!VvAd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e011261-a796-4c9e-8bec-e7b5f808b47b_1680x314.png" width="1456" height="272" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0e011261-a796-4c9e-8bec-e7b5f808b47b_1680x314.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:272,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:55680,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://zero2dataengineer.substack.com/i/163065723?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e011261-a796-4c9e-8bec-e7b5f808b47b_1680x314.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!VvAd!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e011261-a796-4c9e-8bec-e7b5f808b47b_1680x314.png 424w, https://substackcdn.com/image/fetch/$s_!VvAd!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e011261-a796-4c9e-8bec-e7b5f808b47b_1680x314.png 848w, https://substackcdn.com/image/fetch/$s_!VvAd!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e011261-a796-4c9e-8bec-e7b5f808b47b_1680x314.png 1272w, https://substackcdn.com/image/fetch/$s_!VvAd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e011261-a796-4c9e-8bec-e7b5f808b47b_1680x314.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><div><hr></div><h3>Deep Dive: B-Tree vs Hash Index</h3><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LZd0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb6b36fb5-84e5-4cdf-bdbb-058da78c19a7_1660x338.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LZd0!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb6b36fb5-84e5-4cdf-bdbb-058da78c19a7_1660x338.png 424w, https://substackcdn.com/image/fetch/$s_!LZd0!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb6b36fb5-84e5-4cdf-bdbb-058da78c19a7_1660x338.png 848w, https://substackcdn.com/image/fetch/$s_!LZd0!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb6b36fb5-84e5-4cdf-bdbb-058da78c19a7_1660x338.png 1272w, https://substackcdn.com/image/fetch/$s_!LZd0!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb6b36fb5-84e5-4cdf-bdbb-058da78c19a7_1660x338.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LZd0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb6b36fb5-84e5-4cdf-bdbb-058da78c19a7_1660x338.png" width="1456" height="296" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b6b36fb5-84e5-4cdf-bdbb-058da78c19a7_1660x338.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:296,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:64050,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://zero2dataengineer.substack.com/i/163065723?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb6b36fb5-84e5-4cdf-bdbb-058da78c19a7_1660x338.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!LZd0!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb6b36fb5-84e5-4cdf-bdbb-058da78c19a7_1660x338.png 424w, https://substackcdn.com/image/fetch/$s_!LZd0!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb6b36fb5-84e5-4cdf-bdbb-058da78c19a7_1660x338.png 848w, https://substackcdn.com/image/fetch/$s_!LZd0!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb6b36fb5-84e5-4cdf-bdbb-058da78c19a7_1660x338.png 1272w, https://substackcdn.com/image/fetch/$s_!LZd0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb6b36fb5-84e5-4cdf-bdbb-058da78c19a7_1660x338.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><blockquote><p><strong>Interview tip:</strong> Most RDBMS default to <strong>B-Tree</strong>, so unless otherwise needed, go with that.</p></blockquote><div><hr></div><h3>Real World Example: The Wrong Index Costs Real Money</h3><p>When I worked on a loyalty points table with <strong>30M+ rows</strong>,<br>a teammate created an index on user_id.</p><p>Problem?<br>All queries filtered on <strong>created_at</strong> That index was <strong>useless</strong>.</p><p>We swapped it to:</p><pre><code>CREATE INDEX idx_created_at ON loyalty_points (created_at);</code></pre><p>Result: Dashboard load time dropped from 38s &#8594; 2s.<br>That tiny change saved $600/month in Snowflake compute credits.</p><div><hr></div><h3>Real-World: How One Index Saved a Failing Job</h3><p>At a logistics startup, we had a <strong>shipment tracking job</strong> that queried a 60M row tracking_events table every 5 minutes.</p><p><code>We thought it was clean:</code></p><pre><code>SELECT * FROM tracking_events WHERE status = 'delivered';</code></pre><p>But it took 45 seconds.</p><p>The problem? No index on status.</p><p>Worse &#8212; status had only 3 values (in_transit, delivered, failed) &#8594; low cardinality.</p><p>So indexing that didn&#8217;t help. Instead, we looked at event_time, which had high uniqueness and was used in the WHERE clause:</p><pre><code>CREATE INDEX idx_event_time ON tracking_events (event_time);</code></pre><p>Query dropped to 2.1 seconds.</p><p>And that&#8217;s when we learned:</p><p>Index the column that filters the MOST rows &#8212; and does so selectively.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://zero2dataengineer.substack.com/subscribe&quot;,&quot;text&quot;:&quot;UPGRADE TO ANNUAL&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://zero2dataengineer.substack.com/subscribe"><span>UPGRADE TO ANNUAL</span></a></p>
      <p>
          <a href="https://zero2dataengineer.substack.com/p/indexing-secrets-they-dont-teach">
              Read more
          </a>
      </p>
   ]]></content:encoded></item></channel></rss>