Data Engineering Insights

Choosing the Right Data Ingestion Strategy

Nov 16, 2024

Welcome to Data Engineering Insights!

Hi everyone, and welcome back to Data Engineering Insights! In our last issue, we introduced the core components of a data pipeline. Today, we’re diving into the first key stage: data ingestion. Choosing the right ingestion strategy sets the stage for a reliable and efficient pipeline, and in this edition, we’ll cover batch vs. real-time ingestion, tools, and techniques needed to handle diverse data formats.

Let’s start with some background on data sources and why data ingestion is essential for making data accessible across an organization.

Where Does Data Come From, and Why Does it Need to be Ingested?

Most companies collect data from multiple sources, each with unique formats and structures. Here are some common sources:

User Interaction Data: Generated from websites or applications, capturing details about user actions and behavior. This type of data is often delivered in lightweight formats like JSON to support fast, real-time updates.
Sensor or IoT Data: For companies using field equipment or sensors, such as in logistics or manufacturing, data is frequently sent in CSV format at regular intervals (e.g., hourly or daily).
External Market or Vendor Data: Data from third-party APIs, often delivered in XML, provides insights on external conditions like market trends, competitor analysis, or weather impacts. XML is a structured format that’s common for data exchange between systems.
Internal Sales or Transactional Data: Sales records and transaction logs are often stored in Parquet format, which is efficient for large volumes of data and ideal for querying.

Each of these data sources is like a puzzle piece—valuable on its own but far more powerful when combined. Data ingestion is the process that collects, organizes, and unifies these pieces, turning them into a centralized resource for analysis.

Dealing with Different Data Formats

With diverse data sources come different file formats, each posing unique challenges. Here’s how a company might handle a range of file types before ingestion:

Parsing JSON: User interaction data in JSON format is parsed to extract relevant details and structure them into rows and columns, enabling easy analysis of user behavior.
Validating CSV: Sensor data in CSV format is checked for consistency and completeness to ensure accuracy, especially important for IoT data that might miss updates due to network issues.
Flattening XML: XML from external sources is “flattened” by mapping nested tags to columns, making it compatible with other data sources.
Standardizing Parquet: Sales data, already in Parquet, is validated for schema consistency, ensuring it can be easily queried alongside other data.

Real-Life Example: Handling Diverse Formats in a Unified Pipeline

Imagine our company needs to handle diverse data sources to create a seamless data pipeline that supports different departments. They receive data from multiple sources, each with its own format and frequency:

User Interaction Data in JSON format from a mobile app, arriving in real time.
Sensor Data in CSV format, batch-delivered every hour.
Market Data in XML format from an external API, delivered daily.
Sales Data in Parquet format from internal systems, updated weekly.

Each source is valuable, but to make this data useful across the organization, the team needs a way to ingest, convert, and standardize it all into a unified format—Parquet—before storing it in a centralized data lake. Here’s how they accomplish this:

Step 1: Ingesting Each Data Source

The data pipeline begins by ingesting data from each source, with each data type arriving in different formats and at different intervals:

Real-Time JSON Data: The user interaction data is ingested in real-time from the mobile app. Real-time data ingestion captures JSON events as they happen, allowing the team to monitor app activity closely.
Batch CSV Data: Sensor data, arriving in hourly CSV files, is ingested in batches. Automated batch ingestion detects each new CSV file and prepares it for conversion.
Daily XML Data: The market data from the external API arrives daily in XML format. The pipeline fetches and extracts this XML data from the API at scheduled times.
Parquet Data: Internal sales data, already in Parquet format, is directly ingested without additional transformation, simplifying the process.

Step 2: Converting Data to a Unified Format (Parquet)

With ingestion complete, each data type is converted to Parquet format, making it consistent across the data lake. Here’s how each format is processed:

Converting JSON to Parquet: The real-time JSON data is parsed to extract key fields and organized into a tabular format, making it easier to store as Parquet. Each JSON event is structured into rows and columns to support efficient querying in the data lake.
Converting CSV to Parquet: Sensor data in CSV format is read row-by-row, mapped to a structured table, and saved as Parquet. During conversion, data types are validated, and any missing values are handled to maintain data quality.
Converting XML to Parquet: XML data, which may contain nested tags, is “flattened” by mapping nested elements into columns in a table format. Once structured, it’s saved as Parquet to keep it consistent with other data sources. Flattening XML ensures that all elements can be accessed and queried easily in the data lake.
No Conversion for Existing Parquet Data: Sales data, already in Parquet format, is directly stored in the data lake without additional transformation, saving time and resources.

Step 3: Storing in the Data Lake

With all data converted to Parquet, it’s loaded into a centralized data lake. Using a consistent format across all datasets allows the company to perform seamless joins and queries across all sources, simplifying analysis and making insights readily available across departments.

Batch vs. Real-Time Ingestion

Now that we understand data sources and formats, let’s look at two main ingestion strategies: batch and real-time.

Batch Ingestion: This method gathers data at specific intervals (e.g., hourly, daily) and processes it all at once. It’s ideal for data that doesn’t need instant updates, such as periodic reports. Batch ingestion is typically simpler and more economical.
Real-Time Ingestion: This method continuously streams data as it arrives, ideal for applications where speed is critical. Real-time ingestion is useful for monitoring live interactions or updating dashboards, and it requires a more complex infrastructure but provides immediate insights.

Tools to Get Started

Here are some popular tools for both batch and real-time ingestion:

Batch Processing Tools:
- Apache Nifi: Great for ETL with various file formats, Apache Nifi supports complex workflows and makes batch processing accessible.
- Talend: Known for its drag-and-drop interface, Talend offers powerful data integration tools ideal for setting up batch ETL processes.
Real-Time Processing Tools:
- Apache Kafka: Handles large volumes of streaming data, perfect for real-time ingestion and commonly used with JSON and CSV.
- AWS Kinesis: Works well with real-time data and integrates easily with other AWS tools, making it a good fit for AWS-based infrastructure.

Resources to Learn More

If you’d like to explore data ingestion further, here are some valuable resources:

Books:
- "Designing Data-Intensive Applications" by Martin Kleppmann – Covers data ingestion strategies, data formats, and real-world applications.
- "Streaming Systems" by Tyler Akidau, Slava Chernyak, and Reuven Lax – A comprehensive guide for those working with real-time data processing, especially relevant for tools like Kafka and Apache Beam.
Videos and Courses:
- Coursera’s Data Engineering on Google Cloud: A hands-on course covering data engineering essentials, including ingestion with Google Pub/Sub.
- Kafka Training by Confluent: Confluent’s online Kafka training dives deep into real-time data streaming, ideal for engineers implementing Kafka in production.

Final Thoughts

Data ingestion isn’t just the first step in a pipeline—it’s the foundation that ensures data is clean, timely, and ready for analysis. Whether you’re working with batch or real-time ingestion, understanding data formats and choosing the right ingestion strategy can help your organization turn raw data into valuable insights.

Feel free to reach out if you have any questions or want help setting up your first ingestion pipeline.

Happy Ingesting!

Best,
Avantika

Discussion about this post

Ready for more?