Monday, 28 August 2017

Paper Summary - Data Ingestion for the Connected World

Data Ingestion for the Connected World
John Meehan, Cansu Aslantas, Stan Zdonik, Nesime Tatbul, Jiang Du


Businesses have been using “Big Data” applications to perform timely analytics to make real/near-real time decisions. Effectiveness of these analytics and decisions depends on how quickly necessary data can be extracted, transformed, and loaded from operational platform to analytical platform while ensuring correctness. According to the authors, it is challenging for these latency sensitive “Big Data” applications to do this via traditional ETL processes which are cumbersome and very slow. They propose a new architecture for ETL which they call streaming ETL. Streaming ETL can take the advantages of the push-based nature of a stream processing system.

In this paper, authors have proposed streaming ETL requirements. Streaming ETL must ensure the correctness and predictability of its results. At the same time, a streaming ETL system must be able to scale with the number of incoming data sources and process data in as timely as possible. They have divided the requirements into three categories:

  • ETL requirements
  • Streaming requirements
  • Infrastructure requirements

ETL Requirements (Data Collection + Bulk Loading + Heterogeneous Data Types)

In the case of streaming data sources, data must be collected, queued, and routed to the appropriate processing channels. A data collection mechanism should have the ability to transform traditional ETL data sources into streaming ETL sources. Data collection should scale with the number of data sources. A streaming ETL engine must have the ability to bulk load freshly transformed data into the data warehouse. Streaming ETL engine should have data routing capability to load semantically related data into multiple target systems.

Streaming Requirements (Out-of-Order and Missing Tuples + Dataflow Ordering + Exactly-Once Processing)

When number of data sources and/or data volume is huge, there is a possibility that data may get out of time-stamp order and sometimes data can be missing altogether. Waiting for the things to be sorted out can introduce an unacceptable latency. Authors have proposed to use timeout value and predictive techniques (e.g. regression) to overcome these issues. To improve the performance, streaming ETL should break large batches into smaller ones and large operation also needs to be broken into a number of smaller operations. Streaming ETL must use ordering constraints to ensure that these smaller operations on smaller batches still produce the same result as their larger counter parts. Also, any data migration to and from the streaming ETL engine must occur once and only once.

Infrastructure Requirements (Local Storage + ACID Transactions + Scalability + Data Freshness and Latency)

Any ETL or data ingestion pipeline needs to maintain local storage for temporary staging of new batches of data while they are being prepared for loading into the backend data warehouse. Streaming ETL is no different. Having local storage will also help to ensure the correctness of temporal ordering and alignment of the data. Since streaming ETL engine will be processing multiple stream at once, and each dataflow instance may try to make modifications to the same state simultaneously, it is expected that streaming ETL must follow ACID transaction semantics. Streaming ETL must also ensure that scalability of data ingestion and data freshness.

Streaming ETL Architecture

Authors propose a new architecture based on the above requirements.This new architecture has four primary components:

Data collection: This component has a collection of data collectors. These data collectors primarily serve as messaging queues. Data collectors consume data from different sources, create logical batches of data and push them to the streaming ETL engine.

Streaming ETL: This component contains a range of ETL tools, including data cleaning and transformation operators. Dataflow graph can be created using these operators to massaged the incoming batches of data into normalised data. Once the data has been fully cleaned and transformed, it can be either pushed into data warehouse or pulled by data warehouse.

OLAP backend: This component consists of a query processor and one or several OLAP engines. Each OLAP engine contains its own data warehouse, as well as a delta data warehouse. Both data warehouses have same schema. Streaming ETL engine writes all updates to the delta data warehouse, and OLAP engine periodically merges these updates into the full data warehouse.

Data migrator: Data migrator ensures that no batch of data get lost when it moves from streaming ETL to OLAP backend components. This should also fully support ACID transactions.


Authors have built a proof-of-concept implementation based on this new architecture using Apache Kafka, S-Store, Intel’s BigDAWG polystore, and Postgres.

In this paper, authors have also tried to answer another important question regarding the frequency of the data migration to the data warehouse by a streaming ETL system. There are two methods: push (ingestion engine periodically pushes the data to the warehouse) and pull (warehouse pulls the data from the ingestion engine when it is needed). Authors have run an experiment to test the pros and cons of each method and according to them pulling new data with each query is the best option if the data staleness is the priority. They also suggested that it is better to go for smaller, more frequent migrations in both push and pull scenarios.

Conclusion

Authors think that streaming ETL can be extended to create all-in-one ingestion and analytics engine specifically for time-series data which they call Metronome (time-series ETL). This paper focuses on the functional requirements of streaming ETL. Authors also build a proof-of-concept implementation based on these requirements.