Monday, 28 August 2017

Paper Summary - Data Ingestion for the Connected World

Data Ingestion for the Connected World
John Meehan, Cansu Aslantas, Stan Zdonik, Nesime Tatbul, Jiang Du

Businesses have been using “Big Data” applications to perform timely analytics to make real/near-real time decisions. Effectiveness of these analytics and decisions depends on how quickly necessary data can be extracted, transformed, and loaded from operational platform to analytical platform while ensuring correctness. According to the authors, it is challenging for these latency sensitive “Big Data” applications to do this via traditional ETL processes which are cumbersome and very slow. They propose a new architecture for ETL which they call streaming ETL. Streaming ETL can take the advantages of the push-based nature of a stream processing system.

In this paper, authors have proposed streaming ETL requirements. Streaming ETL must ensure the correctness and predictability of its results. At the same time, a streaming ETL system must be able to scale with the number of incoming data sources and process data in as timely as possible. They have divided the requirements into three categories:

  • ETL requirements
  • Streaming requirements
  • Infrastructure requirements

ETL Requirements (Data Collection + Bulk Loading + Heterogeneous Data Types)

In the case of streaming data sources, data must be collected, queued, and routed to the appropriate processing channels. A data collection mechanism should have the ability to transform traditional ETL data sources into streaming ETL sources. Data collection should scale with the number of data sources. A streaming ETL engine must have the ability to bulk load freshly transformed data into the data warehouse. Streaming ETL engine should have data routing capability to load semantically related data into multiple target systems.

Streaming Requirements (Out-of-Order and Missing Tuples + Dataflow Ordering + Exactly-Once Processing)

When number of data sources and/or data volume is huge, there is a possibility that data may get out of time-stamp order and sometimes data can be missing altogether. Waiting for the things to be sorted out can introduce an unacceptable latency. Authors have proposed to use timeout value and predictive techniques (e.g. regression) to overcome these issues. To improve the performance, streaming ETL should break large batches into smaller ones and large operation also needs to be broken into a number of smaller operations. Streaming ETL must use ordering constraints to ensure that these smaller operations on smaller batches still produce the same result as their larger counter parts. Also, any data migration to and from the streaming ETL engine must occur once and only once.

Infrastructure Requirements (Local Storage + ACID Transactions + Scalability + Data Freshness and Latency)

Any ETL or data ingestion pipeline needs to maintain local storage for temporary staging of new batches of data while they are being prepared for loading into the backend data warehouse. Streaming ETL is no different. Having local storage will also help to ensure the correctness of temporal ordering and alignment of the data. Since streaming ETL engine will be processing multiple stream at once, and each dataflow instance may try to make modifications to the same state simultaneously, it is expected that streaming ETL must follow ACID transaction semantics. Streaming ETL must also ensure that scalability of data ingestion and data freshness.

Streaming ETL Architecture

Authors propose a new architecture based on the above requirements.This new architecture has four primary components:

Data collection: This component has a collection of data collectors. These data collectors primarily serve as messaging queues. Data collectors consume data from different sources, create logical batches of data and push them to the streaming ETL engine.

Streaming ETL: This component contains a range of ETL tools, including data cleaning and transformation operators. Dataflow graph can be created using these operators to massaged the incoming batches of data into normalised data. Once the data has been fully cleaned and transformed, it can be either pushed into data warehouse or pulled by data warehouse.

OLAP backend: This component consists of a query processor and one or several OLAP engines. Each OLAP engine contains its own data warehouse, as well as a delta data warehouse. Both data warehouses have same schema. Streaming ETL engine writes all updates to the delta data warehouse, and OLAP engine periodically merges these updates into the full data warehouse.

Data migrator: Data migrator ensures that no batch of data get lost when it moves from streaming ETL to OLAP backend components. This should also fully support ACID transactions.

Authors have built a proof-of-concept implementation based on this new architecture using Apache Kafka, S-Store, Intel’s BigDAWG polystore, and Postgres.

In this paper, authors have also tried to answer another important question regarding the frequency of the data migration to the data warehouse by a streaming ETL system. There are two methods: push (ingestion engine periodically pushes the data to the warehouse) and pull (warehouse pulls the data from the ingestion engine when it is needed). Authors have run an experiment to test the pros and cons of each method and according to them pulling new data with each query is the best option if the data staleness is the priority. They also suggested that it is better to go for smaller, more frequent migrations in both push and pull scenarios.


Authors think that streaming ETL can be extended to create all-in-one ingestion and analytics engine specifically for time-series data which they call Metronome (time-series ETL). This paper focuses on the functional requirements of streaming ETL. Authors also build a proof-of-concept implementation based on these requirements.

Sunday, 6 August 2017

Visualising Software Architecture Effectively in Service Description

Somedays back one of my team members told me about Simon Brown's C4 model. Since then I have been following this to document the software architecture. This presentation is about the diagrams that I draw (or I like to see) in service description based on C4 model.

Friday, 4 August 2017

Paper Summary - Prioritizing Attention in Fast Data: Principles and Promise

Prioritizing Attention in Fast Data: Principles and Promise
Peter Bailis, Edward Gan, Kexin Rong, Sahaana Suri
Stanford InfoLab

Processing and interpreting huge volume data that is in motion (fast data) to get timely answer is challenging and sometimes infeasible due to the scarce of resources (both human and computational). Human attention is limited. According to the authors, a new generation analytic system is needed to bridge the gap between limited human attention and growing volume of data. This new type of analytic system will prioritise attention in fast data. In this paper, authors have proposed three design principles that can be used to design and develop such fast data analytic system:

Principle 1: Prioritise Output – The design must deliver more information using less output.

Fast data analytic system should produce fewer and good quality output. If a system produces lot of raw (output) data, then it becomes difficult for a human to give attention. For example, if the end result is to find out which device is producing more problematic records, then it would be ideal if the system can simply return the device id with the count of records rather than producing every raw problematic record. According to the authors – “A few general results are better than many specific results”.

Principle 2: Prioritise Iteration – The design should allow iterative feedback-driven development.

Modern analytics workflows consist of many steps – including feature engineering, model selection, parameter tuning, and performance engineering. It is difficult to get the final model at first attempt. This means that analytics system should empower the end users by giving them necessary tools so that they can improve these steps iteratively based on the feedback. Today this is very labour intensive and time-consuming task. Fast data analytics system should lower this barrier. Fast data system should be designed for modularity and incremental extensibility.

Principle 3: Prioritise Computation – The design must prioritise computation on inputs that most affect its output.

One of the key property of fast data is – not all inputs contribute equally to the output. Therefore, it is waste of valuable computational resource if the system gives equal importance to all inputs. But how will fast data system select these inputs that contribute most to the output? According to authors – “fast data systems should start from the output and work backwards to the input, doing as little work as needed on each piece of data, prioritizing computation over data that matters most”.


Authors have built a new fast data analysis engine called MacroBase based on the principles outlined above. At present MacroBase’s core dataflow pipelines contain a sequence of data ingestion, feature extraction, classification, and explanation operators. These operators perform tasks including feature extraction, supervised and unsupervised classification, explanation and summarisation. MacroBase can process data as it arrives. It can also process data in batch mode.

MacroBase System Architecture

Users can engage at three interface levels with MacroBase:
  • Basic: Web based graphical user interface. This one is an easy interface.
  • Intermediate: Custom pipelines configuring using Java.
  • Advanced: Custom dataflow operators using Java/C++.
These interfaces will enable users of varying skill levels to quickly obtain the initial results and further improve result quality by iteratively refining their analyses. Users can highlight the key performance metrics (like, power drain, latency) and metadata attributes (like, hostname, device id). MacroBase reports explanations of the abnormal behaviour. For example, MacroBase may report that queries running on host 5 are 10 times more likely to experience high latency than the rest of the cluster. MacroBase is currently doing mostly anomaly or outlier detection, it is not doing any deep machine learning training.


Today we collect large volume of data in analytical platform. Some of these data are never read. Sometimes we may go back and analysis these data to find the root cause of the problem after it happened. Moreover, tools that we use to do these kinds of analysis are not easily accessible and process is time consuming. I think, these design principles provide good guidance which can be used to design and build a new generation analytics engine which can process huge volume of data and produce good quality output in timely manner.