Peter Bailis, Edward Gan, Kexin Rong, Sahaana Suri
Processing and interpreting huge volume data that is in motion (fast data) to get timely answer is challenging and sometimes infeasible due to the scarce of resources (both human and computational). Human attention is limited. According to the authors, a new generation analytic system is needed to bridge the gap between limited human attention and growing volume of data. This new type of analytic system will prioritise attention in fast data. In this paper, authors have proposed three design principles that can be used to design and develop such fast data analytic system:
Principle 1: Prioritise Output – The design must deliver more information using less output.
Fast data analytic system should produce fewer and good quality output. If a system produces lot of raw (output) data, then it becomes difficult for a human to give attention. For example, if the end result is to find out which device is producing more problematic records, then it would be ideal if the system can simply return the device id with the count of records rather than producing every raw problematic record. According to the authors – “A few general results are better than many specific results”.
Principle 2: Prioritise Iteration – The design should allow iterative feedback-driven development.
Modern analytics workflows consist of many steps – including feature engineering, model selection, parameter tuning, and performance engineering. It is difficult to get the final model at first attempt. This means that analytics system should empower the end users by giving them necessary tools so that they can improve these steps iteratively based on the feedback. Today this is very labour intensive and time-consuming task. Fast data analytics system should lower this barrier. Fast data system should be designed for modularity and incremental extensibility.
Principle 3: Prioritise Computation – The design must prioritise computation on inputs that most affect its output.
One of the key property of fast data is – not all inputs contribute equally to the output. Therefore, it is waste of valuable computational resource if the system gives equal importance to all inputs. But how will fast data system select these inputs that contribute most to the output? According to authors – “fast data systems should start from the output and work backwards to the input, doing as little work as needed on each piece of data, prioritizing computation over data that matters most”.
Authors have built a new fast data analysis engine called MacroBase based on the principles outlined above. At present MacroBase’s core dataflow pipelines contain a sequence of data ingestion, feature extraction, classification, and explanation operators. These operators perform tasks including feature extraction, supervised and unsupervised classification, explanation and summarisation. MacroBase can process data as it arrives. It can also process data in batch mode.
Users can engage at three interface levels with MacroBase:
- Basic: Web based graphical user interface. This one is an easy interface.
- Intermediate: Custom pipelines configuring using Java.
- Advanced: Custom dataflow operators using Java/C++.
These interfaces will enable users of varying skill levels to quickly obtain the initial results and further improve result quality by iteratively refining their analyses. Users can highlight the key performance metrics (like, power drain, latency) and metadata attributes (like, hostname, device id). MacroBase reports explanations of the abnormal behaviour. For example, MacroBase may report that queries running on host 5 are 10 times more likely to experience high latency than the rest of the cluster. MacroBase is currently doing mostly anomaly or outlier detection, it is not doing any deep machine learning training.
Today we collect large volume of data in analytical platform. Some of these data are never read. Sometimes we may go back and analysis these data to find the root cause of the problem after it happened. Moreover, tools that we use to do these kinds of analysis are not easily accessible and process is time consuming. I think, these design principles provide good guidance which can be used to design and build a new generation analytics engine which can process huge volume of data and produce good quality output in timely manner.
More on this can be found: