We’ve given Hadoop almost 10 years to mature, invested billions, and very few companies are seeing the return on investment. Several companies have tried to make Hadoop a real-time analytical platform, incorporating SQL-like facades on top, but the latency is still not where it needs to be for interactive applications. Even Google, a true big data user, has moved on and is using more dataflow / flow-based programming approaches. Why? It just makes sense…
Why should I store all my data in the HDFS?
How many adaptors do I need to write to get data into the HDFS?
Why is my data stagnant when my business process is fluid?
I already have massive amounts of data in my existing triple stores and relational dbs… you want me to do what?
Instead of saving all data into an HDFS, and then trying to run full parallel table scans to reduce the data, alternative systems can answer the same questions while the data is in motion. We live in an extremely chaotic data environment. Data is everywhere, exists in many different formats, and comes at us from many different directions. It’s unforseeable that we’ll be able to duplicate all of this data into a single HDFS.
By defining your processing scheme as functional blocks with inputs and outputs, and then describing dependencies between the functional blocks as dataflow connections, a platform can systematically parallelize the execution of the defined processing scheme. As data is flowing, it can be filtered, routed, aggregated and disseminated to other systems. This results in a system that is making decisions and getting answers out of its data in real-time.
It’s platforms like Composable DataOps Platform – https://composable.ai, which are going to be the big players in this next big data wave. Flow-based programming approaches provide agile and fluid mechanics for processing data.
Data got you overwhelmed? Go with the flow!