web analytics

StreamSets vs Composable

Many first time users of Composable have asked “How does Composable compare to Streamsets?” I finally had the opportunity to use Streamsets in a real-life setting a couple of months ago, and boy am I glad I did. Hopefully this comparison will shed some light for folks on the strengths and weaknesses of each platform.

Streamsets is just dataflows. That’s it – nothing more. Composable isn’t just dataflows. Composable has many more features for data scientists and developers. Some of these unique Composable features include:

Query Views – Interactive and customizable grid-based web interfaces – 0 coding – just write the SQL statement.

Web Apps – Build pixel perfect UIs for End-User workflows. Read-only dashboards is a thing of the past. Web Apps allows your tool users to submit and update information.

Entity Hub – Build master data models across 1000s of sources with scalable fuzzy matching.

Data Portals – Build ER Data Models and CRUD (Create, Read, Update, Delete) interfaces with 0 coding.

Real-time Alerting – Send and receive first class alerts in and out of the platform

Data Catalog – Discover data in your existing dbs and dynamically create queries.

Notebooks – Build, train and test your machine learning algorithms

Streamsets is built to handle ETL Pipelines, and was engineered just for that. So let’s do an in-depth comparison of:

Composable Dataflows vs Streamsets Pipelines

  • Modules Inputs
    • Many configuration inputs in Streamsets have to be hardcoded. Examples of these might be a database server name, a database name, max request size, url, port, query, etc … In contrast, all module inputs in Composable can be connected to other modules, thus allowing your Composable dataflow to be completely dynamic and avoid hardcoded values. In Composable, you’re never stuck – you can always set an input to a value coming from anywhere.
      • So how do Streamsets’ users get around these hardcoded limits? They end up templating their Pipelines, and then build scripts to create Pipelines dynamically outside of Streamsets. So instead of having 1 pipeline that is configurable, you end up creating one in Streamsets, then export it, and then have many outside of Streamsets, or write a script to generate them, and then import them. This technique is pretty tedious and only works for certain scenarios; if you ever have a scenario where you need to specify an input dynamically during execution in a Streamset Pipeline, you’re SOL.
  • Loops
    • Composable has first class looping – foreach, do while, as well as recursion. Modules down stream from a loop can execute multiple times. To perform looping in Streamsets is a bit more difficult. You need to send messages back into your pipeline with any state – essentially creating a recursive pipeline. And goodluck creating your halting condition in Streamsets. You’ll probably bring down the system before you get it right.
  • Type Systems
    • Streamsets is record oriented – modules produce streams of records – whether these are JSON nodes or tables in a Sql database. Composable Dataflows have a type system akin to modern programming languages. You have signed integers, big integers, booleans, doubles, tables, files, strings, complex types, etc .. And it’s extensible – in Composable you can create your own types.
  • A Run vs a Dataflow
    • In Composable, a Dataflow produces multiple runs. If a message comes into a Composable Dataflow, or if you click on the run button in the designer, a new run is created. In Streamsets, a pipeline has a single state – all messages are routed to the same Pipeline. This is very critical difference. By having multiple runs in Composable, you can go back to each run and look at all the intermediate outputs. It also promotes reuse – you can use the same dataflow with different inputs. And even different users can execute the same dataflow, and maintain different runs.
  • Nesting Dataflows
    • Composable allows you to repackage a dataflow as a sub flow – allowing you to call it from many parent dataflows. Streamsets doesn’t support that.
  • Security
    • Streamsets and Composable have comparable security modules surrounding dataflows. You can specific what users or groups have read, write, execute, and delete permissions.
  • Change Tracking
    • Both systems support a history of modifications to the dataflow / pipeline.
  • Scale of Dataflows
    • Most Streamsets Pipelines I’ve seen are around 5 -10 modules and usually don’t go beyond 20. Composable dataflows can approach 100s of modules, and each module may have 100s of other modules.

 

We recently had a customer that previouly developed many Streamsets pipelines and didn’t want to take the development hit of migrating all Pipelines over to Composable at once. To help with this migration, we created a Composable Dataflow that uses the Streamsets API’s to execute a StreamSet Pipeline. One interesting tidbit is that no custom modules were necessary. You could execute a Streamset Pipeline from Composable using the existing modules. This is a dataflow that you could never create with the primitives exposed in Streamsets.

 

Exported Dataflows used for this integration:

In Summary, Composable is a much more comprehensive DataOps platforms. When comparing just the datalfow / pipeline features, you quickly find out that Composable is a true dataflow programming language – with zero limitations.