This post will cover two recent additions to Composable DataFlows that have yet to receive a comprehensive write-up: DataFlow Resume and Pinned Results. We’ll be discussing them together due to their conceptual overlap; both facilitate DataFlow development and debugging by allowing you to reuse results that have already been calculated, saving computation time and helping to consistently reproduce specific execution states.
Simply put, Pinned Results allow you to carry a module’s output values from one run to the next, allowing you to skip executing that module and guaranteeing that its results will not change. To pin a module, right-click on it after the DataFlow has executed and select ‘Pin Results’. Scenarios where Pinned Results can be useful include:
- Developing a dataflow containing a module with a long runtime: When the behavior you’re testing doesn’t specifically depend on that module’s results, pinning it can save you time every run.
- Testing a dataflow with side-effects, such as email and alert sending: Pinning the offending modules can allow you to repeatedly try the logic you care about without spamming anyone’s inbox.
- Debugging an error that occurs under specific conditions: The ability to freeze specific module results in place can help reproduce those conditions, particularly when opening a failed run from the DataFlow’s Runs page.
Like right-clicking to disable a module, result pinning is not permanently saved and will only apply to the current DataFlow Designer session; it is intended to be used for development and debugging, not persistent changes. If you need constant outputs from a module that would otherwise vary, consider replacing that module with one that allows you to manually input your desired output data directly, (i.e. String Input, Array Builder, or Table Editor).
It’s worth reiterating that pinned modules will not execute; when you pin a module that has out-of-DataFlow side effects, such as a SQL Writer or Email Sender, those actions will not take place, even if the module outputs indicate otherwise (remember, the output values are pinned, and do not necessarily reflect the current execution state).
Pinning module results is useful, but it can also be dangerous; a pinned module will always produce its pinned output values, even if it’s not provided with input values. If you pin a module on one side of a branch, then make changes that toggle which branch output is set, both sides of the branch will execute (one via normal execution, the other due to pinning).
Pinning helps when isolating results from a small numbers of modules, but what about scenarios where you want to keep more than that? To save you from having to pin wide swathes of your dataflow, let’s move on to discuss another feature you might find useful.
Simply put, DataFlow Resume allows you to carry many module output values from one run to the next; when you click the ‘Resume’ button in the designer, the execution engine will detect changes to your DataFlow since the previous run and conditionally determine which modules need to be re-executed based on the following criteria:
- Modules whose inputs (connections or values) have been changed since the last run.
- Modules with errors in the previous run.
- Modules that have the ‘Force Execution’ option set.
- Modules whose outputs cannot be cached (certain File executors and Property loaders)
- Modules that are downstream of any of the above modules.
All other modules will skip execution and reuse their results from the previous run. For iterating modules (i.e. loops, streams) that can execute multiple times in a run, Resume will start from the most recent iteration, so resuming from a mid-loop failure will behave as expected.
‘Force Execution’ is another new module-level option (like Pin Results and Disable/Enable) that tells the execution engine to always execute this module, regardless of whether it has results from a previous run it can draw from. In a sense, Pin Results and Force Execution are opposites; a Pinned module will never execute and always use past results, and a Forced module will always execute and never use past results. For reference, here’s a table of the current module options and their behavior.
|Does Not Execute|
|Does Not Execute|
|Resumed Run||Executes If Changed|
|Does Not Execute|
|Does Not Execute|
Here are a few scenarios where Resume can be useful:
- An activated DataFlow encountered an intermittent error (i.e. an external service it depends on was down), and you want to ‘rescue’ the failed run from the point where the error happened.
- You’re developing a large/complex DataFlow, and want to see the results of your changes without having to re-execute the entire DataFlow every time.
The caveats that applied to Pinned Results also apply to DataFlow Resume; when an unchanged module reuses its prior results, it will not execute and out-of-dataflow side effects like SQL writes will not occur. Additionally, take care when modifying logic after a failure within a loop or stream that collects its results in an Accumulator module; if your correction changes the content or format of data that would be collected in the Accumulator, after resuming the accumulator may contain a mixture of both data profiles (those cached from the prior run, and those generated after resume). To avoid this, you can set Force Execution on the parent loop/stream module; this will force it to re-execute from scratch, thus re-populating the downstream accumulator with only output data generated by the current logic.
There you have it! Module Pinning and DataFlow Resume are both powerful additions to the Composable DataFlow development toolkit, and I hope this post helped explain how they can be used effectively. As always, if you have any questions, feel free to reach out on the Composable Support Forums.
- Advanced Comparisons using Composable Predicate Modules - May 23, 2022
- Saving Your Progress: DataFlow Resume and Pinned Results - February 28, 2021