How to Refactor a Pipeline
When creating pipelines in Datameer, it is often difficult for users to anticipate the final outcome. Datameer recognizes this challenge and supports an iterative approach to designing transformation flows, allowing users to iterate and make continuous adjustments as needed.
In addition, there are situations where existing pipelines need to be adapted due to external changes that occur outside of Datameer. These changes can range from schema modifications to addressing revised business requirements.
In all of these scenarios, the common factor is that analysts need to make modifications to an existing transformation flow. These modifications may involve actions such as:
- swapping source datasets or existing transformations with new ones
- deleting transformation nodes
- inserting a new additional transformation
Datameer provides the necessary tools and flexibility to facilitate these modifications, enabling analysts to refine their transformation flows effectively.
In all of the following cases, the 'Exchange Source' feature in Datameer is required. However, please note that this feature is not available for JOINs, UNIONs, and SQL nodes. For these types of transformations, you should utilize the respective built-in transformation editor to directly make the necessary changes.
The cases where the 'Exchange Source' feature can be used include:
- editing the left or right side of a Join operation
- adding or removing data sets to / from a Union operation
- modifying the "FROM" clause in a SQL statement by referencing another source
How-To Refactor the Pipeline#
The pipeline can be refactored in three different ways:
- exchanging source datasets: involves replacing the existing source data sets with new ones,
- deleting a transformation: entails removing a transformation from the pipeline, eliminating its impact on the data flow
- inserting additional transformations: involves adding new transformations into the existing pipeline
Exchanging Source Data Sets#
To exchange the source data set, follow these steps:
Introduce the new source data set to the Flow Area. This will display the source node alongside the existing pipeline.
Choose the transformation node that requires modification and access the Inspector. Click on the "Exchange Source" option.
Specify the newly added source data set as the replacement and confirm your selection. It's important to note that you can only select new sources from upstream, and circular dependencies are not allowed.
Observe the result of the exchange. If the new source data set does not possess the necessary schema required by the transformation node, an error will occur. In such cases, you may need to adjust the existing transformation or add a new transformation to accommodate the schema changes.
If the former source node is no longer needed, remove it from the Project to maintain a clean and organized Flow Area.
Deleting a Transformation from the Pipeline#
To delete a transformation node from the pipeline:
Choose the node that follows the transformation node you wish to delete. Access the Inspector and click on "Exchange Source".
Select the new preceding node that will replace the deleted node and confirm your selection. Keep in mind that you can only choose new sources from upstream, and circular dependencies are not permitted.
If the former middle node is no longer required, proceed to delete it from the pipeline. This helps maintain a streamlined and organized structure.
Inserting Additional Transformations to the Pipeline#
To insert an additional transformation to the pipeline:
Begin by selecting the source node. Then, create a new transformation node by clicking on the "+" symbol and configuring the transformation according to your requirements.
Next, choose the existing transformation node that you want to precede the newly created transformation. Access the Inspector and click on "Exchange Source".
In the 'Exchange Source' dialog, select the newly created transformation node as the source node for the existing transformation. Confirm your selection. Remember that you can only choose new sources from upstream, and circular dependencies are not allowed.
Once the exchange is complete, observe the outcome of the change. If the new subsequent data set does not follow the necessary schema required by the newly assigned node, an error may occur. In such cases, you might need to adjust the subsequent node's configuration.
What to Learn Next#
For a more detailed guide on supporting a migration use case that involves changing data sources and deployment targets of an existing pipeline, see How-To: Migrate from Development to Production Environments.