Getting Started Tutorial 2.x
Browse and Understand the Product Data
Harmonizing the Product Data
Now that we have modeled the Product entity we can use the Data Hub Framework’s code scaffolding to create a boilerplate for Harmonizing our data. Recall from earlier that the Data Hub Framework can use the Entity Services model definition to create code.
Click on the Flows tab in the top navigation bar.
- Click on the + icon next to Harmonize Flows
- Type Harmonize Products into the Harmonize Flow Name field
- Click the CREATE button
This time we want to use the default option of Create Structure from Entity Definition. This means that the Data Hub Framework will create boilerplate code based on our Entity model. The code will pre-populate the fields we need to add.
Click on the Harmonize Products flow. You can run the harmonize flow from the Flow Info tab. The other tabs allow you to edit the source code for the generated plugins. Take note that there are five plugins for harmonize flows: collector, content, headers, triples, writer.
Harmonize flows were designed to be run as batch jobs. To support this batch running, the Data Hub Framework exposes a collector plugin whose purpose is to return a list of things to batch over. The Data Hub Framework then breaks the list of things into parallel batches of a configurable size and sends each and every single thing to the (content, headers, triples, writer) plugins as a transaction.
If you are not interested in running Harmonization flows as batches we do provide ways for running them on-demand for single items.
- collector: returns a list of strings to operate on
- content: returns data to put into the content section of the envelope
- headers: returns data to put into the headers section of the envelope
- triples: returns data to put into the triples section of the envelope
- writer: receives the final envelope and writes it to the database. You can do whatever you like in the writer. The default code inserts the envelope into the database, but you could push the envelope onto a message bus or send a tweet if you like.
Click on the Collector tab.
Collector Plugin
This collector code is returning a list of URIs, one for every Product document in the staging database. We are using URIs because we intend to create one harmonized document for every ingested staging document.
The code you see is using cts.uris to get values from the URI lexicon. We pass in cts.collectionQuery as the 3rd parameter to constrain our results to only the URIs for documents in the Product collection. We are using options.entity
as the parameter. The Data Hub Framework passes in options from Java to the plugins.
The default options passed in to the plugin are:
- entity: the name of the entity this plugin belongs to
- flow: the name of the flow this plugin belongs to
- flowType: the type of flow being run (input or harmonize)
Click on the Content tab.
Content Plugin
The content code receives an id as the first parameter. This id happens to be the URI for a staging Product document. The id can be anything: a URI, a relational row id, a twitter handle, a random number. It’s up to you to decide how to use that id to harmonize your data.
The only modification we need to make to this file is to change the way we look up the sku.
let sku = xs.string(source.sku || source.SKU);
This change will use either sku or SKU depending on which one is found. This covers the case we are trying to solve of two separate field names.
Here is the Final content.sjs file:
After making the code change, Click SAVE.
Now Click on the Flow Info tab.
Let’s Run the flow. Click the RUN HARMONIZE button to start the flow.
Check out the Harmonized Products
After running the Input flow we verified that the job finished. Let’s do that again.
- Click on the Jobs tab.
- Make sure the job finished.
Now let’s explore our Harmonized Data.
- Click on the Browse tab.
- Change Database to FINAL.
- Click Search.
You should see harmonized documents in the search results.
Click on a result to see the raw data.
Up Next
Congratulations! You just loaded and harmonized your Product data. Up next is doing the same thing for the Order data.