In this article we walk through how Sync works. We would introduce concepts that are central to Sync and explain them with an example.
Introduction
Sync at a high level can be thought of as a process that happens in Stages. These stages are non overlapping and sequential. These are the stages in the Pipeline.
- Read from Sources.
- Process Source Entity Pipeline.
- Process Source Field Pipelines.
- Deduplication and Merge.
- Save Record and Log Transactions.
- Process Destination Entity Pipeline.
- Process Destination Field Pipelines.
- Write data to Destinations.
Before we go further and talk about these Stages in detail, let's look at the Pipeline we will use throughout this article.
This is an Account pipeline with Salesforce and Hubspot as both source and destination. We will look into this pipeline in detail further in the article. Now let’s look at the Stages.
Read from Sources
First stage, each time Sync runs (each run is called a Sync Cycle), is Reading from Sources. In a Sync cycle a fixed number of records are read from each source to be processed through the Pipeline.
When Sync reads data from a source it needs to read changes that have happened since the last time the Sync ran (last Sync Cycle). Each Synapse entity exposes a watermark field and Sync uses that watermark field to query Synapse on records that have changed since last Sync cycle. Broadly a typical Synapse entity needs to satisfy three conditions.
- Watermark field. In Salesforce Account this field would be SystemModstamp.
- Id field. In Salesforce Account this field would be Account ID. Id field is required to identify and match records from Synapse with records in Syncari. We will discuss this more when we discuss how IDs are mapped later.
- Records are returned from the source sorted by the watermark field.
Third requirement of "Records are returned from the source sorted by the watermark field" is necessary because that allows Sync to ensure that records are processed in the order the changes happened in the Synapse and we do not overwrite new data with old data.
Now we are ready for next stage - Process source Entity Pipeline.
Process Source Entity Pipeline
In this Stage, the records read from the sources are processed through each function in the Entity Pipeline. In our pipeline records from Salesforce Account go through "Attach Salesforce" function and records from Hubspot Company through "Attach Hubspot" function.
"Attach Salesforce" and "Attach Hubspot" are both instances of "Attach Record" function. This function is used to perform data unification. Let's look at how "Attach Salesforce" function is configured. We match the incoming Salesforce record's Account Name with Account Name of all Account records in Syncari Database. If such an Syncari Account record exists then we associate incoming Salesforce record's ID with that Syncari Record. We call this association ID Mapping. Below is representation of ID Mapping in Data Studio for a record in Syncari. Both these functions allow Sync to unify records from Salesforce and Hubspot. More details on data unification and Attach Record function can be found here.
ID Mapping association allows us to associate incoming source record with an existing record in Syncari. This would also help us identify the correct Transaction Log operation - Create vs Update vs Delete. We will discuss transactions in the next section.
Process Source Field Pipelines
When building a Pipeline, we configure Field Mappings like below. These mappings define how source and destination fields are mapped to Syncari fields. Field Pipelines are created around each mapped Syncari field. Most of the Field Pipelines are simple mappings where Sync just copies the Field value from source synapse into the Syncari Field.
We can also define transformations on these Field pipelines. Below is an example of Domain field pipeline. Here we use "Extract Domain" function to extract domain from incoming Salesforce Account Website field value and map the domain to Syncari Account Domain field.
Transaction Log
As each source field pipeline is processed, Sync records the new processed field value and existing field value (if there is no existing record, then existing value is recorded as null). This change log is aggregated for all the fields in the record and logged as Transaction Log in the later stages of the Sync.
Data Authority
In our example, field Syncari Account Name is mapped to Salesforce Account Name and Hubspot Company Name. If both sources attempt to update the same field simultaneously for given record, Data Authority configuration on Field Pipeline decides which value is chosen. In the example below, value from Salesforce is chosen. See Data Authority for details.
At the end of this stage, incoming record from the sources have been processed through Entity and Field pipelines, they have been unified and data authority rules have been applied. Now we are ready to save the record in Syncari Database.
Deduplication and Merge
Next stage in Sync is Deduplication and Merge. We merge incoming record with existing records in Syncari based on rules defined in Merge Studio. Below is an example of a simple rule.
If the incoming record has the same Account Name as existing Syncari record(s) then we consider these records as duplicates. "Select Winner" rule decides which record among the duplicates would be the winning record and other records become losers. Losing records would be deleted from Syncari. Field values from losing records are copied over to the winning record based on "Default Merge Policy" and "Default Override Policy". More details on Merge Studio configuration can be found here. Winning record with updated field values is saved in the Syncari Database.
Save Record and Log Transactions
Next stage is saving processed records in Syncari. Transactions generated in previous stages are logged in this stage.
Destination Processing
Destination side processing works on the data that has been saved in the Syncari in the previous stage. In the example below, data can thought as moving from Account node to Salesforce and Hubspot destination through two paths.
Different Synapses may have limits and quotas on how many requests they can accept and thus one processing for one destination may be slower than the other. We keep track of this by using destination watermarks for each destination synapse.
Destination watermark represents time until when the changes in Syncari have been processed and written to the specific destination. This also helps you to check lag in time between when a record is processed and saved in Syncari and when the changes are written out to Destination. In majority of the cases this watermark will be very close to the current time.
Process Destination Entity Pipeline
Destination Entity Pipeline processing is similar to Source Entity Pipeline, except it is processing data saved in Syncari instead of data from Source sypanses. Below is an example of how destination entity pipeline may be configured. Here the Sync stops the Account data from being written to Hubspot if the Account Type is not Customer. Destination Entity Pipeline supports all Functions and Actions that source Entity Pipeline does.
Process Destination Field Pipelines
As with Destination Entity Pipeline above, Destination Field Pipelines processing is similar to Source Field Pipelines. Transformations in the Field Pipeline are intended to transform Field value before being written out to destination. For example you might want to set Company Type as "Other" in Hubspot if it is not Prospect or a Partner. Below is the Pipeline.
Write data to Destinations
Once the Entity and Field pipelines are processed, Sync is ready to write data to destinations. Before a record is written to the destination, we need to check if record has changed. There are four possibilities.
- This is a new record which is not present in destination synapse.
- This record already exists in destination synapse, but the record field values have changed once it has been processed through pipeline.
- This record already exists in destination synapse and the record field values have not changed.
- This record was deleted in Syncari and now needs to be deleted in destination synapse.
For all the above cases, except #3, changes would be written to destination.
This was the last stage in the Sync process. Before the Sync cycle is marked as complete, Sync's internal book keeping moves the source watermarks forward. This ensures that the next Sync cycle is not reading already processed records from the source. Source watermarks are moved after the Sync cycle is complete, as this guarantees that if a Sync cycle is interrupted, then the next Sync cycle reads the same data from sources and finishes the processing. This ensures that data is processed at least once.