Fivetran Community

mark-vandewiel · ‎08-11-2022

The use of an integration agent is only applicable to the HVR Solution. When using the Fivetran managed service you don't have to worry about whether or not to use an integration agent, because Fivetran writes to your target.

From the ground up, the HVR Solution was built based on principles of load distribution through agents. The three main reasons for using integrate agents are:

Performance. An agent close to the target has low-latency access to the data store. Communication between hub and agent is always compressed, with compression ratios up to 10x. Network communication is further optimized to limit sensitivity to high latency and achieve maximum bandwidth.
Scalability. The agent performs a subset of the work. If more work has to be performed, then an additional (stateless) agent can be added to distribute the load. With often a consolidation of data flows (pipelines, or channels) into the same target, customers sometimes use an agent farm consisting of multiple agents, with a load balancer to automatically scale up or down the number of agents required.
Security. HVR communication between hub and agents uses TLS 1.3.

What processing does the integrate agent perform?

The amount of work to integrate changes depends on the destination technology, and the pipeline (channel) configuration. At a high level there are two main approaches:

Continuous mode, for dominantly operational use cases with either a transaction processing (OLTP) database as a target, or Kafka. For example replication between oracle and Kafka, postgresql and mysql, or a homogeneous (target technology identical to source) use case.

In continuous mode HVR applies the changes in commit order to the target, row-by-row. The bulk of the processing is performed by the target technology. However, the row-by-row nature of the integration requires low latency to achieve fast performance. The continuous integration mode benefits from the integration agent - close to the target database if not on the database server - dominantly for low latency.
Burst (micro-batch) mode. The burst - or micro-batch - mode is used for all analytical database technologies like Snowflake, Google BigQuery and Databricks, but also for use cases with files as a destination (e.g. S3, ADLS, GCS).

HVR uses micro-batches because without them the destination technology would not be able to keep up with the rate of changes coming from one or more sources. The burst mode will compute a net operation (insert, update or delete) per unique row, before preparing a data set that will be processed as a micro batch. The process to compute the net operation is called coalescing, and is both CPU and memory-intensive (and, if memory thresholds are exceeded, will spill to disk writing temporary files). Also, formatting files - staging files or regular destination files - is CPU-intensive, as are operations like client-side encryption (configuration-dependent).

Burst mode dominantly benefits from the scalability aspect of an integration agent.

The use of the agent farm is mostly popular on the target when burst mode is used.

Fivetran Community

Best practices for using an integration agent