Fivetran Community

mark-vandewiel · ‎10-11-2022

Users of of the HVR technology have to manage their own HVR installation.

When you start using HVR 6, or you migrate from version 5 to 6, you may wonder: how do I size my HVR hub server. The HVR documentation provides high-level guidelines. In this post I want to clarify some of the considerations, and discuss the repository database.

Agents

The HVR technology was designed to leverage agents in a distributed setup. Agents perform the heavy lifting of data replication, allowing for a hub with very little load that can handle a lot of channels (data pipelines).

However, HVR was also designed to be very flexible, allowing for log-based CDC and data delivery to be performed remotely (with applicable considerations for performance and data volumes that can be processed). Remote access to source or destination end point can be achieved by an agent, or by the hub server directly. The sizing guidelines make broad-stroke, and fairly conservative, assumptions about the load on the hub server for remote access to the end points.

The most cost-effective configuration in your environment may deviate from our recommended configurations. Through testing you would experience what load your configurations can handle.

Remote capture

The amount of resources remote capture requires vary somewhat depending on the amount of processing, and filtering, that takes place in the database. HVR will, as much as possible, push down predicates when retrieving log fragments remotely. For example for SQL Server remote capture there is no filtering possible. DB2i capture on the other hand allows for retrieval of log change records based on impacted table names. Hence remote capture for SQL Server requires more bandwidth than remote capture for DB2i. However, DB2i is performing more processing to serve up the changes.

Remote integrate

HVR supports two modes of integration: continuous, and burst mode. Transaction processing destinations like Oracle, SQL Server and PostgreSQL may use continuous integration. Analytical destinations like Snowflake, BigQuery and Databricks always use the burst mode.

Continuous integration is highly sensitive to network latency between the integration server, and the destination. Most of the processing for continuous integration is performed by the destination database system. The agent or hub integrating the data into the target is not performing much work, except for an SAP source replicating cluster and pool tables.

Burst integration always involves staging changes in burst tables that map 1:1 to actual target tables. The burst tables are populated through bulk loads. These may involve one more layer of staging with an external table on top, or using a bulk load utility such as copy. The first phase of the burst cycle is particularly resource intensive on the integrate server. For every row in a data set the integration computes a net changes based on outstanding changes. This process is referred to as coalescing, and is reported separately in the logs. Coalescing is both CPU and memory intensive.

If integrate runs on your hub server, then its sizing can be tuned based on the type of integration, as well as the frequency of integration runs.

Repository sizing

The HVR repository contains 23 tables. Most tables will be very limited in size. However, depending on your activity the following tables can become very large:

hvr_stats: this table stores the statistics that are used to populate the insights. You can modify the granularity and retention policy under SYSTEM -> Current Hub -> Statistics Tuning.
hvr_event_result: the hvr_event_result table contains a value column that is defined as a LOB. Depending on your database technology, and the frequency of running events, the hvr_event_result may allocate a lot of storage to the LOB column.

The hvr_stats table has two indexes that, over time, each may exceed the size of the hvr_stats table. A busy hub, with dozens of channels, will likely see the hvr_stats table grow to tens of GBs. With fewer channels and less activity the table will grow less rapidly. Depending on the database technology for your hub, you may also benefit from occasionally rebuilding the indexes (hvr_stats__x0 and hvr_stats_pkey) and/or reorganizing the table(s).

You should plan to allocate 20+ GB of storage to your repository database, and more for a busier hub.

Fivetran Community

Sizing your HVR hub server