Fivetran Community

mark-vandewiel · ‎07-31-2022

Several technologies, including Fivetran and Fivetran's HVR Solution, support log-based Change Data Capture (CDC) from transaction processing databases. DBAs and system administrators often wonder what is the impact of running CDC on the database server in terms of CPU, memory and IO resources, including disk and network IO.

This post covers generic information that applies to most transaction processing database technologies (e.g. Oracle, SQL Server, DB2, MySQL, PostgreSQL, SAP HANA and others). However, concepts and features are not identical across technologies, so there are subtle differences that result in slightly higher or lower impact. Also, for many database technologies Fivetran provides multiple different methods to perform log-based CDC that may change the impact. Please do respond to this post or start a new thread if you'd like to discuss a specific technology or capture method.

Log-based CDC pulls table changes from the transaction log, during and after transactions are completed. The transaction log is the foundation for database recovery. Replication technologies will generally require data changes to be written to the transaction log, as opposed to statements that were performed. Log-based CDC runs independent of the transaction that is making the change. As a result, on a well-configured system with sufficient resources, the transactions themselves may not experience any noticeable impact. However, whilst not noticeable, there is impact on database processing and the database server.

What is causing impact?

To allow for the highest possible transaction rates database technologies generally minimize the amount of change data logged without compromising recoverability. As a result, updates only record the column values that were changed. Also, identifiers to existing rows are often internal identifiers rather than application-level identifiers (primary or unique key column values). Log-based CDC - commonly used to push changes into a different technology - requires application-level identifiers. The first cause of impact is so-called supplemental logging: the database is instructed to write additional information to the log, requiring more IO to the log, and more storage space to store backups.

The amount of impact from supplemental logging varies, depending on the relative percentage of database activity you want to capture, the transaction mix between inserts, updates and deletes, the number of columns per table, as well as database technology differences. Also, if you already replicate your database, then you may already absorb this overhead. It is however not uncommon to see significant additional transaction log generation (20+%) as a result of supplemental logging. In practice though this rarely slows down the database server or even database processing.

A second cause of overhead is the log parsing. What operations in the log are relevant, and what are they? Several database technologies provide built-in routines or APIs to retrieve the changes. For example SQL Server can populate CDC tables you can select from. DB2 provides the db2ReadLog API. PostgreSQL has decoders to retrieve data from replication slots. Etc.

Some CDC providers, including Fivetran, provide so-called binary log readers. A binary log reader has the ability to parse a transaction log file or log fragments directly. Such approach bypasses database processing altogether, and can often be run on a different server, provided there is access to the transaction logs. However, to achieve lowest possible latency such binary log reader may also run on the database server directly. In fact, binary log readers almost always take up less resources (CPU, memory, IO) compared to database routines or APIs, because the in-database processing introduces additional overhead.

The third cause of overhead is not always applicable. This cause is related to the presentation of the change data. For example SQL Server will either publish changes in a distribution database (in the case of replication), or through CDC tables (in case CDC is used). In either case additional database processing and storage are required to make changes visible.

The fourth cause of overhead is the retrieval of changes. What is the overhead to pull the changes in terms of database processing and network IO? When pulling data remotely a large data transfer may benefit from optimizations in the database connectivity driver. Generally, passing less data and using optimized network connectivity will result in the lowest possible impact.

Fivetran's agent-based technology, including HVA connectors and the HVR Solution, always pass compressed data on the wire. With 5-10x compression on just the change data, the data volume crossing the network is drastically limited relative to uncompressed CDC data transfer, let alone compared to a binary log parser running elsewhere.

How high is the impact?

Many factors influence the impact of log-based CDC. What percentage of database activity must be captured? What are the table definitions involved including what data types they use, and what is the row width? What is the transaction mix? What approach is used to capture the changes?

In practice it is rare to see more than 10% overhead on database server CPU due to the introduction of log-based CDC. On a well-configured system you should be able to drive database processing up to 100% CPU utilization. This implies that introducing CDC on your database server will have very limited if any noticeable impact, if generally your database server CPU utilization is no higher than 70-80%.

Ironically, the less activity there is in your database the higher the overhead of CDC. Of course the relative base is much lower on a mostly idle system, but also the overhead to validate that there is no change data becomes noticeable. Of course on a mostly idle system this overhead will not hurt.

Fivetran Community

What is the impact of running log-based CDC on my database server?