cancel
Showing results for 
Search instead for 
Did you mean: 

Using Snowflake Internal, or External Staging?

mark-vandewiel
Fivetranner
Fivetranner

Snowflake is one of the most popular destinations for Fivetran, also with the use of the HVR Technology. When Fivetran syncs data into Snowflake we stage data using files.

This post covers considerations between using external and internal staging.

Why Staging?

Snowflake points out that bulk load is the most efficient way to load large volumes of data. Naturally for the insert-only initial sync we leverage staging. Data from files in the staging area is loaded directly into the target tables using copy.

For increments following the initial sync, Fivetran - including HVR - processes changes using controlled routines. With the many different data sources we support - SaaS applications, databases and others - we rarely come across an insert-only workload. Almost all use cases include updates, and deletes. We use staging to bulk load incremental changes into staging tables, followed by merge statements to apply the changes to the target tables. With data already residing in Snowflake the merges are very efficient, allowing for high throughput (averaging a high number of changes per second).

Internal vs. external staging

The Fivetran managed service always uses external staging in our staging environment. Data is always encrypted, and we automatically generate temporary credentials to load the data. Data is deleted when it was successfully loaded. This is all part of the managed service, and you don't have to worry about the management of the data in the staging area.

With the use of the HVR solution you have a choice between the use of external or internal staging. In either case, HVR will manage the files in the staging area (i.e. delete the files when they are no longer needed). Keeping the files HVR uses to apply changes to the target in the staging area would lead to inconsistencies. You can configure an alternative (file) destination to keep data in files, for example in a data lake.

Given the options internal versus external staging, here are a number of considerations:

  • The use of internal staging is easier to configure than external staging.
  • With the use of internal staging data must first be written to a local file system, and then uploaded into the staging area. Data cannot be streamed directly into Snowflake's internal staging area. External staging on the other hand allows for changes to be streamed directly into the staging area, bypassing the step to write files locally. This may lead to better performance for external versus internal staging.
  • HVR loads data using four parallel streams when internal staging is used. For external staging parallelism is dependent on the cloud running your Snowflake (AWS, Azure, GCP) which may result both in faster or slower performance for external staging relative to internal staging. On AWS external staging will likely perform faster, especially for large volumes of change data. If low latency is critical then you should run a benchmark on your data set to learn how you can get the best performance.
  • Internal staging is accessible through Snowflake credentials. To use external staging you must use alternative credentials. Many organizations require regularly rotating of credentials, introducing additional management when using external staging.
  • The files to be uploaded to internal staging are not encrypted at the OS level. Of course file system encryption may be enabled. In Snowflake's internal staging the data is encrypted. Communication to Snowflake is also encrypted. With the use of external staging in S3 you can use KMS client side encryption, allowing for end-to-end encrypted data transfers.
  • Costs for the use of internal staging are on the Snowflake invoice, whereas you pay the cloud provider directly for the use of external staging.

Given these considerations, what option should you use? With HVR I recommend you use internal staging unless you see a strong reason not to.

0 REPLIES 0