Pointers for successful data lake and warehouse migrations
Congratulations! Your team has finally secured budget for a new enterprise cloud data warehouse or lake to accelerate your data initiatives. This is a step in the right direction, except for one thing: what happens to your old warehouse and how do you actually cut everything over?
When it comes to moving the data itself, there are a few questions that commonly come up:
- Which data do you actually need or want?
- How do you set up a replication process that doesn’t lock any tables that are still receiving updates?
- How do you place safeguards on data integrity when it comes to making sure all of the data makes its way to the new warehouse?
A typical warehouse migration process can take anywhere from a few months to 2 years, depending on the maturity of your organization. Many organizations also opt to keep using their old warehouse as a transactional record store, making it important to think about how you’ll implement these safeguards on an ongoing basis.
Data migration strategy
As a general rule of thumb, you likely won’t need to migrate all of the data in your old enterprise warehouse. This step will take some cross organization collaboration to ensure that you’re bringing over critical data being actively used for analytics and any recently added objects with upcoming use cases. If you’re unsure if you’re going to use that data, and you’re migrating to a cloud-based warehouse, a common solution is to err on the side of caution and just bring over the data. Storage is typically no more than 5% of overall costs when it comes to tools that separate compute and storage for billing.
Once your migration team has settled on which schemas and objects to bring to your new warehouse, you should consider using a tool like Fivetran, that not only supports popular transactional databases (e.g. SQL Server, Oracle), but also includes cloud warehouses such as BigQuery and Snowflake as sources to be replicated from. Some of the benefits that go into their connectors include:
- Object Selection - After finalizing your objects to be brought over, you can select them in the Fivetran interface via checkboxes. If, at any time, you decide that a previously excluded table has value, then you’ll be able to select that object to be imported.
- Simple setup - Generally speaking, you’ll only need to provision a database user with read access to replication logs (for transactional databases) and tables that you’d like to replicate. If you have the proper admin permissions, you should be able to do this in less than 15 minutes.
- Automated replication - Fivetran connector replication consists of two phases; an initial historical sync of existing data and update syncs to reconcile source (connector) and target (destination). You won’t need to do anything beyond the optional object selection portion.
When it comes to automated replication, as mentioned before, you’ll want to ensure that any ongoing transactions (i.e. updates, inserts) on your source data warehouse to be migrated aren’t missed in the interest of data integrity.
Database replication pointers
While the traditional way of looking at migration was to do it overnight or over a weekend to prevent interfering with ongoing work, with today’s tools, in many cases, you’ll simply be able to set it up on a normal work day.
Continuing the example from above, if you’re using a tool like Fivetran, you’ll want to note these two points:
- Use log-based replication when possible: Assuming that your database administrator is able to provision enough storage to hold transactional logs, you should use transactional logs for replication. The benefit of this is that you can avoid scanning tables, which is especially useful to avoid locking tables that have a high rate of transactions, while using a reliable, native method created for setting up primary to secondary databases.
- Change management: Fivetran has two ways of recording changes that occurred from source to target. The default method is soft deletes, where they create a boolean table column that indicates whether a row was deleted from the source. The alternative is History Mode, which keeps a record of all changes made to any given row.
To confirm that your replication isn’t interfering with any ongoing processes on your source data warehouse, Metaplane offers both Volume and Freshness monitors that will alert your team to any sudden drops in volume and update frequencies. Those are typically early indicators of data quality issues caused by broken or lagging processes that would be updating your source data warehouse.
Data integrity safeguards
While Fivetran already has safeguards built in, such as an idempotent replication process to avoid missing data, and follows best practices such as relying on native database transaction logs, particular scenarios, such as cloud provider or network outages, can still cause replication processes to stall. If you’re already using your new warehouse for any production use cases, you’ll want to ensure that you’re aware of any potential data quality issues, to proactively alert stakeholders and build trust.
To mitigate risk in this scenario, we’d recommend not only using Metaplane’s data quality monitors on your previous data warehouse, but also layering additional monitor types on your new data warehouse. Here are a few that you might consider:
- Freshness: This will alert you to any delays in replication, creating stale data.
- Volume: If you see a drop in the normal volume (e.g. row count) change in a table that has historically been regularly updated, then there may be a replication issue.
- Numeric Distributions: You’ll want to consider these on any downstream (i.e. transformed) tables of your raw datasets, which helps identify a potential errant value originating from your source database.
For forward-looking data teams, you’ll want to place data quality monitors on both your raw data and any critical “cleaned” datasets that are heavily referenced in downstream business intelligence dashboards. Metaplane can help you do this with a combination of:
- Column-level lineage and usage analytics: Metaplane offers a full view of your data stack, down to individual fields. This means that you’ll be able to infer which Fivetran connector is loading data into your new warehouse, and what fields are used in your integrated business intelligence tool(s). Icons within the Metaplane platform also help to indicate which specific tables are most used in downstream flows, so that you can identify why tables require data quality monitors.
- Machine-learning based monitors: One of the biggest challenges when it comes to identifying update frequencies and other data quality metrics on an object-level is the maintenance as your processes change over time. For example, you may opt to move from 15 minute replication frequencies to 5 minutes on your Fivetran connectors, which would require updating every single unit test. Metaplane was built was this in mind - we use your data to train and update our models for accurate alerting, every time.
If you’ve recently started, or will soon start a cloud migration - you can setup a Fivetran account here. While you’re waiting for your initial sync, we’d recommend setting up a Metaplane account here, so that you’re ready to ensure data quality and build trust immediately!