Schema Changes
Whether your team is currently using one database for both transactions and analytics, or is centralizing several databases’ worth of data into your analytics lakehouse, it’s important to track schema changes, with distinct benefits for different types of changes.
- Changed data types: Ensure that existing queries, such as those applicable to numeric data types only, can still function.
- Deprecated fields and objects: Align data strategy by deprecating unused dashboards using those objects or determine a replacement.
- Renamed fields and objects: Update queries with new field and object references.
- New fields and objects: Determine the value of new fields and objects to current or upcoming data projects.
Transactional Database Schema Change Monitoring
Transactional databases are used to store data for software application events, website activity, manufacturing progress, and more. As a result, in a high-growth company using data to drive the business forward, it’s common to see changes or new fields altogether as your teams want to track different aspects of activities.
Tracking schema changes can be used to improve communication between different groups, and align on activities. For example, taking the case of a software application where the Product team dictates changes that an Engineering team implements, identifying the list of schema changes can lead to better cohesion with the downstream Product Analytics team.
Examples of tracking schema changes
One method that works for all databases is to create a process (i.e. event trigger) that writes DDL statements to a table that you create. Here are examples of how you’d do this for different databases:
SQL Server
Note: You can optionally enable change data capture instead, which will capture both DML and DDL statements, in addition to additional metadata. This will likely use slightly more storage and compute compared to creating your own event trigger.
```sql
CREATE TRIGGER schema_change_trigger
ON DATABASE
FOR DDL_DATABASE_LEVEL_EVENTS
AS
BEGIN
SET NOCOUNT ON;
DECLARE @eventdata XML;
SET @eventdata = EVENTDATA();
INSERT INTO schema_change_log_table (object_name, operation, ddl_statement)
VALUES (CONVERT(NVARCHAR(255), @eventdata.value('(/EVENT_INSTANCE/ObjectName)[1]', 'NVARCHAR(255)')),
CONVERT(NVARCHAR(100), @eventdata.value('(/EVENT_INSTANCE/EventType)[1]', 'NVARCHAR(100)')),
CONVERT(NVARCHAR(MAX), @eventdata.value('(/EVENT_INSTANCE/TSQLCommand)[1]', 'NVARCHAR(MAX)')));
END;
```
– SQL Server requires
```sql
ENABLE TRIGGER schema_change_trigger ON DATABASE;
```
PostgreSQL
```sql
CREATE OR REPLACE FUNCTION schema_change_function_name()
RETURNS event_trigger
LANGUAGE plpgsql
AS $$
DECLARE
event_data json;
schema_name text;
object_identity text;
operation text;
ddl_statement text;
BEGIN
-- Get the event data
SELECT json_object_agg(key, value)::json INTO event_data
FROM json_each_text(TG_EVENT);
-- Extract relevant information
schema_name := event_data->>'schema_name';
object_identity := event_data->>'object_identity';
operation := event_data->>'operation';
ddl_statement := event_data->>'ddl_command';
-- Insert into a table that you create
INSERT INTO schema_change_log_table (log_time, schema_name, object_identity, operation, ddl_statement)
VALUES (current_timestamp, schema_name, object_identity, operation, ddl_statement);
END;
$$;
Here’s how you would set up the corresponding event trigger:
CREATE EVENT TRIGGER schema_change_trigger
ON ddl_command_end
EXECUTE FUNCTION schema_change_function_name();
```
MySQL
```sql
– Enable ddl statement tracking + triggers
SET GLOBAL log_bin_trust_function_creators = 1;
DELIMITER //
CREATE TRIGGER capture_schema_changes
AFTER DDL
ON DATABASE
FOR EACH STATEMENT
BEGIN
DECLARE ddl_statement TEXT;
SET ddl_statement = CONCAT_WS(' ', TRIM(TRAILING ';' FROM EVENT_OBJECT_SCHEMA), EVENT_OBJECT_TYPE, EVENT_OBJECT_TABLE, TRIM(EVENT_STATEMENT));
INSERT INTO schema_change_log_table (schema_name, object_name, operation, ddl_statement)
VALUES (EVENT_OBJECT_SCHEMA, EVENT_OBJECT_TABLE, EVENT_OBJECT_ACTION, ddl_statement);
END;
//
DELIMITER ;
```
Warehouse Schema Change Monitoring
Due to centralizing data within the same lakehouse, it’s common to reuse the same raw tables and even models created by others with overlapping responsibilities. As a result, it’s also common to have a situation where one schema change creates a ripple effect on multiple data products.
Sharing schema changes across an analytics team and subset of their key stakeholders can also unify the broader organization’s understanding of where and when schema changes occur, and how these changes impact various data products.
Schema Change Tracking Examples
As of July 25th, 2023, Snowflake, BigQuery, and Redshift don’t offer a singular metadata table or explicit option to write all schema changes to a physical table in your warehouse. As a result, you’ll want to track context (e.g. column counts in a given table) about your schema and cross reference any schema changes with your query_history to determine what type of change occurred.
Snowflake
Here’s one example of a query that utilizes two important views within the information_schema, “TABLES” and “SCHEMATA” to provide table counts for a given database (though it can also do much more!). You would then store the output into a table, potentially with a timestamp column indicating when you snapshotted this metadata.
```sql
WITH TABLE_COUNTS AS (
SELECT TABLE_CATALOG, TABLE_SCHEMA, COUNT(*) AS TABLE_COUNT
FROM <database_name>.INFORMATION_SCHEMA.TABLES
GROUP BY TABLE_CATALOG, TABLE_SCHEMA
)
SELECT
CATALOG_NAME AS DATABASE_NAME,
SCHEMA_NAME AS NAME,
TABLE_COUNT,
CREATED AS SOURCE_CREATED_AT,
COMMENT AS DESCRIPTION
FROM <database_name>.INFORMATION_SCHEMA.SCHEMATA as s
LEFT JOIN TABLE_COUNTS t
ON s.CATALOG_NAME = t.TABLE_CATALOG
AND s.SCHEMA_NAME = t.TABLE_SCHEMA
```
One alternative solution to look for schema changes is to use Snowflake’s Time Travel feature, which allows you to find what a table(s) looked like at a given point in time. The challenge with Time Travel as a solution, however, is that the number of lookback days are limited by your pricing plan.
BigQuery
You could create snapshots of schema contexts (e.g. table counts), and track any changes made in between snapshots, using views from information_schema. An example of this for columns within a table’s data types would be:
```sql
SELECT
column_name,
data_type
FROM
`your_project_id.your_dataset_id.INFORMATION_SCHEMA.COLUMNS`
WHERE
table_name = 'your_table_name';
```
An alternative approach would be to use BigQuery audit logs and look like ALTER TABLE statements.
Redshift
In Redshift, there are a few additional system tables that you can use (in lieu of information_schema), which you’ll find prefixed with PG_*, SYS_*, and SVV_*. Some tables of note will be:
- SYS_QUERY_HISTORY: This will give you your query history, with timestamps - if there are specific query patterns associated with your DDL statements (e.g. ALTER <table_name> RENNAME <x>), then you can explore parsing the query_text with that in mind.
- SVV_TABLE_INFO: This will give you row counts and table size for your given table(s). You can use these counts and sizes as early indicators of when data issues may exist.
PG_TABLE_DEF: You can use this table to understand what data types are associated with each column in any of your tables.