3 ways to improve data sampling efficiency in Snowflake

Using conditional statements before you start sampling ensures you’re only working with the data that matters. Learn how to make your Snowflake data sampling queries more efficient with these 2 different conditional statements.

Nicole Mitich

and

February 28, 2024

Nicole Mitich

Data @ Metaplane

February 28, 2024

3 ways to improve data sampling efficiency in Snowflake

In the last few years, you’ve probably heard the phrase “data is the new currency” at least once. And there's a lot of truth to that.

Like money, data has become incredibly valuable. Both businesses and individuals are doing everything they can to gather, analyze, and leverage data to their advantage. So how we handle our data can make a huge difference.

Think about it: the longer a query takes to execute, the more expensive it becomes. Not just in terms of compute resources, but also our most precious resource—time.

While it’s not much of a problem when your tables are small, as your tables grow in size, the cost, execution, and iteration time of downstream tasks follow suit. That’s why writing efficient queries is just as important as writing queries that work.

That's where conditional statements come into play. In Snowflake in particular, conditional statements can drastically reduce the resources and time spent on data queries.

How do conditional statements in Snowflake make sampling queries more efficient?

Think of conditional statements as setting up smart filters for your data collection. These “filters” sort through all the data and pick out just the bits that are actually useful and relevant to the problem you’re trying to solve. Like if you're fishing and you only want to catch salmon, you'd ideally use a net that only lets salmon through and keeps all the other fish out (if such a thing existed).

Using conditional statements before you start sampling makes sure you’re only working with the data that matters, so you’re not wasting time and resources on data you don't need. And when you do take a representative sample, it's more likely to give you the information you need without hitting random data roadblocks (or worse, sampling bias/sampling error).

With that said, we’re starting with a fairly simple sampling technique. It’s not a conditional statement per se, but it’s a great way to wrangle your larger datasets more efficiently.

Option 1: Using Partitioned Tables with Sampling

Though more of a technique than a conditional statement, partitioning and sampling data in Snowflake is a great, easy way to enhance your query performance—especially if you’re using large datasets. Partitioning organizes the data based on certain keys or criteria, facilitating quicker access to relevant data segments and reducing the scope of data scans during queries. Essentially, it speeds up query execution by focusing only on pertinent data partitions.

Sampling after partitioning allows you to work with a smaller sample size that represents the larger whole. By analyzing a sample, you can infer patterns, trends, and insights without the overhead of processing the entire dataset, saving you on data storage down the road.

To combine these steps into a single SQL query in Snowflake, you’d typically make sure your table is organized into partitions based on a key that is relevant to your query patterns. But since Snowflake automatically manages micro-partitions and doesn’t allow manual partitioning like traditional databases, we'll focus on using cluster sampling for organizing and then sampling data.

Here’s what that looks like in practice:

Let's say we have a sales data table (`sales_data`), and we're interested in analyzing sales performance by region. We assume that the table is clustered by region_id to optimize performance for queries filtered by region. Now, we want to sample a subset of this data for a quick analysis using the SQL query below:

```sql

SELECT

FROM

sales_data TABLESAMPLE BERNOULLI (10);

-- Sample approximately 10% of the rows

WHERE

region_id = 'NorthAmerica' -- Assuming you're interested in North American sales data

AND DATE(sale_date) BETWEEN '2023-01-01'

AND '2023-12-31'

```

‍

In this case:

`WHERE region_id = 'NorthAmerica'` focuses the query on the North American sales data, using the table's clustering on `region_id` to improve performance.
`AND DATE(sale_date) BETWEEN '2023-01-01' AND '2023-12-31'` further filters the data to only include sales from the year 2023.
`TABLESAMPLE BERNOULLI (10)` applies a sampling method to retrieve approximately 10% of the rows from the filtered result set. The BERNOULLI sampling method provides a random sample of the data.

This query is designed to efficiently filter and sample the data based on the table's organization (clustering by `region_id`)—it aligns the data with query patterns, and then samples the targeted subset of data.

While partitioning is great for speeding up searches targeting specific regions, if your data isn't neatly organized around the criteria you're using to partition it, or if your searches don't align well with how the data is split up, partitioning won't help much. For instance, if you need information that's spread across multiple partitions, or if your search conditions change a lot and don't match the partitioning scheme, you might not see the performance boost you were hoping for. If that’s the case, you might want to investigate conditional statements.

Option 2: Using `CASE` statements

Using `CASE` statements in your data sampling queries in Snowflake adds a layer of conditional logic to the sampling process, which is particularly useful when you want to apply different sampling rates or methods based on specific criteria within your data.

For instance, you might want to sample more heavily in areas where your data is denser or more variable, and less so in more uniform areas. The `CASE` statement allows you to dynamically adjust the sampling rate or method based on the characteristics of the data (e.g. region, time period, or any other relevant dimension).

To use `CASE` statements to analyze sales performance by region (like in the above sales_data table example), you can design a query that selects a sample of sales data based on certain conditions related to regions. Since Snowflake's SQL does not support using `TABLESAMPLE` directly within a `CASE` statement, you’ll have to use a workaround that involves filtering data in subqueries or using conditional logic to assign sample rates and then applying these rates in a subsequent operation.

Here's what this looks like in practice:

```sql

WITH region_sample AS (

SELECT

sale_id,

region,

sale_amount,

CASE

WHEN sale_amount < 1000 THEN 1

WHEN sale_amount BETWEEN 1000 AND 10000 THEN 2

WHEN sale_amount > 10000 THEN 3

END AS sale_group

FROM

sales_data

sampled_data AS (

SELECT

FROM

region_sample

WHERE

(

sale_group = 1

AND RANDOM() < 0.05

) -- For sales under $1000, sample ~5%

OR (

sale_group = 2

AND RANDOM() < 0.1

) -- For sales between $1000 and $10000, sample ~10%

OR (

sale_group = 3

AND RANDOM() < 0.15

) -- For sales over $10000, sample ~15%

)

SELECT

FROM

sampled_data;

```

‍

In this case:

The function above for ‘region_sample’ assigns a `sample_group` value to each row based on the region and `sale_amount`. Each region (and condition within the region) is associated with a different group.
`sampled_data` then filters the `region_sample` data by applying a random sampling condition to each `sample_group`. The `RANDOM()` function generates a random value between 0 and 1, and rows are selected based on whether this random value falls below the specified threshold (e.g., 0.05 for a 5% sample rate).

Rather than partitioning, this approach allows for nuanced sampling based on region and sales amount. As a result, you get a much more targeted data analysis of sales performance. But you also have to deal with increased complexity and reduced readability of your SQL queries. As you add more conditions and logic with `CASE` statements, the queries become harder to understand and maintain (which is especially true for teams where multiple analysts work on the same codebase). If this doesn’t work for your scenario, try using a `JOIN`.

Option 3: Using `JOINS`

Using `JOIN` statements with conditional logic allows you to sample data based on relationships between tables or within subsets of a single table. You can create a derived table or a Common Table Expression (CTE) that contains the specific conditions or subsets you care about, then join this derived table or CTE with the original (or another related) table and apply the sampling on this joined result set.

This method is particularly useful when the sampling criteria involve complex conditions or multiple tables.

Now, back to the sales_data table example from above. Let's assume we have a related table (e.g.`regions`), that contains detailed information about different sales regions. And suppose we want to sample sales data from the `NorthAmerica` region more efficiently by joining the sales_data table with the regions table, which contains detailed region information.

This is what that SQL query looks like:

```sql

WITH RegionSales AS (

SELECT

sd.*

FROM

sales_data sd

JOIN regions r ON sd.region_id = r.region_id

WHERE

r.region_name = 'NorthAmerica' -- Condition to filter sales data by region

AND DATE(sd.sale_date) BETWEEN '2023-01-01'

AND '2023-12-31'

)

SELECT

FROM

RegionSales TABLESAMPLE BERNOULLI (10);

```

‍

In this case:

`RegionSales` creates a temporary result set that joins the `sales_data` table with the regions table. It filters the sales data to include only those records from the `NorthAmerica` region and within the specified date range ('2023-01-01' to '2023-12-31').
The `TABLESAMPLE BERNOULLI (10)` clause is applied to this filtered and joined dataset, sampling approximately 10% of the rows.

`JOINS` are particularly advantageous when the sampling criteria involves complex conditions or multiple tables that are interconnected. Imagine trying to get a snapshot of data that's spread across different tables, each with its own set of rules or relationships. `JOINS` bring all that related information together first, so you can then apply your sampling logic to one, combined dataset. This is super helpful when your analysis depends on understanding how different pieces of data relate to each other, like how customer profiles link to their purchase histories.

But keep in mind: while `JOINS` are powerful for relating datasets, they’re not always the best choice if simplicity and performance are priorities. When you join tables, especially multiple or large ones, you increase the amount of data being processed before sampling can even occur which requires more compute resources upfront and slows down query execution time. Doing the `JOIN` after sampling will improve this efficiency slightly, but it won’t fix the problem entirely.

A better way

Metaplane makes it easy for users to configure how they want their data to be sampled in future Snowflake queries. This includes options for users such as Time Windows and `WHERE` clauses. With Time Windows and “Include Data Since” options, users can configure their lookback periods to only include their most recent data. In `WHERE` clauses, users can further restrict the amount of data within a table being queried by any dimension of their table.

As a bonus, consider using timestamp functions supported by your warehouse such as `CURRENT_DATE()` to scan data created or updated from today onwards, if your goal is to ensure that new data is accurate.

Want to see how Metaplane’s configurable monitors can make your Snowflake data sampling more efficient? Talk to us or start a free trial today.