Document toolboxDocument toolbox

Setting up Snowflake Data Metrics Functions

This page will walk you through the setup of Data Quality rules using Snowflake Data Metric Functions.

 


Step 1) Understanding Data Metric Functions in Snowflake

Data Quality Functions (DMF) are Snowflake functions that monitor the state and integrity of your data. You can use DMFs to measure key metrics, such as, but not limited to, freshness and counts that measure duplicates, NULLs, rows, and unique values.

 

Before getting started with setting up DMFs we recommend you complete the related Snowflake DMF documentation and tutorial

https://docs.snowflake.com/en/user-guide/tutorials/data-quality-tutorial-start

https://docs.snowflake.com/en/user-guide/data-quality-intro


Step 2) Adding K compliant Data Metric Functions

K does NOT SUPPORT the out of the box DMFs (in the CORE schema).

K only uses user defined DMFs because K measures DQ results as a % of rows that meet the test condition.

K expects the DMF result to be a value between 0 to 100

Out of the box DMFs measure the number of rows that meet the test condition which does not provide sufficient context

Replacement DMFs for out of the box DMFs are provided in below

 

Log into Snowflake and ensure your user has sufficient permissions to create functions.

DMFs are stored in the database you have either selected, or specify in the CREATE statement. If you intend to store all of the DMFs in a centralised place, create a new DB and SCHEMA.

 

Run the following code to create a column Null Count DMF.

Update or remove the [db].[schema]. as required.

CREATE OR REPLACE DATA METRIC FUNCTION
[DB].[SCHEMA].null_col_count_perc (ARG_T table(ARG_C1 STRING))
RETURNS NUMBER(19,6)
COMMENT = 'Count of column values that are null expressed as a percentage of total records'
AS
'SELECT COUNT_IF(ARG_C1 IS NULL)/(select count(1) from ARG_T) FROM ARG_T';

 

You can test the DMF by using the following code and adding in the target object

SELECT [DB].[SCHEMA].null_col_count_perc
(SELECT [COLUMN] from [DB].[SCHEMA].[TABLE]);

 

The result should be a value between 0 and 100.


Step 3) Adding the DMF to your column

Run the following code to attach the Null Count DMF to a column in a table

ALTER [TABLE/VIEW] [DB].[SCHEMA].[TABLE] add data metric function [DB].[SCHEMA].null_col_count_perc on ([COLUMN]);

 


Step 4) Scheduling your DMFs to run

DMFs runs are scheduled per table.

If a table has multiple DMFs attached to it, the schedule applied will run all DMFs attached to that table.

 

Run the following code to schedule the DMF to run at 5am every day (UTC). This needs to be set for every table you have attached DMFs to.

ALTER VIEW [DB].[SCHEMA].[TABLE] DATA_METRIC_SCHEDULE = 'USING CRON 0 5 * * * UTC';

 

 

After the scheduled DMF has run, you can check the results by running the following code

USE ROLE CATALOG_READ_ONLY;
SELECT * FROM SNOWFLAKE.LOCAL.DATA_QUALITY_MONITORING_RESULTS WHERE METRIC_NAME ='NULL_COL_COUNT_PERC'

 


K compliant replacements for Snowflake Out of the box DMFs

Null count / Null percent (Completeness)

CREATE OR REPLACE DATA METRIC FUNCTION
[db].[schema].null_col_count_perc (ARG_T table(ARG_C1 STRING))
RETURNS NUMBER(19,6)
COMMENT = 'Count of column values that are null expressed as a percentage of total records'
AS
'SELECT COUNT_IF(ARG_C1 IS NULL)/(select count(1) from ARG_T) FROM ARG_T';

 

Duplicate count (Accuracy)

CREATE OR REPLACE DATA METRIC FUNCTION
[db].[schema].duplicate_perc (ARG_T table(ARG_C1 STRING))
RETURNS NUMBER(19,6)
COMMENT = 'Count of duplicate records expressed as a percentage of total unique records'
AS
$$
SELECT
(1-((select count(1) from (select ARG_C1 from ARG_T group by 1 having count(1)>1))
/(select count(1) from ARG_T)))*100
$$;

 

Unique count (Uniqueness)

CREATE OR REPLACE DATA METRIC FUNCTION
[db].[schema].unique_perc (ARG_T table(ARG_C1 STRING))
RETURNS NUMBER(19,6)
COMMENT = 'Count of unique records expressed as a percentage of total records'
AS
'SELECT ((select count(distinct ARG_C1) from ARG_T)/(select count(1) from ARG_T))*100';


Useful Data Metrics Functions to implement

More DMFs for different dimensions or industries will be added over time.

Valid email (Accuracy)