This page will walk you through the setup of Data Quality rules using Snowflake Data Metric Functions.
Step 1) Understanding Data Metric Functions in Snowflake
Data Quality Functions (DMF) are Snowflake functions that monitor the state and integrity of your data. You can use DMFs to measure key metrics, such as, but not limited to, freshness and counts that measure duplicates, NULLs, rows, and unique values.
Before getting started with setting up DMFs we recommend you complete the related Snowflake DMF documentation and tutorial
https://docs.snowflake.com/en/user-guide/tutorials/data-quality-tutorial-start
https://docs.snowflake.com/en/user-guide/data-quality-intro
Step 2) Adding K compliant Data Metric Functions
K does NOT SUPPORT the out of the box DMFs (in the CORE schema).
K only uses user defined DMFs because K measures DQ results as a % of rows that meet the test condition.
K expects the DMF result to be a value between 0 to 100
Out of the box DMFs measure the number of rows that meet the test condition which does not provide sufficient context
Replacement DMFs for out of the box DMFs are provided in below
Log into Snowflake and ensure your user has sufficient permissions to create functions.
DMFs are stored in the database you have either selected, or specify in the CREATE statement. If you intend to store all of the DMFs in a centralised place, create a new DB and SCHEMA.
Run the following code to create a column Null Count DMF.
Update or remove the [db].[schema].
as required.
CREATE OR REPLACE DATA METRIC FUNCTION
[DB].[SCHEMA].null_col_count_perc (ARG_T table(ARG_C1 STRING))
RETURNS NUMBER(19,6)
COMMENT = 'Count of column values that are null expressed as a percentage of total records'
AS
'SELECT COUNT_IF(ARG_C1 IS NULL)/(select count(1) from ARG_T) FROM ARG_T';
You can test the DMF by using the following code and adding in the target object
SELECT [DB].[SCHEMA].null_col_count_perc
(SELECT [COLUMN] from [DB].[SCHEMA].[TABLE])
;
The result should be a value between 0 and 100.
Step 3) Adding the DMF to your column
Run the following code to attach the Null Count DMF to a column in a table
ALTER [TABLE/VIEW] [DB].[SCHEMA].[TABLE] add data metric function [DB].[SCHEMA].null_col_count_perc on ([COLUMN]);
All DMFs will be detached from the table when the table is dropped/recreated.
A workaround is to attached the DMFs using a scheduled task.
Step 4) Scheduling your DMFs to run
DMFs runs are scheduled per table.
If a table has multiple DMFs attached to it, the schedule applied will run all DMFs attached to that table.
Run the following code to schedule the DMF to run at 5am every day (UTC). This needs to be set for every table you have attached DMFs to.
ALTER VIEW [DB].[SCHEMA].[TABLE] DATA_METRIC_SCHEDULE = 'USING CRON 0 5 * * * UTC';
https://crontab.guru/ is a useful page to use to configure CRON based schedules
After the scheduled DMF has run, you can check the results by running the following code
USE ROLE CATALOG_READ_ONLY;
SELECT * FROM SNOWFLAKE.LOCAL.DATA_QUALITY_MONITORING_RESULTS WHERE METRIC_NAME ='NULL_COL_COUNT_PERC'
K compliant replacements for Snowflake Out of the box DMFs
Null count / Null percent (Completeness)
CREATE OR REPLACE DATA METRIC FUNCTION
[db].[schema].null_col_count_perc (ARG_T table(ARG_C1 STRING))
RETURNS NUMBER(19,6)
COMMENT = 'Count of column values that are null expressed as a percentage of total records'
AS
'SELECT COUNT_IF(ARG_C1 IS NULL)/(select count(1) from ARG_T) FROM ARG_T';
Duplicate count (Accuracy)
CREATE OR REPLACE DATA METRIC FUNCTION
ces_governance.dmfs.duplicate_perc (ARG_T table(ARG_C1 STRING))
RETURNS NUMBER(19,6)
COMMENT = 'Count of duplicate records expressed as a percentage of total unique records'
AS
$$
SELECT
(1-((select count(1) from (select ARG_C1 from ARG_T group by 1 having count(1)>1))
/(select count(1) from ARG_T)))*100
$$;
Unique count (Uniqueness)
CREATE OR REPLACE DATA METRIC FUNCTION
ces_governance.dmfs.unique_perc (ARG_T table(ARG_C1 STRING))
RETURNS NUMBER(19,6)
COMMENT = 'Count of unique records expressed as a percentage of total records'
AS
'SELECT ((select count(distinct ARG_C1) from ARG_T)/(select count(1) from ARG_T))*100';
Useful Data Metrics Functions to implement
More DMFs for different dimensions or industries will be added over time.