Sensitive Data (e.g. PII) Scanner

Pre-requisites

Python 3.6 - 3.10 (excluding 3.9.0)
- Note: 3.9.0 is unsupported. 3.9.1 and subsequent versions are supported.
Access to K landing directory
Read access to the source that you are going to run the PII Scanner against.
Install the corresponding collector package for the source you are scanning.
- For example if you are scanning snowflake, you need to also install the Snowflake collector package.
- If you would like to run the PII scanner on multiple sources, then you will need to install the collector package for all sources
- Refer to the Source collector page for instructions on how to install collector packages. e.g. Snowflake (via Collector method) , Redshift (via Collector method) , Oracle Database (via Collector method)

Limitations

PII Scanner currently can only be used for Snowflake, Redshift and Oracle Database

The following scenarios will result in a FAILED scan status by the scanner.

Unable to scan tables with case sensitive names thats usually controlled by quoting in the SQL
Unable to scan tables with special characters that break SQL format without quoting
Unable to scan tables that are named after keywords which require quoting
Unable to scan tables (inclusive of the schema and database names that the table belongs to) that contain a period (.) in the name.
If the table contains a column that causes a data retrieval error
If the view has issues executing the underlying Stored Proc or SQL

Step 1: Generate a scanner configuration

The scanner configuration is generated for all tables in a database or schema within a database source. A future enhancement (Q1 2024) will allow for additional custom table selection.

Log into K and go to Data Applications. Select Ask K

Select the scanner tab. Go to the Scanner config and click on Run

Select the source and tables you want to run the scanner on. You can select all the tables by database or schema.

Click Create to generate the scanner config to be used in the scanner.

Step 1: Install the PII Scanner Collector

The PII Scanner Collector is currently hosted in KADA’s Azure Blob Storage. Reach out to KADA Support (support@kada.ai) to obtain the collector package and receive a SAS token to access the repository.

Make sure that you’ve already set up the collector packages for the Sources (e.g. Snowflake) that you’d like to run the PII Scanner Collector on.

Step 2: Configure the Collector

Check to ensure that the following steps have been completed:

Installed the relevant Source Collector .whl
Installed any external dependencies described on the Source Collector page.
Installed the common library package kada_collectors_lib-<version>-py3-none-any.whl or higher
Installed the PII Scanner .whl (as per Step 1)
Created the Source Collector config json file as described on the Source Collector page

The following code is an example of how to run the extractor targeting Snowflake as the source. You may need to uplift this code to meet any code standards at your organisation.

This is the wrapper script: kada_pii_scanner.py

import csv
import argparse
from kada_collectors.extractors.utils import load_config, get_generic_logger
from kada_collectors.extractors.pii_scanner import PIIScanner, VALID_DEFAULT_DETECTORS

get_generic_logger('root') # Set to use the root logger, you can change the context accordingly or define your own logger

parser = argparse.ArgumentParser(description='KADA PII Scanner.')
parser.add_argument('--extractor-config', '-e', dest='extractor_config', type=str, required=True, help='Location of the extarctor configuration json.')
parser.add_argument('--objects-file-path', '-f', dest='objects_file_path', type=str, required=True, help='Location of the .txt file that contains the list of objects to scan for the source.')
parser.add_argument('--source-type', '-t', dest='source_type', type=str, required=True, help='What kind of source are we scanning? E.g. snowflake, oracle etc. See documentation for full list of supported source types.')
parser.add_argument('--sample-size', '-s', dest='sample_size', type=int, required=True, help='How many rows are we sampling for each object, note that 0 means all rows will be sampled, number should be greater or equal to 0.')
parser.add_argument('--parrallel', '-p', dest='concurrency', type=int, default=1, help='Should it be running in parallel, default is 1 meaning no parallelism, consider your CPU resources when settings this value.')
parser.add_argument('--default-detectors', '-d', dest='default_detectors', type=str, help='Comma seperated list of default detectors the scanner should use {}'.format(','.join(VALID_DEFAULT_DETECTORS)))
parser.add_argument('--delta', '-a', dest='delta', action='store_true', help='Produces a DELTA extract file if you are doing a partial scan.')
parser.add_argument('--pii-output-path', '-o', dest='pii_output_path', type=str, required=True, help='The output path to the folder where the extract should appear.')
args = parser.parse_args()

# ######
# Impliment additional logic for checking existence if you wish
# You may also choose to call the PIIScanner differently and not use an input file this is completely up to you
# You can also feed in from your own custom producer for the list of objects
# This is simply the default out of the box implimentation to call the PIIScanner and produce the required Extract File for K
# ######

def read_validate_object_file(file_path):
    """
    Reads the flat file and validates the header and returns an iterator
    This is simply the out of the box way to feed the scanner, you may choose a different
    way to feed the scanner
    """
    with open (file_path, 'r', encoding='utf-8') as csv_file:
        reader = csv.reader(csv_file, delimiter=',')
        header = next(reader) # Skip the header
        if [x.upper() for x in header] != ['OBJECT_TYPE','OBJECT_ID']: # Should be a flat file thats comma delimited with the headers OBJECT_TYPE and OBJECT_ID
            raise Exception('Invalid object file')
        return [x for x in reader] # Return a list not an iterator as we will close the file

if __name__ == '__main__': # Do not omit this syntax as the Class impliments multiprocessing
    extractor_config = load_config(args.extractor_config) # Load the corresponding collector config file
    object_list = read_validate_object_file(args.objects_file_path) # 2D Array of objects
    # You can define your own Detector classes and register them before calling .scan() method to ensure the scanner picks up the new Detector Class, read the documentation on how to impliment new classes
    # You'll need to decorate the class with kada_collectors.extractors.pii_scanner.register_detector 
    default_detectors = [x.strip() for x in args.default_detectors.split(',')] if args.default_detectors else []
    pii_scanner = PIIScanner(args.source_type, args.sample_size, args.concurrency, object_list, args.pii_output_path, default_detectors=default_detectors, delta=args.delta, **extractor_config)
    pii_scanner.scan()

This can be executed anywhere provided it has the wheels installed. And can be called by running.

python kada_pii_scanner.py -e ./kada_snowflake_extractor_config.json -f ./pii_test_scan.csv -t snowflake -s 10 -p 8 -o /tmp/output -d Email,AUPhoneNumber,CreditCard,AUTaxFileNumber,AUZipCode,AUDriversLicense,AUBSBNumber,AUBankAccountNumber,AUMedicare,AUPassport

Arguments are as follows, you can view these by calling help in the default wrapper script

python kada_pii_scanner.py --help

ARGUMENT	SHORT ARGUMENT	TYPE	OPTIONAL	DESCRIPTION	EXAMPLE

ARGUMENT	SHORT ARGUMENT	TYPE	OPTIONAL	DESCRIPTION	EXAMPLE
--extractor-config	-e	STRING	N	Location of the extractor configuration json.	/tmp/kada_snowflake_collector_config.json
--objects-file-path	-f	STRING	N	Location of the .txt file that contains the list of objects to scan for the source. See reference section below for a sample file. This is applicable when using the default wrapper script.	/tmp/object_list.csv
--source-type	-t	STRING	N	What kind of source are we scanning? E.g. snowflake, oracle etc. See documentation for full list of supported source types. See reference section below.	snowflake
--sample-size	-s	INTEGER	N	How many rows are we sampling for each object, note that 0 means all rows will be sampled, number should be greater or equal to 0.	10
--parallel	-p	INTEGER	Y	Should it be running in parallel, default is 0 meaning no parallelism, consider your CPU resources when settings this value. Also see the warning below.	4
--default-detectors	-d	LIST<STRING> comma seperated	This is mandatory unless custom detectors are defined.	This lists all the out of the box default detectors that should be run, should be a comma seperated list of the following values see Advanced Usage for how to define your own Detectors and not just use out of the box ones.	Email,AUPhoneNumber,CreditCard,AUTaxFileNumber,AUZipCode,AUDriversLicense,AUBSBNumber,AUBankAccountNumber,AUMedicare,AUPassport,AUAddress
--delta	-a	FLAG	Y	If this flag is specified the collector will produce a DELTA file, use this flag if you are not intending to scan everything that’s loaded in K, note that producing DELTAs means K won’t know what has changed and it will only upsert your extract.	N/A
--pii-output-path	-o	STRING	N	The path to the output folder for the PII extract, note this folder will be cleared during the extraction process so choose carefully.	/tmp/output

Step 3 (Optional): Defining your own Detectors

You may choose which default ones you wish to use by passing in a list to the default_detectors value (see Controlling Scanner Inputs in the Advanced Usage Section for the PIIScanner class definition)

Alternatively you may choose to overwrite one of the out of the box detectors, in this case make sure the class name for the detector is the same as the class name of the out of the box detector. To list the names of the out of the box detectors, you can inspect the registry

To define your own detector you’ll need to do the following before you instantiate the Scanner class

Your class must

inherit the DatabaseDatumDetector class
be decorated with the register_detector so the Scanner class knows it exists. If you don’t do this, the scanner will not pick it up. If you don’t inherit the right class it will also fail the registration and cause the following exception
Contain the detect method that takes a datum value (it will be up to you to transform this datum value as it can be any type) and the column_name and returns a list of matching PIIType instances, so your Detector can return multiple PIIType matches if needed.

In some cases you may wish to package your detectors, in such scenarios you can continue to use the decorator method via a director call like below

Example: Wrapper with defined Detector and PII Type

Below is an example of a wrapper with a defined Detector and PII Type using the register_detector decorator.

Advanced Usage

Controlling Scanner Inputs

You may have other processes that produce the object list and you would like to control your own way of handling the object list. For these scenarios, the scanner can be called as follows.

PII Scanner is designed to instantiate a scanner per target object
source_type: What kind of source will the scanner be pointing to
rows: How many rows are we sampling? 0 means all rows will be sampled
concurrency: Should the Scanner be running in parallel, 0 means single thread
object_list: 2D array in the form of a list of OBJECT_TYPE and OBJECT_ID,
default_detectors: list of default detectors to run, If this is empty no defaults will run

args and kwargs are passed to the Extractor class, more information can be found on the relevant Collector page for that particular source Extractor.

Controlling the Scan and Extract Process

In some situations you may wish to scan individual sets of objects but not necessarily produce an extract till the end of an orchestration. You can do so by passing in a blank object list to instantiate the class and explicitly call scan_data yourself.

Where object list is still the same 2D array, if you want to specify just one object, just place 1 element into the 2D array like so

The method will return another 2D array which will tell you the object picked up from the scan and any PIITypes associated that was detected. If nothing is detected it will return an empty array. So payloads will look like this if it wasn’t empty.

498364cf-6310-35d0-94a5-89676d67d04b is simply the UUID for K’s internal purposes

Once you have multiple payloads from multiple individual calls, you can combine all the payloads together into a single payload by concatenating the lists together.

You can then process the final payloads using scan_call_back and produce the extract by calling produce_extract

This will produce the extract as if you called the scan() process itself.

Hardcoding the Payloads to the Extract Process

Sometimes, you will already know PII data exists and there is no requirement to run a PII scan. During these scenarios, you can manually inject your own payload when constructing the final_payloads for the known PII objects by using the below code.

When hardcoding a payload, check to ensure you have replicated the payload correctly:

Ensure that my detected pii matches one of the PIIType names
Ensure the object type is correct
Ensure that object id is correct, for TABLE objects K follows a string period delimited format
<host name>.<database name>.<schema name>.<table name>
You should reference the object list or other produced outputs for examples of the format. Note that if any of the parts contain a period in the name, the period is transformed into an _ e.g. if my table name was KADA.AI we change it to KADA_AI to not cause issues with the period delimited format. Ask the KADA team if you still unsure of what to do.

References

Supported Data Sources

List of supported data sources

Snowflake
Oracle
Redshift

Object List File

When using the default wrapper script you’ll need to pass in an object list file, the file spec is as follows

A Flat file that is comma separated, can be any extension provided it’s text readable e.g. .txt, .csv
UTF-8 encoded
Contains the following header

HEADER	TYPE	OPTIONAL	DESCRIPTION	EXAMPLE

HEADER

TYPE

OPTIONAL

DESCRIPTION

EXAMPLE

OBJECT_TYPE

STRING

N

The object type being scanned, currently only support TABLE

TABLE

OBJECT_ID

STRING

N

The ID of the object being scanned, for TABLES this is a 4 part ID in the form

Where the value of host should match the value of the host onboarded in K. Each part of the ID should have any periods (.) replaced by an underscore (_) to not interfere with the ID format.

kada_snowflakecomputing_com.landing.fcs.lookup_codes