Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Table of Contents

Setup Infrastructure

Kubernetes deployment

  1. Platform:
    Setup a Kubernetes on any cloud provider (AWS, Azure, Google Cloud) or on-premise solution (e.g. OpenShift)

  2. Specification:
    The Kubernetes cluster requires a minimum of 4 nodes with each node having 4 CPU, 16GB MEM

  3. Storage:

    1. Setup an object store such as AWS s3, Azure Blob etc.

      1. Setup: s3 bucket setup

    2. ~200GB storage to be mounted into the Kubernetes cluster. In cloud setups the Kubernetes services will automatically provision this.

    3. Where the organisation defines their own PV definitions eg OpenShift, set the Reclaim Policy is set to Retain. This is important to ensure there is no data lost during prolonged outage at the Kubernetes layer.

  4. Networking:

    1. Firewall rules may be required to enable access to HTTPS (443)

    2. You may choose to use your own Kubernetes ingress services or use the one provided by KADA's configuration scripts.

...

  1. This setup requires a single machine with a minimum spec of 16CPU, 64GB MEM, 200GB (minimum) storage

  2. Install docker: https://docs.docker.com/engine/install/

  3. Install docker compose: https://docs.docker.com/compose/install/

Configuring Access to KADA Image Repository

The KADA installation will require access to the KADA Image repository.

...

Code Block
docker login -u _json_key --password-stdin https://asia.gcr.io < /tmp/kada-client-[code]-key.json

# Run the following to test connectivity
docker pull busybox:1.28
docker pull asia.gcr.io/kada-external/postgres:1.7.0-pg11

KADA Platform Installation

  1. KADA is packaged as a set of configuration files.

  2. Download the latest package.

...

  1. Edit the following kada_docker_compose.env and set the following value

    Code Block
    KADA_ROOT_DATA_DIR=
    KADA_DOCKER_NETWORK=
    KADA_URL=
  2. In conf/kada-realm.json replace DOMAIN_URL with your base url of your installation. Eg https://example.com

  3. Generate CA Certs base on the domain name of the host. In conf/ rename and replace the cortex.crt and cortex.key with your generated CA Certificates.

  4. Deploy the environment

    Code Block
    ./kada_docker_compose.sh setup
    
    #Create storage_root_folder folder (default name is kada-data)
    cd ${KADA_ROOT_DATA_DIR}/cerebrum/data
    mkdir kada-data 

KADA Platform Configuration

  1. Platform Settings
    On the bottom left of screen click the GEAR icon. And select Platform Settings.

    Then setup the following properties depending on your deployment setup

    Code Block
    storage_type = [LOCAL, S3, AZURE]
    
    # For Azure blob storage
    root_folder = <s3 bucket name>
    storage_azure_access_key
    storage_azure_storage_account
    
    # For aws S3 configuration
    root_folder = <s3 bucket name>
    storage_aws_region 
    storage_aws_access_key
    storage_aws_secret_access_key
    
    # Set this for LOCAL storage
    storage_local_root_directory
    
    # Set this if connecting to sources directly
    enable_connection_test = true
  2. Integrating sources to KADA
    KADA needs to be configured for each source that you want to integrate. Setup can be configure via the KADA front end. See KADA Integration Setup and Management How to: Onboard a new source

  3. KADA Platform Initial load

    1. Setup the following Platform Setting values for initial load

      Code Block
      celery_batch_task_soft_time_limit = 0
      celery_batch_task_time_limit = 0
      metric_window = 30
    2. KADA provides a built in Batch manager for triggering the loading of sources.

    3. Seehttps://kadaai.atlassian.net/wiki/spaces/KS/pages/675708946/KADAHow+to%3A+IntegrationOnboard+Setupa+andnew+Management#4source#4.-Manually-Triggering-Source-loads

    4. Once the sources have been loaded. Manually trigger the following platform jobs. See https://kadaai.atlassian.net/wiki/spaces/KS/pages/1740931226/KADA+Batch+Manager#Manually-triggering-a-Platform-job
      1. GATHER_METRICS_AND_STATS
      2. POST_PROCESS_QUERIES
      3. DAILY

  4. Schedule sources to load.
    KADA provided a scheduler to periodically load the source you have configured.
    Setup the following Platform Setting value to enable the scheduler to run.

    Code Block
    enable_platform_batch = true

    Each Source can now be scheduled to run. See https://kadaai.atlassian.net/wiki/spaces/KS/pages/675708946/KADAHow+to%3A+IntegrationOnboard+Setupa+andnew+Management#3source#3.-Scheduling-a-Source

Upgrading KADA

KADA generally releases new updates each month. See our Release versions to see what the latest version available is.

To check your version see How to: Check the version of K platform

...

Code Block
# Download new package
# Transfer package over to kada server under /kada

# unzip package
cd /kada
tar -xvf kada_5.5.0_docker_compose.tar.gz
cd /kada/d_pkg_5.5.0

# From the current version directory copy over the following config
cp /kada/d_pkg_5.4.0/conf/cortex.key /kada/d_pkg_5.5.0/conf/cortex.key
cp /kada/d_pkg_5.4.0/conf/cortex.crt /kada/d_pkg_5.5.0/conf/cortex.crt
cp /kada/d_pkg_5.4.0/conf/kada-realm.json /kada/d_pkg_5.5.0/conf/kada-realm.json

# Update these variables in the newer version's kada_docker_compose.env
KADA_ROOT_DATA_DIR=
KADA_DOCKER_NETWORK=
KADA_URL=

# The restart the environment
./kada_docker_compose_setup.sh update yes-i-can-proceed

KADA Integrations

1. Updating Kubernetes Configs for Source Onboarding

Some sources in KADA require additional configuration to establish connectivity. This section details the additional configuration steps per integration source.

...

Code Block
# Upload edited config
kubectl apply -f path/to/config/file.yaml

# Get service details
kubectl get pods

# Restart cerebrum services. 
# Replace the XXXXXXXXXX-XXXXX with the ID in your environment
kubectl delete pod cerebrum-deployment-5dcd4df76c-s972x-XXXXXXXXXX-XXXXX 
kubectl delete pod cerebrum-celery-batch-worker-deployment-XXXXXXXXXX-XXXXX 
kubectl delete pod cerebrum-celery-worker-deployment-XXXXXXXXXX-XXXXX 

1.1. Teradata

Uses ODBC so an update is required to the cerebrum-odbc-ini.yaml file

...

  1. Read access to the DBC and PDCRINFO schemas.

  2. Specifically these tables:

    1. PDCRINFO.DBQLogTbl Otherwise DBC.DBQLogTbl

    2. PDCRINFO.DBQLSqlTbl Otherwise DBC.DBQLSqlTbl

    3. DBC.TABLESV

    4. DBC.TABLESV

    5. DBC.INDICESV

    6. DBC.ALL_RI_CHILDRENV

1.2. SQLServer 2012+

Uses ODBC so an update is required to the cerebrum-odbc-ini.yaml file

...

Code Block
--Query To Create Extended Events Session
CREATE EVENT SESSION [KADA] ON SERVER ADD EVENT sqlserver.sp_statement_completed (
	ACTION(package0.collect_system_time, package0.event_sequence, sqlos.task_time, sqlserver.client_app_name, sqlserver.client_hostname, sqlserver.database_id, sqlserver.database_name, sqlserver.nt_username, sqlserver.query_hash, sqlserver.server_instance_name, sqlserver.server_principal_name, sqlserver.server_principal_sid, sqlserver.session_id, sqlserver.session_nt_username, sqlserver.transaction_id, sqlserver.username) WHERE (
		(
			[statement] LIKE '%CREATE %'
			OR [statement] LIKE '%DROP %'
			OR [statement] LIKE '%MERGE %'
			OR [statement] LIKE '%FROM %'
			)
		AND [sqlserver].[server_principal_name] <> N'USERS_TO_EXCLUDE'
		AND [sqlserver].[is_system] = (0)
		AND NOT [statement] LIKE 'Insert into % Values %'
		AND [sqlserver].[Query_hash] <> (0)
		)
	), ADD EVENT sqlserver.sql_statement_completed (
	SET collect_statement = (1) ACTION(package0.collect_system_time, package0.event_sequence, sqlos.task_time, sqlserver.client_app_name, sqlserver.client_hostname, sqlserver.database_id, sqlserver.database_name, sqlserver.nt_username, sqlserver.query_hash, sqlserver.server_instance_name, sqlserver.server_principal_name, sqlserver.session_id, sqlserver.session_nt_username, sqlserver.transaction_id, sqlserver.username) WHERE (
		(
			[statement] LIKE '%CREATE %'
			OR [statement] LIKE '%DROP %'
			OR [statement] LIKE '%MERGE %'
			OR [statement] LIKE '%FROM %'
			)
		AND [sqlserver].[server_principal_name] <> N'N'USERS_TO_EXCLUDE'
		AND [sqlserver].[is_system] = (0)
		AND NOT [statement] LIKE 'Insert into % Values %'
		AND [sqlserver].[Query_hash] <> (0)
		)
	) ADD TARGET package0.event_file (SET filename = N'G:\extended events\Extendedevents.xel', max_file_size = (20), max_rollover_files = (100))
	WITH (MAX_MEMORY = 4096 KB, EVENT_RETENTION_MODE = ALLOW_MULTIPLE_EVENT_LOSS, MAX_DISPATCH_LATENCY = 30 SECONDS, MAX_EVENT_SIZE = 0 KB, MEMORY_PARTITION_MODE = NONE, TRACK_CAUSALITY = ON, STARTUP_STATE = ON)
GO

-- Check if the session is dropping events and see other data about the session
-- https://sqlperformance.com/2019/10/extended-events/understanding-event-loss-with-extended-events
SELECT
   s.name, 
   s.total_regular_buffers,
   s.regular_buffer_size,
   s.total_large_buffers,
   s.large_buffer_size,
   s.dropped_event_count,
   s.dropped_buffer_count,
   s.largest_event_dropped_size
FROM sys.dm_xe_sessions AS s;

-- Also check log growth rate. Apply filters to remove noise.
-- some filters:
-- [sqlserver].[server_principal_name] = N'name of principal'
-- [sqlserver].[is_system] = (0)
-- [sqlserver].[client_app_name] = N'name of app'

-- NB grouping by XML is not supported in SQLServer. Need to use an external tool for this.

1.3. Oracle Cloud and Oracle Analytics

Required an oracle wallet and the following. items to be updated in the cerebrum-oic.yaml file.

...

  1. Read access to the SYS.DBA_* tables

  2. Specifically the following tables:

    1. SYS.DBA_PROCEDURES

    2. SYS.DBA_VIEWS

    3. SYS_DBA_MVIEWS

    4. SYS.DBA_CONSTRAINTS

    5. SYS.DBA_CONS_COLUMNS

    6. SYS.DBA_TAB_COLUMNS

    7. SYS.dba_hist_active_sess_history

    8. SYS.dba_hist_snapshot

    9. SYS.dba_users

    10. SYS.dba_hist_sqltext

1.4. Snowflake

No additional configuration is required. Snowflake uses a python native driver.

...

  1. Read access to the SNOWFLAKE.ACCOUNT_USAGE schema

  2. User must have access to role: ACCOUNTADMIN

    1. Alternatively grant to other role. https://docs.snowflake.com/en/sql-reference/account-usage.html#enabling-account-usage-for-other-roles

1.5. Tableau

Permissions

  1. Read user to workgroup Tableau postgres database

  2. Tableau enabled for Metadata API

    1. https://help.tableau.com/current/api/metadata_api/en-us/docs/meta_api_start.html#enable-the-tableau-metadata-api-for-tableau-server

  3. Tableau Server enabled for respository access

    1. https://help.tableau.com/current/server/en-us/perf_collect_server_repo.htm

  4. Create a Tableau Server with the following access roles: Site Administrator Creator or Server Administrator

1.6 Informatica (coming soon)

Permissions

Read access to Informatica all repository tables.

1.7 DBT (coming soon)

Permissions

Read access a location that contains the manifest.json and catalogue.json file for each dbt project.