Content Comparison

Scroll ignore

scroll-viewport	true
scroll-pdf	true
scroll-office	true
scroll-chm	true
scroll-docbook	true
scroll-eclipsehelp	true
scroll-html	true
scroll-epub	true

Open in new tab

About Collectors

Insert excerpt

	KSL:Collector MethodKSL:
	Collector Method
name	about

...

Collector Server Minimum Requirements

Insert excerpt

	KSL:Collector MethodKSL:
	Collector Method
name	CollectorServerSpec
nopanel	true

...

Step 3: Getting Access to the Source Landing Directory

Insert excerpt

	KSL:Collector MethodKSL:
	Collector Method
name	landing

...

The collector requires a set of parameters to connect to and extract metadata from MySQL.

The MySQL ByteHouse collector only extracts metadata and does not extract or process query usage on the database.

FIELD	FIELD TYPE	DESCRIPTION	EXAMPLE
usernameapi_key	string	Username to log into ClickHouse	“myuser”
password	string	Password to log into ClickHouse	“password”
server	string	ClickHouse instance server	“`t1x6j03yyo.ap-southeast-2.aws.clickhouse.cloud"`The API Key for ByteHouse, you can generate one via the Console	“xasdaxcv”
server	string	ByteHouse gateway, these are regionally specific https://docs.byteplus.com/en/docs/bytehouse/docs-supported-regions-and-providers	“gateway.aws-ap-southeast-1.bytehouse.cloud”
port	integer	The port to connect to the ClickHouse instance, generally this is 9440190009440	19000
host	string	The onboarded host in K for the ClickHouse Source“t1x6j03yyo., this is generally the gateway address but we suggest you add a differentiator incase you have multiple ByteHouse accounts.	“gateway.aws-ap-southeast-21.aws.clickhouse.cloud"	database_name	string	The onboarded database name in K for the ClickHouse Source, this will be the same as the source name for ClickHouse	“myclickhouse”bytehouse.cloud”
tenant_account_id	string	This value can be found in the ByteHouse console under the Tenant Management Tab and Basic Information. This is NOT the Login Account Id	“123456778”
meta_only	boolean	Currently we only support meta only as true	true
output_path	string	Absolute path to the output location where files are to be written	“/tmp/output”
mask	boolean	To enable masking or not	true
compress	boolean	To enable compression or not to .csv.gz	true
timeout	boolean	Timeout setting for sending and receiving data in seconds, this is normally defaulted as 80000	80000

These parameters can be added directly into the run or you can use pass the parameters in via a JSON file. The following is an example you can use that is included in the example run code below.

kada_clickhousebytehouse_extractor_config.json

Code Block

{
    "usernameapi_key": "",
    "password": "",
    "server": "",
    "port": 944019000,
    "databasetenant_account_nameid": "",
    "host": "",
    "output_path": "/tmp/output",
    "mask": true,
    "compress": true,
    "meta_only": true,
    "timeout": 80000
}

...

This can be executed in any python environment where the whl has been installed. It will produce and read a high water mark file from the same directory as the execution called clickhousebytehouse_hwm.txt and produce files according to the configuration JSON.

This is the wrapper script: kada_clickhousebytehouse_extractor.py

Code Block

import os
import argparse
from kada_collectors.extractors.utils import load_config, get_hwm, publish_hwm, get_generic_logger
from kada_collectors.extractors.clickhousebytehouse import Extractor

get_generic_logger('root') # Set to use the root logger, you can change the context accordingly or define your own logger

_type = 'clickhousebytehouse'
dirname = os.path.dirname(__file__)
filename = os.path.join(dirname, 'kada_{}_extractor_config.json'.format(_type))

parser = argparse.ArgumentParser(description='KADA ClickhouseBytehouse Extractor.')
parser.add_argument('--config', '-c', dest='config', default=filename, help='Location of the configuration json, default is the config json in the same directory as the script.')
parser.add_argument('--name', '-n', dest='name', default=_type, help='Name of the collector instance.')
args = parser.parse_args()

start_hwm, end_hwm = get_hwm(args.name)

ext = Extractor(**load_config(args.config))
ext.test_connection()
ext.run(**{"start_hwm": start_hwm, "end_hwm": end_hwm})

publish_hwm(_type, end_hwm)

...

Code Block

class Extractor(
    usernameapi_key: str = None,
    password: str = None,
    server: str = None,
    port: int = 8443,
    databasetenant_account_nameid: str = None,
    host: str = None,
    output_path: str = './output',
    mask: bool = False,
    compress: bool = False,
    meta_only: bool = False,
    timeout: int = 80000
) -> None

username: username api_key: api key to sign into Clickhouse Bytehouse server
password: password to sign into Clickhouse server
server: Clickhouse Bytehouse server host or address for the connection
port: Clickhouse Bytehouse server port for the connection, default is 8443
databasetenant_account_nameid: The Clickhouse database nameBytehouse account ID, this should can be the name you onboarded or will onboard into K withfound in the console.
host: The Clickhouse Bytehouse host or address name, this should be the name you onboarded or will onboard into K with, generally this is the same as the connection server.
sql: The list of SQL queries that will be executed by the program
output_path: full or relative path to where the outputs should go
mask: To mask the META/DATABASE_LOG files or not
compress: To gzip output files or not
meta_only: To extract metadata only
timeout: The timeout for the send/recieve connection default is 80000

...

A high water mark file is created in the same directory as the execution called clickhousebytehouse_hwm.txt and produce files according to the configuration JSON. This file is only produced if you call the publish_hwm method.

...

Example: Using Airflow to orchestrate the Extract and Push to K

Insert excerpt

	KSL:Collector MethodKSL:
	Collector Method
name	airflow

...

Version	Old Version 1	New Version 2
Changes made by	Sidney Chen	Sidney Chen
Saved on	Jul 30, 2024	Jul 30, 2024

Content Comparison

Versions Compared

Key

About Collectors

Step 3: Getting Access to the Source Landing Directory

Example: Using Airflow to orchestrate the Extract and Push to K