Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Scroll ignore
scroll-viewporttrue
scroll-pdftrue
scroll-officetrue
scroll-chmtrue
scroll-docbooktrue
scroll-eclipsehelptrue
scroll-htmltrue
scroll-epubtrue

Open in new tab

About Collectors

Insert excerpt
KSL:Collector MethodKSL:
Collector Method
nameabout

...

Collector Server Minimum Requirements

Insert excerpt
KSL:Collector MethodKSL:
Collector Method
nameCollectorServerSpec
nopaneltrue

...

Step 3: Getting Access to the Source Landing Directory

Insert excerpt
KSL:Collector MethodKSL:
Collector Method
namelanding

...

The collector requires a set of parameters to connect to and extract metadata from MySQL.

The MySQL ByteHouse collector only extracts metadata and does not extract or process query usage on the database.

FIELD

FIELD TYPE

DESCRIPTION

EXAMPLE

usernameapi_key

string

Username to log into ClickHouse

“myuser”

password

string

Password to log into ClickHouse

“password”

server

string

ClickHouse instance server

t1x6j03yyo.ap-southeast-2.aws.clickhouse.cloud"The API Key for ByteHouse, you can generate one via the Console

“xasdaxcv”

server

string

ByteHouse gateway, these are regionally specific https://docs.byteplus.com/en/docs/bytehouse/docs-supported-regions-and-providers

“gateway.aws-ap-southeast-1.bytehouse.cloud”

port

integer

The port to connect to the ClickHouse instance, generally this is 9440190009440

19000

host

string

The onboarded host in K for the ClickHouse Source“t1x6j03yyo., this is generally the gateway address but we suggest you add a differentiator incase you have multiple ByteHouse accounts.

“gateway.aws-ap-southeast-21.aws.clickhouse.cloud"

database_name

string

The onboarded database name in K for the ClickHouse Source, this will be the same as the source name for ClickHouse

“myclickhouse”bytehouse.cloud”

tenant_account_id

string

This value can be found in the ByteHouse console under the Tenant Management Tab and Basic Information. This is NOT the Login Account Id

“123456778”

meta_only

boolean

Currently we only support meta only as true

true

output_path

string

Absolute path to the output location where files are to be written

“/tmp/output”

mask

boolean

To enable masking or not

true

compress

boolean

To enable compression or not to .csv.gz

true

timeout

boolean

Timeout setting for sending and receiving data in seconds, this is normally defaulted as 80000

80000

These parameters can be added directly into the run or you can use pass the parameters in via a JSON file. The following is an example you can use that is included in the example run code below.

kada_clickhousebytehouse_extractor_config.json

Code Block
{
    "usernameapi_key": "",
    "password": "",
    "server": "",
    "port": 944019000,
    "databasetenant_account_nameid": "",
    "host": "",
    "output_path": "/tmp/output",
    "mask": true,
    "compress": true,
    "meta_only": true,
    "timeout": 80000
}

...

This can be executed in any python environment where the whl has been installed. It will produce and read a high water mark file from the same directory as the execution called clickhousebytehouse_hwm.txt and produce files according to the configuration JSON.

This is the wrapper script: kada_clickhousebytehouse_extractor.py

Code Block
import os
import argparse
from kada_collectors.extractors.utils import load_config, get_hwm, publish_hwm, get_generic_logger
from kada_collectors.extractors.clickhousebytehouse import Extractor

get_generic_logger('root') # Set to use the root logger, you can change the context accordingly or define your own logger

_type = 'clickhousebytehouse'
dirname = os.path.dirname(__file__)
filename = os.path.join(dirname, 'kada_{}_extractor_config.json'.format(_type))

parser = argparse.ArgumentParser(description='KADA ClickhouseBytehouse Extractor.')
parser.add_argument('--config', '-c', dest='config', default=filename, help='Location of the configuration json, default is the config json in the same directory as the script.')
parser.add_argument('--name', '-n', dest='name', default=_type, help='Name of the collector instance.')
args = parser.parse_args()

start_hwm, end_hwm = get_hwm(args.name)

ext = Extractor(**load_config(args.config))
ext.test_connection()
ext.run(**{"start_hwm": start_hwm, "end_hwm": end_hwm})

publish_hwm(_type, end_hwm)

...

Code Block
class Extractor(
    usernameapi_key: str = None,
    password: str = None,
    server: str = None,
    port: int = 8443,
    databasetenant_account_nameid: str = None,
    host: str = None,
    output_path: str = './output',
    mask: bool = False,
    compress: bool = False,
    meta_only: bool = False,
    timeout: int = 80000
) -> None

username: username api_key: api key to sign into Clickhouse Bytehouse server
password: password to sign into Clickhouse server
server: Clickhouse Bytehouse server host or address for the connection
port: Clickhouse Bytehouse server port for the connection, default is 8443
databasetenant_account_nameid: The Clickhouse database nameBytehouse account ID, this should can be the name you onboarded or will onboard into K withfound in the console.
host: The Clickhouse Bytehouse host or address name, this should be the name you onboarded or will onboard into K with, generally this is the same as the connection server.
sql: The list of SQL queries that will be executed by the program
output_path: full or relative path to where the outputs should go
mask: To mask the META/DATABASE_LOG files or not
compress: To gzip output files or not
meta_only: To extract metadata only
timeout: The timeout for the send/recieve connection default is 80000

...

A high water mark file is created in the same directory as the execution called clickhousebytehouse_hwm.txt and produce files according to the configuration JSON. This file is only produced if you call the publish_hwm method.

...

Example: Using Airflow to orchestrate the Extract and Push to K

Insert excerpt
KSL:Collector MethodKSL:
Collector Method
nameairflow

...