Scroll ignore | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ||||||||||||||||
About Collectors
Insert excerpt | ||||||
---|---|---|---|---|---|---|
|
...
Collector Server Minimum Requirements
Insert excerpt | ||||||||
---|---|---|---|---|---|---|---|---|
|
...
Step 3: Getting Access to the Source Landing Directory
Insert excerpt | ||||||
---|---|---|---|---|---|---|
|
...
The collector requires a set of parameters to connect to and extract metadata from MySQL.
The MySQL ByteHouse collector only extracts metadata and does not extract or process query usage on the database.
FIELD | FIELD TYPE | DESCRIPTION | EXAMPLE | ||||
---|---|---|---|---|---|---|---|
usernameapi_key | string | Username to log into ClickHouse | “myuser” | ||||
password | string | Password to log into ClickHouse | “password” | ||||
server | string | ClickHouse instance server | “ | “xasdaxcv” | |||
server | string | ByteHouse gateway, these are regionally specific https://docs.byteplus.com/en/docs/bytehouse/docs-supported-regions-and-providers | “gateway.aws-ap-southeast-1.bytehouse.cloud” | ||||
port | integer | The port to connect to the ClickHouse instance, generally this is 9440190009440 | 19000 | ||||
host | string | The onboarded host in K for the ClickHouse Source“t1x6j03yyo., this is generally the gateway address but we suggest you add a differentiator incase you have multiple ByteHouse accounts. | “gateway.aws-ap-southeast-21.aws.clickhouse.cloud" | database_name | string | The onboarded database name in K for the ClickHouse Source, this will be the same as the source name for ClickHouse | “myclickhouse”bytehouse.cloud” |
tenant_account_id | string | This value can be found in the ByteHouse console under the Tenant Management Tab and Basic Information. This is NOT the Login Account Id | “123456778” | ||||
meta_only | boolean | Currently we only support meta only as true | true | ||||
output_path | string | Absolute path to the output location where files are to be written | “/tmp/output” | ||||
mask | boolean | To enable masking or not | true | ||||
compress | boolean | To enable compression or not to .csv.gz | true | ||||
timeout | boolean | Timeout setting for sending and receiving data in seconds, this is normally defaulted as 80000 | 80000 |
These parameters can be added directly into the run or you can use pass the parameters in via a JSON file. The following is an example you can use that is included in the example run code below.
kada_clickhousebytehouse_extractor_config.json
Code Block |
---|
{ "usernameapi_key": "", "password": "", "server": "", "port": 944019000, "databasetenant_account_nameid": "", "host": "", "output_path": "/tmp/output", "mask": true, "compress": true, "meta_only": true, "timeout": 80000 } |
...
This can be executed in any python environment where the whl has been installed. It will produce and read a high water mark file from the same directory as the execution called clickhousebytehouse_hwm.txt and produce files according to the configuration JSON.
This is the wrapper script: kada_clickhousebytehouse_extractor.py
Code Block |
---|
import os import argparse from kada_collectors.extractors.utils import load_config, get_hwm, publish_hwm, get_generic_logger from kada_collectors.extractors.clickhousebytehouse import Extractor get_generic_logger('root') # Set to use the root logger, you can change the context accordingly or define your own logger _type = 'clickhousebytehouse' dirname = os.path.dirname(__file__) filename = os.path.join(dirname, 'kada_{}_extractor_config.json'.format(_type)) parser = argparse.ArgumentParser(description='KADA ClickhouseBytehouse Extractor.') parser.add_argument('--config', '-c', dest='config', default=filename, help='Location of the configuration json, default is the config json in the same directory as the script.') parser.add_argument('--name', '-n', dest='name', default=_type, help='Name of the collector instance.') args = parser.parse_args() start_hwm, end_hwm = get_hwm(args.name) ext = Extractor(**load_config(args.config)) ext.test_connection() ext.run(**{"start_hwm": start_hwm, "end_hwm": end_hwm}) publish_hwm(_type, end_hwm) |
...
Code Block |
---|
class Extractor( usernameapi_key: str = None, password: str = None, server: str = None, port: int = 8443, databasetenant_account_nameid: str = None, host: str = None, output_path: str = './output', mask: bool = False, compress: bool = False, meta_only: bool = False, timeout: int = 80000 ) -> None |
username: username api_key: api key to sign into Clickhouse Bytehouse server
password: password to sign into Clickhouse server
server: Clickhouse Bytehouse server host or address for the connection
port: Clickhouse Bytehouse server port for the connection, default is 8443
databasetenant_account_nameid: The Clickhouse database nameBytehouse account ID, this should can be the name you onboarded or will onboard into K withfound in the console.
host: The Clickhouse Bytehouse host or address name, this should be the name you onboarded or will onboard into K with, generally this is the same as the connection server.
sql: The list of SQL queries that will be executed by the program
output_path: full or relative path to where the outputs should go
mask: To mask the META/DATABASE_LOG files or not
compress: To gzip output files or not
meta_only: To extract metadata only
timeout: The timeout for the send/recieve connection default is 80000
...
A high water mark file is created in the same directory as the execution called clickhousebytehouse_hwm.txt and produce files according to the configuration JSON. This file is only produced if you call the publish_hwm method.
...
Example: Using Airflow to orchestrate the Extract and Push to K
Insert excerpt | ||||||
---|---|---|---|---|---|---|
|
...