Integrating Amazon Glue with ClickHouse and Spark

ClickHouse Supported

Amazon Glue is a fully managed, serverless data integration service provided by Amazon Web Services (AWS). It simplifies the process of discovering, preparing, and transforming data for analytics, machine learning, and application development.

Installation

To integrate your Glue code with ClickHouse, you can use our official Spark connector in Glue via one of the following:

Installing the ClickHouse Glue connector from the AWS Marketplace (recommended).
Manually adding the Spark Connector's jars to your Glue job.

AWS Marketplace
Manual Installation

Subscribe to the Connector
To access the connector in your account, subscribe to the ClickHouse AWS Glue Connector from AWS Marketplace.
Grant Required Permissions
Ensure your Glue job’s IAM role has the necessary permissions, as described in the minimum privileges guide.
Activate the Connector & Create a Connection
After subscribing, select the Glue version that matches your job requirements. In the Additional details section, under Usage instructions, click the link to Open Glue Studio - Add ClickHouse connector. This opens the Glue connection creation page with key fields pre-filled. Give the connection a name and press create (no need to provide the ClickHouse connection details at this stage).

Use in Glue Job
In your Glue job, select the Job details tab, and expend the Advanced properties window. Under the Connections section, select the connection you just created. The connector automatically injects the required JARs into the job runtime.

Note

Make sure to select the connector version that matches your Glue job configuration:

Glue 4: Spark 3.3, Scala 2, Python 3
Glue 5: Spark 3.5, Scala 2, Python 3

To add the required jars manually, please follow the following:

Upload the latest Spark connector JAR (clickhouse-spark-runtime-3.X_2.X-0.10.X.jar) to an S3 bucket.
Make sure the Glue job has access to this bucket.
Under the Job details tab, scroll down and expend the Advanced properties drop down, and fill the jars path in Dependent JARs path:

Examples

Visual Editor
Scala
Python

You can use the ClickHouse connector as either a source or a target in the Glue Studio visual editor. Simply drag the ClickHouse Spark Connector component onto the canvas and connect it to your data pipeline.

import com.amazonaws.services.glue.GlueContext
import com.amazonaws.services.glue.util.{GlueArgParser, Job, JsonOptions}
import org.apache.spark.SparkContext
import scala.collection.JavaConverters._

object ClickHouseGlueExample {
  def main(sysArgs: Array[String]): Unit = {
    val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME").toArray)

    val sc = new SparkContext()
    val glueContext = new GlueContext(sc)
    Job.init(args("JOB_NAME"), glueContext, args.asJava)

    val clickHouseOptions = JsonOptions(Map(
      "className" -> "clickhouse",
      "host" -> "<your-clickhouse-host>",
      "http_port" -> "<your-clickhouse-port>",
      "protocol" -> "https",
      "user" -> "default",
      "password" -> "<your-password>",
      "database" -> "default",
      "table" -> "example_table",
      // for ClickHouse Cloud
      "ssl" -> "true"
    ))

    // Read from ClickHouse
    val source = glueContext.getSource(
      connectionType = "custom.spark",
      connectionOptions = clickHouseOptions,
      transformationContext = "clickhouseSource"
    )
    val dyf = source.getDynamicFrame()

    // Write to ClickHouse
    val writeOptions = JsonOptions(Map(
      "className" -> "clickhouse",
      "host" -> "<your-clickhouse-host>",
      "http_port" -> "<your-clickhouse-port>",
      "protocol" -> "https",
      "user" -> "default",
      "password" -> "<your-password>",
      "database" -> "default",
      "table" -> "target_table",
      "ssl" -> "true"
    ))

    glueContext.getSink(
      connectionType = "custom.spark",
      connectionOptions = writeOptions
    ).writeDynamicFrame(dyf)

    Job.commit()
  }
}

import sys
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
logger = glueContext.get_logger()
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

clickhouse_options = {
    "className": "clickhouse",
    "host": "<your-clickhouse-host>",
    "http_port": "<your-clickhouse-port>",
    "protocol": "https",
    "user": "default",
    "password": "<your-password>",
    "database": "default",
    "table": "example_table",
    # for ClickHouse Cloud
    "ssl": "true"
}

# Read from ClickHouse
source = glueContext.create_dynamic_frame.from_options(
    connection_type="custom.spark",
    connection_options=clickhouse_options,
    transformation_ctx="clickhouse_source"
)
dyf = source

logger.info(f"Read {dyf.count()} rows from ClickHouse")

# Write to ClickHouse
write_options = {
    "className": "clickhouse",
    "host": "<your-clickhouse-host>",
    "http_port": "<your-clickhouse-port>",
    "protocol": "https",
    "user": "default",
    "password": "<your-password>",
    "database": "default",
    "table": "target_table",
    "ssl": "true"
}

glueContext.write_dynamic_frame.from_options(
    frame=dyf,
    connection_type="custom.spark",
    connection_options=write_options,
    transformation_ctx="clickhouse_sink"
)

job.commit()

For more details, please visit our Spark documentation.

Installation​

Subscribe to the Connector

Grant Required Permissions

Activate the Connector & Create a Connection

Use in Glue Job

Examples​

Installation

Examples