Integrate Apache Iceberg using Spark with iDrive® e2

    Apache Iceberg is an open-source table format for managing large-scale analytical datasets in data lakes. Integrating Apache Iceberg with IDrive® e2 using Apache Spark enables the management of large analytical datasets with powerful table-format features while utilizing IDrive® e2 as a scalable and secure cloud object storage. This integration facilitates cost-effective storage and high-performance analytics on your data lake.

    Prerequisites:

    Before you begin, ensure the following:

    1. An active IDrive® e2 account. Sign up here if you do not have one.
    2. A bucket in IDrive® e2. See how to create a bucket.
    3. Valid Access Key ID and Secret Access Key. Learn how to create an access key.
    4. In the course of testing, the following packages with Ubuntu 24.04 were used.
      1. Spark version 3.5.6
      2. iceberg-spark-runtime-3.5_2.12-1.9.2.jar
      3. hadoop-aws-3.3.4.jar
      4. aws-java-sdk-bundle-1.11.1026.jar

    Steps to configure Apache Iceberg with IDrive® e2

    1. Upon Installation of the Iceberg library (Spark), update directory in $SPARK_HOME/conf/spark-defaults.conf with below configuration:

      spark.driver.extraClassPath=/home/vishal/spark-extra-jars/hadoop-aws-3.3.4.jar:/home/vishal/spark-extra-jars/aws-java-sdk-bundle-1.11.1026.jar

      spark.executor.extraClassPath=/home/vishal/spark-extra-jars/hadoop-aws-3.3.4.jar:/home/vishal/spark-extra-jars/aws-java-sdk-bundle-1.11.1026.jar

      #Icerberg extensions and catalog setup:

      spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions

      spark.sql.catalog.s3cat=org.apache.iceberg.spark.SparkCatalog

      spark.sql.catalog.s3cat.type=hadoop

      spark.sql.catalog.s3cat.warehouse=s3a://warehouse09876
      (Note: This is the bucket name)

      #S3A filesystem configs for Idrive® e2:

      spark.hadoop.fs.s3a.endpoint=<Idrivee2-endpoint>
      (Ex: v1e8.da.idrivee2-17.com)

      spark.hadoop.fs.s3a.path.style.access=true

      spark.hadoop.fs.s3a.connection.ssl.enabled=true

      spark.hadoop.fs.s3a.signing-region=us-east-1

      spark.hadoop.fs.s3a.access.key=<Access Key>

      spark.hadoop.fs.s3a.secret.key=<Secret Key>

      #Optional steps for performance tuning:

      spark.hadoop.fs.s3a.fast.upload=true

      spark.hadoop.fs.s3a.connection.maximum=100

      spark.hadoop.fs.s3a.threads.max=20

    2. Download the necessary Hadoop AWS and AWS SDK for Spark 3.5.6

      mkdir -p ~/spark-extra-jars

      cd ~/spark-extra-jars

      # Download matching Hadoop AWS + AWS SDK for Spark 3.5.6 (Hadoop 3.3.4)

      curl -O
      https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.4/hadoop-aws-3.3.4.jar

      curl -O
      https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.11.1026/aws-java-sdk-bundle-1.11.1026.jar

    3. Run the following code to create a bucket (If not existing) and insert data in table (Bucket) in destination

      #!/bin/bash

      spark-sql --jars \

      $SPARK_HOME/jars/iceberg-spark-runtime-3.5_2.12-1.9.2.jar,
      $SPARK_HOME/jars/hadoop-aws-3.3.4.jar,
      $SPARK_HOME/jars/aws-java-sdk-bundle-1.11.1026.jar -e "

      CREATE NAMESPACE IF NOT EXISTS s3cat.db;

      CREATE TABLE IF NOT EXISTS s3cat.db.events (
      id BIGINT,
      ts TIMESTAMP,
      data STRING
      ) USING iceberg;

      INSERT INTO s3cat.db.events VALUES (1, current_timestamp(), 'hello'), (2, current_timestamp(), 'world');

      SELECT * FROM s3cat.db.events;

      " --verbose

    Note: Change access keys, endpoints and directory names according to your configuration.

    Note: Data restoration is handled by your specific backup solution provider and is affected by multiple variables that are unique to your environment. For application-related enquiries/support, it is strongly recommended you seek guidance from the technical team of your backup solution provider.