S3 Compatible Storage Integrations

Apache Iceberg using Spark

Integrate Apache Iceberg using Spark with iDrive^® e2

Apache Iceberg is an open-source table format for managing large-scale analytical datasets in data lakes. Integrating Apache Iceberg with IDrive^® e2 using Apache Spark enables the management of large analytical datasets with powerful table-format features while utilizing IDrive^® e2 as a scalable and secure cloud object storage. This integration facilitates cost-effective storage and high-performance analytics on your data lake.

Prerequisites:

Before you begin, ensure the following:

An active IDrive^® e2 account. Sign up here if you do not have one.
A bucket in IDrive^® e2. See how to create a bucket.
Valid Access Key ID and Secret Access Key. Learn how to create an access key.
In the course of testing, the following packages with Ubuntu 24.04 were used.
1. Spark version 3.5.6
2. iceberg-spark-runtime-3.5_2.12-1.9.2.jar
3. hadoop-aws-3.3.4.jar
4. aws-java-sdk-bundle-1.11.1026.jar

Steps to configure Apache Iceberg with IDrive^® e2

Upon Installation of the Iceberg library (Spark), update directory in $SPARK_HOME/conf/spark-defaults.conf with below configuration:
spark.driver.extraClassPath=/home/vishal/spark-extra-jars/hadoop-aws-3.3.4.jar:/home/vishal/spark-extra-jars/aws-java-sdk-bundle-1.11.1026.jar

spark.executor.extraClassPath=/home/vishal/spark-extra-jars/hadoop-aws-3.3.4.jar:/home/vishal/spark-extra-jars/aws-java-sdk-bundle-1.11.1026.jar

#Icerberg extensions and catalog setup:

spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions

spark.sql.catalog.s3cat=org.apache.iceberg.spark.SparkCatalog

spark.sql.catalog.s3cat.type=hadoop

spark.sql.catalog.s3cat.warehouse=s3a://warehouse09876
(Note: This is the bucket name)

#S3A filesystem configs for Idrive^® e2:

spark.hadoop.fs.s3a.endpoint=<Idrivee2-endpoint>
(Ex: v1e8.da.idrivee2-17.com)

spark.hadoop.fs.s3a.path.style.access=true

spark.hadoop.fs.s3a.connection.ssl.enabled=true

spark.hadoop.fs.s3a.signing-region=us-east-1

spark.hadoop.fs.s3a.access.key=<Access Key>

spark.hadoop.fs.s3a.secret.key=<Secret Key>

#Optional steps for performance tuning:

spark.hadoop.fs.s3a.fast.upload=true

spark.hadoop.fs.s3a.connection.maximum=100

spark.hadoop.fs.s3a.threads.max=20
Download the necessary Hadoop AWS and AWS SDK for Spark 3.5.6

mkdir -p ~/spark-extra-jars

cd ~/spark-extra-jars

# Download matching Hadoop AWS + AWS SDK for Spark 3.5.6 (Hadoop 3.3.4)

curl -O
https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.4/hadoop-aws-3.3.4.jar

curl -O
https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.11.1026/aws-java-sdk-bundle-1.11.1026.jar
Run the following code to create a bucket (If not existing) and insert data in table (Bucket) in destination

#!/bin/bash

spark-sql --jars \

$SPARK_HOME/jars/iceberg-spark-runtime-3.5_2.12-1.9.2.jar,
$SPARK_HOME/jars/hadoop-aws-3.3.4.jar,
$SPARK_HOME/jars/aws-java-sdk-bundle-1.11.1026.jar -e "

CREATE NAMESPACE IF NOT EXISTS s3cat.db;

CREATE TABLE IF NOT EXISTS s3cat.db.events (
id BIGINT,
ts TIMESTAMP,
data STRING
) USING iceberg;

INSERT INTO s3cat.db.events VALUES (1, current_timestamp(), 'hello'), (2, current_timestamp(), 'world');

SELECT * FROM s3cat.db.events;

" --verbose

Note: Change access keys, endpoints and directory names according to your configuration.

Note: Data restoration is handled by your specific backup solution provider and is affected by multiple variables that are unique to your environment. For application-related enquiries/support, it is strongly recommended you seek guidance from the technical team of your backup solution provider.

Apache Iceberg using Spark

Integrate Apache Iceberg using Spark with iDrive^® e2

Prerequisites:

Steps to configure Apache Iceberg with IDrive^® e2

E2-logo

Solutions

Get Started

Services

Apache Iceberg using Spark

Integrate Apache Iceberg using Spark with iDrive® e2

Prerequisites:

Steps to configure Apache Iceberg with IDrive® e2

Integrate Apache Iceberg using Spark with iDrive^® e2

Steps to configure Apache Iceberg with IDrive^® e2