Integrate Apache Iceberg using Spark with iDrive® e2
Apache Iceberg is an open-source table format for managing large-scale analytical datasets in data lakes. Integrating Apache Iceberg with IDrive® e2 using Apache Spark enables the management of large analytical datasets with powerful table-format features while utilizing IDrive® e2 as a scalable and secure cloud object storage. This integration facilitates cost-effective storage and high-performance analytics on your data lake.
Prerequisites:
Before you begin, ensure the following:
- An active IDrive® e2 account. Sign up here if you do not have one.
- A bucket in IDrive® e2. See how to create a bucket.
- Valid Access Key ID and Secret Access Key. Learn how to create an access key.
- In the course of testing, the following packages with Ubuntu 24.04 were used.
- Spark version 3.5.6
- iceberg-spark-runtime-3.5_2.12-1.9.2.jar
- hadoop-aws-3.3.4.jar
- aws-java-sdk-bundle-1.11.1026.jar
Steps to configure Apache Iceberg with IDrive® e2
- Upon Installation of the Iceberg library (Spark), update directory in $SPARK_HOME/conf/spark-defaults.conf with below configuration:
spark.driver.extraClassPath=/home/vishal/spark-extra-jars/hadoop-aws-3.3.4.jar:/home/vishal/spark-extra-jars/aws-java-sdk-bundle-1.11.1026.jar
spark.executor.extraClassPath=/home/vishal/spark-extra-jars/hadoop-aws-3.3.4.jar:/home/vishal/spark-extra-jars/aws-java-sdk-bundle-1.11.1026.jar
#Icerberg extensions and catalog setup:
spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
spark.sql.catalog.s3cat=org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.s3cat.type=hadoop
spark.sql.catalog.s3cat.warehouse=s3a://warehouse09876
(Note: This is the bucket name)#S3A filesystem configs for Idrive® e2:
spark.hadoop.fs.s3a.endpoint=<Idrivee2-endpoint>
(Ex: v1e8.da.idrivee2-17.com)spark.hadoop.fs.s3a.path.style.access=true
spark.hadoop.fs.s3a.connection.ssl.enabled=true
spark.hadoop.fs.s3a.signing-region=us-east-1
spark.hadoop.fs.s3a.access.key=<Access Key>
spark.hadoop.fs.s3a.secret.key=<Secret Key>
#Optional steps for performance tuning:
spark.hadoop.fs.s3a.fast.upload=true
spark.hadoop.fs.s3a.connection.maximum=100
spark.hadoop.fs.s3a.threads.max=20
Download the necessary Hadoop AWS and AWS SDK for Spark 3.5.6
mkdir -p ~/spark-extra-jars
cd ~/spark-extra-jars
# Download matching Hadoop AWS + AWS SDK for Spark 3.5.6 (Hadoop 3.3.4)
curl -O
https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.4/hadoop-aws-3.3.4.jarcurl -O
https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.11.1026/aws-java-sdk-bundle-1.11.1026.jarRun the following code to create a bucket (If not existing) and insert data in table (Bucket) in destination
#!/bin/bash
spark-sql --jars \
$SPARK_HOME/jars/iceberg-spark-runtime-3.5_2.12-1.9.2.jar,
$SPARK_HOME/jars/hadoop-aws-3.3.4.jar,
$SPARK_HOME/jars/aws-java-sdk-bundle-1.11.1026.jar -e "CREATE NAMESPACE IF NOT EXISTS s3cat.db;
CREATE TABLE IF NOT EXISTS s3cat.db.events (
id BIGINT,
ts TIMESTAMP,
data STRING
) USING iceberg;INSERT INTO s3cat.db.events VALUES (1, current_timestamp(), 'hello'), (2, current_timestamp(), 'world');
SELECT * FROM s3cat.db.events;
" --verbose
Note: Change access keys, endpoints and directory names according to your configuration.
Note: Data restoration is handled by your specific backup solution provider and is affected by multiple variables that are unique to your environment. For application-related enquiries/support, it is strongly recommended you seek guidance from the technical team of your backup solution provider.