[regression](hudi) Impl new Hudi Docker environment #59401

suxiaogang223 · 2025-12-26T08:15:53Z

What problem does this PR solve?

Hudi Docker Environment

This directory contains the Docker Compose configuration for setting up a Hudi test environment with Spark, Hive Metastore, MinIO (S3-compatible storage), and PostgreSQL.

Components

Spark: Apache Spark 3.5.7 for processing Hudi tables
Hive Metastore: Starburst Hive Metastore for table metadata management
PostgreSQL: Database backend for Hive Metastore
MinIO: S3-compatible object storage for Hudi data files

Important Configuration Parameters

Container UID

Parameter: CONTAINER_UID in custom_settings.env
Default: doris--
Note: Must be set to a unique value to avoid conflicts with other Docker environments
Example: CONTAINER_UID="doris--bender--"

Port Configuration (`hudi.env.tpl`)

HIVE_METASTORE_PORT: Port for Hive Metastore Thrift service (default: 19083)
MINIO_API_PORT: MinIO S3 API port (default: 19100)
MINIO_CONSOLE_PORT: MinIO web console port (default: 19101)
SPARK_UI_PORT: Spark web UI port (default: 18080)

MinIO Credentials (`hudi.env.tpl`)

MINIO_ROOT_USER: MinIO access key (default: minio)
MINIO_ROOT_PASSWORD: MinIO secret key (default: minio123)
HUDI_BUCKET: S3 bucket name for Hudi data (default: datalake)

Version Compatibility

⚠️ Important: Hadoop versions must match Spark's built-in Hadoop version

Spark Version: 3.5.7 (uses Hadoop 3.3.4) - default build for Hudi 1.0.2
Hadoop AWS Version: 3.3.4 (matching Spark's Hadoop)
Hadoop Common Version: 3.3.4 (matching Spark's Hadoop)
Hudi Bundle Version: 1.0.2 Spark 3.5 bundle (default build, matches Spark 3.5.7, matches Doris's Hudi version to avoid versionCode compatibility issues)
AWS SDK Bundle Version: 1.12.262 (compatible with Hadoop 3.3.4)
PostgreSQL JDBC Version: 42.7.1 (compatible with Hive Metastore)
Hudi 1.0.x Compatibility: Supports Spark 3.5.x (default), 3.4.x, and 3.3.x

JAR Dependencies (`hudi.env.tpl`)

All JAR file versions and URLs are configurable:

HUDI_BUNDLE_VERSION / HUDI_BUNDLE_URL: Hudi Spark bundle
HADOOP_AWS_VERSION / HADOOP_AWS_URL: Hadoop S3A filesystem support
HADOOP_COMMON_VERSION / HADOOP_COMMON_URL: Hadoop common library
AWS_SDK_BUNDLE_VERSION / AWS_SDK_BUNDLE_URL: AWS Java SDK
POSTGRESQL_JDBC_VERSION / POSTGRESQL_JDBC_URL: PostgreSQL JDBC driver

Starting the Environment

# Start Hudi environment
./docker/thirdparties/run-thirdparties-docker.sh -c hudi

# Stop Hudi environment
./docker/thirdparties/run-thirdparties-docker.sh -c hudi --stop

Adding Data

⚠️ Important: To ensure data consistency after Docker restarts, only use SQL scripts to add data. Data added through spark-sql interactive shell is temporary and will not persist after container restart.

Using SQL Scripts

Add new SQL files in scripts/create_preinstalled_scripts/hudi/ directory:

Files are executed in alphabetical order (e.g., 01_config_and_database.sql, 02_create_user_activity_log_tables.sql, etc.)
Use descriptive names with numeric prefixes to control execution order
Use environment variable substitution: ${HIVE_METASTORE_URIS} and ${HUDI_BUCKET}
Data created through SQL scripts will persist after Docker restart

Example: Create 08_create_custom_table.sql:

USE regression_hudi;

CREATE TABLE IF NOT EXISTS my_hudi_table (
  id BIGINT,
  name STRING,
  created_at TIMESTAMP
) USING hudi
TBLPROPERTIES (
  type = 'cow',
  primaryKey = 'id',
  preCombineField = 'created_at',
  hoodie.datasource.hive_sync.enable = 'true',
  hoodie.datasource.hive_sync.metastore.uris = '${HIVE_METASTORE_URIS}',
  hoodie.datasource.hive_sync.mode = 'hms'
)
LOCATION 's3a://${HUDI_BUCKET}/warehouse/regression_hudi/my_hudi_table';

INSERT INTO my_hudi_table VALUES
  (1, 'Alice', TIMESTAMP '2024-01-01 10:00:00'),
  (2, 'Bob', TIMESTAMP '2024-01-02 11:00:00');

After adding SQL files, restart the container to execute them:

docker restart doris--hudi-spark

Creating Hudi Catalog in Doris

After starting the Hudi Docker environment, you can create a Hudi catalog in Doris to access Hudi tables:

-- Create Hudi catalog
CREATE CATALOG IF NOT EXISTS hudi_catalog PROPERTIES (
    'type' = 'hms',
    'hive.metastore.uris' = 'thrift://<externalEnvIp>:19083',
    's3.endpoint' = 'http://<externalEnvIp>:19100',
    's3.access_key' = 'minio',
    's3.secret_key' = 'minio123',
    's3.region' = 'us-east-1',
    'use_path_style' = 'true'
);

-- Switch to Hudi catalog
SWITCH hudi_catalog;

-- Use database
USE regression_hudi;

-- Show tables
SHOW TABLES;

-- Query Hudi table
SELECT * FROM user_activity_log_cow_partition LIMIT 10;

Configuration Parameters:

hive.metastore.uris: Hive Metastore Thrift service address (default port: 19083)
s3.endpoint: MinIO S3 API endpoint (default port: 19100)
s3.access_key: MinIO access key (default: minio)
s3.secret_key: MinIO secret key (default: minio123)
s3.region: S3 region (default: us-east-1)
use_path_style: Use path-style access for MinIO (required: true)

Replace <externalEnvIp> with your actual external environment IP address (e.g., 127.0.0.1 for localhost).

Debugging with Spark SQL

⚠️ Note: The methods below are for debugging purposes only. Data created through spark-sql interactive shell will not persist after Docker restart. To add persistent data, use SQL scripts as described in the "Adding Data" section.

1. Connect to Spark Container

docker exec -it doris--hudi-spark bash

2. Start Spark SQL Interactive Shell

/opt/spark/bin/spark-sql \
  --master local[*] \
  --name hudi-debug \
  --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
  --conf spark.sql.catalogImplementation=hive \
  --conf spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension \
  --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog \
  --conf spark.sql.warehouse.dir=s3a://datalake/warehouse

3. Common Debugging Commands

-- Show databases
SHOW DATABASES;

-- Use database
USE regression_hudi;

-- Show tables
SHOW TABLES;

-- Describe table structure
DESCRIBE EXTENDED user_activity_log_cow_partition;

-- Query data
SELECT * FROM user_activity_log_cow_partition LIMIT 10;

-- Check Hudi table properties
SHOW TBLPROPERTIES user_activity_log_cow_partition;

-- View Spark configuration
SET -v;

-- Check Hudi-specific configurations
SET hoodie.datasource.write.hive_style_partitioning;

4. View Spark Web UI

Access Spark Web UI at: http://localhost:18080 (or configured SPARK_UI_PORT)

5. Check Container Logs

# View Spark container logs
docker logs doris--hudi-spark --tail 100 -f

# View Hive Metastore logs
docker logs doris--hudi-metastore --tail 100 -f

# View MinIO logs
docker logs doris--hudi-minio --tail 100 -f

6. Verify S3 Data

# Access MinIO console
# URL: http://localhost:19101 (or configured MINIO_CONSOLE_PORT)
# Username: minio (or MINIO_ROOT_USER)
# Password: minio123 (or MINIO_ROOT_PASSWORD)

# Or use MinIO client
docker exec -it doris--hudi-minio-mc mc ls myminio/datalake/warehouse/regression_hudi/

Troubleshooting

Container Exits Immediately

Check logs: docker logs doris--hudi-spark
Verify SUCCESS file exists: docker exec doris--hudi-spark test -f /opt/hudi-scripts/SUCCESS
Ensure Hive Metastore is running: docker ps | grep metastore

ClassNotFoundException Errors

Verify JAR files are downloaded: docker exec doris--hudi-spark ls -lh /opt/hudi-cache/
Check JAR versions match Spark's Hadoop version (3.3.4)
Review hudi.env.tpl for correct version numbers

S3A Connection Issues

Verify MinIO is running: docker ps | grep minio
Check MinIO credentials in hudi.env.tpl
Test S3 connection: docker exec doris--hudi-minio-mc mc ls myminio/

Hive Metastore Connection Issues

Check Metastore is ready: docker logs doris--hudi-metastore | grep "Metastore is ready"
Verify PostgreSQL is running: docker ps | grep metastore-db
Test connection: docker exec doris--hudi-metastore-db pg_isready -U hive

File Structure

hudi/
├── hudi.yaml.tpl          # Docker Compose template
├── hudi.env.tpl           # Environment variables template
├── scripts/
│   ├── init.sh            # Initialization script
│   ├── create_preinstalled_scripts/
│   │   └── hudi/          # SQL scripts (01_config_and_database.sql, 02_create_user_activity_log_tables.sql, ...)
│   └── SUCCESS            # Initialization marker (generated)
└── cache/                 # Downloaded JAR files (generated)

Notes

All generated files (.yaml, .env, cache/, SUCCESS) are ignored by Git
SQL scripts support environment variable substitution using ${VARIABLE_NAME} syntax
Hadoop version compatibility is critical - must match Spark's built-in version
Container keeps running after initialization for healthcheck and debugging

Check List (For Author)

Test
- Regression test
- Unit Test
- Manual test (add detailed scripts or steps below)
- No need to test or manual test. Explain why:
  - This is a refactor/code format and no logic has been changed.
  - Previous test can cover this change.
  - No code files have been changed.
  - Other reason
Behavior changed:
- No.
- Yes.
Does this need documentation?
- No.
- Yes.

Check List (For Reviewer who merge this PR)

Confirm the release note
Confirm test cases
Confirm document
Add branch pick label

…m the repository

hello-stephen · 2025-12-26T08:15:58Z

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

What problem was fixed (it's best to include specific error reporting information). How it was fixed.
Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
What features were added. Why was this function added?
Which code was refactored and why was this part of the code refactored?
Which functions were optimized and what is the difference before and after the optimization?

…r fresh starts

suxiaogang223 added 3 commits December 26, 2025 14:49

[refactor](docker) Remove Hudi-related configurations and scripts fro…

d71ba4b

…m the repository

feat(docker): add Hudi configuration and initialization scripts

7cdc5af

fix hudi docker

4dcea79

suxiaogang223 added 3 commits December 26, 2025 16:27

add README for Hudi Docker environment setup and configuration

724e656

fix hudi docker version

6d23d1f

Enhance Hudi initialization script to remove previous SUCCESS file fo…

e3f5d63

…r fresh starts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[regression](hudi) Impl new Hudi Docker environment #59401

[regression](hudi) Impl new Hudi Docker environment #59401

Uh oh!

suxiaogang223 commented Dec 26, 2025 •

edited

Loading

Uh oh!

hello-stephen commented Dec 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[regression](hudi) Impl new Hudi Docker environment #59401

Are you sure you want to change the base?

[regression](hudi) Impl new Hudi Docker environment #59401

Uh oh!

Conversation

suxiaogang223 commented Dec 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

Hudi Docker Environment

Components

Important Configuration Parameters

Container UID

Port Configuration (hudi.env.tpl)

MinIO Credentials (hudi.env.tpl)

Version Compatibility

JAR Dependencies (hudi.env.tpl)

Starting the Environment

Adding Data

Using SQL Scripts

Creating Hudi Catalog in Doris

Debugging with Spark SQL

1. Connect to Spark Container

2. Start Spark SQL Interactive Shell

3. Common Debugging Commands

4. View Spark Web UI

5. Check Container Logs

6. Verify S3 Data

Troubleshooting

Container Exits Immediately

ClassNotFoundException Errors

S3A Connection Issues

Hive Metastore Connection Issues

File Structure

Notes

Check List (For Author)

Check List (For Reviewer who merge this PR)

Uh oh!

hello-stephen commented Dec 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

suxiaogang223 commented Dec 26, 2025 •

edited

Loading

Port Configuration (`hudi.env.tpl`)

MinIO Credentials (`hudi.env.tpl`)

JAR Dependencies (`hudi.env.tpl`)