[regression](hudi) Impl new Hudi Docker environment #59401
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What problem does this PR solve?
Hudi Docker Environment
This directory contains the Docker Compose configuration for setting up a Hudi test environment with Spark, Hive Metastore, MinIO (S3-compatible storage), and PostgreSQL.
Components
Important Configuration Parameters
Container UID
CONTAINER_UIDincustom_settings.envdoris--CONTAINER_UID="doris--bender--"Port Configuration (
hudi.env.tpl)HIVE_METASTORE_PORT: Port for Hive Metastore Thrift service (default: 19083)MINIO_API_PORT: MinIO S3 API port (default: 19100)MINIO_CONSOLE_PORT: MinIO web console port (default: 19101)SPARK_UI_PORT: Spark web UI port (default: 18080)MinIO Credentials (
hudi.env.tpl)MINIO_ROOT_USER: MinIO access key (default:minio)MINIO_ROOT_PASSWORD: MinIO secret key (default:minio123)HUDI_BUCKET: S3 bucket name for Hudi data (default:datalake)Version Compatibility
JAR Dependencies (
hudi.env.tpl)All JAR file versions and URLs are configurable:
HUDI_BUNDLE_VERSION/HUDI_BUNDLE_URL: Hudi Spark bundleHADOOP_AWS_VERSION/HADOOP_AWS_URL: Hadoop S3A filesystem supportHADOOP_COMMON_VERSION/HADOOP_COMMON_URL: Hadoop common libraryAWS_SDK_BUNDLE_VERSION/AWS_SDK_BUNDLE_URL: AWS Java SDKPOSTGRESQL_JDBC_VERSION/POSTGRESQL_JDBC_URL: PostgreSQL JDBC driverStarting the Environment
Adding Data
spark-sqlinteractive shell is temporary and will not persist after container restart.Using SQL Scripts
Add new SQL files in
scripts/create_preinstalled_scripts/hudi/directory:01_config_and_database.sql,02_create_user_activity_log_tables.sql, etc.)${HIVE_METASTORE_URIS}and${HUDI_BUCKET}Example: Create
08_create_custom_table.sql:After adding SQL files, restart the container to execute them:
Creating Hudi Catalog in Doris
After starting the Hudi Docker environment, you can create a Hudi catalog in Doris to access Hudi tables:
Configuration Parameters:
hive.metastore.uris: Hive Metastore Thrift service address (default port: 19083)s3.endpoint: MinIO S3 API endpoint (default port: 19100)s3.access_key: MinIO access key (default:minio)s3.secret_key: MinIO secret key (default:minio123)s3.region: S3 region (default:us-east-1)use_path_style: Use path-style access for MinIO (required:true)Replace
<externalEnvIp>with your actual external environment IP address (e.g.,127.0.0.1for localhost).Debugging with Spark SQL
spark-sqlinteractive shell will not persist after Docker restart. To add persistent data, use SQL scripts as described in the "Adding Data" section.1. Connect to Spark Container
docker exec -it doris--hudi-spark bash2. Start Spark SQL Interactive Shell
/opt/spark/bin/spark-sql \ --master local[*] \ --name hudi-debug \ --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \ --conf spark.sql.catalogImplementation=hive \ --conf spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension \ --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog \ --conf spark.sql.warehouse.dir=s3a://datalake/warehouse3. Common Debugging Commands
4. View Spark Web UI
Access Spark Web UI at:
http://localhost:18080(or configuredSPARK_UI_PORT)5. Check Container Logs
6. Verify S3 Data
Troubleshooting
Container Exits Immediately
docker logs doris--hudi-sparkdocker exec doris--hudi-spark test -f /opt/hudi-scripts/SUCCESSdocker ps | grep metastoreClassNotFoundException Errors
docker exec doris--hudi-spark ls -lh /opt/hudi-cache/hudi.env.tplfor correct version numbersS3A Connection Issues
docker ps | grep miniohudi.env.tpldocker exec doris--hudi-minio-mc mc ls myminio/Hive Metastore Connection Issues
docker logs doris--hudi-metastore | grep "Metastore is ready"docker ps | grep metastore-dbdocker exec doris--hudi-metastore-db pg_isready -U hiveFile Structure
Notes
.yaml,.env,cache/,SUCCESS) are ignored by Git${VARIABLE_NAME}syntaxCheck List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)