
Best practices for building an transactional data lake on Amazon S3 using Apache Hudi
Best practices for building an transactional data lake on Amazon S3 using Apache Hudi
Powered By
The session will be hosted by Nikhil Khokhar, Solutions Architect at AWS. In this session, users will get an understanding of how to generate and ingest use-case-specific streaming data into the Kinesis stream, consume data from the Kinesis stream using Glue streaming job, and persist in S3 as Apache Hudi data set. It will also help users improve query performance by setting up Clustering inside a partition to achieve data locality. The Hudi configurations within the Glue job will help you walk through the best practices – like reducing query scan scope by setting up nested partitioning for data using related schema attributes; improve query performance by setting up clustering inside a partition to achieve data locality (linear ordering & Z-Order); for large datasets, leverage Metadata-based file listing feature to avoid performance impact from book-keeping operations, and use Kafka commit call-backs to set up event-driven pipelines on existing Hudi-based S3 datasets.
About Nikhil Khokhar
Cloud professional working on data oriented services like AWS Kinesis (Subject Matter Expert), Kafka, AWS Redshift, RDS PostgreSQL.