Best practices for building an transactional data lake on Amazon S3 using Apache Hudi

Home < YSTV Live <

Best practices for building an…

Best practices for building an transactional data lake on Amazon S3 using Apache Hudi

1 VideoOctober 19, 2022

Share on

Powered By

Amazon Web Services

View Brand Publisher

Overview

Snippets

Questions

The session will be hosted by Nikhil Khokhar, Solutions Architect at AWS. In this session, users will get an understanding of how to generate and ingest use-case-specific streaming data into the Kinesis stream, consume data from the Kinesis stream using Glue streaming job, and persist in S3 as Apache Hudi data set. It will also help users improve query performance by setting up Clustering inside a partition to achieve data locality. The Hudi configurations within the Glue job will help you walk through the best practices – like reducing query scan scope by setting up nested partitioning for data using related schema attributes; improve query performance by setting up clustering inside a partition to achieve data locality (linear ordering & Z-Order); for large datasets, leverage Metadata-based file listing feature to avoid performance impact from book-keeping operations, and use Kafka commit call-backs to set up event-driven pipelines on existing Hudi-based S3 datasets.

About Nikhil Khokhar

Cloud professional working on data oriented services like AWS Kinesis (Subject Matter Expert), Kafka, AWS Redshift, RDS PostgreSQL.

Suggested For You

Follow Us

Follow Us

Best practices for building an transactional data lake on Amazon S3 using Apache Hudi

Powered By

About Nikhil Khokhar