DEN Newsletter #12

Data Engineering News

Jan 12, 2025

1. A First Look at S3 (Iceberg) Tables

How does S3 (Iceberg) revolutionizes data storage and querying in modern data lakes?

The author highlights AWS's game-changing introduction of S3 Tables, bringing native Apache Iceberg support directly into S3 buckets. This deep integration automates table maintenance, simplifies Iceberg adoption, and reduces the effort required to handle compaction. A must-read for data engineers exploring the future of scalable and efficient data lake solutions.

https://meltware.com/2024/12/04/s3-tables

2. How Amazon S3 Tables use compaction to improve query performance by up to 3 times

Can Amazon S3 Tables' automatic compaction boost query performance by 3x?

The authors want to highlight how Amazon S3 Tables with automatic compaction can transform data analytics. By consolidating small Parquet files into optimized larger ones, S3 Tables reduce read requests, enabling query performance improvements of up to 3.2x. This innovation eliminates the complexities of manual data maintenance, streamlining operations for storage-intensive workloads. Dive into the blog to explore the benchmarks and get started with S3 Tables.

https://aws.amazon.com/blogs/storage/how-amazon-s3-tables-use-compaction-to-improve-query-performance-by-up-to-3-times/

3. The History of the Decline and Fall of In-Memory Database Systems

What led to the decline of in-memory database systems, once hailed as the future of data management?

This article explores their meteoric rise, fueled by cheap memory, and their eventual decline as SSDs and cloud storage redefined the landscape. The author delves into what worked, what didn’t, and how lessons from these systems continue to shape modern data architectures. A must-read for anyone designing scalable, adaptable systems in a rapidly evolving tech world.

https://cedardb.com/blog/in_memory_dbms/

4. How to Speed Up Spark Jobs on Small Test Datasets

"How can you optimize Spark jobs for small test datasets without sacrificing accuracy?”

The author wants to say in this article that thoughtful tuning can transform Spark's performance on small datasets, despite its heavyweight design. By adjusting parameters like shuffle partitions, executor resources, and compression, you can dramatically speed up Spark jobs. This post also questions if Spark is always the best tool, encouraging exploration of alternatives like Pandas or DuckDB for efficiency

https://luminousmen.com/post/how-to-speed-up-spark-jobs-on-small-test-datasets

5. Typed Python in 2024: Well adopted, yet usability challenges persist

Is Typed Python the Future or Just a Passing Trend?

This blog dives into a survey by JetBrains, Meta, and Microsoft, exploring how developers use type hints, the tools they rely on, and the challenges they face. With 88% adoption among respondents, types are transforming Python development, yet usability issues and documentation gaps persist. A must-read for anyone shaping Python’s typed future.

https://engineering.fb.com/2024/12/09/developer-tools/typed-python-2024-survey-meta/#:~:text=Overall%20findings,that%20leave%20some%20code%20unchecked

All rights reserved Den Digital, India. I have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employer” opinions.

dat-a-man — Data with Aman

Discussion about this post

Ready for more?