1. The History and Evolution of Open Table Formats
How have open table formats like Delta Lake, Apache Iceberg, and Apache Hudi addressed the limitations of traditional Hive-style table formats in modern data management?
In this article, Alireza Sadheghi explores how these innovations address the limitations of traditional systems like Hive, enabling scalable, ACID-compliant, and flexible data management. This blog takes you on a journey through the history of table formats, their challenges, and the cutting-edge technologies shaping modern data lakes and lake houses. A must-read to understand the past, present, and future of data architecture.
https://alirezasadeghi1.medium.com/the-history-and-evolution-of-open-table-formats-0f1b9ea10e1e
2. Supercharging dbt vol 2: how we modified dbt’s incremental materialisation to more than halve execution time for incremental loads
How did modifying dbt's incremental materialization drastically improve execution times for incremental loads, and what can you learn from this approach?
In this article Dominik Golebiewski and Konard Maliszewski discuss how a simple yet powerful adjustment to dbt’s incremental materialization—adding a date filter—can drastically reduce full table scans, cutting execution time by more than half. This article is a must-read for anyone working with large datasets and looking to optimize performance in their data pipelines.
3. How does Presto® Express optimize query processing to deliver high-speed results with minimal resource usage
How does Presto® Express optimize query processing to deliver high-speed results with minimal resource usage?
Uber’s Presto® Express tackles the challenge of short-running queries by introducing express clusters and dynamic routing, slashing end-to-end SLA for over 75% of queries. This is an inspiring read for anyone passionate about optimizing large-scale data systems.
https://www.uber.com/en-IN/blog/presto-express/
4. Dynamic Data Pipelines with Airflow Datasets and Pub/Sub
How can Airflow Datasets and Pub/Sub be used togetherto build dynamic and efficient data pipelines?
This article discuses that data pipelines don’t have to be rigid and inefficient. By combining Airflow Datasets with Google Cloud Pub/Sub, you can build dynamic, event-driven workflows that trigger tasks in real time based on data availability. This blog is a must-read to learn how to simplify dependencies, enhance scalability, and create smarter data pipelines for complex environments.
https://medium.astrafy.io/dynamic-data-pipelines-with-airflow-datasets-and-pub-sub-d91c81d75f51
5. Right-sizing Spark executor memory
How can you effectively right-size Spark executor memory to optimize performance and resource usage?
This article discusses that tuning Spark configurations doesn’t have to be manual and inefficient. At LinkedIn’s scale, a data-driven approach to right-sizing Spark executor memory reduced failures by 90% and improved memory utilization by 13%. This blog dives into how automation and heuristics can transform resource efficiency and productivity in large-scale data pipelines.
https://www.linkedin.com/blog/engineering/infrastructure/right-sizing-spark-executor-memory
6. From dbt to SQLMesh
How does SQLMesh redefine data transformation workflows, and what lessons can we learn from its evolution beyond dbt?
In this article Alex Butler discusses how SQLMesh redefines how they approach data transformation by simplifying workflows, reducing dependencies, and offering state-aware management. How does SQLMesh’s philosophy of clarity and determinism improve over dbt’s approach? Dive into this detailed journey of transitioning from dbt to SQLMesh, uncovering lessons on scaling, reducing inefficiencies, and embracing the future of data engineering.
https://www.harness.io/blog/from-dbt-to-sqlmesh
All rights reserved Den Digital, India. I have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employer” opinions.