Switching to Apache Iceberg for Better Data Management (Explained Like I’m Talking to My 5-Year-Old Daughter)
Imagine you're a 5-year-old with a massive toy collection that keeps growing every day. You have toys everywhere, and it takes you over an hour just to find the specific toy you want to play with. That's exactly what was happening to LY Corporation, one of Japan's biggest tech companies!
What is LY Corporation?
LY Corporation is like a giant digital playground. Think of it as the company that runs:
LINE - the messaging app that almost everyone in Japan uses (like WhatsApp but way more popular)
Yahoo! JAPAN - the website where people search for things and read news
PayPay - the app people use to pay for things with their phones
They have over 320 million users across Asia and handle more data than you can imagine - we're talking about 1.1 exabytes, which is like having 1.1 billion huge toy boxes full of information!
How big is their data?
To understand how massive LY Corporation's data is, let's use simple numbers:
They have over 100,000 tables (think of each table as a different type of toy box)
Every single day, they process 1.2 trillion records (that's like counting to 1.2 followed by 12 zeros!)
Their data is growing really, really fast because millions of people use their apps every day
What was their old data system like?
Before Apache Iceberg, LY Corporation was using something called Hive (think of it as their old toy organization system). Here's how it worked:
The old way (Hive System)
Data collection: People using LINE, Yahoo, and PayPay would create data (like sending messages, searching, making payments)
Data journey: This data would travel through something called Kafka (like a conveyor belt) to Flink (a processing machine) and finally land in HDFS storage (the toy warehouse)
Format conversion: The data would start as AVRO and Protocol Buffer files and then get converted to ORC format every hour
The big wait: It took 1 to 1.5 hours before anyone could actually use this data
The problems they faced
Remember when your room was so messy you couldn't find your favorite toy car under a pile of socks and Legos? LY Corporation had the same problem, just with, you know, terabytes of data.
1. Super slow data access
Getting fresh data took them over an hour. Yep, an hour.
In a world where modern businesses need data in minutes, not “go-make-a-coffee-and-come-back” hours, this was a big problem. Their machine learning models and real-time services were basically stuck waiting around, tapping their feet.
2. The tiny files chaos
Imagine having a million tiny toy boxes instead of a few big, organized ones. LY Corp’s system (HDFS) was drowning in these tiny files, each demanding attention.
And the “brain” of their system (the namenode) was getting so overloaded trying to keep track that it was running out of memory, like your laptop when you have 47 Chrome tabs open.
3. Deleting data was a nightmare
When someone requested their personal data be deleted (think GDPR and privacy laws), it wasn’t just a quick “delete” button. Nope.
They had to rewrite huge files just to remove a single person’s data. Imagine having to rewrite your entire diary just to erase one sentence. Painful.
4. Schema changes were a disaster
Adding or removing data fields? It was like having to reorganize your entire Lego collection every time you bought a new set.
Why did they choose Apache Iceberg?
Apache Iceberg is like a super-smart toy organization system. Here's why LY Corporation fell in love with it:
1. Delete specific things easily
Instead of emptying entire toy boxes to remove one toy, Iceberg can point to exactly what needs to be deleted
Perfect for when people ask to have their personal data removed (GDPR compliance)
2. Make data available super fast
Reduced their data delay from 1.5 hours to just 5 minutes
That's 10-12 times faster!
3. Easy changes without big mess
Adding new data fields is like adding new compartments to your toy box without emptying everything out
Schema evolution made it simple to adapt as their services grew
4. Better organization
Iceberg manages metadata (information about information) much more efficiently
Reduces pressure on their central database (Hive Metastore)
Think of Apache Iceberg as having a magical catalog (Iceberg’s manifest and metadata approach) that knows exactly where every piece of data is stored and can retrieve it when needed.
How Did they actually make the switch?
The migration strategy
LY Corporation was smart - they didn't try to change everything at once! Here's what they did:
Gradual migration approach
Started with new tables that needed Iceberg's special features
Migrated 8,000+ tables to Iceberg by 2024
Still have 90,000+ tables on the old Hive system
Only migrate tables that actually benefit from Iceberg's features
The new technical setup
New Data pipeline:
Kafka (data conveyor belt) → Flink (processing) → Apache Iceberg (smart storage)
Data gets flushed every 5 minutes instead of every hour
Files are stored in ORC format (efficient storage format)
Everything runs on Kubernetes (modern container platform)
What the table optimizer does
Think of LY’s Table optimizer as an automated janitor for Iceberg tables, keeping them fast, tidy, and cost-efficient without manual babysitting.
Coordinator
Acts like a manager, watching all tables to see which need maintenance. It keeps a smart queue, prioritizing based on file size, file count, or table health.
Workers
Workers are the system’s hands, they execute tasks from the coordinator, like merging files and cleaning data, all in parallel for speed and efficiency.
Automatic compaction
Automatic compaction merges many small files into larger ones, using smart strategies like bin-packing or sorting, to speed up queries and reduce storage waste.
Snapshot management
Iceberg’s time travel creates snapshots with every change. The optimizer prunes old snapshots and cleans up orphan files to save space and keep metadata lean.
Spark jobs
File rewrites and data merges are handled by distributed Spark jobs, scheduled and assigned by the coordinator for efficient processing at scale.
How It’s built
Microservices on Kubernetes: Fully distributed, scalable, and resilient, with coordinators and workers scaled independently.
Automated scheduling: Maintenance runs on autopilot, prioritizing urgent tasks and balancing workload.
Monitoring & health checks: Continuously tracks table health, file counts, and performance, alerting or auto-fixing as needed.
Conflict prevention: Prevents overlapping maintenance and user operations on the same table, ensuring safe, conflict-free optimizations.
In short:
The Table Optimizer keeps LY’s Iceberg tables clean and fast, automatically handling compaction, cleanup, and snapshot pruning, so the team can focus on building, not babysitting tables.
What challenges did they face?
Even with a great plan, moving this much data wasn't easy:
Migration challenges
1. Deciding what to migrate
With 100,000+ tables, they had to carefully choose which ones would benefit from Iceberg
Tables that were small or didn't need Iceberg's features could stay on Hive
2. User training
Data engineers and analysts needed to learn new tools and concepts
Not everyone understood Iceberg's advanced features initially
3. Data copying between systems
Their old backup strategy (DistCP) needed to be redesigned for Iceberg
Moving data between different storage systems became more complex
Ongoing challenges
Even with their big clean-up, LY Corp still has a few chores to keep things running smoothly.
1. Table housekeeping
Iceberg tables don’t magically stay fast and efficient. They need regular “housekeeping” to keep them tidy and optimized, kind of like sweeping the floor before the dust bunnies take over.
Sure, their Table Optimizer helps a lot, but it’s still an ongoing chore they can’t ignore.
2. File size juggling
Too many tiny files? Bad for performance. Super large files? Harder to manage.
It’s a balancing act, like pouring cereal without spilling. Automatic compaction steps in to help, merging those tiny files into more manageable sizes to keep everything humming along.
What goals did they achieve?
LY Corporation hit some amazing targets:
Performance improvements
1. Massive speed boost
Data latency: From 1.5 hours to 5 minutes (10-12x faster)
Query performance: Much faster due to better file organization
Real-time capabilities: Better support for online machine learning
2. Operational efficiency
Automated maintenance: Less manual work for data engineers
Better resource utilization: More efficient use of computing power
Reduced complexity: Simpler data pipeline management
Business benefits
1. Compliance made easy
GDPR compliance: Can efficiently delete user data when requested
Data governance: Better control over data lifecycle
Audit capabilities: Clear tracking of data changes
2. Business agility
Faster insights: Data available for analysis much sooner
Better ML/AI: More timely data for machine learning models
Competitive advantage: Faster response to business needs
Cost savings and benefits
1️⃣ Direct cost reductions
Efficient file organization and compression lower compute and storage costs.
2️⃣ Operational savings
Automation reduces manual maintenance and pipeline failures, saving engineering time.
3️⃣ Infrastructure efficiency
Makes better use of your existing hardware while easing pressure on Hive Metastore and HDFS.
Indirect Benefits
1️⃣ Faster Time-to-Market
Teams get fresh data faster, enabling quicker feature rollouts and business decisions.
2️⃣ Improved user experience
Real-time features and personalization improve with fresher data and faster response to user behavior.
What's special about their implementation?
Innovation: the table optimizer
LY Corporation didn't just use Iceberg - they innovated on top of it. Their Table Optimizer is like having a smart assistant that:
Monitors thousands of tables automatically
Predicts when maintenance is needed
Runs optimization jobs without human intervention
Prevents conflicts between different operations
Provides detailed metrics and monitoring
Hybrid approach
They took a smart approach by not migrating everything at once. Stable, simple tables stayed on Hive (no need to fix what isn’t broken), while complex, dynamic tables moved to Iceberg, focusing effort where Iceberg’s features would bring the most benefit.
Why this matters for other companies
LY Corporation's journey shows that:
1. Scale matters: At their data volumes, small improvements have huge impacts
2. Gradual migration works: You don't need to change everything at once
3. Automation is key: Building tools like Table Optimizer pays off
4. Compliance is critical: Modern data platforms must support privacy regulations
The Future
LY Corporation is continuing to:
Expand Iceberg adoption to more suitable tables
Improve their Table Optimizer with more intelligence
Explore new Iceberg features like branching and tagging
Share their learnings with the broader Apache Iceberg community
What makes this special
This isn't just a story about switching technology - it's about transforming how a massive company handles data. LY Corporation proved that even with 1.1 exabytes of data and 100,000+ tables, you can make smart, gradual changes that deliver huge benefits.
They turned their data platform from a slow, rigid system into a fast, flexible, compliance-ready powerhouse that can adapt to whatever the future brings. And the best part? They did it without disrupting their services to 320 million users!
Concluding thoughts
Sometimes the biggest improvements come from making your data smarter, not just bigger. LY Corporation's Apache Iceberg journey is a perfect example of how the right technology choices can transform not just your data platform, but your entire business capability.
Got a project or idea?
Drop me a line. If it’s interesting, I’m in. If it’s weird, even better.