Optimizing costs of a Data Lakehouse

How we decreased 78% of storage costs of data lakehouse for Xendit Philippines.

David Lexa
Xendit Engineering

--

saving data infra cost

Technology around data lakes has evolved immensely in the last few years in terms of performance, governance, security, or support for features like data streaming. Removing some of the original blockers made data lakes competitive with traditional data warehouses but with multiple major benefits. One appealing benefit has always been low storage costs. As someone said before, database operation is mainly a cost optimization problem. This became more relevant in the current macroeconomic situation (Q2 2023) when multiple organizations started to look closer at costs and optimize them. But what to do when cheap storage turns out to be not so cheap anymore? Let’s dig into how our S3 storage cost for the Xendit Philippines started to rapidly grow and how we optimize it.

Data lakehouse use-cases

Xendit’s Data engineering team established a Data lakehouse to support the rest of the organization in getting insights into business operations, making data-driven decisions, and building scalable data products. It is a central element of our Data platform.

Let’s look at our setup for processing and storing data in our Data lakehouse before the optimization. Data usually goes through the following stages.

  1. Extract data from various sources and persist in a raw Delta table.
  2. Process data from the raw table into a clean Delta table using dbt.
  3. Construct transform tables using dbt.

To ensure minimal disruption for stakeholders we enabled the bucket versioning in S3 for fast recovery in case of issues. Aside from that we also leveraged the time-traveling feature in Delta and kept previous table versions of the last 7 days.

Data processing & storage

Analysis

We used AWS Cost Explorer and S3 Storage Lens dashboard for analyzing the S3 costs. Here are some findings we identified:

  • Overall, Data engineering spending for S3 has almost tripled within the last 12 months.
  • The team’s S3 costs ranked as the second biggest item in the AWS bill after EC2-related costs.
  • Most data (~87%) was allocated in the data lake buckets. Within data lake buckets, ~76% of data was accumulated in non-current versions.
  • S3 Standard was the primary storage class tier used across data regardless of usage.

These points helped us confirm that S3 storage is worth optimizing and enabled us to narrow down specific areas where to cut costs. During the review, we realized that we never revisited our storage strategy after adopting the Delta format back in 2021. And that we have an overlapping process on how we handle data recovery. This resulted in substantial cost inefficiency.
A gap we recognized during the analysis was the inability to drill into the costs of individual S3 folders used for organizing data. It wasn’t crucial to proceed with identified items, but it is something to improve for the next round of optimization.

Optimization

We came up with the following improvements:

  1. We disabled bucket versioning and switched to using Delta table history to continue supporting ad-hoc troubleshooting of data issues. This approach turned out to be more useful than using a full copy of the S3 objects. Partly because the Delta table versions are more cost/storage efficient as they persist only the records that have been modified (instead of full data snapshots). Additionally, you are able to switch between table versions using SQL or even compare them against each other.
  2. We revisited the need of having the backups on top of the table history and decided to keep them in a separate bucket. This is a prevention against accidental object deletions or cases when a bucket becomes unavailable. We added the S3 lifecycle manager rule to ensure that any version older than 7 days is automatically wiped out.
  3. The storage tier for data backup was switched from S3 Standard to S3 Glacier — Instant Retrieval. This data is rarely touched but valuable for a quick recovery without the need to fully reload data from the source database. The rest of the data lake was moved to S3 Intelligent Tier to leverage automatic tier assignment based on usage patterns. We verified that this change doesn’t impact the overall execution time of Trino and Spark queries.

Results

We wanted to track the impact of individual optimizations and share this with the rest of the engineering community. Altogether we were able to reduce S3 cost by 78%. The main 2 reasons contributed to the result:

  • Replacing S3 bucket versioning with Delta table history.
  • Using S3 Glacier Instant Retrieval for data backups.
Impact of individual optimizations

We haven’t detected significant cost reduction after moving data to Intelligent Tier.

We are also interested in a more granular analysis to assign costs per data owner, table type (raw, clean, transform, backup), database, or table itself. We learned that the cost of S3 tagging can grow rapidly with a high amount of S3 objects and the number of tags you want to assign to each one ($0.01 per 10,000 tags per month). We plan to research alternative ways to get additional insights, including the usage of Storage Class Analysis.

Summary

Data engineering teams should formalize their strategy for storing data and revise regularly in order to achieve optimal performance and cost. This exercise includes making decisions on how you treat data backups, data version control, cold data, what is your default partition strategy, and more. The reward for doing that can be significant as our use-case demonstrated (total cost reduction by 78%).

--

--