Oz Katz
November 4, 2022

In the past year, words like recession, business slowdown and monetary cuttings are being heard more and more often. Not just in the economic press and in the media, these discussions are very much heard also in almost all companies – within boardrooms, in management meetings and when engaging with potential investors and customers.

As we sail our technological ships in these troubled waters, cost savings and monetary discipline become more and more necessary, even in organizations that are showing stable profitability and growth, and even ones who secured significant amounts of capital for the near future.

Reducing cloud storage costs at times of recession

In the past decade, public cloud costs have increased to become one of the biggest spends within organizations. Reports forecast that cloud spend is going to reach almost half the spend of the technology budget within organizations in the next year. Storing data in the cloud is considered cheaper than the other line items in the monthly cloud spending such as compute and networking. However, this expense is also growing within the total spend and one should aim to reduce it as part of the overall reduction of cloud spending.

Reducing data storage costs is possible

While organizations become increasingly reliant on their data, it may seem that this line within the cloud budget is doomed to grow inevitably. However, while data will most certainly grow, the good news is that there are effective ways to control the costs of storing the data. Cloud providers do their best to help organizations tailor their cloud architecture to be more cost effective. But in order to reduce the spend on data storage specifically, there is a need to architect solutions that are tailored for the unique needs of data storage, and not settle for generic cloud cost reduction techniques.

4 methods you should adopt today to reduce your cloud data storage costs

In this article, we will outline 4 proven data storage costs reduction methods, that are based on the understanding of the underlying architecture of storing data in the cloud. With that, these techniques are rather simple to adopt and implement, and will show clear ROI in reducing cloud data storage costs within a few months.

1. Reduce data storage costs by using tiered storage 

Storage tiering is an available capability in most cloud providers, and is rather simple to utilize. When using tiered storage, older data which is accessed less frequently can be stored in slower, less expensive storage. Typically, only around 20% of the data in the org is accessed frequently so this could lead to big savings. All cloud providers support Lifecycle management where users can define rules on how and when data objects move between the different tiers or even deleted. Here are some of the offerings of the leading cloud providers:

Optimizing costs with data lifecycle management in Azure

Intelligent tiering in AWS S3 to reduce data storage costs

Object lifecycle management for cost reduction in GCP

2. Use proper file sizes, formats and compressions to reduce cloud data storage costs

Especially for analytics and tabular data, optimizing how your data is stored can lead to a huge difference in storage costs. Using large (>100Mb) objects stored in columnar formats such as ORC or Parquet allows for much more efficient compression. The same data, when stored as small JSON objects (even if compressed!) could be 2x-5x times larger and more expensive. 

For OLAP use cases, this optimization can also lead to significant compute savings, since these optimizations make computations faster and more efficient. Selecting the most appropriate file formats and compressions will prove itself in reducing both storage costs and compute costs.

3. Reduce data storage costs by using a development and testing environment through branching

Building and maintaining multiple ETLs requires organizations to develop and test new and existing pipelines on a regular basis. What most data engineers would do in order to test these pipelines properly on the entire data lake, is create a local copy of the entire lake and test on it. When done regularly, this malpractice can lead to a multiplied usage of data storage, that is completely unnecessary. 

The solution to this is using a data versioning tool that will provide branching capabilities, enabling the creation of a development and testing environment without copying the data itself. This solution can help significantly reduce data storage costs, as well as increase data engineering efficiency, by not needing to copy, maintain and get rid of these multiple clones.

One such solution is lakeFS, an open-source solution that provides git-like operations on data lakes. Using lakeFS, data engineering teams can instantly create a development and testing environment, without copying anything as lakeFS uses metadata – pointers to the data, and does not create copies of the data itself.

Source: lakeFS

By creating an isolated branch of the entire data lake, without copying it, data engineers can safely develop, test and maintain high-quality, well-tested ETLs without any implication on storage costs, and by that reduce the spend on data storage by up to 80%.

4. Achieve reproducibility on the data by using versioning

In many organizations, there is a need to regularly keep multiple copies of the data in order to keep track of the changes that happened in it throughout its lifecycle. Whenever there is an error or inconsistency in the data, being able to time-travel between these copies helps a great deal in identifying when this error was introduced and reproducing what happened that caused this error. 

Keeping versions of the data by copying it has a significant cost in storage and in needing to regularly delete, manage and maintain these copies. 

While the need to keep track over time on what has changed in the data is very important, there is a  better way to achieve it – which is by having as many versions of the data as needed, instead of numerous physical copies.
The lakeFS open-source project can help in that as it enables keeping multiple versions of the data without copying anything. This is an easy way to achieve a history of what changed in the data over time and keep track of it. On top of that, lakeFS provides tagging – which is a way to give a meaningful name to a specific version of the data, and by that enabling the value of keeping track of the versions history of the data, without needing to create and maintain multiple copies of the data.

Summary

As reducing data storage costs in the cloud is always a good practice, times of economic uncertainty make the ability of companies to save money a necessity to ensure their longevity and resilience. Applying data specific cost reduction methods can be a one time engineering investment that can prove itself even in the short term, and undoubtedly in the long run.

LakeFS

  • Get Started
    Get Started
  • Git for Data - What, How and Why Now?

    Read the article
    +