Lists are everywhere. And there’s a good reason for that: lists have an order to them. They reduce our stress by assuring us that everything is written down on paper (read: note app), and that we are keeping track of things that we would otherwise forget. People who adopt minimalist lifestyles often make complete lists of objects in their possession, in order to monitor the amount of things in their life (and to brag about it on the internet).
Similarly, data engineers and their organizations want to keep track of their objects. Although listing all the files you have seems like a basic capability, in the world of big-data it may not be so trivial to achieve. What if you have so many objects, that listing them would be cost-prohibitive? What if even the list of your files is too big for a MacBook Pro to hold in-memory?
If your S3 data lake has 500,000,000 objects, you’ll have to make 500,000 LIST requests just to cover everything once. It will take a substantial time to make so many requests, and at the end you’ll be left to deal with the resulting dataset, which is bigger than what many traditional databases can handle. This is where S3 inventory comes into play.
How S3 Inventory Works
S3 inventory will run once per day (or once per week) and give you a complete list of your objects. This list will be published to an S3 destination of your choice, and can be given in Parquet, ORC or CSV formats. As a bonus, it will also create Hive symlinks over this list, making it compatible with Athena out of the box. This is the beauty of it: you will be using big-data tools to analyze your list of objects. And the price? $0.0025 per million objects listed, which are $1.25 for the scenario described above.
S3 inventory can be easily configured under the Management tab of your bucket in the S3 console. After being configured, it will be published to the specified location daily or weekly. Once it has been published, you can use Athena to query it. With a little creativity, you will get some very interesting insights regarding your data usage patterns.
Testing Your Retention Solution
Let’s use S3 inventory to find out how many files changed their storage tier between two given dates. This can help you save money by monitoring and validating that S3’s intelligent tiering fits your needs. If you’re an in-house tiering solution, you can use a very similar query to test it.
SELECT t1.intelligent_tiering_access_tier AS from_tier, t2.intelligent_tiering_access_tier AS to_tier, COUNT(*) AS file_count FROM my_inventory t1 JOIN my_inventory t2 ON t1.key = t2.key WHERE t1.dt = '2021-01-01-00-00' AND t2.dt = '2021-01-02-00-00' AND t1.is_latest AND t2.is_latest AND t1.intelligent_tiering_access_tier <> t2.intelligent_tiering_access_tier GROUP BY 1,2
You can see that 10815 files went from the INFREQUENT to the FREQUENT tier. If this number doesn’t seem right, we can remove the GROUP BY clause from the query to see which files we are now paying more for.
Find Space Consuming “Directories”
Although S3 doesn’t really have directories, it is sometimes useful to look at your lake as a file tree. If you want to find subtrees that take up a lot of space, you can use the following query to get this information in a form similar to the output of the “du” command. You can check out lakeview, an open-source visibility tool for your data lake, to get an interactive version of this query (similar to “ncdu”).
SELECT depth, common_prefix, sum(size) AS size FROM (SELECT depth, key, CASE CARDINALITY(SPLIT(key, '/')) WHEN depth THEN key ELSE ARRAY_JOIN(SLICE(SPLIT(key, '/'), 1, depth), '/') || '/' END AS common_prefix, size FROM my_inventory, UNNEST(SEQUENCE(1, 10)) d(depth) -- replace "10" with the max-depth of your tree WHERE dt = '2021-01-01-00-00' AND is_latest AND NOT is_delete_marker) GROUP BY 1, 2 ORDER BY 1 ASC, 2 DESC
Finding Duplicate Objects
Having many processes writing to many different paths, can result in the same data being written to multiple places. Also, sometimes we just need some data to be under a specific key, so we copy it there and then forget about it. One can use S3 inventory to find duplicate objects, and save storage costs by deleting them. The following query finds duplicate objects with size over 10KB, along with all the keys they appear under.
SELECT e_tag, ARRAY_JOIN(ARRAY_AGG(key), ',') AS , COUNT(*) FROM my_inventory WHERE dt = '2021-01-01-00-00' AND is_latest AND NOT is_delete_marker AND size > 10000 GROUP BY 1 HAVING COUNT(*) > 1 ORDER BY COUNT(*) DESC
The Power of S3 Inventory
Amazon S3 inventory is a powerful tool which is overlooked by many in the data community. Using it wisely and creatively can help you better understand the topography of your data lake, and even save you money.
While developing lakeFS, we used S3 inventory in many ways. Firstly, it is one of the ways to import data into our product. Moreover, we used anonymized versions of our design partners’ inventories to investigate their data usage patterns. We even introduced lakeview, which uses techniques similar to the ones I described in this article.