Lakeview is a new open source visibility tool for AWS S3 based data lakes. Think of it as ncdu, but for Petabyte-scale data. It’s goal is to provide you with an easy way to see the total size of your S3 bucket (prefix) storage. Instead of scanning billions of objects using the S3 API, which would require millions of API calls and can get expensive fast, lakeview uses Athena to query S3 Inventory Reports.
What can it do?
- Aggregate the sizes of directories* in S3 – allowing you to drill down and find what is taking up space.
- Compare sizes between different dates – see how directories size change over time between different inventory reports.
- Find the largest duplicates in your directories (planned but not yet implemented)
* S3, being an object store and not a filesystem, doesn’t really have a notion of directories, but it’s API supports so-called “common prefixes”.
All capabilities are provided in both a human consumable web interface and a machine consumable JSON report – feel free to plug them into your favorite monitoring tool.
Give it a try
- Ensure you have an S3 inventory set up (preferably as Parquet or ORC)
- Verify the table is registered in Athena
docker run -it -p 5000:5000 \ -v $HOME/.aws:/home/lakeview/.aws \ treeverse/lakeview:0.1.0 \ --table <athena table name> \ --output-location <s3 uri>
<athena table name>is the name you gave in step 2, and
<s3 uri>is a location in S3 where Athena could store its results (e.g. s3://my-bucket/athena/)
- Open http://localhost:5000/ and start exploring
In addition, you can also start using lakeview as an API or run it locally.
If you enjoyed this article, you will these related posts:
- Data Versioning – Does it mean what you think it does? | lakeFS
- Object Storage: Everything You Need to Know | lakeFS
Table of Contents