Oz Katz
October 12, 2020

Lakeview is a new open source visibility tool for AWS S3 based data lakes. Think of it as ncdu, but for Petabyte-scale data. It’s goal is to provide you with an easy way to see the total size of your S3 bucket (prefix) storage. Instead of scanning billions of objects using the S3 API, which would require millions of API calls and can get expensive fast, lakeview uses Athena to query S3 Inventory Reports

Simply compare sizes of directories between different dates

What can it do?

  1. Aggregate the sizes of directories* in S3 – allowing you to drill down and find what is taking up space.
  2. Compare sizes between different dates – see how directories size change over time between different inventory reports.
  3. Find the largest duplicates in your directories (planned but not yet implemented)

* S3, being an object store and not a filesystem, doesn’t really have a notion of directories, but it’s API supports so-called “common prefixes”.

All capabilities are provided in both a human consumable web interface and a machine consumable JSON report – feel free to plug them into your favorite monitoring tool.

Lakeview allows you to easily find the total size of your S3 bucket (prefix) storage size

Give it a try

docker run -it -p 5000:5000 \
    -v $HOME/.aws:/home/lakeview/.aws \
    treeverse/lakeview:0.1.0 \
        --table <athena table name> \
        --output-location <s3 uri>
  • Note <athena table name> is the name you gave in step 2, and <s3 uri> is a location in S3 where Athena could store its results (e.g. s3://my-bucket/athena/)

In addition, you can also start using lakeview as an API or run it locally.

More information

lakeview was originally built (with 💚) by the team at Treeverse. We’re actively developing lakeFS as an open source tool that delivers resilience and manageability to object-storage based data lakes.

We’d love to hear your feedback on both of these projects. Share with us what you think on the lakeFS Slack channel or GitHub repository.

LakeFS

  • Get Started
    Get Started