Ready to dive into the lake?

lakeFS is currently only
available on desktop.

For an optimal experience, provide your email below and one of our lifeguards will send you a link to start swimming in the lake!

lakeFS Community
Robin Moffatt
March 20, 2023

We recently got a dog.

Bear with me here.

We have a dog walker who comes once a week to take him out for walks with other puppies his age. Since I’m often on work calls, the dog walker has a key so they let themselves in without disturbing me.

That key admits them to the whole house, even though they only need to go to the kitchen where the dog sleeps during the day. Of course, we trust the dog walker, but theoretically, we’d prefer granting them (and any other dog walker) access only to the kitchen while denying access to other rooms.

A dog sat by the door waiting for a walk
A dog sat by the door waiting for a walk

When deploying lakeFS, you’re likely to have similar considerations. You’ll want to give all your data engineers, data scientists, and any other folks of a data practitioner bent access to it since it’s going to be the heart of the data platforms they’re working on.

However, not all users are created equal, and this is where Role-Based Access Control comes in.

Just as I’d theoretically grant access to part of my house to a user (person) based on their role (dog walker, spouse, child), we can control access in lakeFS based on a user’s role.

Controlling Access based on Role in lakeFS

lakeFS supports Role-Based Access Control (RBAC), so you can precisely specify what (functionality) users can do and where (data) they can do it.

For example:

  • Data Infrastructure Engineers who administer the data platform including lakeFS should be able to do a lot of the repository and branch management but perhaps be blocked from viewing any sensitive data that it contains.
  • Data Engineers are allowed to create branches of data against which they can test the changes to data their code would make and merge those branches back once the test is successful. DEs in different departments may all be granted access to central data and restricted access to particular departmental data as needed.

As the name implies, access is usually granted based on the role of a user. It can be done directly against a user too, but that’s generally not good practice. If you do it by role, it’s simpler to track and audit.

Granting access to a role than a user is a lot more flexible too. If you have a dozen data scientists who now need to be able to create their own branches from production data to run experiments against, making that change in one Data Scientist role configuration will be faster than getting a list of all the data scientists to go and change the permissions for each user one by one.

Note: RBAC is available with lakeFS Cloud and lakeFS Enterprise.

Terminology

If you’re anything like me and study the AWS IAM reference documentation each night before bed, then you’ll be familiar with the components that lakeFS bases its RBAC on. Just in case, let’s do a quick recap.

  • Users – Representing entities that access and use the system.
  • Groups – A named collection of users. Users can belong to multiple groups. This is really another word for Role.
  • Actions – Representing a logical action within the system – reading a file, creating a repository, etc.
  • Resources – A unique identifier representing a specific resource in the system – a repository, an object, a user, etc.
  • Policies – Representing a set of Actions, a Resource, and an effect (allow or deny). Policies are attached to a group. They can be attached to users too, but generally this is not good practice.

lakeFS RBAC in Action

To understand quite what RBAC could do, I created a small project with three protagonists:

  • Bob is the data platform owner. He’s responsible for setting everything up and getting the raw data loaded,
  • Carol is a data engineer tasked with transforming the raw data into a dimensional model ready for analysis,
  • Alice is a data scientist (or perhaps an analytics engineer; she can’t decide which title sounds better these days. Is data scientist passé now?). She’s going to use the transformed data to do some analysis.

Each of these protagonists has clear responsibilities and limits to what they should be able to do on lakeFS. We can enforce these with RBAC.

As the platform owner, Bob is an admin and has full access. After doing the initial data load, Bob sets write-protection on the main branch. This isn’t part of RBAC but bears mentioning anyway as it’s core to the development pattern recommended for lakeFS and other versioning systems. Bob also writes some sensitive data that neither Carol nor Alice must be able to access.

Carol gets assigned to the built-in Developers group. This group is perfect for the data engineer role, which will develop and test code against a branch of data they want to merge back to main.

We create a new Data Scientists group for Alice. Data Scientists will generally want to work against a branch of production data and perhaps write their findings or scratch files back to a branch, but not merge this to main, which the group policy blocks.

Let’s now walk through each user’s interactions with lakeFS in a bit more detail and see how the RBAC implementation works at each stage.

Loading the data into lakeFS

We’re using S3 as our storage, with lakeFS providing a nice way to enforce good practices in version control by giving us a way to commit and rollback changes made as needed.

The dataset is one of drone registration data, which through a Freedom of Information Act request from Ben Welsh is available to the public (hat tip to Jeremy Singer-Vine for his Data is Plural newsletter where I found the dataset). Like much raw data we end up handling as data engineers, it’s a set of XLSX files, with one file per year.

A folder listing of XLSX files
A folder listing of XLSX files

Bob the admin runs a little Python script that converts the Excel files into Parquet and uploads them to lakeFS. He also has a CSV of drone pilot details, which he uploads directly to lakeFS using the lakectl tool:

$ lakectl fs upload lakefs://drones03/main/sensitive/flyers.csv --source /data/flyers.csv
Path: sensitive/flyers.csv
Modified Time: 2023-03-01 09:49:08 +0000 UTC
Size: 80491 bytes
Human Size: 80.5 kB
Physical Address: s3://rmoff-01/drones03/data/gp7u8o5bjhuk6civtnrg/cfvhv55bjhuk6civtnug
Checksum: 64f27b8396e1feb68515d6df2c7e3872
Content-Type: application/octet-stream

This is sensitive data, and we’ll see later how RBAC blocks access to it for users who shouldn’t see it.

Bob can now review the files in lakeFS before committing them. The object viewer helpfully enough displays the parquet data directly in the web browser:

lakeFS UI Object page showing DuckDB query executed against Parquet file
lakeFS UI Object page showing DuckDB query executed against Parquet file

Happy that the data files are as they should be, Bob goes ahead and commits them:

lakeFS UI Commit Changes dialog
lakeFS UI Commit Changes dialog

With the main branch now loaded, Bob sets a protection on it so that any subsequent writes must be done via the merge of a branch.

$ lakectl branch-protect add lakefs://drones03 main

Bob checks that the protection is working by trying to directly delete a file, which fails as it should:

$ lakectl fs rm lakefs://drones02/main/drone-registrations/Registations-RecFlyer-Active-2015.parquet

cannot write to protected branch

Building a Dimensional Model on the Raw Data

Enter Carol. Carol likes nothing more than restoring order to the world by modeling data into a star schema ready for analytics.

RBAC for the Data Engineer

Before we look at how she does this, let’s look at the RBAC implementation for her ID. As a developer, she’s in the Developer group, which gives her permission to read and write files. However, as a developer she’s not allowed access to files that are deemed sensitive, which helpfully have a sensitive* prefix for their object naming convention. Thus, we can construct an RBAC policy to deny access:

lakeFS UI showing a custom Access Control policy
lakeFS UI showing a custom Access Control policy

Now any object that begins with sensitive in the drones03 repository will be protected from access by any user to whom this policy is applied.

Policies are usually attached to groups, which is what we’ll do here using the lakectl tool:

$ lakectl auth groups policies attach \
            --id "Developers" \
            --policy FSBlockAccessToSensitiveData 
        
Policy attached successfully

Let’s double-check that this has the correct effect on access. First up, Bob (remember him? He’s the owner of this whole data platform) tries to access a sensitive file, which will succeed as it should:

# Using Bob’s credentials
$ lakectl fs cat lakefs://drones03/main/sensitive/flyers.csv | head -n2

id,first_name,last_name,email,address_1,address_2,city
1,Archer,Ganiclef,aganiclef0@google.nl,38282 Maple Point,Suite 37,Penteado
2,Joann,Edwardes,jedwardes1@auda.org.au,056 Elgar Park,PO Box 81577,Oslo

Now Carol tries to do the same:

# Using Carol’s credentials
$ lakectl fs cat lakefs://drones03/main/sensitive/flyers.csv | head -n2

insufficient permissions
401 Unauthorized

🛑 Stop right there! RBAC is doing its job 👍🏻

Transforming the data

With a set of Parquet files of as-yet unknown structure, Carol ventures forth. She can see that there are two types of file, each split by year:

$ lakectl fs ls lakefs://drones03/main/drone-registrations/ 
object          2023-03-01 09:47:36 +0000 UTC    30.7 kB         Registations-P107-Active-2016.parquet
object          2023-03-01 09:48:54 +0000 UTC    119.7 kB        Registations-P107-Active-2017.parquet
object          2023-03-01 09:44:47 +0000 UTC    594.3 kB        Registations-P107-Active-2018.parquet
object          2023-03-01 09:45:04 +0000 UTC    1.3 MB          Registations-P107-Active-2019.parquet
object          2023-03-01 09:48:12 +0000 UTC    2.8 MB          Registations-P107-Active-2020.parquet
object          2023-03-01 09:48:51 +0000 UTC    3.2 MB          Registations-P107-Active-2021.parquet
object          2023-03-01 09:45:30 +0000 UTC    659.3 kB        Registations-RecFlyer-Active-2015.parquet
object          2023-03-01 09:46:48 +0000 UTC    1.5 MB          Registations-RecFlyer-Active-2016.parquet
object          2023-03-01 09:46:28 +0000 UTC    715.1 kB        Registations-RecFlyer-Active-2017.parquet
object          2023-03-01 09:45:35 +0000 UTC    318.5 kB        Registations-RecFlyer-Active-2018.parquet
object          2023-03-01 09:45:21 +0000 UTC    1.6 MB          Registations-RecFlyer-Active-2019.parquet
object          2023-03-01 09:47:35 +0000 UTC    4.3 MB          Registations-RecFlyer-Active-2020.parquet
object          2023-03-01 09:46:19 +0000 UTC    4.3 MB          Registations-RecFlyer-Active-2021.parquet

Using DuckDB, she quickly (quackly?) builds a view over each set of files:

🟡◗ CREATE VIEW Registations_RecFlyer_Active AS
    SELECT * FROM read_parquet(
        's3://drones03/main/drone-registrations/Registations-RecFlyer-Active-*.parquet') ;

🟡◗ CREATE VIEW Registations_P107_Active AS 
    SELECT * FROM read_parquet(
        's3://drones03/main/drone-registrations/Registations-P107-Active-*.parquet',
        union_by_name=true) ;

From these views, she determines that they share the same data structure, except one uses UAS and the other uses Asset for two of the column names (the data being the same across each).

🟡◗ DESCRIBE Registations_P107_Active;
┌─────────────────────────┬─────────────┬─────────┬─────────┬─────────┬───────┐
│       column_name       │ column_type │  null   │   key   │ default │ extra │
│         varchar         │   varchar   │ varchar │ varchar │ varchar │ int32 │
├─────────────────────────┼─────────────┼─────────┼─────────┼─────────┼───────┤
│ Registration Date       │ VARCHAR     │ YES     │         │         │       │
│ Registion Expire Dt     │ VARCHAR     │ YES     │         │         │       │
│ Asset Type              │ VARCHAR     │ YES     │         │         │       │
│ RID Equipped            │ BOOLEAN     │ YES     │         │         │       │
│ Asset Model             │ VARCHAR     │ YES     │         │         │       │
│ Physical City           │ VARCHAR     │ YES     │         │         │       │
│ Physical State/Province │ VARCHAR     │ YES     │         │         │       │
│ Physical Postal Code    │ VARCHAR     │ YES     │         │         │       │
│ Mailing City            │ VARCHAR     │ YES     │         │         │       │
│ Mailing State/Province  │ VARCHAR     │ YES     │         │         │       │
│ Mailing Postal Code     │ VARCHAR     │ YES     │         │         │       │
│ filename                │ VARCHAR     │ YES     │         │         │       │
├─────────────────────────┴─────────────┴─────────┴─────────┴─────────┴───────┤
│ 12 rows                                                           6 columns │
└─────────────────────────────────────────────────────────────────────────────┘

This gives rise to a fairly straightforward star schema:

Entity Relationship Diagram (ERD) of the star schema
Entity Relationship Diagram (ERD) of the star schema

Carol runs some SQL queries that apply this model to the data, resulting in three nicely defined tables:

🟡◗ DESCRIBE;
┌────────────────────┬──────────────────────┬───────────────────────────┬───────────┐
│     table_name     │     column_names     │       column_types        │ temporary │
│      varchar       │      varchar[]       │         varchar[]         │  boolean  │
├────────────────────┼──────────────────────┼───────────────────────────┼───────────┤
│ dim_asset          │ [asset_id, type, m…  │ [INTEGER, VARCHAR, VARC…  │ false     │
│ dim_location       │ [location_id, CITY…  │ [INTEGER, VARCHAR, VARC…  │ false     │
│ fact_registrations │ [date_of_reg, reg_…  │ [VARCHAR, VARCHAR, VARC…  │ false     │
└────────────────────┴──────────────────────┴───────────────────────────┴───────────┘

Her work here is done, and she is ready to write this data back to lakeFS so that others can use it. Because she’s in the Developer RBAC group, she has permission to create a branch, which she duly does:

$ lakectl branch create \
            lakefs://drones03/build-star-schema 
            --source lakefs://drones03/main
Source ref: lakefs://drones03/main
created branch 'build-star-schema' 6182688a0574521e030b62e4c350d4050453f953d525c118e583a53d6deb498b

She then exports the freshly-modeled dataset:

🟡◗ EXPORT DATABASE 's3://drones03/build-star-schema/drone-registrations/star' (FORMAT PARQUET);

Finally, she commits this to the branch…

$ lakectl commit lakefs://drones03/build-star-schema \
        --message "Modeled raw data into star schema ⭐" 
  
Branch: lakefs://drones03/build-star-schema
Commit for branch "build-star-schema" completed.

ID: 617cf4e21ed85d9b5939c90b0b1e3043decf8f3e97899cbb35dc4bfd5c666c89
Message: Modeled raw data into star schema ⭐
Timestamp: 2023-03-03 18:07:12 +0000 UTC
Parents: 6182688a0574521e030b62e4c350d4050453f953d525c118e583a53d6deb498b

…and merges that back to main:

$ lakectl merge lakefs://drones03/build-star-schema lakefs://drones03/main 

Source: lakefs://drones03/build-star-schema
Destination: lakefs://drones03/main
Merged "build-star-schema" into "main" to get "858f4a9d987609e7dcb4a983bb3dbffeb898feaf2512eb712e8a9b5899d2fb33".

Branching the Data for Sandbox Use

As part of her role as a data scientist, Alice needs to analyze the data. This will include creating some tables that she’ll want to store, but that are really only for temporary purposes and shouldn’t be written to the main data.

To block that, we create a new policy to prevent commits to the main branch. Instead of using the web interface like we did above, we use lakectl here:

$ lakectl auth policies create \
           --statement-document - \
           --id FSBlockMergingToMain <<EOF
{
    "statement": [
        {
            "action": [
                "fs:CreateCommit"
            ],
            "effect": "deny",
            "resource": "arn:lakefs:fs:::repository/*/branch/main"
        }
    ]
}
EOF

Alice is assigned membership of the Data Scientists group, which includes this new policy as well as several others, including the FSBlockAccessToSensitiveData policy:

lakeFS UI showing policies attached to the Data Scientist group
lakeFS UI showing policies attached to the Data Scientist group

To get underway with her work, Alice creates a branch for her to work in:

$ lakectl branch create lakefs://drones03/alice-test \
                --source lakefs://drones03/main 

Source ref: lakefs://drones03/main
created branch 'alice-test' 858f4a9d987609e7dcb4a983bb3dbffeb898feaf2512eb712e8a9b5899d2fb33

With the data safely isolated in her own branch, she can now start working on it. The task at hand is to report back on the most popular models of drones to be registered and where. To start with, she loads the data into DuckDB:

🟡◗ CREATE TABLE dim_asset AS 
        SELECT * FROM read_parquet('s3://drones03/alice-test/drone-registrations/star/dim_asset.parquet');
🟡◗ CREATE TABLE dim_location AS 
        SELECT * FROM read_parquet('s3://drones03/alice-test/drone-registrations/star/dim_location.parquet');
🟡◗ CREATE TABLE fact_registrations AS 
        SELECT * FROM read_parquet('s3://drones03/alice-test/drone-registrations/star/fact_registrations.parquet');

Looking at the data, it’s apparent that there are issues with the data quality – Mavic and MAVIC, Mavic Air 2S also listed as AIR 2S, and that’s just the beginning of it.

🟡◗ SELECT model,count(*) FROM dim_asset a left join fact_registrations r on a.asset_id=r.asset_id group by model order by 2 desc;
┌───────────────┬──────────────┐
│     model     │ count_star() │
│    varchar    │    int64     │
├───────────────┼──────────────┤
│ Mavic 2 Pro   │        14505 │
│ MAVIC Air 2s  │        11740 │
│ Air 2S        │         6953 │
│ Mini 2        │         6778 │
[…]
│ Mavic Air 2S  │         3608 │
│ AIR 2S        │         3568 │
│ Air 2s        │         3439 │
[…]

Alice starts off by looking at the example of the Air 2S:

🟡◗ SELECT model,count(*) FROM dim_asset a left join fact_registrations r on a.asset_id=r.asset_id 
     where lower(model) like '%air%' 
       and lower(model) like '%2s%' 
    group by model order by 2 desc 
    limit 30;
┌──────────────────┬──────────────┐
│      model       │ count_star() │
│     varchar      │    int64     │
├──────────────────┼──────────────┤
│ Air 2S           │         6953 │
│ Mavic Air 2S     │         3608 │
│ AIR 2S           │         3568 │
│ Air 2s           │         3439 │
│ DJI Air 2S       │         1217 │
[…]

Alice adds a field to the dim_asset table for cleaning this data up, which by default will be the same value as model.

🟡◗ ALTER TABLE dim_asset ADD COLUMN model_cleaned VARCHAR;
🟡◗ UPDATE dim_asset SET model_cleaned = model;

Now she can start to clean up the Air 2S references:

🟡◗ UPDATE dim_asset
    SET model_cleaned = 'Air 2S'
    WHERE lower(model) LIKE '%air 2s%'
        OR lower(model) LIKE '%air2s%';

With this and several similar data fixes run, Alice now uses a notebook and Python charting library to visualize the top models:

Bar Chart showing # of registrations of different models of drone
Bar Chart showing # of registrations of different models of drone

She also plots out the relative number of drone registrations by US state:

Chloropleth Map showing relative number of drone registrations by US state
Chloropleth Map showing relative number of drone registrations by US state

With her analysis done for now, Alice saves the results of the aggregate calculations that she’s done back to lakeFS.

🟡◗ CREATE TABLE TOP_DRONES AS 
    SELECT model_cleaned AS model,count(*)  as registrations
    FROM dim_asset a 
        left join fact_registrations r on a.asset_id=r.asset_id 
    group by model_cleaned 
    order by 2 desc; 

🟡◗ COPY TOP_DRONES TO 's3://drones03/alice-test/drone-registrations/star/top_drones.parquet';

She also saves the cleansed dimension data back to her branch, replacing what’s there.

🟡◗ COPY dim_asset 
      TO 's3://drones03/alice-test/drone-registrations/star/dim_asset.parquet' 
      (FORMAT PARQUET, ALLOW_OVERWRITE TRUE);

The nice thing about this is that the changes she’s made to the data are isolated from those in the main branch, which other users of the data will be accessing.

Query run from lakeFS showing dim_asset schema from the main branch and highlighting that the four original columns present
Query run from lakeFS showing dim_asset schema from the main branch and highlighting that the four original columns present
Query run from lakeFS showing dim_asset schema from the alice-test branch and highlighting that there is now a fifth column present
Query run from lakeFS showing dim_asset schema from the alice-test branch and highlighting that there is now a fifth column present

Conclusion

This little tour of RBAC in lakeFS showed how flexible and powerful it is for ensuring that different users can do exactly what they need to in lakeFS—and only that. With a good set of built-in groups and policies, lakeFS comes out of the box ready to deploy for general types of users, and by adding custom policies, you can tailor it precisely as needed.

You can try out lakeFS Cloud for free today. The Python code and notebook with SQL shown in this blog can be found here on GitHub if you’d like to run it for yourself.

RBAC is available with lakeFS Cloud and lakeFS Enterprise.

Git for Data – lakeFS

  • Get Started
    Get Started
  • The annual State of Data Engineering Report is now available. Find out what’s new in 2023 -

    +