Last week we held another lakeFS Community Call! We believe these calls are invaluable opportunities to have direct dialogue with our users on all things lakeFS.
Oz covered important new lakeFS functionality, previewed what's coming soon from the roadmap, and also shared two exciting updates from the community. Let's recap!
6 Important lakeFS Releases
1. post-merge and post-commit hooks (0.46)
Now you can use the lakeFS hooks feature to run validation tests triggered after a commit or merge action takes place. Unlike with the “pre-” variety, this won’t prevent bad data from making it into production datasets. But it will you to have downstream systems respond to the events and take their own action – like Hive Metastore adding a partition or updating a data discovery tool.
2. Hooks support triggering Airflow DAGs (0.47)
With this feature, lakeFS post-merge hooks can trigger an Airflow DAG to run that, for example, transforms or aggregates data. With this architecture, Airflow goes from a sensor-driven architecture to an event-driven one. Cool stuff!
3. Support repositories on multiple AWS regions (0.48)
One lakeFS installation can now support repositories created with the underlying storage buckets in different regions. This is useful for larger orgs using lakeFS that want to use different regions for their data. And also in particular when using the native Spark client which talks to the data objects directly, instead of going through the lakeFS S3 Gateway.
4. Optional S3 Gateway DNS settings (0.48)
Before this change a lakeFS installation required two DNS records – one pointing to the lakeFS OpenAPI and one to the S3 Gateway. We used the different hosts to determine if something was an S3 or API request.
Configuring the DNS records and explaining why it was necessary was often a challenge for new users, so we’ve added logic to a Request Resolver component within lakeFS that automatically determines where requests should be routed.
No multiple DNS records needed and a simplified deployment process!
5. lakeFS protected branches (0.52)
The guardrails for data lake users and guarantees you can ensure about your data with this feature are pretty neat!
6. Support for LDAP Authentication (0.53)
No more syncing a user database or having to manage users internally in lakeFS. Tap into an existing LDAP server to re-use user settings in lakeFS.
Watch the full event replay below!
The Road Ahead
Want to know what’s coming in future lakeFS releases? We covered four of the ones we’re most excited about.
- Improved UI collaboration – Improved viz for diffs, aggregated stats per prefix, and line-diffs for human-readable formats (like CSV)
- Tracking changes at the collection or object level – Similar to the way
git log -- some/pathlets you see who made a change or how has this file changed over time
- Integration with dbt – We’ve heard of two challenges dbt users face where lakeFS would improve the usage: 1. Creating isolated data environments without copying data and 2. Running
dbt testbefore data is exposed to consumers. More to come here!
- lakeFS Metastore – As we’ve made clear, the current metastore solution from Hive isn’t cutting it. Creating a modern, version-aware metastore within lakeFS is part of our future vision.
First we heard from Prafful at Volvo shared how they are utilizing lakeFS to provide the data science teams the maximum isolation possible with data versioning.
Volvo developed a solution using versioning from lakeFS with Git branching. Every git branch points to a specific commit in lakeFS. This provides a way for data science teams to apply development best practices for managing data, without creating new complexity.
Second, Oz announced our upcoming panel on Hive Metastore. It’s taking place virtually on November 10th with a great lineup of panelists. See the event page for more details and to tune in!
Thank you for reading! If you would like to stay in touch with the community...
Read Related Articles.
In today’s world of data engineering, we need to store more than just simple text information in relational or non-relational databases, tables or documents. Data
What if you could manage your data lake just like you manage code? With rollback, versioning, and branching capabilities on top of your existing data