Best Practices Machine Learning Tutorials

Building an ML Experimentation Platform for Easy Reproducibility Using lakeFS

Vino SD

April 21, 2023

MLOps is mostly data engineering. As organizations ride past the hype cycle of MLOps, we realize there is significant overlap between MLOps and data engineering. As ML engineers, we spend most of our time collecting, verifying, pre-processing, and engineering features from data before we can even begin training models. Only 5% of developing and deploying […]

Best Practices

How To Maintain Data Quality In Your Data Lake

The lakeFS Team

February 15, 2023

Enterprises use more and more data as the foundation for their decisions and operations. The sheer number of digital goods that collect, analyze, and use data to feed decision-making algorithms in order to improve future services is also rapidly increasing. Because of this, data quality has become the most important asset for businesses in almost

Best Practices Data Engineering

Big Data Testing: Benefits, Challenges & Tools

The lakeFS Team

January 23, 2023

When testing ETLs for big data applications, data engineers usually face a challenge that originates in the very nature of data lakes. Since we’re writing or streaming huge volumes of data to a central location, it only makes sense to carry out data testing against equally massive amounts of data. You need to test with

Best Practices

Best Practices to Easily Adopt lakeFS

Iddo Avneri

January 3, 2023

lakeFS is gaining momentum as a solution for data versioning on top of an object store, and more and more data driven organizations adopt lakeFS as their data version control. Once you start using lakeFS, the files on your object store will form in a new structure. Other solutions, such as Iceberg, also create a

Best Practices Data Engineering

Write-Audit-Publish for Data Pipelines: The Shortest Path to Your Destination with lakeFS

The lakeFS Team

December 4, 2022

Overview Continuous integration (CI) of data is the process of exposing data to consumers only after ensuring it adheres to best practices such as format, schema, and PII governance. Continuous deployment (CD) of data ensures the quality of data at each step of a production pipeline. These approaches are commonly used by application developers of

Best Practices Data Engineering

Data Version Control – A Data Engineering Best Practice You Must Adopt

Einat Orr, PhD

December 1, 2022

Imagine the software engineering world before distributed version control systems like Git became widespread. This is where the data world is currently at. The explosion in the volume of generated data forced organizations to move away from relational databases and instead store data in object storage. This escalated manageability challenges that teams need to address

Best Practices Data Engineering

Git for Data – What, How and Why Now?

Einat Orr, PhD

November 24, 2022

Git, the Source Control, aka Code Version Control When we wish for “Git for Data”, we already know what code version control is, and that Git is the standard tool for code version control. For the sake of those who have just joined us, let’s define those terms. Back in the 60’s of the 20th

Best Practices Data Engineering

lakeFS with DynamoDB – How Key Value Store is Used by lakeFS

Itai David

September 20, 2022

This blog discusses advanced topics within lakeFS. If you are new to lakeFS, or would like to expand your knowledge of how lakeFS works, make sure to check out our documents section. In the Beginning There Was Postgres Up until recently, lakeFS was using a strongly consistent SQL DB, namely PostgreSQL, where all metadata was

Best Practices Tutorials

Building Rich CLI Applications with Go’s Built-in Templating

Barak Amar

October 20, 2021

Overview The templating package text/template implements data-driven templates for generating textual output. Although we do not benefit from executing the template output more than once, we found it easy to use and helpful for outputting text with colors, marshaling data, and rendering tabular information. By mapping additional functions by name, it is possible to extend

Best Practices Tutorials

Go Versions: Manage Multiple Go Versions with Go

Barak Amar

May 12, 2021

Updated on April 5, 2022 As a user of the Go programming language, I’ve found it useful to enable the running multiple versions within a single project. If this is something you’ve tried or have considered, great! In this post I’ll present the when and the how of enabling multiple Go versions. Finally, we’ll conclude

Best Practices Data Engineering

lakeFS Hooks: Implementing Write-Audit-Publish for Data Using Pre-Merge Hooks

Oz Katz

March 2, 2021

Write-Audit-Publish (continuous integration/continuous deployment of data) is the process of exposing data to consumers only after ensuring it adheres to best practices such as format, schema, and PII governance. Continuous deployment of data ensures the quality of data at each step of a production pipeline. In this blog, I will present lakeFS’s web hooks, and

Best Practices Data Engineering Tutorials

Concrete Graveler: Committing Data to Pebble SSTables

Ariel Shaqed (Scolnicov)

February 16, 2021

Introduction In our recent version of lakeFS, we switched to base metadata storage on immutable files stored on S3 and other common object stores. Our design is inspired by Git, but for object stores rather than filesystems, and with (much) larger repositories holding machine-generated commits. The design document is informative but by nature omits

Best Practices

Pick up the Slack with lakeFS