Webinar Lottie

lakeFS Acquires DVC, Uniting Data Version Control Pioneers to Accelerate AI-Ready Data

webcros

Best Practices

Best Practices Machine Learning Tutorials

Building an ML Experimentation Platform for Easy Reproducibility Using lakeFS

Vino SD

MLOps is mostly data engineering. As organizations ride past the hype cycle of MLOps, we realize there is significant overlap between MLOps and data engineering. As ML engineers, we spend most of our time collecting, verifying, pre-processing, and engineering features from data before we can even begin training models.  Only 5% of developing and deploying […]

Best Practices

How To Maintain Data Quality In Your Data Lake

The lakeFS Team

Enterprises use more and more data as the foundation for their decisions and operations. The sheer number of digital goods that collect, analyze, and use data to feed decision-making algorithms in order to improve future services is also rapidly increasing. Because of this, data quality has become the most important asset for businesses in almost

Best Practices Data Engineering

Big Data Testing: Benefits, Challenges & Tools

The lakeFS Team

When testing ETLs for big data applications, data engineers usually face a challenge that originates in the very nature of data lakes. Since we’re writing or streaming huge volumes of data to a central location, it only makes sense to carry out data testing against equally massive amounts of data. You need to test with

Best Practices

Best Practices to Easily Adopt lakeFS

Iddo Avneri

lakeFS is gaining momentum as a solution for data versioning on top of an object store, and more and more data driven organizations adopt lakeFS as their data version control. Once you start using lakeFS, the files on your object store will form in a new structure. Other solutions, such as Iceberg, also create a

Best Practices Data Engineering

Write-Audit-Publish for Data Pipelines: The Shortest Path to Your Destination with lakeFS

The lakeFS Team

Overview Continuous integration (CI) of data is the process of exposing data to consumers only after ensuring it adheres to best practices such as format, schema, and PII governance. Continuous deployment (CD) of data ensures the quality of data at each step of a production pipeline. These approaches are commonly used by application developers of

Best Practices Data Engineering

Data Version Control – A Data Engineering Best Practice You Must Adopt

Einat Orr, PhD

Imagine the software engineering world before distributed version control systems like Git became widespread. This is where the data world is currently at. The explosion in the volume of generated data forced organizations to move away from relational databases and instead store data in object storage. This escalated manageability challenges that teams need to address

Best Practices Data Engineering

Git for Data – What, How and Why Now?

Einat Orr, PhD

Git, the Source Control, aka Code Version Control When we wish for “Git for Data”, we already know what code version control is, and that Git is the standard tool for code version control. For the sake of those who have just joined us, let’s define those terms. Back in the 60’s of the 20th

Best Practices Tutorials

Building Rich CLI Applications with Go’s Built-in Templating

Barak Amar

Overview The templating package text/template implements data-driven templates for generating textual output. Although we do not benefit from executing the template output more than once, we found it easy to use and helpful for outputting text with colors, marshaling data, and rendering tabular information. By mapping additional functions by name, it is possible to extend

Best Practices Tutorials

Go Versions: Manage Multiple Go Versions with Go

Barak Amar

Updated on April 5, 2022 As a user of the Go programming language, I’ve found it useful to enable the running multiple versions within a single project. If this is something you’ve tried or have considered, great! In this post I’ll present the when and the how of enabling multiple Go versions. Finally, we’ll conclude

Best Practices Data Engineering Tutorials

Concrete Graveler: Committing Data to Pebble SSTables

Ariel Shaqed (Scolnicov)

  Introduction In our recent version of lakeFS, we switched to base metadata storage on immutable files stored on S3 and other common object stores.  Our design is inspired by Git, but for object stores rather than filesystems, and with (much) larger repositories holding machine-generated commits. The design document is informative but by nature omits

lakeFS