Project

Data Engineering Project

Building A Data Development Environment with lakeFS

Barak Amar
October 27, 2020

Overview As part of our routine work with data we develop code, choose and upgrade compute infrastructure, and test new data. Usually, this requires running parts of our production pipelines in parallel to production, testing the changes we wish to apply. Every data engineer knows that this convoluted process requires copying data, manually updating configuration, …

Building A Data Development Environment with lakeFS Read More »

Project

The lakeFS Playground

Guy Hardonag
October 20, 2020

If you’re interested in playing around and exploring lakeFS, you can now easily get started using the Katacoda playground which provides a personalized sandboxed environment – all from your browser, without installing anything.  What you will learn: This tutorial we will work with a sample dataset to give you a sense of the ways lakeFS …

The lakeFS Playground Read More »

Data Engineering Project

Introducing lakeview: A Visibility Tool for AWS S3 Based Data Lakes

Oz Katz
October 12, 2020

Lakeview is a new open source visibility tool for AWS S3 based data lakes. Think of it as ncdu, but for Petabyte-scale data. It’s goal is to provide you with an easy way to see the total size of your S3 bucket (prefix) storage. Instead of scanning billions of objects using the S3 API, which …

Introducing lakeview: A Visibility Tool for AWS S3 Based Data Lakes Read More »

Data Engineering Project

How to Manage Your Data the Way You Manage Your Code

Einat Orr, PhD.
October 5, 2020

50 years ago it was very hard to collaborate over code. When developing large scale software projects it was difficult to manage changes to source code over time, as revision control tools were only starting to enter mainstream computing. The adoption of version control tools, first centralized and then distributed, changed all that, and now …

How to Manage Your Data the Way You Manage Your Code Read More »

Go Project

Improving Postgres Performance Tenfold Using Go Concurrency

Tzahi Yaacobovicz
October 5, 2020

In this article I will show how Go concurrency enabled us to cut through a daunting DB performance barrier. This blog post continues our journey to big data performance. The first post on this issue discussed in-process caching in Go.  The Pain lakeFS is a versioned directory over objects stores like AWS S3 and GCS …

Improving Postgres Performance Tenfold Using Go Concurrency Read More »

Go Project

In-process caching in Go: scaling lakeFS to 100k requests/second

Barak Amar
October 5, 2020

This is a first in a series of posts describing our journey of scaling lakeFS. In this post we describe how adding an in-process cache to our Go server speed up our authorization flow. Background lakeFS is an open-source layer that delivers resilience and manageability to object-storage based data lakes. With lakeFS you can build …

In-process caching in Go: scaling lakeFS to 100k requests/second Read More »

Data Engineering Project

The Quick Guide for Running Presto Locally on S3

Guy Hardonag
September 11, 2020

This post aims to cover our experience running Presto in a local environment with the ability to query Amazon S3 and other S3 Compatible Systems. We will: Describe the components needed and how to configure them. Provide a dockerized environment you could run. Show an example of running the provided environment and querying a publicly …

The Quick Guide for Running Presto Locally on S3 Read More »

Data Engineering Project

Introducing lakeFS

Einat Orr, PhD.
September 11, 2020

lakeFS is an open source platform that delivers resilience and manageability to your existing object-storage based data lake. With lakeFS you can build repeatable, atomic and versioned data lake operations – from complex ETL jobs to data science and analytics.

LakeFS

  • Get Started
    Get Started