Data Engineering

Best Practices Data Engineering

Big Data Testing: How To Test Data Pipelines In The ETL World

The lakeFS team
January 23, 2023

When testing ETLs for big data applications, data engineers usually face a challenge that originates in the very nature of data lakes. Since we’re writing or streaming huge volumes of data to a central location, it only makes sense to carry out data testing against equally massive amounts of data. You need to test with …

Big Data Testing: How To Test Data Pipelines In The ETL World Read More »

Data Engineering People

Data Engineering Conferences 2023

Ankit Srinivas
January 17, 2023

Conferences are back in full steam! 2023 is looking to be another great year for data conferences. This is a great time to learn, network, and engage with like-minded people.  Let’s kick off this list with some of the top Data Engineering Conferences that you will want to attend! Developer Week 2023 Website: https://www.developerweek.com/  When: …

Data Engineering Conferences 2023 Read More »

Data Engineering

ETL Testing: A Practical Guide

Iddo Avneri
January 16, 2023

What is ETL Testing? ETL testing is the process of evaluating and verifying that the ETL (Extract, Transform, Load) processes work correctly.  What is ETL? An ETL process Extracts data of potentially many different structure or unstructured formats from multiple sources into a centralized repository. Then, an ETL process Transforms the data to a format …

ETL Testing: A Practical Guide Read More »

Best Practices Data Engineering

CI/CD for data pipelines – The Shortest Path to Your Destination with lakeFS

The lakeFS team
February 7, 2023

Overview Continuous integration (CI) of data is the process of exposing data to consumers only after ensuring it adheres to best practices such as format, schema, and PII governance. Continuous deployment (CD) of data ensures the quality of data at each step of a production pipeline. These approaches are commonly used by application developers of …

CI/CD for data pipelines – The Shortest Path to Your Destination with lakeFS Read More »

Best Practices Data Engineering

Data Version Control – A Data Engineering Best Practice You Must Adopt

Einat Orr, PhD.
January 3, 2023

Imagine the software engineering world before distributed version control systems like Git became widespread. This is where the data world is currently at. The explosion in the volume of generated data forced organizations to move away from relational databases and instead store data in object storage. This escalated manageability challenges that teams need to address …

Data Version Control – A Data Engineering Best Practice You Must Adopt Read More »

Data Engineering

Data Reproducibility and other Data Lake Best Practices

The lakeFS team
January 16, 2023

Overview Data changes frequently, making the task of keeping track of its exact state over time difficult. Oftentimes, people maintain only one state of their data––its current state. Data lake best practices require reproducibility that lets us time travel between different versions of the data, enabling us a snapshot at the data at different times …

Data Reproducibility and other Data Lake Best Practices Read More »

Data Engineering Thought Leadership

4 Ways to Reduce Cloud Data Storage Costs

Oz Katz
November 7, 2022

In the past year, words like recession, business slowdown and monetary cuttings are being heard more and more often. Not just in the economic press and in the media, these discussions are very much heard also in almost all companies – within boardrooms, in management meetings and when engaging with potential investors and customers. As …

4 Ways to Reduce Cloud Data Storage Costs Read More »

Data Engineering Use Cases

How to Develop Spark ETL Pipelines in Isolation

Amit Kesarwani, Vino SD, Iddo Avneri
November 7, 2022

You’re bound to ask yourself this question at some point: Do I need to test the Spark ETLs I’m developing? The answer is yes; you certainly should – and not just with unit testing but also integration, performance, load, and regression testing. Naturally, the scale and complexity  of your data set matters a lot, so …

How to Develop Spark ETL Pipelines in Isolation Read More »

Git for Data – lakeFS

  • Get Started
    Get Started
  • LIVE: Develop Spark pipelines against production data on February 15 -

    Register Now
    +