Power Amazon EMR Applications with Git-like Operations Using lakeFS

Itai Admi

Last updated on December 16, 2024

Home > Blog > Power Amazon EMR Applications with Git-like Operations Using lakeFS

Learn from AI, ML & data leaders from Dell, Lockheed Martin, Red Hat & more

This article will provide a detailed explanation of how to use lakeFS with Amazon EMR. Today, it’s common to manage a data lake using cloud object stores like AWS S3, Azure Blob Storage, or Google Cloud Storage as the underlying storage service. Each cloud provider offers a set of managed services to simplify the way you consume data directly from the lake.

What is Amazon EMR?

Amazon’s EMR is an example of an industry-leading big data platform for processing vast amounts of data using open source tools such as Apache Spark, Apache Hive, Apache HBase, Apache Flink, Apache Hudi, and Presto. EMR makes it possible to run these frameworks on-demand with elasticity and high reliability.

Applications running on EMR, from simple ETLs to complex ML pipelines, access data on S3 as input, and write output back to S3. lakeFS enables any application running on EMR with git-like operations to manage the input and output data as a git repository, and to allow experimentation, testing and CI/CD for the data.

Benefits of EMR applications

There are numerous benefits of using Amazon EMR, such as the flexibility provided by AWS and the cost reductions compared to constructing your own on-premises resources. Let’s dive into the details:

Cost efficiency – Amazon EMR cost is determined by the type and amount of Amazon EC2 instances deployed and the region in which your cluster is launched. On-demand pricing is inexpensive, but you may save even more by purchasing Reserved Instances or Spot Instances.
Integration – Amazon EMR interfaces with other AWS services to provide networking, storage, and security capabilities for your cluster.
Easy deployment – Your EMR cluster is made up of EC2 instances that complete the tasks you assign to them. All you need to do is choose the instance size and type that best meet your cluster’s processing requirements: batch processing, low-latency queries, streaming data, or big data storage.
Scalability and flexibility – Amazon EMR allows you to scale your cluster up or down as your computing demands change. You can adjust your cluster to add instances during peak workloads and remove ones to control expenses when peak workloads decrease.
Monitoring – Amazon EMR monitors your cluster’s nodes and automatically terminates and replaces instances if they fail. You can utilize the Amazon EMR management interfaces and log files to diagnose cluster issues, including failures and errors. Amazon EMR supports archiving log files in Amazon S3, allowing you to store logs and address issues long after your cluster has terminated.
Security – Amazon EMR uses other AWS services, such as IAM and Amazon VPC, as well as features like Amazon EC2 key pairs, to help you secure your clusters and data.

Common use cases for EMR applications

Here are the most common use cases for Amazon EMR:

Batch processing – EMR is best suited for batch processing tasks such as data cleansing, transformation, and ETL (Extract, Transform, Load) operations on huge datasets.
Data Warehousing – EMR can be used to analyze and transform data before loading it into a data warehouse such as Amazon Redshift.
Fraud Detection – EMR can examine transaction data to detect patterns and abnormalities linked with fraudulent activity.
Recommendation Systems – Use user behavior data to create recommendation systems that suggest products, content, and services to users.
Machine Learning – EMR can preprocess and alter data to feed clean and prepared data into machine learning models.
Log Analysis – EMR can be used to analyze and process log and event data in order to extract insights and detect trends.
Real-Time Analytics – EMR can process streaming data in real time, allowing enterprises to acquire insights and take actions as they happen.
Sentiment Analysis – ERM applications can use social media data or consumer feedback to better comprehend sentiment and opinions.
Clickstream Analysis – You can use ERM to gather and analyze clickstream data from websites or apps to better understand user behavior and engagement.

Examples of EMR applications

Amazon EMR makes deploying distributed data processing frameworks simple and cost-efficient. Furthermore, it separates computation and storage, allowing them to increase independently and improving resource use.

Historically, users have found it difficult to operate traditional data processing frameworks such as Apache Spark, especially when combined with other frameworks such as Hadoop. The solution was complicated, costly, and time-consuming. Organizations had to purchase and integrate hardware (servers, PCs, etc.) and then install and manage software. Of course, software and hardware would require continual upgrades, increasing costs and complexity.

Multiple lines of business would frequently share centralized cluster resources. As a result, idle periods were underutilized, and SLAs were not met during peak. As your data grew, so did your infrastructure. Because storage and computing are inextricably linked, expanding storage necessitates scaling costly compute requirements.

AWS EMR makes deploying distributed data processing frameworks simple and cost-efficient. Furthermore, it separates computation and storage, allowing them to grow independently, resulting in improved resource use.

How to configure an EMR application with lakeFS

In order to configure Spark on EMR to work with lakeFS we will set the lakeFS credentials and endpoint in the appropriate fields. The exact configuration keys depends on the application running in EMR, but their format is of the form:

lakeFS endpoint: *.fs.s3a.endpoint

lakeFS access key: *.fs.s3a.access.key

lakeFS secret key: *.fs.s3a.secret.key

EMR will encourage users to use s3:// with Spark as it will use EMR’s proprietary driver. Users need to use s3a:// for this guide to work.

The Spark job reads and writes will be directed to the lakeFS instance, using the S3 gateway.

There are 2 options for configuring an EMR cluster to work with lakeFS:

When you create a cluster – All steps will use the cluster configuration. No specific configuration needed when adding a step.
Configuring on each step – cluster is created with the default s3 configuration. Each step using lakeFS should pass the appropriate config params.

EMR cluster creation and Job Setup with lakeFS

Use the below configuration when creating the cluster. You may delete any app configuration which is unsuitable for your use-case.

[
  {
    "Classification": "presto-connector-hive",
    "Properties": {
      "hive.s3.aws-access-key": "AKIAIOSFODNN7EXAMPLE",
      "hive.s3.aws-secret-key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
      "hive.s3.endpoint": "https://s3.lakefs.example.com",
      "hive.s3-file-system-type": "PRESTO"
    }
  },
  {
    "Classification": "hive-site",
    "Properties": {
      "fs.s3.access.key": "AKIAIOSFODNN7EXAMPLE",
      "fs.s3.secret.key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
      "fs.s3.endpoint": "https://s3.lakefs.example.com",
      "fs.s3a.access.key": "AKIAIOSFODNN7EXAMPLE",
      "fs.s3a.secret.key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
      "fs.s3a.endpoint": "https://s3.lakefs.example.com"
    }
  },
  {
    "Classification": "hdfs-site",
    "Properties": {
      "fs.s3.access.key": "AKIAIOSFODNN7EXAMPLE",
      "fs.s3.secret.key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
      "fs.s3.endpoint": "https://s3.lakefs.example.com",
      "fs.s3a.access.key": "AKIAIOSFODNN7EXAMPLE",
      "fs.s3a.secret.key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
      "fs.s3a.endpoint": "https://s3.lakefs.example.com"
    }
  },
  {
    "Classification": "core-site",
    "Properties": {
      "fs.s3.access.key": "AKIAIOSFODNN7EXAMPLE",
      "fs.s3.secret.key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
      "fs.s3.endpoint": "https://s3.lakefs.example.com",
      "fs.s3a.access.key": "AKIAIOSFODNN7EXAMPLE",
      "fs.s3a.secret.key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
      "fs.s3a.endpoint": "https://s3.lakefs.example.com"
    }
  },
  {
    "Classification": "emrfs-site",
    "Properties": {
      "fs.s3.access.key": "AKIAIOSFODNN7EXAMPLE",
      "fs.s3.secret.key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
      "fs.s3.endpoint": "https://s3.lakefs.example.com",
      "fs.s3a.access.key": "AKIAIOSFODNN7EXAMPLE",
      "fs.s3a.secret.key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
      "fs.s3a.endpoint": "https://s3.lakefs.example.com"
    }
  },
  {
    "Classification": "mapred-site",
    "Properties": {
      "fs.s3.access.key": "AKIAIOSFODNN7EXAMPLE",
      "fs.s3.secret.key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
      "fs.s3.endpoint": "https://s3.lakefs.example.com",
      "fs.s3a.access.key": "AKIAIOSFODNN7EXAMPLE",
      "fs.s3a.secret.key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
      "fs.s3a.endpoint": "https://s3.lakefs.example.com"
    }
  },
  {
    "Classification": "spark-defaults",
    "Properties": {
      "spark.sql.catalogImplementation": "hive"
    }
  }
]

When a cluster was created without the above configuration, you can still use lakeFS when adding a step.

For example, when creating a Spark job:

aws emr add-steps --cluster-id j-197B3AEGQ9XE4 \
--steps="Type=Spark,Name=SparkApplication,ActionOnFailure=CONTINUE,\
Args=[--conf,spark.hadoop.fs.s3a.access.key=AKIAIOSFODNN7EXAMPLE,\
--conf,spark.hadoop.fs.s3a.secret.key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY,\
--conf,spark.hadoop.fs.s3a.endpoint=https://s3.lakefs.example.com,\
s3a://<lakefs-repo>/<lakefs-branch>/path/to/jar]"

Coming up

lakeFS will support a Hadoop client that separates data management from metadata management. Additionally, it will allow direct access to S3 from the hadoop client, while calling lakeFS servers only for metadata resolution. For more details see link to Github milestone.

If you enjoyed this article, check out our Github repo and Slack channel to participate in all our discussions.

The Control Plane for AI-Ready Data

Versioned. Reproducible. Compliant.