This article will provide a detailed explanation on how to use lakeFS with Amazon EMR. Today it’s common to manage a data lake using cloud object stores like AWS S3, Azure Blob Storage, or Google Cloud Storage as the underlying storage service. Each cloud provider offers a set of managed services to simplify the way you consume data directly from the lake.
Amazon’s EMR is an example of an industry-leading big data platform for processing vast amounts of data using open source tools such as Apache Spark, Apache Hive, Apache HBase, Apache Flink, Apache Hudi, and Presto. EMR makes it possible to run these frameworks on-demand with elasticity and high reliability.
Applications running on EMR, from simple ETLs to complex ML pipelines, access data on S3 as input, and write output back to S3. lakeFS enables any application running on EMR with git-like operations to manage the input and output data as a git repository, and to allow experimentation, testing and CI/CD for the data.


Configuration
In order to configure Spark on EMR to work with lakeFS we will set the lakeFS credentials and endpoint in the appropriate fields. The exact configuration keys depends on the application running in EMR, but their format is of the form:
lakeFS endpoint: *.fs.s3a.endpoint
lakeFS access key: *.fs.s3a.access.key
lakeFS secret key: *.fs.s3a.secret.key
EMR will encourage users to use s3:// with Spark as it will use EMR’s proprietary driver. Users need to use s3a:// for this guide to work.
The Spark job reads and writes will be directed to the lakeFS instance, using the s3 gateway.
There are 2 options for configuring an EMR cluster to work with lakeFS:
- When you create a cluster – All steps will use the cluster configuration. No specific configuration needed when adding a step.
- Configuring on each step – cluster is created with the default s3 configuration. Each step using lakeFS should pass the appropriate config params.
Configuration on cluster creation
Use the below configuration when creating the cluster. You may delete any app configuration which is unsuitable for your use-case.
[
{
"Classification": "presto-connector-hive",
"Properties": {
"hive.s3.aws-access-key": "AKIAIOSFODNN7EXAMPLE",
"hive.s3.aws-secret-key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
"hive.s3.endpoint": "https://s3.lakefs.example.com",
"hive.s3-file-system-type": "PRESTO"
}
},
{
"Classification": "hive-site",
"Properties": {
"fs.s3.access.key": "AKIAIOSFODNN7EXAMPLE",
"fs.s3.secret.key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
"fs.s3.endpoint": "https://s3.lakefs.example.com",
"fs.s3a.access.key": "AKIAIOSFODNN7EXAMPLE",
"fs.s3a.secret.key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
"fs.s3a.endpoint": "https://s3.lakefs.example.com"
}
},
{
"Classification": "hdfs-site",
"Properties": {
"fs.s3.access.key": "AKIAIOSFODNN7EXAMPLE",
"fs.s3.secret.key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
"fs.s3.endpoint": "https://s3.lakefs.example.com",
"fs.s3a.access.key": "AKIAIOSFODNN7EXAMPLE",
"fs.s3a.secret.key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
"fs.s3a.endpoint": "https://s3.lakefs.example.com"
}
},
{
"Classification": "core-site",
"Properties": {
"fs.s3.access.key": "AKIAIOSFODNN7EXAMPLE",
"fs.s3.secret.key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
"fs.s3.endpoint": "https://s3.lakefs.example.com",
"fs.s3a.access.key": "AKIAIOSFODNN7EXAMPLE",
"fs.s3a.secret.key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
"fs.s3a.endpoint": "https://s3.lakefs.example.com"
}
},
{
"Classification": "emrfs-site",
"Properties": {
"fs.s3.access.key": "AKIAIOSFODNN7EXAMPLE",
"fs.s3.secret.key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
"fs.s3.endpoint": "https://s3.lakefs.example.com",
"fs.s3a.access.key": "AKIAIOSFODNN7EXAMPLE",
"fs.s3a.secret.key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
"fs.s3a.endpoint": "https://s3.lakefs.example.com"
}
},
{
"Classification": "mapred-site",
"Properties": {
"fs.s3.access.key": "AKIAIOSFODNN7EXAMPLE",
"fs.s3.secret.key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
"fs.s3.endpoint": "https://s3.lakefs.example.com",
"fs.s3a.access.key": "AKIAIOSFODNN7EXAMPLE",
"fs.s3a.secret.key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
"fs.s3a.endpoint": "https://s3.lakefs.example.com"
}
},
{
"Classification": "spark-defaults",
"Properties": {
"spark.sql.catalogImplementation": "hive"
}
}
]
When a cluster was created without the above configuration, you can still use lakeFS when adding a step.
For example, when creating a Spark job:
aws emr add-steps --cluster-id j-197B3AEGQ9XE4 \
--steps="Type=Spark,Name=SparkApplication,ActionOnFailure=CONTINUE,\
Args=[--conf,spark.hadoop.fs.s3a.access.key=AKIAIOSFODNN7EXAMPLE,\
--conf,spark.hadoop.fs.s3a.secret.key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY,\
--conf,spark.hadoop.fs.s3a.endpoint=https://s3.lakefs.example.com,\
s3a://<lakefs-repo>/<lakefs-branch>/path/to/jar]"
Coming up
lakeFS will support a Hadoop client that separates data management from metadata management. Additionally, it will allow direct access to S3 from the hadoop client, while calling lakeFS servers only for metadata resolution. For more details see link to Github milestone.
If you enjoyed this article, check out our Github repo and Slack channel to participate in all our discussions.