A Step-by-Step Configuration Tutorial
Introduction
In today’s data-driven world, organizations are grappling with an explosion in the volume of data, compelling them to shift away from traditional relational databases and embrace the flexibility of object storage. Storing data in object storage repositories offers scalability, cost-effectiveness, and accessibility. However, efficiently analyzing or querying structured data in these repositories requires the right tools. One such tool is Trino.
Trino is a powerful ANSI SQL-compliant distributed query engine tailor-made for handling big data. Its capabilities extend to data warehousing and analytics, encompassing tasks like data analysis, aggregating large datasets, and generating comprehensive reports. These workloads often fall under the category of Online Analytical Processing (OLAP).
But the deluge of data also brings forth complex challenges that organizations must address to unlock the full potential of their big data resources. One of these critical challenges is data version control, and this is where lakeFS comes into play. Data versioning is crucial because it supercharges data teams’ velocity while substantially reducing the costs associated with errors.
Let’s delve deeper into why data versioning matters:
- Create Isolated Dev/Test Environments: Data versioning allows you to create isolated development and testing environments, significantly reducing the risks associated with testing in a live production environment. This isolation is pivotal for maintaining data quality and ensuring accurate results. (For more details, check out the lakeFS ETL Testing guide).
- Promote Only High-Quality Data to Production: With data versioning, you can establish a robust Continuous Integration and Continuous Deployment (CI/CD) pipeline for your data. This means you can promote only high-quality, thoroughly tested data to your production environment, enhancing the reliability of your analytics and decision-making processes. (For more insights, see the lakeFS CI/CD for Data guide).
- Fix Bad Data with Production Rollback: Data errors happen, but with data version control, you have the ability to roll back to a previous, error-free version swiftly. This minimizes the impact of erroneous data on your operations and allows for quick recovery. (For detailed instructions, refer to the lakeFS Rollback guide).
- Backfilling Data: Data backfilling is a common operation in data management. It involves populating historical data to fill gaps or update information. Data versioning simplifies this process, making it more reliable and efficient. (For a comprehensive guide, explore lakeFS’s foolproof guide to backfilling data).
Trino’s magic lies within its catalogs. A Trino catalog is essentially a container for schemas that references a data source through a connector. For instance, you can configure a Hive Metastore as a Trino catalog to access Hive data sources. When you execute SQL statements in Trino, they run against one or more of these catalogs. In this blog post, we will demonstrate:
- How to use Trino and lakeFS, both open-source tools, to version control your big data and analyze it at scale.
- How to use AWS Glue Catalog to organize your big data in multiple databases, schemas and tables.
Throughout this article, we will provide a step-by-step guide, accompanied by a Jupyter demo notebook, to demonstrate how lakeFS and Trino work together to manage your analytical workloads.
Demo Notebook
You will run the demo notebook in EMR Studio Workspaces (Notebooks). You can read more about the demo in this git repository.
Demo Prerequisites
- lakeFS installed and running in your AWS environment or in the lakeFS Cloud. If you don’t have lakeFS already running then either use lakeFS Cloud which provides free lakeFS server on-demand with a single click or Deploy lakeFS on AWS.
- AWS CLI
- AWS IAM permissions to create EMR Cluster, EMR Studio Workspaces and Glue Catalog
Step 1: Acquire lakeFS Access Key and Secret
In this step, you will acquire the lakeFS Access Key and Secret that will later on be used in the following steps. If you already have a Secret Key, you can skip this section.
Note: To create a new access key, You need either AuthManageOwnCredentials Policy or AuthFullAccess Policy attached to your user.
Login to lakeFS and click on Administration -> Create Access Key
A new key will be generated:
As instructed, copy the Secret Access Key and store it somewhere safe. You will not be able to access it again (but you will be able to create new ones).
Step 2: Demo Setup: Clone Samples Repo
Clone the repo:
git clone https://github.com/treeverse/lakeFS-samples
cd lakeFS-samples/01_standalone_examples/aws-glue-trino
Step 3: Demo Setup: Create EMR Cluster
Change lakeFS Endpoint URL, Access Key and Secret Key in the ‘trino_configurations.json’ file included in the Git repo in ‘01_standalone_examples/aws-glue-trino’ folder. Your lakeFS Endpoint URL will be in the format ‘https://username.aws_region_name.lakefscloud.io‘ if you are using lakeFS Cloud.
[
{
"Classification": "trino-connector-hive",
"Properties": {
"hive.metastore": "glue",
"hive.s3.aws-access-key": "lakeFS Access Key",
"hive.s3.aws-secret-key": "lakeFS Secret Key",
"hive.s3.endpoint": "lakeFS Endpoint URL",
"hive.s3.path-style-access": "true",
"hive.s3-file-system-type": "TRINO"
}
},
{
"Classification": "hive-site",
"Properties": {
"fs.s3a.access.key": "lakeFS Access Key",
"fs.s3a.secret.key": "lakeFS Secret Key",
"fs.s3a.endpoint": "lakeFS Endpoint URL",
"fs.s3a.path.style.access": "true"
}
}
]
Run following AWS CLI command from your computer to create an EMR cluster. Change AWS region_name, log-uri (S3 bucket where you want to store Trino logs), ec2_subnet_name before running the command:
aws emr create-cluster \
--release-label emr-6.11.1 \
--applications Name=Trino Name=JupyterEnterpriseGateway Name=Spark \
--configurations file://trino_configurations.json \
--region region_name \
--name lakefs_glue_trino_demo_cluster \
--log-uri s3://bucket-name/trino/logs/ \
--instance-type m5.xlarge \
--instance-count 1 \
--service-role EMR_DefaultRole \
--ec2-attributes
InstanceProfile=EMR_EC2_DefaultRole,SubnetId=ec2_subnet_name
Step 4: Demo Setup: Create & Launch EMR Studio Workspace
- If you don’t already have an EMR Studio created then set up an AWS EMR Studio. EMR Cluster and EMR Studio should be running in the same subnet.
- Create and launch an AWS EMR Studio Workspace.
- Attach previously created AWS EMR cluster to Studio Workspace and launch Workspace in JupyterLab IDE.
- If you get an error like this to attach EMR Cluster to Workspace:
Cluster failed to attach to the Workspace. Reason: Attaching the workspace(notebook) failed. Notebook security group sg-0123456789 does not have an egress rule to connect with the master security group sg-9876543210. Please fix the security group or use the default option.
Then add the outbound rule to security group sg-0123456789 to destination sg-9876543210 on TCP port 18888.
And add the inbound rule to security group sg-9876543210 from source sg-0123456789 on TCP port 18888.
- Click on “Upload Files” button in JupyterLabs UI and upload ‘trino-glue-demo-notebook’ included in the Git repo in ‘01_standalone_examples/aws-glue-trino’ folder to EMR Studio Workspace:
- Open this notebook, select Change Kernel under Kernel menu item and select the PySpark kernel to run the notebook:
Step 5: Notebook Config
Complete following Config steps in the notebook:
- Change Spark Configuration: change lakeFS Endpoint URL, Access Key and Secret Key and run the cell:
%%configure -f
{
"conf": {
"spark.hadoop.fs.s3a.endpoint": "<lakeFS Endpoint URL>",
"spark.hadoop.fs.s3a.access.key": "<lakeFS Access Key>",
"spark.hadoop.fs.s3a.secret.key": "<lakeFS Secret Key>",
"spark.hadoop.fs.s3.impl": "org.apache.hadoop.fs.s3a.S3AFileSystem",
"spark.hadoop.fs.s3a.path.style.access": "true",
"spark.pyspark.virtualenv.enabled": "true",
"spark.pyspark.virtualenv.type":"native",
"spark.pyspark.virtualenv.bin.path":"/usr/bin/virtualenv"
}
}
- Change lakeFS Cloud endpoint and credentials:
lakefsEndPoint = '<lakeFS Endpoint URL>'
lakefsAccessKey = '<lakeFS Access Key>'
lakefsSecretKey = '<lakeFS Secret Key>'
- Storage Information: Since you are going to create a lakeFS repository in the demo, you will need to change the storage namespace to point to a unique path. If you are using your own bucket, insert the path to your bucket. If you are using our bucket in lakeFS Cloud, you will want to create a repository in a subdirectory of the sample repository that was automatically created for you.
For example, if you login to your lakeFS Cloud and see:
Add a subdirectory to the existing path (in this case, s3://lakefs-sample-us-east-1-production/AROA5OU4KHZHHFCX4PTOM:028298a734a7a198ccd5126ebb31d7f1240faa6b64c8fcd6c4c3502fd64b6645/)
. i.e. insert:
storageNamespace = 's3://lakefs-sample-us-east-1-production/AROA5OU4KHZHHFCX4PTOM:028298a734a7a198ccd5126ebb31d7f1240faa6b64c8fcd6c4c3502fd64b6645/image-segmentation-repo/'
- Install lakeCTL: Install and configure lakectl on your computer (lakeFS command-line tool).
Step 6: Notebook Setup
You can run Setup cells in the notebook without changing anything. During the setup you will:
- Define few variables
- Install & import Python libraries
- Create lakeFS repo
- Create a connection to Trino by using PyHive library
- Create a Glue database
- Define schema structure for Customers and Orders tables
Step 7: Run the Demo
- Clone the Samples repo: if you didn’t already clone the Samples repo then clone it:
git clone https://github.com/treeverse/lakeFS-samples
- Upload sample data: run next cell to print the command to use lakectl on your computer to upload some sample data to the lakeFS repository:
print(f"cd lakeFS-samples && lakectl fs upload -s ./data/OrionStar lakefs://{repo_name}/main/ --recursive && lakectl commit lakefs://{repo_name}/main -m 'Uploaded sample data'")
- Create tables: register Customers and Orders tables next in the Glue Catalog and populate both tables with some sample data by using Spark. Tables use an external_location pointing to the main branch in the lakeFS repository.
- Execute Trino queries: run Trino queries to select the data from both tables.
- Create a branch: you will create a new branch to run your ETL job or test your new code:
lakefs.branches.create_branch(
repository=repo_name,
branch_creation=BranchCreation(
name=etlBranch, source=mainBranch))
When you create a new branch then lakeFS performs zero-copy operation to create a new isolated environment. So, you can run your ETL job or test your new code without impacting your production data.
- Create Glue database: you will create a Glue database with the name ending with the new branch name. You will also register your tables in the new database in the Glue Catalog. In future, this process will be fully automated via lakeFS hooks.
- Insert data: insert a record in the Customers table in the new branch:
execute_trino_query(f"INSERT INTO
{glueDatabaseName}_{etlBranch}.{customersTable} VALUES
(1,'US','M',2,'Scott Gibbs','Scott','Gibbs','12APR1970','556 Greywood Rd',9260103713,1068,1030)")
- Execute Trino queries: you will execute Trino queries to select the data from the Customers table in the new branch as well as the main/production branch. You will notice that the newly inserted record is only in the new branch.
Summary
You configured an EMR cluster to be able to read data from lakeFS using Trino and execute git actions via Python.
lakeFS accelerates your team and simplifies the version control process for the analytical workloads:
- Unique zero-copy operation to create test/dev environments for your data
- Allows you to clone Glue databases without copying any data
- Integrates easily with Trino so users can seamlessly access production as well as test/dev data
Want to learn more?
If you have questions about lakeFS, then drop us a line at hello@treeverse.io or join the discussion on lakeFS’ Slack channel.
Table of Contents