This post aims to cover our experience running Presto in a local environment with the ability to query Amazon S3 and other S3 Compatible Systems. We will:
- Describe the components needed and how to configure them.
- Provide a dockerized environment you could run.
- Show an example of running the provided environment and querying a publicly available dataset on S3.
TL;DR: If you just want to use the environment you could skip to the example.
As part of developing lakeFS we needed to ensure that it’s API compatible with S3, Doing that required enabling the option to run Presto on lakeFS.
One direction we considered was to deploy lakeFS on EC2 and run Presto on an EMR machine. We needed a fast feedback loop and the ability to debug, therefore running presto on a local machine and directing it to our instance was the best option. When trying to do that we noticed that it’s not as straightforward as we expected. Information was scattered around and it was not that easy to find the right environment and configuration.
As a result, we decided to write down the lessons learned from our experience and share a working environment to save others time and effort.
Why install and run presto locally?
Potential use-cases that require installing and running Presto locally:
- Testing/ Debugging a local environment with lakeFS, minIO or any other S3-compatible systems.
- Data CI/CD – validating data as part of Data CI/CD flow
- Query Data on S3 without deploying anything on AWS. This way you are not dependent on your DevOps team or Athena’s quirks.
Now that we established a reason to run presto locally – let’s see how to do it.
Presto 101: The Presto Environment
Presto has two server types:
- Coordinator – The coordinator is responsible for parsing statements, planning queries and managing Presto worker nodes. It is the node to which a client connects to submit statements for execution.
- Worker – A Presto worker is a server in a Presto installation which is responsible for executing tasks and processing data.
Presto installation must have a Presto coordinator alongside one or more Presto workers. It supports setting up a single machine for testing that will function as both a coordinator and worker, which is exactly what we needed.
OK, that’s enough to get presto up and running. Now, how does it connect and interact with S3? That’s where the connector comes in.
The Hive Connector
Connectors are the source of all data for queries in Presto. The connector translates the query and storage concepts of the underlying data source. It implements Presto’s SPI (Service Provider Interface), which allows it to interact with a resource using a standard API. Presto contains several built-in connectors, the Hive connector is used to query data on HDFS or on S3-compatible engines. The Hive connector doesn’t need Hive to parse or execute the SQL query in any way. Rather, the Hive connector only uses the Hive metastore.
The connector’s metastore could be configured with two options: Hive metastore or AWS Glue. In our case we wanted a self contained environment, so Glue wasn’t applicable. Another option that isn’t documented, is the file metastore, where metadata and data are stored in the file system. The file metastore was also not an option for us because we wanted to test that our product supports Hive Metastore. The file metastore can be useful for a local Presto environment, because it is simple to set up (check out example)
The Hive Metastore
The Hive Metastore saves metadata about locations of data files, and how they are mapped to schemas and tables. This metadata is stored in a database, such as MySQL, and is accessed via Hive Metastore service.
You can find more information about Hive Metastore and AWS Glue here.
The Metastore Standalone
Since Hive 3.0, Hive metastore is provided as a separate release in order to allow non-Hive systems to easily integrate with it.
In our case we needed Hive for using MSCK REPAIR and for creating a table with symlinks as its input format, both are not supported today in Presto. This is why we didn’t use the metastore standalone. However, it is still a good option if you don’t need such Hive functionality.
A ready to use dockerized environment:
I created a repository with a local dockerized environment you could use.
The environment consists:
- Standalone presto server and client
- Standalone Hive Server and Hive Metastore
- MariaDB Database
The following configurations allow the servers to communicate with each other and with the S3 storage.
The S3 catalog –
As mentioned above Presto accesses data via connectors.
Connectors are mounted in catalogs, so we will create a configuration file for our S3 catalog
connector.name=hive-hadoop2 hive.metastore.uri=thrift://hive:9083 hive.non-managed-table-writes-enabled=true hive.s3select-pushdown.enabled=true hive.s3.aws-access-key=[YOUR_ACCESS_KEY] hive.s3.aws-secret-key=[YOUR_SECRET_KEY] #hive.s3.endpoint=http:[WANTED_ENDPOINT]
Hive – configure Hive and Hive Metastore
<configuration> <property> <name>hive.metastore.uris</name> <value>thrift://0.0.0.0:9083</value> <description>Thrift URI for the remote metastore. Used by metastore client to connect to remote metastore.</description> </property> <property> <name>hive.metastore.event.db.notification.api.auth</name> <value>false</value> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value> </property> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://mariadb:3306/metastore_db</value> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>admin</value> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>admin</value> </property> <property> <name>fs.s3a.access.key</name> <value>[YOUR_ACCESS_KEY]</value> </property> <property> <name>fs.s3a.secret.key</name> <value>[YOUR_SECRET_KEY]</value> </property> <property> <name>fs.s3a.path.style.access</name> <value>true</value> </property> <!-- <property>--> <!-- <name>fs.s3a.endpoint</name>--> <!-- <value>[WANTED_ENDPOINT]</value>--> <!-- </property>--> <property> <name>fs.s3a.impl</name> <value>org.apache.hadoop.fs.s3a.S3AFileSystem</value> </property> </configuration>
<configuration> <property> <name>fs.s3a.access.key</name> <value>[YOUR_ACCESS_KEY]</value> </property> <property> <name>fs.s3a.secret.key</name> <value>[YOUR_SECRET_KEY]</value> </property> <!-- <property>--> <!-- <name>fs.s3a.endpoint</name>--> <!-- <value>[WANTED_ENDPOINT]</value>--> <!-- </property>--> <property> <name>fs.s3a.path.style.access</name> <value>true</value> </property> </configuration>
In the next steps we will setup a dockerized Presto environment with Hive.
Create an external table connected to the public dataset, Amazon Customer review dataset. Add the table metadata using MSCK REPAIR and run some queries in presto.
Clone the repository
git clone https://github.com/treeverse/blog-presto-local.git
First you will need to change the configuration to fit your credentials.
The configuration files you will need to change are:
docker-compose up -d
That’s it, now you can run presto with the s3 catalog we created
docker-compose exec presto presto-cli --catalog s3 --schema default
Add amazon review dataset using Hive
Start Hive server:
docker-compose exec -d hive /opt/apache-hive-3.1.2-bin/bin/hiveserver2
Note that it will take Hive server some time to start.
Once Hive is running, Connect to Hive server with beeline
docker-compose exec hive /opt/apache-hive-3.1.2-bin/bin/beeline -u jdbc:hive2://localhost:10000
Once connected to beeline, you should see:
Create table with
jdbc:hive2://localhost:10000> CREATE EXTERNAL TABLE amazon_reviews_parquet( marketplace string, customer_id string, review_id string, product_id string, product_parent string, product_title string, star_rating int, helpful_votes int, total_votes int, vine string, verified_purchase string, review_headline string, review_body string, review_date int, year int) PARTITIONED BY (product_category string) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 's3a://amazon-reviews-pds/parquet/';
Run MSCK REPAIR in order to load all the partitions metadata to the metastore
jdbc:hive2://localhost:10000> MSCK REPAIR TABLE amazon_reviews_parquet;
Go back to presto to run queries on the table
docker-compose exec presto presto-cli --catalog s3 --schema default
Let’s check what tables we have:
presto:default> show tables;
Now lets see the top ten voted reviews in video games:
select product_title,review_headline, total_votes from amazon_reviews_parquet where product_category='Video_Games' order by total_votes DESC limit presto:default> select product_title,review_headline, total_votes from amazon_reviews_parquet where product_category='Video_Games' order by total_votes DESC limit 10;
You should get:
That’s it, now you have the option to run hive and presto on your machine and query S3.
To find out more about lakeFS head on over to our GitHub repository and docs.