Ready to dive into the lake?
lakeFS is currently only
available on desktop.

For an optimal experience, provide your email below and one of our lifeguards will send you a link to start swimming in the lake!

lakeFS Community
Guy Hardonag
Guy Hardonag Author

Guy has a rich background in software engineering, playing an...

Last updated on January 24, 2024

This post aims to cover our experience running Presto in a local environment with the ability to query Amazon S3 and other S3 Compatible Systems. We will:

  1. Describe the components needed and how to configure them.
  2. Provide a dockerized environment you could run.
  3. Show an example of running the provided environment and querying a publicly available dataset on S3.

TL;DR: If you just want to use the environment you could skip to the example.


As part of developing lakeFS we needed to ensure that it’s API compatible with S3, Doing that required enabling the option to run Presto on lakeFS.

One direction we considered was to deploy lakeFS on EC2 and run Presto on an EMR machine. We needed a fast feedback loop and the ability to debug, therefore running presto on a local machine and directing it to our instance was the best option. When trying to do that we noticed that it’s not as straightforward as we expected. Information was scattered around and it was not that easy to find the right environment and configuration.

As a result, we decided to write down the lessons learned from our experience and share a working environment to save others time and effort.

Why install and run presto locally?

Potential use-cases that require installing and running Presto locally:

  1. Testing/ Debugging a local environment with lakeFS, minIO or any other S3-compatible systems.
  2. Data CI/CD – validating data as part of Data CI/CD flow
  3. Query Data on S3 without deploying anything on AWS. This way you are not dependent on your DevOps team or Athena’s quirks.

Now that we established a reason to run presto locally – let’s see how to do it.

Presto 101: The Presto Environment

Presto has two server types:

  1. Coordinator – The coordinator is responsible for parsing statements, planning queries and managing Presto worker nodes. It is the node to which a client connects to submit statements for execution.
  2. Worker – A Presto worker is a server in a Presto installation which is responsible for executing tasks and processing data. 

Presto installation must have a Presto coordinator alongside one or more Presto workers. It supports setting up a single machine for testing that will function as both a coordinator and worker, which is exactly what we needed.

OK, that’s enough to get presto up and running. Now, how does it connect and interact with S3? That’s where the connector comes in.

The Hive Connector

Connectors are the source of all data for queries in Presto. The connector translates the query and storage concepts of the underlying data source. It implements Presto’s SPI (Service Provider Interface), which allows it to interact with a resource using a standard API. Presto contains several built-in connectors, the Hive connector is used to query data on HDFS or on S3-compatible engines. The Hive connector doesn’t need Hive to parse or execute the SQL query in any way. Rather, the Hive connector only uses the Hive metastore.

The connector’s metastore could be configured with two options: Hive metastore or AWS Glue. In our case we wanted a self contained environment, so Glue wasn’t applicable. Another option that isn’t documented, is the file metastore, where metadata and data are stored in the file system. The file metastore was also not an option for us because we wanted to test that our product supports Hive Metastore. The file metastore can be useful for a local Presto environment, because it is simple to set up (check out example)

The Hive Metastore 

The Hive Metastore saves metadata about locations of data files, and how they are mapped to schemas and tables. This metadata is stored in a database, such as MySQL, and is accessed via Hive Metastore service.
You can find more information about Hive Metastore and AWS Glue here.

The Metastore Standalone

Since Hive 3.0, Hive metastore is provided as a separate release in order to allow non-Hive systems to easily integrate with it.
In our case we needed Hive for using MSCK REPAIR and for creating a table with symlinks as its input format, both are not supported today in Presto. This is why we didn’t use the metastore standalone. However, it is still a good option if you don’t need such Hive functionality.

A ready to use dockerized environment:

I created a repository with a local dockerized environment you could use.

The environment consists:

  1. Standalone presto server and client
  2. Standalone Hive Server and Hive Metastore
  3. MariaDB Database
Environment Diagram


The following configurations allow the servers to communicate with each other and with the S3 storage.

The S3 catalog – 

As mentioned above Presto accesses data via connectors.

Connectors are mounted in catalogs, so we will create a configuration file for our S3 catalog


Hive – configure Hive and Hive Metastore 


   <description>Thrift URI for the remote metastore. Used by metastore client to connect to remote metastore.</description>
 <!--    <property>-->
   <!--        <name>fs.s3a.endpoint</name>-->
   <!--        <value>[WANTED_ENDPOINT]</value>-->
   <!--    </property>-->

Hadoop –


 <!--    <property>-->
   <!--        <name>fs.s3a.endpoint</name>-->
   <!--        <value>[WANTED_ENDPOINT]</value>-->
   <!--    </property>-->


In the next steps we will setup a dockerized Presto environment with Hive.

Create an external table connected to the public dataset, Amazon Customer review dataset. Add the table metadata using MSCK REPAIR and run some queries in presto.

Clone the repository

git clone

First you will need to change the configuration to fit your credentials.

The configuration files you will need to change are:

  • etc/
  • hive/conf/hdfs-site.xml
  • hive/conf/hive-site.xml


docker-compose up -d

That’s it, now you can run presto with the s3 catalog we created

docker-compose exec presto presto-cli --catalog s3 --schema default

Add amazon review dataset using Hive

Start Hive server:

docker-compose exec -d hive /opt/apache-hive-3.1.2-bin/bin/hiveserver2

Note that it will take Hive server some time to start.

Once Hive is running, Connect to Hive server with beeline

docker-compose exec hive /opt/apache-hive-3.1.2-bin/bin/beeline -u  jdbc:hive2://localhost:10000

Once connected to beeline, you should see:


Create table with 

jdbc:hive2://localhost:10000> CREATE EXTERNAL TABLE amazon_reviews_parquet(
marketplace string,
customer_id string,
review_id string,
product_id string,
product_parent string,
product_title string,
star_rating int,
helpful_votes int,
total_votes int,
vine string,
verified_purchase string,
review_headline string,
review_body string,
review_date int,
year int)
PARTITIONED BY (product_category string)

Run MSCK REPAIR in order to load all the partitions metadata to the metastore

jdbc:hive2://localhost:10000> MSCK REPAIR TABLE amazon_reviews_parquet;

Go back to presto to run queries on the table

docker-compose exec presto presto-cli --catalog s3 --schema default

Let’s check what tables we have:

presto:default> show tables;

Now lets see the top ten voted reviews in video games:

select product_title,review_headline, total_votes from amazon_reviews_parquet where product_category='Video_Games' order by total_votes DESC limit presto:default> select product_title,review_headline, total_votes from amazon_reviews_parquet where product_category='Video_Games' order by total_votes DESC limit 10;

You should get:

That’s it, now you have the option to run hive and presto on your machine and query S3.

To find out more about lakeFS head on over to our GitHub repository and docs.

Git for Data – lakeFS

  • Get Started
    Get Started
  • Where is data engineering heading in 2024? Find out in this year’s State of Data Engineering Report -

    Read it here