Lots of applications require a Postgres database. Before you can install them, you will need a Postgres database. How do you pick the right Postgres for your application? There are a bewildering variety of possible ways to acquire a database running on a Postgres instance, but the biggest choice is “build or buy”: whether to install a Postgres version on your own or to purchase it as a service.
lakeFS is yet another Postgres application, so it requires a Postgres database. At Treeverse we decided for now to buy Postgres as a service. I’ll explain the factors why we decided, and how you might decide to whether or not to follow us.
If your organization already runs Postgres instances, they may well be the best choice: there can be extensive in-house experience and expertise already present. And you probably don’t need this guide.
The most common objection to buying a database as a service is sheer price: a DB instance from your cloud provider can cost more than twice the cost of the virtual machine instance that it runs on. Obviously you pay extra to get much more than just the virtualized resources: you get expertise, much of the ops, easier backup and clustering, sometimes even some features.
For pricing I believe we should prefer to buy what is not a core part of our business. Pricing can actually make this decision easy. If the cost of buying Postgres as a service is a significant expenditure, then necessarily it is a core part of business — and clearly we need to develop the requisite in-house capability to support it. Otherwise cost is not a driving factor in the decision.
Even if you decide Postgres is a core capability for your organization, you may wish buy Postgres as a service for an initial ramp-up period, planning to move away later. While migrating database installations generally requires some downtime, this too is often a good choice.
Every cloud provider provides managed Postgres as a service:
- AWS provide RDS as well as clustered Postgres-compatible Aurora;
- Azure provide Azure Database for PostgreSQL;
- GCP provide Postgres on Cloud SQL
Similarly, some hosting providers also have a Postgres as a service offering. If you are already using a hosting provider that offers Postgres — Heroku is one example — this can be a convenient choice.
Postgres maintain a page PostgreSQL: Hosting Providers that list a wide range of options for purchasing hosted solutions along with support. Additionally, many hosted application platforms offer Postgres.
Developing in-house expertise is often desirable, and not only as a means to save costs. For instance, by allowing rapid resolution of issues without having to go outside the organization, expertise reduces risk.
On the flip side, managing your in-house database carries an ops burden, which will include monitoring, backup, clustering, and versioning. The last can be particularly troubling: occasionally you will need to upgrade Postgres on a running system.
A middle ground between buying a service and building it yourself is to contract the setup but manage ops in-house. I believe that unless you do database-related development, databases are generally a poor fit for this middle ground alternative. Contracting only the setup means you will take longer to build expertise in the database. Meanwhile you already start carrying the ops burden of the database.
Buying from a large service provider appears to limit available versions. Currently, for example, the latest supported Postgres version is 12.4; the major release Postgres 12 was on 2019-10-03. Service providers have been slow to provide version 12:
- AWS RDS started supporting Postgres version 12 on 2020-03-31, and supports Postgres versions 9.4, 9.5, 9.6, 10, and 11; however Aurora clusters do not support version 12 yet;
- Azure still supports only versions 9.5, 9.6, 10.11, and 11.6;
- GCP started providing Postgres version 12 on 2020-05-21;
- ElephantSQL supports Postgres version 12, but the free tier provides version 11.9.
At the same time service providers can upgrade Postgres (and underlying OS) versions more transparently. This can be particularly useful for receiving patch releases. So low availability of newer Postgres versions is not a significant detriment. Of course, certain applications may require an absolutely latest version of Postgres, which may make it impossible to buy Postgres service.
Numerous extensions are available for Postgres. Service providers provide the most popular of these. But they cannot provide all of them, and typically it is not possible to install new extensions that require compilation.
To allow users flexibility, lakeFS requires only Postgres 11. So availability of versions and extensions does not influence choice of provider for a database intended to run only lakeFS.
Creating a new database instance with a service provider requires plenty of time. For example, on AWS RDS creating a small empty database instance can take 10 minutes. This is very reasonable for one-time setup of long-lived production and even staging environments.
For short-lived environments or for testing, it can take too long to set up a hosted database. Luckily Postgres is an isolated component and one installation may easily be replaced with another installation. If the database is small and the generated data is not particularly valuable then there are good alternatives with quick setup times. These include creating a new database on an existing Postgres database instance or even — when database instance performance is not critical — running a database instance inside a container. If you have permission to create a new database, both of these alternatives have equivalent setup times that are typically less than a second.
Using an existing database instance allows you to take advantage of existing setup and ops on that database instance. For example, it will be easy to backup your new database if the database already has backups, and clustering exists at the instance level. By the same token, an existing database instance will offer significantly lower isolation than a container.
An example of creating new databases inside an existing database instance is the ElephantSQL free tier: you are allocated a database inside an existing instance and can see that other databases exist inside that instance. It uses Postgres roles to prevent data leakage (a side effect is that some operations are impossible, for instance any application that requires multiple roles).
An example of running a database instance inside a container is when running tests. Every test can create its own blank container running Postgres. This is probably the best solution for component testing: it offers complete isolation between tests, the database container is small, and it scales directly with the number of running tests.
All our lakeFS instances currently use plain or Aurora Postgres on RDS:
- Our database usage is currently not high enough to justify the cost savings of running our own instances..
- We build lakeFS, and supporting Postgres as a service is important to the product. So by design, lakeFS does not require latest-version features or esoteric extensions.
- Our core business is currently to develop rather than to operate lakeFS; this is essentially the “ramp-up” time above.
For isolation and speed of setup, our unit and component testing use a containerized Postgres instance.