In this article I will show how Go concurrency enabled us to cut through a daunting DB performance barrier. This blog post continues our journey to big data performance. The first post on this issue discussed in-process caching in Go.
The Pain
lakeFS is a versioned directory over objects stores like AWS S3 and GCS (resembling Github for object sores). It uses Postgres for storing the description of billions of objects that may reside in a large data lake. This gives big-data tools like Spark and Presto transactionality and immediate consistency in a transparent manner.
In some use cases hundreds of thousands of distinct select statements hit the DB every second , each getting meta-data on a single object from a single row.
Postgres has some fixed small cost associated with each statement execution. E.g. Postgres puts an ACCESS SHARE lock on the table to block DDL statements from changing the table structure during the read. Multiplying this small cost by hundreds of thousands of requests per second can bring a strong Postgres server to its knees, requiring an expensive DB upgrade.
The solution
The problem stems from executing too many statements, each doing too little work. The obvious solution is combining the work of many statements into a single statement. It is as simple as replacing 10 “SELECT * FROM table WHERE key=keyN” statements with a single “SELECT * FROM table WHERE key IN (key1,key2,…,key10)”. How didn’t we see this before?
Well, there are some hurdles to overcome before reaching performance heaven:
- Each getobject request comes separately. It is selfish and wants to return ASAP. We need to create a traffic cop that will put them in line without an interface change.
- Response time guarantees(SLA) must be kept.
- After squeezing many requests into a single SQL statement, we need to return a distinct result to each waiting request.
We implemented a solution in lakeFS, and the results were a tenfold increase in read requests per second. For this blog post I wrote a simplified version of it . I used this program to benchmark the batched VS discrete reading. You can clone or fork it from our GitHub Repository.
Before explaining how Go goroutines and channels enabled an elegant solution, I would like to show the benchmark and its results.
Benchmark setup
Hardware
- Dell XPS 13 3790 two-in-one laptop
- 32 GB memory
- Intel i7-1065G7 4 core processor (8 hyperthreaded)
- 1 TB NVMe disk.
Software and
- Ubuntu 20.04
- Postgres 11.7
- GO 1.15
- Pgxpool golang Postgres driver by the benevolent Jack Christensen. Pgxpool support for native array parameters improved batching performance by 20%
Database
The test DB contains a single table “random_read_test” with 350 million rows composed of the following columns:
- Pk – indexed text field containing a sequence number padded to 20 bytes.
- Payload – text field containing the sequence number padded to 35 byte.
Table raw size is 20 GB, and the index adds another 13 GB. You may create the table using just two sql statements from this sql file
The test driver
I wrote a test driver that generates 3 million random keys, and sends them via a channel to three hundred reading goroutine instances(tunable by constant) that perform the reads. It is able to test different batch sizes, as well as non-batched reading.Important factors of the run can be controlled by a few constants that appear in the test driver. On each run, the driver emits a statistics row that is appended to the “statistics.csv” file. You may fiddle with them to fit your requirements.
Benchmark results
Batch size | Run duration in seconds | Average response time | Worst case response time | Reads per second |
Non batched | 50 | 5.2 ms | 770 ms | 60,000 |
1 | 87 | 7.9 ms | na | 34,500 |
2 | 43 | 4 ms | 42 ms | 70,000 |
4 | 21 | 2.2 ms | 31 ms | 143,000 |
8 | 9.5 | 0.7 ms | 39 ms | 316,000 |
16 | 5.5 | 0.45 ms | 21 ms | 550,000 |
24 | 5.0 | 0.4 ms | 22 ms | 600,000 |
32 | 4.8 | 0.38 ms | 18 ms | 625,000 |
Conclusions:
- Batching gives a tenfold throughput improvement
- Even batch size of 2 is better than non batched
- Response time is order of magnitude better than non-batched
- Improvement from bigger batch flattens at batch size of 16
- I measured CPU consumption using the graphical system monitor. The graph for the batched run was flat at around 85%, while the non-batched version run at 80%. So almost all CPU resources are dedicated to reading, regardless of batching.
Using Go Concurrency for micro-batching
The software solution is described in the architecture diagram below. The rectangles represent a request for a key, the lines represent directional channels, while the rounded rectangles represent goroutines that are started on server initialization and don’t terminate.
It works as follows:
- When a read request arrives, ReadEntry function is called with the requested key. In the code below you see that a replay channel is created and sent with the key to the orchestrator. Then ReadEntry waits on the channel it created.
func ReadEntry(pk string) (*rowType, error) {
replyChan := make(chan readResponse, 1) // create reply channel
request := readRequest{pk: pk, replyChan: replyChan}
readRequestChan <- request // send request
select { // wait for reply or timeout
case response := <-replyChan:
return response.testRow, response.err
case <-time.After(ReadTimeout):
return nil, ErrReadEntryTimeout
}
}
- The read orchestrator accepts the key + channel and adds it to the current batch. It will send the batch for execution to the batch readers when one of the following conditions is met:
- The batch reached a configured maximum keys (e.g. 16 keys)
- A configured amount of time has passed since the first key was added (currently it is 0.5 ms).
- The batch reader goroutine accepts the keys batch from the channel. It puts the received keys in the where condition of the select statement, and executes it. It will send the result of each key to the requester over the channel the requester provided. If the key was not found, it will send a “not found” message.
Does this solution fit your problem?
Good software solutions fit specific situations, and are no good for other challenges. Micro-batching may fit your needs if:
- It is a single statement transaction that is executed at a high rate and creates a bottleneck on the DB.
- The time between arrival of the first request in the batch and the last request is short and tolerable for your SLA.
- Transactional consistency is not required.
- DB bottleneck is CPU and not I/O.
Micro-batching works well for write operations as well, as long as they are not part of a multi step transaction.
This approach will not help where the challenge includes:
- The load is composed of many different transaction types.
- A non-trivial transaction that needs transactional consistency.
- The bottleneck is on the I/O side, and you can not provide enough memory or IOPS to move the bottleneck to the CPU.
What next
If you got this far, then I hope you will take the time to clone the GitHub repository, and play with the micro-batching. If anything is not clear in the explanation, please drop us a line and we will clarify.
In a future post, we will discuss more ways in which we made Postgres pull the throughput and response time required in a big data environment.
Notes
- This code is for demonstration purposes. Error handling is at “panic” level to keep the purpose clear
- The idea is inspired by the way Postgres handle commit processing. The configuration parameter “commit_delay” instructs Postgres to batch commits that came within the parameterized time period. It is disabled by default.