Compression Benchmarks for PostgreSQL Archives
By far, my favorite RDBMS is PostgreSQL. Having worked with MySQL, MSSQL, Cassandra, Redis and more; the versatility of PostgreSQL continues to inspire me to write better and more complex SQL. I have a piece of software that reaches out to various news websites following best practices for crawling. Scraping content using provided by the sitemap.xml
and following the rules set forth by the robots.txt file. The content of these sites can be anything, but the general approach is to collect as much news as possible to see if I can develop a series of heuristics to provide as technical indicators.
I've been running the software for about two years now, and a massive PostgreSQL table has been created from the result of it. Today, I'd like to start making regular backups of the data in the table. This article will focus on benchmarking how long it'll take to backup a table using the pg_dump
program & Docker.
Standing on the shoulders of giants, I've found a comprehensive review of algorithms used for compression. I'm not concerned with parallel processing; I'll stick to evaluating the programs as provided. What I'm looking to glean is a definitive and consistent measurement of how long it'll take to export and compress information for the massive PostgreSQL table.
Benchmark Setup
Benchmark setup is fairly straightforward. A table called location
tracks various metrics about URLs such as crawl_delay
, domain
of the URL, change_ferq
, and lastmod
. Properties provided by a sites' sitemap.xml
. Which in turn allows for the development of an algorithm to select more relevant pages, yet still allow the crawler to search for archived content while not overburdening the website. The location
table will be used to identify the best compression algorithm benchmark for the much larger table
timestamp_start=$(date '+%Y-%m-%dT%H:%M:%S')
date -d "$timestamp" +"%s"
docker exec $container_id /usr/local/bin/pg_dump -h localhost -U dbuser -t location -a -d dbname > $host_filepath
timestamp_start=$(date '+%Y-%m-%dT%H:%M:%S')
The database is ran using docker, specifically the postgis:15-3.3-alpine
image. Port 5432
is exposed to the host, but we won't use it because it seems that connecting to a port exposed to the host will route traffic through the LAN. Instead we'll export data in the container and compress the output from the exec
command to a filepath
on the host.
How the database has been initalized using Docker
docker run -p "5432:5432" --name postgresql-datastore -e 'POSTGRES_USER=dbuser' -e 'POSTGRES_PASSWORD=password' -e 'POSTGRES_DB=dbname' -d --shm-size=12g postgis:15-3.3-alpine
The compression algorithms to be tested are zstd, gzip, bzip2, lz4, and xz.
Here is the full script to run the benchmark
#!/usr/bin/env bash
set -e
TABLES="location"
timestamp=$(date '+%Y-%m-%dT%H:%M:%S')
dump_dir=/datatore/backups
for table in $TABLES; do
echo "ZSTD Benchmark"
filepath=$dump_dir/$table.$timestamp.sql-data.zstd
echo "Dumping table '$table' to $filepath"
container_id=$(docker ps|grep datastore|awk '{print $1}')
timestamp_start=$(date '+%Y-%m-%dT%H:%M:%S')
timestamp_start_interval=$(date -d "$timestamp_start" +"%s")
docker exec $container_id /usr/local/bin/pg_dump -h localhost -U dbuser -t $table -a -d dbname | \
zstd -T2 > $filepath
timestamp_stop=$(date '+%Y-%m-%dT%H:%M:%S')
timestamp_stop_interval=$(date -d "$timestamp_stop" +"%s")
timestamp_diff=$(($timestamp_stop_interval - $timestamp_start_interval))
filesize=$(du -sh $filepath|awk '{print $1}')
echo "Benchmark for Table: $table, Compression: ztsd. Filesize: $filesize. Took $timestamp_diff seconds"
echo "GZIP Benchmark"
filepath=$dump_dir/$table.$timestamp.sql-data.gzip
echo "Dumping table '$table' to $filepath"
container_id=$(docker ps|grep datastore|awk '{print $1}')
timestamp_start=$(date '+%Y-%m-%dT%H:%M:%S')
timestamp_start_interval=$(date -d "$timestamp_start" +"%s")
docker exec $container_id /usr/local/bin/pg_dump -h localhost -U dbuser -t $table -a -d dbname | \
gzip -c > $filepath
timestamp_stop=$(date '+%Y-%m-%dT%H:%M:%S')
timestamp_stop_interval=$(date -d "$timestamp_stop" +"%s")
timestamp_diff=$(($timestamp_stop_interval - $timestamp_start_interval))
filesize=$(du -sh $filepath|awk '{print $1}')
echo "Benchmark for Table: $table, Compression: gzip. Filesize: $filesize, Took $timestamp_diff seconds"
echo "bzip2 Benchmark"
filepath=$dump_dir/$table.$timestamp.sql-data.bzip2
echo "Dumping table '$table' to $filepath"
container_id=$(docker ps|grep datastore|awk '{print $1}')
timestamp_start=$(date '+%Y-%m-%dT%H:%M:%S')
timestamp_start_interval=$(date -d "$timestamp_start" +"%s")
docker exec $container_id /usr/local/bin/pg_dump -h localhost -U dbuser -t $table -a -d dbname | \
bzip2 -cz > $filepath
timestamp_stop=$(date '+%Y-%m-%dT%H:%M:%S')
timestamp_stop_interval=$(date -d "$timestamp_stop" +"%s")
timestamp_diff=$(($timestamp_stop_interval - $timestamp_start_interval))
filesize=$(du -sh $filepath|awk '{print $1}')
echo "Benchmark for Table: $table, Compression: bzip2. Filesize: $filesize. Took $timestamp_diff seconds"
echo "lz4 Benchmark"
filepath=$dump_dir/$table.$timestamp.sql-data.lz4
echo "Dumping table '$table' to $filepath"
container_id=$(docker ps|grep datastore|awk '{print $1}')
timestamp_start=$(date '+%Y-%m-%dT%H:%M:%S')
timestamp_start_interval=$(date -d "$timestamp_start" +"%s")
docker exec $container_id /usr/local/bin/pg_dump -h localhost -U dbuser -t $table -a -d dbname | \
lz4 > $filepath
timestamp_stop=$(date '+%Y-%m-%dT%H:%M:%S')
timestamp_stop_interval=$(date -d "$timestamp_stop" +"%s")
timestamp_diff=$(($timestamp_stop_interval - $timestamp_start_interval))
filesize=$(du -sh $filepath|awk '{print $1}')
echo "Benchmark for Table: $table, Compression: lz4. Filesize: $filesize. Took $timestamp_diff seconds"
echo "xz Benchmark"
filepath=$dump_dir/$table.$timestamp.sql-data.xz
echo "Dumping table '$table' to $filepath"
container_id=$(docker ps|grep datastore|awk '{print $1}')
timestamp_start=$(date '+%Y-%m-%dT%H:%M:%S')
timestamp_start_interval=$(date -d "$timestamp_start" +"%s")
docker exec $container_id /usr/local/bin/pg_dump -h localhost -U dbuser -t $table -a -d dbname | \
xz -cz > $filepath
timestamp_stop=$(date '+%Y-%m-%dT%H:%M:%S')
timestamp_stop_interval=$(date -d "$timestamp_stop" +"%s")
timestamp_diff=$(($timestamp_stop_interval - $timestamp_start_interval))
filesize=$(du -sh $filepath|awk '{print $1}')
echo "Benchmark for Table: $table, Compression: xz. Filesize: $filesize. Took $timestamp_diff seconds"
done
Benchmark Results
Benchmark for Table: location, Compression: ztsd. Filesize: 1.6G. Took 88 seconds
Benchmark for Table: location, Compression: gzip. Filesize: 1.7G, Took 200 seconds
Benchmark for Table: location, Compression: bzip2. Filesize: 1.3G. Took 1144 seconds
Benchmark for Table: location, Compression: lz4. Filesize: 2.5G. Took 88 seconds
Benchmark for Table: location, Compression: xz. Filesize: 1.3G. Took 2662 seconds
A core trade off when selecting an optimal compression algorithm is the amount of time taken; relative to the ratio of the file before compression size over after compression size.
$$ Compression Ratio = {Uncompressed Size\over{Compressed Size}} $$
Understanding the Benchmark Results
I'm more interested in the backup running quickly and am willing to accept slightly larger archived files. bzip2
and xz
both provided the smallest filesize, but the time to archive took well over twenty minutes for each. ztsd
took about a minute and a half, and provided a slightly larger file at 1.6GB
in size
Data Stream
I'm not archiving files. I'm archiving a stream coming from a docker exec
command and that has additional overhead to consider. Rather than understanding the overhead, lets produce a metric for how many MB/s of information is being transmitted from the docker exec
command. How much data is being piped into the various compression algorithms?
Exporting the same stream to dd
for about a minute will tell us enough about how much data is being sent to the various compression algorithms
docker exec $container_id /usr/local/bin/pg_dump -h localhost -U dbuser -t location -a -d dbname|dd > /dev/null
5936424960 bytes (5.9 GB, 5.5 GiB) copied, 60.3537 s, 98.4 MB/s
Knowing an average of 98.4 MB/s
is being transmitted, we now have enough information to estimate the compression ratio of each archive.
$$ Seconds \times \text{Average MB/s} \over {Compressed Size} $$
Algorithm | Math | Ratio |
---|---|---|
zstd | $$ 88 \times 98.4 \over {1.6} $$ | 5412.0 |
gzip | $$ 200 \times 200 \over {1.7} $$ | 11576.471 |
bzip2 | $$ 1144 \times 98.4 \over {1.3} $$ | 86592.0 |
lz4 | $$ 88 \times 98.4 \over {2.5} $$ | 3463.680 |
xz | $$ 2262 \times 98.4 \over {1.3} $$ | 201492.923 |
Compression Ratios can vary based on a multitude of factors, most notably is the amount of repeated information in the file being archived. For instance, text data often repeats while random data does not. Therefore we'll be able to achieve higher compression ratios on archives with text data (such as a PostgreSQL dump).