Compression Benchmarks for PostgreSQL Archives

August 29, 2024

By far, my favorite RDBMS is PostgreSQL. Having worked with MySQL, MSSQL, Cassandra, Redis and more; the versatility of PostgreSQL continues to inspire me to write better and more complex SQL. I have a piece of software that reaches out to various news websites following best practices for crawling. Scraping content using provided by the sitemap.xml and following the rules set forth by the robots.txt file. The content of these sites can be anything, but the general approach is to collect as much news as possible to see if I can develop a series of heuristics to provide as technical indicators.

I've been running the software for about two years now, and a massive PostgreSQL table has been created from the result of it. Today, I'd like to start making regular backups of the data in the table. This article will focus on benchmarking how long it'll take to backup a table using the pg_dump program & Docker.

Standing on the shoulders of giants, I've found a comprehensive review of algorithms used for compression. I'm not concerned with parallel processing; I'll stick to evaluating the programs as provided. What I'm looking to glean is a definitive and consistent measurement of how long it'll take to export and compress information for the massive PostgreSQL table.

Benchmark Setup

Benchmark setup is fairly straightforward. A table called location tracks various metrics about URLs such as crawl_delay, domain of the URL, change_ferq, and lastmod. Properties provided by a sites' sitemap.xml. Which in turn allows for the development of an algorithm to select more relevant pages, yet still allow the crawler to search for archived content while not overburdening the website. The location table will be used to identify the best compression algorithm benchmark for the much larger table

timestamp_start=$(date '+%Y-%m-%dT%H:%M:%S')
date -d "$timestamp" +"%s"
docker exec $container_id /usr/local/bin/pg_dump -h localhost -U dbuser -t location -a -d dbname > $host_filepath
timestamp_start=$(date '+%Y-%m-%dT%H:%M:%S')

The database is ran using docker, specifically the postgis:15-3.3-alpine image. Port 5432 is exposed to the host, but we won't use it because it seems that connecting to a port exposed to the host will route traffic through the LAN. Instead we'll export data in the container and compress the output from the exec command to a filepath on the host.

How the database has been initalized using Docker

docker run -p "5432:5432" --name postgresql-datastore -e 'POSTGRES_USER=dbuser' -e 'POSTGRES_PASSWORD=password' -e 'POSTGRES_DB=dbname' -d --shm-size=12g postgis:15-3.3-alpine

The compression algorithms to be tested are zstd, gzip, bzip2, lz4, and xz.

Here is the full script to run the benchmark

#!/usr/bin/env bash

set -e

TABLES="location"

timestamp=$(date '+%Y-%m-%dT%H:%M:%S')
dump_dir=/datatore/backups

for table in $TABLES; do
  echo "ZSTD Benchmark"
  filepath=$dump_dir/$table.$timestamp.sql-data.zstd
  echo "Dumping table '$table' to $filepath"
  container_id=$(docker ps|grep datastore|awk '{print $1}')
  timestamp_start=$(date '+%Y-%m-%dT%H:%M:%S')
  timestamp_start_interval=$(date -d "$timestamp_start" +"%s")
  docker exec $container_id /usr/local/bin/pg_dump -h localhost -U dbuser -t $table -a -d dbname | \
    zstd -T2 > $filepath
  timestamp_stop=$(date '+%Y-%m-%dT%H:%M:%S')
  timestamp_stop_interval=$(date -d "$timestamp_stop" +"%s")
  timestamp_diff=$(($timestamp_stop_interval - $timestamp_start_interval))
  filesize=$(du -sh $filepath|awk '{print $1}')
  echo "Benchmark for Table: $table, Compression: ztsd. Filesize: $filesize. Took $timestamp_diff seconds"

  echo "GZIP Benchmark"
  filepath=$dump_dir/$table.$timestamp.sql-data.gzip
  echo "Dumping table '$table' to $filepath"
  container_id=$(docker ps|grep datastore|awk '{print $1}')
  timestamp_start=$(date '+%Y-%m-%dT%H:%M:%S')
  timestamp_start_interval=$(date -d "$timestamp_start" +"%s")
  docker exec $container_id /usr/local/bin/pg_dump -h localhost -U dbuser -t $table -a -d dbname | \
    gzip -c > $filepath
  timestamp_stop=$(date '+%Y-%m-%dT%H:%M:%S')
  timestamp_stop_interval=$(date -d "$timestamp_stop" +"%s")
  timestamp_diff=$(($timestamp_stop_interval - $timestamp_start_interval))
  filesize=$(du -sh $filepath|awk '{print $1}')
  echo "Benchmark for Table: $table, Compression: gzip. Filesize: $filesize, Took $timestamp_diff seconds"

  echo "bzip2 Benchmark"
  filepath=$dump_dir/$table.$timestamp.sql-data.bzip2
  echo "Dumping table '$table' to $filepath"
  container_id=$(docker ps|grep datastore|awk '{print $1}')
  timestamp_start=$(date '+%Y-%m-%dT%H:%M:%S')
  timestamp_start_interval=$(date -d "$timestamp_start" +"%s")
  docker exec $container_id /usr/local/bin/pg_dump -h localhost -U dbuser -t $table -a -d dbname | \
    bzip2 -cz > $filepath
  timestamp_stop=$(date '+%Y-%m-%dT%H:%M:%S')
  timestamp_stop_interval=$(date -d "$timestamp_stop" +"%s")
  timestamp_diff=$(($timestamp_stop_interval - $timestamp_start_interval))
  filesize=$(du -sh $filepath|awk '{print $1}')
  echo "Benchmark for Table: $table, Compression: bzip2. Filesize: $filesize. Took $timestamp_diff seconds"

  echo "lz4 Benchmark"
  filepath=$dump_dir/$table.$timestamp.sql-data.lz4
  echo "Dumping table '$table' to $filepath"
  container_id=$(docker ps|grep datastore|awk '{print $1}')
  timestamp_start=$(date '+%Y-%m-%dT%H:%M:%S')
  timestamp_start_interval=$(date -d "$timestamp_start" +"%s")
  docker exec $container_id /usr/local/bin/pg_dump -h localhost -U dbuser -t $table -a -d dbname | \
    lz4 > $filepath
  timestamp_stop=$(date '+%Y-%m-%dT%H:%M:%S')
  timestamp_stop_interval=$(date -d "$timestamp_stop" +"%s")
  timestamp_diff=$(($timestamp_stop_interval - $timestamp_start_interval))
  filesize=$(du -sh $filepath|awk '{print $1}')
  echo "Benchmark for Table: $table, Compression: lz4. Filesize: $filesize. Took $timestamp_diff seconds"

  echo "xz Benchmark"
  filepath=$dump_dir/$table.$timestamp.sql-data.xz
  echo "Dumping table '$table' to $filepath"
  container_id=$(docker ps|grep datastore|awk '{print $1}')
  timestamp_start=$(date '+%Y-%m-%dT%H:%M:%S')
  timestamp_start_interval=$(date -d "$timestamp_start" +"%s")
  docker exec $container_id /usr/local/bin/pg_dump -h localhost -U dbuser -t $table -a -d dbname | \
    xz -cz > $filepath
  timestamp_stop=$(date '+%Y-%m-%dT%H:%M:%S')
  timestamp_stop_interval=$(date -d "$timestamp_stop" +"%s")
  timestamp_diff=$(($timestamp_stop_interval - $timestamp_start_interval))
  filesize=$(du -sh $filepath|awk '{print $1}')
  echo "Benchmark for Table: $table, Compression: xz. Filesize: $filesize. Took $timestamp_diff seconds"
done

Benchmark Results

Benchmark for Table: location, Compression: ztsd. Filesize: 1.6G. Took 88 seconds
Benchmark for Table: location, Compression: gzip. Filesize: 1.7G, Took 200 seconds
Benchmark for Table: location, Compression: bzip2. Filesize: 1.3G. Took 1144 seconds
Benchmark for Table: location, Compression: lz4. Filesize: 2.5G. Took 88 seconds
Benchmark for Table: location, Compression: xz. Filesize: 1.3G. Took 2662 seconds

A core trade off when selecting an optimal compression algorithm is the amount of time taken; relative to the ratio of the file before compression size over after compression size.

$$ Compression Ratio = {Uncompressed Size\over{Compressed Size}} $$

Understanding the Benchmark Results

I'm more interested in the backup running quickly and am willing to accept slightly larger archived files. bzip2 and xz both provided the smallest filesize, but the time to archive took well over twenty minutes for each. ztsd took about a minute and a half, and provided a slightly larger file at 1.6GB in size

Data Stream

I'm not archiving files. I'm archiving a stream coming from a docker exec command and that has additional overhead to consider. Rather than understanding the overhead, lets produce a metric for how many MB/s of information is being transmitted from the docker exec command. How much data is being piped into the various compression algorithms?

Exporting the same stream to dd for about a minute will tell us enough about how much data is being sent to the various compression algorithms

docker exec $container_id /usr/local/bin/pg_dump -h localhost -U dbuser -t location -a -d dbname|dd > /dev/null
5936424960 bytes (5.9 GB, 5.5 GiB) copied, 60.3537 s, 98.4 MB/s

Knowing an average of 98.4 MB/s is being transmitted, we now have enough information to estimate the compression ratio of each archive.

$$ Seconds \times \text{Average MB/s} \over {Compressed Size} $$

Algorithm	Math	Ratio
zstd	$$ 88 \times 98.4 \over {1.6} $$	5412.0
gzip	$$ 200 \times 200 \over {1.7} $$	11576.471
bzip2	$$ 1144 \times 98.4 \over {1.3} $$	86592.0
lz4	$$ 88 \times 98.4 \over {2.5} $$	3463.680
xz	$$ 2262 \times 98.4 \over {1.3} $$	201492.923

Compression Ratios can vary based on a multitude of factors, most notably is the amount of repeated information in the file being archived. For instance, text data often repeats while random data does not. Therefore we'll be able to achieve higher compression ratios on archives with text data (such as a PostgreSQL dump).

Joseph Curtin

Compression Benchmarks for PostgreSQL Archives

Benchmark Setup

Benchmark Results

Understanding the Benchmark Results

Data Stream

Article References