Concepts

What is Bexhoma?

Bexhoma (Benchmark Experiment Host Manager) is a Python-based orchestration tool for running Database Management System (DBMS) benchmark experiments in cloud environments. It automates the full lifecycle of a benchmark: provisioning the DBMS inside a Kubernetes cluster, loading data, executing the workload, collecting metrics, and tearing everything down again — all from a single command or script.

The central motivation is reproducibility and comparability. By encoding every hardware and software decision — DBMS image, resource limits, node placement, data scale, driver concurrency — into a single parameterized run, Bexhoma makes it straightforward to repeat the same experiment with different DBMS or different configuration variants and then compare results side by side.

Bexhoma has been used on Amazon Web Services, Google Cloud, Microsoft Azure, IBM Cloud, Oracle Cloud, and Minikube installations, and has been validated across more than twenty DBMS products.

Goals

Automation: a single command starts and stops all cluster components — no manual pod creation or data import.
Repeatability: every experiment run is parameterized and logged; re-running the same command reproduces the same setup.
Comparability: multiple DBMS or configurations can be benchmarked within the same experiment, sharing data and producing a unified result set.
Scalability: driver and loader pods can be scaled out to match cloud-native environments where the workload is generated by many parallel clients.
Flexibility: supports analytical and transactional benchmarks, custom SQL workloads, multi-tenant architectures, and connection pooling.

Core Concepts

Experiment

An experiment is a complete benchmarking run: a specific DBMS tested with a specific dataset and workload. Each experiment receives a unique numeric code (derived from the start timestamp) that identifies all its Kubernetes resources, result files, and monitoring intervals.

An experiment consists of:

one or more configurations (DBMS variants being compared)
one or more experiment runs — the same setup executed multiple times for statistical confidence
one or more clients (benchmark phases within a run, e.g., with different concurrency levels)

Configuration

A configuration is a named DBMS variant, for example PostgreSQL or PostgreSQL-A. All components belonging to one configuration — the SUT pod, loader pods, benchmarker pods, storage volume — share a configuration label in Kubernetes. This label is the primary key for grouping and comparing results.

Phase

A phase (also called client in result files) is a discrete step within an experiment run that uses a fixed driver concurrency. Experiments can sweep across multiple concurrency levels by defining several phases, e.g., first with one benchmarker pod, then with eight.

A phase is identified by <configuration>-<experiment_run>-<client>.

Example: PostgreSQL-1-2-3 — first PostgreSQL instance, second experiment run, third client phase.

Job

A job is one benchmark job executing within a phase. Multiple jobs can run in parallel within the same phase; the 1-based index of each parallel job is called benchmark_run. A job is identified by <phase>-<benchmark_run>, i.e. <configuration>-<experiment_run>-<client>-<benchmark_run>.

Example: PostgreSQL-1-2-3-4 — fourth parallel benchmark job in phase PostgreSQL-1-2-3.

Most phases have a single job (benchmark_run = 1), in which case the job name is always <phase>-1.

Connection

A connection is a single driver pod executing within a benchmark job. Its identifier is <job>-<pod>, i.e. <configuration>-<experiment_run>-<client>-<benchmark_run>-<pod>.

Example: PostgreSQL-1-2-3-4-5 — fifth pod of the fourth benchmark job in phase PostgreSQL-1-2-3.

In collectors, which aggregate results from multiple experiments, all identifiers are prefixed with the experiment code: <code>-<connection> and <code>-<job>.

Host Setting

A host setting describes what the DBMS environment looks like:

the DBMS Docker image
the persistent storage volume holding the data
init scripts run before and after data loading (schema creation, index building, etc.)
Kubernetes resource limits (CPU, memory) and node placement constraints

Benchmark Setting

A benchmark setting describes how the workload is driven:

the benchmark tool and workload variant (e.g., TPC-H, YCSB workload A)
the data scale factor
the number and concurrency of driver processes or threads
timeouts and repetition counts
tool-specific parameters (target throughput, transaction mix, etc.)

Architecture

Bexhoma deploys a set of cooperating components inside a Kubernetes cluster.

┌─────────────────────────────────────────────────────────────────┐
│  Kubernetes Cluster                                             │
│                                                                 │
│  ┌──────────┐   ┌─────────────┐   ┌────────────────────────┐    │
│  │  SUT     │   │  Worker(s)  │   │  Connection Pool       │    │
│  │  (DBMS)  │◄──│  (shards /  │   │  (PgBouncer, optional) │    │
│  │          │   │  replicas)  │   └────────────────────────┘    │
│  └────┬─────┘   └─────────────┘                                 │
│       │                                                         │
│  ┌────┴──────┐  ┌─────────────┐   ┌────────────────────────┐    │
│  │  Storage  │  │  Loader(s)  │   │  Benchmarker(s)        │    │
│  │  Volume   │  │  (data gen) │   │  (workload drivers)    │    │
│  └───────────┘  └─────────────┘   └────────────────────────┘    │
│                                                                 │
│  ┌───────────┐  ┌─────────────┐   ┌────────────────────────┐    │
│  │ Monitoring│  │  Message    │   │  Dashboard / Results   │    │
│  │ Prometheus│  │  Queue      │   │  (shared result folder)│    │
│  │ + cAdvisor│  │  (Redis)    │   │                        │    │
│  └───────────┘  └─────────────┘   └────────────────────────┘    │
└─────────────────────────────────────────────────────────────────┘
         ▲
         │ kubectl / Kubernetes API
         │
  ┌──────┴──────┐
  │  Bexhoma    │  (orchestrator — runs locally)
  │  Python     │
  └─────────────┘

System Under Test (SUT)

The SUT is the DBMS deployment — a Kubernetes Deployment backed by a persistent volume (PVC). For distributed DBMS (e.g., Citus, CockroachDB, YugabyteDB), additional worker pods form the sharding or replication layer.

Storage Volume

Data is stored in a dedicated PVC whose name encodes the DBMS, benchmark, and scale factor (e.g., bexhoma-storage-postgresql-tpch-10). Volumes are reused across experiments: once data is loaded it is not re-imported unless explicitly requested, saving time for large scale factors.

Loaders

Loader pods generate and import benchmark data into the SUT. Multiple loader pods can work in parallel, splitting the work by partition or table, to reduce import time at large scale factors.

Benchmarkers

Benchmarker pods run the actual workload against the SUT. They can be scaled out to simulate many concurrent clients in a cloud-native setting. Results (throughput, latency) are written to the shared result folder.

Message Queue

A Redis-based message queue coordinates distributed loader and benchmarker pods, distributing tasks and collecting status signals without coupling pods directly.

Monitoring

Bexhoma integrates with Prometheus and cAdvisor to record hardware metrics (CPU, memory, network I/O) for each component during every benchmark phase. Monitoring can be scoped to the SUT only, or extended to all cluster components. Metrics are fetched after each phase and stored alongside benchmark results for joint analysis.

Dashboard

An in-cluster dashboard (Dash / Plotly) serves as an interactive result browser. Results can also be inspected locally via the bexhoma CLI tool or Jupyter notebooks.

Experiment Lifecycle

An experiment proceeds through the following stages:

Prepare — Bexhoma creates Kubernetes deployments, services, and PVCs for the chosen DBMS and resource profile.
Start SUT — The DBMS container starts and becomes ready to accept connections.
Initialize — Pre-loading init scripts run inside the SUT (e.g., schema creation, extension installation).
Load Data — Loader pods generate and import benchmark data; the volume is marked as loaded when complete.
Post-Load Init — Post-loading scripts run (e.g., index creation, statistics gathering).
Benchmark — Benchmarker pods execute the workload for each configured phase; metrics and logs are collected.
Evaluate — Results are written to the result folder; a summary is printed.
Stop & Clean — All ephemeral components (SUT, loaders, benchmarkers, monitoring) are removed from the cluster; the data volume is retained.

Steps 1–5 are skipped when the data volume already exists for the chosen DBMS/scale combination.

Supported Benchmarks

Benchmark	Type	Tool
TPC-H	Analytical	DBMSBenchmarker
TPC-DS	Analytical	DBMSBenchmarker
YCSB	Key-value / transactional	YCSB
TPC-C	Transactional	HammerDB
TPC-C	Transactional	Benchbase
Custom SQL	Analytical / mixed	DBMSBenchmarker

Scale-Out and Cloud-Native Benchmarking

Bexhoma supports cloud-native benchmarking patterns where the load-generating infrastructure is itself distributed:

Parallel loaders: multiple pods split data generation and import, each responsible for a partition of the dataset.
Parallel benchmarkers: multiple driver pods share the workload, aggregated into a single throughput figure per phase.
Sweep experiments: a single command can run several phases with increasing driver counts (e.g., 1, 2, 4, 8 pods), producing a concurrency-vs-throughput curve in one pass.
Distributed DBMS: worker stateful sets extend the SUT for sharded or replicated deployments.
Connection pooling: an optional PgBouncer tier sits between drivers and the DBMS to test pooling overhead or multiplexing behaviour.

Multi-Tenancy

Bexhoma supports benchmarking multi-tenant architectures where multiple tenants share a DBMS instance. Three isolation strategies can be evaluated and compared:

Strategy	Description
Schema-per-tenant	Each tenant has a dedicated schema in a shared database
Database-per-tenant	Each tenant has a dedicated database within one DBMS instance
Container-per-tenant	Each tenant gets a dedicated DBMS container

Init scripts use placeholders ({BEXHOMA_SCHEMA}, {BEXHOMA_DATABASE}) that Bexhoma substitutes at runtime. Result collection and aggregation across tenants is handled transparently by the collectors.

Result Evaluation

Bexhoma structures its evaluation layer in two tiers:

Evaluators

An evaluator parses the raw result files (pod logs, pickles) for a single experiment code and exposes them as pandas DataFrames. Each benchmark type has its own evaluator class (evaluators.ycsb, evaluators.benchbase, evaluators.tpcc, evaluators.dbmsbenchmarker).

Collectors

A collector aggregates results across multiple experiment codes into a single analysis-ready view. Collectors are used when comparing results from several runs or parameter sweeps. Metric aggregation follows the metric type: counters are reported as deltas, gauges as means.

Results can be inspected via:

The bexhoma CLI tool (status, summary, local result list)
The in-cluster dashboard (bexhoma dashboard)
A local dashboard (bexhoma localdashboard)
Jupyter notebooks (in images/evaluator_dbmsbenchmarker/notebooks/)

References

If you use Bexhoma in work contributing to a scientific publication, we kindly ask that you cite our application note [2] or [1]:

[1] A Framework for Supporting Repetition and Evaluation in the Process of Cloud-Based DBMS Performance Benchmarking

Erdelt P.K. (2021) A Framework for Supporting Repetition and Evaluation in the Process of Cloud-Based DBMS Performance Benchmarking. In: Nambiar R., Poess M. (eds) Performance Evaluation and Benchmarking. TPCTC 2020. Lecture Notes in Computer Science, vol 12752. Springer, Cham. https://doi.org/10.1007/978-3-030-84924-5_6

[2] Orchestrating DBMS Benchmarking in the Cloud with Kubernetes

Erdelt P.K. (2022) Orchestrating DBMS Benchmarking in the Cloud with Kubernetes. In: Nambiar R., Poess M. (eds) Performance Evaluation and Benchmarking. TPCTC 2021. Lecture Notes in Computer Science, vol 13169. Springer, Cham. https://doi.org/10.1007/978-3-030-94437-7_6

[3] DBMS-Benchmarker: Benchmark and Evaluate DBMS in Python

Erdelt P.K., Jestel J. (2022). DBMS-Benchmarker: Benchmark and Evaluate DBMS in Python. Journal of Open Source Software, 7(79), 4628 https://doi.org/10.21105/joss.04628

[4] A Cloud-Native Adoption of Classical DBMS Performance Benchmarks and Tools

Erdelt, P.K. (2024). A Cloud-Native Adoption of Classical DBMS Performance Benchmarks and Tools. In: Nambiar, R., Poess, M. (eds) Performance Evaluation and Benchmarking. TPCTC 2023. Lecture Notes in Computer Science, vol 14247. Springer, Cham. https://doi.org/10.1007/978-3-031-68031-1_9

[5] Benchmarking Multi-Tenant Architectures in PostgreSQL

Erdelt, P.K., and Rabl T. (2026) In: Proceedings 29th International Conference on Extending Database Technology, EDBT 2026 OpenProceedings.org https://doi.org/10.48786/edbt.2026.46