Concept: Evaluators
Overview
The evaluators module parses a single bexhoma experiment result folder and
exposes its data as structured pandas DataFrames.
It sits one level below the Collectors, which aggregate results
across multiple experiment codes.
evaluators.benchbase(code, path)
├── log_to_df() ← parse pod log files into DataFrames
├── get_df_benchmarking() ← read combined benchmarking pickle
├── get_df_loading() ← read combined loading pickle
└── get_connections_of_experiment() ← connection/pod metadata
Supported benchmark types:
Evaluator class |
Benchmark tool |
|---|---|
|
Benchbase |
|
HammerDB TPC-C |
|
YCSB |
|
DBMSBenchmarker |
Class hierarchy
base
└── logger
├── benchbase
├── tpcc
├── ycsb
└── dbmsbenchmarker
base provides connection metadata and loading throughput helpers.
logger adds the log-file → DataFrame → pickle pipeline used by all
benchmark-specific subclasses.
Constructor
All evaluators share the same constructor signature:
from bexhoma import evaluators
ev = evaluators.ycsb(code="1777285093", path="/data/benchmarks")
Parameter |
Description |
|---|---|
|
Experiment identifier — also the name of the result sub-folder |
|
Root directory that contains the per-experiment sub-folders |
|
Whether loading-phase results are expected (evaluators enable this automatically) |
|
Whether benchmarking-phase results are expected (default |
|
1-based index of the benchmark run to filter results to; |
Quick Reference
DataFrame Access
Method |
Defined in |
Returns |
|---|---|---|
|
|
All benchmarking results for this experiment |
|
|
All loading results |
|
|
Connection/pod metadata, one row per pod |
|
|
Raw workload properties dict |
|
|
|
Type Conversion & Aggregation
Method |
Defined in |
Description |
|---|---|---|
|
all subclasses |
Cast benchmarking DataFrame to correct dtypes |
|
all subclasses |
Reduce parallel pods → one row per phase (default) or per job |
|
|
Cast loading DataFrame to correct dtypes |
|
|
Reduce parallel loading pods → one row per job |
Loading Throughput
Method |
Defined in |
Returns |
|---|---|---|
|
|
Loading metrics per connection |
|
|
Loading metrics aggregated per experiment run |
|
|
Loading metrics per run with tenant grouping |
|
|
Raw loading DataFrame, one row per pod |
Monitoring
Method |
Defined in |
Returns |
|---|---|---|
|
|
List of metric keys from |
|
|
Time-series CSV for one metric, all connections |
|
|
Combines per-connection CSVs into one file |
Workflow & Testing
Method |
Defined in |
Returns |
|---|---|---|
|
|
Dict mapping configuration → |
|
all |
Exit code |
|
|
|
Class base
base is the root evaluator. It owns the experiment folder path, error tracking,
and the connection-metadata helpers that underpin all downstream aggregation.
Key attributes
Attribute |
Set by |
Description |
|---|---|---|
|
|
Absolute path to the experiment result folder ( |
|
|
Experiment identifier string |
|
|
Flag: loading-phase results expected |
|
|
Flag: benchmarking-phase results expected |
|
|
Reconstructed workflow dict |
|
|
Dict of parse errors keyed by filename |
get_connections_of_experiment()
Returns a DataFrame of connection/pod metadata read from connections.config.
One row per pod (when orig_name is present) or per client (otherwise).
Key columns:
phase— code-prefixed phase identifier:<code>-<configuration>-<experiment_run>-<client>job— code-prefixed job identifier:<code>-<configuration>-<experiment_run>-<client>-<benchmark_run>code,connection,configuration,experiment_run,benchmark_run,client,pods,time_load,time_preload,time_generate,time_ingest,time_postload,type_tenants,num_tenants,vol_tenants, plus flattenedhost_*,loading_parameters_*,benchmarking_parameters_*,sut_parameters_*, andarg_*fields.
benchmark_run (numBenchmark) is the 1-based index of the parallel benchmark job
within a phase. Each job produces its own entry with a unique job string.
See Concepts for the definition of phase vs job.
get_workload()
Reads queries.config from the experiment folder and returns its content as
a Python dictionary. Useful for accessing the scale factor:
sf = int(ev.get_workload()['defaultParameters']['SF'])
get_loading_per_connection()
Returns loading metrics enriched with the scale factor and a
'Throughput [SF/h]' column, one row per connection/pod.
get_loading_per_run()
Aggregates to one row per (code, configuration, experiment_run) by taking the
max across connections and recomputing the throughput from the aggregated load time.
get_loading_per_run_multitenant()
Like get_loading_per_run() but groups by
(code, experiment_run, type_tenants, vol_tenants, num_tenants).
Class logger
logger extends base with the log-file parsing and pickle-caching pipeline.
All benchmark-specific evaluators inherit from logger.
Log-file pipeline
When results have not yet been cached:
evaluate_results()
├── transform_all_logs_benchmarking()
│ └── end_benchmarking(jobname) ← log_to_df → .df.pickle
├── transform_all_logs_loading()
│ └── end_loading(jobname) ← log_to_df → .df.pickle
└── _collect_dfs() ← merges .df.pickle → .all.df.pickle
get_df_benchmarking() and get_df_loading() trigger this pipeline on first
call if the combined pickle does not yet exist.
get_df_benchmarking()
Returns the combined benchmarking DataFrame from bexhoma-benchmarker.all.df.pickle,
running the log pipeline first if the file is absent.
get_df_loading()
Returns the combined loading DataFrame from bexhoma-loading.all.df.pickle.
get_monitoring_metrics()
Returns the list of metric keys defined in connections.config.
get_monitoring_metric(metric, component='loading')
Returns a time-series DataFrame for one metric and one component role.
Rows are timestamps; columns are connection names, prefixed with {code}-.
plot(df, column, x, y, ...)
Convenience matplotlib wrapper. With plot_by=None produces a single chart
with one line per value of column. With plot_by set, produces a grid of
sub-plots — one per group — with lines split by column within each sub-plot.
Class benchbase
Evaluator for Benchbase experiments.
Parses per-pod log files produced by the Benchbase benchmarking tool and exposes throughput, goodput, and latency distribution results.
log_to_df(filename)
Parses a single Benchbase pod log file. Extracts header fields
(connection, configuration, experiment_run, client, pod, etc.) and
the JSON result block delimited by ####BEXHOMA####. Returns a one-row
DataFrame whose columns include:
Throughput (requests/second),Goodput (requests/second)Latency Distribution.*(25th / 50th / 75th / 90th / 95th / 99th percentile, average, min, max)efficiency(TPC-C key-and-think mode only)
benchmarking_set_datatypes(df)
Casts a benchmarking DataFrame to the correct column types.
benchmarking_aggregate_by_parallel_pods(df, columns=['phase'])
Reduces parallel pods to one row per group.
The phase column holds the phase identifier (configuration-experiment_run-client)
and the job column holds the job identifier
(configuration-experiment_run-client-benchmark_run).
Default
columns=['phase']groups by phase, aggregating all parallel jobs within the same phase into a single row. This is the grouping used byget_performance_aggregated_per_phase().Pass
columns=['job']to keep one row per job (jobs within a phase stay separate). This is the grouping used byget_performance_aggregated_per_job().
Throughput and goodput are summed; latency percentiles use max; minimum
latency uses min; average latency uses mean.
parse_benchbase_log_file(file_path)
Low-level parser. Extracts per-second throughput from lines matching the
Benchbase log format and returns a list of {'second': …, 'throughput': …} dicts.
benchmark_logs_to_timeseries_df(list_logs, metric='throughput', aggregate=True)
Builds a per-second time-series DataFrame from all pod logs matching each ID
in list_logs.
"9"or"Max"in the metric name → element-wise max across pods"Min"in the metric name → element-wise minall others → sum
Returns an aggregated DataFrame indexed by 'second' (with an 'avg' column)
when aggregate=True, or a list of per-pod DataFrames when aggregate=False.
get_benchmark_logs_timeseries_df_aggregated(metric='throughput', configuration='', client='1', experiment_run='1')
Convenience wrapper around benchmark_logs_to_timeseries_df with aggregate=True.
Filters get_df_benchmarking() by the given configuration, client, and
experiment_run to obtain the pod list.
df_ts = ev.get_benchmark_logs_timeseries_df_aggregated(
metric="throughput",
configuration="PostgreSQL-64-8-65536",
client=1,
experiment_run=1)
# df_ts.index → seconds since benchmark start
# df_ts['throughput'] → txn/sec, summed across pods
get_benchmark_logs_timeseries_df_single(metric='throughput', configuration='', client='1', experiment_run='1')
Like the aggregated variant but returns a list of per-pod DataFrames instead.
get_summary_benchmark_per_connection()
Returns benchmarking results with one row per pod, filtered to the key display
columns (experiment run, terminals, target, client, child, time, errors,
throughput, goodput, efficiency, and latency percentiles), sorted by
(experiment_run, client, child). Used by show_summary().
get_summary_benchmark_per_phase()
Returns benchmarking results aggregated over parallel pods (via
benchmarking_aggregate_by_parallel_pods), one row per job, filtered to the
same display columns as the per-connection view plus pod_count.
Used by show_summary().
get_summary_loading_per_run()
Delegates to :class:base’s get_loading_per_run(). Returns one row per
(code, configuration, experiment_run) with time_load and
Throughput [SF/h]. Used by show_summary().
Class tpcc
Evaluator for HammerDB TPC-C experiments.
log_to_df(filename)
Parses a single HammerDB pod log file. Key extracted columns:
NOPM(New Orders Per Minute),TPM(Transactions Per Minute)efficiency— meaningful only when key-and-think time is enabledOptional latency statistics when logged:
CALLS,MIN [ms],AVG [ms],MAX [ms],TOTAL [ms],P99 [ms],P95 [ms],P50 [ms]
benchmarking_set_datatypes(df)
Casts all columns to the correct types. Handles two schemas: with and without the optional latency columns.
benchmarking_aggregate_by_parallel_pods(df, columns=['phase'])
Reduces parallel pods to one row per group. The phase column holds the phase
identifier (configuration-experiment_run-client) and the job column holds the job
identifier (configuration-experiment_run-client-benchmark_run). Default
columns=['phase'] produces one row per phase; pass columns=['job'] for one row per
job. NOPM and TPM are averaged (not summed) across pods; efficiency is recomputed after
aggregation.
Class ycsb
Evaluator for YCSB experiments.
Covers both the benchmarking phase and the loading phase, each with their own aggregation helpers and time-series methods.
log_to_df(filename)
Parses a single YCSB pod log file into a one-row DataFrame. Columns include
[OVERALL].Throughput(ops/sec) and, depending on the workload, per-operation
statistics such as:
[READ].Operations,[READ].AverageLatency(us),[READ].99thPercentileLatency(us), …[UPDATE].*,[INSERT].*,[SCAN].*,[READ-MODIFY-WRITE].*[*-FAILED].*variants for error counting
benchmarking_set_datatypes(df) / loading_set_datatypes(df)
Cast the benchmarking or loading DataFrame columns to the correct types. Unknown operation types are handled gracefully by conditional type application.
benchmarking_aggregate_by_parallel_pods(df, columns=['phase'])
Reduces parallel benchmarking pods to one row per group. The phase column holds
the phase identifier and the job column holds the job identifier. Default
columns=['phase'] produces one row per phase; pass columns=['job'] for one row
per job. Throughput is summed; average latency uses mean; percentile latency uses max;
minimum uses min.
loading_aggregate_by_parallel_pods(df, columns=['phase'])
Same reduction logic for loading pods.
get_df_loading()
Returns the combined loading DataFrame from bexhoma-loading.all.df.pickle.
get_loading_per_connection()
Merges the aggregated loading results with connection metadata on
(code, configuration, experiment_run), normalises the index, drops rows
without a recorded loading phase, and attaches the scale factor.
get_loading_per_pod()
Returns the raw loading DataFrame (one row per pod) from get_df_loading().
parse_ycsb_log_file(file_path)
Low-level parser. Each line produces a dict with sec, total_operations,
current_ops_per_sec, and a nested metrics dict for per-operation statistics.
logs_to_timeseries_df(list_logs, metric='current_ops_per_sec', aggregate=True, filetype='benchmarker')
Core time-series builder. filetype controls whether benchmarker or loading
log files are matched. The last measurement of each pod is removed as unreliable.
Aggregation strategy:
Metric name contains |
Aggregation |
|---|---|
|
element-wise max |
|
element-wise min |
|
sum |
others |
sum then divided by pod count |
get_benchmark_logs_timeseries_df_aggregated(metric='current_ops_per_sec', configuration='', client='1', experiment_run='1')
Aggregated time series for the benchmarking phase, filtered by
(configuration, client, experiment_run).
df_ts = ev.get_benchmark_logs_timeseries_df_aggregated(
metric="current_ops_per_sec",
configuration="PostgreSQL-64-8-196608",
client=2,
experiment_run=1)
# df_ts.index → seconds (int)
# df_ts['current_ops_per_sec'] → summed across pods
get_benchmark_logs_timeseries_df_single(metric='current_ops_per_sec', configuration='', client='1', experiment_run='1')
Returns a list of per-pod DataFrames for the benchmarking phase.
get_loading_logs_timeseries_df_aggregated(metric='current_ops_per_sec', configuration='', experiment_run='1')
Aggregated time series for the loading phase. No client parameter —
loading pods are identified by (configuration, experiment_run) only.
df_ts = ev.get_loading_logs_timeseries_df_aggregated(
configuration="PostgreSQL-64-8-196608",
experiment_run=1)
get_loading_logs_timeseries_df_single(metric='current_ops_per_sec', configuration='', experiment_run='1')
Returns a list of per-pod DataFrames for the loading phase.
Class dbmsbenchmarker
Evaluator for DBMSBenchmarker
experiments. Uses the dbmsbenchmarker.inspector API rather than raw log
parsing.
get_inspector()
Loads the DBMSBenchmarker inspector for this experiment. Called automatically
by __init__; may be called again if self.evaluation is None.
get_df_loading()
Returns loading phase timing (generate, ingest, schema, index, load) extracted from the inspector’s connection data. Index is the DBMS name.
get_df_benchmarking()
Returns a combined DataFrame of throughput and timing metrics:
Power@Size [~Q/h]— power metric (derived from geometric-mean execution time)Throughput@Size— throughput metrictime [s]— wall-clock benchmark duration (derived frombenchmark_start/benchmark_endtimestamps)pod_count— number of parallel pods
benchmarking_aggregate_by_parallel_pods(df, columns=['phase'])
Reduces parallel pods within each job to one aggregated row. Geometric mean is used for
total_timer_execution and Power@Size; Throughput@Size is recomputed
from the aggregated timing.
get_total_warnings(query_titles=False)
Returns per-query warning counts (result mismatches) as a DataFrame.
Pass query_titles=True to replace numeric query indexes with titles from
queries.config.
get_total_errors(query_titles=False)
Returns per-query error counts (failed executions).
get_query_latencies(query_titles=False)
Returns mean execution latency per query and DBMS, rounded to 2 decimal places.
Minimal Examples
Benchbase
from bexhoma import evaluators
ev = evaluators.benchbase(code="1777285093", path="/data/benchmarks")
# All benchmarking results
df_bench = ev.get_df_benchmarking()
# Aggregate parallel pods → one row per job
df_bench = ev.benchmarking_set_datatypes(df_bench)
df_agg = ev.benchmarking_aggregate_by_parallel_pods(df_bench)
# Per-second throughput for one phase
df_ts = ev.get_benchmark_logs_timeseries_df_aggregated(
metric="throughput",
configuration="PostgreSQL-64-8-65536",
client=1,
experiment_run=1)
YCSB
ev = evaluators.ycsb(code="1777285093", path="/data/benchmarks")
# Summary DataFrames
df_bench = ev.get_df_benchmarking()
df_loading = ev.get_df_loading()
# Connection metadata
df_conn = ev.get_connections_of_experiment()
# Loading throughput
ev.get_loading_per_connection()
ev.get_loading_per_pod()
# Benchmarking time series
ev.get_benchmark_logs_timeseries_df_aggregated(
configuration="PostgreSQL-64-8-196608", client=2, experiment_run=1)
ev.get_benchmark_logs_timeseries_df_single(
configuration="PostgreSQL-64-8-196608", client=2, experiment_run=1)
# Loading time series (no client dimension)
ev.get_loading_logs_timeseries_df_aggregated(
configuration="PostgreSQL-64-8-196608", experiment_run=1)
ev.get_loading_logs_timeseries_df_single(
configuration="PostgreSQL-64-8-196608", experiment_run=1)
DBMSBenchmarker (TPC-H)
ev = evaluators.dbmsbenchmarker(code="1777285093", path="/data/benchmarks")
# Throughput and power metrics
ev.get_df_benchmarking()
# Loading times
ev.get_df_loading()
# Query-level diagnostics
ev.get_query_latencies(query_titles=True)
ev.get_total_errors(query_titles=True)
ev.get_total_warnings(query_titles=True)