Skip to content

Instantly share code, notes, and snippets.

@Mr0grog
Created November 14, 2025 22:37
Show Gist options
  • Select an option

  • Save Mr0grog/a0e90c3fc11bdaf42f2be17db58cabfa to your computer and use it in GitHub Desktop.

Select an option

Save Mr0grog/a0e90c3fc11bdaf42f2be17db58cabfa to your computer and use it in GitHub Desktop.
Browsertrix-crawler 1.8.1 / 1.9.0 comparison script for epa.gov

Comparing Browserstrix-crawler v1.8.1 vs. v1.9.0 on EPA URLs

These are some quick and simple tools for comparing performance of different versions of browsertrix-crawler (using the webrecorder/browsertrix-crawler Docker images). You'll need Docker and uv to run this.

  1. First, adjust epatest.yaml for whatever settings you’d like.

  2. Then, run run-many.sh <name-of-scenario>. This will run 3 crawls each with v1.8.1 and v1.9.0, where the resulting collections are named epatest--<version>--<scenario>--<index>. For example:

    # Try it with default settings:
    ./run-many.sh default
    
    # Adjust epatest.yaml to turn off autoscroll, then:
    ./run-many.sh no-autoscroll

    (This script just runs epatest.sh repeatedly with appropriate version and name arguments. You can do a single test run with ./epatest.sh <version> <name>, e.g. ./epatest.sh 1.8.1 default--1.)

  3. Run epatest-logs.py with uv to load the logs from those crawls and compare them:

    $ uv run epatest-logs.py
    
         Median Time | Collections                                        | Individual Times
      -------------- | -------------------------------------------------- | --------------------
    https://hero.epa.gov/hero/index.cfm/search
        6.0s +/- 0.2 | 1-8-1--default-behaviors.......................... |   5.9s,   6.0s,   6.0s
        6.0s +/- 0.2 | 1-8-1--no-autoscroll.............................. |   6.0s,   5.8s,   6.1s
        6.0s +/- 0.1 | 1-9-0--default-behaviors.......................... |   6.0s,   5.9s,   6.0s
        5.9s +/- 0.0 | 1-9-0--no-autoscroll.............................. |   5.9s,   5.8s,   5.9s
    https://espanol.epa.gov/espanol/terminos-e
        3.6s +/- 0.5 | 1-8-1--default-behaviors.......................... |   3.5s,   3.6s,   4.1s
        3.6s +/- 0.6 | 1-8-1--no-autoscroll.............................. |   3.6s,   3.6s,   4.1s
       70.1s +/- 0.1 | 1-9-0--default-behaviors.......................... |  70.1s,  70.0s,  70.1s
       32.9s +/- 0.1 | 1-9-0--no-autoscroll.............................. |  32.9s,  32.8s,  32.9s
    

    So you can compare how different scenarios and versions perform on the same page(s).

    (NOTE: you can also add a --path <path> to point to a directory of crawl collections, e.g. uv run epatest-logs.py --path other-crawls/collections.)

Results

I sure got some surprising results here! I had started by grabbing the first 20 URLs of a random EDGI crawl of epa.gov hostnames (see epatest.yaml in this gist). It happened to include two hostnames: hero.epa.gov and espanol.epa.gov. Running this gave me results like what I expected… hero.epa.gov worked fine in both versions, and every URL at espanol.epa.gov took ~10× longer in v1.9.0. Dropping autoscroll helped v1.9.0 only on some long pages. Excerpted results:

     Median Time | Collections                                        | Individual Times
  -------------- | -------------------------------------------------- | --------------------
https://hero.epa.gov/hero/index.cfm/search
    6.0s +/- 0.2 | 1-8-1--default-behaviors.......................... |   5.9s,   6.0s,   6.0s
    6.0s +/- 0.2 | 1-8-1--no-autoscroll.............................. |   6.0s,   5.8s,   6.1s
    6.0s +/- 0.2 | 1-8-1--no-autoscroll-autoplay..................... |   6.2s,   6.0s,   6.0s
    5.8s +/- 0.1 | 1-8-1--no-autoscroll-autoplay-autofetch........... |   5.7s,   5.8s,   5.9s
    6.0s +/- 0.1 | 1-9-0--default-behaviors.......................... |   6.0s,   5.9s,   6.0s
    5.9s +/- 0.0 | 1-9-0--no-autoscroll.............................. |   5.9s,   5.8s,   5.9s
    5.9s +/- 0.0 | 1-9-0--no-autoscroll-autoplay..................... |   5.9s,   5.9s,   5.9s
    5.8s +/- 0.2 | 1-9-0--no-autoscroll-autoplay-autofetch........... |   5.8s,   5.7s,   5.8s
https://hero.epa.gov/
    5.2s +/- 0.0 | 1-8-1--default-behaviors.......................... |   5.2s,   5.2s,   5.2s
    5.2s +/- 0.1 | 1-8-1--no-autoscroll.............................. |   5.2s,   5.2s,   5.2s
    5.2s +/- 0.1 | 1-8-1--no-autoscroll-autoplay..................... |   5.2s,   5.1s,   5.2s
    5.4s +/- 0.1 | 1-8-1--no-autoscroll-autoplay-autofetch........... |   5.4s,   5.4s,   5.3s
    5.6s +/- 0.1 | 1-9-0--default-behaviors.......................... |   5.6s,   5.5s,   5.6s
    5.6s +/- 0.1 | 1-9-0--no-autoscroll.............................. |   5.5s,   5.6s,   5.6s
    5.6s +/- 0.0 | 1-9-0--no-autoscroll-autoplay..................... |   5.6s,   5.6s,   5.6s
    5.8s +/- 0.2 | 1-9-0--no-autoscroll-autoplay-autofetch........... |   5.7s,   5.9s,   5.8s
https://espanol.epa.gov/cai/manual-informativo-sobre-el-radon
    4.7s +/- 0.2 | 1-8-1--default-behaviors.......................... |   4.7s,   4.7s,   4.9s
    4.7s +/- 0.0 | 1-8-1--no-autoscroll.............................. |   4.7s,   4.7s,   4.7s
    4.7s +/- 0.1 | 1-8-1--no-autoscroll-autoplay..................... |   4.7s,   4.7s,   4.8s
    4.6s +/- 0.2 | 1-8-1--no-autoscroll-autoplay-autofetch........... |   4.5s,   4.6s,   4.7s
   34.6s +/- 0.0 | 1-9-0--default-behaviors.......................... |  34.6s,  34.6s,  34.5s
   33.8s +/- 0.0 | 1-9-0--no-autoscroll.............................. |  33.8s,  33.7s,  33.8s
   33.8s +/- 0.1 | 1-9-0--no-autoscroll-autoplay..................... |  33.8s,  33.7s,  33.8s
   33.8s +/- 0.0 | 1-9-0--no-autoscroll-autoplay-autofetch........... |  33.8s,  33.8s,  33.8s
https://espanol.epa.gov/watersense/en-sequia
    3.6s +/- 0.4 | 1-8-1--default-behaviors.......................... |   3.9s,   3.6s,   3.5s
    3.6s +/- 0.1 | 1-8-1--no-autoscroll.............................. |   3.6s,   3.6s,   3.5s
    4.3s +/- 0.8 | 1-8-1--no-autoscroll-autoplay..................... |   4.3s,   3.6s,   4.3s
    3.7s +/- 0.8 | 1-8-1--no-autoscroll-autoplay-autofetch........... |   4.3s,   3.5s,   3.7s
   35.4s +/- 0.0 | 1-9-0--default-behaviors.......................... |  35.4s,  35.4s,  35.4s
   32.9s +/- 0.0 | 1-9-0--no-autoscroll.............................. |  32.9s,  32.8s,  32.9s
   32.9s +/- 0.2 | 1-9-0--no-autoscroll-autoplay..................... |  32.9s,  32.9s,  33.0s
   32.9s +/- 0.0 | 1-9-0--no-autoscroll-autoplay-autofetch........... |  32.9s,  32.9s,  32.9s
https://espanol.epa.gov/espanol/terminos-e
    3.6s +/- 0.5 | 1-8-1--default-behaviors.......................... |   3.5s,   3.6s,   4.1s
    3.6s +/- 0.6 | 1-8-1--no-autoscroll.............................. |   3.6s,   3.6s,   4.1s
    3.5s +/- 0.1 | 1-8-1--no-autoscroll-autoplay..................... |   3.5s,   3.5s,   3.6s
    3.5s +/- 0.8 | 1-8-1--no-autoscroll-autoplay-autofetch........... |   3.5s,   3.5s,   4.3s
   70.1s +/- 0.1 | 1-9-0--default-behaviors.......................... |  70.1s,  70.0s,  70.1s
   32.9s +/- 0.1 | 1-9-0--no-autoscroll.............................. |  32.9s,  32.8s,  32.9s
   32.9s +/- 0.2 | 1-9-0--no-autoscroll-autoplay..................... |  32.9s,  32.9s,  33.0s
   32.9s +/- 1.3 | 1-9-0--no-autoscroll-autoplay-autofetch........... |  32.9s,  34.2s,  32.9s

So I thought I’d try to simplify the test and just look at a couple of the slow pages. But then v1.8.1 slowed down to be the same as v1.9.0!

     Median Time | Collections                                        | Individual Times
  -------------- | -------------------------------------------------- | --------------------
https://espanol.epa.gov/cai/manual-informativo-sobre-el-radon
   34.9s +/- 0.1 | 1-8-1--default-behaviors.......................... |  35.0s,  34.9s,  34.9s
   34.1s +/- 0.0 | 1-8-1--no-autoscroll.............................. |  34.1s,  34.1s,  34.1s
   34.9s +/- 0.1 | 1-9-0--default-behaviors.......................... |  34.9s,  34.9s,  35.0s
   34.1s +/- 0.1 | 1-9-0--no-autoscroll.............................. |  34.1s,  34.2s,  34.1s
https://espanol.epa.gov/watersense/en-sequia
   32.8s +/- 0.1 | 1-8-1--default-behaviors.......................... |  32.9s,  32.8s,  32.8s
   32.8s +/- 0.1 | 1-8-1--no-autoscroll.............................. |  32.8s,  32.8s,  32.8s
   35.6s +/- 0.0 | 1-9-0--default-behaviors.......................... |  35.6s,  35.6s,  35.6s
   33.0s +/- 0.1 | 1-9-0--no-autoscroll.............................. |  33.0s,  33.0s,  32.9s

This holds even when I bumped it up to 5 URLs instead of two.

BUT if I include 2 URLs each from hero.epa.gov and espanol.epa.gov, we're back to the original speedy behavior in v1.8.1 and slow behavior in v1.9.0:

     Median Time | Collections                                        | Individual Times
  -------------- | -------------------------------------------------- | --------------------
https://hero.epa.gov/hero/index.cfm/search
    5.9s +/- 0.5 | 1-8-1--default-behaviors.......................... |   6.3s,   5.8s,   5.9s
    6.1s +/- 0.2 | 1-8-1--no-autoscroll.............................. |   6.1s,   6.0s,   6.2s
    5.9s +/- 0.0 | 1-9-0--default-behaviors.......................... |   5.9s,   5.9s,   5.9s
    5.9s +/- 0.0 | 1-9-0--no-autoscroll.............................. |   5.8s,   5.9s,   5.9s
https://hero.epa.gov/
    5.1s +/- 1.2 | 1-8-1--default-behaviors.......................... |   5.1s,   5.2s,   4.0s
    5.2s +/- 0.8 | 1-8-1--no-autoscroll.............................. |   4.4s,   5.2s,   5.2s
    5.6s +/- 0.8 | 1-9-0--default-behaviors.......................... |   4.9s,   5.6s,   5.6s
    5.6s +/- 0.8 | 1-9-0--no-autoscroll.............................. |   4.8s,   5.6s,   5.6s
https://espanol.epa.gov/cai/manual-informativo-sobre-el-radon
    4.6s +/- 0.1 | 1-8-1--default-behaviors.......................... |   4.7s,   4.6s,   4.6s
    4.7s +/- 0.0 | 1-8-1--no-autoscroll.............................. |   4.7s,   4.7s,   4.7s
   34.6s +/- 0.3 | 1-9-0--default-behaviors.......................... |  34.8s,  34.5s,  34.6s
   33.7s +/- 0.3 | 1-9-0--no-autoscroll.............................. |  34.0s,  33.7s,  33.7s
https://espanol.epa.gov/watersense/en-sequia
    3.6s +/- 0.7 | 1-8-1--default-behaviors.......................... |   3.6s,   3.6s,   4.2s
    3.6s +/- 0.8 | 1-8-1--no-autoscroll.............................. |   4.4s,   3.6s,   3.6s
   35.4s +/- 0.7 | 1-9-0--default-behaviors.......................... |  36.1s,  35.4s,  35.4s
   32.9s +/- 0.5 | 1-9-0--no-autoscroll.............................. |  33.3s,  32.8s,  32.9s

Doesn't make a whole lot of sense to me. Is this an issue with multiple hostnames? hero.epa.gov poisoning things somehow? Odd.

# /// script
# dependencies = [
# "python-dateutil",
# ]
# ///
from argparse import ArgumentParser
from collections import defaultdict
from dataclasses import dataclass, field
from datetime import datetime, timedelta
from dateutil.parser import parse as parse_timestamp
import json
from pathlib import Path
import re
from statistics import median
@dataclass
class PageInfo:
url: str
index: int
logs: list[dict] = field(default_factory=list)
start_time: datetime | None = None
end_time: datetime | None = None
@property
def duration(self) -> timedelta:
return self.end_time - self.start_time
def loglines(filepath: str):
with open(filepath) as file:
for line in file:
clean_line = line.strip()
if clean_line:
yield json.loads(clean_line)
def parse_logfile(filepath: str) -> dict[str, PageInfo]:
print(f'Parsing "{filepath}"')
index = 0
current_page: PageInfo | None = None
pages: dict[str, PageInfo] = {}
for log in loglines(filepath):
timestamp = parse_timestamp(log["timestamp"])
log["timestamp"] = timestamp
if log["context"] == "worker" and log["message"] == "Starting page":
url = log["details"]["page"]
if current_page:
raise RuntimeError(f"Starting a new page when another has not finished! (old: '{current_page.url}', new: '{url}')")
index += 1
current_page = PageInfo(url=url, index=index, start_time=timestamp)
if current_page:
current_page.logs.append(log)
if log["context"] == "pageStatus" and log["message"] == "Page Finished":
url = log["details"]["page"]
if not current_page:
raise RuntimeError(f'Tried to finish page while no page is open! (url: {url})')
current_page.end_time = timestamp
pages[current_page.url] = current_page
current_page = None
return pages
def parse_collection_logs(collection_path: Path) -> dict[str, PageInfo]:
logs_directry = collection_path / 'logs'
results = {}
for file in logs_directry.iterdir():
if file.suffix == '.log':
results.update(parse_logfile(str(file)))
return results
parser = ArgumentParser()
parser.add_argument(
"--path",
type=Path,
default=Path("./crawls/collections"),
help="Path to collections directory to analyze",
)
args = parser.parse_args()
collections_path: Path = args.path
# collections_path = Path('./crawls/collections')
collections = sorted(p.name for p in collections_path.iterdir())
crawls = {c: parse_collection_logs(collections_path / c) for c in collections}
print('')
# UNGROUPED RESULTS
# # Show in order that pages were encountered in the first collection. This
# # hopefully compares like-to-like as much as possible when it comes to caching,
# # since each collection will have crawled in a *similar* order.
# for page in sorted(crawls[collections[0]].values(), key=lambda x: x.index):
# print(page.url)
# for collection in collections:
# info = crawls[collection][page.url]
# memory = next(
# (log["details"] for log in info.logs if log["context"] == "memoryStatus"),
# "?"
# )
# print(f' {info.duration.total_seconds():5.1f}s | {info.index:>2} | {collection:<50} | Mem: {json.dumps(memory)}')
# GROUPED RESULTS
# Group crawls. Expects crawls like "epatest--<version>--<tag>--<index>",
# e.g. "epatest--1-8-1--basic--1".
# The version and tag get combined to name a group we'll combine.
crawl_groups: dict[str, list[dict[str, PageInfo]]] = defaultdict(list)
for collection in collections:
unprefixed = re.search(r'(^|-)\d-\d+-\d+-+.*$', collection).group(0).strip('-')
name = re.sub(r'-+\d+$', '', unprefixed)
crawl_groups[name].append(crawls[collection])
# Show in order that pages were encountered in the first collection. This
# hopefully compares like-to-like as much as possible when it comes to caching,
# since each collection will have crawled in a *similar* order.
print(f" {'Median Time':>14} | {'Collections':<50} | Individual Times")
print(f" {'-' * 14} | {'-' * 50} | {'-' * 20}")
for page in sorted(crawls[collections[0]].values(), key=lambda x: x.index):
print(page.url)
for group, members in sorted(crawl_groups.items(), key=lambda x: x[0]):
page_infos = [c[page.url] for c in members]
durations = [i.duration.total_seconds() for i in page_infos]
print(' ' + ' | '.join([
f"{median(durations):5.1f}s +/-{max(durations) - min(durations):4.1f}",
f"{group:.<50}",
", ".join(f"{d:5.1f}s" for d in durations),
# ",".join(str(i.index) for i in page_infos),
]))
# # Show a summary and relative timings of all the logs while loading a given
# # page in a given crawl.
# print('')
# print(' total s | incremental s | log')
# page = crawls["epatest-1-9-0--no-autoscroll-1"]["https://espanol.epa.gov/tri/encontrar-interpretar-y-utilizar-el-tri"]
# last_time = page.start_time
# for log in page.logs:
# timestamp = log["timestamp"]
# details = log["details"]
# url = details.get("frameUrl", details.get("url", details.get("page")))
# url_text = f"(url='{url}')" if url else ""
# print(f'{(timestamp - page.start_time).total_seconds():6.2f} | {(timestamp - last_time).total_seconds():5.2f} | {log['context']}: {log['message']} {url_text}')
# last_time = timestamp
#!/usr/bin/env bash
set -eo pipefail
if [[ -z "${1}${2}" ]]; then
echo 'You must specify a browsertrix-crawler version and crawl name as arguments.'
echo 'For example, `epatest.sh 1.9.0 basic-1`'
exit 1
fi
VERSION="${1}"
BROWSERTRIX_IMAGE="webrecorder/browsertrix-crawler:${VERSION}"
COLLECTION="$(echo "epatest--${VERSION}--${2}" | tr '.' '-')"
echo "COLLECTION='${COLLECTION}'"
mkdir -p crawls
docker run \
--rm \
--attach stdout --attach stderr \
--volume "./epatest.yaml:/app/config.yaml" \
--volume "${PWD}/crawls/:/crawls/" \
"${BROWSERTRIX_IMAGE}" \
crawl \
--config /app/config.yaml \
--collection "${COLLECTION}" \
--saveState always \
--logging debug,stats \
--logLevel debug,info,warn,error,fatal
behaviors:
- autoscroll
- autoplay
- autofetch
- siteSpecific
pageLoadTimeout: 120
rolloverSize: 8000000000
saveStateHistory: 1
scopeType: page
seeds:
- https://hero.epa.gov/hero/index.cfm/search
- https://hero.epa.gov/
- https://hero.epa.gov/hero/index.cfm/content/transparency
- https://hero.epa.gov/hero/index.cfm/content/assessment
- https://hero.epa.gov/hero/index.cfm/content/basic
- https://hero.epa.gov/hero/index.cfm/litbrowser/public
- https://hero.epa.gov/hero/index.cfm/content/howto
- https://espanol.epa.gov/cai/manual-informativo-sobre-el-radon
- https://espanol.epa.gov/watersense/en-sequia
- https://espanol.epa.gov/cai/indoor-airplus-mejores-ambientes-adentro-y-afuera
- https://espanol.epa.gov/plomo/acciones-para-reducir-la-exposicion-al-plomo
- https://espanol.epa.gov/espanol/terminos-e
- https://espanol.epa.gov/espanol/explicacion-sobre-el-oxido-de-etileno-eto
- https://espanol.epa.gov/espanol/forms/contactenos-sobre-el-sitio-epa-en-espanol-preocupaciones-ambientales-o-alguna
- https://espanol.epa.gov/programa-fronterizo/calendario-del-programa-fronterizo
- https://espanol.epa.gov/tri/mision-y-metas-del-programa-del-tri
- https://espanol.epa.gov/tri/encontrar-interpretar-y-utilizar-el-tri
- https://espanol.epa.gov/espanol/resumen-del-programa-de-wifia
- https://espanol.epa.gov/espanol/conceptos-basicos-sobre-el-material-particulado-pm-por-sus-siglas-en-ingles
- https://espanol.epa.gov/cai/proteja-su-vida-y-la-de-su-familia-evite-el-envenenamiento-con-monoxido-de-carbono
warcinfo:
operator: '"Environmental Data & Governance Initiative" <[email protected]>'
workers: 1
#!/usr/bin/env bash
set -eo pipefail
NAME="${1}"
if [[ -z "${NAME}" ]]; then
echo 'You must specify a name, e.g. `run-many.sh basic`'
exit 1
fi
for i in $(seq 1 3); do
echo '--------------------------------------------------------------------'
echo "Running in 1.8.1... (run #${i})"
./epatest.sh 1.8.1 "${NAME}--${i}"
echo ''
echo '--------------------------------------------------------------------'
echo "Running in 1.9.0... (run #${i})"
./epatest.sh 1.9.0 "${NAME}--${i}"
echo ''
echo ''
done
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment