Skip to content

add GHA skeleton for routing regression pipeline#12

Open
Sherley-Sonali wants to merge 5 commits into
mainfrom
feat/gha-skeleton
Open

add GHA skeleton for routing regression pipeline#12
Sherley-Sonali wants to merge 5 commits into
mainfrom
feat/gha-skeleton

Conversation

@Sherley-Sonali

Copy link
Copy Markdown
Collaborator

Adds .github/workflows/routing_regression.yml - skeleton for the routing regression pipeline .

  • Four jobs: build Valhalla from source with ccache via Hetzner S3, build graph tiles from the LFS PBF and upload as artifact, run route requests in parallel against old and new router via pyvalhalla, diff and store results in RAD-data.
  • Processing scripts (run_routes.py, diff_responses.py, push_results.py) are left as TODOs for a follow-up PR. Known gaps are documented inline.

closes #10

Used AI as a drafting/thinking aid for implementation and test design. All changes were reviewed, tested, and understood before submission.-

@Sherley-Sonali

Copy link
Copy Markdown
Collaborator Author

key decisions in this skeleton:

  • merged build-valhalla and build-tiles into one job - ccache via Hetzner S3 handles making repeat builds fast, no need to pass binaries between jobs
  • workflow_dispatch inputs drive all 4 permutations (old/new router × old/new graph) without code changes - just pass different refs at trigger time
  • tiles uploaded as 90-day GHA artifact, reused across runs - only rebuilt when rebuild_tiles=true
  • run-routes matrix parallelizes old and new router on separate machines simultaneously
  • processing scripts (run_routes.py, diff_responses.py, push_results.py) are TODOs for next PR

@nilsnolde nilsnolde left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks. I think for now you can ignore the comments. I needed to think about this a bit more and see what makes most sense.

IMO we have to split this up into multiple ymls. one for building tiles, one for routing regression tests, at least. so we can keep the layout sane.

you know what @Sherley-Sonali . best you research how to semantically do this best with multiple files so the layout makes most sense. keep some stuff in mind for research:

  • workflow_call does "reusable workflows" (could be e.g. the testing yml calling the build_tiles workflow)
  • we can have our own .github/workflows/actions/build-valhalla.yml which centralizes the build for both (and eventually most) workflows

Comment thread .github/workflows/routing_regression.yml Outdated
Comment thread .github/workflows/routing_regression.yml Outdated
Comment thread .github/workflows/routing_regression.yml Outdated
Comment thread .github/workflows/routing_regression.yml Outdated
Comment thread .github/workflows/routing_regression.yml Outdated
Comment thread .github/workflows/routing_regression.yml Outdated
Comment thread .github/workflows/routing_regression.yml Outdated
Comment thread .github/workflows/routing-regression.yml Outdated
Comment thread .github/workflows/routing_regression.yml Outdated
- name: Download tiles artifact
uses: actions/download-artifact@v4
with:
name: ${{ inputs.tiles_artifact }}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

per your description of the input: valhalla-tiles-master will download what from where? we have nothing building a tileset for master yet.

I'm not quite sure how to handle this exactly.. artifacts work per workflow run, meaning you'd always need to build both "old" and "new" graphs in each run. if we use artifacts for this. let me think for a second.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this still stands @Sherley-Sonali, I still don't see how this could make sense with artifacts.

nothing here actually says "build tiles". there's apparently some GHA config for it, but who calls that?

@Sherley-Sonali

Copy link
Copy Markdown
Collaborator Author

Here's a first pass at the layout split, based on the points you raised:

  • build-valhalla.yml - reusable workflow (workflow_call), takes a valhalla_ref input, builds Valhalla from source with ccache (S3-backed). This is the single place that owns the build/cmake config - both other workflows call into it.
  • build-tiles.yml - workflow_dispatch, calls build-valhalla, then builds tiles from OSM data and uploads them as an artifact. Run independently whenever OSM data/Mjolnir changes.
  • routing-regression.yml - workflow_dispatch, calls build-valhalla for an old/new ref pair, downloads a tiles artifact, runs the route diff. Replaces routing_regression.yml.

Existing TODOs (admins.sqlite, run_routes.py, diff_responses.py, push_results.py, RAD-data push race) carried over as-is - this PR is just the layout split, no new functionality yet.

@nilsnolde

Copy link
Copy Markdown
Member

@Sherley-Sonali hm, seems I worded it wrong: of course a lot of my comments are still valid! and you need to address them! just not the ones which try to review logic which would be changing anyways with my suggestion to split it into multiple files.

@nilsnolde nilsnolde left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still some way to go. also keep in mind that eventually we need 2 refs to build tiles from, even if we just start with "old" for now.

Comment thread .github/workflows/build-tiles.yml Outdated
secrets: inherit

build-tiles:
runs-on: ubuntu-24.04

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
runs-on: ubuntu-24.04
runs-on: ubuntu-latest

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we want to know early if there's anything failing with current releases

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed.

Comment on lines +11 to +15
build-valhalla:
uses: ./.github/workflows/build-valhalla.yml
with:
valhalla_ref: ${{ inputs.valhalla_ref }}
secrets: inherit

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like a redundant job..

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

build-valhalla now produces a wheel artifact that build-tiles downloads and installs, no rebuilding inline

Comment thread .github/workflows/build-tiles.yml Outdated
Comment on lines +48 to +61
- name: Rebuild Valhalla
run: |
cmake -B valhalla-src/build -S valhalla-src \
-DCMAKE_BUILD_TYPE=Release \
-DENABLE_PYTHON_BINDINGS=ON \
-DENABLE_SERVICES=OFF \
-DENABLE_TESTS=OFF \
-DENABLE_BENCHMARKS=OFF \
-DENABLE_CCACHE=ON \
-DENABLE_TOOLS=OFF \
-DENABLE_GEOTIFF=OFF \
-DENABLE_LZ4=OFF
make -C valhalla-src/build -j$(nproc)
sudo make -C valhalla-src/build install

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do the build "inline" if there's a drop-in workflow call?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed in redesign

Comment thread .github/workflows/build-tiles.yml Outdated
Comment on lines +76 to +79
valhalla_build_admins -c valhalla.json data/liechtenstein_graph.osm.pbf
valhalla_build_config \
--mjolnir-tile-dir valhalla_tiles > valhalla.json
valhalla_build_tiles -c valhalla.json data/liechtenstein_graph.osm.pbf

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you actually try this?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

verified locally now

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please keep in mind to always verify locally before you actually commit, or at least before you ask for review! it's part of the learning process.

- name: Download tiles artifact
uses: actions/download-artifact@v4
with:
name: ${{ inputs.tiles_artifact }}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this still stands @Sherley-Sonali, I still don't see how this could make sense with artifacts.

nothing here actually says "build tiles". there's apparently some GHA config for it, but who calls that?

Comment on lines +79 to +92
- name: Build Valhalla at router ref
run: |
cmake -B valhalla-src/build -S valhalla-src \
-DCMAKE_BUILD_TYPE=Release \
-DENABLE_PYTHON_BINDINGS=ON \
-DENABLE_SERVICES=OFF \
-DENABLE_TESTS=OFF \
-DENABLE_BENCHMARKS=OFF \
-DENABLE_CCACHE=ON \
-DENABLE_TOOLS=OFF \
-DENABLE_GEOTIFF=OFF \
-DENABLE_LZ4=OFF
make -C valhalla-src/build -j$(nproc)
sudo make -C valhalla-src/build install

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no inline build now

@Sherley-Sonali

Copy link
Copy Markdown
Collaborator Author

tiles are built by running build-tiles.yml separately and it uploads them as a 90-day artifact. routing-regression.yml takes tiles_run_id as input and uses gh run download to fetch that artifact cross-run.

And on needing - 2 refs to build tiles:
The new build-tiles.yml currently takes one valhalla_ref input. For the future 4 permutations (new router + new graph), you'd run build-tiles.yml twice with different refs. The artifact names valhalla-tiles-{ref_slug} already handle this and each ref produces a distinct artifact (just run it twice).

@nilsnolde nilsnolde left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so to summarize how it currently would work when I'd like to run a regression test some day:

  1. I run build-tiles.yml manually with the right git SHAs, which builds the bindings for each SHA
  2. I wait for that process to finish: AFAIK I have to refresh the PR page constantly, bcs I will not get notified from GH when the tile build run exited with success
  3. then I manually run routing-regression.yml with the same input as build-tiles.yml plus a tiles_run_id, which I have to hunt down from the 1. step (build-tiles.yml run)
  4. in routing-regression.yml, for each SHA we again build valhalla (as we already did for the build-tiles.yml step, but even more do I wonder why you sync wheels via artifacts in the first place), download the valhalla wheel & the graph from GH artifacts via a brittle tiles_run_id, then push the route responses before
  5. diffing them in another job

everything before 5. is not an ergonomical workflow and very very humany error prone.

I see it this way: the routing-regression.yml is the only thing we ever need to manually run for a route regression test. everything else derives from this one source of truth, which orchestrates everything else. build-valhalla.yml is run exactly once per SHA, build-tiles.yml is run inline for each SHA in the route regression workflow.

Comment on lines +56 to +60
pip wheel . --no-build-isolation --wheel-dir /tmp/valhalla-dist \
-Ccmake.build-type=Release \
-Ccmake.define.ENABLE_PYTHON_BINDINGS=ON \
-Ccmake.define.ENABLE_TESTS=OFF \
-Ccmake.define.ENABLE_SERVICES=OFF

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this'll currently work well for caching. ccache uses a lot of heuristics to invalidate cache hits and I'm quite sure compilation commands might use e.g. -I </tmp/pip-build-xxx> absolute paths which would trigger invalidation. that is bcs pip wheel uses /tmp to build the wheel.

this is just an educated guess. can you make sure that doesn't happen currently by simply executing this command twice on your local machine (of course with ccache installed)? you'll need to watch ccache hits/stats before & after the second run.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and what happened to #12 (comment)? the way you use it now would need ENABLE_TOOLS=ON but the others are still not necessary.

Comment on lines +50 to +52
- name: Install Python build dependencies
run: pip install scikit-build-core pyproject-metadata setuptools-scm pybind11

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is pretty bad for maintenance: whenever we add a build dependency, we now have to update two places! remove --build-isolation and this step

Comment thread .github/workflows/build-tiles.yml Outdated
Comment on lines +76 to +79
valhalla_build_admins -c valhalla.json data/liechtenstein_graph.osm.pbf
valhalla_build_config \
--mjolnir-tile-dir valhalla_tiles > valhalla.json
valhalla_build_tiles -c valhalla.json data/liechtenstein_graph.osm.pbf

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please keep in mind to always verify locally before you actually commit, or at least before you ask for review! it's part of the learning process.

Comment on lines +47 to +48
- name: Install system dependencies
run: bash valhalla-src/scripts/install-linux-deps.sh

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first install, then restore cache

- name: Restore ccache
uses: tespkg/actions-cache/restore@v1
with:
endpoint: ${{ secrets.HETZNER_S3_ENDPOINT }}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where did you get this secret from?

Comment on lines +62 to +64
- name: Install Valhalla from wheel
run: pip install /tmp/valhalla-dist/*.whl

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do want to control the python version this runs with.

Comment on lines +65 to +74
- name: Download tiles from build-tiles run
env:
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: |
gh run download ${{ inputs.tiles_run_id }} \
--repo "${{ github.repository }}" \
--pattern "valhalla-tiles-*" \
--dir valhalla_tiles
shopt -s dotglob
mv valhalla_tiles/valhalla-tiles-*/* valhalla_tiles/ 2>/dev/null || true

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not priority, but worth mentioning: one thing we said we want to look out for is future compatibility with e.g. scenario old graph/new graph. currently it's fixed to using a single graph, whatever tiles_run_id references. could do 2 of those, but then I'd need to look up 4 things: old/new SHA, old/new run_id. all separately. this is super duper error prone and can really really take time to realize, when smth nasty seems to happen on the diffs we kick off. avoid at all costs!

don't try hard to keep that requirement in mind for the next round of edits. we can deal with more scenarios once we get there. I just needed to mention it.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also don't sync via artifacts. let the graphs be uploaded to S3, that's a much more idiomatic place for them, especially the master graph. PR graphs should be uploaded to S3 as well, they'll be cleaned once a PR closes.

Comment on lines +78 to +80
valhalla_build_config \
--mjolnir-tile-dir "$(pwd)/valhalla_tiles" \
> /tmp/valhalla.json

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pass the dir (after your next edit: the tar of the dir) to run_routes.py instead. pyvalhalla can deal with simply the path, no need for an external config!

Comment on lines +86 to +91
- name: Upload responses artifact
uses: actions/upload-artifact@v4
with:
name: responses-${{ matrix.router.name }}
path: /tmp/responses-${{ matrix.router.name }}.jsonl
retention-days: 7

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is literally the only thing we want uploaded to GH artifacts.

Comment on lines +13 to +15
tiles_run_id:
description: "Run ID from a successful build-tiles.yml run"
required: true

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is pretty brittle and awkward to hunt down.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GHA skeleton

2 participants