Montréal Road Inspector — Hybrid AI

HF Build Small Hackathon 2026Track 1 · Backyard AIYOLOv8s 10M + VLM ≤32B (Modal GPU)4 Mtl Open Data sources

Pick a photo
Click an example below or upload your own.

Set GPS (strongly recommended)
The photo's EXIF GPS is read automatically; or enter lat/lon. Real coords unlock the Mtl Open Data context.

Analyze
7 transparent stages run live (~10-30s).

Photo — JPG / PNG / HEIC (iPhone)…

Preview

Bundled Mtl examples — click to load

Why it matters: with real coordinates the pipeline pulls the street's official record (class, arterial status, roughness IRI 2022) and fuses it into the severity verdict. Without GPS, the VLM guesses from pixels only.

Priority: photo EXIF (auto) > manual lat/lon below.

Latitude

Longitude

Weather is read from the photo, automatically. The VLM detects visual winter cues (snow, salt residue, ice patches) and estimates the freeze-thaw load — none/some/moderate/heavy → 0/2/4/6 cycles (Stage 5). We deliberately don't use a live weather API: a photo can be old, so its actual conditions matter more than today's weather. Why it matters: a pothole on Sherbrooke in March ≠ in July — freezing widens fissures and multiplies urgency.

Each stage runs sequentially. Cards update live. Total time: ~10-15 s (1st run: +5-10 s to load models).

Upload a photo to begin.

Stage 2 — YOLO annotated (all detections)

Waiting…

Stage 3 — Post-VLM filter (✅ green = kept, ❌ red X = rejected)

Waiting…

Stage 4 — Drivable surface + pothole exposure (🔴 center / ⚪ gutter)

Waiting…

How it works

This system combines a small fine-tuned vision model with a vision-language model to deliver operational road inspection — not just detection.

Architecture

PHOTO
  │
  ▼
┌─────────────────────────────────────────┐
│ STAGE 1-2 — YOLOv8s V2b multi-class     │
│ 10M params, fine-tuned on 7 datasets    │
│ Detects: pothole / crack / manhole      │
│ Visual severity via ASTM D6433 cascade  │
└─────────────────────────────────────────┘
  │
  ▼
┌─────────────────────────────────────────┐
│ STAGE 3 — VLM cross-validation          │
│ Reviews each YOLO detection, rejects FPs│
│ with reasoning ("this is a car plate")  │
└─────────────────────────────────────────┘
  │
  ▼
┌─────────────────────────────────────────┐
│ STAGE 4 — Surface seg. + exposure       │
│ SegFormer-B0 segments the drivable      │
│ surface, then measures each pothole's   │
│ exposure (center vs gutter) → impact    │
└─────────────────────────────────────────┘
  │
  ▼
┌─────────────────────────────────────────┐
│ STAGE 5 — VLM road context inference    │
│ road_type / num_lanes / surface / winter│
│ (replaces GPS+Open Data when unavailable)│
└─────────────────────────────────────────┘
  │
  ▼
┌─────────────────────────────────────────┐
│ STAGE 6 — Severity fusion (pure func)   │
│ visual ASTM × context = operational     │
│ priority S1-S4 (≤24h, <1 week, etc.)    │
└─────────────────────────────────────────┘
  │
  ▼
┌─────────────────────────────────────────┐
│ STAGE 7 — Operational output            │
│ Cost estimate · Crew · 311 ticket draft │
└─────────────────────────────────────────┘

Why hybrid?

Approach	Strength	Weakness
YOLOv8 alone	Fast (50ms), precise bbox	No reasoning, no context
VLM alone	Semantic reasoning	Slow (3-30s), bbox imprecise
Hybrid (this app)	YOLO speed + VLM intelligence	Combined orchestration

The VLM doesn't replace YOLO — it enhances it:

Filters obvious false positives (manhole on license plate)
Adds missing context (no GPS? infer from image)
Explains decisions in natural language (auditable)

Honest small-model fit

YOLOv8s = 10M params (truly small)
VLM = open-weight ≤32B — Gemma 4 12B or Nemotron 8B on your own Modal GPU (swappable live), or Qwen3-VL-32B via OpenRouter as a cloud fallback
Combined = each model does what it's best at

Total compute is dominated by YOLOv8 (~80% of analyses use detection only). The VLM runs on a remote Modal GPU, keeping the Space itself light.

Why weather matters (the freeze-thaw story)

"A pothole on Sherbrooke in March ≠ the same pothole in July. The visual is identical, but the operational urgency multiplies. A pothole detector that ignores climate misses 40% of Mtl's pothole crisis."

The physics:

Water infiltrates microfissures in the asphalt
Freezing → ice expands +9% volume → fissure widens
Thawing → water drains, leaves a void under the surface
Vehicle weight → asphalt collapses into the void → new pothole

→ Each freeze-thaw cycle is a destructive iteration.

Concrete impact: a Medium (S3) pothole on Sherbrooke (arterial + RAAV) becomes S4 (24h emergency) after 3+ freeze-thaw cycles in 7 days. Same pothole, different week, different priority.

That's why our combine() function fuses ASTM visual severity with a freeze-thaw signal — making the system operationally aware, not just visually aware.

Where does the freeze-thaw signal come from? From the photo itself. The VLM reads visual winter cues (snow, salt residue, ice patches) and maps them to a cycle count (none/some/moderate/heavy → 0/2/4/6). We deliberately avoid a live weather API: a citizen's photo can be weeks or months old, so its actual visible conditions matter far more than today's weather at those coordinates.

📖 Concepts & Glossary

Everything this app reasons about, explained plainly — the Montréal open-data terms, how severity is computed, and the design decision behind each stage. This page is the single source of truth for the "why".

What the app does

A civic-tech AI road inspector. A citizen or 311 operator uploads a pothole photo; the app inspects it like a road technician would:

Detects defects (potholes, cracks, manholes) with a small fine-tuned vision model.
Judges their visual severity (how bad the defect looks).
Contextualizes with Montréal open data + what the photo shows (road type, surface, traffic exposure, winter wear).
Drafts the municipal 311 repair ticket: priority, delay, cost, crew, materials.

The core idea: a pothole's urgency isn't just how it looks — it's where it is and what road it's on.

🏛️ Montréal open-data glossary

Géobase

The City of Montréal's official road-network dataset — ~47,000 street segments covering the island, each with attributes (name, class, direction…). When you provide GPS coordinates, we snap to the nearest segment to read its official record. Covers the island only — outside it, we fall back to the VLM.

Road class (Géobase `CLASSE` 0-9)

Géobase classifies every street by function. A few key codes:

Code	Type	Meaning	Severity effect
0	locale	quiet residential street	−0.4 (can wait)
5	collecteur	feeds local traffic toward arteries	0.0 (neutral)
6	artère secondaire	secondary arterial	+0.3
7	artère principale	major arterial	+0.4 (urgent)
8	autoroute	highway	+0.5

A pothole on a major arterial (heavy traffic) is more urgent than the same one on a quiet local street — so road class shifts the severity.

RAAV — Réseau Artériel Administratif de la Ville

Montréal's administrative arterial network: the priority corridors the city manages as its main routes. Being on the RAAV bumps urgency (+0.1) — these roads carry the most traffic and matter most to keep open.

IRI — International Roughness Index

A standardized measure of how rough / bumpy a road surface is (m/km), recorded by instrumented survey vehicles. Low = smooth, high = degraded.

IRI state	Range	Severity effect
`good`	< 3.0	−0.2
`acceptable`	3.0–5.0	0.0
`bad`	5.0–7.0	+0.2
`very_bad`	≥ 7.0	+0.4
`unknown`	not surveyed	0.0 (neutral)

Why it matters: a pothole on an already-degraded road is more urgent (the whole surface is failing) than an isolated one on a smooth road.

Auscultation 2022

The City's 2022 road-condition survey — the dataset that gives each surveyed segment its IRI. Not every street is surveyed every year (small collectors/locals are often skipped), so many segments return unknown. When that happens, the VLM's read of the surface fills the gap (see worst-wins below).

Freeze-thaw cycles

The number of times water trapped in the asphalt freezes (expands +9%) then thaws — the main driver of Montréal's pothole crisis. Each cycle widens fissures until the surface collapses. We read this from the photo (snow / salt / ice cues), not a live weather API — because a photo can be weeks or months old, so its actual visible conditions matter more than today's weather.

Cycles (7 d)	Severity effect
0	0.0
1–2	+0.1
3–4	+0.2
5–7	+0.3

⚖️ How severity is computed

Three layers, each visible in the pipeline:

1. Visual severity (ASTM D6433 → S1-S4)

How bad the defect looks, from size + depth (MiDaS monocular depth). Mapped Low / Medium / High → base S2 / S3 / S4.

2. Context fusion (`combine()` — a pure, testable function)

base          = {Low: 2, Medium: 3, High: 4}[visual]
context_score = road_weight + raav_bonus + iri_weight + freeze_thaw_weight
                (clamped to [-1.0, +1.0])
delta         = round(context_score)        → -1, 0 or +1
final         = clamp(base + delta, 1, 4)    → S1..S4

So the same pothole can be S2 on a quiet street in summer, S3 on a degraded arterial, or S4 after a freezing week. Context can shift severity by at most ±1 level — it modulates, it doesn't override the visual.

3. Traffic exposure (from road segmentation)

Where is the pothole on the drivable surface?

Center of the wheel path + blocking a notable width → +1 (every vehicle hits it)
In the gutter / off the road → −1 (rarely hit)
otherwise → 0

This is what road detection is for — not drawing lines, but measuring impact: position + footprint on the drivable surface.

Final severity = visual × road context × traffic exposure.

Severity → operational action (S1-S4)

S	Meaning	Priority	Max delay
🔴 S4	emergency	P1	24 h
🟠 S3	major	P2	1 week
🟡 S2	moderate	P3	1 month
🟢 S1	minor	P4	next cycle

🧠 Design decisions (and why)

Worst-wins IRI + confidence gate

When the Géobase IRI (2022 record) and the VLM's read of the surface disagree, we keep the worse of the two — a stale optimistic record shouldn't hide visible degradation. But the VLM can only worsen the IRI if it's confident (≥ 0.7), so a hesitant guess doesn't inflate severity. Conflicts are always shown with a ⚠️ badge.

Weather read from the photo, not an API

A live "last 7 days" weather feed is anchored to today — wrong for an old photo. The VLM reads winter cues from the image itself, so the signal matches the actual scene.

Surface segmentation, not lane-line tracing

Lane-line detectors (and VLMs asked to trace lines) fail on close-up, unmarked 311 photos — there's simply no line to trace. Instead we segment the road surface (SegFormer-B0) — a region that always exists — and measure pothole exposure on it. The research literature backs this: road-surface segmentation beats line tracing for unmarked roads.

Outside Montréal → the VLM takes over

Géobase covers the island only. For coordinates outside it, there is no city data — we say so honestly and rely on the VLM's inference from the photo, instead of presenting neutral defaults as if they were real.

Géobase primes the VLM

For administrative facts (road class, direction, lane count) the city knows better than a single photo, so we pass them to the VLM as a strong prior — it aligns instead of guessing, cutting gratuitous divergence. The VLM may still override on a clear visual contradiction.

🔭 The 7 stages

Metadata — GPS (EXIF / manual) + Montréal open-data lookup
YOLO detection — potholes / cracks / manholes + visual severity
VLM cross-validation — rejects false positives, with reasoning
Road segmentation + exposure — drivable surface + per-pothole traffic exposure
VLM context inference — road type, surface, winter cues (primed by Géobase)
Severity fusion — visual × context × exposure → S1-S4
Operational output — priority, delay, cost, crew, 311 ticket draft

⚠️ Honest limitations

VLM bounding boxes are approximate (not pixel-precise masks).
Road segmentation is dashcam-trained (Cityscapes): on an extreme oblique close-up the foreground curb may be over-included — it does not affect the severity score.
VLM filtering is conservative — it can reject borderline real defects (false negatives), favouring precision over recall.
IRI vs visual condition are different scales: IRI is segment-level roughness (instrument), the VLM read is a local visual — related but not identical.
A single photo yields relative impact (center vs edge), not metric measurements without scale calibration.

Everything here is implemented in src/ and visible live in the Analyze tab. Prompts sent to the VLM are in the Prompts tab; models and datasets in the Stack tab.

Prompts used (transparency)

All prompts sent to the active VLM during the pipeline (same prompts whatever the backend — the pipeline is model-agnostic). Versioned in src/vlm_hub/prompts/ (PROMPT_VERSION = "1.0").

Stage 3 — Detection validation

System prompt:

You are a road infrastructure inspector. You are reviewing detections from a YOLOv8 object detector running on road photos. Be strict but fair. Return ONLY valid JSON. No markdown fences. No preamble.

User prompt:

The image shows a road. A YOLOv8 detector flagged candidate road defects. For each candidate listed below, visually verify whether it really is what the detector claims. Common false positives to reject:
  - Car parts, license plates, shadows labelled as potholes
  - Manhole covers labelled as potholes
  - Lane markings or asphalt joints labelled as cracks
  - Wet spots or puddles without a real cavity

Detections to review:
{detections}

Return JSON in this EXACT format (no markdown):
{{
  "validations": [
    {{
      "detection_id": 0,
      "valid": true,
      "vlm_confidence": 0.95,
      "reasoning": "clear depth, asphalt cracking around indicates real pothole"
    }},
    {{
      "detection_id": 1,
      "valid": false,
      "vlm_confidence": 0.85,
      "reasoning": "this is a license plate, not a manhole"
    }}
  ]
}}

Include ALL detection_ids from the input, even if you reject them.

Stage 4 — Drivable-surface segmentation (no VLM prompt)

Stage 4 does not call the VLM. We found that asking any VLM (or even a dedicated lane detector) to trace lane lines fails on close-up, unmarked 311 photos — there is simply no line to trace. Instead we run SegFormer-B0 (Cityscapes) to segment the road surface directly and derive its boundary. A road-surface region always exists; a lane line often doesn't. See the literature on road-surface segmentation vs line tracing for unmarked roads.

Stage 5 — Context inference

System prompt:

You are a road infrastructure analyst. You infer contextual variables from a single road photo. You may receive HINTS from the city's official open data — for administrative facts (road class, direction, lane count) align with them unless the photo clearly contradicts. Return ONLY valid JSON with EXACT enum values listed. No markdown fences. No preamble.

User prompt:

Analyze this road photo and infer the operational context using only visual cues (road width, marking type, surrounding buildings, signage, visible weather effects).

Return JSON in this EXACT format (use STRICTLY these enum values):
{
  "road_type": "residential" | "collector" | "arterial_secondary" | "arterial" | "highway" | "pedestrian" | "private" | "business" | "unknown",
  "direction": "one_way" | "two_way" | "unknown",
  "num_lanes": integer 0 to 12,
  "surface_condition": "smooth" | "moderate_wear" | "heavy_wear" | "very_degraded" | "unknown",
  "winter_signs": "none" | "some" | "moderate" | "heavy" | "unknown",
  "confidence": float 0.0 to 1.0 (your overall confidence in this inference),
  "reasoning": "1-2 sentence justification citing the visual cues you noticed"
}

Quick reference for road_type:
  - residential : small two-lane neighborhood streets, low buildings
  - collector   : 2-4 lanes connecting neighborhoods to arteries
  - arterial    : 4+ lanes, major boulevards, commercial/mixed strips
  - highway     : limited-access, no pedestrians, multiple lanes per direction
  - pedestrian  : closed to vehicles, paved walkways
  - private     : driveway, parking lot, gated road
  - business    : parking lot or service road in commercial area
  - unknown     : not enough cues

Quick reference for winter_signs:
  - none     : no snow/ice/salt visible
  - some     : minor salt residue or wet patches
  - moderate : visible snow piles, slush, salt streaks
  - heavy    : snowpack on road, ice patches, deep slush

Tech stack

Models

Component	Model	Params	Source
Detection	YOLOv8s V2b multi-class	10M	Fine-tuned on 7 datasets (see below)
Reasoning (VLM)	Gemma 4 12B / Nemotron 8B (Modal GPU) · Qwen3-VL-32B (OpenRouter fallback)	≤32B	Open weights, swappable live
Road surface seg.	SegFormer-B0 Cityscapes	3.7M	NVIDIA (pre-trained)
Depth (severity)	MiDaS DPT-Hybrid	123M	Intel (pre-trained)
Lane detection (V1 only)	YOLOPv2	80M	Open weights

Per-model cap: every model is ≤32B (hackathon-legal). The VLM runs on a remote Modal GPU under your own account; the Space itself stays light.

Datasets fused (15,580 images train, 1,994 valid)

Dataset	Source	Images	Classes
Roboflow Mtl	Roboflow Universe	3,174	pothole
IVCNZ Chitale 2020	Open dataset	1,243	pothole
Chitholian Kaggle	Kaggle	665	pothole
Idanbaru severity	Kaggle	717	pothole + severity GT
Sabidrahman multi-class	Kaggle	2,130 (clean)	pothole + crack + manhole
Lorenzoarcioni multi-class	Kaggle	2,009	pothole + crack + manhole
RDD2022 US/CZ/NO	IEEE BigData	7,931	longitudinal/transverse/alligator + pothole

Performance (V2b validation, 1,994 images):

mAP50 global: 0.67
pothole: 0.68
crack: 0.54
manhole: 0.80

Severity scoring

ASTM D6433 cascade:

astm_full: diameter + depth → strict ASTM matrix
astm_partial_no_depth: diameter only → matrix with depth assumed medium
heuristic + depth: no calibration → composite score (area + depth + position)
heuristic: no calibration nor depth → area + position

Then combine() (pure function) merges visual severity with contextual variables:

road_class (Mtl Géobase 0-9)
historical_iri (Mtl auscultation 2022)
freeze_thaw_7d (estimated by the VLM from the photo's visual winter cues — snow/salt/ice — not a live weather API, so it stays valid for old photos)
is_raav (Mtl Réseau Artériel Administratif)

Repository

Code: huggingface.co/spaces/build-small-hackathon/montreal-road-inspector — see the Files tab
Models: bundled in this Space via Git LFS
License: MIT

Built for

HF Build Small Hackathon 2026 — Track 1 "Backyard AI"

🎬 Demo video · 📝 Field notes · 📣 Social post · 💻 Code
Independent project, not affiliated with the Ville de Montréal · HF Build Small Hackathon 2026 · Backyard AI

Montréal Road Inspector — Hybrid AI

How it works

Architecture

Why hybrid?

Honest small-model fit

Why weather matters (the freeze-thaw story)

📖 Concepts & Glossary

What the app does

🏛️ Montréal open-data glossary

Géobase

Road class (Géobase CLASSE 0-9)

RAAV — Réseau Artériel Administratif de la Ville

IRI — International Roughness Index

Auscultation 2022

Freeze-thaw cycles

⚖️ How severity is computed

1. Visual severity (ASTM D6433 → S1-S4)

2. Context fusion (combine() — a pure, testable function)

3. Traffic exposure (from road segmentation)

Severity → operational action (S1-S4)

🧠 Design decisions (and why)

Worst-wins IRI + confidence gate

Weather read from the photo, not an API

Surface segmentation, not lane-line tracing

Outside Montréal → the VLM takes over

Géobase primes the VLM

🔭 The 7 stages

⚠️ Honest limitations

Prompts used (transparency)

Stage 3 — Detection validation

Stage 4 — Drivable-surface segmentation (no VLM prompt)

Stage 5 — Context inference

Tech stack

Models

Datasets fused (15,580 images train, 1,994 valid)

Severity scoring

Repository

Built for

Road class (Géobase `CLASSE` 0-9)

2. Context fusion (`combine()` — a pure, testable function)