Montréal Road Inspector — Hybrid AI
Click an example below or upload your own.
The photo's EXIF GPS is read automatically; or enter lat/lon. Real coords unlock the Mtl Open Data context.
7 transparent stages run live (~10-30s).
Why it matters: with real coordinates the pipeline pulls the street's official record (class, arterial status, roughness IRI 2022) and fuses it into the severity verdict. Without GPS, the VLM guesses from pixels only.
Priority: photo EXIF (auto) > manual lat/lon below.
VLM served on your own Modal GPU endpoint (public web endpoint). Pick the model below and run a live head-to-head: Google Gemma 4 12B (best grounding) vs NVIDIA Nemotron 8B. The pipeline is model-agnostic (src/vlm_hub) — same 7 stages, only the VLM changes. The OpenRouter selector below is ignored in this backend.
Each stage runs sequentially. Cards update live. Total time: ~10-15 s (1st run: +5-10 s to load models).
Upload a photo to begin.
Waiting…
Waiting…
Waiting…
Waiting…
Waiting…
How it works
This system combines a small fine-tuned vision model with a vision-language model to deliver operational road inspection — not just detection.
Architecture
PHOTO
│
▼
┌─────────────────────────────────────────┐
│ STAGE 1-2 — YOLOv8s V2b multi-class │
│ 10M params, fine-tuned on 7 datasets │
│ Detects: pothole / crack / manhole │
│ Visual severity via ASTM D6433 cascade │
└─────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ STAGE 3 — VLM cross-validation │
│ Reviews each YOLO detection, rejects FPs│
│ with reasoning ("this is a car plate") │
└─────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ STAGE 4 — Surface seg. + exposure │
│ SegFormer-B0 segments the drivable │
│ surface, then measures each pothole's │
│ exposure (center vs gutter) → impact │
└─────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ STAGE 5 — VLM road context inference │
│ road_type / num_lanes / surface / winter│
│ (replaces GPS+Open Data when unavailable)│
└─────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ STAGE 6 — Severity fusion (pure func) │
│ visual ASTM × context = operational │
│ priority S1-S4 (≤24h, <1 week, etc.) │
└─────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ STAGE 7 — Operational output │
│ Cost estimate · Crew · 311 ticket draft │
└─────────────────────────────────────────┘
Why hybrid?
| Approach | Strength | Weakness |
|---|---|---|
| YOLOv8 alone | Fast (50ms), precise bbox | No reasoning, no context |
| VLM alone | Semantic reasoning | Slow (3-30s), bbox imprecise |
| Hybrid (this app) | YOLO speed + VLM intelligence | Combined orchestration |
The VLM doesn't replace YOLO — it enhances it:
- Filters obvious false positives (manhole on license plate)
- Adds missing context (no GPS? infer from image)
- Explains decisions in natural language (auditable)
Honest small-model fit
- YOLOv8s = 10M params (truly small)
- VLM = open-weight ≤32B — Gemma 4 12B or Nemotron 8B on your own Modal GPU (swappable live), or Qwen3-VL-32B via OpenRouter as a cloud fallback
- Combined = each model does what it's best at
Total compute is dominated by YOLOv8 (~80% of analyses use detection only). The VLM runs on a remote Modal GPU, keeping the Space itself light.
Why weather matters (the freeze-thaw story)
"A pothole on Sherbrooke in March ≠ the same pothole in July. The visual is identical, but the operational urgency multiplies. A pothole detector that ignores climate misses 40% of Mtl's pothole crisis."
The physics:
- Water infiltrates microfissures in the asphalt
- Freezing → ice expands +9% volume → fissure widens
- Thawing → water drains, leaves a void under the surface
- Vehicle weight → asphalt collapses into the void → new pothole
→ Each freeze-thaw cycle is a destructive iteration.
Concrete impact: a Medium (S3) pothole on Sherbrooke (arterial + RAAV)
becomes S4 (24h emergency) after 3+ freeze-thaw cycles in 7 days.
Same pothole, different week, different priority.
That's why our combine() function fuses ASTM visual severity with a
freeze-thaw signal — making the system operationally aware, not just
visually aware.
Where does the freeze-thaw signal come from? From the photo itself. The VLM reads visual winter cues (snow, salt residue, ice patches) and maps them to a cycle count (none/some/moderate/heavy → 0/2/4/6). We deliberately avoid a live weather API: a citizen's photo can be weeks or months old, so its actual visible conditions matter far more than today's weather at those coordinates.
📖 Concepts & Glossary
Everything this app reasons about, explained plainly — the Montréal open-data terms, how severity is computed, and the design decision behind each stage. This page is the single source of truth for the "why".
What the app does
A civic-tech AI road inspector. A citizen or 311 operator uploads a pothole photo; the app inspects it like a road technician would:
- Detects defects (potholes, cracks, manholes) with a small fine-tuned vision model.
- Judges their visual severity (how bad the defect looks).
- Contextualizes with Montréal open data + what the photo shows (road type, surface, traffic exposure, winter wear).
- Drafts the municipal 311 repair ticket: priority, delay, cost, crew, materials.
The core idea: a pothole's urgency isn't just how it looks — it's where it is and what road it's on.
🏛️ Montréal open-data glossary
Géobase
The City of Montréal's official road-network dataset — ~47,000 street segments covering the island, each with attributes (name, class, direction…). When you provide GPS coordinates, we snap to the nearest segment to read its official record. Covers the island only — outside it, we fall back to the VLM.
Road class (Géobase CLASSE 0-9)
Géobase classifies every street by function. A few key codes:
| Code | Type | Meaning | Severity effect |
|---|---|---|---|
| 0 | locale | quiet residential street | −0.4 (can wait) |
| 5 | collecteur | feeds local traffic toward arteries | 0.0 (neutral) |
| 6 | artère secondaire | secondary arterial | +0.3 |
| 7 | artère principale | major arterial | +0.4 (urgent) |
| 8 | autoroute | highway | +0.5 |
A pothole on a major arterial (heavy traffic) is more urgent than the same one on a quiet local street — so road class shifts the severity.
RAAV — Réseau Artériel Administratif de la Ville
Montréal's administrative arterial network: the priority corridors the city manages as its main routes. Being on the RAAV bumps urgency (+0.1) — these roads carry the most traffic and matter most to keep open.
IRI — International Roughness Index
A standardized measure of how rough / bumpy a road surface is (m/km), recorded by instrumented survey vehicles. Low = smooth, high = degraded.
| IRI state | Range | Severity effect |
|---|---|---|
good |
< 3.0 | −0.2 |
acceptable |
3.0–5.0 | 0.0 |
bad |
5.0–7.0 | +0.2 |
very_bad |
≥ 7.0 | +0.4 |
unknown |
not surveyed | 0.0 (neutral) |
Why it matters: a pothole on an already-degraded road is more urgent (the whole surface is failing) than an isolated one on a smooth road.
Auscultation 2022
The City's 2022 road-condition survey — the dataset that gives each
surveyed segment its IRI. Not every street is surveyed every year (small
collectors/locals are often skipped), so many segments return unknown.
When that happens, the VLM's read of the surface fills the gap (see
worst-wins below).
Freeze-thaw cycles
The number of times water trapped in the asphalt freezes (expands +9%) then thaws — the main driver of Montréal's pothole crisis. Each cycle widens fissures until the surface collapses. We read this from the photo (snow / salt / ice cues), not a live weather API — because a photo can be weeks or months old, so its actual visible conditions matter more than today's weather.
| Cycles (7 d) | Severity effect |
|---|---|
| 0 | 0.0 |
| 1–2 | +0.1 |
| 3–4 | +0.2 |
| 5–7 | +0.3 |
⚖️ How severity is computed
Three layers, each visible in the pipeline:
1. Visual severity (ASTM D6433 → S1-S4)
How bad the defect looks, from size + depth (MiDaS monocular depth). Mapped Low / Medium / High → base S2 / S3 / S4.
2. Context fusion (combine() — a pure, testable function)
base = {Low: 2, Medium: 3, High: 4}[visual]
context_score = road_weight + raav_bonus + iri_weight + freeze_thaw_weight
(clamped to [-1.0, +1.0])
delta = round(context_score) → -1, 0 or +1
final = clamp(base + delta, 1, 4) → S1..S4
So the same pothole can be S2 on a quiet street in summer, S3 on a degraded arterial, or S4 after a freezing week. Context can shift severity by at most ±1 level — it modulates, it doesn't override the visual.
3. Traffic exposure (from road segmentation)
Where is the pothole on the drivable surface?
- Center of the wheel path + blocking a notable width → +1 (every vehicle hits it)
- In the gutter / off the road → −1 (rarely hit)
- otherwise → 0
This is what road detection is for — not drawing lines, but measuring impact: position + footprint on the drivable surface.
Final severity = visual × road context × traffic exposure.
Severity → operational action (S1-S4)
| S | Meaning | Priority | Max delay |
|---|---|---|---|
| 🔴 S4 | emergency | P1 | 24 h |
| 🟠 S3 | major | P2 | 1 week |
| 🟡 S2 | moderate | P3 | 1 month |
| 🟢 S1 | minor | P4 | next cycle |
🧠 Design decisions (and why)
Worst-wins IRI + confidence gate
When the Géobase IRI (2022 record) and the VLM's read of the surface disagree, we keep the worse of the two — a stale optimistic record shouldn't hide visible degradation. But the VLM can only worsen the IRI if it's confident (≥ 0.7), so a hesitant guess doesn't inflate severity. Conflicts are always shown with a ⚠️ badge.
Weather read from the photo, not an API
A live "last 7 days" weather feed is anchored to today — wrong for an old photo. The VLM reads winter cues from the image itself, so the signal matches the actual scene.
Surface segmentation, not lane-line tracing
Lane-line detectors (and VLMs asked to trace lines) fail on close-up, unmarked 311 photos — there's simply no line to trace. Instead we segment the road surface (SegFormer-B0) — a region that always exists — and measure pothole exposure on it. The research literature backs this: road-surface segmentation beats line tracing for unmarked roads.
Outside Montréal → the VLM takes over
Géobase covers the island only. For coordinates outside it, there is no city data — we say so honestly and rely on the VLM's inference from the photo, instead of presenting neutral defaults as if they were real.
Géobase primes the VLM
For administrative facts (road class, direction, lane count) the city knows better than a single photo, so we pass them to the VLM as a strong prior — it aligns instead of guessing, cutting gratuitous divergence. The VLM may still override on a clear visual contradiction.
🔭 The 7 stages
- Metadata — GPS (EXIF / manual) + Montréal open-data lookup
- YOLO detection — potholes / cracks / manholes + visual severity
- VLM cross-validation — rejects false positives, with reasoning
- Road segmentation + exposure — drivable surface + per-pothole traffic exposure
- VLM context inference — road type, surface, winter cues (primed by Géobase)
- Severity fusion — visual × context × exposure → S1-S4
- Operational output — priority, delay, cost, crew, 311 ticket draft
⚠️ Honest limitations
- VLM bounding boxes are approximate (not pixel-precise masks).
- Road segmentation is dashcam-trained (Cityscapes): on an extreme oblique close-up the foreground curb may be over-included — it does not affect the severity score.
- VLM filtering is conservative — it can reject borderline real defects (false negatives), favouring precision over recall.
- IRI vs visual condition are different scales: IRI is segment-level roughness (instrument), the VLM read is a local visual — related but not identical.
- A single photo yields relative impact (center vs edge), not metric measurements without scale calibration.
Everything here is implemented in src/ and visible live in the Analyze
tab. Prompts sent to the VLM are in the Prompts tab; models and datasets
in the Stack tab.
Prompts used (transparency)
All prompts sent to the active VLM during the pipeline (same prompts whatever
the backend — the pipeline is model-agnostic). Versioned in
src/vlm_hub/prompts/ (PROMPT_VERSION = "1.0").
Stage 3 — Detection validation
System prompt:
You are a road infrastructure inspector. You are reviewing detections from a YOLOv8 object detector running on road photos. Be strict but fair. Return ONLY valid JSON. No markdown fences. No preamble.
User prompt:
The image shows a road. A YOLOv8 detector flagged candidate road defects. For each candidate listed below, visually verify whether it really is what the detector claims. Common false positives to reject:
- Car parts, license plates, shadows labelled as potholes
- Manhole covers labelled as potholes
- Lane markings or asphalt joints labelled as cracks
- Wet spots or puddles without a real cavity
Detections to review:
{detections}
Return JSON in this EXACT format (no markdown):
{{
"validations": [
{{
"detection_id": 0,
"valid": true,
"vlm_confidence": 0.95,
"reasoning": "clear depth, asphalt cracking around indicates real pothole"
}},
{{
"detection_id": 1,
"valid": false,
"vlm_confidence": 0.85,
"reasoning": "this is a license plate, not a manhole"
}}
]
}}
Include ALL detection_ids from the input, even if you reject them.
Stage 4 — Drivable-surface segmentation (no VLM prompt)
Stage 4 does not call the VLM. We found that asking any VLM (or even a dedicated lane detector) to trace lane lines fails on close-up, unmarked 311 photos — there is simply no line to trace. Instead we run SegFormer-B0 (Cityscapes) to segment the road surface directly and derive its boundary. A road-surface region always exists; a lane line often doesn't. See the literature on road-surface segmentation vs line tracing for unmarked roads.
Stage 5 — Context inference
System prompt:
You are a road infrastructure analyst. You infer contextual variables from a single road photo. You may receive HINTS from the city's official open data — for administrative facts (road class, direction, lane count) align with them unless the photo clearly contradicts. Return ONLY valid JSON with EXACT enum values listed. No markdown fences. No preamble.
User prompt:
Analyze this road photo and infer the operational context using only visual cues (road width, marking type, surrounding buildings, signage, visible weather effects).
Return JSON in this EXACT format (use STRICTLY these enum values):
{
"road_type": "residential" | "collector" | "arterial_secondary" | "arterial" | "highway" | "pedestrian" | "private" | "business" | "unknown",
"direction": "one_way" | "two_way" | "unknown",
"num_lanes": integer 0 to 12,
"surface_condition": "smooth" | "moderate_wear" | "heavy_wear" | "very_degraded" | "unknown",
"winter_signs": "none" | "some" | "moderate" | "heavy" | "unknown",
"confidence": float 0.0 to 1.0 (your overall confidence in this inference),
"reasoning": "1-2 sentence justification citing the visual cues you noticed"
}
Quick reference for road_type:
- residential : small two-lane neighborhood streets, low buildings
- collector : 2-4 lanes connecting neighborhoods to arteries
- arterial : 4+ lanes, major boulevards, commercial/mixed strips
- highway : limited-access, no pedestrians, multiple lanes per direction
- pedestrian : closed to vehicles, paved walkways
- private : driveway, parking lot, gated road
- business : parking lot or service road in commercial area
- unknown : not enough cues
Quick reference for winter_signs:
- none : no snow/ice/salt visible
- some : minor salt residue or wet patches
- moderate : visible snow piles, slush, salt streaks
- heavy : snowpack on road, ice patches, deep slush
Tech stack
Models
| Component | Model | Params | Source |
|---|---|---|---|
| Detection | YOLOv8s V2b multi-class | 10M | Fine-tuned on 7 datasets (see below) |
| Reasoning (VLM) | Gemma 4 12B / Nemotron 8B (Modal GPU) · Qwen3-VL-32B (OpenRouter fallback) | ≤32B | Open weights, swappable live |
| Road surface seg. | SegFormer-B0 Cityscapes | 3.7M | NVIDIA (pre-trained) |
| Depth (severity) | MiDaS DPT-Hybrid | 123M | Intel (pre-trained) |
| Lane detection (V1 only) | YOLOPv2 | 80M | Open weights |
Per-model cap: every model is ≤32B (hackathon-legal). The VLM runs on a remote Modal GPU under your own account; the Space itself stays light.
Datasets fused (15,580 images train, 1,994 valid)
| Dataset | Source | Images | Classes |
|---|---|---|---|
| Roboflow Mtl | Roboflow Universe | 3,174 | pothole |
| IVCNZ Chitale 2020 | Open dataset | 1,243 | pothole |
| Chitholian Kaggle | Kaggle | 665 | pothole |
| Idanbaru severity | Kaggle | 717 | pothole + severity GT |
| Sabidrahman multi-class | Kaggle | 2,130 (clean) | pothole + crack + manhole |
| Lorenzoarcioni multi-class | Kaggle | 2,009 | pothole + crack + manhole |
| RDD2022 US/CZ/NO | IEEE BigData | 7,931 | longitudinal/transverse/alligator + pothole |
Performance (V2b validation, 1,994 images):
- mAP50 global: 0.67
- pothole: 0.68
- crack: 0.54
- manhole: 0.80
Severity scoring
ASTM D6433 cascade:
- astm_full: diameter + depth → strict ASTM matrix
- astm_partial_no_depth: diameter only → matrix with depth assumed medium
- heuristic + depth: no calibration → composite score (area + depth + position)
- heuristic: no calibration nor depth → area + position
Then combine() (pure function) merges visual severity with contextual variables:
road_class(Mtl Géobase 0-9)historical_iri(Mtl auscultation 2022)freeze_thaw_7d(estimated by the VLM from the photo's visual winter cues — snow/salt/ice — not a live weather API, so it stays valid for old photos)is_raav(Mtl Réseau Artériel Administratif)
Repository
- Code: huggingface.co/spaces/build-small-hackathon/montreal-road-inspector — see the Files tab
- Models: bundled in this Space via Git LFS
- License: MIT
Built for
HF Build Small Hackathon 2026 — Track 1 "Backyard AI"
Independent project, not affiliated with the Ville de Montréal · HF Build Small Hackathon 2026 · Backyard AI