Last updated: 2026-04-29 (structured miss classification added)
Every valuations row and every valuation_comparables row MUST have a stored subject_corelogic_id / corelogic_id BEFORE it can be used in any cohort. Pipeline's RPP address preflight is unreliable for strata, letter-prefix, slash-unit, Shed N/, Unit X, formats and will fail ~75% of realistic addresses if we force it to resolve from raw string. We have the Cotality Address Matcher for exactly this reason — use it at ingest, store the result, never ask RPP to re-resolve at pipeline time.
Call GET /search/au/matcher/address?q=<address> via corelogic-client.js#matchAddress(). Recommended query format per Cotality docs (Address Match-corelogic.txt):
[unit] / [streetNumber] [streetName] [streetType] [suburb] [stateCode] [postcode]
e.g. 1A/10 Smith St Smithville QLD 4000.
Response shape:
{ "matchDetails": { "matchType": "E", "matchRule": "002", "propertyId": 1070537, "updateIndicator": "O", "updateDetail": "00000000" } }
Write the propertyId to the corelogic_id / subject_corelogic_id column when:
matchType = E (Exact) — best.matchType = P (Partial) — acceptable. Cotality NEVER returns E for altered addresses; P is normal on cleansed input.matchType = A (Alias) — accept, log for audit.Reject / send to manual-curate queue:
M064 (Multi-match, ambiguous) — no single propertyId returned. /search/au/suggestions can help pick one.N (No-match).F (Fuzzy) — treat as manual review (don't auto-store).D (Duplicate, rule 255) — no propertyId returned by design.X (Postal record).scripts/corelogic-subjects-backfill.js (CLB-2388).scripts/corelogic-comps-backfill.js.set -a && source /home/jonbo/.claude/keys.env && set +a (CORELOGIC_CLIENT_ID / CORELOGIC_CLIENT_SECRET — per CLB-2389 the client now accepts CORELOGIC_* preferred over legacy RPDATA_*).Before any new valuation row enters valuations:
matchAddress(property_address_full).matchType ∈ {E, P, A} with non-null propertyId → insert with subject_corelogic_id = propertyId.ingest_review_queue (proposed table). Never insert with null subject_corelogic_id into a cohort; it will fail preflight at pipeline time.C:\Users\jonbo\OneDrive\Desktop\Address Match-corelogic.txt (canonical match-code table).reports/subject-corelogic-id-backfill-2026-04-24.md (48.3% → 82.2% coverage; 511 M064 multi-match require manual curation).Verify Edge CDP is reachable — RPP auth tier 3 drives Jon's real logged-in Edge on Windows via CDP on port 9226. It is NOT headless and has no fallback. From WSL:
curl -s --max-time 3 http://172.17.0.1:9226/json/version
If this times out, the pipeline will fail at stage 1 preflight (ETIMEDOUT) across all three auth tiers. Fix first: ensure Edge is running on Windows with --remote-debugging-port=9226 --remote-allow-origins=* --user-data-dir=C:\Users\jonbo\EdgeCDP and is logged into rpp.corelogic.com.au. See ~/.claude/rules/env-vars.md → "RPP preflight requires live Edge CDP".
Verify cohort membership — address must be in stratified_cohort_20260407.
SELECT * FROM stratified_cohort_20260407 WHERE address_full ILIKE '%<address>%';
Run preflight validation — confirm address resolves, ground truth exists, and RPP has candidates.
node scripts/preflight-scan.js --address "<address>"
Confirm search parameters — check search_parameters_master for correct params matching property type and density zone. Do not override defaults without a documented reason.
Check API budget — do not trigger batch runs without Jon's approval. Single-address runs for investigation are fine.
Use scripts/run-pilot.sh as the only canonical 50-subject pilot entrypoint. It preflights required env, fresh RPP cookies, DB connectivity, and a clean repo before dispatching scripts/phase3-pilot-50-2026-04-29.js, then writes config.json, pilot.log, and status.txt under runs/YYYY-MM-DD/<run-name>/.
Use scripts/analyze-pilot.sh runs/YYYY-MM-DD/<run-name> as the matching read-only analyser. It reads the run window from config.json, queries strict pool coverage from the DB, and writes analysis.md plus raw-report.json in the same run directory. The schema-compliant report.json is then produced by scripts/make-report.sh (CLB-2478) consuming raw-report.json. Use --dry-run on either wrapper only for tooling checks; do not use dry-run output as measurement.
POST /api/pipeline/start with clientName and purpose set.status = 'review' or status = 'failed' — do not read intermediate stages.skipEnrich: true for Pillar 1 evaluation. Do not enrich unless explicitly testing enrichment.stage_3_raw_candidates.sourceStats. If any entry with source = 'rpp' has ok = false (e.g. RPP_AUTO_LOGIN_FAILED, cooldown, 4xx/5xx), the run is INVALID for Pillar 1 measurement. The pipeline may have fallen back to vector_text / pgvector and reached status = review with geographically irrelevant candidates — that is not a recall measurement, it is a fallback artefact. Do not cite the headline number; rerun once RPP auth is restored. (NORTH_STAR Rule 24.)The live sidecar (port 3211) runs from a deployed copy at /opt/palermo/apps/truemarket/, NOT from any git working tree. The deployed copy can lag master by hours or days, and a sidecar started 22 hours ago will not pick up commits merged today. Before treating any pilot result as a measurement of code on master, verify:
# Confirm the deployed copy contains the change you expect
grep -c "<unique-string-from-your-fix>" /opt/palermo/apps/truemarket/<file>
# OR compare the deployed file against master HEAD
diff <(cat /opt/palermo/apps/truemarket/<file>) <(git show origin/master:<file>)
If the deployed copy is stale, redeploy + restart before the run. A pilot against stale code is worse than no pilot — it produces a measurement that looks authoritative but tests a code state that is not what's on master.
# 1. From /home/jon/work/projects/truemarket on master:
git pull --ff-only origin master
# 2. Sync working tree → deployed copy (preserve runtime state files)
rsync -a --delete \
--exclude='.git' \
--exclude='node_modules' \
--exclude='*.log' \
--exclude='dashboard/data/' \
/home/jon/work/projects/truemarket/ \
/opt/palermo/apps/truemarket/
# 3. Restart the sidecar
# The process at PID <X> runs `node dashboard/server.js` from /opt/palermo/apps/truemarket
# Send SIGTERM, wait, restart from the same cwd
kill <pid-of-dashboard-server>
( cd /opt/palermo/apps/truemarket && \
set -a; source /etc/palermo/truemarket.env; set +a; \
nohup node dashboard/server.js > /var/log/truemarket-sidecar.log 2>&1 & disown )
# 4. Verify it's listening on 3211 AND on the expected commit
ss -tlnp | grep 3211
grep -c "<unique-string-from-your-fix>" /opt/palermo/apps/truemarket/<file>
If a pm2/systemd unit manages it, prefer pm2 restart <name> or systemctl restart <unit> — but still re-verify the file content; an unwatched deployment can drift.
Default (any box): cookie jar at ~/.rpp-session.json, written by scripts/rpp-login-once.js.
node scripts/rpp-login-once.js — uses Playwright + creds from ~/.claude/keys.env to write the jar.rpp-direct-client.getAuth(). Override path with RPP_COOKIE_JAR.Other paths (opt-in via RPP_BROWSER):
chrome (default fallback): reads cookies from local Chrome's libsecret keyring via scripts/chrome-rpp-cookies.py.edge: legacy Dell/WSL CDP path at ${CDP_HOST}:${CDP_PORT} (Edge with --remote-debugging-port).Sidecar inherits env from the launching shell — source /etc/palermo/truemarket.env before restart.
Verify API call logging — confirm a row exists in api_call_log with the correct pipeline_run_id and search_run_id.
SELECT id, provider, endpoint, created_at
FROM api_call_log
WHERE pipeline_run_id = '<run_id>'
ORDER BY created_at DESC LIMIT 10;
Check recall metrics — open stage_4_scored_candidates.recallMetrics. Key fields:
groundTruthCount — how many GT comps the valuer usedhits array length — how many we foundrecall.R@10 — primary metric for Pillar 1If partial recall, inspect stage_4_scored_candidates.compLearnings for per-comp miss reasons.
Review dashboard — /pipeline/<runId> shows the Learnings section with each missed comp's verdict.
For each missed comp in recallMetrics.details.missedAddresses:
Verify in RPP by direct CoreLogic/property-ID lookup FIRST. Before any
address search or radius probing, call the property-ID endpoint using the
ground-truth corelogic_id (for example
/api/properties/<gt_corelogic_id>/commons). Record whether RPP returns the
property and confirm propertyType, lat/lon, isUnit, and isBodyCorporate. If
the ID lookup succeeds but stage_3 does not contain the ID, the miss is in
our pipeline, NOT a data gap.
If direct ID lookup is unresolved, try exact address/suggestions SECOND.
Use the exact GT address with /api/clapi/suggestions?q=<gt_address> and
follow only exact or clearly canonical suggestions to /api/properties/<id>/commons.
Record the suggested property IDs and why the selected suggestion is or is
not the GT comp.
If still unresolved, do tiny bounded radius probing ONLY. Probe a small,
documented radius around the GT lat/lon or subject-derived point only after
direct ID and exact-address/suggestions have failed. Keep the probe tiny and
bounded: record the radius, centre, filters, and hard call cap in rpp_calls;
do not broaden into exploratory suburb/postcode sweeps. This step is for
resolving address/index ambiguity, not for improving recall by fishing.
Check CoreLogic ID in raw candidates (stage_3) — is the property in the pool at all?
SELECT cand->>'corelogic_id', cand->>'address_full'
FROM pipeline_runs,
LATERAL jsonb_array_elements(stage_3_raw_candidates->'rawCandidates') AS cand
WHERE id = '<run_id>'
AND (
cand->>'corelogic_id' = '<gt_corelogic_id>'
OR lower(cand->>'address_full') LIKE '%<street>%'
);
Check sale_date vs search window — compare comp's sale_date against stage_2_strategy.params.date_from / date_to. Most misses on recent runs are sales 21–64 days after the window closes.
Check distance_km vs search radius — compare distance_from_subject_km in valuation_comparables against stage_2_strategy.params.radius_km.
Check property_type vs searched types — compare valuation_comparables.property_type against stage_2_strategy.params.propertyTypes. The most common miss pattern: RPP returns Detached Residence or Residential but search only includes [HOUSE, UNIT].
Classify pipeline vs RPP data gap. If RPP has the property by direct ID, exact-address/suggestions, or the tiny bounded probe, but our pool does not, classify the miss as pipeline-side. Common causes: pagination ceiling truncated the tail, type-code didn't translate, density misclassified, radius too small.
If all checks pass and RPP does not have the property -> RPP data gap. Log as such. No code fix available.
Every run with pool_coverage < 100% MUST write per-comp classifications to the
pipeline_run_miss_classifications table BEFORE the headline number is cited.
Free-text in stage_4.compLearnings is informational; the structured table is
authoritative.
Run via:
node scripts/classify-missed-comps.js --run-id <pipeline_run_id>
Use --dry-run to preview without writing to the DB (also works before the migration is applied).
The classifier walks each missed GT comp through the Investigation SOP above,
pulls evidence (raw-pool presence, RPP direct/suggestions/probe responses,
strategy params, rank if found), and writes a row with a miss_class enum
value. Requires the
pipeline_run_miss_classifications migration to be applied first
(see migrations/proposed-pipeline-run-miss-classifications.sql).
Each missed comp MUST also have a machine-readable evidence object attached to the structured classification row or adjacent report artifact. Free-text notes may summarize it, but they do not replace it. Required fields:
{
"subject_label": "<subject label used in the run/report>",
"valuation_id": "<valuation_id>",
"run_id": "<pipeline_run_id>",
"gt_corelogic_id": "<ground-truth comparable CoreLogic/property ID>",
"gt_address": "<ground-truth comparable address>",
"in_raw_pool": true,
"rpp_direct_found": true,
"found_by": "direct_id|exact_address|suggestion|bounded_radius|raw_pool|not_found",
"likely_failure_mode": "pipeline_filter|date_window|radius|property_type|pagination|rpp_data_gap|insufficient_evidence",
"evidence": {},
"rpp_calls": [],
"timestamp": "2026-05-05T00:00:00.000Z"
}
rpp_calls must list every RPP request made during the Investigation SOP in
order, including endpoint, normalized params, HTTP status/result summary, and
the reason for the call. The first RPP call must be the direct
CoreLogic/property-ID lookup; exact-address/suggestions may follow; bounded
radius probing may appear only after both earlier lookup modes are unresolved.
Use an ISO-8601 UTC timestamp.
Stale — pre-2026-04-26 percentages. Below figures are from before CLB-2403 (cohort/walker valuation_id keying) and CLB-2404 #1. The 2026-04-26 Opus diagnostic + 2026-04-27 V2-overlap pre-check supersede these. Current breakdown: see docs/CURRENT_PILLAR.md "39% missing-comp gap" table + the 2026-04-27 morning section. Verdict as of 2026-04-27: V2 filter is NOT the leak point — recall loss is upstream (sidecar.log shows 0 GT-overlap with V2 rejects).
Detached Residence, Residential, Duplex) don't map to pipeline search types (HOUSE, UNIT). Fix: expand type mapping or search with broader type filter.date_to by 90 days or use assessment_date + buffer.110-114 Collins Avenue Edge Hill (CoreLogic ID 15348655) returned 0 candidates repeatedly across 4 runs — confirmed RPP gap.valuation_comparables with different sale dates (e.g. confirmed + unconditional). Recall check counts each row separately; a "miss" on the second entry is usually the earlier/later sale of the same property.fetchCompDetails previously failed to join because valuations.property_address_full uses commas (e.g. 15 Fairway Avenue, Rocky Point QLD 4874) while stage_1_subject.addressFull is uppercase without commas. Fixed 2026-04-14 — now strips commas before comparison.valuation_comparables has no entry for a missed comp, the investigator falls back to "Insufficient data". This means the comp address is not in the valuer's report DB — possible for commercial properties or addresses with unusual formatting.