Validation methodology

How we measure that RugGuard's risk scores actually correlate with rug outcomes. Live numbers at /v1/metrics.

Two complementary datasets

We don't trust one dataset to validate the other — each has structural biases we'd otherwise read as signal.

Dataset A — `rug_census` (auto-detected confirmed rugs)

What it is. Every Aerodrome v2 token paired against WETH or USDC whose quote balance is below 0.01 WETH (or 100 USDC) and whose pool is at least 7 days old, scanned through the heuristic engine and persisted with source='rug_census', status='rugged'.

Pipeline. A weekly systemd timer runs the bulk discovery and rug-check; the detection logic checks pool drainage from on-chain RPC reads.

What it measures. Per-heuristic recall — for every flag, the fraction of known rugs on which it FAILed (decided rug-positive) versus PASSed. SKIPped results are excluded from both numerator and denominator.

Structural bias — read this before quoting numbers. Samples in this dataset are scanned after the rug already happened. By that point:

The owner has typically been renounced (the rug-puller covered tracks). → OWNER_NOT_RENOUNCED and OWNER_CAN_PAUSE PASS.
The contract is dead, so GoPlus's honeypot simulator returns no data. → HONEYPOT_DETECTED, HONEYPOT_ASYMMETRIC, HONEYPOT_TAX_HIGH SKIP.
Mint authority may have been revoked as part of the cover-up. → MINT_AUTHORITY_ACTIVE PASSes more often than in real-time scanning.

Therefore: rug-census recall under-estimates the heuristic's pre-trade power. Use it to confirm an engine is functional and that LP / holder / liquidity heuristics fire on real on-chain data — but never quote it as "recall an agent should expect at trade time."

Dataset B — `forward_sampler` + T+7/14/30 follow-up

What it is. Every newly-paired token on Aerodrome v2, scanned at t ≈ 0 (within hours of pool creation) and persisted with source='forward_sampler', status='pending'. Then a daily classifier runs the same drainage check at three checkpoints (T+7, T+14, T+30) and advances the status:

Drainage detected at T+7 / T+14 → status='rugged' (terminal).
Drainage detected at T+30 → status='rugged' (terminal).
No drainage at T+30 → status='survived' (terminal).
No drainage at T+7 / T+14 → still pending, re-checked next checkpoint.

What it measures. Precision per verdict band, the metric that actually matters for agent decision-making:

P(token rugged within 30d | RugGuard returned verdict X at t=0)

Plus the base rate of rug-vs-survived in our forward-sampled population, which lets us compare RugGuard against random.

Why this is the trustworthy dataset. Scans happen at t=0, on the actual state an agent would see at trade time. No post-rug artifact. The 30-day horizon isn't arbitrary — Aerodrome rugs typically materialize within a week, but giving 30 days lets us catch the slow-rolling exit-scams too.

Limitations.

Sample size grows linearly with calendar time. Phase 0 launched mid-Apr 2026; the first samples reach T+30 in mid-May 2026. Until then, all precision-per-verdict numbers are non-significant.
Coverage is currently Aerodrome v2 only. Tokens that launched on a different DEX or were pre-paired off-chain are invisible to the discovery pipeline.
"Survived" doesn't mean "good buy" — it just means LP wasn't drained within 30 days. Slow bleeds, soft rugs, and high-tax exits aren't captured.

Score-to-band calibration

Score band is computed from the weighted sum of failing heuristic weights, normalized to 0–100 over the heuristics that decided (SKIP excluded). The verdict is then derived from the score:

Band	Range	Verdict
0–25	low	`safe`
26–50	medium-low	`low_risk`
51–70	medium	`medium_risk`
71–90	medium-high	`high_risk`
91–100	high	`critical`

Confidence override. When score_confidence is low or insufficient_data (too few heuristics decided to trust the numerator), the verdict is forced to uncertain regardless of the numeric score. This protects agents from a "score=100, but only 1/12 heuristics decided" trap — the verdict would otherwise read critical when really the engine just couldn't gather data.

Treat uncertain as "do not trade", not as "safe by default" — the absence of evidence of risk is not evidence of absence.

Live results

Always pull from /v1/metrics for current numbers — they update at request time from the latest sample data.

curl -s https://rugguard.redfleet.fr/v1/metrics | jq

The endpoint returns:

sample_counts_by_source_status: how many samples we have, by source and status
rug_census.heuristic_recall: per-flag recall on the rug census (with the post-rug bias caveat above)
rug_census.verdict_distribution: what verdict band the engine assigned to confirmed rugs
forward_sampler.status_breakdown: how many forward samples are pending vs rugged vs survived

Base snapshot

Rug census, n=19, 14-day Aerodrome lookback:

Heuristic	Fail	Pass	Skip	Recall (decided)
`TOP10_CONCENTRATION_HIGH`	17	1	1	94.4%
`LP_INSUFFICIENT_LIQUIDITY`	5	1	13	83.3%
`LP_NOT_LOCKED`	5	0	14	100%
`OWNER_NOT_RENOUNCED`	4	0	15	100%
`SOURCE_NOT_VERIFIED`	4	14	1	22.2%
`MINT_AUTHORITY_ACTIVE`	1	12	6	7.7%
`HONEYPOT_*` / `OWNER_CAN_PAUSE` / `HIDDEN_OWNER`	0	0–13	6–18	0% (post-rug)

Solana snapshot

Rug census, n=20, drained Raydium V4 pools (TVL < $50, recent 7d volume > 0):

Heuristic	Fail	Pass	Skip	Recall (decided)
`SOL_TOP10_CONCENTRATION_HIGH`	13	1	6	92.9%
`SOL_MINT_AUTHORITY_ACTIVE`	1	15	4	6.2%
`SOL_FREEZE_AUTHORITY_ACTIVE`	0–low	most	few	~0% (post-rug)
`SOL_DEPLOYER_RUG_HISTORY`	—	—	most	n/a (insufficient_history during DB warm-up)
`SOL_LP_NOT_LOCKED`	varies	—	—	per-pool

The cross-chain pattern is consistent: TOP10_CONCENTRATION_HIGH is the workhorse signal at ~93-94% recall on confirmed rugs of both Base and Solana. Authority-based heuristics (MINT_*, FREEZE_*) underperform on the rug census for the same post-rug-state reason — ruggers commonly renounce authorities right after the drain to wash their tracks.

Honest disclosure: signal concentration

Today, ~94% of our recall on confirmed rugs comes from a single heuristic: TOP10_CONCENTRATION_HIGH. The remaining heuristics fire under 23% on the rug census, partly real (some signals genuinely matter less in isolation) and partly artefactual (post-rug-state biases listed above). This is a fact, not a defect — concentration is a leading indicator on memecoin launches by construction. But the implication is that a competitor can replicate ~90% of our current value with a single RPC call to balanceOf the top holders.

The structural moat we are deliberately building (in priority order):

Forward-sampler ground truth — precision-per-verdict numbers nobody else publishes, computed from our own T+30 outcome labels (first datapoint mid-May 2026).
BYTECODE_SIMILAR_TO_RUG — MinHash on 4-byte shingles vs every contract we have ever labeled rugged. Recall here grows with our census, not with what the chain exposes; a competitor cannot reproduce it without our corpus.
DEPLOYER_RUG_HISTORY — same logic: queryable only because we accumulate scans + labels over time.

Until those compound, single-call concentration is what most of the score reflects. Agents who only need that signal can build it themselves; agents who want the audit trail, the cross-chain history, the bytecode similarity and the calibrated probability use RugGuard.

Always pull /v1/metrics for live numbers — the snapshot above is updated periodically but the API is updated each request.

When we'll add what

Phase	Validation milestone
Phase 0 (now)	Rug-census recall + first forward samples at T+0
Phase 0.5 (mid-May 2026)	First T+30 forward samples → first precision per verdict band numbers
Phase 1	Bulk B (Dune base rate query — what % of all Base launches rug at all?)
Phase 1	Bytecode similarity to known rugs (pgvector)
Phase 1	True positive rate tracking on `high_risk` verdicts that the agent acted on