Validation methodology
How we measure that RugGuard's risk scores actually correlate with rug outcomes. Live numbers at /v1/metrics.
Two complementary datasets
We don't trust one dataset to validate the other — each has structural biases we'd otherwise read as signal.
Dataset A — rug_census (auto-detected confirmed rugs)
What it is. Every Aerodrome v2 token paired against WETH or USDC whose quote balance is below 0.01 WETH (or 100 USDC) and whose pool is at least 7 days old, scanned through the heuristic engine and persisted with source='rug_census', status='rugged'.
Pipeline. A weekly systemd timer runs the bulk discovery and rug-check; the detection logic checks pool drainage from on-chain RPC reads.
What it measures. Per-heuristic recall — for every flag, the fraction of known rugs on which it FAILed (decided rug-positive) versus PASSed. SKIPped results are excluded from both numerator and denominator.
Structural bias — read this before quoting numbers. Samples in this dataset are scanned after the rug already happened. By that point:
- The owner has typically been renounced (the rug-puller covered tracks). →
OWNER_NOT_RENOUNCEDandOWNER_CAN_PAUSEPASS. - The contract is dead, so GoPlus's honeypot simulator returns no data. →
HONEYPOT_DETECTED,HONEYPOT_ASYMMETRIC,HONEYPOT_TAX_HIGHSKIP. - Mint authority may have been revoked as part of the cover-up. →
MINT_AUTHORITY_ACTIVEPASSes more often than in real-time scanning.
Therefore: rug-census recall under-estimates the heuristic's pre-trade power. Use it to confirm an engine is functional and that LP / holder / liquidity heuristics fire on real on-chain data — but never quote it as "recall an agent should expect at trade time."
Dataset B — forward_sampler + T+7/14/30 follow-up
What it is. Every newly-paired token on Aerodrome v2, scanned at t ≈ 0 (within hours of pool creation) and persisted with source='forward_sampler', status='pending'. Then a daily classifier runs the same drainage check at three checkpoints (T+7, T+14, T+30) and advances the status:
- Drainage detected at T+7 / T+14 →
status='rugged'(terminal). - Drainage detected at T+30 →
status='rugged'(terminal). - No drainage at T+30 →
status='survived'(terminal). - No drainage at T+7 / T+14 → still
pending, re-checked next checkpoint.
What it measures. Precision per verdict band, the metric that actually matters for agent decision-making:
P(token rugged within 30d | RugGuard returned verdict X at t=0)
Plus the base rate of rug-vs-survived in our forward-sampled population, which lets us compare RugGuard against random.
Why this is the trustworthy dataset. Scans happen at t=0, on the actual state an agent would see at trade time. No post-rug artifact. The 30-day horizon isn't arbitrary — Aerodrome rugs typically materialize within a week, but giving 30 days lets us catch the slow-rolling exit-scams too.
Limitations.
- Sample size grows linearly with calendar time. Phase 0 launched mid-Apr 2026; the first samples reach T+30 in mid-May 2026. Until then, all precision-per-verdict numbers are non-significant.
- Coverage is currently Aerodrome v2 only. Tokens that launched on a different DEX or were pre-paired off-chain are invisible to the discovery pipeline.
- "Survived" doesn't mean "good buy" — it just means LP wasn't drained within 30 days. Slow bleeds, soft rugs, and high-tax exits aren't captured.
Score-to-band calibration
Score band is computed from the weighted sum of failing heuristic weights, normalized to 0–100 over the heuristics that decided (SKIP excluded). The verdict is then derived from the score:
| Band | Range | Verdict |
|---|---|---|
| 0–25 | low | safe |
| 26–50 | medium-low | low_risk |
| 51–70 | medium | medium_risk |
| 71–90 | medium-high | high_risk |
| 91–100 | high | critical |
Confidence override. When score_confidence is low or insufficient_data (too few heuristics decided to trust the numerator), the verdict is forced to uncertain regardless of the numeric score. This protects agents from a "score=100, but only 1/12 heuristics decided" trap — the verdict would otherwise read critical when really the engine just couldn't gather data.
Treat uncertain as "do not trade", not as "safe by default" — the absence of evidence of risk is not evidence of absence.
Live results
Always pull from /v1/metrics for current numbers — they update at request time from the latest sample data.
curl -s https://rugguard.redfleet.fr/v1/metrics | jq
The endpoint returns:
sample_counts_by_source_status: how many samples we have, by source and statusrug_census.heuristic_recall: per-flag recall on the rug census (with the post-rug bias caveat above)rug_census.verdict_distribution: what verdict band the engine assigned to confirmed rugsforward_sampler.status_breakdown: how many forward samples are pending vs rugged vs survived
Base snapshot
Rug census, n=19, 14-day Aerodrome lookback (snapshot 2026-05-17 — live numbers always at /v1/metrics):
| Heuristic | Fail | Pass | Skip | Recall (decided) |
|---|---|---|---|---|
TOP10_CONCENTRATION_HIGH | 17 | 1 | 1 | 94.4% |
LP_INSUFFICIENT_LIQUIDITY | 5 | 1 | 13 | 83.3% |
LP_NOT_LOCKED | 5 | 0 | 14 | 100% |
OWNER_NOT_RENOUNCED | 4 | 0 | 15 | 100% |
SOURCE_NOT_VERIFIED | 4 | 14 | 1 | 22.2% |
MINT_AUTHORITY_ACTIVE | 1 | 12 | 6 | 7.7% |
HONEYPOT_* / OWNER_CAN_PAUSE / HIDDEN_OWNER | 0 | 0–13 | 6–18 | 0% (post-rug) |
Solana snapshot
Rug census, n=24, drained Raydium V4 pools (TVL < $50, recent 7d volume > 0) (snapshot 2026-05-17):
| Heuristic | Fail | Pass | Skip | Recall (decided) |
|---|---|---|---|---|
SOL_TOP10_CONCENTRATION_HIGH | 16 | 2 | 6 | 88.9% |
SOL_LP_NOT_LOCKED | 3 | 17 | 4 | 15.0% |
SOL_MINT_AUTHORITY_ACTIVE | 1 | 19 | 4 | 5.0% |
SOL_FREEZE_AUTHORITY_ACTIVE | 0 | 20 | 4 | 0.0% (post-rug) |
SOL_DEPLOYER_RUG_HISTORY | 0 | 0 | 24 | n/a (insufficient_history during DB warm-up) |
The cross-chain pattern is consistent: TOP10_CONCENTRATION_HIGH is the workhorse signal at ~93-94% recall on confirmed rugs of both Base and Solana. Authority-based heuristics (MINT_*, FREEZE_*) underperform on the rug census for the same post-rug-state reason — ruggers commonly renounce authorities right after the drain to wash their tracks.
Honest disclosure: signal concentration
Today, ~94% of our recall on confirmed rugs comes from a single heuristic: TOP10_CONCENTRATION_HIGH. The remaining heuristics fire under 23% on the rug census, partly real (some signals genuinely matter less in isolation) and partly artefactual (post-rug-state biases listed above). This is a fact, not a defect — concentration is a leading indicator on memecoin launches by construction. But the implication is that a competitor can replicate ~90% of our current value with a single RPC call to balanceOf the top holders.
The structural moat we are deliberately building (in priority order):
- Forward-sampler ground truth — precision-per-verdict numbers nobody else publishes, computed from our own T+30 outcome labels (first datapoint mid-May 2026).
BYTECODE_SIMILAR_TO_RUG— MinHash on 4-byte shingles vs every contract we have ever labeledrugged. Recall here grows with our census, not with what the chain exposes; a competitor cannot reproduce it without our corpus.DEPLOYER_RUG_HISTORY— same logic: queryable only because we accumulate scans + labels over time.
Until those compound, single-call concentration is what most of the score reflects. Agents who only need that signal can build it themselves; agents who want the audit trail, the cross-chain history, the bytecode similarity and the calibrated probability use RugGuard.
Always pull /v1/metrics for live numbers — the snapshot above is updated periodically but the API is updated each request.
Forward sampler progress
The forward sampler started feeding the classifier in Apr 2026. As of 2026-05-17:
| Chain | Rugged (terminal) | Pending (T+0 / T+7 / T+14) |
|---|---|---|
| Base | 828 | 2 449 |
| Solana | 15 | 388 |
Numbers update each request at /v1/metrics — the snapshot above will drift the same day you read this page.
What we'll publish next, in order:
| Milestone | What you'll see |
|---|---|
| n ≥ 50 forward-rugged samples per chain | Precision per verdict band: P(token rugged within 30d | RugGuard returned high_risk at t=0). Exposed at /v1/metrics under forward_sampler.precision_per_verdict. |
| n ≥ 100 forward-survived samples | False-positive estimate: P(RugGuard returned high_risk | token survived 30d). Same endpoint, forward_sampler.fp_rate_per_verdict. |
| Dune base-rate integration | Bulk B — the population base rate: what % of all Base / Solana launches rug at all within 30d? Lets a reader compare RugGuard precision against the random baseline directly. |
| Bytecode similarity (MinHash) | Per-call BYTECODE_SIMILAR_TO_RUG recall against our growing labeled corpus. Already shipped in code — metric will surface once corpus > 500 confirmed-rugged contracts. |
| Agent-acted true positive rate | For agents that signal back via /v1/feedback (not built yet): of high_risk verdicts the agent acted on, how many were genuine rugs? |
Verify it yourself
Pick a token you already know rugged on Base, scan it free with the public endpoint, compare what RugGuard says to your independent assessment:
curl -s https://rugguard.redfleet.fr/v1/explain?contract=0x...&chain=base | jq
/v1/explain retrieves the cached scan with the per-heuristic breakdown if the token has been scanned before. If not, run a fresh scan first (paid):
# Requires an x402-capable client, e.g. rugguard-mcp:
pip install rugguard-mcp
python -m rugguard_mcp init
# Then from any MCP-aware agent (Claude Desktop, Cursor, etc):
# scan_token(chain="base", contract="0x...")
The free path is /v1/metrics — nothing is gated behind payment, you can audit recall and forward sampler progress before deciding whether to install a client.