Validation methodology

How we measure that RugGuard's risk scores actually correlate with rug outcomes. Live numbers at /v1/metrics.

Two complementary datasets

We don't trust one dataset to validate the other — each has structural biases we'd otherwise read as signal.

Dataset A — rug_census (auto-detected confirmed rugs)

What it is. Every Aerodrome v2 token paired against WETH or USDC whose quote balance is below 0.01 WETH (or 100 USDC) and whose pool is at least 7 days old, scanned through the heuristic engine and persisted with source='rug_census', status='rugged'.

Pipeline. A weekly systemd timer runs the bulk discovery and rug-check; the detection logic checks pool drainage from on-chain RPC reads.

What it measures. Per-heuristic recall — for every flag, the fraction of known rugs on which it FAILed (decided rug-positive) versus PASSed. SKIPped results are excluded from both numerator and denominator.

Structural bias — read this before quoting numbers. Samples in this dataset are scanned after the rug already happened. By that point:

Therefore: rug-census recall under-estimates the heuristic's pre-trade power. Use it to confirm an engine is functional and that LP / holder / liquidity heuristics fire on real on-chain data — but never quote it as "recall an agent should expect at trade time."

Dataset B — forward_sampler + T+7/14/30 follow-up

What it is. Every newly-paired token on Aerodrome v2, scanned at t ≈ 0 (within hours of pool creation) and persisted with source='forward_sampler', status='pending'. Then a daily classifier runs the same drainage check at three checkpoints (T+7, T+14, T+30) and advances the status:

What it measures. Precision per verdict band, the metric that actually matters for agent decision-making:

P(token rugged within 30d | RugGuard returned verdict X at t=0)

Plus the base rate of rug-vs-survived in our forward-sampled population, which lets us compare RugGuard against random.

Why this is the trustworthy dataset. Scans happen at t=0, on the actual state an agent would see at trade time. No post-rug artifact. The 30-day horizon isn't arbitrary — Aerodrome rugs typically materialize within a week, but giving 30 days lets us catch the slow-rolling exit-scams too.

Limitations.

Score-to-band calibration

Score band is computed from the weighted sum of failing heuristic weights, normalized to 0–100 over the heuristics that decided (SKIP excluded). The verdict is then derived from the score:

BandRangeVerdict
0–25lowsafe
26–50medium-lowlow_risk
51–70mediummedium_risk
71–90medium-highhigh_risk
91–100highcritical

Confidence override. When score_confidence is low or insufficient_data (too few heuristics decided to trust the numerator), the verdict is forced to uncertain regardless of the numeric score. This protects agents from a "score=100, but only 1/12 heuristics decided" trap — the verdict would otherwise read critical when really the engine just couldn't gather data.

Treat uncertain as "do not trade", not as "safe by default" — the absence of evidence of risk is not evidence of absence.

Live results

Always pull from /v1/metrics for current numbers — they update at request time from the latest sample data.

curl -s https://rugguard.redfleet.fr/v1/metrics | jq

The endpoint returns:

Base snapshot

Rug census, n=19, 14-day Aerodrome lookback (snapshot 2026-05-17 — live numbers always at /v1/metrics):

HeuristicFailPassSkipRecall (decided)
TOP10_CONCENTRATION_HIGH171194.4%
LP_INSUFFICIENT_LIQUIDITY511383.3%
LP_NOT_LOCKED5014100%
OWNER_NOT_RENOUNCED4015100%
SOURCE_NOT_VERIFIED414122.2%
MINT_AUTHORITY_ACTIVE11267.7%
HONEYPOT_* / OWNER_CAN_PAUSE / HIDDEN_OWNER00–136–180% (post-rug)

Solana snapshot

Rug census, n=24, drained Raydium V4 pools (TVL < $50, recent 7d volume > 0) (snapshot 2026-05-17):

HeuristicFailPassSkipRecall (decided)
SOL_TOP10_CONCENTRATION_HIGH162688.9%
SOL_LP_NOT_LOCKED317415.0%
SOL_MINT_AUTHORITY_ACTIVE11945.0%
SOL_FREEZE_AUTHORITY_ACTIVE02040.0% (post-rug)
SOL_DEPLOYER_RUG_HISTORY0024n/a (insufficient_history during DB warm-up)

The cross-chain pattern is consistent: TOP10_CONCENTRATION_HIGH is the workhorse signal at ~93-94% recall on confirmed rugs of both Base and Solana. Authority-based heuristics (MINT_*, FREEZE_*) underperform on the rug census for the same post-rug-state reason — ruggers commonly renounce authorities right after the drain to wash their tracks.

Honest disclosure: signal concentration

Today, ~94% of our recall on confirmed rugs comes from a single heuristic: TOP10_CONCENTRATION_HIGH. The remaining heuristics fire under 23% on the rug census, partly real (some signals genuinely matter less in isolation) and partly artefactual (post-rug-state biases listed above). This is a fact, not a defect — concentration is a leading indicator on memecoin launches by construction. But the implication is that a competitor can replicate ~90% of our current value with a single RPC call to balanceOf the top holders.

The structural moat we are deliberately building (in priority order):

  1. Forward-sampler ground truth — precision-per-verdict numbers nobody else publishes, computed from our own T+30 outcome labels (first datapoint mid-May 2026).
  2. BYTECODE_SIMILAR_TO_RUG — MinHash on 4-byte shingles vs every contract we have ever labeled rugged. Recall here grows with our census, not with what the chain exposes; a competitor cannot reproduce it without our corpus.
  3. DEPLOYER_RUG_HISTORY — same logic: queryable only because we accumulate scans + labels over time.

Until those compound, single-call concentration is what most of the score reflects. Agents who only need that signal can build it themselves; agents who want the audit trail, the cross-chain history, the bytecode similarity and the calibrated probability use RugGuard.

Always pull /v1/metrics for live numbers — the snapshot above is updated periodically but the API is updated each request.

Forward sampler progress

The forward sampler started feeding the classifier in Apr 2026. As of 2026-05-17:

ChainRugged (terminal)Pending (T+0 / T+7 / T+14)
Base8282 449
Solana15388

Numbers update each request at /v1/metrics — the snapshot above will drift the same day you read this page.

What we'll publish next, in order:

MilestoneWhat you'll see
n ≥ 50 forward-rugged samples per chainPrecision per verdict band: P(token rugged within 30d | RugGuard returned high_risk at t=0). Exposed at /v1/metrics under forward_sampler.precision_per_verdict.
n ≥ 100 forward-survived samplesFalse-positive estimate: P(RugGuard returned high_risk | token survived 30d). Same endpoint, forward_sampler.fp_rate_per_verdict.
Dune base-rate integrationBulk B — the population base rate: what % of all Base / Solana launches rug at all within 30d? Lets a reader compare RugGuard precision against the random baseline directly.
Bytecode similarity (MinHash)Per-call BYTECODE_SIMILAR_TO_RUG recall against our growing labeled corpus. Already shipped in code — metric will surface once corpus > 500 confirmed-rugged contracts.
Agent-acted true positive rateFor agents that signal back via /v1/feedback (not built yet): of high_risk verdicts the agent acted on, how many were genuine rugs?

Verify it yourself

Pick a token you already know rugged on Base, scan it free with the public endpoint, compare what RugGuard says to your independent assessment:

curl -s https://rugguard.redfleet.fr/v1/explain?contract=0x...&chain=base | jq

/v1/explain retrieves the cached scan with the per-heuristic breakdown if the token has been scanned before. If not, run a fresh scan first (paid):

# Requires an x402-capable client, e.g. rugguard-mcp:
pip install rugguard-mcp
python -m rugguard_mcp init
# Then from any MCP-aware agent (Claude Desktop, Cursor, etc):
#   scan_token(chain="base", contract="0x...")

The free path is /v1/metrics — nothing is gated behind payment, you can audit recall and forward sampler progress before deciding whether to install a client.