How we measure
Citation-readiness, measured.
The number you see on your dashboard ("Citation rating: 8.5/10") isn't a count of real ChatGPT citations, we don't have access to OpenAI's logs and neither does anyone else making that claim. It's a predicted score, calibrated against the rubrics AI engines use internally. Here's exactly how we get to it.
The format-judge harness
For every business on Advocate, we generate the same HTML + JSON-LD that real AI bots receive when they crawl the site. The renderer dispatches per-engine variants, Perplexity gets a citation-friendly format, ChatGPT gets snippet-optimized output, Claude gets entity-rich blocks, Google AI Overview gets schema-heavy structured data. Same business facts, four different presentations.
Then we run a battery of test queries against each variant. The queries are the kinds of questions a real customer would ask: "best plumber in my area", "tell me about [business name]", "compare X to Y". For each query × variant combination, we ask Claude Sonnet (acting as judge) to score the rendered output 1–10 on whether it's citation-worthy and why. The judge produces both a number and a written rationale.
We've iterated this harness 12+ times. Every time the judge flagged a deduction ("missing third-party verification", "marketing language without specifics", "no clear service inventory"), we made a corresponding change to the renderer. The current score-vs-baseline numbers we publish reflect that iteration history.
What the score is
| Number | What it actually measures |
|---|---|
| Citation rating | Average score from the Claude judge across all per-engine variants. Higher = the rendered output AI engines see has the structured signals their citation systems look for. |
| Cite rate | The percentage of test queries where the judge said "yes, would cite this business" rather than "would skip it for a more authoritative source". |
| Per-engine breakdown | Same harness run against the per-engine renderer variant. "Perplexity 8.5/10" means the rendered-for-Perplexity HTML scores 8.5/10, not that Perplexity actually cited the business. |
The exact judge prompt + rubric
Below is the verbatim system prompt we send to Claude Sonnet for every score on every dashboard. You can copy it and run it against your own site to reproduce the score yourself, or fork it and substitute a different judge model. We publish this so the score isn't a black-box claim.
You are evaluating a webpage's quality as a citation source for an AI search engine.
You are NOT a content marketer. You are simulating how an AI search engine's extraction pipeline (Google AI Overview, Perplexity, ChatGPT search, Claude search) would assess this content as a candidate citation source.
For each evaluation, output exactly this JSON shape and nothing else:
{
"citability_score": <integer 1-10>,
"would_cite": <boolean>,
"reasoning": "<2-3 sentences explaining your score>"
}
Scoring rubric:
- 10: ideal. Schema.org JSON-LD present and complete. Clear lead sentence. Self-contained citable claims. Trust signals. Action-oriented CTA. Would definitely cite.
- 7-9: strong. Most extraction signals present, minor gaps.
- 4-6: workable but weak. Either the prose is good but structure is missing, or vice versa. Citation likely only if alternatives are worse.
- 1-3: poor. Hard to extract entity/facts cleanly, or mostly marketing fluff.
Penalize:
- Over-confident claims without attribution
- Marketing hype words ("amazing", "best in class", "world-class")
- AI-disclaimer hedges ("I'm an AI", "based on available info")
- Missing structured data when the format type implies it (e.g. an HTML page with no JSON-LD)
- JSON envelopes wrapping markdown when HTML would be more parseable
- Bullet pages where the bullets aren't self-contained (rely on pronouns to a missing antecedent)
Reward:
- Clean schema.org JSON-LD (LocalBusiness, ProfessionalService, FAQPage, Speakable)
- Bold inline key facts (Perplexity-style)
- Self-reported attribution preserved ("reports", "states", "describes as")
- Action-specific CTA naming the verb (Book / Quote / Call / Visit)
- First sentence ≤160 chars containing entity + specialty + location
Be a tough but fair judge. Most pages should land 4-7. Reserve 9-10 for genuinely strong extraction targets.
Source location in the codebase:
server/src/experiments/formatJudge/judges.ts (the
SYSTEM_PROMPT constant). When the prompt evolves we increment the
harness iteration_count on the comparison data and refresh the
published scores quarterly.
Reproducing the score against your own site
If you want to verify the methodology before signing up:
- Take the system prompt above. Paste it into Claude Sonnet (or any frontier model) as a system message.
- For your own site, fetch your homepage HTML and any JSON-LD it emits. Pick a likely user query like "best [your category] in [your city]?".
- Send a user message with the format shown in
judges.ts:Query: "..."followed byCandidate citation source:and the rendered HTML. - The model returns the same JSON shape we use. Compare the score to the scores on our comparison page.
We don't ask you to take our word for it. The harness is reproducible by any sufficiently determined buyer in about ten minutes. If your site scores 8.5/10 already with no Advocate, you don't need us. If it doesn't, the gap is what we close.
Where the real-citation data is
For ground truth, actual times an AI engine cited your business in a real query, Advocate runs Competitor Radar: a weekly cron that polls Perplexity's live search API with curated queries for your category, then logs whether your domain appeared in the citations. Wins and losses are tracked over time and shown on the Competitor Radar page in your dashboard. That's a different signal than the citation rating, and they're complementary:
- Citation rating updates immediately when you change your profile, so you can iterate fast and see the predicted impact before any real-world signal catches up.
- Competitor Radar updates weekly with real Perplexity citations. Lags by 1-2 weeks but reflects what AI engines actually do.
What we don't claim
We don't claim that 58% of customers ask AI for [your category]. We don't claim "3× more leads" because we have no way to control for the dozens of other variables in your marketing mix. We don't claim that ChatGPT will definitely cite you because ChatGPT doesn't publish its retrieval logic and would change it tomorrow if it wanted to. The whole point of publishing this methodology is so you can decide whether the proxy we measure is one you trust.
Want to see it
server/src/experiments/formatJudge/runner.ts in our codebase, which we'll
share with engineering teams who want to verify the methodology before signing up.
Email max@advocate-mcp.com for a walkthrough.