Share of Voice in AI: How to Measure Brand Visibility in LLMs
Share of Voice in AI is the fraction of LLM answers that cite your brand. Here is the formula, a 30-day measurement plan, and the three pitfalls that distort the number.
Share of Voice (SoV) in AI is the percentage of LLM answers across a fixed prompt set that cite or mention your brand. It is the operational equivalent of impression share in paid search, but the unit is a sentence inside a generated answer, not a slot on a results page. This post defines the metric, gives the formula we use with brands, and lays out a 30-day measurement plan you can run with or without tooling.
What "Share of Voice" means in the AI era
The classical SoV metric came from media buying. It was your brand's spend or impression count divided by the category's total — a clean denominator because impressions were billed and counted. Search SEO inherited a softer version: your share of organic clicks for a defined keyword basket, with rank-tracking tools like Ahrefs and Semrush as the meter.
AI-era SoV is different on three dimensions. The unit shifts from impression or click to citation or mention. The surface shifts from a single ranked list to many engine-specific answer formats. And the denominator stops being a finite, biddable pool of keywords — it becomes a sample of prompts you choose, because the population of possible AI queries is effectively infinite.
The strategic point: AI SoV is a sampling problem before it is a counting problem. Get the sample wrong and the number means nothing.
The math: how to actually measure it
The core formula is the same shape as classical SoV, with two new variables.
SoV_AI = ( answers_citing_brand / answers_total ) over prompt_sample P, engine_set E, window W
The four inputs you have to fix before the metric is meaningful:
| Variable | Definition | Common mistake |
|---|---|---|
Prompt sample P | The fixed set of prompts you re-run each measurement window | Drifting the sample week to week makes trends meaningless |
Engine set E | The answer engines you query (ChatGPT, Perplexity, Gemini, Copilot, ...) | Reporting a single number across engines instead of per-engine |
Measurement window W | The time period the sample covers (week, fortnight, month) | Comparing windows with different prompt counts |
| Citation rule | What counts as a "cite": URL footnote, in-text brand mention, or both | Mixing citation and mention without labelling |
A second metric pairs with SoV: citation rate, which is the share of answers where your domain appears as a numbered or footnoted source (not just a brand-name mention). Stanford's HELM Lite benchmark is one of the few public references for evaluating LLM behavior across retrieval scenarios; in our Q1 2026 panel, citation and mention were measurably distinct outputs — a model can recite your brand from training data without retrieving your URL, and a retrieved URL can produce an answer that never names your brand. Track both, and report them separately.
Sampling: prompts as the new SERP queries
The prompt sample is the single most important design choice. Three properties decide whether the resulting SoV number reflects reality.
- Buyer-relevant. The sample should be questions your real audience actually asks an AI assistant, not the keyword universe a rank tracker would build. Start from sales call transcripts, support tickets, and chat logs. Search Engine Land's coverage of the Seer Interactive AI Overview CTR study shows that question-form queries trigger AI answer surfaces measurably more often than head-term keyword queries — your sample should reflect that.
- Mix of intents. Include category-defining questions ("what is X"), comparison questions ("X vs Y"), and recommendation questions ("best X for Y"). Brands consistently undercount recommendation prompts, which are the ones where citation translates to revenue.
- Stable size. A useful sample sits at 50 to 200 prompts per engine. Below 50 the variance is too high to detect movement; above 200 the labelling cost overwhelms most internal teams.
For most B2B brands we work with, the right starting set is 75 prompts: 25 category questions, 25 comparison questions, and 25 recommendation questions. Hold the set constant for at least eight weeks before adding or rotating.
Three common pitfalls
Three measurement errors recur in almost every internal SoV dashboard we audit. Each one turns a useful trend line into noise.
Pitfall 1: too narrow a prompt set. A team tracks 10 high-value prompts and reports SoV weekly. The number swings 30 points week to week because two prompts shifted citation, and the team makes content decisions on signal that is mostly variance. Fix: 50-prompt minimum per engine, locked for at least a quarter.
Pitfall 2: confusing brand with entity. "Stripe" can appear as a verbal mention of the company, a citation of stripe.com, or a passage retrieved from a third-party article that mentions Stripe in passing. These count as three different things. Fix: label each occurrence with mention (in-text brand name), citation (URL in the source list), or passthrough (third-party page that mentions the brand). Report citation rate as the primary metric; mention rate as the secondary.
Pitfall 3: citation versus mention conflation. A single answer can cite your domain in its footer while talking about a competitor in the body. Counting that as a "win" inflates SoV by 15 to 25 percent in our audits of brand-side dashboards. Fix: require both URL citation and an in-text mention of the brand within the same answer for the "fully cited" tier; track partial cases separately. Our post on how LLMs choose sources explains why these two paths diverge at the retrieval layer.
A practical 30-day measurement plan
A team with no AI SoV instrumentation can have a defensible weekly number in four weeks. The plan below is the one we walk through with new brands.
Week 1 — Define the sample and the engines. Pull 30 representative prompts from sales transcripts and support logs. Add 20 comparison prompts ("X vs your category leader") and 25 recommendation prompts. Lock the engine set to ChatGPT, Perplexity, and Gemini for the first quarter; add Copilot and vertical engines later. Document the citation rule (URL footnote + in-text mention).
Week 2 — First baseline run. Run all 75 prompts through each engine manually or via a tool. Capture three artifacts per prompt: the full answer text, the source list, and a timestamp. Label each occurrence as mention, citation, or passthrough. The first run takes 6 to 10 hours of analyst time for 75 prompts on three engines.
Week 3 — Define the cadence. Decide weekly or biweekly. Weekly catches Perplexity-driven movements (its re-ranker updates fastest, per BrightEdge's AI Search Visits 2025 research report). Biweekly is more sustainable for a one-person operation. Set a fixed weekday, fixed prompt order, and fixed engine order to control day-of-week noise.
Week 4 — First trend point and pitfall audit. Run the sample again. Compute SoV per engine, citation rate per engine, and a blended (size-weighted) number for the executive view. Audit the labels against the three pitfalls above. If any week-over-week swing exceeds 10 points, re-label by hand to confirm it is real.
After week 4, the cadence runs itself, and the analyst time settles around 3 to 5 hours per measurement window for 75 prompts.
Tools and what to do without them
You can run AI SoV measurement without any specialized tool. A spreadsheet, the three engines' web interfaces, and a disciplined labelling protocol carry a team through the first quarter. The cost is roughly one analyst day per measurement window once the workflow is set.
Above 100 prompts and three engines, the manual workload becomes unsustainable. Tooling earns its cost by automating the prompt execution, the citation parsing, and the deduplication of near-duplicate answers across runs. We built Prompt Architect for this exact workflow, but the metric design is what matters most — the right SoV definition is portable across any tool, and the wrong one is wrong in every tool.
A few non-PA references that publish AI SoV data publicly:
- Similarweb's GenAI tracking covers aggregate AI engine referral share, including the +28.6% YoY visit growth referenced above.
- BrightEdge's twelve-month AI Overviews analysis publishes category-level visibility shifts and citation-vs-organic overlap data.
- The HTTP Archive Web Almanac is the canonical reference for the structured-data adoption baseline that underpins schema-driven citation lift.
For the framework that connects AI SoV to the broader visibility discipline, see our AEO vs SEO 2026 framework, which positions SoV inside the answer-economy success metric set.
What good looks like
A mature AI SoV practice has three properties. The prompt sample is stable and documented. The metric is reported per engine with a blended view on top, not as a single conflated number. And the team treats Perplexity as a leading indicator, ChatGPT as a confirmatory signal, and Gemini as the lagging indicator that closes the loop.
Brands that miss any of the three end up with a chart that moves a lot, means little, and gets ignored at the next quarterly review. Brands that get all three see the same compounding curve that early SEO teams saw in 2010 to 2013: small, repeatable wins on content format that aggregate into category leadership inside the answer layer.
Get the next post in your inbox
One anchor essay a week on Answer Engine Optimization. No filler.
Related
AEO vs SEO: A 2026 Framework for Brand Visibility
AEO (Answer Engine Optimization) and SEO solve different problems in 2026. This framework maps the seven divergences, four overlaps, and a decision matrix you can apply this quarter.
insightsHow ChatGPT, Perplexity, and Gemini Choose Their Sources
ChatGPT, Perplexity, and Gemini retrieve and cite differently. Here is the engine-level breakdown of how each picks sources, and what it changes for content strategy in 2026.
bestPracticesMDX pipeline smoke test
An English-only post used to verify that hreflang sibling matrices correctly omit absent locales.