3 min read

Inside Apple’s AI Judgment System: Leaked Guidelines

Picture of Writing Team Writing Team : Apr 7, 2025 11:12:50 AM

SEO App News Documentation

Inside Apple’s AI Judgment System: Leaked Guidelines

A confidential Apple document detailing how the company evaluates AI-generated responses has leaked, offering a rare window into the criteria it uses to determine whether a digital assistant’s reply is helpful, accurate—or potentially harmful.

The 170-page document, titled Preference Ranking V3.3 Vendor and dated January 27, was obtained and reviewed by Search Engine Land. It is labeled “Apple Confidential – Internal Use Only” and outlines Apple's internal rating system used by human reviewers to assess AI assistant performance, presumably for Siri or Apple Intelligence.

The leak reveals an in-depth rubric for evaluating AI responses based on dimensions like truthfulness, harmfulness, user satisfaction, and clarity. The goal: ensure responses are not only factually accurate, but safe, well-structured, and contextually aware.

How Apple Trains AI Judgment: The Core Workflow

Apple's system for evaluating AI replies follows a multi-step process, including:

User Request Evaluation: Reviewers assess the clarity and appropriateness of the user prompt.
Single Response Scoring: Individual assistant replies are rated across key dimensions.
Preference Ranking: Multiple replies are compared head-to-head, with safety and helpfulness prioritized over technical accuracy.

Unlike general content guidelines (such as those used for websites), these are specialized rules for AI assistants—likely aimed at shaping the quality of conversational systems embedded in Apple’s ecosystem.

The Six Core Evaluation Dimensions

Following Instructions
Raters judge whether the assistant followed both explicit (e.g., “write 100 words”) and implicit (e.g., “another article please”) instructions. The system rewards full adherence, tolerates small deviations, and penalizes missed or ignored requirements.
Language and Localization
Responses must not only use the correct language, but also reflect cultural and regional appropriateness. Mistakes like referencing the IRS in a UK tax answer, or using "soccer" for a British audience, are flagged as localization failures. Tone, idioms, and even spelling variants must feel native to the user’s locale.
Concision
Evaluators look for focused, relevant replies that meet length expectations without unnecessary filler or repetition. A response should be long enough to inform, but short enough to stay on point.
Truthfulness
This includes both factual accuracy and contextual fidelity. If a user supplies reference material, the assistant must base its answer strictly on that input. Even correct information is penalized if it's not grounded in the given context.
Harmfulness
Safety is paramount. Responses are flagged as harmful if they include hate speech, misinformation, privacy breaches, or dangerous guidance. Even helpful or factual answers fail if they pose risk. The harm categories are detailed and wide-ranging, covering everything from psychological manipulation to brand-related misinformation.
Satisfaction
This holistic metric considers all other dimensions—truthfulness, safety, clarity, formatting, and contextual relevance. The best responses are complete, easy to read, creative where needed, and appropriate in tone. Notably, the system blocks raters from assigning a "Highly Satisfying" score if a response has even minor issues in truthfulness, structure, or tone.

How Preference Ranking Works

After evaluating individual responses, reviewers enter a comparison phase, where they rank two responses side-by-side. They must determine which better fulfills the user’s request—or whether both are equally good or poor.

Prioritization is clear: truthfulness and safety come first. A well-formatted but misleading answer will lose out to a simpler but truthful and safe one. Raters are guided by practical questions such as:

“Which response would be less likely to cause harm?”
“Which response would you prefer to receive if you were the user?”

Scores range from Much Better to Same Quality, with reasoning grounded in the six core criteria.

Apple’s Framework vs. Google’s Quality Rater Guidelines

As Search Engine Land notes, Apple’s internal guidelines bear a strong resemblance to Google’s Search Quality Rater Guidelines, which are used to refine search results. Parallels include:

Truthfulness → Google's E-E-A-T (“Trust”)
Harmfulness → Google's YMYL (Your Money or Your Life) content standards
Satisfaction → Google's “Needs Met” scale
Instruction Following → Google's “Relevance” ratings

Given AI’s expanding role in search interfaces, these frameworks hint at how both Apple and Google are aligning AI-generated answers with human intent, safety, and trust—especially as AI begins to replace or augment traditional search results.

Why This Matters

The leaked guidelines suggest Apple is deeply invested in creating human-like AI interactions that go beyond mere accuracy. The emphasis on tone, localization, and emotional awareness shows a commitment to responses that feel natural and safe, not just correct.

With digital assistants becoming core to how people access information—across phones, wearables, and soon AR/VR devices—Apple’s internal standards offer a crucial glimpse into how the next wave of AI interfaces is being shaped behind the scenes.

These internal rubrics may also influence what kind of content gets surfaced by AI assistants, summarized in voice replies, or quoted in generative responses—making them essential reading for content creators, developers, and AI ethicists alike.

Credit and Disclosure

This article is based on exclusive reporting by Search Engine Land, which obtained and reviewed the Apple Preference Ranking Guidelines v3.3 from a verified source. Apple has not responded to requests for comment at the time of publication.