title: An Anonymous Source Shared Thousands of Leaked Google Search API Documents with Me; Everyone in SEO Should See Them
url: https://sparktoro.com/blog/an-anonymous-source-shared-thousands-of-leaked-google-search-api-documents-with-me-everyone-in-seo-should-see-them/
hash_url: 2eba81418d
archive_date: 2024-05-31
og_image: https://sparktoro.com/blog/wp-content/uploads/2024/05/google-api-content-warehouse-leak-conversation-anonymized-1024x604.jpg
description: On Sunday, May 5th, I received an email from a person claiming to have access to a massive leak of API documentation from inside Google’s Search division.
favicon: https://sparktoro.com/favicon.ico
language: en_US
On Sunday, May 5th, I received an email from a person claiming to have access to a massive leak of API documentation from inside Google’s Search division. The email further claimed that these leaked documents were confirmed as authentic by ex-Google employees, and that those ex-employees and others had shared additional, private information about Google’s search operations.
Many of their claims directly contradict public statements made by Googlers over the years, in particular the company’s repeated denial that click-centric user signals are employed, denial that subdomains are considered separately in rankings, denials of a sandbox for newer websites, denials that a domain’s age is collected or considered, and more.
Naturally, I was skeptical. The claims made by this source (who asked to remain anonymous) seemed extraordinary–claims like:
And these are only the tip of the iceberg.
Extraordinary claims require extraordinary evidence. And while some of these overlap with information revealed during the Google/DOJ case (some of which you can read about on this thread from 2020), many are novel and suggest insider knowledge.
So, this past Friday, May 24th (following several emails), I had a video call with the anonymous source.
Update (5/28 at 10:00am Pacific): The anonymous source has decided to come forward. This video announces their identity, Erfan Azimi, an SEO practitioner and the founder of EA Eagle Digital.
Prior to the email and call, I had neither met nor heard of Erfan. He asked that his identity remain veiled, and that I merely include the quote below:
An eagle uses the storm to reach unimaginable heights.
– Matshona Dhliwayo
After the call I was able to confirm details of Erfan’s work history, mutual people we both know from the marketing world, and several of their claims about being at particular events with industry insiders (including Googlers), though I cannot confirm details of the meetings nor the content of discussions they claim to have had.
During our call, Erfan showed me the leak itself: more than 2,500 pages of API documentation containing 14,014 attributes (API features) that appear to come from Google’s internal “Content API Warehouse.” Based on the document’s commit history, this code was uploaded to GitHub on Mar 27, 2024 and not removed until May 7, 2024. (Note: because this piece was, post-publishing, edited to reflect Erfan’s identity, he’s referred to below as “the anonymous source”).
This documentation doesn’t show things like the weight of particular elements in the search ranking algorithm, nor does it prove which elements are used in the ranking systems. But, it does show incredible details about data Google collects. Here’s an example of the document format:
After walking me through a handful of these API modules, the source explained their motivations (around transparency, holding Google to account, etc.) and their hope: that I would publish an article sharing this leak, revealing some of the many interesting pieces of data it contained, and refuting some “lies” Googlers “had been spreading for years.”
A critical next step in the process was verifying the authenticity of the API Content Warehouse documents. So, I reached out to some ex-Googler friends, shared the leaked docs, and asked for their thoughts. Three ex-Googlers wrote back: one said they didn’t feel comfortable looking at or commenting on it. The other two shared the following (off the record and anonymously):
Next, I needed help analyzing and deciphering the naming conventions and more technical aspects of the documentation. I’ve worked with APIs a bit, but it’s been 20 years since I wrote code and 6 years since I practiced SEO professionally. So, I reached out to one of the world’s foremost technical SEOs: Mike King, founder of iPullRank.
During a 40-minute phone call on Friday afternoon, Mike reviewed the leak and confirmed my suspicions: this appears to be a legitimate set of documents from inside Google’s Search division, and contains an extraordinary amount of previously-unconfirmed information about Google’s inner workings.
2,500 technical documents is an unreasonable amount of material to ask one man (a dad, husband, and entrepreneur, no less) to review in a single weekend. But, that didn’t stop Mike from doing his best.
He’s put together an exceptionally detailed initial review of the Google API leak here, which I’ll reference more in the findings below. And he’s also agreed to join us at SparkTogether 2024 in Seattle, WA on Oct. 8, where he’ll present the fully transparent story of this leak in far greater detail, and with the benefit of the next few months of analysis.
Before we go further, a few disclaimers: I no longer work in the SEO field. My knowledge of and experience with SEO is 6+ years out of date. I don’t have the technical expertise or knowledge of Google’s internal operations to analyze an API documentation leak and confirm with certainty whether it’s authentic (hence getting Mike’s help and the input of ex-Googlers).
So why publish on this topic?
Because when I spoke to the party that sent me this information, I found them credible, thoughtful, and deeply knowledgeable. Despite going into the conversation deeply skeptical, I could identify no red flags, nor any malicious motivation. This person’s sole aim appeared quite aligned with my own: to hold Google accountable for public statements that conflict with private conversations and leaked documentation, and to bring greater transparency to the field of search marketing. And they believed that, despite my years removed from SEO, I was the best person to share this publicly.
These are goals I cared about deeply for almost two decades. And while my professional life has moved on (I now run two companies: SparkToro, which makes audience research software and Snackbar Studio, an indie video game developer), my interest in and connections to the world of Search Engine Optimization remain strong. I feel a deep obligation to share information about how the world’s dominant search engine works, especially information Google would prefer to keep quiet. And sadly, I’m not sure where else to send something this potentially groundbreaking.
Years ago, before he left journalism to become Google’s Search Liaison, Danny Sullivan, would have been my go-to source for a leak of this magnitude. He had the gravitas, resume, knowledge, and experience to examine a claim like this and present it fairly in the court of public opinion. There have been so many times in the last few years I’ve wished for Danny’s calm, even-handed, tough-but-fair-on-Google approach to newsworthy pieces like this–pieces that could reach as far as the company’s statements on the witness stand (e.g. his eloquent writing on Google’s indefensible privacy claims about organic keyword data).
Whatever Google’s paying him, it isn’t nearly enough.
Apologies that instead of Danny, dear reader, you’re stuck with me. But since you are, I’m going to assume you may not be familiar with my background or credentials, and briefly share those.
OK. Back to the Google leak.
When looking through the massive trove of API documentation, the first reasonable set of questions might be: “What is this? What is it used for? Why does it exist in the first place?”
The leak appears to come from GitHub, and the most credible explanation for its exposure matches what my anonymous source told me on our call: these documents were inadvertently and briefly made public (many links in the documentation point to private GitHub repositories and internal pages on Google’s corporate site that require specific, Google-credentialed logins). During this probably-accidental, public period between March and May of 2024, the API documentation was spread to Hexdocs (which indexes public GitHub repos) and found/circulated by other sources (I’m certain that others have a copy, though it’s odd that I could find no public discourse until now).
According to my ex-Googler sources, documentation like this exists on almost every Google team, explaining various API attributes and modules to help familiarize those working on a project with the data elements available. This leak matches others in public GitHub repositories and on Google’s Cloud API documentation, using the same notation style, formatting, and even process/module/feature names and references.
If that all sounds like a technical mouthful, think of this as instructions for members of Google’s search engine team. It’s like an inventory of books in a library, a card catalogue of sorts, telling those employees who need to know what’s available and how they can get it.
But, whereas libraries are public, Google search is one of the most secretive, closely-guarded black boxes in the world. In the last quarter century, no leak of this magnitude or detail has ever been reported from Google’s search division.
That’s open to interpretation. Google could have retired some of these, used others exclusively for testing or internal projects, or may even have made API features available that were never employed.
However, there are references in the documentation to deprecated features and specific notes on others indicating they should no longer be used. That strongly suggests those not marked with such details were still in active use as of the March, 2024 leak.
We also can’t say for certain whether the March leak is of the most recent version of this documentation. The most recent date I can find referenced in the API docs is August of 2023:
The relevant text reads:
“The domain-level display name of the website, such as “Google” for google.com. See go/site-display-name for more details. As of Aug 2023, this field is being deprecated in favor of info.[AlternativeTitlesResponse].site_display_name_response field, which also contains host-level site display names with additional information.”
A reasonable reader would conclude that the documentation was up-to-date as of last summer (references to other changes in 2023 and earlier years, all the way back to 2005, are also present), and possibly even up-to-date as of the March 2024 date of disclosure.
Google search obviously changes massively from year to year, and recent introductions like their much-maligned AI Overviews, do not make an appearance in this leak. Which of the items mentioned are actively used today in Google’s ranking systems? That’s open to speculation. This trove contains fascinating references, many that will be entirely new to non-Google-search-engineers.
But, I would urge readers not to point to a particular API feature in this leak and say: “SEE! That’s proof Google uses XYZ in their rankings.” It’s not quite proof. It’s a strong indication, stronger than patent applications or public statements from Googlers, but still no guarantee.
That said, it’s as close to a smoking gun as anything since Google’s execs testified in the DOJ trial last year. And, speaking of that testimony, much of it is corroborated and expanded on in the document leak, as Mike details in his post. 👀
I expect that interesting and marketing-applicable insights will be mined from this massive file set for years to come. It’s simply too big and too dense to think that a weekend of browsing could unearth a comprehensive set of takeaways, or even come close.
However, I will share five of the most interesting, early discoveries in my perusal, some that shed new light on things Google has long been assumed to be doing, and others that suggest the company’s public statements (especially those on what they “collect”) have been erroneous. Because doing so would be tedious and could be perceived as personal grievances (given Google’s historic attacks on my work), I won’t bother showing side-by-sides of what Googlers said vs. what this document insinuates. Besides, Mike did a great job of that in his post.
Instead, I’ll focus on interesting and/or useful takeaways, and my conclusions from the whole of the modules I’ve been able to review, Mike’s piece on the leak, and how this combines with other things we know to be true of Google.
A handful of modules in the documentation make reference to features like “goodClicks,” “badClicks,” “lastLongestClicks,” impressions, squashed, unsquashed, and unicorn clicks. These are tied to Navboost and Glue, two words that may be familiar to folks who reviewed Google’s DOJ testimony. Here’s a relevant excerpt from DOJ attorney Kenneth Dintzer’s cross-examination of Pandu Nayak, VP of Search on the Search Quality team:
Q. So remind me, is navboost all the way back to 2005?
A. It’s somewhere in that range. It might even be before that.
Q. And it’s been updated. It’s not the same old navboost that it was back then?
A. No.
Q. And another one is glue, right?
A. Glue is just another name for navboost that includes all of the other features on the page.
Q. Right. I was going to get there later, but we can do that now. Navboost does web results, just like we discussed, right?
A. Yes.
Q. And glue does everything else that’s on the page that’s not web results, right?
A. That is correct.
Q. Together they help find the stuff and rank the stuff that ultimately shows up on our SERP?
A. That is true. They’re both signals into that, yes.
A savvy reader of these API documents would find they support Mr. Nayak’s testimony (and align with Google’s patent on site quality):
Google appears to have ways to filter out clicks they don’t want to count in their ranking systems, and include ones they do. They also seem to measure length of clicks (i.e. pogo-sticking – when a searcher clicks a result and then quickly clicks the back button, unsatisfied by the answer they found) and impressions.
Plenty has already been written about Google’s use of click data, so I won’t belabor the point. What matters is that Google has named and described features for that measurement, adding even more evidence to the pile.
My anonymous source claimed that way back in 2005, Google wanted the full clickstream of billions of Internet users, and with Chrome, they’ve now got it. The API documents suggest Google calculates several types of metrics that can be called using Chrome views related to both individual pages and entire domains.
This document, describing the features around how Google creates Sitelinks, is particularly interesting. It showcases a call named topUrl, which is “A list of top urls with highest two_level_score, i.e., chrome_trans_clicks.” My read is that Google likely uses the number of clicks on pages in Chrome browsers and uses that to determine the most popular/important URLs on a site, which go into the calculation of which to include in the sitelinks feature.
E.G. In the above screenshot from Google’s results, pages like “Pricing,” the “Blog,” and the “Login” pages are our most-visited, and Google knows this through their tracking of billions of Chrome users’ clickstreams.
A module on “Good Quality Travel Sites” would lead reasonable readers to conclude that a whitelist exists for Google in the travel sector (unclear if this is exclusively for Google’s “Travel” search tab, or web search more broadly). References in several places to flags for “isCovidLocalAuthority” and “isElectionAuthority” further suggests that Google is whitelisting particular domains that are appropriate to show for highly controversial of potentially problematic queries.
For example, following the 2020 US Presidential election, one candidate claimed (without evidence) that the election had been stolen, and encouraged their followers to storm the Capital and take potentially violent action against lawmakers, i.e. commit an insurrection.
Google would almost certainly be one of the first places people turned to for information about this event, and if their search engine returned propaganda websites that inaccurately portrayed the election evidence, that could directly lead to more contention, violence, or even the end of US democracy. Those of us who want free and fair elections to continue should be very grateful Google’s engineers are employing whitelists in this case.
Google has long had a quality rating platform called EWOK (Cyrus Shepard, a notable leader in the SEO space, spent several years contributing to this and wrote about it here). We now have evidence that some elements from the quality raters are used in the search systems.
How influential these rater-based signals are, and what precisely they’re used for is unclear to me in an initial read, but I suspect some thoughtful SEO detectives will dig into the leak, learn, and publish more about it. What I find fascinating is that scores and data generated by EWOK’s quality raters may be directly involved in Google’s search system, rather than simply a training set for experiments. Of course, it’s possible these are “just for testing,” but as you browse through the leaked documents, you’ll find that when that’s true, it’s specifically called out in the notes and module details.
This one calls out a “per document relevance rating” sourced from evaluations done via EWOK. There’s no detailed notation, but it’s not much of a logic-leap to imagine how important those human evaluations of websites really are.
This one calls out “Human Ratings (e.g. ratings from EWOK)” and notes that they’re “typically only populated in the evaluation pipelines,” which suggests they may be primarily training data in this module (I’d argue that’s still a hugely important role, and marketers shouldn’t dismiss how important it is that quality raters perceive and rate their websites well).
This one’s fascinating, and comes directly from the anonymous source who first shared the leak. In their words: “Google has three buckets/tiers for classifying their link indexes (low, medium, high quality). Click data is used to determine which link graph index tier a document belongs to. See SourceType here, and TotalClicks here.” In summary:
Once the link becomes “trusted” because it belongs to a higher tier index, it can flow PageRank and anchors, or be filtered/demoted by link spam systems. Links from the low-quality link index won’t hurt a site’s ranking; they are merely ignored.
If you care strategically about the value of organic search traffic, but don’t have much use for the technical details of how Google works, this section’s for you. It’s my attempt to sum up much of Google’s evolution from the period this leak covers: 2005 – 2023, and I won’t limit myself exclusively to confirmed elements of the leak.
I’m excited to see how practitioners with more recent experience and deeper technical knowledge go about analyzing this leak. I encourage anyone curious to dig into the documentation, attempt to connect it to other public documents, statements, testimony, and ranking experiments, then publish their findings.
Historically, some of the search industry’s loudest voices and most prolific publishers have been happy to uncritically repeat Google’s public statements. They write headlines like “Google says XYZ is true,” rather than “Google Claims XYZ; Evidence Suggests Otherwise.”
Please, do better. If this leak and the DOJ trial can create just one change, I hope this is it.
When those new to the field read Search Engine Roundtable, Search Engine Land, SE Journal, and the many agency blogs and websites that cover the SEO field’s news, they don’t necessarily know how seriously to take Google’s statements. Journalists and authors should not presume that readers are savvy enough to know that dozens or hundreds of past public comments by Google’s official representatives were later proven wrong.
This obligation isn’t just about helping the search industry—it’s about helping the whole world. Google is one of the most powerful, influential forces for the spread of information and commerce on this planet. Only recently have they been held to some account by governments and reporters. The work of journalists and writers in the search marketing field carries weight in the courts of public opinion, in the halls of elected officials, and in the hearts of Google employees, all of whom have the power to change things for the better or ignore them at our collective peril.
Thank you to Mike King for his invaluable help on this document leak story, to Amanda Natividad for editing help, and to the anonymous source who shared this leak with me. I expect that updates to this piece may arrive over the next few days and weeks as it reaches more eyeballs. If you have findings that support or contradict statements I’ve made here, please feel free to share them in the comments below.