news

Google search engine fully revealed! Nearly 100 documents leaked, bloggers worked hard for weeks to reverse engineer

2024-08-23

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

New Intelligence Report

Editor: Editorial Department

【New Wisdom Introduction】Following the document leak in May, Google's search engine was turned upside down again. Not only did DeepMind publish a paper explaining the mechanism of the Vizier system, but blogger Mario Fischer also conducted a thorough investigation and analysis of nearly 100 documents, restoring the full picture of this Internet behemoth for us.

The papers published by Google have begun to reveal the true colors of its own technology.

In a recent paper published by DeepMind senior research scientist Xingyou (Richard) Song and others, the algorithmic secrets behind Google's Vizier service are explained.

As a black-box optimizer that has run millions of times, Vizier has helped optimize many studies and systems within Google. At the same time, Google Cloud and Vertex have also launched Vizier services to help researchers and developers perform hyperparameter adjustment or black-box optimization.

Song said that compared with other industry baselines such as Ax/BoTorch, HEBO, Optuna, HyperOpt, SkOpt, etc., Vizier has more robust performance in many user scenarios, such as high dimensions, batch queries, and multi-objective problems.

Taking advantage of the release of the paper, Google veteran Jeff Dean also tweeted to praise the Vizier system.

The open source version of Vizier he mentioned is already hosted on the GitHub repository, has very detailed documentation, and has been continuously maintained and updated recently.

Repository address: https://github.com/google/vizier

OSS Vizier's distributed client-server system

Although Google Research published a paper discussing the entire Vizier system as early as 2017, the content was far less detailed than the latest one.

This technical report contains the results of a large amount of research work and user feedback. While describing the implementation details and design choices of the open source Vizier algorithm, experiments on standardized benchmarks demonstrate the robustness and versatility of Vizier in a variety of practical modes.

Among them, the lessons learned from the iteration process of the Vizier system are also presented one by one, which has great reference significance for academia and industry and is worth seeing.

The core components of the Bayesian algorithm used by the Vizier system

The main contributions of this article are as follows:

- Formalized the default algorithm for the current version of Vizier and explained its functionality, design choices, and lessons learned throughout the iteration process

- Provides open source Python and JAX framework implementations based on the original C++ implementation

- Tested on industry-standard benchmarks, demonstrating Vizier’s robustness in high-dimensional, classification, batch, and multi-objective optimization modes

- Perform ablation experiments on the unconventional design choice of a zeroth-order evolutionary acquisition optimizer, demonstrating and discussing its key advantages

The first two names in the list of authors of the paper are Richard.

Xingyou (Richard) Song worked as a researcher on reinforcement learning generalization at OpenAI. He joined Google Brain as a senior research scientist in 2019 and has been a senior research scientist at DeepMind since 2023, working on GenAI.

Qiuyi (Richard) Zhang currently works on the DeepMind Vizier team and is also the co-creator of the open source version of Vizier. His research focuses on hyperparameter optimization, Bayesian calibration, and theoretical machine learning. He also has interests in AI alignment, counterfactuals/fairness, etc.

In 2014, Zhang received his bachelor's degree from Princeton University as a cum laude graduate, and then received his Ph.D. in applied mathematics and computer science from the University of California, Berkeley.

Search Engine Mechanisms

As an absolute industry giant, many of Google's undisclosed core technologies have long aroused the curiosity of the outside world, such as search engines.

With a market share of over 90% for more than a decade, Google Search has become perhaps the most influential system on the entire Internet. It determines the survival of websites and the presentation of online content.

But the specific details of how Google ranks websites have always been a "black box."

Unlike products like Vizier, the search engine is both Google's wealth code and its signature technology, so it is impossible for an official paper to disclose the information.

Although the media, researchers and people working in search engine optimization have made various speculations, they are just blind men groping in the dark.

The verdict of the protracted Google antitrust lawsuit was recently announced, and prosecutors at all levels in the United States collected about 5 million pages of documents and turned them into public evidence in court.

However, leaked internal Google documents and public documents from antitrust hearings, among others, don’t really tell us much about how rankings actually work.

Moreover, due to the use of machine learning, the structure of natural search results is so complex that even Google employees involved in the development of the ranking algorithm said,They don’t fully understand the interplay of the many signal weights that explain why a particular result ranks first or second.

On May 27, an anonymous source (later confirmed to be Erfan Azimi, a senior practitioner in the SEO industry) provided SparkToro CEO Rand Fishkin with a 2,500-page leaked Google Search API document, revealing detailed information about Google's search engine internal ranking algorithm.

But that’s not all.

Search Engine Land, a news site that specializes in covering the search engine industry, also recently published a blog that reverse-engineered thousands of leaked Google court documents to reveal for the first time the core technical principles of Google's web search rankings.

This blog post was created after the original author reviewed, analyzed, structured, discarded and reorganized nearly 100 documents many times over the course of several weeks. Although it is not necessarily strictly accurate or comprehensive, it can be said to be the only comprehensive and detailed information about Google's search engine.

The author's flow-saving version structure diagram is as follows:

There is no doubt that Google search engine is a huge and complex project. From the crawler system, repository Alexandria, coarse ranking Mustang, to the filtering and fine ranking system Superroot, and GWS responsible for the final presentation of the page, all of these will affect the final presentation and exposure of the website page.

New file: Waiting for Googlebot to access

When a new website is published, it will not be indexed by Google immediately. How does Google collect and update web page information?

The first step is crawling and data collection. Google first needs to know the existence of the website URL. Updating the site map or placing a URL link allows Google to crawl the new website.

Also, links to frequently visited pages will attract Google's attention more quickly.

The crawler system fetches new content and keeps track of when to revisit URLs to check for site updates, which is managed by a component called the scheduler.

The storage server then decides whether to forward the URL or to put it in a sandbox.

Google has previously denied the existence of the sandbox, but recent leaks indicate that (suspected) spam and low-value sites are also put into the sandbox, and Google apparently forwards some spam sites, probably for further content analysis and algorithm training.

The image link is then transferred to ImageBot for subsequent search calls, sometimes with delays. ImageBot has a classification function that can place identical or similar pictures in an image container.

The crawler system seems to use its own PageRank to adjust the frequency of information crawling. If a website has more traffic, the crawling frequency will increase (ClientTrafficFraction).

Alexandria: Google indexing system

Google's indexing system is called Alexandria, which assigns a unique DocID to each web page content. If there is a duplicate content, a new ID is not created, but the URL is linked to the existing DocID.

Google makes a clear distinction between URLs and documents: a document can consist of multiple URLs containing similar content, including versions in different languages, all of which are called by the same DocID.

If there is duplicate content on different domains, Google will choose to display the canonical version in the search ranking. This also explains why other URLs may sometimes have similar rankings. Moreover, the so-called "canonical" version of the URL is not a one-time deal, but will change over time.

URL of the Alexandria collection document

There is only one version of an author's document online, so it is given its own DocID by the system.

With the DocID, keywords are searched for each part of the document and aggregated into the search index. The "hit list" aggregates keywords that appear multiple times on each page and is first sent to the direct index.

Taking the author's web page as an example, since the word "pencil" appears many times in it, the DocID is listed under the "pencil" entry in the word index.

The algorithm calculates the IR (Information Retrieval) score of the word "pencil" in the document based on various text features and assigns it to a DocID, which is later used in the Posting List.

For example, the word "pencil" in the document is bolded and included in the first-level heading (stored in AvrTermWeight), which are signals that increase the IR score.

Google will move important documents to HiveMind, its main memory system, while using fast SSDs and traditional HDDs (called TeraGoogle) for long-term storage of information that doesn't need to be accessed quickly.

It’s worth noting that experts estimate that before the recent AI boom, Google controlled about half of the world’s web servers.

A massive network of interconnected clusters would allow millions of main memory units to work together, and a Google engineer once pointed out at a conference that, in theory, Google's main memory could store the entire network.

Interestingly, links to important documents stored in HiveMind and their backlinks seem to have higher weight, while URL links in HDD (TeraGoogle) may have lower weight or may not even be considered.

Additional information and signals for each DocID are dynamically stored in PerDocData, a repository that holds the 20 most recent versions of each document (via CrawlerChangerateURLHistory), which is accessed by many systems when adjusting relevance.

Furthermore, Google has the ability to evaluate different versions over time. If you want to completely change the content or subject of a document, you would theoretically need to create 20 interim versions to completely overwrite the old version.

This is why reviving an expired domain (one that was once active but has since been abandoned or sold due to bankruptcy or other reasons) will not preserve the ranking benefits of the original domain.

If a domain's Admin-C and its subject content change at the same time, the machine can easily identify this.

At this time, Google will set all signals to zero, and the old domain name that once had traffic value will no longer provide any advantages, and is no different from a newly registered domain name. Taking over an old domain name does not mean taking over the original traffic and ranking.

In addition to leaks, evidence documents from U.S. judicial hearings and trials against Google are also useful sources of research, even including internal emails.

QBST: Someone is searching for "pencil"

When someone types the search term "pencil" into Google, QBST (Query Based Salient Terms) starts working.

QBST is responsible for analyzing the search terms entered by the user, assigning different weights to the terms contained therein according to their importance and relevance, and querying the relevant DocIDs respectively.

The vocabulary weighting process is quite complex and involves systems such as RankBrain, DeepRank (formerly BERT), and RankEmbeddedBERT.

QBST is important for SEO because it affects how Google ranks your search results, which in turn affects how much traffic and visibility your website can get.

QBST will rank a website higher if it contains the most common terms that match the user query.

After QBST, related words such as "pencil" will be passed to Ascorer for further processing.

Ascorer: Creating the “Green Ring”

Ascorer extracts the first 1000 DocIDs under the "pencil" entry from the inverted index (i.e., vocabulary index) and ranks them by IR score.

According to internal documents, this list is called the "green ring." In the industry, this is called the posting list.

In our example about the pencil, the document is ranked 132nd in the published list. If no other system intervenes, this is where it will end up.

Superroot: "Ten out of a thousand"

Superroot is responsible for re-ranking the 1,000 candidate web pages just selected by Mustang, reducing the "green ring" of 1,000 DocIDs to the "blue ring" of 10 results.

This task is specifically performed by Twiddlers and NavBoost, and other systems may also be involved, but the specific details are unclear due to inaccurate information.

Mustang generates 1000 potential results, Superroot filters them down to 10

Twiddlers: Layers of Filtering

Various documents indicate that Google uses hundreds of Twiddler systems, which we can think of as similar to the filters in WordPress plugins.

Each Twiddler has its own specific filtering target and can adjust the IR score or ranking.

It was designed this way because Twiddler is relatively easy to create and does not require modifying the complex ranking algorithm in Ascorer.

Modification of the ranking algorithm is very challenging because of the potential side effects involved, requiring extensive planning and programming. In contrast, multiple Twiddlers operate in parallel or sequentially, unaware of the activities of other Twiddlers.

Twiddler can be divided into two types:

- PreDoc Twiddlers can handle sets of several hundred DocIDs, as they require little additional information;

-On the contrary, the "Lazy" type of Twiddler requires more information, such as information from the PerDocData database, and requires a relatively longer and more complicated process.

Therefore, PreDocs first receives the release list and reduces the web page entries, and then uses a slower "Lazy" type of filter. The combination of the two greatly saves computing power and time.

Two types of more than 100 Twiddlers are responsible for reducing the number of potential search results and re-ranking them.

After testing, Twiddler has a variety of uses. Developers can try to use new filters, multipliers or specific position restrictions, and even make very precise manipulations to rank a specific search result in front of or behind another result.

A leaked internal Google document shows that certain Twiddler features should only be used by experts in consultation with the core search team.

If you think you know how Twidder works, trust us: you don’t. We’re not sure we do either.

There are also Twiddlers that are used just to create comments, and add those comments to the DocID.

Why is your country's health department always first on the COVID-19 search list during COIVD?

That's because Twiddler uses queriesForWhichOfficial to facilitate precise allocation of official resources based on language and region.

While developers cannot control the results of Twiddler's reranking, understanding its mechanism can better explain ranking fluctuations and those "unexplained rankings."

Quality Raters and RankLab

Thousands of quality raters around the world are responsible for evaluating search results for Google and testing new algorithms or filters before they go live.

Google says their ratings are for reference only and do not directly affect rankings.

This is essentially true, but their ratings and bids do have a large indirect impact on the rankings.

Assessors typically take the assessment on a mobile device, receive a URL or search phrase from the system, and answer preset questions.

For example, they are asked, “Is it clear who wrote the content and what practices it was created by? Does the author have expertise in the topic?”

The answers are stored and used to train machine learning algorithms, allowing them to better identify high-quality, trustworthy pages and less reliable ones.

In other words, the results provided by human evaluators became an important criterion for deep learning algorithms, while the ranking criteria created by the Google search team were less important.

Imagine what kind of web page would make a human evaluator find it credible?

If a page includes a photo of the author, full name, and LinkedIn link, it generally appears to be credible, whereas pages lacking these features are judged as less credible.

The neural network will then identify this feature as a key factor, and after at least 30 days of active testing runs, the model may start automatically using this feature as a ranking criterion.

Thus, a page with an author photo, full name, and LinkedIn link might receive a ranking boost through the Twiddler mechanism, while a page lacking these features would see a ranking drop.

In addition, according to information leaked by Google, the isAuthor attribute and the AuthorVectors attribute (similar to "author fingerprint recognition") can allow the system to identify and distinguish the author's unique vocabulary and expressions (ie, personal language characteristics).

The raters' comments are aggregated into an "Information Satisfaction" (IS) score. Although many raters participate, IS scores are only available for a small number of URLs.

Google points out that many documents that are not clicked may also be important. When the system cannot make an inference, the document is automatically sent to the evaluator and a score is generated.

The mention of “gold” in the evaluator-related terminology suggests that there may be a “gold standard” for some documents and that meeting the expectations of human evaluators may help a document reach that “gold” standard.

Additionally, one or more Twiddler systems may push a "gold standard" DocID into the top ten.

Quality assessors are typically not full-time Google employees, but are instead outsourced to outsourced companies.

In contrast, Google's own experts work in the RankLab lab, conducting experiments, developing new Twiddlers, and evaluating and improving them to see whether Twiddler improves the quality of results or simply filters out spam.

The proven and effective Twiddler was then integrated into the Mustang system, using complex, interconnected and computationally intensive algorithms.

NavBoost: What do users like?

In Superroot, another core system, NavBoost, also plays an important role in ranking search results.

Navboost is mainly used to collect data on users’ interactions with search results, especially their clicks on different query results.

Although Google officially denies using user click data for ranking, an internal email disclosed by the Federal Trade Commission (FTC) indicates that the processing of click data must be kept confidential.

Google's denial involves two reasons.

First, from the user's perspective, Google, as a search platform, monitors users' online activities all the time, which will cause media anger over privacy issues.

But from Google's perspective, the purpose of using click data is to obtain statistically significant data indicators, not to monitor individual users.

FTC documents confirm that click data will influence rankings, and frequently mention the NavBoost system (mentioned 54 times at the hearing on April 18, 2023), as evidenced by an official hearing in 2012.

Since August 2012, the official has made it clear that click data will affect rankings

Various user behaviors on the search results page, including searches, clicks, repeated searches and repeated clicks, as well as website or web page traffic, will affect rankings.

Concerns about user privacy are just one reason. Another concern is that measuring click data and traffic could encourage spammers and scammers to use bots to fake traffic and manipulate rankings.

Google also has ways to counter this situation, such as distinguishing user clicks into bad clicks and good clicks through multi-faceted evaluations.

The indicators used include the time spent on the target page, the time period during which the web page is viewed, the starting page of the search, the record of the most recent "good click" in the user's search history, etc.

For each ranking in the search results pages (SERPs), there is an average expected click-through rate (CTR) as a baseline.

For example, according to an analysis by Johannes Beus at this year’s CAMPIXX conference in Berlin, the first organic search result received an average of 26.2% of clicks, and the second position received 15.5%.

If a CTR is significantly lower than the expected rate, the NavBoost system will note the difference and adjust the DocID's ranking accordingly.

If the expected_CRT deviates significantly from the actual value, the ranking will be adjusted accordingly.

The user's click volume basically represents the user's opinion on the relevance of the results, including the title, description and domain name.

SEO experts and data analysts report that when they monitor click-through rates across the board, they notice the following:

If a document ranks in the top 10 for a search query and the CTR is significantly lower than expected, you can expect to see a drop in ranking within a few days (depending on search volume).

On the contrary, if the CTR is much higher than the ranking, the ranking will usually rise. If the CTR is poor, the website needs to adjust and optimize the title and content description in a short time to get more clicks.

Calculating and updating PageRank is time-consuming and computationally intensive, which is why the PageRank_NS metric is used. NS stands for "Nearest Seed," a group of related pages that share a PageRank value that is applied to new pages either temporarily or permanently.

Google set a good example at a hearing about how to provide up-to-date information: For example, when a user searches for "Stanley Cup," the search results often show a cup.

However, when the Stanley Cup hockey game is in progress, NavBoost adjusts the results to prioritize real-time information about the game.

According to the latest findings, the document's click metrics include 13 months of data, with one month of overlap to allow comparison with the previous year.

Surprisingly, Google doesn’t actually offer much personalized search results. Tests have shown that modeling and adjusting to user behavior is more effective than assessing individual user preferences.

However, personal preferences, such as preferences for search and video content, are still included in personalized results.

GWS: The end and beginning of the search

Google Web Servers (GWS) are responsible for rendering the search results page (SERP), which includes the 10 "blue links", as well as ads, images, Google Maps views, "People also ask" and other elements.

Components such as FreshnessNode, InstantGlue (reacts within 24 hours, with a delay of about 10 minutes) and InstantNavBoost can adjust the ranking at the last moment before the page is displayed.

FreshnessNode can monitor changes in user search behavior in real time and adjust rankings based on these changes to ensure that search results match the latest search intent.

InstantNavBoost and InstantGlue make final adjustments to search results before they are presented, such as adjusting rankings based on breaking news and hot topics.

therefore,In order to achieve high rankings, excellent document content must be combined with correct SEO measures.

Rankings can be affected by a variety of factors, including changes in search behavior, the appearance of other documents, and real-time information updates. Therefore, it is important to recognize that having high-quality content and doing a good job of SEO are only part of the dynamic ranking landscape.

Google's John Mueller stressed that a drop in rankings does not usually mean that the content is of poor quality, and changes in user behavior or other factors may change the performance of results.

For example, if users start to prefer shorter texts, NavBoost will automatically adjust the rankings accordingly. However, the IR scores in the Alexandria system or Ascorer remain unchanged.

This tells us that SEO must be understood in a broader sense.. If the content of the document is not consistent with the user's search intent, simply optimizing the title or content is ineffective.