Unique Matters: Why Information Gain Is an important Google SEO Signal

The words Original & Unique appear 131 times in Google’s Search Rater Evaluation Guidelines. [link]. It’s one of the four core concepts of main content quality alongside effort, talent or skill, and accuracy.

If you read Google’s `SEO Starter Guide` you’ll notice making sure “the content is unique” is one of their four core principles for an interesting and useful site. Navigate to their `Get On Google` section and again, there it is. They talk about asking yourself what makes your content unique to ensure it’s helpful, reliable and people-first.

It's important to understand how Google thinks about measures "unique" or "originality".

We need to dig. Surprisingly, there wasn't any mention in the Google API leaks. But in 2018, Google filed a patent titled "Contextual estimation of link information gain".

Now, the rest of this post goes pretty deep into Information Gain (IG) but here's a brief summary:

"An information gain score for a given document is indicative of additional information that is included in the given document beyond information contained in other documents that were already presented to the user"

And it's calculated using things like vector embeddings via 

"In some implementations, information gain scores may be determined for one or more documents by applying data indicative of the documents, such as their entire contents, salient extracted information, a semantic representation (e.g., an embedding, a feature vector, a bag-of-words representation, a histogram generated from words/phrases in the document, etc.) across a machine learning model to generate an information gain score"

The patent tells a few incredibly tantalizing things about how this all works. 

👉 For one, information gain is computed for every sentence and every phrase.

“For example, one or more sentences, phrases, or other passages or snippets in the third document that mention or relate to the third potential cause of the computer error may be identified. TTS processing may then be applied to generate, for instance, excerpts directly from the second document and/or natural language output that summarizes these sentences, phrases, or other passages or snippets" .

👉 Second - Google uses semantic language and vector embeddings to help understand the relationship of sentences and phrases, and other snippets. = 

"For example, an autoencoder (e.g., word2vec) may be trained to receive text as input, encode a semantic representation using an encoder portion, and then reconstruct the original text input using a decoder portion. Once trained, the encoder portion alone may be used to generate semantic representations (e.g., semantic vectors) of documents that may then be applied as input across the aforementioned trained machine learning. Thus, for each of the documents identified by search engine 120, a semantic vector can be generated.”

👉 What I think may be a bombshell - As the user views additional documents, the system updates the information gains scores and reranks the remaining unviewed documents.

In some implementations, documents that have not been viewed by the user can be reranked based on the information gain scores of the documents…. The reranking based on the information gain scores may result in one or more documents being promoted and/or demoted in a ranked list"

👉 Information density is a thing. It’s not just that your article or content is providing unique information, it’s that it’s doing so in an efficient manner.

“In various implementations, information to be provided to the user can be extracted from a document, which is selected on the basis of the information gain score. In this way, information provided to the user can be streamlined by preferentially outputting new information. In some implementations, the unnecessary output of repeated or redundant information can be reduced by determining the information gain score”

👉 And finally, human evaluators are determining uniqueness and originality. The models are trained on the output of humans who are guided by the Google Evaluation Guidelines. Read them!

"In some such implementations, the labels assigned to the training examples may be generated manually. For example, one or more individuals may read the documents and then provide a subjective information gain score representing how much additional (or novel) information they feel they gained in list of consuming d, after consuming d.

& As noted previously, in some implementations, an information gain annotation engine may generate annotated training data based on annotations of human curators of documents. For example, a curator may be provided with a first document and a second document ..., and the curator may assign a value to the second document that is indicative of information gain of the second document after viewing the first document. Also, for example, a user in a session of the search interface may be asked to rate and/or rank information gain of a document during the search session (e.g., a pop-up asking the user to rate information gained from a viewed document based on other documents viewed during the session). The annotations can be utilized to generate training data that indicates information gain as determined by human curators and the training data can be provided to the machine learning model as input to train the model to determine meaningful output (e.g., scores that are indicative of information gain for a given set of documents)"

⚒️ THE WORKFLOW 🥁

  1. Google compares two distinct documents of similar topics
  2. For each new document in the second set, the system determines an information gain score. This score reflects the amount of new information the document provides beyond what is contained in the previously viewed documents.
  3. Based on the calculated information gain scores, the new documents are ranked
  4. **(now here’s where it gets pretty wild)** The documents can be re-ranked in subsequent searches based on what documents you’ve already viewed

📚 YOUR TAKEAWAYS

  1. Audit your content alongside other documents in ranking above you. Are you providing more unique information about the search term? 
  2. Find ways to add a unique input. Quotes from experts. Share statistics from your own businesses on the topic. Find sources of information your competitors aren’t using (hint: Google Scholar is a GOLD MINE)
  3. Be different. The tactic of scraping all the top ranking pages and summarizing will stop working. Be intentional about your position, and go find ways to provide value.

To quote Will Reynolds, “Do real company shit”

Information Gain is one of the many inputs we’re using at Kixely to help predict the outcome of search engine rankings. If you’re interested in learning more, give us a shout – nicolas@kixely.com.

Also, shout out to Michael King, Bernard Huang and others have shared quite a bit in this space. And here’s a podcast recorded last year where the two of them are on the same podcast talking about this exact topic (https://ipullrank.com/topic-maturity-ft-bernard-huang)