Google Webmaster Central Blog - Official news on crawling and indexing sites for the Google index

How search results may differ based on accented characters and interface languages

Thursday, August 31, 2006 at 5:22 PM

When a searcher enters a query that includes a word with accented characters, our algorithms consider web pages that contain versions of that word both with and without the accent. For instance, if a searcher enters [México], we'll return results for pages about both "Mexico" and "México."



Conversely, if a searcher enters a query without using accented characters, but a word in that query could be spelled with them, our algorithms consider web pages with both the accented and non-accented versions of the word. So if a searcher enters [Mexico], we'll return results for pages about both "Mexico" and "México."



How the searcher's interface language comes into play
The searcher's interface language is taken into account during this process. For instance, the set of accented characters that are treated as equivalent to non-accented characters varies based on the searcher's interface language, as language-level rules for accenting differ.

Also, documents in the chosen interface language tend to be considered more relevant. If a searcher's interface language is English, our algorithms assume that the queries are in English and that the searcher prefers English language documents returned.

This means that the search results for the same query can vary depending on the language interface of the searcher. They can also vary depending on the DemoUrl of the searcher (which is based on IP address) and if the searcher chooses to see results only from the specified language. If the searcher has personalized search enabled, that will also influence the search results.

The example below illustrates the results returned when a searcher queries [Mexico] with the interface language set to Spanish.



Note that when the interface language is set to Spanish, more results with accented characters are returned, even though the query didn't include the accented character.

How to restrict search results
To obtain search results for only a specific version of the word (with or without accented characters), you can place a + before the word. For instance, the search [+Mexico] returns only pages about "Mexico" (and not "México"). The search [+México] returns only pages about "México" and not "Mexico." Note that you may see some search results that don't appear to use the version of word you specified in your query, but that version of the word may appear within the content of the page or in anchor text to the page, rather than in the title or description listed in the results. (You can see the top anchor text used to link to your site by choosing Statistics > Page analysis in webmaster tools.)

The example below illustrates the results returned when a searcher queries [+Mexico].

The comments you read here belong only to the person who posted them. We do, however, reserve the right to remove off-topic comments.

5 comments:

Marcelo Waldo said...

Hello,

Very useful information. Showing results with or without accents is an excellent idea, now I understand why results may vary, although most of the queries in Spanish are done without them. But web sites are full of them; Google shows 535,000,000 results for méxico and 532,000,000 for mexico .

In Spanish, an accent is used to change the pronunciation or because a grammatical rule. An example,

The word “papa” can be interpreted in 3 different ways, it all depends in the accent at the moment you are speaking with someone else and the subject you are treating.

Case “a” : quiero unas papas fritas, I want some French fries,
Case “b” : los papas viven en el vaticano, the Popes live in the Vatican.
Case “c” : donde estan mis papás ?, Where are my parents ?

Google seems to understand well the concepts in Spanish; I made a search with these 3 different queries:

papas del vaticano - Vatican Popes
donde estan mis papas? - where are my parents?
papas fritas - French fries

The word “papa” is associated with 3 different subjects, and the results were pretty good. I didn’t use accentuated words.

A special case could be in a different language. For example, in Italian, the accents are also used to change the pronunciation, but, the word “e”, has a different meaning when is accentuated. Another example,

é importante mangiare e dormire bene: it’s important to eat and sleep well.

Google seems not to understand the difference between “e” and “ é” because takes both words as the same thing.

Best, regards.

secarica said...

Depending on the particular language, the serch result may be totally wrong (i.e. not displayed at all, which is equivalent of a false non-existing).

What I want to say is that today (July 2007) the language support in Google search is incomplete. I am referring here to my language, Romanian.

I made a simple test page in order to highlight the issue. Please read carefully the instructions provided therein:
http://www.secarica.ro/html/test_motor_de_cautare.html

The Romanian language includes five special characters, ăâîșț. The problem was/is that, because of the wrong association of international standard ISO/IEC 8859-2 (Latin 2) with the Romanian language, characters ş and ţ were used instead of ș and ț (in other words, cedilla below (wrong) instead of comma below (correct). For Apple MacOS this encoding problem is already solved since years. For Microsoft Windows this has addressed only recently in Windows Vista. For "legacy" Windows systems a font update has been published here
http://www.microsoft.com/downloads/details.aspx?FamilyID=0ec6f335-c3de-44c5-a13d-a1e7cea5ddea&displaylang=en

But today the problem still exists: for I cannot find words written with correct Romanian diacritical marks by searching them without, or viceversa.

Cristi

Tony said...

fist time long time. My question is, how about Chinese characters? Does hex characters render Google less semantically powerful to recognize relationships between keywords and link URLs or any other points where there is a mix between ascii and hex in keywords?

Thank you

Nick Name said...

So how do construct the search if I want results that will contain @maildomain.com or .mail.domain.com ?

I need it complete, without the @ sign or any periods stripped away.

Thanks.
DgB

Google Webmaster Central said...

Hi everyone,

Since over a year has passed since we published this post, we're closing the comments to help us focus on the work ahead. If you still have a question or comment you'd like to discuss, free to visit and/or post your topic in our Webmaster Help Group.

Thanks and take care,
The Webmaster Central Team