Information Today, Inc. Corporate Site KMWorld CRM Media Streaming Media Faulkner Speech Technology DBTA/Unisphere
PRIVACY/COOKIES POLICY
Other ITI Websites
American Library Directory Boardwalk Empire Database Trends and Applications DestinationCRM Faulkner Information Services Fulltext Sources Online InfoToday Europe KMWorld Literary Market Place Plexus Publishing Smart Customer Service Speech Technology Streaming Media Streaming Media Europe Streaming Media Producer Unisphere Research



Magazines > ONLINE > May/June 2008
Back Index Forward
 




SUBSCRIBE NOW!
Online Magazine

Vol. 32 No. 3 — May/June 2008

on the net
Multilingual Searching: Search Engine Language Tools
By Greg R. Notess, Montana State University

The World Wide Web may not cover the entire globe, but it certainly has a presence in most populated places throughout the world. With such an international scope, the “multilinguality” of web content continues to increase. For savvy searchers, the multiple languages and content from distant countries create new opportunities for finding previously buried information resources.

To find these resources, even in languages that you can’t read, the search engines offer a variety of language aids. With language limits, machine translation, translated search, and multiple interface languages, searchers have a variety of increasingly sophisticated tools to harvest information content from many languages.

TOOL DIFFERENTIATION

To make some sense of the various language tools, most of the search engine-related ones fall into one of the following categories:

  • Interface language
  • Language limit
  • Machine translation
  • Translated search

The first noticed, most frequently talked about, and perhaps least useful, is the interface language. Click on Google’s “language tools,” and halfway down the page you’ll see a listing of 117 different languages. These are the interface language choices listed by Google as “Use the Google Interface in Your Language.” These language options merely change Google’s homepage text as well as the text on results pages, help files, button text, and other words used by Google within the interface itself. Its effect on searching and results is zero. Searching the same query with the English, Bihari, right-to-left Hebrew, or the oh-so-common Klingon or Elmer Fudd gets the same results (at least as far as two sequential searches in Google ever get the same results).

A search engine’s language limits, not its interface language, can be used to search for pages written primarily in a specific language or languages. The language limits are typically hidden away on the advanced search pages. The machine translation tools, useful to get very rough and inaccurate translations between languages, are often buried even further on completely separate pages. And lastly, Google’s new Translated Search feature aims to take a search query in one language and first translate the query and then search that translated query.

LANGUAGE LIMITS

The different search engines have different languages limits. These are almost always found under the advanced search page (which is now sometimes hidden under “Options”). Google, Live Search, Yahoo!, Ask, Gigablast, and Exalead all have language limits, but they vary greatly as to how many and which languages they search. Want to limit a search to Dutch? All six can do that. But a limit to Ukrainian is only offered by Live Search and Google, while Yahoo! and Exalead are the only ones to offer a Tagalog limit.

So who has the most? My most recent count has Exalead far out in front with Google, Live Search, and Yahoo! fairly close, while Ask.com is trailing the pack.

  • Exalead - 54
  • Google - 43
  • Live Search - 41
  • Yahoo! - 40
  • Gigablast - 24
  • Ask.com - 6

The count is debatable, since Chinese can be listed once or twice (simplified or traditional), as can Portuguese (Brazil and Portugal usage differentiated at Live Search). All six of these search engines cover six main languages: Dutch, English, French, German, Italian, and Spanish. For a complete listing of these language limits and the other language tools, see www.searchengineshowdown.com/language.

When do language limits help a search? Much of the time, they don’t help at all. The simplest, and often most effective, way to search for webpages in a particular language is to use a query in that language. Just search for tradução to find Portuguese pages (or traduction for French or traduzione for Italian). Use the language limits when searching for a word that finds pages in multiple languages, especially names (brands, companies, people, products), scientific terms, or new technology topics.

While you may never need to search for Malayalam, Mongolian, or Georgian pages, Exalead’s inclusion of these languages has an additional search advantage. If you come across a term in an unknown language, just search that term in Exalead. Then look at the “narrow your search box” on the right. The listed language(s) can identify the source language of the word. Click the “more choices” button at the bottom of the box to see percentages of the retrieved results in each language.

MACHINE TRANSLATION

So once you limit a search to pages from a particular language, how do you get a translation of those pages? Google and Yahoo! have some “Translate this page” links on non-English search results that are in a language available in their translation software. Or just go directly to one of the free, online translation sites. There are many available on the web ranging from word-by-word bilingual dictionaries to those that try to handle whole webpages and chunks of text. Several that do handle full text are offered by search engines:

Yahoo! Babel Fish is the newer version of AltaVista Babel Fish. It provides the exact same options, except that Yahoo! has added translation between traditional and simplified Chinese. All four give options for translating a block of text or an entire webpage.

Paste some text or a webpage URL from a non-English source and then translate it into English to get a sense of the quality of machine translation. For example, a Spanish book on a publisher’s site gives a description including the text of “Con la simulación de ensayos clínicos en el desarrollo de medicamentos se pueden llevar a cabo réplicas virtuales de los ensayos clínicos reales …” Paste that into Yahoo! Babel Fish to get an English translation of “With the simulation of clinical tests in the medicine development virtual retorts of real the clinical tests can be carried out …”

Machine translations often seem nonsensical. Yet they can still be informative. Had a request for a Spanish book and only found Spanish descriptions of it? A quick online translation should give the general sense of the book’s topic. Thus, it is easy to tell that the previous description is of a medical book. Translating the title of another book Veinte años canción en España yields Twenty Years Song in Spain. While not the most accurate translation, it certainly suggests that this book covers a music history topic.

The free availability of this translation tool is what makes it so useful. You get a rough sense of the text. Yet with the four translation options listed above, which ones are best to use? Among these services, a total of 40 language pairs are available. Yahoo! Babel Fish has the most with 38, followed by AltaVista Babel Fish at 36, Google Translate at 29, and Live with 25. Yet it is also important to note that, at first, all four of these were based on the same underlying technology, from Systran. Google has since moved some of its language pairs to its own technology. To get two differing (and potentially inaccurate) translations of the same text, I will often try it at both Yahoo! and Google.

TRANSLATED SEARCH

Another, newer approach to the multilingual web comes from Google: Translated Search (www.google.com/translate_s) is still in beta, but it combines several search and translation tools into one. Enter a search term in one language, and Google will translate the query into the target language. Start with an English query, and Translated Search can translate the query into one of 14 other languages: Arabic, Chinese (Simplified), Chinese (Traditional), Dutch, English, French, German, Greek, Italian, Japanese, Korean, Portuguese, Russian, or Spanish.

After the search query has been translated and run, Google displays two columns of results. The right column has the results in the target language. The left column has links to translated copies of the webpages in the original language of the query. For example, when searching for Arabic pages about knowledge management, Google translates the query into and then pulls up results.

The display of the results in parallel includes the titles, keyword-in-context extract, URLs, and cache links for both. The cached copy of the translated page has the text translated as well. The cached copy of the page in the other language has a link label to the cache that is also in that language so that in Arabic search, for example, the cached link is right next to the URL (the URLs are not translated).

I find Translated Search especially useful for the non-Latin alphabet languages: Arabic, Chinese, Japanese, Korean, and Russian. To my own disappointment, I am not literate in any of these languages, but now I can search them and have some idea on what is being written on webpages in those languages. This works well for names (brand names, personal names, and organizational names). Wondering what the Korean web space might be saying about ProQuest GenderWatch? That query does not get translated, but it brings up nine results with translated web- pages. While this is similar to just using a language limit, the display of translated and untranslated results in parallel presents an easier way to choose which pages to view.

This approach does not work with a name containing translatable words. A search on ProQuest Historical Annual Reports becomes proquest . But this is where another Google innovation, which was introduced with the Translated Search, can be used. Google is trying to harness the power of the collective mind, somewhat like Wikipedia’s approach, and asks for user feedback on the translated query and the translations. Each time Translated Search is used, the translated query is shown along with a “Not quite right? Edit” link that allows the searcher to change the query translation. For this example, a searcher can change the entire query string back to the English ProQuest Historical Annual Reports. Unfortunately, some of the results then end up being English language pages.

This user feedback on translations could eventually help improve the translations, assuming a sufficient quantity of multilingual users chose to contribute. In addition to the query translation correction, Google has enabled user corrections directly on translated pages. Simply mouse over a paragraph on the translated page, and a speech bubble pops up with the original text and another link to “Suggest a better translation.”

THE MULTILINGUAL SEARCHER

With translated search, language limits, and machine translation all available online for free, even the monolingual searcher can explore information content written in other languages. It takes some search flexibility and a heavy dose of skepticism about the accuracy of the translations. Even so, with multiple translation tools using differing underlying technologies, searchers can gain a general sense of the content on non-English pages. For those needing accurate translations, these tools can help decide which pages are worth the cost of buying a professional translation. For the rest of us, the tools can give an insight into conversations, opinions, and professional web content in otherwise inaccessible languages.


Greg R. Notess (greg@notess.com; www.notess.com) is reference team leader at Montana State University and founder of SearchEngineShowdown.com.

Comments? E-mail letters to the editor to marydee@infotoday.com.


 
       Back to top