Search Strategies for Large Document Searching

Google, Bing, and most of the other remaining search engines focus their algorithms and engineering efforts on combating spam and optimizing results so that when we search for recipes, movies, restaurants, music, news, or a basic topic overview, we get helpful results. Newer artificial intelligence initiatives, such as Google’s RankBrain, increasingly focus on converting query words into concepts and related topics rather than matching actual words. The objective is to deliver search results that will be relevant for both common queries and unusual ones. However, since expert information professionals—people like us who are trying to dig deeply into the huge textual index of web-accessible information content—are a very small minority of web searchers, search engines do not optimize their algorithms and engineering efforts for us.

How do some of these newer initiatives impact expert searchers still looking for documents based on text matches? With a growing number of very large documents appearing online, including full books and multi-volume tomes, how the search engines handle such documents can help inform search strategies. While it gets frustrating for searchers to see the strange and inconsistent results that so frequently occur, knowing and expecting these inconsistencies can help to eventually track down that needed nugget of information no one else seems able to find.

Large Documents

Many years ago, when Google primarily covered only HTML webpages, Google had a known cap on how much of a large HTML webpage would be indexed. Only the first 100KB or so of a document were indexed. That meant that any words beyond that limit were not searchable. Upon hearing this, one librarian from an international nonprofit was dismayed to discover that the long webpages her organization had been publishing were not being fully indexed by Google. Since the web documents in question included long lists of names, all of which were intended to be searchable, apparently the only option was to separate the content into more, and smaller, pages.

Now content online has exploded in terms of both scope and format. From Google Books to 500-page PDFs to extensive documents in many other formats, the range of items being indexed and made searchable creates different issues for searchers. Along with the expansion of document types indexed, different rules can apply depending upon the document type. Identifying how much of these large documents are indexed, searchable, and findable proves to be quite complex, as Google is not consistent between formats or websites. Also, with large documents separated into formatted pages, the page and even line breaks can cause phrases that stretch across sections not to be indexed as a phrase.