Search engine spiders will gather (crawl) every webpage on the internet by following links from page to page. They then extract all the content from the page and insert the content into their database in a specific location, where it can be easily located.
Each document is converted into a set of word occurrences called hits. The hits record the word, position(s) in document, an approximation of font size, and capitalization. Hit lists account for most of the space used in Google’s inverted index.
Let’s now look at a basic example of an inverted index. Say we’ve got three phrases as follows:
P0: “it is what it is”
P1: “what is it”
P2: “it is a banana”
This can be represented in an inverted index as follows:
where (x, y) can be read as (phrase number, word position in phrase).
| "a": | (2, 2) | |||
| "banana": | (2, 3) | |||
| "is": | (0, 1), | (0, 4), | (1, 1), | (2, 1) |
| "it": | (0, 0), | (0, 3), | (1, 2), | (2, 0) |
| "what": | (0, 2), | (1, 0) |
Note: word position starts at zero, so the word “what” occupies position 0 in phrase P1 (represented in bold above and diagrammatically below).
| Phrase (P1) | what | is | it |
| Position | 0 | 1 | 2 |
For more on inverted indices please visit the following:
Definition of Inverted IndexBuilding an Inverted Index
19 December 2008