When a user enters a query into a search engine (typically by using key words), the engine examines its index (or database) and provides a listing of best-matching web pages according to its own set criteria.
Step 1:
The query is tokenized and parsed ie. split up into letters, numbers and other characters. The search phrase is then checked for special operators (AND, OR) and punctuation (quotation marks for exact phrases eg “search engine”). Special characters with no meaning are dropped.
This is the point at which the majority of search engines perform the index search.
Step 2:
Stop word removal and stemming.
Stop Words: This step used to matter much more than it does now as memory is much cheaper and systems so much faster, but, since stop words are said to comprise up to 40 percent of text words in a document, it still has some significance. A stop word list typically consists of those word classes known to convey little substantive meaning, such as articles (a, the), conjunctions (and, but), interjections (oh, but), prepositions (in, over), pronouns (he, it), and forms of the "to be" verb (is, are).
Stemming: Stemming reduces the number of unique words, which improves storage space and speed. For example, stemming reduces the words "fishing", "fished", "fisheries", “fisherman” and "fisher" to the root word, "fish". One problem with stemming can be a loss of precision.
Step 3:
Query expansion (or thesaurus lookup). Since various words can be expressed in different ways (ie. synonyms), it may be necessary to include multiple variations of one word. As an example, someone who searches for ‘mobile phone’ would likely be interested in similar documents that contain ‘cellphone’, so the user’s search query may be expanded to include ‘cellphone’.
Step 4:
Term Weighting The final step in the process is to compute weights for each term in the query (assuming the query is longer than one word). This involves placing a measure of importance on each word, which is done by the search engines behind the scenes.
After the query has been processed, the next step is to search the index (database) for all terms in the query. Search Engines usually use what’s called an ‘inverted index’ to store all it’s webpages (documents).
Some further reading on query processing can be found here:
How a Search Engine Works
The Anatomy of a Large-Scale Hypertextual Web Search Engine by Larry Page and Sergey Brin.
Advanced Query Processing and
Search Engine Architecture
11 December 2008