What is LSI?
Latent Semantic Indexing (LSI) is, in a nutshell, word relationship technology. LSI is an important step in the document indexing process – it creates a result set by examining a document collection and producing results based on similarity between the documents.
Documents which have many words in common are said to be semantically close, while those with few words in common are semantically distant. By placing additional weight on related words in content, or words in similar positions in other related documents, LSI has a net effect of lowering the value of pages which only match the specific term and do not back it up with related terms.
By doing a Google search for a word with a tilda (~) preceding, it will show you what Google believes are related words (not necessarily synonyms). This data is collected during the 'thesaurus lookup' stage of query processing. For example, a search for ‘~dental’ will have ‘teeth, ‘tooth’, ‘dentist’, ‘dentistry’, and ‘oral’ all highlighted in the SERPs. This is possible as all of these words appear in a multitude of semantically-close documents relating to the dental industry.
For further reading on latent semantic indexing, please visit the following links:
Wikipedia's LSI/LSA Overview PageSEOBook's excellent 'Patterns in Unstructured Data'
Telcordia's LSI Papers
Using Latent Semantic Indexing for Information Filtering
3 December 2008