Wikipedia is sustained by people like you. Please donate today.
Full text search
From Wikipedia, the free encyclopedia
Jump to: navigation, search

In text retrieval, full text search refers to a technique for searching a computer-stored document or database. In a full text search, the search engine examines all of the words in every stored document as it tries to match search words supplied by the user. Full text searching techniques became common in online bibliographic databases in the 1970s[verification needed]. Many web sites and application programs (such as word processing software) provide full-text search capabilities. Some web search engines such as AltaVista employ full text search techniques while others index only a portion of the web pages examined by its indexing system.[1]
Contents
[hide]

    * 1 Indexing
    * 2 The precision vs. recall tradeoff
    * 3 The false positive problem
    * 4 Improving the performance of full text searching
          o 4.1 Improved querying tools
          o 4.2 Improved search algorithms
          o 4.3 Text retrieval software
                + 4.3.1 Open Source projects
                + 4.3.2 Proprietary Solutions
    * 5 Notes
    * 6 See also

[edit] Indexing

When dealing with a small number of documents it is possible for the full-text search engine to directly scan the contents of the documents with each query, a strategy called serial scanning. This is what some rudimentary tools, such as grep, do when searching.

However, when the number of documents to search is potentially large or the quantity of search queries to perform is substantial the problem of full text search is often divided into two tasks: indexing and searching. The indexing stage will scan the text of all the documents and build a list of search terms, often called an index, but more correctly named a concordance. In the search stage, when performing a specific query, only the index is referenced rather than the text of the original documents.

The indexer will make an entry in the index for each term or word found in a document and possibly its relative position within the document. Usually the indexer will ignore stop words, such as the English "the", which are both too common and carry too little meaning to be useful for searching. Some indexers also employ language-specific stemming on the words being indexed, so for example any of the words "drives", "drove", or "driven" will be recorded in the index under a single concept word "drive".
[edit] The precision vs. recall tradeoff

Due to the ambiguities of natural language, a full text search typically produces a retrieval list that has low precision: most of the items retrieved are irrelevant. Controlled-vocabulary searching solves this problem by tagging the documents in such a way that the ambiguities are eliminated. However, a controlled vocabulary search may have low recall: it may fail to retrieve some documents that are actually relevant to the search question. Despite the presence of many irrelevant documents in a free text search's retrieval list, a free text search may be able to locate a document that a controlled vocabulary search failed to retrieve.
See also: Precision and recall
[edit] The false positive problem

Free text searching is likely to retrieve many documents that are not relevant to the intended search question. Such documents are called false positives. The retrieval of irrelevant documents is often caused by the inherent ambiguity of natural language.

Certain clustering techniques based on Bayesian algorithms (similar to the spam filter in gmail[citation needed]) can help reduce the false positive errors. So if the search term is "football", these techniques can categorize the document/data universe into say "American football", "corporate football" etc. Depending on the occurrences of words in a document, it can fall into one of the categories or more. These techniques are being extensively deployed in the e-discovery domain.
[edit] Improving the performance of full text searching

The deficiencies of free text searching have been addressed in two ways: By providing users with tools that enable them to express their search questions more precisely, and by developing new search algorithms that improve retrieval precision.
[edit] Improved querying tools

    * Keywords. Document creators (or trained indexers) are asked to supply a list of words that describe the subject of the text, including synonyms of words that describe this subject. Keywords improve recall, particularly if the keyword list includes a search word that is not in the document text.
    * Field-restricted search. Some search engines enable users to limit free text searches to a particular field within a stored data record, such as "Title" or "Author."
    * Boolean queries. Searches that use Boolean operators (for example, "encyclopedia" AND "online" NOT "Encarta") can dramatically increase the precision of a free text search. The AND operator says, in effect, "Do not retrieve any document unless it contains both of these terms." The NOT operator says, in effect, "Do not retrieve any document that contains this word." If the retrieval list retrieves too few documents, the OR operator can be used to increase recall; consider, for example, "encyclopedia" AND "online" OR "Internet" NOT "Encarta". This search will retrieve documents about online encyclopedias that use the term "Internet" instead of "online." This increase in precision is very commonly counter-productive since it usually comes with a dramatic loss of recall. [2]
    * Phrase search. A phrase search matches only those documents that contain a specified phrase, such as "Wikipedia, the free encyclopedia."
    * Concordance search. A concordance search produces an alphabetical list of all principal words that occur in a text with their immediate context.
    * Proximity search. A phrase search matches only those documents that contain two or more words that are separated by a specified number of words; a search for "Wikipedia" WITHIN2 "free" would retrieve only those documents in which the words "Wikipedia" and "free" occur within two words of each other.
    * Regular expression. A regular expression employs a complex but powerful querying syntax that can be used to specify retrieval conditions with precision.
    * Wildcard search. A search that substitutes one or more characters in a search query for a wildcard character such as an asterisk. For example using the asterisk in a search query "s*n" will find "sin", "son", "sun", etc. in a text.

[edit] Improved search algorithms

Google's PageRank algorithm gives more prominence to documents to which other Web pages have linked[citation needed]. See search engine for additional examples.
[edit] Text retrieval software
Ambox style.png
	This article may require cleanup to meet Wikipedia's quality standards. Please improve this article if you can. (September 2009)

The following is a partial list of available software products whose predominant purpose is to perform full text indexing and searching. Some of these are accompanied with detailed descriptions of their theory of operation or internal algorithms, which can provide additional insight into how full text search may be accomplished.
[edit] Open Source projects

    * DataparkSearch
    * ht://Dig
    * Lemur/Indri
    * Lucene
    * Ferret
    * Minion
    * mnoGoSearch
    * Sphinx
    * Swish-e
    * Xapian
    * Hibernate Search

[edit] Proprietary Solutions

    * Attivio
    * Autonomy Corporation
    * Brainware
    * BRS/Search
    * Dieselpoint
    * Endeca
    * Exalead
    * Fast Search & Transfer
    * Inktomi
    * JackalFish
    * Vivísimo
    * dtSearch

[edit] Notes

   1. ^ In practice it may be difficult to determine how a given search engine works. The search algorithms actually employed by web search services are seldom fully disclosed out of fear that web entrepreneurs will use search engine optimization techniques to improve their prominence in retrieval lists.
   2. ^ Studies have repeatedly shown that most users do not understand the negative impacts of boolean queries.[1]

[edit] See also

    * Controlled vocabulary
    * Information retrieval
    * Search engine
    * Search engine indexing - how search engines generate indices to support full text searching
    * Subject Indexing

Retrieved from "http://en.wikipedia.org/wiki/Full_text_search"
Categories: Searching | Text editor features | Information retrieval
Hidden categories: All pages needing factual verification | Wikipedia articles needing factual verification from October 2008 | All articles with unsourced statements | Articles with unsourced statements from October 2007 | Articles with unsourced statements from September 2009 | Articles needing cleanup from September 2009 | All pages needing cleanup
Views

    * Article
    * Discussion
    * Edit this page
    * History

Personal tools

    * Try Beta
    * Log in / create account

Navigation

    * Main page
    * Contents
    * Featured content
    * Current events
    * Random article

Search
 
Interaction

    * About Wikipedia
    * Community portal
    * Recent changes
    * Contact Wikipedia
    * Donate to Wikipedia
    * Help

Toolbox

    * What links here
    * Related changes
    * Upload file
    * Special pages
    * Printable version
    * Permanent link
    * Cite this page

Languages

    * Česky
    * Deutsch
    * فارسی
    * Français
    * Bahasa Melayu
    * Nederlands
    * 日本語
    * Русский
    * 中文

Powered by MediaWiki
Wikimedia Foundation

    * This page was last modified on 18 October 2009 at 23:14.
    * Text is available under the Creative Commons Attribution-ShareAlike License; additional terms may apply. See Terms of Use for details.
      Wikipedia® is a registered trademark of the Wikimedia Foundation, Inc., a non-profit organization.
    * Privacy policy
    * About Wikipedia
    * Disclaimers