Wednesday, September 18, 2013
Enterprise search starts with a user looking for information and submitting a search query. A search query would be a list of keywords (terms) or a phrase. The search engine would look for all records that match the request and return a list to the user. The list would contain results that are ranked in order of most relevant to least relevant for the request.
Let's look at search in more detail.
There are two performance measures for evaluating the quality of query results: precision and recall.
Precision refers to the fraction of relevant documents from all documents retrieved. Recall is the fraction of relevant documents retrieved by a search from the total number of all relevant documents in the collection. It is said that precision is a measure of usefulness of a result while recall is a measure of the completeness of the result.
Modern search engines provide a high recall with good precision. It is easy to achieve high recall by simply returning all documents in the collection for every query. However, the precision in this case would be poor. A key challenge is how to increase precision without sacrificing recall. For example, most web search engines today provide reasonably good recall but poor precision. In other words, a user gets some relevant results, usually in the first 10 to 20 results, along with many non-relevant results.
Relevancy is a numerical score assigned to a search result representing how well the result meets the information the user who submitted the query is looking for. Relevancy is therefore a subjective measure of the quality of the results as defined by the user. The higher the score, the higher the relevance.
For every document in a result, a search engine calculates and assigns a relevancy score. TF-IDF is the standard relevancy heuristic used for all search engines.It compares TF and IDF variables to provide a ranking score for each document.
TF stands for Term Frequency. This is the number of times a word (or term) appears in a single document as percentage of total number of terms in the document. Term frequency assumes that when evaluating two documents, document A and document B, the one that contains more occurrences of the search term is probably also more "relevant" to the user.
IDF stands for Inverse Document Frequency. This is a measure of the general importance of the term which is the ratio of all documents in the set to the documents that contain the term. IDF prevents a bias towards longer documents.
Additional techniques may put more emphasis other attributes to determine relevancy, for example, freshness - when was the document created or last updated or what part of the document matched the term - document title or author may score higher than finding the term in the text body.
Modern search engines provide good relevancy scoring across a wide range of document formats, but more importantly, allow users to create and use their own relevancy scoring profiles optimized for their queries. These user-defined weights, also called boosting, can be set up and run for a user, group of users, or per query. This is extremely helpful for personalizing the search experience by roles or departments within the organization.
Linguistics is a vital component of any search solution. It refers to the processing and understanding of text in unstructured documents or text fields. There are two parts to linguistics: syntax and semantics.
Syntax is about breaking text into words and numbers which is also called tokenization. Semantics is the process of finding the meaning behind text, from the levels of words and phrases to the level of paragraphs, a document or a set of documents. Semantic analysis often involves grammatical description and deconstruction, morphology, phonology, and pragmatics. One major challenge is ambiguity of language.
Linguistics therefore improves relevancy and affects precision and recall. Common linguistic features in a search solution include stemming and lemmatization of words (reducing words to their root or stem form), phrasing (the recognition and grouping of idioms), removal of stop words (words that appear often in documents but contain little meaning, for example articles), spelling corrections, etc.
One way to overcome the challenges of semantics and language ambiguity used by search engines is navigation. In this case, the search engine is using linguistics features, such as extraction of entities (nouns and noun phrases, places, people, concepts, etc.) and predefined taxonomy to narrow the results by clustering related documents together or providing useful dimensions, called facets, to slice the data, for example using price, name, etc. to narrow down the search results.
The Search Index
At the heart of every search engine is the search index. An index is a searchable catalog of documents created by the search engine. The search engine receives content for all source system to place in the index. This process is called ingestion. The search engine then accepts search queries to match against the index. The index is used to quickly find relevant documents for a search query out of collection of documents.
A common index structure is the inverted index which maps every term in the collection to all of its locations in this collection. For example, a search for the term "A" would check the entry for "A" in the index that contains links to all the documents that include "A".