Galaxy Consulting Blog: Enterprise Search

Showing posts with label Enterprise Search. Show all posts

Wednesday, September 18, 2013

The Mystery of How Enterprise Search Works (For You!)

Enterprise search starts with a user looking for information and submitting a search query. A search query would be a list of keywords (terms) or a phrase. The search engine would look for all records that match the request and return a list to the user. The list would contain results that are ranked in order of most relevant to least relevant for the request.

Let's look at search in more detail.

Performance Measures

There are two performance measures for evaluating the quality of query results: precision and recall.

Precision refers to the fraction of relevant documents from all documents retrieved. Recall is the fraction of relevant documents retrieved by a search from the total number of all relevant documents in the collection. It is said that precision is a measure of usefulness of a result while recall is a measure of the completeness of the result.

Modern search engines provide a high recall with good precision. It is easy to achieve high recall by simply returning all documents in the collection for every query. However, the precision in this case would be poor. A key challenge is how to increase precision without sacrificing recall. For example, most web search engines today provide reasonably good recall but poor precision. In other words, a user gets some relevant results, usually in the first 10 to 20 results, along with many non-relevant results.

Relevancy

Relevancy is a numerical score assigned to a search result representing how well the result meets the information the user who submitted the query is looking for. Relevancy is therefore a subjective measure of the quality of the results as defined by the user. The higher the score, the higher the relevance.

For every document in a result, a search engine calculates and assigns a relevancy score. TF-IDF is the standard relevancy heuristic used for all search engines.It compares TF and IDF variables to provide a ranking score for each document.

TF stands for Term Frequency. This is the number of times a word (or term) appears in a single document as percentage of total number of terms in the document. Term frequency assumes that when evaluating two documents, document A and document B, the one that contains more occurrences of the search term is probably also more "relevant" to the user.

IDF stands for Inverse Document Frequency. This is a measure of the general importance of the term which is the ratio of all documents in the set to the documents that contain the term. IDF prevents a bias towards longer documents.

Additional techniques may put more emphasis other attributes to determine relevancy, for example, freshness - when was the document created or last updated or what part of the document matched the term - document title or author may score higher than finding the term in the text body.

Modern search engines provide good relevancy scoring across a wide range of document formats, but more importantly, allow users to create and use their own relevancy scoring profiles optimized for their queries. These user-defined weights, also called boosting, can be set up and run for a user, group of users, or per query. This is extremely helpful for personalizing the search experience by roles or departments within the organization.

Linguistics

Linguistics is a vital component of any search solution. It refers to the processing and understanding of text in unstructured documents or text fields. There are two parts to linguistics: syntax and semantics.

Syntax is about breaking text into words and numbers which is also called tokenization. Semantics is the process of finding the meaning behind text, from the levels of words and phrases to the level of paragraphs, a document or a set of documents. Semantic analysis often involves grammatical description and deconstruction, morphology, phonology, and pragmatics. One major challenge is ambiguity of language.

Linguistics therefore improves relevancy and affects precision and recall. Common linguistic features in a search solution include stemming and lemmatization of words (reducing words to their root or stem form), phrasing (the recognition and grouping of idioms), removal of stop words (words that appear often in documents but contain little meaning, for example articles), spelling corrections, etc.

Navigation

One way to overcome the challenges of semantics and language ambiguity used by search engines is navigation. In this case, the search engine is using linguistics features, such as extraction of entities (nouns and noun phrases, places, people, concepts, etc.) and predefined taxonomy to narrow the results by clustering related documents together or providing useful dimensions, called facets, to slice the data, for example using price, name, etc. to narrow down the search results.

The Search Index

At the heart of every search engine is the search index. An index is a searchable catalog of documents created by the search engine. The search engine receives content for all source system to place in the index. This process is called ingestion. The search engine then accepts search queries to match against the index. The index is used to quickly find relevant documents for a search query out of collection of documents.

A common index structure is the inverted index which maps every term in the collection to all of its locations in this collection. For example, a search for the term "A" would check the entry for "A" in the index that contains links to all the documents that include "A".

Thursday, June 20, 2013

Intelligent Search and Automated Metadata

The inability to identify the value in unstructured content is the primary challenge in any application that requires the use of metadata. Search cannot find and deliver relevant information in the right context, at the right time without good quality metadata.

An information governance approach that creates the infrastructure framework to encompass automated intelligent metadata generation, auto-classification, and the use of goal and mission-aligned taxonomies is required. From this framework, intelligent metadata enabled solutions can be rapidly developed and implemented. Only then can organizations leverage their knowledge assets to support search, litigation, e-discovery, text mining, sentiment analysis and open source intelligence.

Manual tagging is still the primary approach used to identify the description of content, and often lacks any alignment with enterprise business goals. This subjectivity and ambiguity is applied to search, resulting in inaccuracy and the inability to find relevant information across the enterprise.

Metadata used by search engines may be comprised of end user tags, pre-defined tags, or generated using system defined metadata, keyword and proximity matching, extensive rule building, end-user ratings, or artificial intelligence. Typically, search engines provide no way to rapidly adapt to meet organizational needs or account for an organization’s unique nomenclature.

More effective is implementing an enterprise metadata infrastructure that consistently generates intelligent metadata using concept identification. A profoundly different approach, relevant documents, regardless of where they reside, will be retrieved even if they don’t contain the exact search terms, because the concepts and relationships between similar content has been identified. The elimination of end-user tagging and the resulting organizational ambiguity enables the enriched metadata to be used by any search engine index, for example, ConceptSearch, SharePoint, Solr, Autonomy or Google Search Appliance.

Only when metadata is consistently accurate and trusted by the organization can improvements be achieved in text analytics, e-discovery and litigation support.

In the exploding age of big data, and more specifically text analytics, sentiment analysis and even open source intelligence, the ability to harness the meaning of unstructured content in real time improves decision-making and enables organizations to proactively act with greater certainty on rapidly changing business complexities.

To achieve an effective information governance strategy for unstructured content, results are predicated on the ability to find information and eliminate inappropriate information. The core enterprise search component must be able to incorporate and digest content from any repository, including faxes, scanned content, social sites (blogs, wikis, communities of interest, Twitter), emails, and websites. This provides a 360-degree corporate view of unstructured content, regardless of where it resides or how it was acquired.

Ensuring that the right information is available to end users and decision makers is fundamental to trusting the accuracy of the information and is another key requirement in intelligent search. Organizations can then find the descriptive needles in the haystack to gain competitive advantage and increase business agility.

An intelligent metadata enabled solution for text analytics analyzes and extracts highly correlated concepts from very large document collections. This enables organizations to attain an ecosystem of semantics that delivers understandable and trusted results that is continually updated in real time.

Applying the concept of intelligent search to e-discovery and litigation, traditional information retrieval systems use "keyword searches" of text and metadata as a means of identifying and filtering documents. The challenges and escalating costs of e-discovery and litigation support continue to increase. The use of intelligent search reduces costs and alleviates many of the challenges.

Content can be presented to knowledge professionals in a manner that enables them to more rapidly identify relevant information and increase accuracy. Significant benefits can be achieved by removing the ambiguity in content and the identification of concepts within a large corpus of information. This methodology delivers expediencies, and reduces costs, offering an effective solution that overcomes many of the challenges typically not solved in e-discovery and litigation support.

Organizations must incorporate an approach that addresses the lack of an intelligent metadata infrastructure. Intelligent search, a by-product of the infrastructure, must encourage, not hamper, the use and reuse of information and be rapidly extendable to address text mining, sentiment analysis, e-discovery, and litigation support.

The additional components of auto-classification and taxonomies complete the core infrastructure to deploy intelligent metadata enabled solutions, including records management, data privacy, and migration. Search can no longer be evaluated on features, but on proven results that deliver insight into all unstructured content.

Thursday, May 9, 2013

Search Engine Technology

Modern web search engines are highly intricate software systems which employ technology that has evolved over the years. There are few categories of search engines that are applicable to specific browsing needs.

These include web search engines (e.g. Google), database or structured data search engines (e.g. Dieselpoint), and mixed search engines or enterprise search.

The more prevalent search engines such as Google and Yahoo! utilize hundreds of thousands of millions of computers to process trillions of web pages in order to return fairly well-aimed results. Due to this high volume of queries and text processing, the software is required to run in a highly dispersed environment with a high degree of superfluity.

Search Engine Categories

Web search engines

These are search engines that are specifically designed for searching web pages. They were developed to facilitate searching through a large amount of web pages. They are engineered to follow a multi-stage process: crawling the infinite number of pages to skim the figurative foam from their contents, indexing the foam/buzzwords in a sort of semi-structured form (for example a database), and returning mostly relevant as links to those skimmed documents or pages from the inventory.

Crawl

In the case of a wholly textual search, the first step in classifying web pages is to find an "index item" that might relate expressly to the "search term". Most search engines use sophisticated algorithms to "decide" when to revisit a particular page, to check its relevance. These algorithms range from constant visit-interval with higher priority for more frequently changing pages to adaptive visit-interval based on several criteria such as frequency of chance, popularity, and overall quality of site. The speed of the web server running the page as well as resource constraints like amount of hardware or bandwidth also figure in.

Link map

The pages that are discovered by web crawls are often distributed and fed into another computer that creates a veritable map of uncovered resources. This looks a little like a graph, on which different pages are represented as small nodes that are connected by links between the pages. The excess of data is stored in multiple data structures that allow quick access to this data by certain algorithms that compute the popularity score of pages on the web based on how many links point to a certain web page, which is how people can access any number of resources concerned with diagnosing psychosis.

Database Search Engines

Searching for text-based content in databases presents few special challenges from which a number of specialized search engines developed. Databases are slow when solving complex queries (with multiple logical or string matching arguments). Databases allow pseudo-logical queries which full-text searches do not use. There is no crawling necessary for a database since the data is already structured. However, it is often necessary to index the data in a more economized form designed to inspire a more expeditious search.

Mixed Search Engines

Sometimes, searched data contains both database content and web pages or documents. Search engine technology has developed to respond to both sets of requirements. Most mixed search engines are large Web search engines, like Google. They search both through structured and unstructured data sources. Pages and documents are crawled and indexed in a separate index. Databases are indexed also from various sources. Search results are then generated for users by querying these multiple indices in parallel and compounding the results according to "rules".

Saturday, March 30, 2013

Search Applications - Coveo - Advanced Website Search

In my last post, I described Coveo for advanced enterprise search. In this post, I will describe Coveo for advanced website search.

Coveo takes your existing Web Content Management to a new level of insight and productivity. Coveo creates a virtual integration layer between your WCM and all your company’s key information sources (knowledge bases, databases, cloud content) to provide powerful, consolidated insight.

Coveo recommends the most relevant content to visitors, powered by indexing and search across any set of diverse systems. You can increase customer satisfaction by finding the missing content your customers are looking for. Smart navigation and search makes relevant content quickly apparent for your users. Website visitors will be presented with relevant, related content without the administrator having to create the links and without the visitor having to search!

You can instantly correlate content from your WCM/CMS such as Sitecore with other key information sources, including your Salesforce, SharePoint, Exchange, Lithium, and more.

Features

WCM Features

Out-of-the-box usability - you can start getting insight immediately without any difficult set up or configuration.

Related Topics - automatically correlate site content to relevant content based on similar themes and attributes.

Composite Views - composite views that combine relevant site content with other corporate system outside WCM (Communities, Intranet, CRM, Social, etc.)

Modular, flexible design - Template-based rendering for easy customization, reusable and extensible user controls for deep customization.

Computed Facets - configurable facets ideal for eCommerce websites, providing dynamic calculations of relevant product information such as average or summed prices, as well as filtering by price ranges.

Native integration with leading WCMs - API-Level Integration with Sitecore, SharePoint and SDL Tridion provides support for live indexing, security trimming and metadata search.

Faceted Search and Navigation - More intuitive, complementing traditional keyword searches with guided navigation and conversational search that leverages metadata for increased relevance and precision.

Search Analytics - Provides valuable information about visitor search behavior, content usage (top queries) and gaps in content (unsuccessful queries), offering unparalleled insights into key trends and more agile decision-making.

Indexing

Audio-video Indexing - The speech in audio or video files can be indexed with the optional Audio Video Search module. It creates an accurate transcript of speech content that is aware of the enterprise's vocabulary (i.e. proper names, employee names, domain terms), and allows users to effectively search audio and video content as easily as they search document content. When searching, the exact location of the searched terms are highlighted in the timeline of the audio or video player.

Connector Framework - Connector APIs enable easy integration with most repositories, including a flexible security API to support the security models of the indexed repositories.

Converters - Tens of file formats are supported out of the box, including PDFs, Office documents, Lotus Notes, HTML, XML, Text files, etc. Metadata contained in audio and images file formats is also indexed, while the text contained in images can be index with the optional OCR module.

Languages - Languages are automatically identified at indexing time, improving content processing and relevance algorithms.

Metadata mapping - Regardless of the actual naming for the metadata in the indexed repositories, the system supports configurable mapping to a specified internal field representation. For instance, an index containing both Exchange and Lotus Notes emails will merge the “From” and “To” and “Subject” metadata even if they use different names for these fields.

OCR- The Optical Character Recognition (OCR) module allows the indexing of text content from files such as scanned documents stored in image or PDF files.

Pre/Post conversion scripts - Conversion scripts are hooks in the indexing pipeline that allows administrators to fully customize the way documents are indexed. There are two types of scripts, those that are executed before and those executed after the conversion of the document from its binary representation to indexable metadata and text.

Push API - Provides a simple way to integrate with external systems. All the calls necessary to support all the advanced features of the indexing pipeline are available through this API.

Tagging - Metadata can be injected on documents at search time, enabling search and facets on these new metadata in real-time. An example of usage is the addition of user-created tags on documents.

Security

Document Level Security - Data sources can be configured to index document permissions with content, making Early-binding security possible, or permissions can be set directly for all documents of this source.

Index Security - Security is integrated directly in the index structures to ensure that users only see content they are entitled to see. Early and Late security binding are both handled at the index level to deliver the best performance and security.

Index Segmentation - In addition to the document level securities reflecting the underlying repository permissions, the index can be segmented into collections with their own access restrictions.

Security Freshness - Changes in the group/user structure are constantly monitored and refreshed in Coveo’s security cache. An administrator can also force a refresh of the cache if required

Security Normalization - Securities from different systems are normalized within the index so that users are automatically assigned with all proper security identifiers when accessing Coveo. This ensures that users see all the content they are entitled to see.

Super User Access - The main system administrator can grant temporary and audited rights to a specified user to search and access content for which he normally does not have access rights. Typical uses are e-Discovery, forensic, etc.

Reporting and Analytics

3rd party analytics integration - The Coveo analytics database allows the use of third-party reporting tools for more complex or custom reporting. An administrator can also easily configure the search interfaces to integrate third-party Web Analytics systems such as Google Analytics.

Advanced Query Analytics - Captures data on all user interactions with the search interfaces including result click-through and the use of different search UI functions. Reporting interface allows administrators to analyze the captured data, to elevate the most popular results, or select the correct result, for given queries.

Query & Indexing Logs - Comprehensive reports and statistics with graphical views on system status, queries, content, history, etc. Live console gives administrators a real-time view of what is going on the system.

Text Analytics

Configurable Text Analytics - An administrator can configure a workflow that will create new metadata based on content analysis, rules and context, such as Themes, Named-entities, Regular Expressions.

Incremental updates - An administrator can configure update schedules to capture recent changes in the index.

Interactive Fine Tuning - Extraction parameters, normalization and blacklisting can be refined and metadata regenerated without re-indexing the full documents set.

Named Entity Extraction - Entities such as persons, locations, and organizations are automatically extracted from indexed content. Additional entities can be configured in the system.

Plug-ins - Additional, 3rd party, plugins can be added to the Text Analytics workflow. For example, domain/organization specific taxonomies can be used in the process.

Rule-based Extraction - Configurable rules can be used to add specific metadata to documents.

Theme Extraction - Themes are Topics and Concepts automatically extracted from indexed content.

Next time, I will complete describing Coveo products with Coveo for service and support.

Monday, March 11, 2013

Search Applications-Coveo-Advanced Enterprise Search-Part 2

Yesterday, I mentioned that Coveo offers three products - Coveo for advanced enterprise search, Coveo for advanced website search, Coveo for service and support. I presented the some features of Coveo for advanced enterprise search product.

Today, I will complete presenting this product.

User Interface

Coveo InsightBox - different field values are suggested while a user types a query.

Facets: AND/OR mode - for specific facets, multi-selection can be switch from OR (default) to AND

Facets: Computed Fields - computations can be applied of specific facets: Sums (Avgs, Mins, Maxs).

Facets: Search-as-you-type - search-as-you-type on all unique values of specified facets.

Facets: Sort-by - facet values can be sorted based on their label, count or computed value.

Mobile UI - user Interface compatible with iOS browsers.

Relevance Ranking - results are ordered by default based on user profile, social and other context. This can be easily tuned and configured by an administrator.

Result Sort-by any Field - results can be ordered by any field. This is configured by the administrator.

Secure, Federated search - log in once in Coveo and search, navigate and consolidate dozens of different data sources simultaneously. Trim results based on user permissions.

Sort-by any Field - results can be ordered by any field, as configured by the UI administrator.

Faceted Search and Navigation - relevant metadata and fields can be used to populate facets within a result set. User can select multiple facet values dynamically and get instant changes in the result list.

Advanced User Interface

Export Results to Excel - results can be downloaded and opened in Excel. Administrator can select which metadata to include in the export.

Floating Searchbar - Windows users can activate a floating Search bar and search Coveo without the need to start a Web browser.

Outlook Sidebar - Outlook users get contextual results based on their context and selection. They can also search for emails, files, people and SharePoint without leaving Outlook.

Tagging - users have the ability to add custom tags and annotations to results. Tags are searchable, are applied in the index in real-time and are available to other users.

Widgets - an administrator can configure widgets to display results in advanced visual representation.

Windows Desktop Indexing - Windows users can search their local files and email archives. A desktop agent is required to capture content and synchronize it with the centralized Coveo index.

Relevance

Configurable Ranking - administrator can assign weight on more than 20 ranking attributes, such as Term proximity, TFIDF, dates, Terms in Title, Content reputation.

Query Correction - the spelling of the query is checked against the index content in order to suggest proper spelling even for words that are not normally part of general dictionaries (example: internal project codes, people names, etc.)

Query Ranking Expressions (QRE) - for each UI, an administrator can configure specific ranking rules, based on context and result set. This is an easy and flexible way to promote content based on profile attributes of the current user, such as locations, history, languages, roles.

Stemming - variations of a keyword with a similar basic meaning are treated as synonyms, broadening the search when required.

Thesaurus - an administrator can create a thesaurus and link it to the query. Thesaurus can be created from scratch or imported from existing enterprise content.

Top Results - an administrator can assign specific results to appear at the top of the list for specific queries.

Collaborative/Social ranking - click through data and Manual document rating are used in relevance calculation. This is automatically shared among colleagues based on their social proximity.

Administration and Configuration

Admin Roles - the main system administrator can delegate partial administration permissions based on roles (interface designer, system administrator, collection administrator, etc.)

APIs - administration APIs allow custom development and integration of the administration functions into external systems.

Audience Management - administrator can define multiple audiences and assign specific UIs to them.

Installation Kit - everything is installed and initially configured using an install kit for easy deployment.

Interface Editor - for each User Interface, an administrator can configure Result templates, CSS, Facets, Sort-keys and other parameters.

Monitoring/Email alerts - different system conditions are monitored and email alerts can be sent to report important system events, such a disk space running low.

Web-based Administration UI - simple, Web-based UI for easy administration.

I will describe Coveo for advanced website search in my next post.

Sunday, March 10, 2013

Search Applications-Coveo -Advanced Enterprise Search - Part 1

Coveo offers three products - Coveo for advanced enterprise search, Coveo for advanced website search, Coveo for service and support. Today, I am going to present Coveo for advanced enterprise search. This product has many features, so I will start presenting them today and will finish tomorrow.

Coveo for advanced enterprise search is the enterprise search solution that automatically organizes your company’s information into actionable, on-demand knowledge. Coveo's powerful enterprise search engine correlates and analyzes all your company’s data information sources, wherever they reside. All the information in your Sharepoint, CRM, email, Cloud content, and File servers are now instantly accessible from one place.

Features

Access Real-time information from anywhere - federate searches on enterprise, social and cloud data securely and in real time—regardless of format or source.

Transform how your users access information - seamlessly integrate within existing applications and workflows to maximize impact and minimize disruption.

Digest, synthesize and utilize information faster - automatic metadata and entity extraction, themes and tagging combine to help users discover content and share findings.

Navigate content with ease - dynamic, searchable facets provide an ability to navigate to the most relevant content.

Simple to set-up and deploy with existing resources - as easy to use as any consumer web app, coupled with enterprise-grade robustness and scalability.

No hassle security integration - secure configuration out of the box is safe and easy.

Indexing

Audio-video Indexing - the speech in audio or video files can be indexed with the optional Audio Video Search module. It creates an accurate transcript of speech content that is aware of the enterprise's vocabulary (i.e. proper names, employee names, domain terms), and allows users to effectively search audio and video content as easily as they search document content. When searching, the exact location of the searched terms are highlighted in the timeline of the audio or video player.

Connector Framework - connector APIs enable easy integration with most repositories, including a flexible security API to support the security models of the indexed repositories.

Converters - multiple file formats are supported out of the box, including PDFs, Office documents, Lotus Notes, HTML, XML, Text files, etc. Metadata contained in audio and images file formats is also indexed, while the text contained in images can be indexed with the optional OCR module.

Languages - languages are automatically identified at indexing time, improving content processing and relevance algorithms.

Metadata mapping - regardless of the actual naming for the metadata in the indexed repositories, the system supports configurable mapping to a specified internal field representation. For instance, an index containing both Exchange and Lotus Notes emails will merge the "From", "To" and "Subject" metadata even if they use different names for these fields.

OCR - the Optical Character Recognition (OCR) module allows the indexing of text content from files such as scanned documents stored in image or PDF files.

Pre/Post conversion scripts - conversion scripts are hooks in the indexing pipeline that allows administrators to fully customize the way documents are indexed. There are two types of scripts, those that are executed before and those executed after the conversion of the document from its binary representation to indexable metadata and text.

Push API - provides a simple way to integrate with external systems. All the calls necessary to support all the advanced features of the indexing pipeline are available through this API.

Tagging - metadata can be injected on documents at search time, enabling search and facets on these new metadata in real-time. An example of usage is the addition of user-created tags on documents.

Reporting and Analytics

3rd party analytics integration - Coveo analytics database allows the use of third-party reporting tools for more complex or custom reporting. An administrator can also easily configure the search interfaces to integrate third-party web analytics systems such as Google Analytics.

Advanced Query Analytics - captures data on all user interactions with the search interfaces including result click-through and the use of different search UI functions. Reporting interface allows administrators to analyze the captured data, to elevate the most popular results, or select the correct result for given queries.

Query and Indexing Logs - comprehensive reports and statistics with graphical views on system status, queries, content, history, etc. Live console gives administrators a real-time view of what is going on the system.

Scalability and Fault Tolerance

Distributed Indexing - indexing process distributed in many Index Slices, each one indexing part of the content. Slices can be hosted locally (on local drives or on a SAN) or on separate servers (through IP connection) providing highly scalable architecture.

Failover and Query Scalability - index mirroring system provides high availability (if one mirror fails, the others can continue serving queries). The number of queries that can be answered per second can be doubled by doubling the number of automatically synchronized mirrors.

Performance profiles - configurable performance profiles to balance indexing total throughput, query performance and time-to-index.

Query Federation/GDI - federate queries to other instances of Coveo and merge the results from all instances into a single result page while also leveraging the ranking algorithms from the different instances.

Security

Document Level Security - data sources can be configured to index document permissions with content, making early-binding security possible, or permissions can be set directly for all documents of this source.

Index Security - security is integrated directly in the index structures to ensure that users only see content they are entitled to see. Early and late security binding are both handled at the index level to deliver the best performance and security.

Index Segmentation - in addition to the document level securities reflecting the underlying repository permissions, the index can be segmented into collections with their own access restrictions.

Security Freshness - changes in the group/user structure are constantly monitored and refreshed in Coveo’s security cache. An administrator can also force a refresh of the cache if required.

Security Normalization - securities from different systems are normalized within the index so that users are automatically assigned with all proper security identifiers when accessing Coveo. This ensures that users see all the content they are entitled to see.

Super User Access - the main system administrator can grant temporary and audited rights to a specified user to search and access content for which he normally does not have access rights. Typical uses are e-Discovery, forensic, etc.

Text Analytics

Configurable Text Analytics - an administrator can configure a workflow that will create new metadata based on content analysis, rules and context, such as Themes, Named entities, Regular Expressions.

Incremental Updates - an administrator can configure update schedules to capture recent changes in the index.

Interactive Fine Tuning - extraction parameters, normalization and blacklisting can be refined and metadata regenerated without re-indexing the full documents set.

Named Entity Extraction - entities such as persons, locations, and organizations are automatically extracted from indexed content. Additional entities can be configured in the system.

Plug-ins - additional, 3rd party, plugins can be added to the text analytics workflow. For example, domain/organization specific taxonomies can be used in the process.

Rule-based Extraction - configurable rules can be used to add specific metadata to documents.

Theme Extraction - themes are topics and concepts are automatically extracted from indexed content.

More features of this product tomorrow...

Saturday, December 29, 2012

10 Signs Your Search Engine is Stalling

An integral part of your content management strategy is enterprise search. Your users need to be able to find information and documents they are looking for. However, enterprise search does not always work properly. Users have to get more creative to compensate for the lack of good search but is this what you really want?

Creative ways which users use to find information is a sign that your search engine is stalling. If you recognize any of these symptoms below in your content management system, take it as a call to action: your search engine needs a tune-up!

1. Querying With Magic Cookies: If your users are memorizing document IDs or using some other "magic" unrelated to natural search queries to find content, your search has a problem.

2. Off-Roading to Google: If your users are proudly (or surreptitiously) resorting to Google to get answers, take a hard look at your search tools. Google is good. But if your search engine can't outperform a generic Web search engine, given it has a much tighter domain of content and context and can be tuned to your goals -you can do better.

3. Gaming Learning: Today's search engines are "smart". No doubt-click-stream feedback is a powerful tool for improving search relevance. But, if users are expending time repeatedly running the same search and clicking on the "right" document to force it to the top of the results list, your search engine isn't learning, it has a learning disability. Search engines should not have to be gamed.

4. Using Cliff Notes: If users use everything from note cards to the back of their hands to scribble titles and key phrases for frequently utilized content, if your users have taken to cliff-noting content to prime their searches, the search engine is not working.

5. Paper Chasing: Are your users printing out content, littering their cubes with hard copies? That is just another form of cliff-noting. Using functioning search is easier than a paper chase, not to mention more reliable.

6. Doing the Link Tango: Badly tuned search engines tend to "fixate" on certain content, especially content with lots of links to other content. Smart users often take advantage of this tendency to click through on anchor articles and then ricochet through the link structure to find the actual content they need. If your users are doing the "link tango" for information, you know that your users are great, but your search does not work.

7. Lots of "Game Over" Search Sessions: When search strings bring back large amount of content that is not earning click-through (document views), your users are having the "game over" experience. Unable to identify what is relevant in this sea of material, they are forced to cheat to stay in the game. Smart search engines provide navigation, faceted guidance or clarifying questions to prevent "game over" interactions.

8. Dumbing it Down: Your users are verbally adept, and they would love to ask questions "naturally". If your analytics are telling you that most users' searches have disintegrated into 1-2 word queries, take it as a direct reflection of your search engine's lack of intelligence. A search engine competent in natural language typically receives 20% of queries in seven words or more and about half in three, with fewer than 20% as 1-2 word queries.

9. Easter Eggs: If your users tell you they often find interesting new content by stumbling upon it, your search engine is delivering "Easter Eggs." Finding good content by accident, especially good content, signals a poorly tuned search engine. New content is usually highly relevant content and ought to be preferred by smart retrieval algorithms.

10. Taxi Driver Syndrome: Taxi drivers will tell you that they don't want a map, that they don't need a map because they memorized the map. Unlike a city topography, a knowledge base changes frequently. So, if your users are saying they don't want or need search, it's not because they have memorized all your content. What they are realy saying is: we don't need search that doesn't work.

If your users have to get more and more creative to get the job done, thank them. And then reward them and your business by tuning or upgrading your search engine. It will pay off in efficiency, customer satisfaction, and employees and customers retention.

Thursday, September 20, 2012

Faceted Search

Faceted search, also called faceted navigation or faceted browsing, is a technique for accessing information organized according to a faceted classification system, allowing users to explore a collection of information by applying multiple filters. A faceted classification system classifies each information element along multiple explicit dimensions, enabling the classifications to be accessed and ordered in multiple ways rather than in a single, pre-determined, taxonomic order.

Facets correspond to properties of the information elements. They are often derived by analysis of the text of an item using entity extraction techniques or from pre-existing fields in a database such as author, descriptor, language, and format. Thus, existing web-pages, product descriptions or online collections of articles can be augmented with navigational facets.

Faceted search has become the de facto standard for e-commerce and product-related web sites. Other content-heavy sites also use faceted search. It has become very popular and users are getting used to it and even expect it.

Faceted search lets users refine or navigate a collection of information by using a number of discrete attributes – the so-called facets. A facet represents a specific perspective on content that is typically clearly bounded and mutually exclusive. The values within a facet can be a flat list that allows only one choice (e.g. a list of possible shoe sizes) or a hierarchical list that allow you to drill-down through multiple levels (e.g. product types, Computers > Laptops). The combination of all facets and values are often called a faceted taxonomy. These faceted values can be added directly to content as metadata or extracted automatically using text mining software.

For example, a recipe site using faceted search can allow users to decide how they’d like to navigate to a specific recipe, offering multiple entry points and successive refinements.

As users combine facet values, the search engine is really launching a new search based on the selected values, which allows the users to see how many documents are left in the set corresponding to each remaining facet choice. So while users think they are navigating a site, they are really doing the dreaded advanced search.

There are best practices in establishing facets. They are:

do not create too many facets - presenting users with 20 different facets will overwhelm them; users will generally not scroll too far down beyond the initial screen to locate your more obscure facets;

base facets on key use cases and known user access patterns - idenfity key ways users search and navigate your site. Analysing search logs, evaluating competitor sites, and user research and testing are great ways to figure out what key access points users are looking for. Interviewing as few as 10 users will often give you great insight into what the facet structure should be;

order facets and values based on importance - not all facets are equally important. Some access points are more important than others depending on what users are doing and where they are in the site. Present most popular facets on the top. When determining order for navigation, again think about your users and why they are coming to your site.

leverage the tool to show and hide facets and values - while the free or low-cost faceted search tools don’t all offer these configuration options, more sophisticated faceted search solutions allow you to create rules to progressively disclose facets.

Think of a site offering online greeting cards. While the visual theme of the card – teddy bears, a sunset, golf – might eventually be important to a user, it probably isn’t the first place they will start their search. They will likely start with occasion (birthday, Christmas), or recipient (father, friend), and then become interested in themes further down the line. Accordingly, we might hide the “themes” facet until a user has selected an occasion or recipient. You can selectively present facets based on your understanding of your users and their typical search patterns (as mentioned in the previous “do”).

Also take advantage of the search engine’s clutter-reducing features, such as the “more...” link. This allows you to present only the most popular items and hide the rest until the user specifically requests to see them. You can also do this at the facet level, collapsing lesser-used facets to present just the category name and let users who are interested expand that facet.

facet display should be dependent on the area of the site. If you are in the first few layers of your site, you should show fewer facets with more values exposed, whereas if you are deeper into product information you should show more facets, some with values exposed and others hidden.

create your taxonomy with faceted search in mind - a good taxonomy goes a long way in making a successful faceted search interface.

There are some important guidelines to follow in taxonomy design. Facets need to be well defined, mutually exclusive and have clear labels. For example, having one facet called “Training” and another “Events” is confusing: where do you put a seminar? Is it training or an event? If you have to wonder, your users will too. The taxonomy depth (how many levels deep does it go) and breadth (how many facets wide is it) are other important considerations. Faceted search works better with a broad taxonomy that is relatively shallow, as this lets users combine more perspectives rather than get stuck in an eternal drill down, which causes fatigue. The facet configuration and display rules will help you create the optimal progressive presentation of these facets so as to not overwhelm users with the breadth.

If you are torn between two places in the taxonomy for a term, consider putting it in both places. This is called polyhierarchy, and it is a good way to ensure findability from multiple perspectives. Polyhierarchy is best served within a facet rather than across multiple facets. Since facets should be mutually exclusive, you shouldn’t have much need to repeat terms across facets, which can be more confusing than helpful.

The most important thing however, is to be prepared to break any of these rules in the name of usability. Building a faceted taxonomy involves understanding your users’ search behavior.

As the trend towards increased social computing continues, Web 2.0 concepts are entering the realm of faceted search. We are starting to see social tags being used in faceted search and browse interfaces. Buzzillions.com, a product-review site, is using social tag-based facets in its navigation, allowing users to refine results based on tags grouped as "Pros" or "Cons". This site uses a nice blend of free social tagging and control to ensure good user experience; when you type in a tag to add to a product review, type-ahead verifies existing tags and prompts you to select one from the existing list of matches to maximize consistency.

Ultimately, navigation and search is one of the main interactions users have with your site, so getting it right is not just a matter of good design, it impacts the bottom line. Faceted search is a very popular and powerful solution when done well; it allows users to deconstruct a large set of results into bite-size pieces and navigate based on what’s important to them. But faceted search by itself is not necessarily going to make your users lives easier. You need to understand your users’ mental models (how they seek information), test your assumptions about how they will interpret your terms and categories and spend time refining your approach.

Faceted search can just add more complexity and frustrate your users if not considered from the user perspective and carefully thought through with sound usability principles in mind. Faceted search is raising the bar in terms of findability and how well you execute will determine whether your site meets the new standard.

Tuesday, July 17, 2012

Search Applications - Exalead

Exalead provides search platforms and search-based applications. Search-based applications (SBA) are software applications in which a search engine platform is used as the core infrastructure for information access and reporting. SBAs use semantic technologies to aggregate, normalize and classify unstructured, semi-structured and/or structured content across multiple repositories, and employ natural language technologies for accessing the aggregated information.

Exalead has a platform that uses advanced semantic technologies to bring structure, meaning, and accessibility to previously unused or under-utilized data in the disparate, heterogeneous enterprise information overload.

The system collects data from virtually any source, in any format, and transforms it into structured, pervasive, contextualized building blocks of business information that can be directly searched and queried, or used as the foundation for a new breed of lean, innovative information access applications.

Exalead products include the CloudView platform and the ii Solutions Suite of packaged SBAs, all built on the same powerful CloudView platform.

Exalead CloudView

Exalead CloudView enables organizations to meet demands for real time, in-context, accurately delivered information, accessed from diverse web and enterprise big data sources, yet delivered faster and with cost less than with traditional application architectures. This platform is used for both online and enterprise SBAs as well as enterprise search.

Available for on-premises or cloud delivery, Exalead CloudView is the infrastructure that powers all Exalead solutions, including Exalead’s public web search engine, the company’s custom SBAs, and the Exalead ii Solution Suite of packaged, vertical SBAs.

Exalead ii Solutions Suite

Exalead ii ("information intelligence”) applications are packaged, workflow specific SBAs that transform large volumes of heterogeneous, multi-source data into meaningful, real time information intelligence, and deliver that information intelligence in context to users to improve business processes.

On the data side, the Exalead information infrastructure uses semantic technologies to non-intrusively aggregate, align and enhance multi-source data to create a powerful base of actionable knowledge (i.e., information intelligence).

Exalead Advanced Search options appear as a drop-down menu below the search form, where users select the search criteria that will be entered directly into the search form. A different set of advanced search options is available for each search type.

Registered users can select the "Bookmark" option below for any search results to add these results to a list of saved sites, accessible on the Exalead homepage as a collection of image thumbnails.

Strengths:

truncation, proximity, and many other advanced operators not available from other search engines;
includes thumbnails of pages;
provides excellent narrowing options on right side.

Exalead appears to support Boolean operators and nested searching with the operators AND, OR, and NOT. Either AND, OR or NOT can be used. Searching can be nested using parentheses. Operators must be in upper case. Exalead can also use "for NOT" but only when it is not used along with the Boolean operators. In the Advanced Search, it also has drop-down choices for "containing," "not containing," and "preferably containing." There is also an OPT operator which means that the word following it is an optional word.

Phrase searching is available by using "double quotes" around a phrase. It also supports a NEXT operator for ordered proximity of one word (in other words, the same thing as a phrase search.) So "double quotes" should get the same results as double NEXT quotes. Exalead also supports the NEAR operator for 16 word proximity. You can change it to NEAR/5 (or any other number) to specify a different proximity value.

Exalead's Advanced Search also offer some unusual types of special searches:

phonetic spelling with the sounds like: operator;
approximate spelling with the spells like: operator;
regular expression using regex syntax.

On a search with two or more words, stemming is automatic. Exalead also supports truncation using an asterisk * symbol. Stemming is also controlled on the preferences page. Exalead has no case sensitive searching. Using either lower or upper or mixed case will result in the same results. Exalead supports a title search.

Exalead has limits for language, country, file type, site, and date available on the Advanced Search page. The file type limit includes text, PDF, Word, Excel, Powerpoint, Rich Text Format, Corel WordPerfect, and Shockwave Macromedia Flash. You can place the file type search command into the search box. The Advanced Search page offers a site limit, which can be used to limit results to those from the specified domain. The language limit is available in the Advanced Search.

Some common words such as 'a,' 'the' and 'in' are ignored, but they can be searched with a + in front. Within a phrase search, all words are searched.

Results are sorted by a relevance algorithm. Pages are also clustered by site. Only one page per site will be displayed. Others are available via the yellow folder and domain name. The Advanced Search page used to include two date sort options, but those disappeared with the new interface in October 2006. They are still available via the field prefix of sort. Two options are available: sort:new and sort:old.

Wednesday, April 25, 2012

Enterprise Search vs Centralized Systems

It is the known fact that data is doubling every 18 months, and that unstructured information volumes grow six times faster than structured. Employees spend too much time, about 20% of their time, on average, looking for, not finding and recreating information. Once they find the information, 42% of employees report having used the wrong information, according to a recent survey.

To combat this reality, for years, companies have spent hundreds of thousands, even millions, to move data to centralized systems, in an effort to better manage and access its growing volumes, only to be disappointed as data continues to proliferate outside of that system. In fact, in a recent survey by the Technology Services Industry Association, more than 90% of its members have a single support knowledgebase in place, yet the report decreases in critical customer service metrics, due to the inability to quickly locate the right knowledge and information to serve customers.

Despite best efforts to move data to centralized platforms, companies are finding that their knowledgebase runs throughout enterprise systems, departments, divisions and newly acquired subsidiaries. Knowledge is stored offline in laptops, in emails and archives, intranets, file shares, CMS, CRM systems, ERPs, home-grown systems and many others across departments and across the countries of the world.

Add to this the proliferation of enterprise application use (including social networks, wikis, blogs and more) throughout organizations and it is no wonder that efforts to consolidate data into a single knowledgebase, a single "version of the truth" have failed... and at a very high price.

The bottom line is, moving data into a single knowledgebase is a losing battle. There remains a much more successful way to effectively manage your knowledge ecosystem, all without moving data. The key to it is to stop moving data by combining structured and unstructured information from virtually any enterprise system, including social networks, into a central, unified index. Think of it as an indexing layer that sits above all enterprise systems, from which services can be provided to multiple departments, each configured to that department’s specific needs.

This approach enables dashboards, focused on various business departments and processes, which contain just-in-time analytics and 360-degree information about, for example, a customer or a prospective customer. Such composite views of information provide new, actionable perspectives on many business processes, including overall corporate governance. The resulting position of key metrics and information improve decision making and operational efficiency.

This approach allows IT departments to leverage their existing technologies, and avoid significant costs associated with system integrations and data migration projects. It also helps companies avoid pushing their processes into a one-size-fits-all, framework. With configurable dashboards, companies decide how,what, and where information and knowledge are presented, workflows are enabled, and for what groups of employees.

Information monitoring and alerts facilitate compliance. There is virtually no limit to the type of information and where it is pulled from, into the central, unified and, highly secure—index: structured, unstructured, from all corporate email, .PST files, archives, on desktops and in many CRMs, CMS, knowledgebases, etc.

Enterprise applications have proliferated throughout organizations, becoming rich with content. And yet all of that knowledge and all of that content remain locked within the community, often not even easily available to the members themselves. It is possible to leverage the knowledge of communities in enterprise search efforts. User rankings, best bets and the ability to find people through the content they create are all social search elements that provide the context that employees and customers have come to expect from their interactions with online networks.

Once you have stopped moving data and created the central index, you would be able to provide your employees with the access to pertinent information and knowledge. For many organizations, employees spend most of their time in Outlook. Other organizations with large sales teams need easy access to information on the road.

Also valuable is the ability to conduct secure searches within enterprise content directly from a BlackBerry, including guided navigation. Even when systems are disconnected, including laptops, users can easily find information from these systems, directly from their mobile devices. Again, without moving data, organizations can enjoy immediate, instant access to pertinent knowledge and information, anywhere, anytime.

Tuesday, April 10, 2012

Content Management Systems Reviews - Documentum - Federated Search

Documentum Federated Search is a suite of products designed to solve the problem of finding information quickly. Federated Search Services is comprised of two server-based components, the Federated Search Server and Federated Search Adapter Packs, and two client-level options – Webtop Federated Search and Discovery Manager. These options enable organizations to quickly search for information stored in a myriad of sources and data formats.

Federated Search Server

Federated Search Server manages federated searches through a query broker and source adapters to provide relevant results in real time while leveraging the local index and security permissions of each source being queried. The result: the most relevant and secure real-time search results which are organized in an intuitive manner.

Key Features:

Quickly access relevant information across countless sources with a single query executed from an easy-to-use, web-based interface.
Search multiple internal and external information sources right out of the box, including Documentum repositories, network file shares, Documentum e-Room and other Documentum products, Google, Yahoo, etc.
Ensure secure access to content by respecting security permissions set at the information source being searched. This ensures that queries only return only those search results that the user is authorized to see.
Retrieve information from sources whether or not they support the Documentum query language. The Federated Search Server adapts the query automatically and performs post-filtering to compensate for sources that do not support specific operator or metadata.

Federated Search Adapter Packs and Federated Search SDK

Federated Search Adapter Packs are sets of out-of-the box adapters that provide access from the Federated Search Server to information sources by leveraging native data structure, metadata or index of the information sources being searched. Each adapter gathers relevant search results which are then filtered and presented to the end user in the real time in an intuitive manner. Adapter packs are available as a three-pack, ten-pack or unlimited pack.

Federated Search SDK provides a powerful developer toolkit to customize or create adapters based on industry standards as well as source code and a library of common APIs, including Java API to customize the Discovery Manager client for bespoke applications.

Key Features:

Adapter Library: out-of-the box source adapters, including adapters for government and industry databases, local content archives, enterprise applications, web services, or bundles of adapters for Pharma and Science.
Enterprise repositories include: Documentum products, FileNet Panagon Content Services, IBM Lotus Domino/Notes, Open Text LiveLink, Oracle database, Symantec Enterprise Vault, and others.
Search engines include Autonomy, Google Search Appliance, Google.com, Yahoo, Microsoft Index Server, Open Directory, and others.
Technology standards include HTTP, JDBC/ODBC, SOAP, Web Services, Z39.50
Content providers include Factiva (news), IDRAC, Lexis Nexis (news and legal).
Intelligence Services
With Documentum Intelligence Services added to the Documentum Content Server, content metadata can be improved helping to produce even more precise search results.

Monday, April 2, 2012

Using Enterprise Search

Enterprise search is used in a wide variety of applications and solutions. this means it has to be highly flexible, allwing you to tune it in multiple directions because what is important to one application may not be important to another.

Embedded Search

Search is becoming a common requirement for many applications. In some applications search is a critical aspect of the solution. For example, legal discovery applications (e-discovery) use embedded search technologies to search through a massive volume of

e-mails and documents to obtain critical evidence in a legal investigation. Other examples are in compliance monitoring, enterprise content management (ECM), e-mail archiving, supply chain systems, customer management, etc.

The unique requirement for embedded search are high flexibility, a small footprint, the ability to scale incrementally, and the ease of integration and use.

Intranet Search

Intranet search refers to searching sources of content internal to the enterprise such as shared file directories, e-mail systems, and internal web sites and wikis, and the case of newer search solutions, databases. Search technology is used for improving productivity or helping employees locate specific information they need as part of their daily jobs.

There is an infinite array of intranet implementations that are highly specialized, solving one problem or another in the enterprise, for example e-mail surveillance, privacy auditing, content management, etc.

The key requirements for intranet search are breadth and depth on handling various document formats, ease of integration with enterprise systems and information repositories, enhanced user interface (such as navigated search or facets), and a security model for handling user authorization based on enterprise credentials. Some of the more specialized intranet applications also require enhanced features such as entity extraction, document clustering, text mining, and high performance with horizontal scalability.

Web Portals

Search is an essential element of every consumer facing portal where fast and accurate access to information is the sole purpose of the application. Online catalogs for retailers, job searching sites, government portals, and commercial information portals for researchers, marketers, and business managers all use search technology to increase revenue and self service. The new revenue can come from either new customers paying subscriptions or keeping browsers around longer at a site that monetizes through online advertising.

The key requirement for web portals are an enhanced user interface (such as navigated search), the ability to easily manage and control search results with relevance tuning, high performance (sometimes with thousands of concurrent users), and incremental scalability with 24/7 availability.

Search in Analytics

Data mining applications are becoming a must have component in the corporate business intelligence (BI) stack. But not all business content is nicely organized in relational databases or even resides in enterprise systems. E-mails, free text fields in customer surveys, voice recordings from customer service centers, online news, and competitors' web sites all contain important information about the business. In these applications, enterprise search can be sued to bring together the content, extract concepts, perform sentiment analysis, or find new relationships in the data to improve data analysis.

Wednesday, February 8, 2012

Developing Enterprise Search Strategy

During last ten years the volume and diversity of digital content grew at unprecedented rates. There is an increased use of departmental network drives, collaboration tools, content management systems, messaging systems with file attachments, corporate blogs and wikis, and databases. There are duplicate and untraceable documents that crowd valuable information needed to get work done.

Unfortunately, not all content makes into it into a managed content repository, like a portal or a content management system. Some companies have more than one content management system. Having a search solution that could search across all content repositories becomes very important.

Expectations for quality search continue to rise. Many users like to use an expression: "we would like a search like Google". So, how do we formulate a search strategy?

Here are few key points:

Security within enterprise search strategies should be carefully designed. Information like employee pay rates, financial information, or confidential communications should not end up in a general search results.
Search results should deliver high quality, authoritative, up-to-date information. Obsolete information should not end up in the search results.
Search results should be highly relevant to keywords entered in a search box.
The ability to limit the search should be included.

Steps to Develop an Enterprise Search Strategy

Step 1: Define Specific Objectives for Your Search Strategy

People don’t search for the sake of searching. They search because they are looking to find and use information to get their jobs done. Answer these questions:

1. Who is searching? Which roles within the organization are using the search function, and what requirements do they have?

For example, a corporate librarian is likely familiar with Boolean search and using advanced search forms, while a layperson searcher likely prefers a simple search box. A sales professional may need an instant access to past proposals for an upcoming meeting, but compliance professionals conducting investigations often use deep search across massive message archiving and records management systems.

2. What categories of information are they looking for?

Define the big buckets of information that are the most relevant to different roles. Realize that not all roles need all information. Part of why desktop search tools are popular is they inherently define a bucket called "stuff on my machine". Defining categories for searching project information, employee information, sales tools, and news helps searchers formulate the right query for the right type of search.

3. What are they likely to do with the information when they find it? After defining broad information categories, work to understand context and answer the question: why are people searching?

For example, if a marketer is collecting information on a particular competitor by searching on the company’s name, it is often useful to expand that query to include related information, like other competitors in the industry, specific business units or product lines, pricing information, past financial performance. Related information can be included in search experiences through a variety of methods, including the search results themselves or methods like faceted navigation.

It is impossible to account for every type of information that users may be looking for, but defining broad user roles, like sales professionals or market researchers and identifying their most common search scenarios is a great way to create the scope of a search project. Use such methods as personas, use cases, interview users to validate assumptions about what processes they are involved in, and identify the information that is most useful to support those processes.

Step 2: Define the Desired Scope and Inventory Repositories

When using the search function built into a particular content management system, the product itself limits the scope of the search to whatever is stored in this system. Search engines such as Autonomy, Endeca Technologies, Google, Vivisimo, and others will search across multiple content management systems and databases. Increasingly, portal products and collaboration platforms from companies like IBM, Microsoft, Oracle, and Open Text will also let you search content that is stored inside and outside of their systems.

Use search to reach outside the confines of a single repository. Cross-repository search becomes essential when companies use different content repositories for different purposes.

Match roles and search categories to relevant content sources. Search requirements often include multiple repositories, such as document libraries, file systems, databases, etc. These repositories usually consist of multiple technology products, such as Lotus Notes, EMC Documentum, Microsoft SharePoint, and others. Using the roles and types of searches you are looking to support, identify all of the relevant repositories necessary to achieve your desired search scope.

Create an inventory of required repositories. When creating your inventory, document the name of each repository, a repository owner, a description of its content, an assessment of the quality of this content, and the quantity and rate of growth of content in each repository. Also document the technology product used as well as any specific security access policies in place.

Consider a phased rollout and select simple but telling data source repositories for kick-off. When rolling out a project such as search strategy that involves disparate sources and complex UIs, a phased rollout may be preferable depending upon factors such as resource constraints and time-to-launch pressure. By approaching the project in phases, you can vet the process and workflow while familiarizing users with the objectives.

Inventory and prioritize the repositories at the start of your project so that you can identify and start with the repositories that will have a big impact. For example, basic queries into a CRM system can add a lot of value while remaining relatively straightforward. Throughout this process, it is important to set expectations with your users, since this approach may lengthen their involvement with the project.

Documenting your repositories lets software vendors effectively size and bid on your project. Most search software gets priced based on the number of documents (or data items) in the index plus additional fees for premium connectors that ingest content from repositories like enterprise content management systems.

For example, strategies that require a limited set of commodity connectors are priced altogether differently than those with premium connectors for content management systems and enterprise applications. Thus, knowing which repositories are relevant and understanding the rate of content growth within them can help avoid unnecessary overspending.

Step 3: Evaluate and Select the Best Method for Enriching Content

When addressing content with very little descriptive text and metadata, evaluate several methods for enriching the content to improve the search experience. Methods range from manual application of metadata to automatic categorization. Some companies use a mix of both methods.

Step 4: Define Requirements and List Products and Vendors to Consider

After specifying a search scope, define requirements for users. The most important is not to get distracted with irrelevant features, but instead to focus on products that adequately meet the organization’s requirements over a specified time period. Consider factors like ease of implementation, product strategy, and market presence in any product evaluation.

Score and select vendors on criteria that are relevant for your needs. There are many vendors to choose from. Search vendors include Autonomy, Coveo Solutions, Endeca Technologies, Exalead, Google Enterprise, ISYS Search Software, Recommind, Thunderstone Software, Vivisimo, and others. Also large software providers such as IBM, Microsoft, Oracle, and SAP have one or more search products on the market.

Product capabilities range from highly sophisticated, large-scale, secure searches that mix advanced navigation and filtering, to basic keyword searches across file systems. Products differ depending on whether the content being searched consists primarily of data. For example, high-end search companies like Endeca offer robust tools for searching structured data from databases, while small-scale basic file system search needs can be met with products like the Google Mini or the IBM OmniFind Yahoo! Edition.

Step 5: Define a Taxonomy of Logical Types of Searches

While it is impossible to predict and account for everything people search for, it is possible to organize the search experience so it is intuitive to use. Start with defining logical types of searches. For example:

People Search. Searching for employees has gained acceptance as a valuable type of search within enterprises for finding expertise on a subject. A search for people, whether it is a simple name look-up or more advanced expertise search, requires attention to everything from how the query gets processed to how results appear in the interface. For example, searchers typically want to see an alphabetical list of names in a people search results as opposed to results ranked by relevance.

Product Search. A search for products frequently needs to include product brand names (e.g., Trek), concepts and terms related to the product (e.g., bike, bicycle, road race, touring), product description, and specific product attributes, like frame size, material, and color. Knowing where all of this information is stored and how it should be optimally presented to end users is essential.

Customer Search. It is now possible to search and return results for virtually any logical item in an enterprise, like orders, customers, products, and places. You should look into sources like enterprise data warehouses, ERP systems, order histories, and others to create a full picture of the items that is being searched.

Documents Search. Documents usually reside in few repositories, so be sure to include them in your search sources. Users expect search results to be highly relevant with most relevant to be on the top of the search results list.

By bucketing types of searches into logical categories, you can also improve the quality of those searches. Several methods include applying type specific thesaurus, taxonomies, and controlled vocabularies.

Administrators can influence the relevance algorithm in a way that returns the right information the right way, like weighting hits in a product description more heavily than a product attribute field.

Step 6: Plan for a Relevant User Experience

Recognize that not all search experiences should be the same. Google, Yahoo!, and MSN’s popularity on the Web have generated strong interest in offering simple-to-use wide search boxes and tabbed interfaces within the enterprise. But in the enterprise, it is often helpful to use more advanced interface techniques to clarify what users are looking for, including:

Faceted navigation adds precision to search. It exposes attributes of the items returned to an end user directly into the interface. For example, a search through a product information database for "electrical cables" might return cables organized by gauge, casing materials, insulation, color, and length, giving an engineer clues to find exactly what he is looking for.

Statistical clustering methods remove ambiguity. Methods like statistical clustering automatically organize search results by frequently occurring concepts. Clusters provide higher level groupings of information than the individual results can provide, and can make lists of millions of documents easier to scan and navigate.

Best bets guide users to specific information they need. Creating best bets is the process of writing a specific rule that says something like: "when a person enters the term "401K plan" into the search box on the corporate intranet, they should see a link to the "401K plan" page on the intranet".

Additionally, products like Google OneBox and SAP’s Enterpise Search Appliance enable retrieval of frequently searched facts, such as sales forecast data, dashboards, and partner information from back-end ERP systems. Best bets help users avoid a lot of irrelevant results and are very effective for frequently executed queries.

Use basic interface mock-ups and pilot efforts to test, refine, and make these concepts useful for employees in your organization. Many companies use a "Google Labs" style page on their intranets to test out search user interface concepts and tools prior to exposing them more broadly to the enterprise.

Step 7: Implement, Monitor, and Improve

For large projects, allow a lot of time for change management. Teams should maintain the interface between the search engine and all of its back-end content sources.

It is essential to keep IT individuals informed of product evaluation and selection plans so that the final implementation supports security and regulatory policies that are in place for these systems.

Create a plan for ongoing maintenance of search indexing processes and exceptions. Create a monthly reporting plan that lists most frequent searches performed, searches that did not retrieve results, and overall usage of the search function. This can help you troubleshoot existing implementations and drive future decisions on how to enhance the search experience over time.

Enhancements typically include adding types of searches to the experience, further enriching content assets for better retrieval, and incorporating new, valuable content into the overall experience.

In my future posts, I will describe search products such as Autonomy, Coveo Solutions, Endeca Technologies, Exalead, ISYS Search Software, Recommind, Thunderstone Software, Vivisimo, and others.