Tuesday, June 30, 2015

Search Applications - Concept Searching

Concept Searching Limited is a software company which specializes in information retrieval software. It has products for Enterprise search, Taxonomy Management and Statistical classification.

Concept Searching Technology Platform

The Concept Searching Technology Platform is based on our Smart Content Framework™ for information governance, and incorporates best practices for developing an enterprise framework to mitigate risk, automate processes, manage information, protect privacy, and address compliance issues. Underlying the framework is the technology to:
  • Automatically generate semantic metadata using Compound Term Processing.
  • Auto-classify content from diverse repositories.
  • Easily develop, deploy, and manage taxonomies.
The framework is being used to enable intelligent metadata enabled solutions to improve search, records management, enterprise metadata management, text analytics, migration, enterprise social networking, and data security.

Features
  • Compound terms are extracted when content is indexed from internal or external content sources, enabling the delivery of greater precision of relevant content at the top of search results.
  • Relevance ranking displays extracts from the documents based on the query.
  • Search refinement delivers to the end user highly correlated concepts that may be used to refine the search.
  • Taxonomy browse capabilities are standard.
  • Documents can be classified into one or more taxonomy nodes, enhancing the precision of documents returned.
  • In addition to static summaries, Dynamic Summarization, a modified weighting system, can be applied that will identify in real-time short extracts that are most relevant to the user’s query.
  • Related Topics will return results based on the conceptual meaning of the search terms used, using the ability to generate compound terms in a search. For example, ‘triple’ is a single word term but ‘triple heart bypass’ is a compound term that provides a more granular meaning.
  • Based on previous queries, or on extracts retrieved, end users can use the text to perform additional searches to retrieve more granular results.
  • The product is based on an open architecture with all API’s based on XML and Web Services. Transparent access to system internals including the statistical profile of terms is standard.
  • Highly scalable.
  • High performance specifically with classification occurring in real time.
  • Easily customized to achieve your organizations’ objectives.
Base Components in the Concept Searching Technology Framework

Conceptual Search Platform

conceptSearch, is Concept Searching’s enterprise search product and a key component in the Concept Searching Technology Platform. It is a unique, language independent technology and is the first content retrieval solution to integrate relevance ranking based on the Bayesian Inference Probabilistic Model and concept identification based on Shannon’s Information Theory.

Unlike other enterprise search engines that require significant customization with marginal results, conceptSearch is delivered with an out-of-the-box application that demonstrates a simple search interface and indexing facilities for internal content, web sites, file systems, and XML documents. Application developers experience a minimal learning curve and the organization can look forward to a rapid return on investment.

Because of the innovative technology, conceptSearch delivers both high precision and high recall. Precision and recall are the two key performance measures for information retrieval. Precision is the retrieval of only those items that are relevant to the query. Recall is the retrieval of all items that are relevant to the query. Yet most information retrieval technologies are less than 22% accurate for both precision and recall. The ideal goal is to have these features balanced. Compound term processing has the ability to increase precision with no loss of recall.

conceptSearch is particularly important for organizations that need sophisticated search and retrieval solutions. By weighting multi-word phrases, instead of single words, or words in proximity, the retrieval experience is more accurate and relevant. The ability for the search engine to identify concepts enables organizations to improve the search experience for a variety of business requirements.

Search Engine Integration

This functionality is provided via the Concept Searching Technology platform to integrate with any search engine. The Concept Searching Technology platform can perform as on the fly classification with search engines calling the classify API. Search engine support includes SharePoint, the former FAST products, Office 365 Search, Solr, Google Search Appliance, Autonomy, and IBM Vivisimo. If the FAST Pipeline Stage is required, this is sold as a separate product.

conceptClassifier

conceptClassifier is a leading-edge rules based categorization module providing control of rules-based descriptors unique to an organization. conceptClassifier delivers a categorization descriptor table, which is easy to implement and maintain, through which all rules and terms can be defined and managed. This approach eliminates the error-prone results of ‘training’ algorithms typically found in other text retrieval solutions and enables human intervention to effectively tune classification results.

Functionality is provided via the Concept Searching Technology platform, to classify documents based upon concepts and multi-word terms that form a concept. Automatic and/or manual classification is included. Knowledge workers with the appropriate security rights can also classify content in real time. Content can be classified from diverse repositories including SharePoint, Office 365, file shares, Exchange public folders, and websites. All content can be classified on the fly and classified to one or more taxonomies.

conceptTaxonomyManager

This is an advanced enterprise class, easy-to-use taxonomy development and management tool, still unique in the industry. Developed on the premise that a taxonomy solution should be used by business professionals, and not the IT team or librarians, the end result is a highly interactive and powerful tool that has been proven to reduce taxonomy development by up to 80% (client source data).

conceptTaxonomyManager is a simple to use, has an intuitive user interface designed for Subject Matter Experts, and does not require IT or Information Scientist expertise to build, maintain and validate taxonomies for the enterprise. conceptTaxonomyManager has the capability to automatically group unstructured content together based on an understanding of the concepts and ideas that share mutual attributes while separating dissimilar concepts.

This approach is instrumental in delivering relevant information via the taxonomy structure as well as using the semantic metadata in enterprise search to reduce time spent finding information, increase relevancy and accuracy of the search results, and enable the re-use and re-purposing of content. Using one or more taxonomies, unstructured content can be leveraged to improve any application that uses metadata. This flexibility extends to records management, information security, migration, text analytics, and collaboration.

Intelligent Migration

Using the Concept Searching Technology platform an intelligent approach to migration can be achieved. As content is migrated it is analyzed for organizationally defined descriptors and vocabularies, which will automatically classify the content to taxonomies, or in the SharePoint environment, the SharePoint Term Store, and automatically apply organizationally defined workflows to process the content to the appropriate repository for review and disposition.

conceptSQL

This product provides the ability to define a document structure based on information held in a Microsoft SQL Server. A document can include any number of text and metadata fields and can span multiple tables if required. conceptSQL supports SQL 2005, 2008, and 2012. A powerful but easy to use configuration tool is supplied eliminating the need for any programming. Templates are provided for out of the box support for Documentum, Hummingbird, and Worksite/Interwoven DMS.

SharePoint Feature Set

The SharePoint Feature Set includes the following components: farm solution with feature sets, Term Store integration, taxonomy tree control for editing, refinement panel integration, event handlers for notification of changes, management of classification status column, web service advanced functionality (implement system update or preserve GUIDS), automated site column creation.

Intelligent Records Management

The ability to intelligently identify, tag, and route documents of record to either a staging library and/or a records management solution is a key component in driving and managing an effective information governance strategy. Taxonomy management, automatic declaration of documents of record, auto-classification, and semantic metadata generation are provided via the Concept Searching Technology platform and conceptTaxonomyWorkflow.

Data Privacy

Fully customizable to identify unique or industry standard descriptors, content is automatically meta-tagged and classified to the appropriate node(s) in the taxonomy based upon the presence of the descriptors, phrases, or keywords from within the content. Once tagged and classified the content can be managed in accordance with regulatory or government guidelines.

The identification of potential information security exposures includes the proactive identification and protection of unknown privacy exposures before they occur, as well as real-time monitoring of organizationally defined vocabulary and descriptors in content as it is created or ingested. Taxonomy, classification, and metadata generation are provided via the Concept Searching Technology platform and conceptTaxonomyWorkflow.

eDiscovery and Litigation Support

Taxonomy, classification, and metadata generation are provided via the Concept Searching Technology platform. This is highly useful when relevance, identification of related concepts, vocabulary normalization are required to reduce time and improve quality of search results.

Text Analytics

Taxonomy, classification, and metadata generation are provided via the Concept Searching Technology platform. A third party business intelligence or reporting tool is required to view the data in the desired format. This is useful to cleanse the data sources before using text analytics to remove content noise, irrelevant content, and identify any unknown privacy exposures or records that were never processed.

Social Networking

Taxonomy, classification, and metadata generation are provided via the Concept Searching Technology platform. Integration with social networking tools can be accomplished if the tools are available in .NET or via SharePoint functionality. This is useful to provide structure to social networking applications and provide significantly more granularity in relevant information being retrieved.

Business Process Workflow

conceptTaxonomyWorkflow serves as a strategic tool managing migration activities and content type application across multiple SharePoint and non-SharePoint farms and is platform agnostic. This add-on component delivers value specifically in migration, data privacy, and records management, or in any application or business process that requires workflow capabilities.

conceptTaxonomyWorkflow is required to apply action on a document, optionally automatically apply a content type and route to the appropriate repository for disposition.

Wednesday, June 24, 2015

Thesaurus Principles

Thesaurus is necessary for effective information retrieval. A major purpose of a thesaurus is to match the terms brought to the system by an enquirer with the terms used by the indexer.

Whenever there are alternative names for a type of item, we have to choose one to use for indexing, and provide an entry under each of the others saying what the preferred term is. The goal of the thesaurus, and the index which is built by allocating thesaurus terms to objects, is to provide useful access points by which that record can be retrieved.

For example, if we index all full-length ladies' garments as dresses, then someone who searches for frocks must be told that they should look for dresses instead.

This is no problem if the two words are really synonyms, and even if they do differ slightly in meaning it may still be preferable to choose one and index everything under that. I do not know the difference between dresses and frocks but I am fairly sure that someone searching a modern clothing collection who was interested in the one would also want to see what had been indexed under the other. We would do this by linking the terms with the terms Use and Use for, like this:

Dresses
USE FOR
Frocks

Frocks
USE
Dresses

This may be shown in a printed list, or it may be held in a computer system, which can make the substitution automatically. If an indexer assigns the term Frocks, the computer will change it to Dresses, and if someone searches for Frocks the computer will search for Dresses instead, so that the same items will be retrieved whichever term is used.

Use and Use For relationships are also used between synonyms or pairs of terms which are so nearly the same that they do not need to be distinguished in the context of a particular collection. For example:

Nuclear energy
USE
Nuclear power

Nuclear power
USE FOR
Nuclear energy

Hierarchical Relationships

If we have a hundred jackets, a list under a single term will be too long to look through easily, and we should use the more specific terms. In that case, we have to make sure that a user will know what terms there are. We do this by writing a list of them under the general heading. For example:

Jackets
NT (Narrower Terms)
Dinner Jackets
Flying Jackets
Sports Jackets

In the thesaurus, BT(Broader Terms)/NT relationships can be used for parts and wholes in only four special cases: parts of the body, places, disciplines and hierarchical social structures.

Good computer software should allow you to search for "Jackets and all its narrower terms" as a single operation, so that it will not be necessary to type in all the possibilities if you want to do a generic search.

Related Terms

Related terms may be of several kinds:

1. Objects and the discipline in which they are studied, such as Animals and Zoology.
2. Process and their products, such as Weaving and Cloth.
3. Tools and the processes in which they are used, such as Paint brushes and Painting.

It is also possible to use the Related Term relationship between terms which are of the same kind, not hierarchically related, but where someone looking for one ought also to consider searching under the other, e.g. Beds RT Bedding; Quilts RT Feathers; Floors RT Floor coverings.

Definitions and Scope Notes

Record information which is common to all objects to which a term might be applicable. Where there is any doubt about the meaning of a term, or the types of objects which it is to represent, attach a scope note. For example:

Fruit
SN
distinguish from Fruits as an anatomical term
BT
Foods
Preserves
SN
includes jams
Neonates
SN
covers children up to the age of about 4 weeks; includes premature infants

Form of Thesaurus

A list based on these relationships can be arranged in various ways; alphabetical and hierarchical sequences are usually required, and thesaurus software is generally designed to give both forms of output from a single input.

Poly-hierarchies

a term can have several broader terms, if it belongs to several broader categories. The thesaurus is then said to be poly-hierarchical. Cardigans, for example, are simultaneously Knitwear and Jackets, and should be retrieved whenever either of these categories is being searched for.

With a poly-hierarchical thesaurus it would take more space to repeat full hierarchies under each of several broader terms in a printed version, but this can be overcome by using references, as Root does. There is no difficulty in displaying poly-hierarchies in a computerized version of a thesaurus.

Singular or Plurals

Thesaurus creation standards prescribe to use plural forms of nouns.

Use of Thesaurus

A thesaurus is an essential tool which must be at hand when indexing a collection of objects, whether by writing catalog cards by hand or by entering details directly into a computer. The general principles to be followed are:

1. Consider whether a searcher will be able to retrieve the item by a combination of the terms you allocate.
2. Use as many terms as are needed to provide required access points.
3. If you allocate a specific term, do not also allocate that term's broader terms.
4. Make sure that you include terms to express what the object is, irrespective of what it might have been used for.

If you have a computerized thesaurus, with good software, this can give you a lot of direct help. Ideally it should provide pop-up windows displaying thesaurus terms which you can choose from and then "paste" directly into the catalog record without re-typing. It should be possible to browse around the thesaurus, following its chain of relationships or displaying tree structures, without having to exit the current catalog record, and non-preferred terms should automatically be replaced by their preferred equivalents.

You should be able to "force" new terms onto the thesaurus, flagged for review later by the thesaurus editor. When editing thesaurus relationships, reciprocals should be maintained automatically, and it should not be possible to create inconsistent structures.

Thesaurus Maintenance

New terms can be suggested, and temporarily terms "forced" into the thesaurus by users. Someone has to review these terms regularly and either accept them and build them into the thesaurus structure, or else decide that they are not appropriate for use as indexing terms.

In that case they should generally be retained as non-preferred terms with USE references to the preferred terms, so that users who seek them will not be frustrated. An encouraging thought is that once the initial work of setting up the thesaurus has been done, the number of new terms to be assessed each week should decrease.

When to Use Thesaurus?

It is particularly appropriate for fields which have a hierarchical structure, such as names of objects, subjects, places, materials and disciplines, and it might also be used for styles and periods. A thesaurus would not normally be used for names of people and organisations, but a similar tool, called an authority file is usually used for these. The difference is that while an authority file has preferred and non-preferred relationships, it does not have hierarchies.

Authority files and thesauri are two examples of a generalized data structure which can allow the indication of any type of relationship between two entries, and modern computer software should allow different types of relationship to be included if needed.

Other Subject Retrieval Techniques

A thesaurus is an essential component for reliable information retrieval, but it can usefully be complemented by two other types of subject retrieval mechanism.

Classification Schemes

While a thesaurus inherently contains a classification of terms in its hierarchical relationships, it is intended for specific retrieval, and it is often useful to have another way of grouping objects. It is also often necessary to be able to classify a list of objects arranged by subject in a way which differs from the alphabetical order of thesaurus terms. Each subject group may be expressed as a compound phrase, and given a classification number or code to make sorting possible.

Free Text

It is highly desirable to be able to search for specific words or phrases which occur in object descriptions. These may identify individual items by unique words such as trade names which do not occur often enough to justify inclusion in the thesaurus. A computer system may "invert" some or all fields of the record, i.e. making all the words in them available for searching through a free-text index, or it may be possible to scan records by reading them sequentially while looking for particular words. The latter process is fairly slow, but is a useful way of refining a search once an initial group has been selected by using thesaurus terms.