Showing posts with label Automatic Classification. Show all posts
Showing posts with label Automatic Classification. Show all posts

Sunday, August 31, 2014

Role of Automatic Classification in Information Governance

Defensible disposal of unstructured content is a key outcome of sound information governance programs. A sound approach to records management as part of the organization’s information governance strategy is rife with challenges.

Some of the challenges are explosive content volumes, difficulty with accurately determining what content is a business record comparing to transient or non-business related content, eroding IT budgets due to mounting storage costs, and the need to incorporate content from legacy systems or merger and acquisition activity.

Managing the retention and disposition of information reduces litigation risk, it reduces discovery and storage costs, and it ensures organizations maintain regulatory compliance. In order for content to be understood and determined why it must be retained, for how long it must be retained, and when it can be dispositioned, it needs to be classified.

However, users see the process of sorting records from transient content as intrusive, complex, and counterproductive. On top of this, the popularity of mobile devices and social media applications has effectively fragmented the content authoring and has eliminated any chance of building consistent classification tools into end-user applications.

If classification is not being carried out, there are serious implications when asked by regulators or auditors to provide reports to defend the organization’s records and retention management program.

Records managers also struggle with enforcing policies that rely on manual, human-based classification. Accuracy and consistency in applying classification is often inadequate when left up to users, the costs in terms of productivity loss are high, and these issues, in turn, result in increased business and legal risk as well as the potential for the entire records management program to quickly become unsustainable in terms of its ability to scale.

A solution to overcome this challenge is automatic classification. It eliminates the need for users to manually identify records and apply necessary classifications. By taking the burden of classification off the end-user, records managers can improve consistency of classification and better enforce rules and policies.

Auto-Classification makes it possible for records managers to easily demonstrate a defensible approach to classification based on statistically relevant sampling and quality control. Consequently, this minimizes the risk of regulatory fines and eDiscovery sanctions.

In short, it provides a non-intrusive solution that eliminates the need for business users to sort and classify a growing volume of low-touch content, such as email and social media, while offering records managers and the organization as a whole the ability to establish a highly defensible, completely transparent records management program as part of their broader information governance strategy.

Benefits of Automatic Classification for Information Governance

Apply records management classifications as part of a consistent, programmatic component of a sound information governance program to:

Reduce
  • Litigation risk
  • Storage costs
  • eDiscovery costs
Improve
  • Compliance
  • Security
  • Responsiveness
  • User productivity and satisfaction
Address
  • The fundamental difficulties in applying classifications to high volume, low touch content such as legacy content, email and social media content.
  • Records manager and compliance officer concerns about defensibility and transparency.
Features
  • Automated Classification: automate the classification of content in line with existing records management classifications.
  • Advanced Techniques: classification process based on a hybrid approach that combines machine learning, rules, and content analytics.
  • Flexible Classification: ability to define classification rules using keywords or metadata.
  • Policy-Driven Configuration: ability to configure and optimize the classification process with an easy "step-by-step" tuning guide.
  • Advanced Optimization Tools: reports make it easy to examine classification results, identify potential accuracy issues, and then fix those issues by leveraging the provided "optimization" hints.
  • Sophisticated Relevancy and Accuracy Assurance: automatic sampling and bench marking with a complete set of metrics to assess the quality of the classification process.
  • Quality Assurance : advanced reports on a statistically relevant sample to review and code documents that have been automatically classified to manually assess the quality of the classification results when desired.

Monday, July 14, 2014

Benefits of Automatic Content Classification

I had few questions about my posts about automatic content classification. I would like to thank my blog readers for these questions. This post is to follow up on those questions.

Organizations receive countless amounts of paper documents every day. These documents can be mail, invoices, faxes, or email. Even after organizations scan these paper documents, it is still difficult to manage and organize them.

To overcome the inefficiencies associated with paper and captured documents, companies should implement an intelligent classification system to organize captured documents.

With today’s document processing technology, organizations do not need to rely on manual classification or processing of documents. Organizations that overcome manual sorting and classification in favor of an automated document classification & processing system can realize a significant reduction in manual entry costs, and improve the speed and turnaround time for document processing.

Recent research has shown that two-thirds of organizations cannot access their information assets or find vital enterprise documents because of poor information classification or tagging. The survey suggests that much of the problem may be due to manual tagging of documents with metadata, which can be inconsistent and riddled with errors, if it has been done at all.

There are few solutions for automated document classification and recognition. Some of them are: SmartLogic's Semaphore, OpenText, Interwoven Metatagger, Documentum, CVISION Trapeze, and others. These solutions enable organizations to organize, access, and control their enterprise information.

They are cost effective and eliminate inconstancy, mistakes, and the huge manpower costs associated with manual classification. Putting an effective and consistent automatic content classification system in place that ensures quick and easy retrieval of the right documents means better access to corporate knowledge, improved risk management and compliance, superior customer relationship management, enhanced findability for key audiences and an improved ability to monetize information.

Specific benefits of automatic content classification are:

More consistency. It produces the same unbiased results over and over. Might not always be 100% accurate or relevant, but if something goes wrong, it is at least is easy to understand why.

Larger context. Enforces classification from the whole organizations perspective, not the individuals. For example, a person interested in sports might tag an article which mentions a specific player, but forget/not consider a team and a country topic.

Persistent. A person can only handle a certain number of incoming documents per day, whilst an automatic classification works round the clock.

Cost effective. Possible to handle thousands of documents much faster than a person.

Automatic document classification can be divided into three types: supervised document classification where some external mechanism (such as human feedback) provides information on the correct classification for documents; unsupervised document classification (also known as document clustering) where the classification must be done entirely without reference to external information, and semi-supervised document classification where parts of the documents are labeled by the external mechanism.

Automate your content classification and consider using manual labor mainly for quality checking and approval of content.

In my next post on this topic, I will describe the role of automatic classification in records management and information governance.

Thursday, May 15, 2014

Content Categorization Role in Content Management

An ability to find content in a content management system is crucial. One of main goals of having a content management system is to make content easy to find, so you can take an action, make a business decision, do research and development work, etc.

The main challenge to findability is anticipating how users might look for information. That's where categorization comes into play. The quality of the categorization of each piece of content makes or breaks its findability. Theoretically, good tagging will last the lifetime of the content. You would think that if you do it well initially, then you can forget about it until it is time to retire that content. But reality can be very different.

Durable Categorization

Many issues complicate content categorization. They include:
  • the sheer volume, velocity, and variety of internal and external-facing content which needs management;
  • evolving/emerging regulations and compliance issues, some of which need to be retroactively applied; 
  • the need to limit the company's exposure and to support the strength of its position in any legal activity.
Some organizations face the added challenge of integrating content from acquisitions or mergers, which most likely use content management structure, categorization, and methodologies that are incompatible and of inconsistent quality.

Considering these issues, the success factor for good content categorization are the automatic categorization techniques and processes.

Traditionally, keywords, dictionaries, and thesauri are used to categorize content. This type of categorization model poses several problems:
  • taxonomy quality - it depends on the initial vision and attention to detail, and whether it has been kept current;
  • term creep - initial categorization will not always accommodate where and how the content will be used over time, or predict relevancy beyond its original focus;
  • policy evolution - it can't easily apply new or evolving policies, regulations, compliance requirements, etc.;
  • cost and complexity - it is difficult and costly, if not practically impossible, to retroactively expand the original categorization of the existing content if big amount of content is added.
Automatic Categorization

Using technology to automatically categorize content is a solution. It applies the rules more consistently than people do. It does it faster. It frees people from having to do the task, and therefore has less costs. And, it can actively or retroactively categorize batches or whole collections of documents.

You can experience these benefits by using concept-based categorization driven by an analytics engine integrated into the content management system. These systems mathematically analyze example documents you provide to calculate concepts that can be used to categorize other documents. Identifying hundreds of keywords per term, they are able to distinguish relevance that escapes keyword and other traditional taxonomy approaches. They are even highly likely to make connections that a person would miss.

Consider 3D printers as an example. These are also known as "materials printers", "fabbers", "3D fabbers", and as "additive manufacturing". If all of those terms are not in the taxonomy, then relevant documents that use one or more of them, but not 3D printer, would not be optimally categorized.

People looking for information about 3D printers who are not aware of the alternative terms would miss related documents of potential significance. This particularly impacts  external facing websites that sell products on their websites. Their business depends on fast and easy delivery of accurate and complete information to their customers, even when the customer doesn't know all of the various terms used to describe the product they are looking for.

In contrast, through example-based mathematical analysis and comparison along multiple keywords, conceptual analytics systems understand that these documents are all related. They would be automatically categorized and tagged as relevant to 3D printing.

Another difference is that taxonomy systems require someone to enter the newly developed or discovered terms. In conceptual analytics, it is simply a matter of providing additional example documents that automatically add to the system's conceptual understanding.

The days of keeping everything "just in case" are long gone. From cost and risk exposure concerns, organizations need to keep only what is necessary, particularly as the volume and variety of content continue to grow. Good categorization and tagging systems are essential to good content management and to controlling expense and exposure.

Outdated and draft documents unnecessary expand every company's content repositories. Multiple copies of the same or very similar content are scattered throughout the organization. By some estimates, these compose upwards of 20% or more of a company's content.

Efficiently weeding out that content means 20% less active and backup storage, bandwidth, cloud storage for offsite disaster recovery, and archive volume. Effective and thorough tagging can identify such elements to reduce these costs, and simultaneously reduce the company's cost and exposure related to legal or regulatory requirements.

The Value Beyond Cost Savings

An effectively managed content delivers better cost of content management and reduced exposure to risk. While this alone is reason to implement improvements in categorization, there are other reasons.

Superior categorization through conceptual analysis also affects operational efficiency by enabling fast, accurate, and complete content gathering. A significant benefit for any enterprise is that it allows more time for actual work by reducing the time it takes to find necessary information. It is of critical importance for companies whose revenue depends on their customers quickly and easily finding quality information.

Conceptual analytics systems deliver two other advantages over traditional taxonomy methods and manual categorization. It creates a mathematical index, so it is useless to anyone trying to discover private information or clues about the company. Also, it is deterministic and repeatable. It will give the same result every time and so it is very valuable in legal or regulatory activities.

Concept-based analysis makes content findable and actionable, regardless of language, by automatically categorizing it based on understanding developed from example documents you provide. Both internally and externally, the company becomes more competitive with one of its most important assets which is unstructured information.

Thursday, June 20, 2013

Intelligent Search and Automated Metadata

The inability to identify the value in unstructured content is the primary challenge in any application that requires the use of metadata. Search cannot find and deliver relevant information in the right context, at the right time without good quality metadata.

An information governance approach that creates the infrastructure framework to encompass automated intelligent metadata generation, auto-classification, and the use of goal and mission-aligned taxonomies is required. From this framework, intelligent metadata enabled solutions can be rapidly developed and implemented. Only then can organizations leverage their knowledge assets to support search, litigation, e-discovery, text mining, sentiment analysis and open source intelligence.

Manual tagging is still the primary approach used to identify the description of content, and often lacks any alignment with enterprise business goals. This subjectivity and ambiguity is applied to search, resulting in inaccuracy and the inability to find relevant information across the enterprise.

Metadata used by search engines may be comprised of end user tags, pre-defined tags, or generated using system defined metadata, keyword and proximity matching, extensive rule building, end-user ratings, or artificial intelligence. Typically, search engines provide no way to rapidly adapt to meet organizational needs or account for an organization’s unique nomenclature.

More effective is implementing an enterprise metadata infrastructure that consistently generates intelligent metadata using concept identification. A profoundly different approach, relevant documents, regardless of where they reside, will be retrieved even if they don’t contain the exact search terms, because the concepts and relationships between similar content has been identified. The elimination of end-user tagging and the resulting organizational ambiguity enables the enriched metadata to be used by any search engine index, for example, ConceptSearch, SharePoint, Solr, Autonomy or Google Search Appliance.

Only when metadata is consistently accurate and trusted by the organization can improvements be achieved in text analytics, e-discovery and litigation support.

In the exploding age of big data, and more specifically text analytics, sentiment analysis and even open source intelligence, the ability to harness the meaning of unstructured content in real time improves decision-making and enables organizations to proactively act with greater certainty on rapidly changing business complexities.

To achieve an effective information governance strategy for unstructured content, results are predicated on the ability to find information and eliminate inappropriate information. The core enterprise search component must be able to incorporate and digest content from any repository, including faxes, scanned content, social sites (blogs, wikis, communities of interest, Twitter), emails, and websites. This provides a 360-degree corporate view of unstructured content, regardless of where it resides or how it was acquired.

Ensuring that the right information is available to end users and decision makers is fundamental to trusting the accuracy of the information and is another key requirement in intelligent search. Organizations can then find the descriptive needles in the haystack to gain competitive advantage and increase business agility.

An intelligent metadata enabled solution for text analytics analyzes and extracts highly correlated concepts from very large document collections. This enables organizations to attain an ecosystem of semantics that delivers understandable and trusted results that is continually updated in real time.

Applying the concept of intelligent search to e-discovery and litigation, traditional information retrieval systems use "keyword searches" of text and metadata as a means of identifying and filtering documents. The challenges and escalating costs of e-discovery and litigation support continue to increase. The use of intelligent search reduces costs and alleviates many of the challenges.

Content can be presented to knowledge professionals in a manner that enables them to more rapidly identify relevant information and increase accuracy. Significant benefits can be achieved by removing the ambiguity in content and the identification of concepts within a large corpus of information. This methodology delivers expediencies, and reduces costs, offering an effective solution that overcomes many of the challenges typically not solved in e-discovery and litigation support.

Organizations must incorporate an approach that addresses the lack of an intelligent metadata infrastructure. Intelligent search, a by-product of the infrastructure, must encourage, not hamper, the use and reuse of information and be rapidly extendable to address text mining, sentiment analysis, e-discovery, and litigation support.

The additional components of auto-classification and taxonomies complete the core infrastructure to deploy intelligent metadata enabled solutions, including records management, data privacy, and migration. Search can no longer be evaluated on features, but on proven results that deliver insight into all unstructured content.

Monday, November 19, 2012

Content Management Systems Reviews - Open Text - ECM Suite - Auto Classification

For records managers and others responsible for building and enforcing classification policies, retention schedules, and other aspects of records management plan, the problem with traditional, manual classification methods can be overwhelming.

Content needs to be classified or understood in order to determine why it must be retained, how long it must be retained, and when it can be dispositioned. Managing the retention and disposition of information reduces litigation risk, reduces discovery and storage costs, and ensures that organizations maintain regulatory compliance.

Classification is the last thing end-users want (or are able) to do. Users see the process of sorting records from transient content as intrusive, complex, and counterproductive. On top of this, the popularity of mobile devices and social media applications has effectively fragmented the content authoring market and has eliminated any chance of building consistent classification tools into end-user applications.

However, if classification is not being carried out there are serious implications when asked by regulators or auditors to provide reports to defend the organization’s records and retention management program.

User concerns aside, records managers also struggle with enforcing policies that rely on manual, human-based approaches. Accuracy and consistency in applying classification is often inadequate when left up to users, the costs in terms of productivity loss are high, and these issues, in turn, result in increased business and legal risk as well as the potential for the entire records management program to quickly become unsustainable in terms of its ability to scale.

So what is the answer? How can organizations overcome the challenges posed by classification?

The answer is a solution that provides automatic identification, classification, retrieval, archival, and disposal capabilities for electronic records as required by the records management policy.

OpenText Auto-Classification is the solution that combines records management with cutting edge semantic capabilities for classification of content. It eliminates the need for users to manually identify records and apply requisite classification. By taking the burden of classification off users, records managers can improve consistency of classification and better enforce rules and policies.

OpenText Auto-Classification makes it possible for records managers to easily demonstrate a defensible approach to classification based on statistically relevant sampling and quality control. Consequently, this minimizes the risk of regulatory fines and eDiscovery sanctions.

It provides a solution that eliminates the need for users to sort and classify a growing volume of content while offering records managers and the organization as a whole the ability to establish completely transparent records management program as part of their broader information governance strategy.

Auto-Classification uses OpenText analytics engine to go through documents and codifies language-specific nuances identified by linguistic experts.

Features

Automated Classification: Automate the classification of content in OpenText Content Server inline with existing records management classifications.

Advanced Techniques: Classification process based on a hybrid approach that combines machine learning, rules, and content analytics.

Flexible Classification: Ability to define classification rules using keywords or metadata.

Policy-Driven Configuration: Ability to configure and optimize the classification process with an easy step-by-step tuning guide.

Advanced Optimization Tools: Reports make it easy to examine classification results, identify potential accuracy issues, and then fix those issues by leveraging the provided optimization hints.

Sophisticated Relevancy and Accuracy Assurance: Automatic sampling and benchmarking with a complete set of metrics to assess the quality of the classification process.

Quality Assurance Workbench: Advanced reports on a statistically relevant sample to review and code documents that have been automatically classified to manually assess the quality of the classification results when desired.

Auto-Classification works with OpenText Records Management so existing classifications and documents can be used during the tuning process.

OpenText Auto-Classification was developed in close partnership with customers using the OpenText ECM Suite, and works in conjunction with OpenText Records Management so that existing classifications and classified documents can be used in the tuning process.

Thursday, August 23, 2012

Content Management Systems Reviews - Documentum - Automatic Classification - Captiva Dispatcher

In my last post, I mentioned that Documentum has two tools for automatic classification: Content Intelligence Services (CIS) and EMC Captiva Dispatcher. I also described Content Intelligence Services (CIS) tool. In this my post, I am going to describe EMC Captiva Dispatcher.

EMC Captiva Dispatcher delivers high speed automatic content classification, data extraction, and routing documents. With Dispatcher, companies are able to scan multiple batches of structured, semi-structured, and unstructured content within a single flow, without a need for separator sheets, barcodes, or patch codes. By combining EMC Captiva Dispatcher with the Captiva InputAsset Intelligent enterprise capture platform, you can scan, classify, extract, and deliver data from almost any kind of electronic or paper document, often without a need for manual sorting or data entry.

The result is cost reduction and business process optimization which are measures that can help save time and money while increasing an ability to manage the flow of incoming documents.

One of the greatest strength of Dispatcher lies in its ability to identify similar document types. It uses both text and image based analysis to determine document types, automatically capture business data for search and archiving or to drive transactional processes and route documents to the appropriate department for processing. The technology works by automatically learning the attributes of existing documents and using them as a basis for classifying new incoming documents.

By analyzing document's layout design such as logos or other graphical elements, Dispatcher is completely language and format independent. In the case of unstructured and semi-structured documents, they system uses full-text engine results, looking for keywords and text phrases contained in a document to determine the document type. By learning documents based on a visual layout, new document types can be automatically added.

Dispatcher performs automated data extraction and validation, reducing the need for manual data entry and ensuring that accurate information is passed to back-office systems. Dispatcher includes several recognition engines that allow you to extract machine printed and handwritten text, check marks, and barcode information.

For structured forms, Dispatcher extracts data using fast and accurate pre-defined zones. For less structured documents, like invoices or contracts, Dispatcher extracts data using more flexible, free form recognition, enabling data to be extracted regardless of where it exists on a page.

This broad set of recognition technologies and methods ensure that data is extracted with the highest performance from structured forms, while also providing maximum flexibility to extract data from all document types.

As part of EMC Captiva intelligent enterprise capture solution, Dispatcher integrates seamlessly with InputAccel, providing a capture platform that supports both centralized and distributed environments. InputAccel custom capture process flows manage the end-to-end process, ensuring that documents are classified, data is extracted and validated, and information is delivered to all relevant content repositories and business systems. Leveraging InputAccel together provides organizations with a complete solution that is capable of processing volumes ranging from few thousand documents a day to several million.

Friday, August 17, 2012

Content Management Systems Reviews - Documentum - Automatic Classification

Documentum has two tools for automatic classification: Content Intelligence Services (CIS) and EMC Captiva Dispatcher. The subject of my today’s post is Content Intelligence Services (CIS. In my next post, I will describe EMC Captiva Dispatcher.

Content Intelligence Services (CIS) is an extension to the EMC Documentum content management platform that enables automatic classification and categorization of content in the Documentum repository. Its benefit is well organized, classified, and categorized content. With CIS, content is parsed and analyzed and classification rules are applied. The results of the classification can then be used for categorization as keywords to populate content metadata.

The capability of automatically creating keywords to populate content metadata can remove the burden from end users who otherwise have to do it manually. Many users struggle to consistently populate metadata as content is being created which significantly limits its future use since metadata is what enables processing of the content.

CIS eliminates this dependency on users. CIS can propose metadata to users who can accept or modify them as needed. CIS can provide support for a combination or automatic and manual classification with a special user interface to category owners. Category owners can make a classification decisions manually in cases where the automatic rules cannot classify content with a preset certainty level. The user interface is built into every Documentum client such as Webtop and becomes active upon detection of CIS in the system.

With CIS, the results of the classification can be used for content categorization which assigns content to appropriate categories. Typically, categories are represented by a folder structure to which content is linked. A category hierarchy – or taxonomy – is usually common to a department or an organization and allows all users to share the same navigation view for content in active project or content that has been archived.

CIS comes with prepackaged taxonomies for various industries. These taxonomies can be customized and used either out of box or as a starting point for a customization. Users can add categories and sub-categories to these taxonomies.

CIS supports major European languages. This enables the classification of local content in its native language against an enterprise-wide or local taxonomy. Using this multilingual capabilities, companies can deploy CIS globally, enhancing globalization capabilities of Documentum that include pervasive Unicode compatibility and localized user interfaces.

Next time: EMC Captiva Dispatcher.

Wednesday, July 25, 2012

Automatic Classification

In my previous posts, I mentioned that the taxonomy is necessary to create navigation to content. If users know what they are looking for, they are going to search. If they don't know what they are looking for, they will look for ways to navigate to content, in other words, browse through content. Taxonomies can also be used as a method of filtering search results so that results are restricted to a selected node on the hierarchy.

Once documents have been classified, users can browse the document collection, using an expanding tree-view to represent the taxonomy structure.

When there are many documents involved, creating taxonomy could be time consuming. There are few tools on the market that provide automatic classification. Another use of the automatic classification is to automatically tag content with controlled metadata (also known as Automatic Metadata Tagging) to increase the quality of the search results.

The tools that provide automatic classification are: Autonomy, ClearForest, Documentum, Interwoven, Inxight, Moxomine, Open Text, Oracle, SmartLogic.

These tools can classify any type of text documents. Classification is either performed on a document repository or on a stream of incoming documents.

Here is how this software works. Example: "International Business Machines today announced that it would acquire Widget, Inc. A spokesperson for IBM said: "Big Blue will move quickly to ensure a speedy transition".

The software classifies concepts rather than words. Words are first stemmed, that is they are reduced to their root form. Next, stop words are being eliminated. These include words such as a, an, in, the - words that add little semantic information. Then, words with similar meanings are equated using thesaurus. For example, the words IBM, International Business Machines, and Big Blue are treated as equivalent.

Next, the software will use statistical or language processing techniques to identify noun phrases or concepts such as "red bicycle". Further, using thesaurus, these phrases are reduced to distinct concepts that will be associated with the document. In this example, there are 3 instances of IBM, 2 instances of acquisition (acquire, speedy transition), and 1 instance of Widget, Inc.

Approaches to Classification

Manual - requires individuals to assign each document to one or more categories. It can achieve a high degree of accuracy. However, it is labor intensive and therefore are more costly than automatic classification in the long run.

Rule-based - keywords or Boolean expressions are used to categorize a document. This is typically used when a few words can adequately describe a category. For example, if a collection of medical papers is to be classified according to a disease together with its scientific, common, and alternative names can be used to define the keywords for each category.

Supervised Learning - most approaches to automatic classification require a human expert to initiate a learning process by manually classifying or assigning a number of "training documents" to each category. This classification system first analyzes the statistical occurrences of each concept in the example documents and then constructs a model or "classifier for each category that is used to classify subsequent documents automatically. The system refines its model, in a sense "learning" the categories as documents are processed.

Unsupervised Learning - these systems identify both groups or clusters of related documents as well as the relationship between these clusters. Commonly referred as clustering, this approach eliminates the need for training sets because it does not require a preexisting taxonomy or category structure. However, clustering algorithms are not always good at selecting categories that are intuitive to users. On the other hand, clustering will often expose useful relationships and themes implicit in the collection that might be missed by a manual process. For this reasons, clustering generally works hand-in-hand with supervised learning techniques.

Each of approaches is optimal for a different situation. As a result, classification vendors are moving to support multiple methods.

Most real world implementations combine search, classification, and other techniques such as identifying similar documents to provide a complete information retrieval solution. Organizations having document repositories will generally benefit from a customized taxonomy.

Once documents are clustered, an administrator can first rearrange, expand or collapse the auto-suggested clusters or categories, and then give them intuitive names. The documents in the cluster serve as initial training sets for supervised-learning algorithms that will be used subsequently to refine the categories. The end result is a taxonomy and a set of topic models are fully customized for an organization's needs.

Building an extensive custom taxonomy can be a large expense. However, automated classification tools can reduce the taxonomy development and maintenance cost.

Organizations with document collections that span complex areas such as medicine, biotechnology, aerospace will have a large taxonomy. However, there are ways to refine taxonomy so it does not become an overwhelming task.

Together, enterprise search and classification provide an initial response to information overload.