Sunday, August 31, 2014

Role of Automatic Classification in Information Governance

Defensible disposal of unstructured content is a key outcome of sound information governance programs. A sound approach to records management as part of the organization’s information governance strategy is rife with challenges.

Some of the challenges are explosive content volumes, difficulty with accurately determining what content is a business record comparing to transient or non-business related content, eroding IT budgets due to mounting storage costs, and the need to incorporate content from legacy systems or merger and acquisition activity.

Managing the retention and disposition of information reduces litigation risk, it reduces discovery and storage costs, and it ensures organizations maintain regulatory compliance. In order for content to be understood and determined why it must be retained, for how long it must be retained, and when it can be dispositioned, it needs to be classified.

However, users see the process of sorting records from transient content as intrusive, complex, and counterproductive. On top of this, the popularity of mobile devices and social media applications has effectively fragmented the content authoring and has eliminated any chance of building consistent classification tools into end-user applications.

If classification is not being carried out, there are serious implications when asked by regulators or auditors to provide reports to defend the organization’s records and retention management program.

Records managers also struggle with enforcing policies that rely on manual, human-based classification. Accuracy and consistency in applying classification is often inadequate when left up to users, the costs in terms of productivity loss are high, and these issues, in turn, result in increased business and legal risk as well as the potential for the entire records management program to quickly become unsustainable in terms of its ability to scale.

A solution to overcome this challenge is automatic classification. It eliminates the need for users to manually identify records and apply necessary classifications. By taking the burden of classification off the end-user, records managers can improve consistency of classification and better enforce rules and policies.

Auto-Classification makes it possible for records managers to easily demonstrate a defensible approach to classification based on statistically relevant sampling and quality control. Consequently, this minimizes the risk of regulatory fines and eDiscovery sanctions.

In short, it provides a non-intrusive solution that eliminates the need for business users to sort and classify a growing volume of low-touch content, such as email and social media, while offering records managers and the organization as a whole the ability to establish a highly defensible, completely transparent records management program as part of their broader information governance strategy.

Benefits of Automatic Classification for Information Governance

Apply records management classifications as part of a consistent, programmatic component of a sound information governance program to:

  • Litigation risk
  • Storage costs
  • eDiscovery costs
  • Compliance
  • Security
  • Responsiveness
  • User productivity and satisfaction
  • The fundamental difficulties in applying classifications to high volume, low touch content such as legacy content, email and social media content.
  • Records manager and compliance officer concerns about defensibility and transparency.
  • Automated Classification: automate the classification of content in line with existing records management classifications.
  • Advanced Techniques: classification process based on a hybrid approach that combines machine learning, rules, and content analytics.
  • Flexible Classification: ability to define classification rules using keywords or metadata.
  • Policy-Driven Configuration: ability to configure and optimize the classification process with an easy "step-by-step" tuning guide.
  • Advanced Optimization Tools: reports make it easy to examine classification results, identify potential accuracy issues, and then fix those issues by leveraging the provided "optimization" hints.
  • Sophisticated Relevancy and Accuracy Assurance: automatic sampling and bench marking with a complete set of metrics to assess the quality of the classification process.
  • Quality Assurance : advanced reports on a statistically relevant sample to review and code documents that have been automatically classified to manually assess the quality of the classification results when desired.

Wednesday, August 13, 2014

Dublin Core Metadata

The word "metadata" means "data about data". Metadata describes a context for objects of interest such as document files, images, audio and video files. It can also be called resource description. As a tradition, resource description dates back to the earliest archives and library catalogs. The modern "metadata" field that gave rise to Dublin Core and other recent standards emerged with the Web revolution of the mid-1990s.

The Dublin Core Schema is a small set of vocabulary terms that can be used to describe different resources.

Dublin Core Metadata may be used for multiple purposes, from simple resource description, to combining metadata vocabularies of different metadata standards, to providing inter-operability for metadata vocabularies in the Linked data cloud and Semantic web implementations.

"Dublin" refers to Dublin, Ohio, USA where the schema originated during the 1995 invitational OCLC/NCSA Metadata Workshop hosted by the Online Computer Library Center (OCLC), a library consortium based in Dublin, and the National Center for Supercomputing Applications (NCSA). "Core" refers to the metadata terms as broad and generic being usable for describing a wide range of resources. The semantics of Dublin Core were established and are maintained by an international, cross-disciplinary group of professionals from librarianship, computer science, text encoding, museums, and other related fields of scholarship and practice.

The Dublin Core Metadata Initiative (DCMI) provides an open forum for the development of inter-operable online metadata standards for a broad range of purposes and of business models. DCMI's activities include consensus-driven working groups, global conferences and workshops, standards liaison, and educational efforts to promote widespread acceptance of metadata standards and practices.

In 2008, DCMI separated from OCLC and incorporated as an independent entity. Any and all changes that are made to the Dublin Core standard are reviewed by a DCMI Usage Board within the context of a DCMI Namespace Policy. This policy describes how terms are assigned and also sets limits on the amount of editorial changes allowed to the labels, definitions, and usage comments.

Levels of the Standard

The Dublin Core standard originally includes two levels: Simple and Qualified. Simple Dublin Core comprised 15 elements; Qualified Dublin Core included three additional elements (Audience, Provenance and RightsHolder), as well as a group of element refinements (also called qualifiers) that could refine the semantics of the elements in ways that may be useful in resource discovery. Since 2012 the two have been incorporated into the DCMI Metadata Terms as a single set of terms using the Resource Description Framework (RDF).

The original Dublin Core Metadata Element Set which is the Simple level consists of 15 metadata elements:


Each Dublin Core element is optional and may be repeated. The DCMI has established standard ways to refine elements and encourage the use of encoding and vocabulary schemes. There is no prescribed order in Dublin Core for presenting or using the elements. The Dublin Core became ISO 15836 standard in 2006 and is used as a base-level data element set for the description of learning resources in the ISO/IEC 19788-2.

Qualified Dublin Core

Subsequent to the specification of the original 15 elements, an ongoing process to develop terms extending or refining the Dublin Core Metadata Element Set (DCMES) began. The additional terms were identified. Elements refinements make the meaning of an element narrower or more specific. A refined element shares the meaning of the unqualified element, but with a more restricted scope.

In addition to element refinements, Qualified Dublin Core includes a set of recommended encoding schemes, designed to aid in the interpretation of an element value. These schemes include controlled vocabularies and formal notations or parsing rules.


Syntax choices for Dublin Core metadata depends on a number of variables, and "one size fits all" forms rarely apply. When considering an appropriate syntax, it is important to note that Dublin Core concepts and semantics are designed to be syntax independent and are equally applicable in a variety of contexts, as long as the metadata is in a form suitable for interpretation both by machines and by human beings.

The Dublin Core Abstract Model provides a reference model against which particular Dublin Core encoding guidelines can be compared, independent of any particular encoding syntax. Such a reference model allows users to gain a better understanding of descriptions they are trying to encode and facilitates the development of better mappings and translations between different syntax.

I will describe some applications of Dublin Core in my future posts.