Galaxy Consulting Blog: Metadata

Showing posts with label Metadata. Show all posts

Thursday, July 30, 2020

Metadata Driven Solutuions

Metadata is data that provides information about other data. Many distinct types of metadata exist, including descriptive metadata, structural metadata, administrative metadata, reference metadata, and statistical metadata.

Descriptive metadata is descriptive information about a resource. It is used for discovery and identification. It includes elements such as title, abstract, author, and keywords.
Structural metadata is metadata about containers of data and indicates how compound objects are put together, for example, how pages are ordered to form chapters. It describes the types, versions, relationships and other characteristics of digital materials.
Administrative metadata is information to help manage a resource, like resource type, permissions, and when and how it was created.
Reference metadata is information about the contents and quality of statistical data.
Statistical metadata, also called process data, may describe processes that collect, process, or produce statistical data.

Metadata, properly managed, is the powerful tool to make things happen. We can have processes and solutions which are driven by metadata.

In application building process which is metadata driven, instead of building the desired application directly we define the application’s specifications. These specifications are fed to an engine which builds the application for us by using predefined rules.

Instead of building a package to create a dimension, for example, we can provide the dimension description (metadata) to a package generating engine. This engine is then responsible for creating the defined package. Once the package is executed, it will create and maintain the prescribed dimension.

Why is metadata driven so much more efficient than traditional methods?

Creating a definition of a process is much faster than creating a process. A metadata driven approach results in building the same asset in less time as compared to traditional methods.
Quality standards are enforced. The rules engine becomes the gatekeeper by enforcing best practices.
The rules engine becomes a growing knowledge base which all processes benefit from.
Easily adapts to change & extension. Simply edit the definition and submit to the engine for a build. Need to inject a custom process? No problem, create a package the old fashioned way.
Enables agile data warehousing. Agile becomes possible due to greatly increased speed of development and reduced rework required by change.

The ongoing proliferation of devices joined with the distributed nature of data sources has created an indispensable role for metadata. Metadata provides knowledge such as location of the device and nature of the data, which facilitates integration of data regardless of its origin or structure.

Enterprises are incorporating data quality and data governance functions as part of data integration flows. Embedding these processes in the integration pipeline necessitates sharing of metadata between the integration tools, and the quality and governance tools.

Metadata also facilitates performance optimization in integration scenarios by providing information on the characteristics of underlying sources in support of dynamic optimization strategies.

In content management, folder-less approach allows you to search for and access files however you want – by client, project type, date, status or other criteria. It's completely dynamic, enabling you organize and display information how you need it, without the limitations of antiquated, static folder structure.

All you do is save to files and tag the file with the properties you need, and you are done. No more wandering through complex, hard-to-navigate folder structures, trying to guess where to save a file. With metadata you just quickly describe what it is you are looking for.

Metadata management software provides context and information for data assets stored across the enterprise. ... Metadata management tools include data catalogs, or assemblages of data organized into datasets (e.g. searchable tables or other arrangements, facilitating exploration).

Friday, July 31, 2015

Dublin Core Metadata Applications - Web Ontology Language (OWL)

In my last post, I described one of the most used applications of Dublin Core Metadata - RDF. In today's post, I will describe second most used applications of Dublin Core Metadata - Web Ontology Language (OWL).

The Web Ontology Language (OWL) is a family of knowledge representation languages for authoring ontology. Ontology is a formal way to describe taxonomy and classification networks, essentially defining the structure of knowledge for various domains: the nouns represent classes of objects and the verbs represent relations between the objects.

An ontology defines the terms used to describe and represent an area of knowledge. Ontologies are used by people, databases, and applications that need to share domain information (a domain is just a specific subject area or area of knowledge, like medicine, tool manufacturing, real estate, automobile repair, financial management, etc.). Ontologies include computer-usable definitions of basic concepts in the domain and the relationships among them. They encode knowledge in a domain and also knowledge that spans domains. In this way, they make that knowledge reusable.

Ontology resembles class hierarchies. It is meant to represent information on the Internet and are expected to be evolving almost constantly. Ontologies are typically very flexible as they are coming from all sorts of data sources.

The OWL languages are characterized by formal semantics. They are built upon a W3C XML standard for RDF objects. I described RDF in my previous post.

The data described by an ontology in the OWL family is interpreted as a set of "individuals" and a set of "property assertions" which relate these individuals to each other. An ontology consists of a set of axioms which place constraints on sets of individuals (called "classes") and the types of relationships permitted between them. These axioms provide semantics by allowing systems to infer additional information based on the data explicitly provided.

OWL ontologies can import other ontologies, adding information from the imported ontology to the current ontology.

For example: an ontology describing families might include axioms stating that a "hasMother" property is only present between two individuals when "hasParent" is also present, and individuals of class "HasTypeOBlood" are never related via "hasParent" to members of "HasTypeABBlood" class. If it is stated that the individual Harriet is related via "hasMother" to the individual Sue, and that Harriet is a member of the "HasTypeOBlood" class, then it can be inferred that Sue is not a member of "HasTypeABBlood".

The W3C-endorsed OWL specification includes the definition of three variants of OWL, with different levels of expressiveness. These are OWL Lite, OWL DL and OWL Full

OWL Lite

OWL Lite was originally intended to support those users primarily needing a classification hierarchy and simple constraints. It is not widely used.

OWL DL

OWL DL includes all OWL language constructs, but they can be used only under certain restrictions (for example, number restrictions may not be placed upon properties which are declared to be transitive). OWL DL is so named due to its correspondence with description logic, a field of research that has studied the logics that form the formal foundation of OWL.

OWL Full

OWL Full is based on a different semantics from OWL Lite or OWL DL, and was designed to preserve some compatibility with RDF Schema. For example, in OWL Full a class can be treated simultaneously as a collection of individuals and as an individual in its own right; this is not permitted in OWL DL. OWL Full allows an ontology to augment the meaning of the pre-defined (RDF or OWL) vocabulary.

OWL Full is intended to be compatible with RDF Schema (RDFS), and to be capable of augmenting the meanings of existing Resource Description Framework (RDF) vocabulary. This interpretation provides the meaning of RDF and RDFS vocabulary. So, the meaning of OWL Full ontologies are defined by extension of the RDFS meaning, and OWL Full is a semantic extension of RDF.

Every OWL ontology must be identified by an URI. For example: Ontology(). The languages in the OWL family use the open world assumption. Under the open world assumption, if a statement cannot be proven to be true with current knowledge, we cannot draw the conclusion that the statement is false.

Languages in the OWL family are capable of creating classes, properties, defining instances and its operations.

Instances

An instance is an object. It corresponds to a description logic individual.

Classes

A class is a collection of objects. It corresponds to a description logic (DL) concept. A class may contain individuals, instances of the class. A class may have any number of instances. An instance may belong to none, one or more classes. A class may be a subclass of another, inheriting characteristics from its parent superclass.

Class and their members can be defined in OWL either by extension or by intension. An individual can be explicitly assigned a class by a Class assertion, for example we can add a statement Queen Elizabeth is a(an instance of) human, or by a class expression with ClassExpression statements of every instance of the human class who has a female value to is an instance of the woman class.

Properties

A property is a directed binary relation that specifies class characteristics. It corresponds to a description logic role. They are attributes of instances and sometimes act as data values or link to other instances. Properties may possess logical capabilities such as being transitive, symmetric, inverse and functional. Properties may also have domains and ranges.

Datatype Properties

Datatype properties are relations between instances of classes and RDF literals or XML schema datatypes. For example, modelName (String datatype) is the property of Manufacturer class. They are formulated using owl:DatatypeProperty type.

Object Properties

Object properties are relations between instances of two classes. For example, ownedBy may be an object type property of the Vehicle class and may have a range which is the class Person. They are formulated using owl:ObjectProperty.

Operators

Languages in the OWL family support various operations on classes such as union, intersection and complement. They also allow class enumeration, cardinality, and disjointness.

Metaclasses

Metaclasses are classes of classes. They are allowed in OWL full or with a feature called class/instance punning.

Syntax

The OWL family of languages supports a variety of syntaxes. It is useful to distinguish high level syntaxes aimed at specification from exchange syntaxes more suitable for general use.

High Level

These are close to the ontology structure of languages in the OWL family.

OWL Abstract Syntax

This high level syntax is used to specify the OWL ontology structure and semantics.

The OWL abstract syntax presents an ontology as a sequence of annotations, axioms and facts. Annotations carry machine and human oriented metadata. Information about the classes, properties and individuals that compose the ontology is contained in axioms and facts only. Each class, property and individual is either anonymous or identified by an URI reference. Facts state data either about an individual or about a pair of individual identifiers (that the objects identified are distinct or the same). Axioms specify the characteristics of classes and properties.

OWL2 Functional Syntax

This syntax closely follows the structure of an OWL2 ontology. It is used by OWL2 to specify semantics, mappings to exchange syntaxes and profiles

OWL2 XML Syntax

OWL2 specifies an XML serialization that closely models the structure of an OWL2 ontology.

Manchester Syntax

The Manchester Syntax is a compact, human readable syntax with a style close to frame languages. Variations are available for OWL and OWL2. Not all OWL and OWL2 ontologies can be expressed in this syntax.

OWL is playing an important role in an increasing number and range of applications, and is the focus of research into tools, reasoning techniques, formal foundations and language extensions.

Wednesday, July 8, 2015

Dublin Core Metadata Applications - RDF

The Dublin Core Schema is a small set of vocabulary terms that can be used to describe different resources.

Dublin Core Metadata may be used for multiple purposes, from simple resource description, to combining metadata vocabularies of different metadata standards, to providing inter-operability for metadata vocabularies in the Linked data cloud and Semantic web implementations.

Most used applications of Dublin Core Metadata are RDF and OWL. I will describe OWL in my next post.

RDF stands for Resource Description Framework. It is a standard model for data interchange on the Web. RDF has features that facilitate data merging even if the underlying schemas differ, and it specifically supports the evolution of schemas over time without requiring all the data consumers to be changed.

RDF extends the linking structure of the Web to use URIs to name the relationship between things as well as the two ends of the link (this is usually referred to as a “triple”). Using this simple model, it allows structured and semi-structured data to be mixed, exposed, and shared across different applications.

This linking structure forms a directed, labeled graph, where the edges represent the named link between two resources, represented by the graph nodes. This graph view is the easiest possible mental model for RDF and is often used in easy-to-understand visual explanations.

RDF Schema or RDFS is a set of classes with certain properties using the RDF extensible knowledge representation data model, providing basic elements for the description of ontologies, otherwise called RDF vocabularies, intended to structure RDF resources. These resources can be saved in a triplestore to reach them with the query language SPARQL.

The first version RDFS version was published by the World-Wide Web Consortium (W3C) in April 1998, and the final W3C recommendation was released in February 2004. Many RDFS components are included in the more expressive Web Ontology Language (OWL).

Main RDFS constructs

RDFS constructs are the RDFS classes, associated properties, and utility properties built on the limited vocabulary of RDF.

Classes

Resource is the class of everything. All things described by RDF are resources.

Class declares a resource as a class for other resources.

A typical example of a Class is "Person" in the Friend of a Friend (FOAF) vocabulary. An instance of "Person" is a resource that is linked to the class "Person" using the type property, such as in the following formal expression of the natural language sentence: "John is a Person".

example: John rdf:type foaf:Person

The other classes described by the RDF and RDFS specifications are:

Literal – literal values such as strings and integers. Property values such as textual strings are examples of literals. Literals may be plain or typed.
Datatype – the class of datatypes. Datatype is both an instance of and a subclass of Class. Each instance of:Datatype is a subclass of Literal.
XMLLiteral – the class of XML literal values.XMLLiteral is an instance of Datatype (and thus a subclass of Literal).
Property – the class of properties.

Properties

Properties are instances of the class Property and describe a relation between subject resources and object resources.

For example, the following declarations are used to express that the property "employer" relates a subject, which is of type "Person", to an object, which is of type "Organization":

ex:employer rdfs:domain foaf:Person

ex:employer rdfs:range foaf:Organization

Hierarchies of classes support inheritance of a property domain and range from a class to its sub-classes:

subPropertyOf is an instance of Property that is used to state that all resources related by one property are also related by another.
Label is an instance of Property that may be used to provide a human-readable version of a resource's name.
Comment is an instance of Property that may be used to provide a human-readable description of a resource.

Utility properties

seeAlso is an instance of Property that is used to indicate a resource that might provide additional information about the subject resource.

isDefinedBy is an instance of Property that is used to indicate a resource defining the subject resource. This property may be used to indicate an RDF vocabulary in which a resource is described.

Saturday, September 6, 2014

Managed Metadata in SharePoint - Part Two

In part one of this post, I described using metadata in SharePoint. In this part two, I will describe metadata management.

Managed metadata makes it easier for Term Store Administrators to maintain and adapt your metadata as business needs evolve. You can update a term set easily. And, new or updated terms automatically become available when you associate a Managed Metadata column with that term set. For example, if you merge multiple terms into one term, content that is tagged with these terms is automatically updated to reflect this change. You can specify multiple synonyms (or labels) for individual terms. If your site is multilingual, you can also specify multilingual labels for individual terms.

Managing metadata

Managing metadata effectively requires careful thought and planning. Think about the kind of information that you want to manage the content of lists and libraries, and think about the way that the information is used in the organization. You can create term sets of metadata terms for lots of different information.

For example, you might have a single content type for a document. Each document can have metadata that identifies many of the relevant facts about it, such as these examples:

Document purpose - is it a sales proposal? An engineering specification? A Human Resources procedure?
Document author, and names of people who changed it
Date of creation, date of approval, date of most recent modification
Department responsible for any budgetary implications of the document
Audience

Activities that are involved with managing metadata:

Planning and configuring
Managing terms, term sets, and groups
Specifying properties for metadata

Planning and configuring managed metadata

Your organization may want to do careful planning before you start to use managed metadata. The amount of planning that you must do depends on how formal your taxonomy is. It also depends on how much control that you want to impose on metadata.

If you want to let users help develop your taxonomy, then you can just have users add keywords to items, and then organize these into term sets as necessary.

If your organization wants to use managed term sets to implement formal taxonomies, then it is important to involve key stakeholders in planning and development. After the key stakeholders in the organization agree upon the required term sets, you can use the Term Store Management Tool to import or create your term sets. You can also use the tool to manage the term sets as users start to work with the metadata. If your web application is configured correctly, and you have the appropriate permissions, you can go to the Term Store Management Tool by following these steps:

1. Select Settings and then choose Site Settings.

2. Select Term store management under Site Administration.

Managing terms, term sets, and groups

The Term Store Management Tool provides a tree control that you can use to perform most tasks. Your user role for this tool determines the tasks that you can perform. To work in the Term Store Management Tool, you must be a Farm Administrator or a Term Store Administrator. Or, you can be a designated Group Manager or Contributor for term sets.

To take actions on an item in the hierarchy, follow these steps:

1. Point to the name of the Managed Metadata Service application, group, term set, or term that you want to change, and then click the arrow that appears.

2. Select the actions that you want from the menu.

For example, if you are a Term Store Administrator or a Group Manager you can create, import, or delete term sets in a group. Term set contributors can create new term sets.

Properties for terms and term sets

At each level of the hierarchy, you can configure specific properties for a group, term set, or term by using the properties pane in the Term Store Management Tool. For example, if you are configuring a term set, you can specify information such as Name, Description, Owner, Contact, and Stakeholders in pane available on the General tab. You can also specify whether you want a term set to be open or closed to new submissions from users. Or, you can choose the Intended Use tab, and specify whether the term set should be available for tagging or site navigation.

Managed Metadata in SharePoint - Part One

Using metadata in SharePoint makes it easier to find content items. Metadata can be managed centrally in SharePoint and can be organized in a way that makes sense in your business. When the content across sites in an organization has consistent metadata, it is easier to find business information and data by using search. Search features such as the refinement panel, which displays on the left-hand side of the search results page, enable users to filter search results based on metadata.

SharePoint metadata management supports a range of approaches to metadata, from formal taxonomies to user-driven folksonomies. You can implement formal taxonomies through managed terms and term sets. You can also use enterprise keywords and social tagging, which enable site users to tag content with keywords that they choose. SharePoint enable organizations to combine the advantages of formal, managed taxonomies with the dynamic benefits of social tagging in customized ways.

Metadata navigation enables users to create views of information dynamically, based on specific metadata fields. Then, users can locate libraries by using folders or by using metadata pivots, and refine the results by using additional key filters.

You can choose how much structure and control to use with metadata, and the scope of control and structure. For example:

You can apply control globally across sites, or make local to specific sites.
You can configure term sets to be closed or open to user contributions.
You can choose to use enterprise keywords and social tagging with managed terms, or not.

The managed metadata features in SharePoint enable you to control how users add metadata to content. For example, by using term sets and managed terms, you can control which terms users can add to content, and who can add new terms. You can also limit enterprise keywords to a specific list by configuring the keywords term set as closed.

When the same terms are used consistently across sites, it is easier to build robust processes or solutions that rely on metadata. Additionally, it is easier for site users to apply metadata consistently to their content.

Metadata Terms

A term is a specific word or phrase that you associated with an item on a SharePoint site. A term has a unique ID and it can have many text labels (synonyms). If you work on a multilingual site, the term can have labels in different languages.

There are two types of terms:

Managed terms are terms that are pre-defined. Term Store administrators organize managed terms into a hierarchical term set.

Enterprise keywords are words or phrases that users add to items on a SharePoint site. The collection of enterprise keywords is known as a keywords set. Typically, users can add any word or phrase to an item as a keyword. This means that you can use enterprise keywords for folksonomy-style tagging. Sometimes, Term Store administrators move enterprise keywords into a specific managed term set. When they are part of a managed term set, keywords become available in the context of that term set.

Term Set

A Term Set is a group of related terms. Terms sets can have different scope, depending on where you create the term set.

Local term sets are created within the context of a site collection, and are available for use (and visible) only to users of that site collection. For example, when you create a term set for a metadata column in a list or library, then the term set is local. It is available only in the site collection that contains this list or library. For example, a media library might have a metadata column that shows the kind of media (diagram, photograph, screen shot, video, etc.). The list of permitted terms is relevant only to this library, and available for use in the library.

Global term sets are available for use across all sites that subscribe to a specific Managed Metadata Service application. For example, an organization might create a term set that lists names of business units in the organization, such as Human Resources, Marketing, Information Technology, and so on.

In addition, you can configure a term set as closed or open. In a closed term set, users can't add new terms unless they have appropriate permissions. In an open term set, users can add new terms in a column that is mapped to the term set.

Group

Group is a security term. With respect to managed metadata, a group is a set of term sets that share common security requirements. Only users who have contributor permissions for a specific group can manage term sets that belong to the group or create new term sets within it. Organizations should create groups for term sets that will have unique access or security needs.

Term Store Management Tool

The Term Store Management Tool is the tool that people who manage taxonomies use to create or manage term sets and the terms within them. The Term Store Management tool displays all the global term sets and any local term sets available for the site collection from which you access the Term Store Management Tool.

Managed Metadata column

A Managed Metadata column is a special kind of column that you can add to lists or libraries. It enables site users to select terms from a specific term set. A Managed Metadata column can be mapped to an existing term set, or you can create a local term set specifically for the column.

Enterprise Keywords column

The enterprise Keywords column is a column that you can add to content types, lists, or libraries to enable users to tag items with words or phrases that they choose. By default, it is a multi-value column. When users type a word or phrase into the column, SharePoint presents type-ahead suggestions. Type-ahead suggestions might include items from managed term sets and the Keywords term set. Users can select an existing value, or enter a new term.

Social Tags

Social tags are words or phrases that site users can apply to content to help them categorize information in ways that are meaningful to them. Social tagging is useful because it helps site users to improve the discoverability of information on a site. Users can add social tags to information on a SharePoint site and to URLs outside a SharePoint site.

A social tag contains pointers to three types of information:

A user identity
An item URL
A term

When you add a social tag to an item, you can specify whether you want to make your identity and the item URL private. However, the term part of the social tag is always public, because it is stored in the Term Store.

When you create a social tag, you can choose from a set of existing terms or enter something new. If you select an existing term, your social tag contains a pointer to that term.

If, instead, you enter a new term, SharePoint creates a new keyword for it in the keywords term set. The new social tag points to this term. In in this manner, social tags support folksonomy-based tagging. Additionally, when users update an enterprise Keywords or Managed Metadata column, SharePoint can create social tags automatically. These terms then become visible as tags in newsfeeds, tag clouds, or My Site profiles.

List or library owners can enable or disable metadata publishing by updating the Enterprise Metadata and Keywords Settings for a list or library.

In the second part of this post, I will describe managing SharePoint metadata.

Wednesday, August 13, 2014

Dublin Core Metadata

The word "metadata" means "data about data". Metadata describes a context for objects of interest such as document files, images, audio and video files. It can also be called resource description. As a tradition, resource description dates back to the earliest archives and library catalogs. The modern "metadata" field that gave rise to Dublin Core and other recent standards emerged with the Web revolution of the mid-1990s.

The Dublin Core Schema is a small set of vocabulary terms that can be used to describe different resources.

"Dublin" refers to Dublin, Ohio, USA where the schema originated during the 1995 invitational OCLC/NCSA Metadata Workshop hosted by the Online Computer Library Center (OCLC), a library consortium based in Dublin, and the National Center for Supercomputing Applications (NCSA). "Core" refers to the metadata terms as broad and generic being usable for describing a wide range of resources. The semantics of Dublin Core were established and are maintained by an international, cross-disciplinary group of professionals from librarianship, computer science, text encoding, museums, and other related fields of scholarship and practice.

The Dublin Core Metadata Initiative (DCMI) provides an open forum for the development of inter-operable online metadata standards for a broad range of purposes and of business models. DCMI's activities include consensus-driven working groups, global conferences and workshops, standards liaison, and educational efforts to promote widespread acceptance of metadata standards and practices.

In 2008, DCMI separated from OCLC and incorporated as an independent entity. Any and all changes that are made to the Dublin Core standard are reviewed by a DCMI Usage Board within the context of a DCMI Namespace Policy. This policy describes how terms are assigned and also sets limits on the amount of editorial changes allowed to the labels, definitions, and usage comments.

Levels of the Standard

The Dublin Core standard originally includes two levels: Simple and Qualified. Simple Dublin Core comprised 15 elements; Qualified Dublin Core included three additional elements (Audience, Provenance and RightsHolder), as well as a group of element refinements (also called qualifiers) that could refine the semantics of the elements in ways that may be useful in resource discovery. Since 2012 the two have been incorporated into the DCMI Metadata Terms as a single set of terms using the Resource Description Framework (RDF).

The original Dublin Core Metadata Element Set which is the Simple level consists of 15 metadata elements:

Title

Creator

Subject

Description

Publisher

Contributor

Date

Type

Format

Identifier

Source

Language

Relation

Coverage

Rights

Each Dublin Core element is optional and may be repeated. The DCMI has established standard ways to refine elements and encourage the use of encoding and vocabulary schemes. There is no prescribed order in Dublin Core for presenting or using the elements. The Dublin Core became ISO 15836 standard in 2006 and is used as a base-level data element set for the description of learning resources in the ISO/IEC 19788-2.

Qualified Dublin Core

Subsequent to the specification of the original 15 elements, an ongoing process to develop terms extending or refining the Dublin Core Metadata Element Set (DCMES) began. The additional terms were identified. Elements refinements make the meaning of an element narrower or more specific. A refined element shares the meaning of the unqualified element, but with a more restricted scope.

In addition to element refinements, Qualified Dublin Core includes a set of recommended encoding schemes, designed to aid in the interpretation of an element value. These schemes include controlled vocabularies and formal notations or parsing rules.

Syntax

Syntax choices for Dublin Core metadata depends on a number of variables, and "one size fits all" forms rarely apply. When considering an appropriate syntax, it is important to note that Dublin Core concepts and semantics are designed to be syntax independent and are equally applicable in a variety of contexts, as long as the metadata is in a form suitable for interpretation both by machines and by human beings.

The Dublin Core Abstract Model provides a reference model against which particular Dublin Core encoding guidelines can be compared, independent of any particular encoding syntax. Such a reference model allows users to gain a better understanding of descriptions they are trying to encode and facilitates the development of better mappings and translations between different syntax.

I will describe some applications of Dublin Core in my future posts.

Thursday, June 20, 2013

Intelligent Search and Automated Metadata

The inability to identify the value in unstructured content is the primary challenge in any application that requires the use of metadata. Search cannot find and deliver relevant information in the right context, at the right time without good quality metadata.

An information governance approach that creates the infrastructure framework to encompass automated intelligent metadata generation, auto-classification, and the use of goal and mission-aligned taxonomies is required. From this framework, intelligent metadata enabled solutions can be rapidly developed and implemented. Only then can organizations leverage their knowledge assets to support search, litigation, e-discovery, text mining, sentiment analysis and open source intelligence.

Manual tagging is still the primary approach used to identify the description of content, and often lacks any alignment with enterprise business goals. This subjectivity and ambiguity is applied to search, resulting in inaccuracy and the inability to find relevant information across the enterprise.

Metadata used by search engines may be comprised of end user tags, pre-defined tags, or generated using system defined metadata, keyword and proximity matching, extensive rule building, end-user ratings, or artificial intelligence. Typically, search engines provide no way to rapidly adapt to meet organizational needs or account for an organization’s unique nomenclature.

More effective is implementing an enterprise metadata infrastructure that consistently generates intelligent metadata using concept identification. A profoundly different approach, relevant documents, regardless of where they reside, will be retrieved even if they don’t contain the exact search terms, because the concepts and relationships between similar content has been identified. The elimination of end-user tagging and the resulting organizational ambiguity enables the enriched metadata to be used by any search engine index, for example, ConceptSearch, SharePoint, Solr, Autonomy or Google Search Appliance.

Only when metadata is consistently accurate and trusted by the organization can improvements be achieved in text analytics, e-discovery and litigation support.

In the exploding age of big data, and more specifically text analytics, sentiment analysis and even open source intelligence, the ability to harness the meaning of unstructured content in real time improves decision-making and enables organizations to proactively act with greater certainty on rapidly changing business complexities.

To achieve an effective information governance strategy for unstructured content, results are predicated on the ability to find information and eliminate inappropriate information. The core enterprise search component must be able to incorporate and digest content from any repository, including faxes, scanned content, social sites (blogs, wikis, communities of interest, Twitter), emails, and websites. This provides a 360-degree corporate view of unstructured content, regardless of where it resides or how it was acquired.

Ensuring that the right information is available to end users and decision makers is fundamental to trusting the accuracy of the information and is another key requirement in intelligent search. Organizations can then find the descriptive needles in the haystack to gain competitive advantage and increase business agility.

An intelligent metadata enabled solution for text analytics analyzes and extracts highly correlated concepts from very large document collections. This enables organizations to attain an ecosystem of semantics that delivers understandable and trusted results that is continually updated in real time.

Applying the concept of intelligent search to e-discovery and litigation, traditional information retrieval systems use "keyword searches" of text and metadata as a means of identifying and filtering documents. The challenges and escalating costs of e-discovery and litigation support continue to increase. The use of intelligent search reduces costs and alleviates many of the challenges.

Content can be presented to knowledge professionals in a manner that enables them to more rapidly identify relevant information and increase accuracy. Significant benefits can be achieved by removing the ambiguity in content and the identification of concepts within a large corpus of information. This methodology delivers expediencies, and reduces costs, offering an effective solution that overcomes many of the challenges typically not solved in e-discovery and litigation support.

Organizations must incorporate an approach that addresses the lack of an intelligent metadata infrastructure. Intelligent search, a by-product of the infrastructure, must encourage, not hamper, the use and reuse of information and be rapidly extendable to address text mining, sentiment analysis, e-discovery, and litigation support.

The additional components of auto-classification and taxonomies complete the core infrastructure to deploy intelligent metadata enabled solutions, including records management, data privacy, and migration. Search can no longer be evaluated on features, but on proven results that deliver insight into all unstructured content.

Wednesday, July 25, 2012

Automatic Classification

In my previous posts, I mentioned that the taxonomy is necessary to create navigation to content. If users know what they are looking for, they are going to search. If they don't know what they are looking for, they will look for ways to navigate to content, in other words, browse through content. Taxonomies can also be used as a method of filtering search results so that results are restricted to a selected node on the hierarchy.

Once documents have been classified, users can browse the document collection, using an expanding tree-view to represent the taxonomy structure.

When there are many documents involved, creating taxonomy could be time consuming. There are few tools on the market that provide automatic classification. Another use of the automatic classification is to automatically tag content with controlled metadata (also known as Automatic Metadata Tagging) to increase the quality of the search results.

The tools that provide automatic classification are: Autonomy, ClearForest, Documentum, Interwoven, Inxight, Moxomine, Open Text, Oracle, SmartLogic.

These tools can classify any type of text documents. Classification is either performed on a document repository or on a stream of incoming documents.

Here is how this software works. Example: "International Business Machines today announced that it would acquire Widget, Inc. A spokesperson for IBM said: "Big Blue will move quickly to ensure a speedy transition".

The software classifies concepts rather than words. Words are first stemmed, that is they are reduced to their root form. Next, stop words are being eliminated. These include words such as a, an, in, the - words that add little semantic information. Then, words with similar meanings are equated using thesaurus. For example, the words IBM, International Business Machines, and Big Blue are treated as equivalent.

Next, the software will use statistical or language processing techniques to identify noun phrases or concepts such as "red bicycle". Further, using thesaurus, these phrases are reduced to distinct concepts that will be associated with the document. In this example, there are 3 instances of IBM, 2 instances of acquisition (acquire, speedy transition), and 1 instance of Widget, Inc.

Approaches to Classification

Manual - requires individuals to assign each document to one or more categories. It can achieve a high degree of accuracy. However, it is labor intensive and therefore are more costly than automatic classification in the long run.

Rule-based - keywords or Boolean expressions are used to categorize a document. This is typically used when a few words can adequately describe a category. For example, if a collection of medical papers is to be classified according to a disease together with its scientific, common, and alternative names can be used to define the keywords for each category.

Supervised Learning - most approaches to automatic classification require a human expert to initiate a learning process by manually classifying or assigning a number of "training documents" to each category. This classification system first analyzes the statistical occurrences of each concept in the example documents and then constructs a model or "classifier for each category that is used to classify subsequent documents automatically. The system refines its model, in a sense "learning" the categories as documents are processed.

Unsupervised Learning - these systems identify both groups or clusters of related documents as well as the relationship between these clusters. Commonly referred as clustering, this approach eliminates the need for training sets because it does not require a preexisting taxonomy or category structure. However, clustering algorithms are not always good at selecting categories that are intuitive to users. On the other hand, clustering will often expose useful relationships and themes implicit in the collection that might be missed by a manual process. For this reasons, clustering generally works hand-in-hand with supervised learning techniques.

Each of approaches is optimal for a different situation. As a result, classification vendors are moving to support multiple methods.

Most real world implementations combine search, classification, and other techniques such as identifying similar documents to provide a complete information retrieval solution. Organizations having document repositories will generally benefit from a customized taxonomy.

Once documents are clustered, an administrator can first rearrange, expand or collapse the auto-suggested clusters or categories, and then give them intuitive names. The documents in the cluster serve as initial training sets for supervised-learning algorithms that will be used subsequently to refine the categories. The end result is a taxonomy and a set of topic models are fully customized for an organization's needs.

Building an extensive custom taxonomy can be a large expense. However, automated classification tools can reduce the taxonomy development and maintenance cost.

Organizations with document collections that span complex areas such as medicine, biotechnology, aerospace will have a large taxonomy. However, there are ways to refine taxonomy so it does not become an overwhelming task.

Together, enterprise search and classification provide an initial response to information overload.

Wednesday, February 22, 2012

DITA, Metadata, and Taxonomy

Component-oriented content creation enables more efficient content re-use and dynamic publishing at more languages at a lower cost. XML authoring is required for the component content creation.

Research shows that organizations that use XML authoring are more mature than their peers with respect to the adoption of best practices for search and metadata. However, the use of native DITA (the Darwin Information Typing Architecture) metadata capabilities is rare, and many are also missing out on opportunities to use taxonomy for reuse and improved findability.

In this post, I am going to describe metadata capabilities within DITA, discuss two major benefits that can be achieved by using descriptive metadata and taxonomy, and recommend some best practices for getting started with metadata for component-oriented content.

Finding content in your file system or content repository is hard enough when you’ve got simple text documents to deal with. When you are using DITA and other component-oriented XML standards, you increase the difficulty by two or three orders of magnitude, because you’re looking for smaller needles in bigger haystacks. Having thousands of media-independent content objects that can be shared and reused across multiple deliverables allows you to create more sophisticated knowledge products, but it definitely poses a challenge in findability for content authors.

Among its many features for content reuse, DITA provides content creators with a facility for tagging content objects with metadata. Metadata (data about the data) lets content authors and others who manage content describe what the content is about ("descriptive metadata"), as well as assign properties like who created the content, when, in what language, and for which audience ("administrative metadata").

A taxonomy is a hierarchical structure that organizes concepts and controls vocabulary. Taxonomies allow organizations to create and centrally manage important terms that can be applied to content as metadata. For example, a telecommunications manufacturer might have a taxonomy that includes concepts such as product categories (Mobile Phones, Wireless Routers, and so on), industries (Healthcare, Utilities, Transportation, and so on), or product models.

Once applied, this metadata and taxonomy can be leveraged by a search application to help users find and use content. Search engines can use taxonomy to organize search results in meaningful ways, such as refining search based upon certain properties ("faceted search") and suggesting related searches based upon relationships between search terms and other concepts in the taxonomy.

It is a natural fit — DITA and taxonomy. DITA creates a multitude of reusable components, and taxonomy helps describe and organize the components so that they may be readily found and reused by content authors and users.

Taxonomies and descriptive metadata is also a natural fit since metadata-based search would improve findability of content objects.

DITA Support for Metadata

Compared to other XML standards, DITA provides a relatively rich and extensible framework for embedding metadata directly within the XML objects themselves. The embedded metadata can be used by processing tools like the publishing tools in the DITA Open Toolkit (DOTK) to conditionally publish content or to create metadata in the final outputs, like HTML.

DITA objects, both topics and maps, have a prolog section in which metadata can be specified. Within the prolog, the metadata section can define metadata about the topic itself such as the intended audience, the platform (for defining the applicability of the topic to specific hardware or operating systems), and so on. This metadata can be used for conditional publishing. For example, you can automate the production of a Linux version of your documentation by only outputting topics and maps that set platform to "Linux" in the metadata.

DITA objects can also embed administrative metadata about the author, copyright holder, source, publisher, and so on. Metadata can also contain descriptive keywords for the topic or map. Keywords or index terms are output to HTML or XHTML as metadata keywords to support search engines. Authors can also define index terms for the generation of back-of-book indices.

DITA also enables users to define custom metadata fields within the "othermeta" element. Like keywords, metadata defined as "othermeta" are output as HTML metadata elements but ignored for other types of output like PDF. Metadata is a powerful tool in helping to manage and publish DITA content.

Dynamic Publishing of Content

A major benefit of DITA is creating content that is media-independent. It also enables content objects to be organized by DITA maps, so that content can be recombined and re-sequenced into different deliverables. DITA maps provide flexibility.

Dynamic publishing lets content be chosen and presented to meet the unique needs of a user or situation. To best illustrate dynamic publishing, let’s compare it with static publishing of a help system.

In a statically published help system, the hierarchy of topics is fixed by the author and the selection of content is limited to what is in the DITA map at publish time. All of the related topics are manually linked. If an author wants to add a related topic, the author needs to manually add the link (or update the related-links table) and republish. The publishing process creates a deliverable that—while interactive—is static with respect to its contents and the relationships among them.

To create the same help system with dynamic publishing, the author would publish his/her content to a server, but he/she would not create the structure and relationships between topics at publish-time. Instead, a taxonomy would specify the relationship between concepts and properties that are defined in metadata. The relationships among topics are generated at run-time, based upon metadata on the topics. The richer the metadata and the more complete the taxonomy, the more sophisticated the user experience.

If you have experienced faceted search on consumer web sites, where we can refine search results by selecting specific values for different attributes, such as the number of megapixels for a camera. This experience is driven by metadata. With rich metadata on DITA content, we can create very sophisticated electronic content browsers, where metadata-based search creates browser-like user experiences.

Best Practices

Start by identifying all your taxonomy use cases. You will be using taxonomy not only for authors to search content objects for reuse but also potentially for serving up content to users dynamically or in a faceted interface. These perspectives will provide you with the framework for your taxonomy.

Reuse existing vocabulary. Many organizations already use controlled vocabularies for some metadata fields such as organization, audience, platform, and product. Look to existing sources for tagging your content such as hierarchical product or system models (from engineering), or hierarchical task models (from instructional/task analysis from the training organization) as places to start building hierarchical descriptive taxonomies.

Authors are the best people to apply descriptive metadata. After all, they do the analysis to determine what content was required in the first place, so they have the best context for classifying it. However, don’t expect authors to tag a lot: automate tagging when possible, especially for administrative metadata (author, organization, creation date, language).

Leverage the technology. Many content management systems can integrate third-party classification servers for automating descriptive metadata. These servers can automatically apply metadata from a taxonomy or controlled vocabulary when content topics are checked-in, then automatically populate subject metadata fields in the CMS. The metadata can in turn be reviewed and manually adjusted by authors. This metadata can be embedded into your DITA content for use in conditional publishing or to generate HTML tags in the final output to support search or dynamic publishing.

The next frontier of DITA adoption is leveraging semantic technologies (taxonomies, ontologies and text analytics) to automate the delivery of targeted content. For example, a service incident from a customer is automatically matched with the appropriate response, which is authored and managed as a DITA topic.

Wednesday, November 30, 2011

Metadata Schemes

In my last post, I mentioned that there are three types of metadata: descriptive, structural, and administrative. Today, I am going to talk more about metadata schemes.

Many different metadata schemes are being developed in a variety of user environments and disciplines. I will discuss the most common ones in this post.

Dublin Core

The Dublin Core Metadata Element Set arose from discussions at a 1995 workshop sponsored by OCLC and the National Center for Supercomputing Applicatons (NCSA). As the workshop was held in Dublin, Ohio, the element set was named the Dublin core. The continuing development of the Dublin Core and related specifications is managed by the Dublin Core Metadata Initiative (DCMI).

The original objective of the Dublin Core was to define a set of elements that could be used by authors to describe their own Web resources. Faced with a proliferation of electronic resources and the inability of the library profession to catalog all these resources, the goal was to define a few elements and some simple rules that could be applied by noncatalogers. The original 13 core elements were later increased to 15: Title, Creator, Subject, Description, Publisher, Contributor, Date, Type, Format, Identifier, Source, Language, Relation, Coverage, and Rights.

Because of its simplicity, the Dublin Core element set is now used by many outside the library community - researchers, museums, music collectors to name only a few. There are hundreds of projects worldwide that use the Dublin Core either for cataloging or to collect data.

Meanwhile the Dublin Core Metadata Initiative has expanded beyond simply maintaining the Dublin Core Metadata Element Set into an organization that describes itself as "dedicated to promoting the widespread adoption of inter-operable operable metadata standards and developing specialized metadata vocabularies for discovery systems.

The Text Encoding Initiative (TEI)

The Text Encoding Initiative is an international project to develop guidelines for marking up electronic texts such as novels, plays, and poetry, primarily to support research in the humanities.

TEI also specify a header portion, embedded in the resource, that consists of metadata about the work. The TEI header, like the rest of the TEI, is defined as an SGML DTD Document Type Definition) — a set of tags and rules defined in SGML syntax that describes the structure and elements of a document. This SGML mark-up becomes a part of electronic resource itself.

Metadata Encoding and Transmission Standard METS)

The Metadata Encoding and Transmission Standard (METS)was developed to fill the need for a standard data structure for describing complex digital library objects. METS is an XML Schema for creating XML document instances that express the structure of digital library objects, the associated descriptive and administrative metadata, and the names and locations of the files that comprise the digital object.

Next time: architecture for for authoring, producing, and delivering information.

Monday, November 28, 2011

Metadata

What is metadata? Metadata is structured information that describes, explains, locates or otherwise makes it easier to retrieve, use, or manage information resources. Metadata is often called data about data or information about information. It is used to describe data.

For example, a digital image may include metadata that describes how large the picture is, the color depth, the image resolution, when the image was created, and other data. A text document's metadata may contain information about how long the document is, who the author is, when the document was written, and a short summary of the document.

There are three main types of metadata:

Descriptive metadata describes a resource for purposes such as discovery and identification. It can include elements such as title, abstract, author, keywords.

Structural metadata indicates how compound objects are put together, for example, how pages are ordered to form chapters.

Administrative metadata provides information to help manage a resource, such as when and how it was created, file type and other technical information, and who can access it. There are several subsets ofadministrative data; two of them sometimes are listed as separate metadata types:

Rights management metadata which deals with intellectual property rights.

Preservation metadata which contains information needed to archive and preserve a resource.

Metadata can describe resources at any level. It can describe a collection, a single resource, or a component which is a part of a larger resource (for example, a photograph in an article).

Metadata can be embedded in a digital object or it can be stored separately.

Metadata is often embedded in HTML documents and in the headers of image files.

Storing metadata with the object it describes ensures the metadata will not be lost, eliminates problems of linking between data and metadata, and helps to ensure that the metadata and object will be updated together.

However, it is impossible to embed metadata in some types of objects for example, artifacts. Also, storing metadata separately can simplify the management of the metadata itself and facilitate the search and retrieval. Therefore, metadata is commonly stored in a database system and linked to the objects described.

More about metadata next time...

Pages