Saturday, September 6, 2014

Managed Metadata in SharePoint - Part Two

In part one of this post, I described using metadata in SharePoint. In this part two, I will describe metadata management.

Managed metadata makes it easier for Term Store Administrators to maintain and adapt your metadata as business needs evolve. You can update a term set easily. And, new or updated terms automatically become available when you associate a Managed Metadata column with that term set. For example, if you merge multiple terms into one term, content that is tagged with these terms is automatically updated to reflect this change. You can specify multiple synonyms (or labels) for individual terms. If your site is multilingual, you can also specify multilingual labels for individual terms.

Managing metadata

Managing metadata effectively requires careful thought and planning. Think about the kind of information that you want to manage the content of lists and libraries, and think about the way that the information is used in the organization. You can create term sets of metadata terms for lots of different information.

For example, you might have a single content type for a document. Each document can have metadata that identifies many of the relevant facts about it, such as these examples:
  • Document purpose - is it a sales proposal? An engineering specification? A Human Resources procedure?
  • Document author, and names of people who changed it
  • Date of creation, date of approval, date of most recent modification
  • Department responsible for any budgetary implications of the document
  • Audience
Activities that are involved with managing metadata:
  • Planning and configuring
  • Managing terms, term sets, and groups
  • Specifying properties for metadata
Planning and configuring managed metadata

Your organization may want to do careful planning before you start to use managed metadata. The amount of planning that you must do depends on how formal your taxonomy is. It also depends on how much control that you want to impose on metadata.

If you want to let users help develop your taxonomy, then you can just have users add keywords to items, and then organize these into term sets as necessary.

If your organization wants to use managed term sets to implement formal taxonomies, then it is important to involve key stakeholders in planning and development. After the key stakeholders in the organization agree upon the required term sets, you can use the Term Store Management Tool to import or create your term sets. You can also use the tool to manage the term sets as users start to work with the metadata. If your web application is configured correctly, and you have the appropriate permissions, you can go to the Term Store Management Tool by following these steps:

1. Select Settings and then choose Site Settings.
2. Select Term store management under Site Administration.

Managing terms, term sets, and groups

The Term Store Management Tool provides a tree control that you can use to perform most tasks. Your user role for this tool determines the tasks that you can perform. To work in the Term Store Management Tool, you must be a Farm Administrator or a Term Store Administrator. Or, you can be a designated Group Manager or Contributor for term sets.

To take actions on an item in the hierarchy, follow these steps:

1. Point to the name of the Managed Metadata Service application, group, term set, or term that you want to change, and then click the arrow that appears.
2. Select the actions that you want from the menu.

For example, if you are a Term Store Administrator or a Group Manager you can create, import, or delete term sets in a group. Term set contributors can create new term sets.

Properties for terms and term sets

At each level of the hierarchy, you can configure specific properties for a group, term set, or term by using the properties pane in the Term Store Management Tool. For example, if you are configuring a term set, you can specify information such as Name, Description, Owner, Contact, and Stakeholders in pane available on the General tab. You can also specify whether you want a term set to be open or closed to new submissions from users. Or, you can choose the Intended Use tab, and specify whether the term set should be available for tagging or site navigation.

Managed Metadata in SharePoint - Part One

Using metadata in SharePoint makes it easier to find content items. Metadata can be managed centrally in SharePoint and can be organized in a way that makes sense in your business. When the content across sites in an organization has consistent metadata, it is easier to find business information and data by using search. Search features such as the refinement panel, which displays on the left-hand side of the search results page, enable users to filter search results based on metadata.

SharePoint metadata management supports a range of approaches to metadata, from formal taxonomies to user-driven folksonomies. You can implement formal taxonomies through managed terms and term sets. You can also use enterprise keywords and social tagging, which enable site users to tag content with keywords that they choose. SharePoint enable organizations to combine the advantages of formal, managed taxonomies with the dynamic benefits of social tagging in customized ways.

Metadata navigation enables users to create views of information dynamically, based on specific metadata fields. Then, users can locate libraries by using folders or by using metadata pivots, and refine the results by using additional key filters.

You can choose how much structure and control to use with metadata, and the scope of control and structure. For example:
  • You can apply control globally across sites, or make local to specific sites.
  • You can configure term sets to be closed or open to user contributions.
  • You can choose to use enterprise keywords and social tagging with managed terms, or not.
The managed metadata features in SharePoint enable you to control how users add metadata to content. For example, by using term sets and managed terms, you can control which terms users can add to content, and who can add new terms. You can also limit enterprise keywords to a specific list by configuring the keywords term set as closed.

When the same terms are used consistently across sites, it is easier to build robust processes or solutions that rely on metadata. Additionally, it is easier for site users to apply metadata consistently to their content.

Metadata Terms

A term is a specific word or phrase that you associated with an item on a SharePoint site. A term has a unique ID and it can have many text labels (synonyms). If you work on a multilingual site, the term can have labels in different languages.

There are two types of terms:

Managed terms are terms that are pre-defined. Term Store administrators organize managed terms into a hierarchical term set.

Enterprise keywords are words or phrases that users add to items on a SharePoint site. The collection of enterprise keywords is known as a keywords set. Typically, users can add any word or phrase to an item as a keyword. This means that you can use enterprise keywords for folksonomy-style tagging. Sometimes, Term Store administrators move enterprise keywords into a specific managed term set. When they are part of a managed term set, keywords become available in the context of that term set.

Term Set

A Term Set is a group of related terms. Terms sets can have different scope, depending on where you create the term set.

Local term sets are created within the context of a site collection, and are available for use (and visible) only to users of that site collection. For example, when you create a term set for a metadata column in a list or library, then the term set is local. It is available only in the site collection that contains this list or library. For example, a media library might have a metadata column that shows the kind of media (diagram, photograph, screen shot, video, etc.). The list of permitted terms is relevant only to this library, and available for use in the library.

Global term sets are available for use across all sites that subscribe to a specific Managed Metadata Service application. For example, an organization might create a term set that lists names of business units in the organization, such as Human Resources, Marketing, Information Technology, and so on.

In addition, you can configure a term set as closed or open. In a closed term set, users can't add new terms unless they have appropriate permissions. In an open term set, users can add new terms in a column that is mapped to the term set.

Group

Group is a security term. With respect to managed metadata, a group is a set of term sets that share common security requirements. Only users who have contributor permissions for a specific group can manage term sets that belong to the group or create new term sets within it. Organizations should create groups for term sets that will have unique access or security needs.

Term Store Management Tool

The Term Store Management Tool is the tool that people who manage taxonomies use to create or manage term sets and the terms within them. The Term Store Management tool displays all the global term sets and any local term sets available for the site collection from which you access the Term Store Management Tool.

Managed Metadata column

A Managed Metadata column is a special kind of column that you can add to lists or libraries. It enables site users to select terms from a specific term set. A Managed Metadata column can be mapped to an existing term set, or you can create a local term set specifically for the column.

Enterprise Keywords column

The enterprise Keywords column is a column that you can add to content types, lists, or libraries to enable users to tag items with words or phrases that they choose. By default, it is a multi-value column. When users type a word or phrase into the column, SharePoint presents type-ahead suggestions. Type-ahead suggestions might include items from managed term sets and the Keywords term set. Users can select an existing value, or enter a new term.

Social Tags

Social tags are words or phrases that site users can apply to content to help them categorize information in ways that are meaningful to them. Social tagging is useful because it helps site users to improve the discoverability of information on a site. Users can add social tags to information on a SharePoint site and to URLs outside a SharePoint site.

A social tag contains pointers to three types of information:
  • A user identity
  • An item URL
  • A term
When you add a social tag to an item, you can specify whether you want to make your identity and the item URL private. However, the term part of the social tag is always public, because it is stored in the Term Store.

When you create a social tag, you can choose from a set of existing terms or enter something new. If you select an existing term, your social tag contains a pointer to that term.

If, instead, you enter a new term, SharePoint creates a new keyword for it in the keywords term set. The new social tag points to this term. In in this manner, social tags support folksonomy-based tagging. Additionally, when users update an enterprise Keywords or Managed Metadata column, SharePoint can create social tags automatically. These terms then become visible as tags in newsfeeds, tag clouds, or My Site profiles.

List or library owners can enable or disable metadata publishing by updating the Enterprise Metadata and Keywords Settings for a list or library.

In the second part of this post, I will describe managing SharePoint metadata.

Sunday, August 31, 2014

Role of Automatic Classification in Information Governance

Defensible disposal of unstructured content is a key outcome of sound information governance programs. A sound approach to records management as part of the organization’s information governance strategy is rife with challenges.

Some of the challenges are explosive content volumes, difficulty with accurately determining what content is a business record comparing to transient or non-business related content, eroding IT budgets due to mounting storage costs, and the need to incorporate content from legacy systems or merger and acquisition activity.

Managing the retention and disposition of information reduces litigation risk, it reduces discovery and storage costs, and it ensures organizations maintain regulatory compliance. In order for content to be understood and determined why it must be retained, for how long it must be retained, and when it can be dispositioned, it needs to be classified.

However, users see the process of sorting records from transient content as intrusive, complex, and counterproductive. On top of this, the popularity of mobile devices and social media applications has effectively fragmented the content authoring and has eliminated any chance of building consistent classification tools into end-user applications.

If classification is not being carried out, there are serious implications when asked by regulators or auditors to provide reports to defend the organization’s records and retention management program.

Records managers also struggle with enforcing policies that rely on manual, human-based classification. Accuracy and consistency in applying classification is often inadequate when left up to users, the costs in terms of productivity loss are high, and these issues, in turn, result in increased business and legal risk as well as the potential for the entire records management program to quickly become unsustainable in terms of its ability to scale.

A solution to overcome this challenge is automatic classification. It eliminates the need for users to manually identify records and apply necessary classifications. By taking the burden of classification off the end-user, records managers can improve consistency of classification and better enforce rules and policies.

Auto-Classification makes it possible for records managers to easily demonstrate a defensible approach to classification based on statistically relevant sampling and quality control. Consequently, this minimizes the risk of regulatory fines and eDiscovery sanctions.

In short, it provides a non-intrusive solution that eliminates the need for business users to sort and classify a growing volume of low-touch content, such as email and social media, while offering records managers and the organization as a whole the ability to establish a highly defensible, completely transparent records management program as part of their broader information governance strategy.

Benefits of Automatic Classification for Information Governance

Apply records management classifications as part of a consistent, programmatic component of a sound information governance program to:

Reduce
  • Litigation risk
  • Storage costs
  • eDiscovery costs
Improve
  • Compliance
  • Security
  • Responsiveness
  • User productivity and satisfaction
Address
  • The fundamental difficulties in applying classifications to high volume, low touch content such as legacy content, email and social media content.
  • Records manager and compliance officer concerns about defensibility and transparency.
Features
  • Automated Classification: automate the classification of content in line with existing records management classifications.
  • Advanced Techniques: classification process based on a hybrid approach that combines machine learning, rules, and content analytics.
  • Flexible Classification: ability to define classification rules using keywords or metadata.
  • Policy-Driven Configuration: ability to configure and optimize the classification process with an easy "step-by-step" tuning guide.
  • Advanced Optimization Tools: reports make it easy to examine classification results, identify potential accuracy issues, and then fix those issues by leveraging the provided "optimization" hints.
  • Sophisticated Relevancy and Accuracy Assurance: automatic sampling and bench marking with a complete set of metrics to assess the quality of the classification process.
  • Quality Assurance : advanced reports on a statistically relevant sample to review and code documents that have been automatically classified to manually assess the quality of the classification results when desired.

Wednesday, August 13, 2014

Dublin Core Metadata

The word "metadata" means "data about data". Metadata describes a context for objects of interest such as document files, images, audio and video files. It can also be called resource description. As a tradition, resource description dates back to the earliest archives and library catalogs. The modern "metadata" field that gave rise to Dublin Core and other recent standards emerged with the Web revolution of the mid-1990s.

The Dublin Core Schema is a small set of vocabulary terms that can be used to describe different resources.

Dublin Core Metadata may be used for multiple purposes, from simple resource description, to combining metadata vocabularies of different metadata standards, to providing inter-operability for metadata vocabularies in the Linked data cloud and Semantic web implementations.

"Dublin" refers to Dublin, Ohio, USA where the schema originated during the 1995 invitational OCLC/NCSA Metadata Workshop hosted by the Online Computer Library Center (OCLC), a library consortium based in Dublin, and the National Center for Supercomputing Applications (NCSA). "Core" refers to the metadata terms as broad and generic being usable for describing a wide range of resources. The semantics of Dublin Core were established and are maintained by an international, cross-disciplinary group of professionals from librarianship, computer science, text encoding, museums, and other related fields of scholarship and practice.

The Dublin Core Metadata Initiative (DCMI) provides an open forum for the development of inter-operable online metadata standards for a broad range of purposes and of business models. DCMI's activities include consensus-driven working groups, global conferences and workshops, standards liaison, and educational efforts to promote widespread acceptance of metadata standards and practices.

In 2008, DCMI separated from OCLC and incorporated as an independent entity. Any and all changes that are made to the Dublin Core standard are reviewed by a DCMI Usage Board within the context of a DCMI Namespace Policy. This policy describes how terms are assigned and also sets limits on the amount of editorial changes allowed to the labels, definitions, and usage comments.

Levels of the Standard

The Dublin Core standard originally includes two levels: Simple and Qualified. Simple Dublin Core comprised 15 elements; Qualified Dublin Core included three additional elements (Audience, Provenance and RightsHolder), as well as a group of element refinements (also called qualifiers) that could refine the semantics of the elements in ways that may be useful in resource discovery. Since 2012 the two have been incorporated into the DCMI Metadata Terms as a single set of terms using the Resource Description Framework (RDF).

The original Dublin Core Metadata Element Set which is the Simple level consists of 15 metadata elements:

Title
Creator
Subject
Description
Publisher
Contributor
Date
Type
Format
Identifier
Source
Language
Relation
Coverage
Rights

Each Dublin Core element is optional and may be repeated. The DCMI has established standard ways to refine elements and encourage the use of encoding and vocabulary schemes. There is no prescribed order in Dublin Core for presenting or using the elements. The Dublin Core became ISO 15836 standard in 2006 and is used as a base-level data element set for the description of learning resources in the ISO/IEC 19788-2.

Qualified Dublin Core

Subsequent to the specification of the original 15 elements, an ongoing process to develop terms extending or refining the Dublin Core Metadata Element Set (DCMES) began. The additional terms were identified. Elements refinements make the meaning of an element narrower or more specific. A refined element shares the meaning of the unqualified element, but with a more restricted scope.

In addition to element refinements, Qualified Dublin Core includes a set of recommended encoding schemes, designed to aid in the interpretation of an element value. These schemes include controlled vocabularies and formal notations or parsing rules.

Syntax

Syntax choices for Dublin Core metadata depends on a number of variables, and "one size fits all" forms rarely apply. When considering an appropriate syntax, it is important to note that Dublin Core concepts and semantics are designed to be syntax independent and are equally applicable in a variety of contexts, as long as the metadata is in a form suitable for interpretation both by machines and by human beings.

The Dublin Core Abstract Model provides a reference model against which particular Dublin Core encoding guidelines can be compared, independent of any particular encoding syntax. Such a reference model allows users to gain a better understanding of descriptions they are trying to encode and facilitates the development of better mappings and translations between different syntax.

I will describe some applications of Dublin Core in my future posts.

Tuesday, July 22, 2014

Hadoop and Big Data

During last ten years the volume and diversity of digital information grew at unprecedented rates. Amount of information is doubling every 18 months, and unstructured information volumes grow six times faster than structured.

Big data is the nowadays trend. Big data has been defined as data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time.

Hadoop was created in 2005 by Doug Cutting and Mike Cafarella to address the big data issue. Doug Cutting named it after his son's toy elephant It was originally developed for the Nutch search engine project. Nutch is an effort to build an open source web search engine based on Lucene and Java for the search and index component.

As of 2013, Hadoop adoption is widespread. A number of companies offer commercial implementations or support for Hadoop. For example, more than half of the Fortune 50 use Hadoop. Hadoop is designed to process terabytes and even petabytes of unstructured and structured data. It breaks large workloads into smaller data blocks that are distributed across a cluster of commodity hardware for faster processing.

Ventana Research, a benchmark research and advisory services firm published the results of its groundbreaking survey on enterprise adoption of Hadoop to manage big data. According to this survey:
  • More than one-half (54%) of organizations surveyed are using or considering Hadoop for large-scale data processing needs.
  • More than twice as many Hadoop users report being able to create new products and services and enjoy costs savings beyond those using other platforms; over 82% benefit from faster analysis and better utilization of computing resources.
  • 87% of Hadoop users are performing or planning new types of analysis with large scale data.
  • 94% of Hadoop users perform analytics on large volumes of data not possible before; 88% analyze data in greater detail; while 82% can now retain more of their data.
  • 63% of organizations use Hadoop in particular to work with unstructured data such as logs and event data.
  • More than two-thirds of Hadoop users perform advanced analysis such as data mining or algorithm development and testing.
Today, Hadoop is being used as a:
  • Staging layer: the most common use of Hadoop in enterprise environments is as “Hadoop ETL” — pre-processing, filtering, and transforming vast quantities of semi-structured and unstructured data for loading into a data warehouse.
  • Event analytics layer: large-scale log processing of event data: call records, behavioral analysis, social network analysis, clickstream data, etc.
  • Content analytics layer: next-best action, customer experience optimization, social media analytics. MapReduce provides the abstraction layer for integrating content analytics with more traditional forms of advanced analysis.
  • Most existing vendors in the data warehousing space have announced integrations between their products and Hadoop/MapReduce.
Hadoop is particularly useful when:
  • Complex information processing is needed.
  • Unstructured data needs to be turned into structured data.
  • Queries can’t be reasonably expressed using SQL.
  • Heavily recursive algorithms.
  • Complex but parallelizable algorithms needed, such as geo-spatial analysis or genome sequencing.
  • Machine learning.
  • Data sets are too large to fit into database RAM, discs, or require too many cores (10’s of TB up to PB).
  • Data value does not justify expense of constant real-time availability, such as archives or special interest information, which can be moved to Hadoop and remain available at lower cost.
  • Results are not needed in real time.
  • Fault tolerance is critical.
  • Significant custom coding would be required to handle job scheduling.
Does Hadoop and Big Data Solve All Our Data Problems?

Hadoop provides a new, complementary approach to traditional data warehousing that helps deliver on some of the most difficult challenges of enterprise data warehouses. Of course, it’s not a panacea, but by making it easier to gather and analyze data, it may help move the spotlight away from the technology towards the more important limitations on today’s business intelligence efforts: information culture and the limited ability of many people to actually use information to make the right decisions.