Galaxy Consulting Blog: Taxonomy

Showing posts with label Taxonomy. Show all posts

Wednesday, February 22, 2012

DITA, Metadata, and Taxonomy

Component-oriented content creation enables more efficient content re-use and dynamic publishing at more languages at a lower cost. XML authoring is required for the component content creation.

Research shows that organizations that use XML authoring are more mature than their peers with respect to the adoption of best practices for search and metadata. However, the use of native DITA (the Darwin Information Typing Architecture) metadata capabilities is rare, and many are also missing out on opportunities to use taxonomy for reuse and improved findability.

In this post, I am going to describe metadata capabilities within DITA, discuss two major benefits that can be achieved by using descriptive metadata and taxonomy, and recommend some best practices for getting started with metadata for component-oriented content.

Finding content in your file system or content repository is hard enough when you’ve got simple text documents to deal with. When you are using DITA and other component-oriented XML standards, you increase the difficulty by two or three orders of magnitude, because you’re looking for smaller needles in bigger haystacks. Having thousands of media-independent content objects that can be shared and reused across multiple deliverables allows you to create more sophisticated knowledge products, but it definitely poses a challenge in findability for content authors.

Among its many features for content reuse, DITA provides content creators with a facility for tagging content objects with metadata. Metadata (data about the data) lets content authors and others who manage content describe what the content is about ("descriptive metadata"), as well as assign properties like who created the content, when, in what language, and for which audience ("administrative metadata").

A taxonomy is a hierarchical structure that organizes concepts and controls vocabulary. Taxonomies allow organizations to create and centrally manage important terms that can be applied to content as metadata. For example, a telecommunications manufacturer might have a taxonomy that includes concepts such as product categories (Mobile Phones, Wireless Routers, and so on), industries (Healthcare, Utilities, Transportation, and so on), or product models.

Once applied, this metadata and taxonomy can be leveraged by a search application to help users find and use content. Search engines can use taxonomy to organize search results in meaningful ways, such as refining search based upon certain properties ("faceted search") and suggesting related searches based upon relationships between search terms and other concepts in the taxonomy.

It is a natural fit — DITA and taxonomy. DITA creates a multitude of reusable components, and taxonomy helps describe and organize the components so that they may be readily found and reused by content authors and users.

Taxonomies and descriptive metadata is also a natural fit since metadata-based search would improve findability of content objects.

DITA Support for Metadata

Compared to other XML standards, DITA provides a relatively rich and extensible framework for embedding metadata directly within the XML objects themselves. The embedded metadata can be used by processing tools like the publishing tools in the DITA Open Toolkit (DOTK) to conditionally publish content or to create metadata in the final outputs, like HTML.

DITA objects, both topics and maps, have a prolog section in which metadata can be specified. Within the prolog, the metadata section can define metadata about the topic itself such as the intended audience, the platform (for defining the applicability of the topic to specific hardware or operating systems), and so on. This metadata can be used for conditional publishing. For example, you can automate the production of a Linux version of your documentation by only outputting topics and maps that set platform to "Linux" in the metadata.

DITA objects can also embed administrative metadata about the author, copyright holder, source, publisher, and so on. Metadata can also contain descriptive keywords for the topic or map. Keywords or index terms are output to HTML or XHTML as metadata keywords to support search engines. Authors can also define index terms for the generation of back-of-book indices.

DITA also enables users to define custom metadata fields within the "othermeta" element. Like keywords, metadata defined as "othermeta" are output as HTML metadata elements but ignored for other types of output like PDF. Metadata is a powerful tool in helping to manage and publish DITA content.

Dynamic Publishing of Content

A major benefit of DITA is creating content that is media-independent. It also enables content objects to be organized by DITA maps, so that content can be recombined and re-sequenced into different deliverables. DITA maps provide flexibility.

Dynamic publishing lets content be chosen and presented to meet the unique needs of a user or situation. To best illustrate dynamic publishing, let’s compare it with static publishing of a help system.

In a statically published help system, the hierarchy of topics is fixed by the author and the selection of content is limited to what is in the DITA map at publish time. All of the related topics are manually linked. If an author wants to add a related topic, the author needs to manually add the link (or update the related-links table) and republish. The publishing process creates a deliverable that—while interactive—is static with respect to its contents and the relationships among them.

To create the same help system with dynamic publishing, the author would publish his/her content to a server, but he/she would not create the structure and relationships between topics at publish-time. Instead, a taxonomy would specify the relationship between concepts and properties that are defined in metadata. The relationships among topics are generated at run-time, based upon metadata on the topics. The richer the metadata and the more complete the taxonomy, the more sophisticated the user experience.

If you have experienced faceted search on consumer web sites, where we can refine search results by selecting specific values for different attributes, such as the number of megapixels for a camera. This experience is driven by metadata. With rich metadata on DITA content, we can create very sophisticated electronic content browsers, where metadata-based search creates browser-like user experiences.

Best Practices

Start by identifying all your taxonomy use cases. You will be using taxonomy not only for authors to search content objects for reuse but also potentially for serving up content to users dynamically or in a faceted interface. These perspectives will provide you with the framework for your taxonomy.

Reuse existing vocabulary. Many organizations already use controlled vocabularies for some metadata fields such as organization, audience, platform, and product. Look to existing sources for tagging your content such as hierarchical product or system models (from engineering), or hierarchical task models (from instructional/task analysis from the training organization) as places to start building hierarchical descriptive taxonomies.

Authors are the best people to apply descriptive metadata. After all, they do the analysis to determine what content was required in the first place, so they have the best context for classifying it. However, don’t expect authors to tag a lot: automate tagging when possible, especially for administrative metadata (author, organization, creation date, language).

Leverage the technology. Many content management systems can integrate third-party classification servers for automating descriptive metadata. These servers can automatically apply metadata from a taxonomy or controlled vocabulary when content topics are checked-in, then automatically populate subject metadata fields in the CMS. The metadata can in turn be reviewed and manually adjusted by authors. This metadata can be embedded into your DITA content for use in conditional publishing or to generate HTML tags in the final output to support search or dynamic publishing.

The next frontier of DITA adoption is leveraging semantic technologies (taxonomies, ontologies and text analytics) to automate the delivery of targeted content. For example, a service incident from a customer is automatically matched with the appropriate response, which is authored and managed as a DITA topic.

Tuesday, November 22, 2011

Taxonomy Development Process

Guided by the key factors, we can define and follow a taxonomy development process that addresses business context, content, and users. The steps in creating taxonomy are: assemble a team, define a scope, create, implement, test, maintain.

Assemble a team

Successful taxonomy development requires both taxonomy expertise and in-depth knowledge of the corporate culture and content. Therefore a taxonomy team should include subject matter experts or content experts from the business community who have in-depth knowledge of corporate culture and content. For small projects, the group may simply be part of a user focus group that is concentrating on the taxonomy task. Taxonomy interrelates with several aspects of web development, including website design, content management, and web search. So, these roles should be included in the taxonomy team. Common considerations are overall project scope, target audience, existing organizational taxonomy initiatives, and corporate culture.

Define scope

Answering the following questions would help to define the scope of taxonomy:

Business context

What is the purpose of the taxonomy?

How is the taxonomy going to be used?

Content

What is the content scope? (Possibilities include company-wide, within an organizational unit, etc.)

What content sources will the taxonomy be built upon? (Specifically, the locations of the content to be covered in the taxonomy.)

User

Who will be using the taxonomy? (Possibilities include employees, customers, partners, etc.)

What are the user profiles?

This step should also define metrics for measuring the taxonomy values. For websites, baselines should be established for later comparison with the new site. An example would be the number of clicks it takes a site visitor to locate certain information.

Create taxonomy

Taxonomy creation can either be manual, automated, or a combination of both. It involves analyzing context, content, and users within the defined scope. The analysis results serve as input for the taxonomy design, including both taxonomy structure and taxonomy view. The taxonomy development team is responsible for the actual mechanics of taxonomy design, whereas the taxonomy interest group is responsible for providing consultation on content inclusion, nomenclature, and labeling.

The design of the taxonomy structure and taxonomy view may run in tandem, depending on the resources available and project time frame. All concepts presented through the taxonomy view need to be categorized properly according to the taxonomy structure. This will ensure that every content item is organized centrally through the same classification schema.

Along with taxonomy structure and taxonomy view, standards and guidelines must be defined. There should be a categorizing rule for each category in taxonomy view and taxonomy structure. In short, you must define what type of content should go under any given category. Content managers can then refer to these rules when categorizing content. If an automated tool is used for content tagging, these rules can be fed to the tagging application. Standards and guidelines help ensure classification consistency, an important attribute of a quality content management system and search engineering process.

Implement the taxonomy

The next step includes setting up the taxonomy and tagging content against it. This is often referred to as "populating" the taxonomy. Similar to taxonomy creation, implementation can be manual, automated, or a combination of both. The goal here is to implement the taxonomy into the website design, search engineering, and content management.

For website design, taxonomy view provides the initial design for the site structure and interface. The focus is on the concepts and groupings, not so much on nomenclature, labeling, or graphics. There may be a need to go through multiple iterations, moving from general to specific in defining levels of detail for the content. Types of taxonomy view include site diagrams, navigation maps, content schemes, and wire frames. The final site layout is built by applying graphical treatment to the last iteration of taxonomy view.

For search engineering, implementation can be accomplished in various ways. Taxonomy structure as a classification schema can be fed into a search engine for training purposes or integrated with the search engine for a combination of category browsing and searching. In the latter case, the exposed taxonomy structure is essentially a type of taxonomy view. One of the most challenging aspects of taxonomy implementation is the synchronization between the search engine and the taxonomy, especially for search engines that do not take taxonomic content tagging in the indexing process. In such cases, a site visitor may receive different results from searching and browsing the same category, which could prove confusing.

Taxonomy structure needs to be integrated within the content management process. Content categorization should be one of the steps within the content management workflow, just like review and approval. If a content management tool is available, the taxonomy structure is loaded into the tool, either through a manual setup process, or imported from a taxonomy created externally. Through the content management process, content is tagged manually or automatically against the taxonomy. In other words, the taxonomy is populated with content.

Test

The goal of testing is to identify errors and discrepancies. The test results are then used to refine the taxonomy design. The testing should be incorporated into the usability testing process for the entire web application, including back-end content management testing and front-end site visitor testing. Here is a sample checklist of testing topics:

Given specific information topics, can the site visitors find what they need easily, in terms of coverage and relevancy? Given specific information topics, how many clicks does it take before a site visitor arrives at the desired information? Given specific tasks, can the site visitors accomplish them within a reasonable time frame? Do the labels convey the concepts clearly or is there ambiguity? Are the content priorities in sync with the site visitors' needs? Does the structure allow content managers to categorize content easily?

Testing results are recorded and can later be compared with the baseline statistics to derive the measurements of improvements.

Maintain

Taxonomy design and fine-tuning is an ongoing process similar to content management. As an organization grows or evolves, its business context, content, and users change. New concepts, nomenclature, and information need to be incorporated into the taxonomy. A change management process is critical to ensure consistency and currency.

Better structure equals better access

Taxonomy serves as a framework for organizing the ever-growing and changing information within a company. The many dimensions of taxonomy can greatly facilitate website design, content management, and search engineering. If well done, taxonomy will allow for structured web content, leading to improved information access.

Next time: what is metadata?

Monday, November 21, 2011

Taxonomy and Enterprise Content Management

Taxonomy is a hierarchical structure for the classification and/or organization of data. In content management and information architecture, taxonomy is used as a tool for organizing content. Development of an enterprise taxonomy requires the careful coordination and cooperation of departments within your organization.

Once the taxonomy is created, it needs to be managed. There is no such thing as "finished taxonomy". Taxonomy needs to be revisited and revised periodically. Why? Business changes, new content is created, old content is archived.

The two key aspects of taxonomy are taxonomy structure and taxonomy view. Taxonomy structure provides a classification schema for categorizing content within the content management process. Taxonomy view is a conceptual model illustrating the types of information, ideas, and requirements to be presented on the Web. It represents the logical grouping of content visible to a site visitor and serves as input for Web site design and search engineering. Together, these concepts can guide your Web development efforts to maximize return on investment. Build it right, and they will come.

There are the three key factors of taxonomy development: business context, users, and content.

These factors reflect the fundamental business requirements for most taxonomy projects. Strategically, they provide a "trinity compass" for the road of taxonomy development.

Here's a description of each factor:

"Business context" is the business environment for the taxonomy efforts in terms of business objectives, Web applications where taxonomy will be used, corporate culture, past or current taxonomy initiatives, and artifacts within the organization and across the industry.

"Users" refers to the target audience for the taxonomy, user profiles, and user characteristics in terms of information usage patterns.

"Content" is the type of information that will be covered by the taxonomy or that the taxonomy will be built upon.

There are two common techniques for taxonomy strategy.

Universal Taxonomy

A single taxonomy is used to store and deliver content. When content contributors utilize the content management system, they add, remove, and manage content in a structure that closely resembles the navigation and hierarchy of the delivery framework (your website or application). The navigation structure is the taxonomy.

This method is conceptually simple and makes it quite easy to dynamically build your navigation from knowledge of this hierarchy. However, this model does have drawbacks:

Every time you reorganize the website, the organization of content in your management application shifts. Admittedly, this isn’t much of a drawback if you’re managing content for one moderately sized site or if your team of contributors is small.

It is difficult to reuse content in this structure. If you hope to reuse assets throughout your website, where are they organized in this structure?

In an environment with many contributors and diverse security requirements, organizing content (in the management application) in another way, say by contributor or by department, may be more intuitive.

Content Mapping

A more robust, albeit more complex, method of managing content is to maintain structures and metadata in the content management application that is independent of the delivery system’s organization (navigation).

Content is organized, at the source, as may be required by your security, workflow, or organizational needs. Perhaps your data lives in a content management system or database where different organizational mechanisms exist. Unfortunately, the navigation for your consuming application (the presentation framework) is often managed by some other means.

By some rule or algorithm, leveraging your content classification data, material gets “mapped” to the presentation framework.

Advantages of this model:

There may be more than one way to organize content (think: content reuse). Given the same set of content, same set of classification criteria, but multiple algorithms, we can now build a delivery framework that allows for many methods of organization.

You no longer need to reorganize your content management application to change the delivery application. Just the algorithms (mappings) change.

Drawbacks:

If you hope to build your navigation dynamically, often you’ll need to build a tool or alternate hierarchy. You may not find much value in the content’s taxonomy.

Content, in your management environment, may be orphaned in your presentation framework if there are no rules mapping to an accessible part of the site.

Parts of the site may only be sparsely populated. It may not be readily obvious that you are creating gaps (with little or no content) in your site.

While powerful, this technique can be difficult to administer without having a fairly comprehensive understanding of the site design and algorithms for "mapping".

Assuming there are hierarchical structures within your content classification system, there is a very good chance that valuable information exists in the hierarchy. By taking advantage of relationships within your hierarchical metadata structures, richer algorithms may be developed for your content delivery framework.

Friday, November 18, 2011

Taxonomy

Taxonomy is very important in content management. It ensures that search and navigation work properly and that content is accessible and can be found via two access points: searching and browsing.

Taxonomy is the science and practice of classification. The word is derived from Greek words "taxis" meaning "arrangement" and "nomia" which means "method". Taxonomy uses taxonomic units, known as taxa (singular taxon). A taxonomy, or taxonomic scheme, is a particular classification ("the taxonomy of ..."), arranged in a hierarchical structure or classification scheme.

Taxonomy is organized by supertype-subtype relationships, also called generalization-specialization relationships, or less formally, parent-child relationships. Once a taxonomy tree has been created, all the items in the tree are tagged as belonging to one or more specific taxonomy categories. This process is typically referred to as "categorization", "tagging" or "profiling". Users can then browse and search within specific categories.

In such an inheritance relationship, the subtype by definition has the same properties, behaviours, and constraints as the supertype plus one or more additional properties, behaviours, or constraints. For example: a bicycle is a kind of vehicle, so any bicycle is also a vehicle, but not every vehicle is a bicycle. Therefore a subtype needs to satisfy more constraints than its supertype. Thus to be a bicycle is more constraint than to be a vehicle.

Historically used by biologists to classify plants or animals according to a set of natural relationships, in content management and information architecture, taxonomy is used as a tool for organizing content. Creating a taxonomy is central to any enterprise content strategy as means of organizing content so that it could be found by either searching or browsing.

Here is an example of food taxonomy:

Next time: more about taxonomy as it applies to content management and the best strategies to develop it.

Pages