Tuesday, December 30, 2014
In my previous post, I described the future of enterprise search. In this post, I will describe few new search applications that could be interesting.
Founded in 2002, Concept Searching provides software products that deliver automatic semantic metadata generation, auto-classification, and powerful taxonomy management tools. Concept Searching is the only platform independent statistical metadata generation and classification software company in the world that uses concept extraction and compound term processing to significantly improve access to unstructured information. The Concept Searching Microsoft suite of technologies runs in all versions of SharePoint, Office 365, and OneDrive for Business.
The technologies are being used to improve search outcomes, deploy an enterprise metadata repository, enable effective records management, identify and secure sensitive information, improve governance and compliance, social tagging, collaboration, text analytics, facilitate eDiscovery, and drive intelligent migration.
Concept Searching, developer of the Smart Content Framework™, provides organizations with a method to mitigate risk, automate processes, manage information, protect privacy, and address compliance issues. This infrastructure framework utilizes a set of technologies that encompasses the entire portfolio of unstructured information assets, resulting in increased organizational performance and agility.
Lexalytics provides enterprise and hosted text analytics software to transform unstructured text into structured data. The software extracts entities (people, places, companies, products, etc.), sentiment, quotes, opinions, and themes (generally noun phrases) from text. Text is considered unstructured data which comprises somewhere between 31% and 85% of what is stored in any given enterprise.
Lexalytics is an OEM vendor of text analytics and sentiment analysis technology for social media monitoring, brand management, and voice-of-customer industries. The software uses natural language processing technology to extract the above-mentioned items from social media and forums; the voice of the customer in surveys, emails, and call-center feedback, traditional media, pharmaceutical research and development, internal enterprise documents, and others.
Lexalytics, provides a text mining engine that is used by a number of search partners like Coveo, Playence, and Oracle to add additional metadata to their search. This is additional intelligence around "just what do those words actually mean?" In other words, this engine is boosting the value of search by providing more information into the index. This enables other applications, and helps search be "smarter".
MaxxCAT provides enterprise search solutions for corporate intranets, web sites, databases, file systems and applications, and other environments that require rapid document retrieval from multiple data sources. The flagship products offered by MaxxCAT are the SB-250 series and the EX-5000 series network search appliances. Also available are series of cloud-enables storage appliances.
Founded in 1995, this software company specializes in applying artificial intelligence techniques to understanding documents written in different languages. Their software enhances parsing tools by classifying the role of words and provides metadata on the role of words to other algorithms. Software from Basis Technology will, for instance, identify the language of an incoming stream of characters and then identify the parts of each sentence like the subject or the direct object.
The company is best known for its Rosette Linguistics Platform which uses Natural Language Processing techniques to improve information retrieval, text mining, search engines and other applications. The tool is used to create normalized forms of text by major search engines, and, translators. Basis Technology software is also used by forensic analysts to search through files for words, tokens, phrases or numbers that may be important to investigators.
Founded in 1991, this company specializes in text retrieval software. Its current range of software includes products for enterprise desktop search, Intranet/Internet spidering and search, and search engines for developers (SDK) to integrate into other software applications
Founded in 1999, this company is in the field of image recognition for commercial and government customers. The company provides technologies for image matching, similarity and color search for integration into applications for mobile, media intelligence and advertisement tracking, ecommerce and stock photography, brand and copyright protection, law enforcement and more
Sematext Group, Inc.
This company's product SSA - Site Search Analytics - continuously monitors, measures, and improves the search experience. It identifies top queries, problematic zero-hit queries, common misspellings, etc. It measures and compares search relevance and improves conversion rates. It is available It is available on-premises and in the cloud.
This is a privately held software company which was founded in 2000 in Konstanz, Germany, with an additional office in the United Kingdom (Bristol). The company develops intelligent software for search and analysis of structured and semi-structured data.
Their product MatchMaker is the leading error-tolerant search & match platform for huge master data volumes. The multiple award-winning software technology thinks, searches and finds like a human – but dramatically faster, in much more complex configurations and with no serious data restriction using keys or similar methods. It is available on-premises and in the cloud.
Federal authorities, insurance agencies, ICT firms and more use this software to identity a resolution in diverse, data-intensive business processes such as input management, enterprise search and data quality. It has easy customization and integration.
Founded in 2005,this company provides enterprise semantic search technology based on artificial intelligence and natural language processing. It offers intuitive search solutions and intelligent content support for website and corporate Intranets.
Content Analyst Company
This is a privately held software company which develops concept-aware text analytics software called CAAT, which is licensed to software product companies for use in eDiscovery. In 2013, five CAAT-powered products were named in the Gartner eDiscovery Magic Quadrant Report, and the analyst firm 451 Group referred to CAAT as The Hottest Product in eDiscovery.
Content Analyst's CAAT analytics software is a machine learning system based on latent semantic indexing technology. CAAT provides several text analytics capabilities using both supervised learning and unsupervised learning methods including concept search, categorization, conceptual clustering, email conversation threading, language identification, near-duplicate identification, auto summarization and difference highlighting.
With SearchYourCloud and its patented, federated search technology, a single search request in Outlook simultaneously and transparently searches your email, desktop and all of your cloud storage sources and delivers highly targeted results. You get exactly the information you need with just one query.
Docurated aggregates all your documents in one place, turning them into a searchable and customizable database. Docurated will now provide Dropbox integration as well. It accelerates sales in companies looking for fast growth by making the best marketing content readily available to Sales around the world. Docurated works with your existing content stores and uses machine learning to enable your team to find and re-use the most effective content with no manual tagging or uploading.
This is the next generation visual knowledge management platform which solves the information retrieval problem for leading companies like Clorox, Omnicom, Netflix, Weather Channel, and many others. Docurated enables sales, marketing, and technology teams to surface and use the exact chart or slide they need, no matter where it is stored, without slogging through folders and files. Docurated seamlessly integrates with existing folder-based repositories.
Apache Lucene is a free open source information retrieval software. It is supported by the Apache Software Foundation and is released under the Apache Software License. While suitable for any application which requires full text indexing and searching capability, Lucene has been widely recognizedfor its utility in the implementation of Internet search engines and local, single-site searching.
At the core of Lucene's logical architecture is the idea of a document containing fields of text. This flexibility allows Lucene's API to be independent of the file format. Text from PDFs, HTML, Microsoft Word, and OpenDocument documents, as well as many others (except images), can all be indexed as long as their textual information can be extracted.
These are just few search applications that are currently on the market. There are many others. Choosing the right application is based on your organization's requirements.
Enterprise search is a developing industry. In this post, I will describe the latest developments in enterprise search.
Effective enterprise search represents one of the most challenging areas in business today. The whole area of search has been revolutionized by Google. Employees now expect to be able to locate relevant data as easily as they navigate the web through Google. When this ease of search is not replicated in organizations' systems, it can be quite frustrating. As we create more content than ever before, the importance of effective search across the enterprise continues to grow.
Until recently, much of the enterprise search technology remained unchanged. The general purpose enterprise search offerings were fairly similar in technology and scope. There are now many software companies who direct their efforts towards enterprise search. The future will bring shorter innovation cycles, continuous user experience improvements, deeper integration with first- and third-party applications and more ETL-like (extract, transform and load) functionality to handle poor quality content.
In the second half of the 2000’s, the enterprise search companies were absorbed by the large software companies:
- Microsoft acquired FAST Search in 2008
- Adobe acquired Mercado in 2009
- Dassault Systèms acquired Exalead in 2010
- Hewlett Packard acquired Autonomy in 2011
- Oracle acquired Endeca in 2011
- IBM acquired Vivisimo in 2012
User experience is a broad topic in itself, with active trends including:
- Richer information about the user to determine context, such as their business context, social context, mobile device sensors, location, speech recognition, preferences and historical usage.
- Advances in visualization such as HTML 5.
- Natural language processing as in the trends seen with Wolfram Alpha and smart phone digital assistants, such as Apple’s Siri, Microsoft’s Cortana and Google Now.
- Richer results that look less like a page of links and more like answers to questions.
- Elements of knowledge management that add meaning to queries and results.
- Enterprise search products will become increasingly and more deeply integrated with existing platforms, allowing more types of content to be searchable and in more meaningful ways. It will also become less of a dark art and more of a platform for discovery and analysis.
The future of enterprise search seems destined to continue with simple keyword and Boolean searching, augmented by faceted navigation based on metadata. Virtually every e-commerce web site today offers guided navigation based on metadata.
This ubiquitous model now appears in most of the leading enterprise search products and users immediately understand how a simple text query can quickly be focused to a specific domain by clicking on a metadata filter. This updated search model is increasing demand for auto-classification products which can generate descriptive metadata automatically based on an analysis of the document’s unstructured content.
Open source software has made significant improvements, displacing many of the traditional search vendors. Lucene and its supporting companies like LucidWorks provide solid search functionality at a hard-to-beat price. Where vendors are seeing success is in four main areas:
- Providing functionality beyond typical "search" – extending to facets, true knowledge management, multimedia search, and other functionality.
- Focusing on vertical-specific applications like fraud and supply-chain management.
- Working with larger, more conservative enterprises.
- Providing a SaaS, one-stop-shop for zero (or low) touch functionality.
A few major factors are going to drive the industry going forward:
- Open source will continue to get better and drive out inefficiency in the market .
- More, better information about the searcher: location awareness, profile sharing, time dependence, deeper understanding of the context and content of the search. With this information, you can provide better, more relevant results.
- Lower tolerance for hassle: people expect search to "just work" – not understanding that it can be just as complicated as any other major IT initiative. By having low-touch solutions, SaaS providers will make major progress in the small/medium business world.
- Search all the things!: Integrated understanding of objects, video, speech, as well as traditional semantic sources like text will combine together better into a whole that allows for information retrieval no matter what the format.
Another area for future development is machine to machine consumption of information and sharing. Search providers are increasingly applying advanced analytics of text and other media so their users’ desires are more deeply satisfied through relevant search results. Search will be increasingly entity-centric and collaborative.
Future of search will include more semantic understanding of both content and queries. For example Exorbyte is focused on searching in structured master data – people, products and places, and its ability to query this data without use of restrictive match-keys for both lexicographical and semantic similarity is globally unique.
The future of search goes through natural language processing while on the other hand it will entail the capability of providing advanced information analysis during indexation time.
The facility to search within the document itself is becoming vital. The Docurated platform caters for instant access to the most relevant page or slide without even having to open the document.
Effective enterprise search can eradicate inefficiency. Enterprise search will become instant and intuitive, paving the way for increased productivity across the enterprise.
In my next post, I will highlight few search applications that could be worth looking into...
Sunday, November 30, 2014
In this post, I will describe few improved features in SharePoint 2013.
SharePoint 2013 has cross-site publishing. In the previous versions of SharePoint, it was not possible to easily share content across sites. Using cross-site publishing, users can separate authoring and publishing into different site collections: authored content goes into an indexable "catalog", and you can then use FAST to index and deliver dynamic content on a loosely coupled front end.
This feature is required for services like personalization, localization, metadata-driven topic pages, etc. An example of its use is a product catalog in an e-commerce environment. It can be used more generally for all dynamic content. Note that cross-site publishing is not available in SharePoint Online.
Here is how it works. First, you designate a list or a library as a "catalog". FAST then indexes that content and makes it available to publishing site collections via a new content search web part (CSWP). There are few good features put into creating and customizing CSWP instances, including some browser-based configurations. Run-time queries should execute faster against the FAST index than against a SharePoint database.
Cross-site publishing feature could significantly improve your content reuse capabilities by enabling you to publish to multiple site collections.
Creating templates still begins with a master page which is an ASP.NET construction that defines the basic page structure such as headers and footers, navigation, logos, search box, etc. In previous versions, master pages tended to contain a lot of parts by default, and branding a SharePoint publishing site was somewhat tricky.
You can generate and propagate a design package to reuse designs across site collections. There are template snippets that enable you to apply layouts within a design package, but they are not reusable across design packages.
This process is more straight forward than the previous versions, but it still would likely involve a developer.
SharePoint 2013 enables contributors to add more complex, non-web part elements like embedded code and video that does not have to be based on a specific web part. This feature is called "embed code". Note that if you are using cross-site publishing with its search based delivery, widget behavior may be tricky and could require IT support.
With respect to digital asset management, SharePoint has had the ability to store digital assets. However, once you got past uploading a FLV or PNG file, there was scant recourse to leverage it. SharePoint 2013 brings a new video content type, with automatic and manual thumbnailing.
Creating image renditions capability has also improved. It allows you to contribute a full fidelity image to a library, and then render a derivative of that image when served through a web page.
Other added features include better mobile detection/mobile site development and an improved editing experience.
Metadata and Tagging Services
SharePoint 2013 has solid metadata and tagging services with improved and simplified the term store. However, there is still no versioning, version control or workflow for terms.
Big improvement is that using FAST, you can leverage metadata in the delivery environment much more readily than you could in previous versions. You can use metadata-based navigation structures (as opposed to folder hierarchies), and deploy automated, category pages and link lists based on how items are tagged.
Saturday, November 15, 2014
With mobile devices becoming increasingly powerful, users want to access their documents while on the move. iPads and other tablets in particular have become very popular. Increasingly, employers allow employees to bring mobile devices of their choice to work.
"Bring Your Own Device" (BYOD) policy became wide spread in organizations and users started expecting and demanding new features that would enable them to work on their documents from mobile devices. Therefore, the necessity to have mobile access to content has greatly increased in recent years.
As with most technology, mobile and cloud applications are driving the next generation of capabilities in ECM tools. The key capabilities in ECM tools are the ability to access documents via mobile devices, ability to sync documents across multiple devices, and the ability to work on documents offline.
Most tools provide a mobile Web-based application that allows users to access documents from a mobile’s Web browser. That is handy when users use a device for which the tool provides no dedicated application.
The capabilities of mobile applications vary across different tools. In some cases, the mobile application is very basic, allowing users to perform only read-only operations. In other cases, users can perform more complex tasks such as creating workflows, editing documents, changing permissions or adding comments.
Solutions and Vendors
Solutions emerged that specialize in cloud based file sharing capabilities (CFS). Dropbox, Google Drive, Box.com, and Syncplicity (acquired by EMC) provide services for cloud-based file sharing, sync, offline work, and some collaboration for enterprises.
There is considerable overlap of services between these CFS vendors and traditional document management (DM) vendors. CFS vendors build better document management capabilities (such as library services), and DM vendors build (or acquire) cloud-based file sharing, sync, and collaboration services. Customers invested in DM tools frequently consider deploying relevant technology for cloud file sharing and sync scenarios. Similarly, many customers want to extend their usage of CFS platforms for basic document management services.
DM vendors which actively trying to address these needs include Alfresco (via Alfresco Cloud), EMC, Microsoft (via SkyDrive/ Office 365), Nuxeo (via Nuxeo Connect), and OpenText (via Tempo Box). Collaboration/social vendors like Jive, Microsoft, and Salesforce have also entered the enterprise file sharing market. Other large platform vendors include Citrix which acquired ShareFile. Oracle, IBM, and HP are about to enter this market as well.
Number of Devices - Number of devices that the ECM vendor provides mobile applications for is very important. Most tools provide specific native applications for Apple’s iPhone and iPad (based on iOS operating system) and Android-based phones and tablets. Some also differentiate between the iPhone and iPad and provide separate applications for those two devices. Some provide applications for other devices such as those based on Windows and BlackBerry.
File sync and offline capabilities - Many users use more than one device to get work done. They might use a laptop in the office, a desktop at home, and a tablet and a phone while traveling. They need to access files from all of those devices, and it is important that an ECM tool can synchronize files across different devices.
Users increasingly expect capabilities for advanced file sharing, including cloud and hybrid cloud-based services. Most tools do that by providing a sync app for your desktop/laptop, which then syncs your files from the cloud-based storage to your local machine.
Most tools require users to create a dedicated folder and move files to that dedicated folder, which is then synced. A few tools like Syncplicity allow users to sync from any existing folder on your machine.
A dedicated folder can be better managed and seems to be a cleaner solution. However, it means that users need to move files around which can cause duplication. The other approach of using any folder as a sync folder allows users to keep working on files in their usual location. That is convenient, but if users reach a stage when they have too many folders scattered around on their laptop and other synced machines, they might have some manageability issues.
Some tools allow users to selectively sync. Rather than syncing the entire cloud drive, users can decide which folders to sync. That is useful when users are in a slow speed area or they have other bandwidth-related constraints. In some cases, they can also decide whether they want a one-way sync or a bi-directional sync. Once they have the files synced up and available locally, they typically can work offline as well. When they go online, their changes are synced back to the cloud.
Most tools that provide a dedicated mobile applications can also sync files on mobile devices. However, mobile syncing is usually tricky due to the closed nature of mobile device file systems.
While most ECM and DM vendors provide some varying capabilities for mobile access, not all of them can effectively offer file sync across multiple devices.
Your options should be based on your users' requirements. Access them very carefully before deciding on a suitable solution for your organization.
Friday, October 31, 2014
A successful enterprise content management (ECM) implementation requires an ongoing partnership between IT, compliance, and business managers.
Strict top-down initiatives that leave little for users' requirements consideration result in ECM systems that users don’t want to use.
Similarly, an ad hoc, overly decentralized approach leads to inconsistent policies and procedures, which in turn leads to disorganized, not governed, not foundable content. In both extremes, the ECM initiative ends with a failure.
Whether your organization uses an agile, waterfall or mixed approach to ECM deployment, ECM leaders must think about program initiation, planning, deployment, and ongoing improvement as a process and not as isolated events. Team composition will change over time of ECM project planning and roll-out, as different skill sets are needed.
For example, a business analyst is a key member of the team early in the project when developing a business case and projecting total cost of the project, while legal department will need to get involved when documenting e-discovery requirements.
But, there is often no clear location in the org chart for fundamental content management responsibilities, and that can contribute to weakened strategy, governance and return on investment (ROI).
Approach to ECM
Successful ECM initiatives balance corporate governance needs with the desire of business units to be efficient and competitive, and to meet cost and revenue targets.
Organizations should determine the balance of centralized versus decentralized decision making authority by the level of industry regulation, jurisdiction, corporate culture and autonomy of business units or field offices.
A central ECM project team of content management, business process, and technology experts should define strategy and objectives and align with the technology vision. Local subject matter experts in business units or regional offices can then be responsible for the execution and translation of essential requirements into localized policies and procedures, along with the business unit’s content management goals.
Business managers can help to measure current state of productivity, set goals for improvement, contribute to a business case or forecast total cost of a CMS ownership over a number of years. A trainer will be needed during pilot and roll-out to help with change management and system orientation. Legal department should approve updates to retention schedule and disposition policies as practices shift away from classification schemes designed for paper to more automated, metadata-driven approach.
The following roles are essential for an ECM project:
- Steering committee is responsible for project accountability and vision. Their role is to define an overall vision for an ECM project and outline processes and procedures to ensure integrity of information.
- Project manager is responsible for the ECM project management during CMS deployment. The project manager's role is to create project plans and timetables, identify risks and dependencies, liaise with business units, executive sponsors, IT, and other teams.
- Business analyst is responsible for outlining the desired state of CMS implementation and success metrics. This role is to gather business and technical requirements by engaging with business, technical, and legal/compliance stakeholders. They need to identify the current state of operations and outline the desired future state by adopting a CMS system.
- Information architect's role is to define and communicate the standards to support the infrastructure, configuration, and development of ECM application.System administrators - their role is to define and implement an approach to on-premises, cloud, or hybrid infrastructure to support a CMS.
- CMS administrator is responsible for the operation of the CMS. This role is to define and implement processes and procedures to maintain the operation of the CMS.
- User experience specialist's role is to define standards for usability and consistency across devices and applications, and create reusable design and templates to drive users' adoption.
- Records and information managers' role is to define and deploy taxonomies, identify metadata requirements, and to develop retention, disposition, and preservation schedules.
Core competencies will be supplemented by developers, trainers, quality assurance, documentation, and other specialists at various phases of the ECM deployment project. It is important to provide leadership during the deployment of a CMS. The team should bring technical knowledge about repositories, standards and service-oriented architectures, combined with business process acumen and awareness of corporate compliance obligations.
Information architects will be important participants during both the planning and deployment phases of the project. Communication and process expertise are essential for ongoing success. IT, information architect, and information managers should learn the vocabulary, pain points, and needs of business units, and help translate users' requirements to technical solutions so that the deployed CMS could help to improve current processes.
Compliance subject matter experts should communicate the implications and rationale of any change in process or obligations to users responsible for creating or capturing content.
Project plans, budgets and timetables should include time for coaching, communication, and both formal and informal training. Even simple file sharing technology will require some investment in training and orientation when processes or policies are changed.
ECM is a long-term investment, not a one-time technology installation project. Enterprises can often realize short-term ROI by automating manual processes or high-risk noncompliance issues, but the real payoff comes when an enterprise treats content as a strategic asset.
A strong ECM project team demonstrates leadership, communication skills and openness to iteration, setting the foundation for long-term value from the deployment efforts.
For example, a company aligned its deployment and continuous improvement work by adopting more agile approaches to project delivery, as well as a willingness to adopt business metrics (faster time to market for new products), instead of technology management metrics (number of documents created per week). That change allowed the company to better serve its document sharing and collaboration needs of sales teams in the field.
The project team must engage directly with the user community to create systems that make work processes better. It is a good idea to include hands-on participation and validation with a pilot group.
Follow best practices from completed ECM projects. Review processes, applications, forms, and capture screens to identify areas of friction when people capture or share content. User experience professionals have design and testing experience, and they need to be included in the ECM deployment team.
User participation is valuable throughout the ECM deployment project. Direct input on process bottlenecks, tool usability and real-world challenges helps prioritize requirements, select technologies and create meaningful training materials.
Senior managers who participate on a steering committee, or are stakeholders in an information governance strategy, should allow their teams to allocate adequate time for participation. That might mean attending focus groups, holding interviews, attending demos and training, or experimenting with new tools.
A sustainable and successful ECM initiative will be responsive to the changing behavior of customers, partners and prospects, changing needs of users, and corporate and business unit objectives. Stay current with ECM and industry trends. ECM project team members should keep one eye on the future and be open to learning about industry best practices.
Businesses will continue to adopt mobile, cloud and social technologies for customer and employees communication. Anticipate new forms of digital content and incorporate them into the ECM program strategy proactively, not reactively.
Proactively push vendors for commitments and road maps to accommodate those emerging needs. Stay alert to emerging new vendors or alternative approaches if the needs of business stakeholders are shifting faster than current ECM technology. Aim for breadth as well as depth of knowledge, and encourage team members to explore adjacent areas to ECM to acquire related knowledge and think more holistically.
Saturday, September 6, 2014
In part one of this post, I described using metadata in SharePoint. In this part two, I will describe metadata management.
Managed metadata makes it easier for Term Store Administrators to maintain and adapt your metadata as business needs evolve. You can update a term set easily. And, new or updated terms automatically become available when you associate a Managed Metadata column with that term set. For example, if you merge multiple terms into one term, content that is tagged with these terms is automatically updated to reflect this change. You can specify multiple synonyms (or labels) for individual terms. If your site is multilingual, you can also specify multilingual labels for individual terms.
Managing metadata effectively requires careful thought and planning. Think about the kind of information that you want to manage the content of lists and libraries, and think about the way that the information is used in the organization. You can create term sets of metadata terms for lots of different information.
For example, you might have a single content type for a document. Each document can have metadata that identifies many of the relevant facts about it, such as these examples:
- Document purpose - is it a sales proposal? An engineering specification? A Human Resources procedure?
- Document author, and names of people who changed it
- Date of creation, date of approval, date of most recent modification
- Department responsible for any budgetary implications of the document
Activities that are involved with managing metadata:
- Planning and configuring
- Managing terms, term sets, and groups
- Specifying properties for metadata
Planning and configuring managed metadata
Your organization may want to do careful planning before you start to use managed metadata. The amount of planning that you must do depends on how formal your taxonomy is. It also depends on how much control that you want to impose on metadata.
If you want to let users help develop your taxonomy, then you can just have users add keywords to items, and then organize these into term sets as necessary.
If your organization wants to use managed term sets to implement formal taxonomies, then it is important to involve key stakeholders in planning and development. After the key stakeholders in the organization agree upon the required term sets, you can use the Term Store Management Tool to import or create your term sets. You can also use the tool to manage the term sets as users start to work with the metadata. If your web application is configured correctly, and you have the appropriate permissions, you can go to the Term Store Management Tool by following these steps:
1. Select Settings and then choose Site Settings.
2. Select Term store management under Site Administration.
Managing terms, term sets, and groups
The Term Store Management Tool provides a tree control that you can use to perform most tasks. Your user role for this tool determines the tasks that you can perform. To work in the Term Store Management Tool, you must be a Farm Administrator or a Term Store Administrator. Or, you can be a designated Group Manager or Contributor for term sets.
To take actions on an item in the hierarchy, follow these steps:
1. Point to the name of the Managed Metadata Service application, group, term set, or term that you want to change, and then click the arrow that appears.
2. Select the actions that you want from the menu.
For example, if you are a Term Store Administrator or a Group Manager you can create, import, or delete term sets in a group. Term set contributors can create new term sets.
Properties for terms and term sets
At each level of the hierarchy, you can configure specific properties for a group, term set, or term by using the properties pane in the Term Store Management Tool. For example, if you are configuring a term set, you can specify information such as Name, Description, Owner, Contact, and Stakeholders in pane available on the General tab. You can also specify whether you want a term set to be open or closed to new submissions from users. Or, you can choose the Intended Use tab, and specify whether the term set should be available for tagging or site navigation.
Using metadata in SharePoint makes it easier to find content items. Metadata can be managed centrally in SharePoint and can be organized in a way that makes sense in your business. When the content across sites in an organization has consistent metadata, it is easier to find business information and data by using search. Search features such as the refinement panel, which displays on the left-hand side of the search results page, enable users to filter search results based on metadata.
SharePoint metadata management supports a range of approaches to metadata, from formal taxonomies to user-driven folksonomies. You can implement formal taxonomies through managed terms and term sets. You can also use enterprise keywords and social tagging, which enable site users to tag content with keywords that they choose. SharePoint enable organizations to combine the advantages of formal, managed taxonomies with the dynamic benefits of social tagging in customized ways.
Metadata navigation enables users to create views of information dynamically, based on specific metadata fields. Then, users can locate libraries by using folders or by using metadata pivots, and refine the results by using additional key filters.
You can choose how much structure and control to use with metadata, and the scope of control and structure. For example:
- You can apply control globally across sites, or make local to specific sites.
- You can configure term sets to be closed or open to user contributions.
- You can choose to use enterprise keywords and social tagging with managed terms, or not.
The managed metadata features in SharePoint enable you to control how users add metadata to content. For example, by using term sets and managed terms, you can control which terms users can add to content, and who can add new terms. You can also limit enterprise keywords to a specific list by configuring the keywords term set as closed.
When the same terms are used consistently across sites, it is easier to build robust processes or solutions that rely on metadata. Additionally, it is easier for site users to apply metadata consistently to their content.
A term is a specific word or phrase that you associated with an item on a SharePoint site. A term has a unique ID and it can have many text labels (synonyms). If you work on a multilingual site, the term can have labels in different languages.
There are two types of terms:
Managed terms are terms that are pre-defined. Term Store administrators organize managed terms into a hierarchical term set.
Enterprise keywords are words or phrases that users add to items on a SharePoint site. The collection of enterprise keywords is known as a keywords set. Typically, users can add any word or phrase to an item as a keyword. This means that you can use enterprise keywords for folksonomy-style tagging. Sometimes, Term Store administrators move enterprise keywords into a specific managed term set. When they are part of a managed term set, keywords become available in the context of that term set.
A Term Set is a group of related terms. Terms sets can have different scope, depending on where you create the term set.
Local term sets are created within the context of a site collection, and are available for use (and visible) only to users of that site collection. For example, when you create a term set for a metadata column in a list or library, then the term set is local. It is available only in the site collection that contains this list or library. For example, a media library might have a metadata column that shows the kind of media (diagram, photograph, screen shot, video, etc.). The list of permitted terms is relevant only to this library, and available for use in the library.
Global term sets are available for use across all sites that subscribe to a specific Managed Metadata Service application. For example, an organization might create a term set that lists names of business units in the organization, such as Human Resources, Marketing, Information Technology, and so on.
In addition, you can configure a term set as closed or open. In a closed term set, users can't add new terms unless they have appropriate permissions. In an open term set, users can add new terms in a column that is mapped to the term set.
Group is a security term. With respect to managed metadata, a group is a set of term sets that share common security requirements. Only users who have contributor permissions for a specific group can manage term sets that belong to the group or create new term sets within it. Organizations should create groups for term sets that will have unique access or security needs.
Term Store Management Tool
The Term Store Management Tool is the tool that people who manage taxonomies use to create or manage term sets and the terms within them. The Term Store Management tool displays all the global term sets and any local term sets available for the site collection from which you access the Term Store Management Tool.
Managed Metadata column
A Managed Metadata column is a special kind of column that you can add to lists or libraries. It enables site users to select terms from a specific term set. A Managed Metadata column can be mapped to an existing term set, or you can create a local term set specifically for the column.
Enterprise Keywords column
The enterprise Keywords column is a column that you can add to content types, lists, or libraries to enable users to tag items with words or phrases that they choose. By default, it is a multi-value column. When users type a word or phrase into the column, SharePoint presents type-ahead suggestions. Type-ahead suggestions might include items from managed term sets and the Keywords term set. Users can select an existing value, or enter a new term.
Social tags are words or phrases that site users can apply to content to help them categorize information in ways that are meaningful to them. Social tagging is useful because it helps site users to improve the discoverability of information on a site. Users can add social tags to information on a SharePoint site and to URLs outside a SharePoint site.
A social tag contains pointers to three types of information:
- A user identity
- An item URL
- A term
When you add a social tag to an item, you can specify whether you want to make your identity and the item URL private. However, the term part of the social tag is always public, because it is stored in the Term Store.
When you create a social tag, you can choose from a set of existing terms or enter something new. If you select an existing term, your social tag contains a pointer to that term.
If, instead, you enter a new term, SharePoint creates a new keyword for it in the keywords term set. The new social tag points to this term. In in this manner, social tags support folksonomy-based tagging. Additionally, when users update an enterprise Keywords or Managed Metadata column, SharePoint can create social tags automatically. These terms then become visible as tags in newsfeeds, tag clouds, or My Site profiles.
List or library owners can enable or disable metadata publishing by updating the Enterprise Metadata and Keywords Settings for a list or library.
In the second part of this post, I will describe managing SharePoint metadata.
Sunday, August 31, 2014
Defensible disposal of unstructured content is a key outcome of sound information governance programs. A sound approach to records management as part of the organization’s information governance strategy is rife with challenges.
Some of the challenges are explosive content volumes, difficulty with accurately determining what content is a business record comparing to transient or non-business related content, eroding IT budgets due to mounting storage costs, and the need to incorporate content from legacy systems or merger and acquisition activity.
Managing the retention and disposition of information reduces litigation risk, it reduces discovery and storage costs, and it ensures organizations maintain regulatory compliance. In order for content to be understood and determined why it must be retained, for how long it must be retained, and when it can be dispositioned, it needs to be classified.
However, users see the process of sorting records from transient content as intrusive, complex, and counterproductive. On top of this, the popularity of mobile devices and social media applications has effectively fragmented the content authoring and has eliminated any chance of building consistent classification tools into end-user applications.
If classification is not being carried out, there are serious implications when asked by regulators or auditors to provide reports to defend the organization’s records and retention management program.
Records managers also struggle with enforcing policies that rely on manual, human-based classification. Accuracy and consistency in applying classification is often inadequate when left up to users, the costs in terms of productivity loss are high, and these issues, in turn, result in increased business and legal risk as well as the potential for the entire records management program to quickly become unsustainable in terms of its ability to scale.
A solution to overcome this challenge is automatic classification. It eliminates the need for users to manually identify records and apply necessary classifications. By taking the burden of classification off the end-user, records managers can improve consistency of classification and better enforce rules and policies.
Auto-Classification makes it possible for records managers to easily demonstrate a defensible approach to classification based on statistically relevant sampling and quality control. Consequently, this minimizes the risk of regulatory fines and eDiscovery sanctions.
In short, it provides a non-intrusive solution that eliminates the need for business users to sort and classify a growing volume of low-touch content, such as email and social media, while offering records managers and the organization as a whole the ability to establish a highly defensible, completely transparent records management program as part of their broader information governance strategy.
Benefits of Automatic Classification for Information Governance
Apply records management classifications as part of a consistent, programmatic component of a sound information governance program to:
- Litigation risk
- Storage costs
- eDiscovery costs
- User productivity and satisfaction
- The fundamental difficulties in applying classifications to high volume, low touch content such as legacy content, email and social media content.
- Records manager and compliance officer concerns about defensibility and transparency.
- Automated Classification: automate the classification of content in line with existing records management classifications.
- Advanced Techniques: classification process based on a hybrid approach that combines machine learning, rules, and content analytics.
- Flexible Classification: ability to define classification rules using keywords or metadata.
- Policy-Driven Configuration: ability to configure and optimize the classification process with an easy "step-by-step" tuning guide.
- Advanced Optimization Tools: reports make it easy to examine classification results, identify potential accuracy issues, and then fix those issues by leveraging the provided "optimization" hints.
- Sophisticated Relevancy and Accuracy Assurance: automatic sampling and bench marking with a complete set of metrics to assess the quality of the classification process.
- Quality Assurance : advanced reports on a statistically relevant sample to review and code documents that have been automatically classified to manually assess the quality of the classification results when desired.
Wednesday, August 13, 2014
The word "metadata" means "data about data". Metadata describes a context for objects of interest such as document files, images, audio and video files. It can also be called resource description. As a tradition, resource description dates back to the earliest archives and library catalogs. The modern "metadata" field that gave rise to Dublin Core and other recent standards emerged with the Web revolution of the mid-1990s.
The Dublin Core Schema is a small set of vocabulary terms that can be used to describe different resources.
Dublin Core Metadata may be used for multiple purposes, from simple resource description, to combining metadata vocabularies of different metadata standards, to providing inter-operability for metadata vocabularies in the Linked data cloud and Semantic web implementations.
"Dublin" refers to Dublin, Ohio, USA where the schema originated during the 1995 invitational OCLC/NCSA Metadata Workshop hosted by the Online Computer Library Center (OCLC), a library consortium based in Dublin, and the National Center for Supercomputing Applications (NCSA). "Core" refers to the metadata terms as broad and generic being usable for describing a wide range of resources. The semantics of Dublin Core were established and are maintained by an international, cross-disciplinary group of professionals from librarianship, computer science, text encoding, museums, and other related fields of scholarship and practice.
The Dublin Core Metadata Initiative (DCMI) provides an open forum for the development of inter-operable online metadata standards for a broad range of purposes and of business models. DCMI's activities include consensus-driven working groups, global conferences and workshops, standards liaison, and educational efforts to promote widespread acceptance of metadata standards and practices.
In 2008, DCMI separated from OCLC and incorporated as an independent entity. Any and all changes that are made to the Dublin Core standard are reviewed by a DCMI Usage Board within the context of a DCMI Namespace Policy. This policy describes how terms are assigned and also sets limits on the amount of editorial changes allowed to the labels, definitions, and usage comments.
Levels of the Standard
The Dublin Core standard originally includes two levels: Simple and Qualified. Simple Dublin Core comprised 15 elements; Qualified Dublin Core included three additional elements (Audience, Provenance and RightsHolder), as well as a group of element refinements (also called qualifiers) that could refine the semantics of the elements in ways that may be useful in resource discovery. Since 2012 the two have been incorporated into the DCMI Metadata Terms as a single set of terms using the Resource Description Framework (RDF).
The original Dublin Core Metadata Element Set which is the Simple level consists of 15 metadata elements:
Each Dublin Core element is optional and may be repeated. The DCMI has established standard ways to refine elements and encourage the use of encoding and vocabulary schemes. There is no prescribed order in Dublin Core for presenting or using the elements. The Dublin Core became ISO 15836 standard in 2006 and is used as a base-level data element set for the description of learning resources in the ISO/IEC 19788-2.
Qualified Dublin Core
Subsequent to the specification of the original 15 elements, an ongoing process to develop terms extending or refining the Dublin Core Metadata Element Set (DCMES) began. The additional terms were identified. Elements refinements make the meaning of an element narrower or more specific. A refined element shares the meaning of the unqualified element, but with a more restricted scope.
In addition to element refinements, Qualified Dublin Core includes a set of recommended encoding schemes, designed to aid in the interpretation of an element value. These schemes include controlled vocabularies and formal notations or parsing rules.
Syntax choices for Dublin Core metadata depends on a number of variables, and "one size fits all" forms rarely apply. When considering an appropriate syntax, it is important to note that Dublin Core concepts and semantics are designed to be syntax independent and are equally applicable in a variety of contexts, as long as the metadata is in a form suitable for interpretation both by machines and by human beings.
The Dublin Core Abstract Model provides a reference model against which particular Dublin Core encoding guidelines can be compared, independent of any particular encoding syntax. Such a reference model allows users to gain a better understanding of descriptions they are trying to encode and facilitates the development of better mappings and translations between different syntax.
I will describe some applications of Dublin Core in my future posts.
Tuesday, July 22, 2014
During last ten years the volume and diversity of digital information grew at unprecedented rates. Amount of information is doubling every 18 months, and unstructured information volumes grow six times faster than structured.
Big data is the nowadays trend. Big data has been defined as data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time.
Hadoop was created in 2005 by Doug Cutting and Mike Cafarella to address the big data issue. Doug Cutting named it after his son's toy elephant It was originally developed for the Nutch search engine project. Nutch is an effort to build an open source web search engine based on Lucene and Java for the search and index component.
As of 2013, Hadoop adoption is widespread. A number of companies offer commercial implementations or support for Hadoop. For example, more than half of the Fortune 50 use Hadoop. Hadoop is designed to process terabytes and even petabytes of unstructured and structured data. It breaks large workloads into smaller data blocks that are distributed across a cluster of commodity hardware for faster processing.
Ventana Research, a benchmark research and advisory services firm published the results of its groundbreaking survey on enterprise adoption of Hadoop to manage big data. According to this survey:
- More than one-half (54%) of organizations surveyed are using or considering Hadoop for large-scale data processing needs.
- More than twice as many Hadoop users report being able to create new products and services and enjoy costs savings beyond those using other platforms; over 82% benefit from faster analysis and better utilization of computing resources.
- 87% of Hadoop users are performing or planning new types of analysis with large scale data.
- 94% of Hadoop users perform analytics on large volumes of data not possible before; 88% analyze data in greater detail; while 82% can now retain more of their data.
- 63% of organizations use Hadoop in particular to work with unstructured data such as logs and event data.
- More than two-thirds of Hadoop users perform advanced analysis such as data mining or algorithm development and testing.
Today, Hadoop is being used as a:
- Staging layer: the most common use of Hadoop in enterprise environments is as “Hadoop ETL” — pre-processing, filtering, and transforming vast quantities of semi-structured and unstructured data for loading into a data warehouse.
- Event analytics layer: large-scale log processing of event data: call records, behavioral analysis, social network analysis, clickstream data, etc.
- Content analytics layer: next-best action, customer experience optimization, social media analytics. MapReduce provides the abstraction layer for integrating content analytics with more traditional forms of advanced analysis.
- Most existing vendors in the data warehousing space have announced integrations between their products and Hadoop/MapReduce.
Hadoop is particularly useful when:
- Complex information processing is needed.
- Unstructured data needs to be turned into structured data.
- Queries can’t be reasonably expressed using SQL.
- Heavily recursive algorithms.
- Complex but parallelizable algorithms needed, such as geo-spatial analysis or genome sequencing.
- Machine learning.
- Data sets are too large to fit into database RAM, discs, or require too many cores (10’s of TB up to PB).
- Data value does not justify expense of constant real-time availability, such as archives or special interest information, which can be moved to Hadoop and remain available at lower cost.
- Results are not needed in real time.
- Fault tolerance is critical.
- Significant custom coding would be required to handle job scheduling.
Does Hadoop and Big Data Solve All Our Data Problems?
Hadoop provides a new, complementary approach to traditional data warehousing that helps deliver on some of the most difficult challenges of enterprise data warehouses. Of course, it’s not a panacea, but by making it easier to gather and analyze data, it may help move the spotlight away from the technology towards the more important limitations on today’s business intelligence efforts: information culture and the limited ability of many people to actually use information to make the right decisions.
Monday, July 14, 2014
I had few questions about my posts about automatic content classification. I would like to thank my blog readers for these questions. This post is to follow up on those questions.
Organizations receive countless amounts of paper documents every day. These documents can be mail, invoices, faxes, or email. Even after organizations scan these paper documents, it is still difficult to manage and organize them.
To overcome the inefficiencies associated with paper and captured documents, companies should implement an intelligent classification system to organize captured documents.
To overcome the inefficiencies associated with paper and captured documents, companies should implement an intelligent classification system to organize captured documents.
With today’s document processing technology, organizations do not need to rely on manual classification or processing of documents. Organizations that overcome manual sorting and classification in favor of an automated document classification & processing system can realize a significant reduction in manual entry costs, and improve the speed and turnaround time for document processing.
Recent research has shown that two-thirds of organizations cannot access their information assets or find vital enterprise documents because of poor information classification or tagging. The survey suggests that much of the problem may be due to manual tagging of documents with metadata, which can be inconsistent and riddled with errors, if it has been done at all.
There are few solutions for automated document classification and recognition. Some of them are: SmartLogic's Semaphore, OpenText, Interwoven Metatagger, Documentum, CVISION Trapeze, and others. These solutions enable organizations to organize, access, and control their enterprise information.
They are cost effective and eliminate inconstancy, mistakes, and the huge manpower costs associated with manual classification. Putting an effective and consistent automatic content classification system in place that ensures quick and easy retrieval of the right documents means better access to corporate knowledge, improved risk management and compliance, superior customer relationship management, enhanced findability for key audiences and an improved ability to monetize information.
Specific benefits of automatic content classification are:
More consistency. It produces the same unbiased results over and over. Might not always be 100% accurate or relevant, but if something goes wrong, it is at least is easy to understand why.
Larger context. Enforces classification from the whole organizations perspective, not the individuals. For example, a person interested in sports might tag an article which mentions a specific player, but forget/not consider a team and a country topic.
Persistent. A person can only handle a certain number of incoming documents per day, whilst an automatic classification works round the clock.
Cost effective. Possible to handle thousands of documents much faster than a person.
Automatic document classification can be divided into three types: supervised document classification where some external mechanism (such as human feedback) provides information on the correct classification for documents; unsupervised document classification (also known as document clustering) where the classification must be done entirely without reference to external information, and semi-supervised document classification where parts of the documents are labeled by the external mechanism.
Automate your content classification and consider using manual labor mainly for quality checking and approval of content.
In my next post on this topic, I will describe the role of automatic classification in records management and information governance.
Tuesday, June 10, 2014
In most organizations today, information is managed in isolated silos by independent teams using various tools for data quality, data integration, data governance, metadata and master data management, B2B data exchange, content management, database administration, information life-cycle management, and so on.
In response to this situation, some organizations are adopting Unified Data Management (UDM), a practice that holistically coordinates teams and integrates tools. Other common names for this practice include enterprise data management and enterprise information management.
Regardless of what you call it, the big picture that results from bringing diverse data disciplines together yields several benefits, such as cross-system data standards, cross-tool architectures, cross-team design and development synergies, leveraging data as an organizational asset, and assuring data’s integrity and lineage as it travels across multiple departments and technology platforms.
But unified data management is not purely an exercise in technology. Once data becomes an organizational asset, the ultimate goal of UDM becomes to achieve strategic, data-driven business objectives, such as fully informed operations and business intelligence, plus related goals in governance, compliance, business transformation, and business integration. In fact, the challenge of UDM is to balance its two important goals: uniting multiple data management practices and aligning them with business goals that depend on data for success.
What is UDM? It is the best practice for coordinating diverse data management disciplines, so that data is managed according to enterprise-wide goals that promote efficiency and support strategic, data-oriented business goals. UDM is unification of both technology practices and business management.
For UDM to be considered successful, it should satisfy and balance the following requirements:
1. UDM must coordinate diverse data management areas. This is mostly about coordinating the development efforts of data management teams and enabling greater inter-operability among their participants. UDM may also involve the sharing or unifying of technical infrastructure and data architecture components that are relevant to data management. There are different ways to describe the resulting practice, and users who have achieved UDM call it a holistic, coordinated, collaborative, integrated, or unified practice. The point is that UDM practices must be inherently holistic if you are to improve and leverage data on a broad enterprise scale.
2. UDM must support strategic business objectives. For this to happen, business managers must first know their business goals, then communicate data-oriented requirements to data management professionals and their management. Ideally, the corporate business plan should include requirements and milestones for data management. Hence, although UDM is initially about coordinating data management functions, it should eventually lead to better alignment between data management work and information-driven business goals of the enterprise. When UDM supports strategic business goals, UDM itself becomes strategic.
UDM is largely about best practices from a user’s standpoint. Most UDM work involves collaboration among data management professionals of different specialties (such as data integration, quality, master data, etc.). The collaboration fosters cross-solution data and development standards, inter-operability of multiple data management solutions, and a bigger concept of data and data management architecture.
UDM is not a single type of vendor-supplied tool. Even so, a few leading software vendors are shaping their offerings into UDM platforms. Such a platform consists of a portfolio of multiple tools for multiple data management disciplines, the most common being BI, data quality, data integration, master data management, and data governance.
For the platform to be truly unified, all tools in the portfolio should share a common graphical user interface (GUI) for development and administration, servers should inter-operate in deployment, and all tools should share key development artifacts (such as metadata, master data, data profiles, data models, etc.). Having all these conditions is ideal, but not an absolute requirement.
UDM often starts with pairs of practices. In other words, it’s unlikely that any organization would want or need to coordinate 100% of its data management work via UDM or anything similar. Instead, organizations select combinations of data management practices whose coordination and collaboration will yield appreciable benefits.
The most common combinations are pairs, as with data integration and data quality or data governance and master data management. Over time, an organization may extend the reach of UDM by combining these pairs and adding in secondary, supporting data management disciplines, such as metadata management, data modeling, and data profiling. Hence, the scope of UDM tends to broaden over time into a more comprehensive enterprise practice.
A variety of organizational structures can support UDM. It can be a standalone program or a subset of larger programs for IT centralization and consolidation, IT-to-business alignment, data as an enterprise asset, and various types of business integration and business transformations. UDM can be overseen by a competency center, a data governance committee, a data stewardship program, or some other data-oriented organizational structure.
UDM is often executed purely within the scope of a program for BI and data warehousing (DW), but it may also reach into some or all operational data management disciplines (such as database administration, operational data integration, and enterprise data architecture).
UDM unifies many things. As its name suggests, it unifies disparate data disciplines and their technical solutions. On an organizational level, it also unifies the teams that design and deploy such solutions. The unification may simply involve greater collaboration among technical teams, or it may involve the consolidation of teams, perhaps into a data management competency center.
In terms of deployed solutions, unification means a certain amount of inter-operability among servers, and possibly integration of developer tool GUIs. Technology aside, UDM also forces a certain amount of unification among business people, as they come together to better define strategic business goals and their data requirements. When all goes well, a mature UDM effort unifies both technical and business teams through IT-to-business alignment.
Why Care About UDM?
There are many reasons why organizations need to step up their efforts with UDM:
Technology drivers. From a technology viewpoint, the lack of coordination among data management disciplines leads to redundant staffing and limited developer productivity. Even worse, competing data management solutions can inhibit data’s quality, integrity, consistency, standards, scalability, architecture, and so on. On the upside, UDM fosters greater developer productivity, cross-system data standards, cross-tool architectures, cross-team design and development synergies, and assuring data’s integrity and lineage as it travels across multiple organizations and technology platforms.
Business drivers. From a business viewpoint, data-driven business initiatives (including BI, CRM, regulatory compliance, and business operations) suffer due to low data quality and incomplete information, inconsistent data definitions, non-compliant data, and uncontrolled data usage. UDM helps avoid these problems, plus it enables big picture data-driven business methods such as data governance, data security and privacy, operational excellence, better decision making, and leveraging data as an organizational asset.
To be successful, an organization needs a data strategy that integrates a multitude of sources, case studies, deployment technologies, regulatory and best practices guidelines, and any other operating parameters.
Many organizations have begun to think about the contributions of customer information in a more holistic way as opposed to taking a more fragmented approach. In the past, these efforts were undertaken mainly by organizations traditionally reliant on data for their model (like finance and publishing), but today, as more and more businesses recognize the benefits of a more comprehensive approach, these strategies are gaining much wider deployment.
What should a good strategy look like?
First, identify all data assets according to their background and potential contribution to the enterprise.
Second, outline a set of data use cases that will show how information will support any of a variety of customer marketing functions.
Next, create rules and guidelines for responsible use and access, making sure that the process is flexible and transparent. Keep in mind that not all data should be treated the same way; rather, it should be managed according to its sensitivity and need.
Finally, make sure that this process is ongoing so that tactics can be evaluated and adjusted as needed.
Such a strategy combines the best practices with responsible data governance and smart organization. Everyone wins - the employees who gain quick access to essential information, the enterprise that is running more smoothly; and of course, the customers who are served by a resource-rich organization!