Skip to main content

IBM Content Harvester

A tool that enables you to unleash the information in a collection of unstructured, formatted documents that follow a similar pattern and make that information available for publishing in any open format.

Date Posted: March 17, 2009

alphaworks tab navigation

 

What is IBM Content Harvester?

We all have unstructured documents created using tools like Microsoft Word, whose content we'd like to retrieve in a seamless fashion. A large subset of such documents follows an underlying structure whether created by individuals, such as resumes and personnel evaluations, or by teams, such as project documentation.

Today's knowledge workers need to access, apply and reuse content created by office productivity suites such as word processors, spreadsheets, and presentations. While the productivity suite revolution in the 1990's freed individuals' creativity to generate content, it has become increasingly difficult to effectively manage and harvest the individuals' creations into knowledge of the whole. One cannot glean a project document repository, for example, to get a summary of the status of work items. Both keyword search and social tagging falls short of functionality required to harvest and distill content for reuse.

A document created with word processors is a collection of character sequences and embedded objects interspersed with formatting information. This makes it difficult to access the content. A large subset of such documents follows an underlying structure, whether the document is a resume created by an individual, for example, or by a team for project documentation.

IBM Content Harvester™ allows you to harvest such unstructured, formatted documents by:

You simply specify the regions of content that are of interest in terms of textual markers, what tag to assign to the extracted content, and what terms to cleanse off in the extracted content, using rules. The information is then processed for cleansing and tagging. The resulting output is an XML file which can be queried using XQuery for any assigned tag and published in any open format like HTML using XSL transforms.

How does it work?

IBM Content Harvester uses an innovative two-stage technique to extract structure and content from the document while overcoming formatting variations. First, the tool automatically reads a collection of documents and identifies text markers which may demarcate content of interest. Your specification of text markers is used to refine the content of interest and provide tag names. The structure of the content, whether a consecutive stream of words, a list of sentences, or a table, can be extracted along with the content.

Then the content can be stripped of sensitive information using cleansing rules. The resulting content is tagged with default or user-provide tags and stored in XML. You can publish the XML in any open format, such as HTML, using XSL transforms and queries using XQuery. There are also APIs available to programmatically work with the extracted content. The toolkit provides support for all these features.

In summary, the process is:

About the technology author(s)

The seed of the IBM Content Harvester technology was planted when the authors participated in a series of projects on improving the consumption of assets from previous engagements. The authors felt that separating content from presentation elements while maintaining inherent information about structure would make the assets highly consumable. The resulting information has to be cleansed and kept in a form that it can be efficiently queried based on user-defined tags. This work was done while the authors were co-located at the IBM Thomas J. Watson Research Center. The authors continue to work on different types of assets.

Biplav Srivastava is a Research Staff Member at the IBM India Research Laboratory in New Delhi, which he joined in February, 2001. His research interests are planning, scheduling, policies, learning and information management, and their practical applications in services–packaged middleware like SAP and Oracle; general software (web services); infrastructure; semantic web, autonomic computing and societal domains. He holds a Ph.D. degree from Arizona State University, USA, and has more than 60 papers and three patents issued worldwide. He is active in AI and Services professional communities.

Yuan-Chi Chang is an IBM researcher working in the information management area. He contributed to the IBM Content Harvester effort to reclaim the unstructured content in IT service engagement documents for repurposing and reuse. His expertise includes data warehousing and data integration. He holds a Ph.D. degree from the University of California, Berkeley and has more than 20 patents issued worldwide.

Swaroop Chalasani is a software engineer in IBM India Software Labs, Bangalore. He has about nine years of experience developing various analytical applications using IBM WebSphere stack. He has developed tools extending eclipse platform for SAP. Most recently he is into InfoSphere solutions.

Sridhar Maradugu is a principal software engineer in the IBM Research Model Driven Business Transformation team. He joined IBM in 2004 and worked on Websphere Commerce Server and On Demand Electronics Contracting Systems in IBM software Laboratory. Sridhar Maradugu is a Sun certified Java programmer.

Trademarks




Related technologies