One of the world’s leading biotechnology companies was evaluating commercial enterprise search products. One of their key product requirements was auto-categorization, but the company didn’t have the expertise internally to evaluate the available commercial search products. What they wanted to do was compare the auto-categorization results from the search engine products under consideration with the auto-categorization results from a leading text analysis product. In addition, the company wanted to use the auto-categorization methodology for classifying new documents in the future.
Approach
Iknow used the SAP BusinessObjects Text Analysis Suite, with its superior entity extraction and categorization capabilities, to analyze the company’s content. Iknow was selected because of its deep technical knowledge and experience with many of the leading enterprise search and text analysis products.
Iknow was given more than 50,000 scientific and technical documents drawn from three sources—the Defense Technical Information Center (DTIC), the U.S. Department of Energy (DOE), and the U.S. National Library of Medicine’s PubMed database. Each of the three data sets included a source-specific taxonomy, full text documents, and the document abstracts. The size of the input data exceeded 75 GB.
The processing and analysis was performed using SAP Business Objects Text Analysis XI 3.0, with the embedded Oracle XE database, Categorizer Workbench, and ThingFinder Workbench tools. The Categorizer Workbench provides an editorial environment for creating and maintaining taxonomies and contains both a learn-by-example (LBE) algorithm and a rules-based engine. The Thingfinder Workbench provides advanced text analysis that automatically identifies and extracts entities from any text data source.
Fifteen separate analyses were performed on the 50,000-plus document dataset, including various taxonomy creation and auto-categorization tasks. The Categorizer Workbench was able to classify the content into the PubMed taxonomy and a proprietary taxonomy with greater than 95 percent accuracy. The Categorizer LBE algorithm automatically generated a categorization rule set that could be reused, which met the company’s requirement for a reusable methodology.
Results
Iknow provided all of the information requested by the biotechnology company and the company made an informed purchase of a new enterprise software product.
Iknow also recommended that the company purchase a text analysis product and integrate it with the enterprise search tool to create an end-to-end automated content acquisition, tagging, and indexing process. The text analysis software would enhance the enterprise search tool by providing entity extraction, automated summarization, and auto-categorization functionality.