Bioinformatics Preprint 03-014
litsift: Automated Text Categorization in Bibliographic Search
Lukas C. Faulstich, Peter F. Stadler, Caroline Thurner, Christina Witwer
Data Mining and Text Mining for Bioinformatics (ECML/PKDD 2003)
In bioinformatics there exist research topics that cannot be uniquely characterized by a set of key words because relevant key words are (i) also heavily used in other contexts and (ii) often omitted in relevant documents because the context is clear to the target audience. Information retrieval interfaces such as entrez/Pubmed produce either low precision or low recall in this case. To yield a high recall at a reasonable precision, the results of a broad information retrieval search have to be filtered to remove irrelevant documents. We use automated text categorization for this purpose.
In this study we use the topic of conserved secondary RNA structures in viral genomes as running example. Pubmed result sets for two virus groups, Picornaviridae and Flaviviridae, have been manually labeled by human experts. We evaluated various classifiers from the Weka toolkit together with different feature selection methods to assess whether classifiers trained on documents dedicated to one virus group can be successfully applied to filter documents on other virus groups.
Our results indicate that in this domain a bibliographic search tool trained on a reference corpus may significantly reduce the amount of time needed for extensive literature recherches.
Automated Text Categorization, Document Filtering
Return to 2003 working papers list.
Last modified: 2003-06-14 00:07:30 studla