Bioinformatics Preprint 03-014
Download:
[PostScript]
[PDF]
Titel:
litsift: Automated Text Categorization in Bibliographic Search
Author(s):
Lukas C. Faulstich,
Peter F. Stadler,
Caroline Thurner,
Christina Witwer
Accepted for:
Data Mining and Text Mining for Bioinformatics (ECML/PKDD 2003)
Abstract:
In bioinformatics there exist research topics that cannot be uniquely
characterized by a set of key words because relevant key words are (i)
also heavily used in other contexts and (ii) often omitted in relevant
documents because the context is clear to the target
audience. Information retrieval interfaces such as
entrez/Pubmed produce either low precision or low
recall in this case. To yield a high recall at a reasonable precision,
the results of a broad information retrieval search have to be filtered
to remove irrelevant documents. We use automated text categorization for
this purpose.
In this study we use the topic of conserved secondary RNA structures in
viral genomes as running example. Pubmed result sets for two
virus groups, Picornaviridae and Flaviviridae, have been
manually labeled by human experts. We evaluated various classifiers from
the Weka toolkit together with different feature selection
methods to assess whether classifiers trained on documents dedicated to
one virus group can be successfully applied to filter documents on other
virus groups.
Our results indicate that in this domain a
bibliographic search tool trained on a reference corpus may
significantly reduce the amount of time needed for extensive
literature recherches.
Keywords:
Automated Text Categorization,
Document Filtering
Alternative Numbers:
Return to 2003 working papers list.
Last modified: 2003-06-14 00:07:30 studla