Font Size: a A A

Uima-based Content Search

Posted on:2009-01-02Degree:MasterType:Thesis
Country:ChinaCandidate:C YangFull Text:PDF
GTID:2208360245961347Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the rapid development and spread of Internet,electronic text information greatly increases.It is a great challenge for information science and technology that how to organize and process large amount of document data,and find the interested information of user quickly,exactly and fully.As the key technology in organizing and processing huge mount of document data,text classification can solve the problem of information disorder to a great extent,and is convenient for user to find the required information quickly.Under the background of search engine this paper is mainly about text classification.This paper provides a totally different angle for text classification,which is based on the Unstructured Information Management Architecture(UIMA).UIMA is a software architecture which specifies component interfaces,data representations,design patterns and development roles for creating,describing,discovering,composing and deploying multi-modal analysis capabilities.Analysis Engine,Type System,Annotation, CAS(Common Analysis System),JCas are analysis basics.There are two levels in the process of text analysis:document-level and collection-level.In the training a set of analysis engines are invoked and the features and weights are stored in the annotation by CAS,which you can access them also via.There are two phases in the process of text categorization.In the training phase,the text should be pre-processed.The feature extraction and selection is following which use the evaluation method of cross entropy.The classifier is constructed by Naive Bayes model.In the testing phase,UIMA simplifies the system development and deployment for analyzing document and provides related components for the semantic search and text mining,such as Annotator,CAS,and Analysis Engine.The next step is to use the trained classifier to category the text.At the end the confusion matrix is used for evaluating the precission of the classification.And the accuracy of classification is around 85%through the experiment.
Keywords/Search Tags:Text Classification, UIMA, Na(?)ve Bayes Model, Analysis Engine, Annotation
PDF Full Text Request
Related items