Uima-based Content Search

Posted on:2009-01-02

Degree:Master

Type:Thesis

Country:China

Candidate:C Yang

Full Text:PDF

GTID:2208360245961347

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

With the rapid development and spread of Internet,electronic text information greatly increases.It is a great challenge for information science and technology that how to organize and process large amount of document data,and find the interested information of user quickly,exactly and fully.As the key technology in organizing and processing huge mount of document data,text classification can solve the problem of information disorder to a great extent,and is convenient for user to find the required information quickly.Under the background of search engine this paper is mainly about text classification.This paper provides a totally different angle for text classification,which is based on the Unstructured Information Management Architecture(UIMA).UIMA is a software architecture which specifies component interfaces,data representations,design patterns and development roles for creating,describing,discovering,composing and deploying multi-modal analysis capabilities.Analysis Engine,Type System,Annotation, CAS(Common Analysis System),JCas are analysis basics.There are two levels in the process of text analysis:document-level and collection-level.In the training a set of analysis engines are invoked and the features and weights are stored in the annotation by CAS,which you can access them also via.There are two phases in the process of text categorization.In the training phase,the text should be pre-processed.The feature extraction and selection is following which use the evaluation method of cross entropy.The classifier is constructed by Naive Bayes model.In the testing phase,UIMA simplifies the system development and deployment for analyzing document and provides related components for the semantic search and text mining,such as Annotator,CAS,and Analysis Engine.The next step is to use the trained classifier to category the text.At the end the confusion matrix is used for evaluating the precission of the classification.And the accuracy of classification is around 85%through the experiment.

Keywords/Search Tags:

Text Classification, UIMA, Na(?)ve Bayes Model, Analysis Engine, Annotation

PDF Full Text Request

Related items

1	Research On The Methods Of Chinese Text Classification Using Bayes And Language Model
2	A Research On Automatic Web Text Classification Technology
3	The Study Of Chinese Text Categorization Based On Na(?)ve Bayes
4	The Study Of Key Technologies For Chinese Domain-Oriented Search Engine
5	Research On Text Classification Algorithm Based On Naive Bayes Method
6	Text Categorization Based On Naive Bayes Method
7	Research On Text Classification Algorithm Based On Map Reduce Model
8	Research On Improved Multinomial Naive Bayes Text Classification Algorithms
9	Correlation Between The Text Classification. Word
10	Research And Implementation Of Text Classification Technology Based On Bayesian Theory