Given Field Oriented Text Content Recognition And Categorization System

Posted on:2006-08-06

Degree:Master

Type:Thesis

Country:China

Candidate:J Z Chu

Full Text:PDF

GTID:2168360155962596

Subject:Control theory and control engineering

Abstract/Summary:

With rapid development of Internet, more and more information is available on the network. But it is still a problem that how to retrieve the useful part from the plenty of information. The information on Internet for the most part exists as text, so text content recognition is the primary step for efficient information retrieval. A Given Field Oriented Text Content Recognition System is designed in this paper. The system can separate the texts that we interested from the numbers of texts. Furthermore, we can categorize the texts and create an abstract for every one. This can improve the information extraction efficiency remarkably.It is necessary to take the given field text features into account. It is also necessary and important to take the other field text features into account. Considering all fields of text features can increase the distance between text feature pattern of each other and optimize their probability distribution. Vector Space Model(VSM) which is testified an better text denotation model is used in this paper. Criteria based on geometrical distance and post-test probability are introduced . According to these criteria, Feature extraction based on entropy concept and direct choosing methods on feature selection are presented.Chinese word segmentation is the foundation of text content recognition technology. Word segmentation algorithm and their realization approaches are analyzed in detail. The advantage and disadvantage of these algorithms is given. Also some difficult problems about word segmentation are discussed in this paper. A enhanced full word index dictionary structure is introduced. This can improve the segmentation algorithm speed dramatically. Two methods is introduced to extract and choose text features base on ciassifiability criterion. One is hypo-best direct search algorithm. The other is standard TF-IDF formula which is in common use in Chinese information processing.Theories on how to decide the weight of features are not many. Usually they are decided according to the statistic result of the training samples and character of the Chinese language itself. Besides the common approaches, a new one is introduced in Chinese information processing. The weights of features are decided based on the word length according to the new approach. It is according to character that the longer word is composed by shorter words.As an example, the whole process of the communication text recognition is given.

Keywords/Search Tags:

Text recognition, Communication, Text categorization, Chinese segmentation, Vector Space Model

Related items

1	Research On Chinese Text Categorization Algorithms Based On Technology Text
2	Study On Text Category Oriented Chinese Text Mining And Its Implementation
3	The Research And Implementation Of Chinese Text Categorization
4	Research And Implementation Of Text Categorization System Based On VSM
5	Modeling And Implementation Of Chinese Text Categorization System Based On SVM
6	Chinese Text Data Classification
7	The Studies On Chinese Text Categorization Based On Pso And Svm
8	Application Of Rough Set Theory In Chinese Text Categorization
9	The Research And Implementation Of Chinese Text Categorization System
10	Research Of Text Categorization Based On Vector Space Model