Font Size: a A A

Given Field Oriented Text Content Recognition And Categorization System

Posted on:2006-08-06Degree:MasterType:Thesis
Country:ChinaCandidate:J Z ChuFull Text:PDF
GTID:2168360155962596Subject:Control theory and control engineering
Abstract/Summary:PDF Full Text Request
With rapid development of Internet, more and more information is available on the network. But it is still a problem that how to retrieve the useful part from the plenty of information. The information on Internet for the most part exists as text, so text content recognition is the primary step for efficient information retrieval. A Given Field Oriented Text Content Recognition System is designed in this paper. The system can separate the texts that we interested from the numbers of texts. Furthermore, we can categorize the texts and create an abstract for every one. This can improve the information extraction efficiency remarkably.It is necessary to take the given field text features into account. It is also necessary and important to take the other field text features into account. Considering all fields of text features can increase the distance between text feature pattern of each other and optimize their probability distribution. Vector Space Model(VSM) which is testified an better text denotation model is used in this paper. Criteria based on geometrical distance and post-test probability are introduced . According to these criteria, Feature extraction based on entropy concept and direct choosing methods on feature selection are presented.Chinese word segmentation is the foundation of text content recognition technology. Word segmentation algorithm and their realization approaches are analyzed in detail. The advantage and disadvantage of these algorithms is given. Also some difficult problems about word segmentation are discussed in this paper. A enhanced full word index dictionary structure is introduced. This can improve the segmentation algorithm speed dramatically. Two methods is introduced to extract and choose text features base on ciassifiability criterion. One is hypo-best direct search algorithm. The other is standard TF-IDF formula which is in common use in Chinese information processing.Theories on how to decide the weight of features are not many. Usually they are decided according to the statistic result of the training samples and character of the Chinese language itself. Besides the common approaches, a new one is introduced in Chinese information processing. The weights of features are decided based on the word length according to the new approach. It is according to character that the longer word is composed by shorter words.As an example, the whole process of the communication text recognition is given.
Keywords/Search Tags:Text recognition, Communication, Text categorization, Chinese segmentation, Vector Space Model
PDF Full Text Request
Related items