Font Size: a A A

The Research On Several Key Techniques In Text Information Processing

Posted on:2007-08-02Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y B XiongFull Text:PDF
GTID:1118360212484756Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the coming of information era and the prevalence of Internet gradually, text information expands rapidly. There are billions of webs and thousands upon thousands TB data on internet. Besides, there happen millions of web updates on it everyday. This makes information abundant but tanglesome. It is a big challenge how to organize and manage the information efficiently and query the information which users need quickly, whole and exactly.Text is a type of basic common information. The paper is based on text information retrieval model, and investigates the vital techniques of text information processing including text categorization, text clustering and approximate query processing. Text categorization and text clustering are two core techniques of organizing and managing text data. And the technique of approximate query processing is applied to query the needed information fast, which is a important technique of solving large scale datasets.The main investigations on text information processing in the paper are listed as follows:(1) Technical Basic of Text Information Processing. It includes document model, word-dividing, feature selection, text categorization and text clustering. The paper introduces Set Model, Algebraic Model, Probabilistic Model and Concept Model simply; analyses the main problems and methods in process of Chinese word-diving; introduces document feature and feature selection concretely; describes text categorization and text clustering in detail and generalizes some important typical algorithms of text categorization and text clustering.(2) Constructions for Hierarchical Structure Based on Confusion Matrix. In information era, documents' large scale and complication make necessity to category them hierarchically. The paper represents two tactics to construct hierarchical structure according to confusion matrix which depicts statistic for a flat classifier's errors probability. One is hierarchical clustering. The other is confusion classification. Hierarchical clustering adopts agglomerative algorithm, that is to say: every sample is regarded as a class in initialization then every two classes is combined to one class according to their comparability or distance until there is only one big class left. The method of confusion classification builds the hierarchical structure according to whether confusion probability between classes is bigger than a certain valve t. And there presents detail algorithms about the two techniques. Finally some experiments are taken on and the comparisons of two technologies' performance for hierarchical categorization are put up. And experiment results show that the performance of confusion classification excels to that of hierarchical clustering and confusion classification can improve the precision and recall of flat document classifier.(3) Document Genre Classification Based on the Feature Sentiment. Document genre doesn't describe concrete content of a document but style of the document. Document genre intersects with document topic. That is to say, there is difference inwriting style of documents although they belong to the same topic and documents with the same genre can describe different topics. Document genre classification has been becoming more and more important in information retrieval, information filtering, counterchecking of reactive information and investigation of public feelings from internet. In order to category positive or negative documents, the paper represents a categorization technology named sentiment categorization which is based on sentiment of documents feature. For sentiment categorization, there is no difference essentially comparing to categorization based on topic. And it can be regarded as a common two-type document categorization. Thus, it is vital to select sentiment features and determine the feature sentiment orientation. The paper investigates mainly the selection of sentiment features, determination of feature sentiment orientation and computation of feature sentiment weight. And some typical methods are brought forward in the paper. Finally, a prototype system is developed and comparison to traditional text categorization and categorization based on semantic pattern is made. Experiment results show that sentiment categorization is inferior to them and categorization based on semantic is best. But it doesn't need label the training samples and not build a self-governed classifier for each topic. Thus it is more general and the speed of its classification is much rapider than other two methods.(4) Approximate Query Processing Based on Wavelet Transform. Conventional Decision Support System (DSS) will give an exact answer according to users' query code submitted to query system and it will take a long time to execute the process. This is a typical black box pattern. However, today's DSS applications, OnLine Analytical Processing (OLAP) and online aggregation don't need an exact result but have a high demand for response. Approximate query is a solvent to deal with it. Wavelet has proved high efficiency in hierarchically decomposing. Wavelet transformation can compress GB/TB level of data to MB level. According to this compression mechanism, this paper depicts algorithms such as Union, Difference and Update based on previous works. And these operations are processed in level of wavelet synopsis. Wavelet synopsis is a compression of original data. Finally, some experiments are provided, and its results show that the accuracy of using wavelet is better than that of random sampling to do union and difference operations. And when the update amount of data is not too much, the direct update of wavelet is almost as good as the optimal selected wavelet synopses.
Keywords/Search Tags:Text information retrieval model, text categorization, text clustering, query processing, confusion matrix, genre categorization, wavelet transformation
PDF Full Text Request
Related items