Font Size: a A A

Research On Classification Of Chinese Documents

Posted on:2008-01-21Degree:MasterType:Thesis
Country:ChinaCandidate:J H WangFull Text:PDF
GTID:2178360242960116Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Information on the Internet is very fast growth,the amount of information is also very huge. Internet users find information on the one hand to bring convenience and hopes, and on the other hand also allows users into "information ocean", it difficult to quickly and efficiently find the necessary resources.So how efficiently handle these text messages become an important research topic. Because most of the information on the Internet is the way to the text, the text constitutes a recognition on the basis of efficient access to information.Use of text classification recognition technology can be a huge number but the lack of organizational structure of the text data into a standardized text data, help people improve retrieval of information, the use of information more efficient.We can also get the automatic identification of a further breakdown of the text, every text and generates a summary, thus greatly improving the efficiency of information access.Through the design of reliable and easy-to-use text classification system, based on the information content of the text of automatic identification and classification, which can greatly reduce the cost of organizing text human resources to help users to quickly find the necessary information.This paper introduces the various steps text categorization in detail, mainly involving the type of text classification, text said, Chinese word segmentation, text feature selection algorithm, the algorithm weights text features and various text classification algorithm, and other related technologies, this paper some of these key conducted in-depth technical analysis conducted a series of experiments and tests, the experimental results were analyzed and compared.(1)The type of text classificationText classification system can automatically statistical methods or through knowledge engineering approach to achieve. Knowledge engineering approach by the knowledge engineers need a lot of manual preparation of the rules of inference, therefore its development costs is quite expensive. By contrast, the statistics based on machine learning methods, due to its rapid, automatic performance, gradually replaced the knowledge-based engineering approach and become the mainstream of text classification technology, and has achieved good results. The computer does not have human intelligence, it can not easily "understand" an article, in order to use computers to process text, it must be converted to the format of the computer can be identified, there are texts that model are: Boolean model , the vector space model and the probability model. In this paper, using the popular vector space model.(2)Chinese word segmentationThis study is Chinese, as the differences between Chinese and Western text, research on classification of Chinese and research on classification of Western text have in common, but also has its own characteristics, It is a written expression of the Chinese as the smallest units, that is, the sequence of characters, no spacing between words marker.Automatic Segmentation of Chinese technology is the basis of information processing technology, the Chinese text automatically in automatic classification on the basis of the word, the text of the Chinese word segmentation is the process of determining the feature set version of the course.Response System requirements for operating efficiency, Chinese word segmentation data processing large, time-consuming long this contradiction, this paper, a table by the index and a table of the two-tier structure participle table model and greatly reduce search time, improve the efficiency of the segmentation.(3)Algorithm of character selectionThis paper presents DF, MI, IG, CE, CHI.Among DF, MI IG and the most commonly used.But these three feature selection methods exist certain shortcomings. DF consider only characteristic frequency, excessive reliance on high-frequency words. IG category at the same time consider the word t there and there is no two situations, although there is not a word of judgment may also contribute to type text, but often far less than that contribution brought about by the interference.MI rely on the existence of low-frequency words phenomenon.This paper proposes a new method of feature selection DFR, the method has the advantage of the same with the DF, or algorithm is simple, low complexity, while eliminating its reliance on high-frequency words, the suppression of low frequency words shortcomings.(4)Feature weights AlgorithmThis paper introduces the four characteristics of the algorithm weights, Boolean weight, TF weight, IDF weight, TFIDF weight. GM is currently TFIDF (inverse text frequency of word frequency), which is TF weights and the weights of the portfolio IDF is the most used features of the current weight algorithm, which considered the frequency of words and text features two frequency statistics, than TF and IDF weights and Boolean algorithm more reasonable.(5)The algorithms of text classification and experimental analysisThis paper mainly on the current text classification several areas commonly used text classification algorithm and its tenets, including the Bayesian algorithm, Vector distance measure algorithm, Decision Tree Algorithm, KNN algorithms, support vector machines, neural networks, such as classification of the Center category, and Experimental through more traditional text classification algorithm and the KNN support vector machine algorithm, the results from the classification can be seen, with the traditional KNN algorithm compared with support vector machine algorithm to classify the effect of the text better. But in experiments, we found that the classification of support vector machines slower, time complexity big. Therefore the classification accuracy of the text classification weak claims, can use the traditional classification algorithm. If the effect of higher classification, the use of support vector machines can be classified. In the use of support vector machines for text classification, the text of the feature extraction method can be used with other methods of combining the word frequency, thus reducing the dimensions of the features, reducing the complexity of classification, the classification does not affect the effectiveness of big.(6)Classified Experimental SystemThis paper uses a functional classification of training and the Chinese text classification systems. It contains a participle, feature selection and classification algorithms, and so on, is a comprehensive text classification test platform. Through experiments can be seen DFR feature selection algorithm to choose the features of the text is superior ability to distinguish between types of DF, MI, IG, and other existing CHI feature selection algorithm. Marking this paper not to expand training set text proposed iterative TFIDF iterative algorithm is not that逐批selected logo text, select the most representative sample joined classifier model to achieve the lower model error, so that the next algorithm Iterative process of finding good initial value, the local optimal values can be avoided. Through experiments can be analyzed when the training set out several smaller, not using certain types of text-logo, incremental iTFIDF algorithm can more effectively improve the classification performance classifier.
Keywords/Search Tags:Classification
PDF Full Text Request
Related items