Font Size: a A A

Research On Core Technology Of The Chinese Text Classification

Posted on:2012-05-18Degree:MasterType:Thesis
Country:ChinaCandidate:Y C XieFull Text:PDF
GTID:2218330338473122Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
In the recent two decades, with the rapid development of the science and technology, people's ability of generating and collecting data has improved greatly, the size of the data increasing rapidly, people lost when face the mass of data, the limitation of traditional statistical analysis techniques and the complexity, heterogeneous and dynamic of the mass of data, resulting the useful knowledge that hidden in the data is difficult to be found, so the phenomenon " explosived data but lack knowledge" is occurred. People hope have a new tools that can automatically analyze and organize the large and complex data, then find valuable information for decision-makers to provide the necessary support. This factors result in the emergence of the data mining techniques. Data mining, also known as Knowledge Discovery in Databases, is a process that extracts potential, valid, novel, useful, ultimately understandable and applicable knowledge from the vast, incomplete, noisy and fuzzy data. It is an intersected subject that involves the database system,computation theory, artificial intelligent, statistical theory, pattern recognition, machine learning and the cognize science, which can perform association analysis, classification, clustering, forecasting,outlier detection and evolution analysis. Although there are many outstanding issues in data mining technologiy, its widespread application prospect and great commercial charm, attracting the great enthusiasm and attention of scholars and industors.Text Classification, is a process that based on the content of the text, divided the text into pre-defined one or more categories by text classification algorithm. Text classification is a key technology that processing and orgnazating largely text data. it solve the information clutter problem and is very efficient management and effective utilization of the information, become a important research field of data mining.At present, research on text classification mainly focused on the text representation, feature selection and classification algorithm. Text representation is the first step, have two ways:vector space model and statistical language model. Feature selection means to select the feature words which can best represent the characteristics of text from high_dimensional feature space, good feature selection methods on one hand reduce the dimension of feature space, can improve the efficiency of text classification, on the other hand improve the accuracy of text classification through removing invalid feature words. Classification algorithm classified the selected feature words. This paper research on the text representation,feature selection and classification algorithm.(1) Text representation. Text representation is the computer text expressed form before analysis the text and is a key technology of the text preprocessing we absorted the existed text precossing and researched the word segmentation. Studied the bottleneck of word segmentation, which is disambiguate. This paper give a new combinational ambiguity disambiguiting algorithm that based on co-occurrence supporting, it basic idea is seen about the support of co-occurrence words resulting in different segment methods in the text, constructed a support formula, then eliminate ambiguity.(2) Feature selection. Feature selection is a technology that selected the representative feature words form high-dimensional feature space. This paper researched seven feature selection technology, that is term frequency and inverse document frequency, mutual information, information gain and so on. Because based on different feature selecton rules,the feature selecton ways may score the same feature very differently, in order to overcome the shortage of single method,this paper considering that combinationed seven feature selection methods. The experiment result shows that the combination of seven methods is better than the single method. In addition, we also introduced the modified of TF-IDF and mutual information.(3) Classification algorithm. The quality of classification algorithm determined the ultimate effcction of text classification. After the feature selection,we proposed a new text classification algorithm based on soft set theory. It integrated EIBA and CHI to select features and take the selected feature vector into soft set table,constructed the soft set compaired table, then classified to the category. Comparing the new algorithm and theÎșNN algorithm, Naive Bayes algorithm, the experiment result shows that the new algorithm is effective.
Keywords/Search Tags:Data Mining, Chinese Text Classification, Word Segmentation, Feature Selection, Classification Algorithm
PDF Full Text Request
Related items