Font Size: a A A

Research On Key Problems In Text Mining Based On Cloud Method

Posted on:2012-01-23Degree:DoctorType:Dissertation
Country:ChinaCandidate:J DaiFull Text:PDF
GTID:1488303389465914Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Text Mining (TM for short) is a process to find out the potential value of text knowledge, such as text information structure, model and pattern, etc. TM involves data mining, pattern recognition, information retrieval, natural language processing and other fields. Because text is the main way to store information, the importance of TM is increasingly obvious.In the present research to TM, traditional data mining methods still dominated. However, with further research in TM, it faces more severe challenges to apply the traditional data mining methods. These difficulties, such as the huge dimensions and sparsity of text object, the high complexity of algorithm and the requirement of prior knowledge and so on, have seriously hampered the development of TM.In the final analysis, these problems in TM process are due to the uncertainty of natural language. The uncertainty of natural language (especially text) comes from the uncertainty of the human thinking in essence. It makes people to have a richer understanding of spatial and cognitive abilities, but also brought a series of problems to TM. Therefore, from the point of reducing the complexity of natural language, if we can carry out the advanced innovation, which based on making full use of these existing technologies, and find out a novel uncertainty artificial intelligence approach for TM, it will greatly facilitate the rapid development of TM.Cloud model is an important tool in the uncertain knowledge research. With the efficient conversion function between qualitative and quantitative data, cloud model is introduced to the key issues of TM. Our primary works are as follow.(1) Cloud model theory expansion in TM.The researches, which involve text knowledge representation, the physical space conversion of the corresponding model and the similarity measures of the text concept, have been carried out. The following three aspects are contained.1) Text information table based on VSM model.The information table in knowledge representation system is introduced to text representation. On this basis, text system is expressed as text information table based on VSM model.2) Text information table conversion based on cloud model.When cloud model is used to deal with the uncertainty relations between texts, it musts be guaranteed that the values of every attribute are the same domain. That is to say, the different attribute values of text have the same physical meaning. But the attributes of existing text information table have different inner meaning and their values are vastly different. It needs to convert these attributes to the unified physical space. Using probability statistical method, a text information table transformation algorithm is proposed. Through this algorithm, the attributes of text information table have been converted to the unified physical space and it reflects the probability distribution of them.3) Text similarity measure based on cloud similarity.The cosine similarity is commonly used method to measure the similarity between texts in text mining. Yet not matter what kind of similarity measure is based on the fact that object properties must strict match. It will result in the lack of consideration of the integrity of text object. Combined the overall distribution with the individual characteristics of text object, a novel cloud similarity is proposed based on vector digital characteristics of cloud, which is used to describe the overall text. By cloud similarity, the similarity between texts is converted to the similarity between cloud vectors. It not only improves the mining performance and can quickly identify the common features, but also fully considers the randomness and fuzziness of the attribute values.(2) Text feature automatic selection algorithm based on cloud model (named FAS).Feature selection is an effective method for reducing the size of text feature space. So far, some effective methods for feature selection have been developed. For the purpose of acquiring the optimal number of features, these methods mainly depend on observation or experience. In this paper, by combining the overall with the local distribution of features in categories, a high performance algorithm for feature automation selection (FAS) is proposed. By using FAS, the feature set can be obtained automatically. Besides, it can effectively amend the distribution of features by using cloud model theory. Analysis and open experimental results show the selected feature set has fewer features and better classification performance than the existing methods.(3) Text classifier based on cloud concept jumping up (named CCJU).With the efficient conversion function between qualitative and quantitative data, the concept extraction method of cloud model is applied to text classification. On the basis of the conversion from text collection to text information table based VSM model, the text qualitative concept, which is extraction from the same category, is jumping up. According to compare the cloud similarity between the test text and each category, the test text is assigned to the most similar category. Through the comparison among different text classifiers based on different feature selection methods, it full proves that CCJU not only has a strong ability to adapt to the different text features, the classification performance is also better than the traditional classifiers.(4) Rapid and unsupervised text clustering based on cloud similarity (named CS-Means)Aiming at the shortcomings of the existing text clustering algorithm, a rapid and unsupervised text clustering based on cloud similarity is proposed. After text pretreatment using FAS algorithm, it takes a gradual approach strategy to obtain the optimal k (cluster number) value based on k-Means clustering algorithm. The process to obtain k value is the automatic clustering process. In this period, the digital characteristics of text cloud vector are extraction firstly. Next, the cloud similarity degree is used to measure the similarity between texts. The algorithm not only avoids the difficulties which bring by the huge dimensions and sparsity of text objects, but also retains the high performance of k-Means. At the same time, the gradual approach strategy also solves the problem which is how to assign the cluster numbers. So, the clustering results are more meet the characteristic of text distribution.
Keywords/Search Tags:text mining, cloud model, text cloud similarity, text feature selection, text classification and clustering
PDF Full Text Request
Related items