Some Issues Of Text Mining For Network Information

Posted on:2016-06-20

Degree:Doctor

Type:Dissertation

Country:China

Candidate:Q M Cao

Full Text:PDF

GTID:1108330503453414

Subject:Control Science and Engineering

Abstract/Summary:

PDF Full Text Request

Facing the massive volume and high dimensional text information,how to build effective and scalable algorithm for text mining is one of research directions of data mining.Aiming at above issues,some basic problems of text mining have been studied substantially as follows, it mainly includes five aspects.1. For the traditional vector space model, because of its high dimension and it cannot handle the problems of synonymy and polysemy, a feature cluster-based vector space model is proposed. Firstly, it represents the feature in form of vector, and then the features are clustered, each cluster is seen as a feature. In addition, it identifies the discontinuous phrase of proper noun in the text preprocessing stage, which makes the feature information contained in the vector space model is more abundant and more accurate. This method can not only effectively reduce the dimension, but also further highlight the semantic features of text. So, it can improve the quality of text mining. Experimental results also show that this method obtains larger reduction rate of features, the performance of clustering using it is better than the traditional VSM model.2. The traditional K-means algorithm randomly selects the initial center points, it is easy to cause the text clustering result unstable. Aim at this problem, this paper proposes an improved K-means algorithm based on the similarity matrix. The improved algorithm can effectively avoid the random selection of initial center points, and select effective initial points purposely by using the similarity matrix for clustering process, so it can produce a good beginning for the whole clustering process and also result in the reduction of the fluctuation of clustering results which strongly depends on initial points, thus it can obtain better clustering quality. The experimental results also show that the F-measure of the improved K-means algorithm has been substantially improved, therefore the clustering results are more stable.3. A semi-supervised K-means algorithm is proposed for the phenomenon of the insufficiency of the labeled data in some new social network sites. This method uses both labeled data and unlabeled data, it makes full use of the labeled data to assist tagging the unlabeled data. The proposed method selects the class center pionts of the labeled data from different categories and some of unlabeled data far from the selected points as the initial center points, thus make it sure the initial center points belong to different clusters. So, it can obtain better results. Experimental results show that this algorithm is an effective method, and it resolves the problem of the labeled data insufficiency to a certain extent.4. Training datasets imbalance is a common phenomenon and it will decrease the accuracy of classification. To solve class imbalance problem, this paper proposed a mixture weighted KNN algorithm. According to the imbalance between the classes, the algorithm assigns each sample of training datasets an inverse proportion weight, which makes the neighbors of the test sample independent on the class imbalance, furthermore it combines with the distance weight, which makes the weight of the training sample close to the test sample greater. Thus, it can obtain better classification results. Experimental results show this algorithm can obtain better classification accuracy, it is an effective method to solve the problem of training datasets imbalance.5. In order to improve the operating efficiency and treat massive datasets easily, this paper proceeds the parallel processing based on MapReduce model for the proposed text clustering and classification methods. In addition, each method is integrated in a complete text mining system as a module, and it realizes the automatic processing during the whole process of text mining. Experiment results demonstrate that such approach greatly improves the operation efficiency and doesnâ€™t sacrifice the accuracy.

Keywords/Search Tags:

text mining, feature clusters, non-contiguous phrases, similarity matrix, semi-supervised K-means algorithm, KNN algorithm, inverse proportion weight

PDF Full Text Request

Related items

1	A Novel Labels And Similarity Reconstruction Based On K-means Algorithm Application On Text Clustering
2	Based On The Text Of The K-means Clustering Analysis
3	Research And Improvement For Semi-supervised K-means Clustering Algorithm In Data Mining
4	Semi-supervised Learning On Text Data
5	K-NN, K-means And The Application In Text Mining
6	Improvement Of Data Mining Algorithm And Application For Teacher-Student Interaction Platform
7	Research On K-means Clustering Algorithm Based On Semi-Supervised Good Point Set And Leader
8	Research On Semi-supervised Learning And Its Application
9	Research On Text Classification Algorithms Based On Semi-supervised Learning
10	Clustering Algorithm Research Based On Semi-supervised GN Algorithm