Font Size: a A A

Some Issues Of Text Mining For Network Information

Posted on:2016-06-20Degree:DoctorType:Dissertation
Country:ChinaCandidate:Q M CaoFull Text:PDF
GTID:1108330503453414Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
Facing the massive volume and high dimensional text information,how to build effective and scalable algorithm for text mining is one of research directions of data mining.Aiming at above issues,some basic problems of text mining have been studied substantially as follows, it mainly includes five aspects.1. For the traditional vector space model, because of its high dimension and it cannot handle the problems of synonymy and polysemy, a feature cluster-based vector space model is proposed. Firstly, it represents the feature in form of vector, and then the features are clustered, each cluster is seen as a feature. In addition, it identifies the discontinuous phrase of proper noun in the text preprocessing stage, which makes the feature information contained in the vector space model is more abundant and more accurate. This method can not only effectively reduce the dimension, but also further highlight the semantic features of text. So, it can improve the quality of text mining. Experimental results also show that this method obtains larger reduction rate of features, the performance of clustering using it is better than the traditional VSM model.2. The traditional K-means algorithm randomly selects the initial center points, it is easy to cause the text clustering result unstable. Aim at this problem, this paper proposes an improved K-means algorithm based on the similarity matrix. The improved algorithm can effectively avoid the random selection of initial center points, and select effective initial points purposely by using the similarity matrix for clustering process, so it can produce a good beginning for the whole clustering process and also result in the reduction of the fluctuation of clustering results which strongly depends on initial points, thus it can obtain better clustering quality. The experimental results also show that the F-measure of the improved K-means algorithm has been substantially improved, therefore the clustering results are more stable.3. A semi-supervised K-means algorithm is proposed for the phenomenon of the insufficiency of the labeled data in some new social network sites. This method uses both labeled data and unlabeled data, it makes full use of the labeled data to assist tagging the unlabeled data. The proposed method selects the class center pionts of the labeled data from different categories and some of unlabeled data far from the selected points as the initial center points, thus make it sure the initial center points belong to different clusters. So, it can obtain better results. Experimental results show that this algorithm is an effective method, and it resolves the problem of the labeled data insufficiency to a certain extent.4. Training datasets imbalance is a common phenomenon and it will decrease the accuracy of classification. To solve class imbalance problem, this paper proposed a mixture weighted KNN algorithm. According to the imbalance between the classes, the algorithm assigns each sample of training datasets an inverse proportion weight, which makes the neighbors of the test sample independent on the class imbalance, furthermore it combines with the distance weight, which makes the weight of the training sample close to the test sample greater. Thus, it can obtain better classification results. Experimental results show this algorithm can obtain better classification accuracy, it is an effective method to solve the problem of training datasets imbalance.5. In order to improve the operating efficiency and treat massive datasets easily, this paper proceeds the parallel processing based on MapReduce model for the proposed text clustering and classification methods. In addition, each method is integrated in a complete text mining system as a module, and it realizes the automatic processing during the whole process of text mining. Experiment results demonstrate that such approach greatly improves the operation efficiency and doesn’t sacrifice the accuracy.
Keywords/Search Tags:text mining, feature clusters, non-contiguous phrases, similarity matrix, semi-supervised K-means algorithm, KNN algorithm, inverse proportion weight
PDF Full Text Request
Related items