Font Size: a A A

Research On Key Problems About Large-Scale Text Clustering

Posted on:2011-05-25Degree:DoctorType:Dissertation
Country:ChinaCandidate:M LiuFull Text:PDF
GTID:1118330338989467Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Clustering technique has been studied for a long time and it is a familiar tool to user's life. Clustering technique is applyied in many fields. Especially along with development of information industry and network technique, people need to contact with more and more information. How to analyze these large-scale data has already become a popular research domain. Clustering can be a good solution to this problem. It partitions similar data into one cluster without any transcendent knowledge, and cluster number is far less than data number. So, after large-scale data are clustered, users can find clusters which include interesting information to users quickly.Because network information is often orgnized as text format, text clustering has become more and more popular and important in research domain. However, traditional text clustering algorithms can't resolve problems of vector sparseness and semantic similarity aroused by large-scale text clustering. So this paper made researches on special problems in large-scale text clustering and also gave resolutions to these problems. It has four aspects as follows:Firstly, the features which are selected by traditional feature selection methods based on statistic can't completely cover topic of text. These features also share much redundant information. When scale of input texts augments, these methods will increase dimension number of feature space greatly and decrease clustering efficiency. So, this paper proposed a feature word selection method based on topic analysis. This method analyzes topic information of text from multiple aspects by constructing lexical chains. After construction, this method selects feature words which can better reflect information each lexical chain describes as clustering features. It can increase clustering efficiency greatly.Secondly, when scale of input texts augments, there are many texts in feature space, which have semantic similarities. However, traditional similarity computation methods can't detect semantic similarities among texts. In order to solve this problem, semantic similarity is imported in clustering, which can make clustering algorithm find semantic similarities among texts to improve clustering precision. Besides, different features have different abilities to partition input texts. However, traditional similarity computation methods regard all the features as equal importance. In order to solve this problem, this paper proposed a feature weight measurement based on feature distribution. This method computes weights of features in similarity computation between neuron and text by statistic counted form distribution of features. It can enhance the features which can effectively reflect similarities among input texts.Thirdly, when scale of input texts augments, the representative features of each cluster may occupy small part of feature space. However, traditional clustering algorithms use all the features from feature space as representations of clusters. It will inevitably import irrelative features in cluster partition to decrease clustering precision. In order to solve this problem, this paper proposed a neuron clustering algorithm based on vector compression. This algorithm selects the features which can represent clusters to partition texts. After that, neuron model is used to optimize partition to get better feature representation and cluster partition, which can reduce running time and improve clustering precision. This paper also proposed a multiple-stage clustering algorithm based on probability. This algorithm selects features which reflect relative information to cluster to construct feature set of cluster. It can filter disturbances on clustering results from irrelative features, which can achieve better clustering precision.Finally, along with updating of information in network, it is impossible that users obtain entire input data which are prepared for clustering at a time. In order to solve this problem, this paper proposed an incremental clustering algorithm by samples. It can cluster input texts at real time. Besides, this paper also proposed a topology-adaptive neuron clustering algorithm. This algorithm can simulate distribution of data at different time stages. It also can be applied in data evolvement analysis to analyze changes of information among texts at different time stages.
Keywords/Search Tags:Large-Scale Text Clustering, Selection of Features from Text, Semantic Similarity, Neuron Clustering, Incremental Clustering, Data Evolvement Analysis
PDF Full Text Request
Related items