Font Size: a A A

The Research And Application Of The Text Clustering That Combined With Weighting Factor And Feature Vector

Posted on:2016-09-29Degree:MasterType:Thesis
Country:ChinaCandidate:S C GuoFull Text:PDF
GTID:2308330464462441Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Text clustering is a process that divides the same kind of text into the same cluster by clustering algorithm, the process involves pre-processing of text and text clustering algorithm. Text clustering method has been widely used in public opinion analysis, search engines, e-books and other services. The process of Text clustering is typical unsupervised learning method, so it is not has to know the description of each category. This is also the difficulty of text clustering, so many researchers try their best to study it, and take a wealth of research results.It’s need to have an pre-process for the text before the text clustering to text, in this process it’s need to represent text by using the method of mathematics, generally, the vector space model is used to represent the text. The text that represented by the vector space model is a vector that composed by characteristic words and weight of the text itself. But the characteristic words weight calculated by traditional method as it’s can’t reflect the difference between the text fully, so the traditional method has a certain limitation. And traditional model does not consider the sequence of feature words appear, not to mention consider the feature words that appeared in different position whether represent the same significance. What’s more, the select of text clustering algorithm has an great effect on clustering result, and many clustering algorithms can’t combined with the coding scheme of text in a good way. In this paper, for the main improvement and application of text clustering method to do the following research:1、This paper analyzes the limiting question that the traditional term weight calculation method represents the text, and make an improvement for term weight calculation method by weighting factor. This approach emphasizes the importance of the characteristic words weights in text collection, not simply represents the text by characteristic words weights, enhance the direct similarity of similar text. And modified the traditional VSM coding scheme when doing encode to the text, so that the text vector composed by four feature vectors, and combined the vectors with position weight information. Finally, taking into account the impact of the revise of text encoding scheme for calculating the similarity between text, reconstructing the text similarity calculation formula.2、According to the improved text pre-processing and encoding scheme, through the control factor(GCF) improve the genetic K-means document clustering algorithm. Control genetic factor operating by using GCF so that the operators of high-quality individuals must be introduced into the next generation to overcome the genetic K-means algorithm operating in efficiencies, to achieve the purpose that applied to the improve text encoding scheme can better improve the clustering effect and enhance the clustering accuracy. Finally, analysis the improved text clustering methods by doing experiment which proved that the improved method has significantly improved the accuracy of the text clustering.3、Finally, the improved text clustering algorithm which combined the weighting factors and feature vector is applied to find the hot spot of public opinion, made a detailed analysis and process arrangements on the specific application. The experiment results show that the improved method has great helpful for hot spot of public opinion and warning research.
Keywords/Search Tags:text clustering, weighting factor, feature vector, genetic K-means, genetic control factor, public opinion
PDF Full Text Request
Related items