Font Size: a A A

Research And Implementation Of Text Clustering Algorithm For Internet News

Posted on:2018-08-09Degree:MasterType:Thesis
Country:ChinaCandidate:Y H YanFull Text:PDF
GTID:2348330518499097Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In recent years,with the development of the rapid integration between China mobile Internet devices and news media,the news reading point of interest is changing from the traditional social network short text to the real-time long text news in the public platform,and many Internet news service providers focus on how to push news to users according to the content of news.Most existing solutions are to use the artificial means to mark categories of news at the beginning of collection,and then use the deep learning method to realize the news automation classification by collecting the massive textual features.Firstly,massive text feature collection process relies on a strong news content services;Secondly,due to the real-time character of news,the text feature library needs to be maintained by professionals from time to time.Although the above method is very accurate,the cost is too expensive so that the method is not widely used.This paper first briefly describes the background of text clustering at home and abroad.Secondly,this paper discusses the classification of text clustering method based on different realization principles.Finally,based on the previous study and the characteristics of the news text,an improved AHK-P hybrid clustering algorithm based on prior knowledge is proposed:(1)The algorithm constructs the category mapping vector and the text representation vector respectively through the prior knowledge of the classification lexicon and the text lexical features;(2)The algorithm clusters the text datasets with the category mapping vector by means of agglomerative hierarchical clustering;(3)After the rough division,the various categories use the category mapping vector and text representation vector to extract the initial centroid;(4)Using the improved text distance calculation method based on the class mapping vector and text representation vector,the subclass data set is refined by K-means algorithm to improve the accuracy of text partitioning.The algorithm not only has the characteristics of accurate clustering of traditional H-K method,but also has the characteristics of flexible method of initial centroid extraction and more rapid clustering process.The experimental results of text clustering for Internet news show that the improved method has a significant improvement in the quality of clustering.
Keywords/Search Tags:Internet, News Text, Text Clustering, AHK-P, Category Mapping Vector, Text Representation Vector
PDF Full Text Request
Related items