Font Size: a A A

Research And Implementation Of Technology News Service Based On Information Aggregation

Posted on:2017-03-17Degree:MasterType:Thesis
Country:ChinaCandidate:D JiangFull Text:PDF
GTID:2308330485453752Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
With the increase of Internet media and users sharing channels, the massive growth of information caused a serious problem of information overload. At this time, compared to traditional information aggregation which focuses on how to provide more rich resources, information screening and filtering has become even more valuable technology. How to help users get their really interested information, to improve the efficiency of user learning knowledge becomes the new challenges of the information aggregation techniques.To relieve the information overload problem in technology news service, this dissertation explores the information screening and filtering technology based on text mining methods. Based on sentence semantic similarity calculation methods, propose duplication detection and text clustering algorithms which combine with semantic features, and apply them to eliminating news duplication, mining public hot spots, positioning user interest topics precisely. In detail, these works and achievements include:1. Propose a short-text duplication detection method based on semantics. For the information redundancy problem in news aggregation, we propose a news duplication detection algorithm which can detect not only the literally duplicate and near duplicate news but also the "topic-duplicate" news reporting the same event. The general methods for calculating sentence semantic similarity are discussed first, and we imporve the sentence similarity calculation methods based on Word Embedding word vectors. Then we apply sentence senmatic similarity calculation to measuring the topic similarity of news. Experiments show that our algorithm can improve greatly in recall rate compared to traditional algorithm which is merely syntax based, under the condition of keeping a high precision. Thus the algorithm is capable of removing the redundancy of news aggregation to a greater extent.2. Propose a text clustering algorithm based on semantics and graphs. Traditional text clustering algorithms often use the Bag-of-words model to construct the vectors of documents, ignoring the semantic information between words; and partitioning clustering methods based on centroid tend to split concept closely related clusters stiffly. Through the integration of semantic models of word vector and graph clustering algorithms which can dig strongly connected natural clusters, we propose a short text clustering algorithm, to make up for the shortcomings of traditional algorithms. Through human evaluations on 21 clusters in the experiment we find that the new algorithm can capture topic information better and show higher clustering purity than the traditional k-means method, so it is more qualified for the news topics mining task.3. With the above algorithms, we build the "Technology Vision" news service system, which can compact news aggregation results, and improve user experiences. This system has been put into the Android Application Market and runs stably.
Keywords/Search Tags:semantic similarity, word vector, text duplication detection, text clustering, complete subgraph, news aggregation, topic mining
PDF Full Text Request
Related items