Font Size: a A A

Discovery Of Urgent Hot Topics Based On Selection Of The Key Sentences And Effective Documents

Posted on:2018-11-09Degree:MasterType:Thesis
Country:ChinaCandidate:J GuFull Text:PDF
GTID:2348330536463987Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The discovering of hot topic has always been a common research subject in the field of natural language processing.In the public opinion monitoring,it is necessary to real-timely discover news events discussed widely by people,especially network emergency which has changed into a hot topic quickly.It is important for relevant government department to discover events discussed in hot topics,hence track and handle these events,even prevent the emergent event from getting worse.This thesis uses the classical K-means algorithm to cluster and reflect the importance of the topic according to the size of the clusters(that is,the number of documents).In order to improve the quality of document sets and extract high quality features,this thesis starts with data source before clustering by selection of key sentences and effective documents to ensure that the data is clean and concise.Firstly,it select the key sentences reflecting the core content of the document and transform the original document sets into a new document sets in which each document is composed of key sentences.This thesis mainly measures the key sentences of the document from two aspects: the sum of weights of the words in the sentence and the position of the sentences in the document.Secondly,because of a lot of “title party” on the network,they often use the false title to induce the users to browse their website in order to attract the attention and click rate of users.the noise document whose title is inconsistent with the content is excluded by computing the similarity between the key sentences and title of the document to select effective document.Finally,the feature terms of the key sentences document set are calculated by using the word frequency-inverted document frequency(TF-IDF).In this paper,the test corpus is mainly the emergency events online in 2016,such as“Lei-yang incident(????)”,“Shandong vaccine case(??????)”.By comparing the clustering results of new document sets by the selection of key sentences and effective document processing with original document sets,it finds that the former F1 value is 80%,while the latter only 67%.So,the refined document sets not only improves the quality of documents and reduces the vector dimension used in cluster,but also have a great improvement in the clustering effect.
Keywords/Search Tags:Selection of Key sentences, Selection of Effective Documents, TF-IDF, K-means, Hot-Topic
PDF Full Text Request
Related items