Font Size: a A A

Research On Directional Acquisition And Automatic Summarization For Network Data

Posted on:2019-11-16Degree:MasterType:Thesis
Country:ChinaCandidate:C C YangFull Text:PDF
GTID:2428330566999350Subject:Software engineering
Abstract/Summary:PDF Full Text Request
The rapid popularization of technologies such as the Internet and Internet of things has led to the rapid growth of data on the Internet platforms.The accurate directional network data acquisition is important for the data mining.However,there is the problem of low acquisition precision.In addition,in the face of the massive acquisition of information content,how to get valuable information from these contents is also the research point.The traditional summary algorithm only considers the frequency of the keywords and the related semantic analysis.Therefore,the document summary has the disadvantages of low precision and low recall.Therefore,this thesis mainly studies two aspects: on the one hand,improves the accuracy of data acquisition and extracts the key information as the summary.The work of this thesis is as follows:(1)In the aspect of directional data acquisition,this thesis proposes an Adaptive Crawling Algorithm(ACA)for network data.The adaptive algorithm introduced the method of text weighting to set the weights for the keywords,and calculated the relevance of the web pages based on the space vector model.The importance of web pages is judged by the relevance of links and topics,set the fitness function to filter the web pages related to the topic,and adjust the system model dynamically according to the real-time web page acquisition.This thesis is based on Hadoop distributed platform,parallel acquisition page.Make full use of the computing resources of each node to improve the acquisition rate.(2)In summary generation,this thesis proposes a multi-document Summarization Algorithm based on Topic Clustering(MDSTC)based on topic clustering.Firstly,the algorithm adds the sample density function into the clustering algorithm;according to the statistical information determine the initial number of clusters and the cluster center automatically.Then the system discovers the number of potential corresponding subtopics in the document set.The convolutional neural network algorithm is used to train the clustered topic texts,score and mark the sentences,extract the center sentences with higher relevancy from different subtopics as the summary.(3)At the end of this article,we build a prototype system to collect the "Earthquake" information and display the collected contents through the web pages.And the use of automatic summary module to gather the massive content condensed into valuable summary.
Keywords/Search Tags:Adaptive Algorithm, Data Collecting, Nutch, Distributed Platform, Automatic Summary, Clustering Algorithm, Convolution Neural Network
PDF Full Text Request
Related items