Font Size: a A A

The Applied Research Of Complex Networks In Processing Of Web News Information

Posted on:2013-02-27Degree:MasterType:Thesis
Country:ChinaCandidate:J TangFull Text:PDF
GTID:2218330371995668Subject:Electrical system control and information technology
Abstract/Summary:PDF Full Text Request
With the spreading of Internet more and more widely, web news as a newer way to transmit information has been concerned increasingly. Meanwhile, it brings many different kinds of problems, such as various news of different qualities. It has become a hot issue today that how to help people know the news quickly, obtain latest hot hit, order the news, and read specific category. This will save our time. As a new research method of complex systems, complex network is familiar to scholars gradually, and approved. It has important significance to research complex network processing news information, that provide effective approach to purify network environment, and reduce both message resource and waste of time.This paper researches three levels of the network information processing systemically. Thorough analysis and verifies the feasibility and validity of the research by the means of the complex network. The main research work is as follows:Experimental platform aspects:this paper intends to achieve an automation intelligent news information processing system. Realize network news grab, information extraction, keywords generation, news page clustering and subsequent research results configuration in turn automatically. Contrast the advantages and disadvantages of programming language, and then choose Java programming language to handle the supply network problems, which is professional in net processing. All algorithms such as complex network modeling, network characteristic value calculation, community groups, are achieved by Java, and data structure. Meanwhile, experimental results are displayed on graphic results interface which is built on Matlab and pajek. Fully analyze the related theory, and develop experimental platform by Java, to provide effective tools for research complex network appliance in network information technology.News page collection aspects:thoroughly research on network crawler principle and Heritrix framework, and improve and realize a intelligent network crawler algorithm, which can specified grab network information (such as web pages, video, images, etc.) in special time on specify web sites. This will support the next step of the network news keywords extraction, news page clustering, public opinion found and monitoring.News web content extraction and pretreatment aspects:thoroughly analyze and research existing information extraction method. On the basis of the Wrapper choose information extraction method, construct a series of the packing library. Collect webpage of sina, netease, the construction of DongKou party branch, and southwest jiaotong university by the algorithm, then analyze accurately to get news content, title, released time and source, etc. Meanwhile, realize preprocessing of Chinese word segmentation, characteristic tagging, etc. This part will provide necessary premise for the follow-up of construction of the complex network.News page keywords extraction:Based on the complex network modeling method, this article builds up complex network with the word news content, with the news of the word as a node. Put forward an improved complex network which bases on the weight of the node with based on the weights of the node, and two adjacent words build the network's edge at the same time. Through the comprehensive application of complex network statistical parameters, such as the network node weight, node of the degree, clustering coefficient node and center interface, rank to get the first N keywords of web news content. It is verified that improved network news content keywords extraction by the means of complex network based on the weight of node have improved considerably, through comparison of the experimental effect.News page clustering analysis aspects:analyze the general process of text mining and data mining based on clustering algorithm. This paper presents and builds a complex network of web news documents, which with front implementation extracts the Keywords page news for dimension reduction methods. It reduces dimension for the crawling news document content, statistics the similarity of each document after dimensionality, with news document for nodes and documents similar for sides. Then it realizes and improves of complex network Association algorithm, compares of implementation level and k-means clustering algorithm in tradition data mining algorithms, points out that the distinction and contraction between them. And a division of societies of improvement based on the voltage spectral algorithm is applied to the complex network, accessing the results of society division, implementing Web page clustering. Experimental results show that the improved algorithm can correct classification nodes in the news page complex network, and then cluster the Web News. This provides a new set of research tool for clustering for Web sites, poor automatic identification, reducing the dimension of the Web site discovery news, increasing dimensions, identification of duplicate articles and other research work.In summary, this article researches the various levels from the experiment platform, establishment of experimental data acquisition, key words extraction (text vector of the dimension reduction processing) and news page clustering news information processing. Results show the research work is of feasibility and validity.
Keywords/Search Tags:Web clustering, Web page the extraction of keywords, The complex network, Community structure
PDF Full Text Request
Related items