Font Size: a A A

Research And Application Of Distributed JS Analysis In Web Info Collection System

Posted on:2016-03-12Degree:MasterType:Thesis
Country:ChinaCandidate:X T ZhengFull Text:PDF
GTID:2308330479498798Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
Nowadays, with the rapid development of Internet technology, the application scopes of network among people living become wider and wider. On the one hand, huge amounts of data, which contains many valuable information, is generated by the Internet, proposing requirements to the web information collection; on the other hand, more and more new technologies are being used and the use of dynamic web technology, especially dynamic scripting technology, largely enhances the users’ experience and the functionality and aesthetics of web pages. But because original Web information collection system can not parse the script, it can not be achieved on the dynamic web information collection. Aiming at this problem, this paper designs and implements a extraction and analysis system of web page script based on distributed computing. Combined with the original information collection system, it solves the problem that original information collection system can not collect the dynamic web page.First of all, through the analysis of the JavaScript language and common analytical engine, this paper designed the process of script extraction and analysis, mainly including script extraction and constructing of analysis environment.Secondly, through the research on the Hadoop scheduling algorithm, combined with the actual operation environment of script extraction and analysis system, this paper designs scheduling algorithm based on harmony search, combining script extraction and analysis with Hadoop distributed computing.Thirdly,combining the script extraction and analysis system with original information collection system, this paper designs the overall system file structure and data storage format based on the file storage structure of the original Nutch system.Finally, this paper completes the system with MapReduce programming and tests it in the actual Hadoop platform. Through the analysis of the test results, this paper verifies that the script extraction and analysis system can make the original information collection system collect the information of the dynamic Web pages, and the use of scheduling algorithm based on harmony search in heterogeneous cluster environment can improve the operation efficiency of the implementation. The scheme proposed by this paper realizes fast and accurate collection of dynamic web information and provides a technical improvement ideas for information acquisition related field.
Keywords/Search Tags:Information collection, JavaScript analysis, Hadoop scheduling algorithm, Harmony search algorithm
PDF Full Text Request
Related items