Font Size: a A A

Research Of The Internet Information Acquisition And Processing Platform

Posted on:2010-06-21Degree:MasterType:Thesis
Country:ChinaCandidate:S C LiFull Text:PDF
GTID:2178360275973564Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
With the rapid development of information-based society,internet has become a major source for people to gain information.However,internet information's characteristics,such as mass,complexity and non-structure,bring various difficulties for people in the research of internet information acquisition,as well as the analysis and studies based on internet information collection.The Internet Information Acquisition and Processing Platform integrated various key technologies in the field of internet information acquisition and processing,accomplished the function of collecting the complex,unstructured internet information and storing information as structured data.In this dissertation,the design,constructing procedure and implementation technologies of a B/S architecture platform were introduced synthetically.The dissertation presented an innovative technical solution for information acquisition and processing,introduced main component's implementation and solved the problems brought by the characters of internet information.The major innovations and research work include:First,used URL analysis technology in internet information processing to assist webpage filtering,site identification,choose parsing template and find the relationship between web pages. This technology helped implementing template-based webpage information parsing and extraction,optimized the design of information processing module,and improved the efficiency and accuracy of information processing.Second,designed Hash-Function-based approach,Abstract Eigenvalue Comparison Approach,to avert data-duplication caused by "second parsing for the same webpage" and optimizes the efficiency of data storage.Some relevant experiments combined with results analysis were taken to demonstrate the improvement of data storage efficiency.Third,basing on the research of Nutch Crawler System,the dissertation improved Nutch Crawler with multi-thread technology and configuration interface,to implement configurable and distributed information acquisition module.Basing on the research findings mentioned above,the dissertation illustrated the system overall design,logical design of functional modules,database design and user interface design;applied GWT in the interactive user interface development;adopted multi-thread technology to optimize efficiency of information processing and verified with relevant experiments.Finally,a stable and efficient Internet Information Acquisition and Processing Platform was established.
Keywords/Search Tags:Information Acquisition, Information Processing, Internet Crawler, Webpage Information Extraction
PDF Full Text Request
Related items