Font Size: a A A

Research On Hyperlinks Extraction Based On Hotspot Website Content Analysis

Posted on:2011-10-22Degree:MasterType:Thesis
Country:ChinaCandidate:F J LangFull Text:PDF
GTID:2218330338965281Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Information on the Internet is very extensive, and that there are many hotspot information people are concerned about, these content of hotspot website is on behalf of the most interested part of the internet information, the purpose of this paper is on NBA hotspot website content analysis, parser hyperlinks and corresponding text information, and then through the URL and text feedback the heat degree of website.Firstly,This paper introduces the overview of web information extraction,its developing history and current status,analyses technique classification of web information extraction and its common algorithm and elaborates the technique and weighing measure of hotspot web information extraction. Secondly, this paper analyses the feature of the hotspot website page content analysis, including the characteristics of the hotspot website and hotspot website sports NBA, the characteristics and comparison of Sohu and NetEase NBA page, analysis hotspot website, hotspot content analysis and internal relationship from Web language features. By contrast,summarizes the features of the NBA class hotspot website content, and these characteristics are ideal for some HTML Parser to parse. Through the contrast of the characteristics of mainstream HTML Parser, elaborate the superiority of HTML Parser on analyzing hotspot web page, through the realization of hotspot web page gathering, has further confirmed web page internal composition structure and constitution characteristic and so on, puts forward a kind of hotspot double feedback URL and text extraction strategy based on HTML Parser:First through HTML Parser extracts web page's URL, then extracts text from URL, feedbacks URL heat degree through the extraction text, feedbacks the heat degree of entire web page through URL again.Finally, this paper realizes website hyperlink information extraction based on HTML Parser. This paper mainly introduces the realization of hotspot website hyperlink and text information extraction using two algorithms. System operation results and effectiveness evaluation mainly through querying Sohu sports NBA and NetEase sports NBA extraction results, the two performance indicators of test precision and recall rate to compare two hotspot website links extraction effect, and through URL and text feedback the heat degree of website.The hyperlinks extraction based on hotspot website content analysis this paper is studying has already realized the simple web page analysis, can filter some garbage information and clean noise, but whether it can satisfy the user request truly,enhance useful information feasibility which is analyzed, also needs to study further.
Keywords/Search Tags:HTML Parser, Information extraction, Web analysis, Double feedback
PDF Full Text Request
Related items