Font Size: a A A

Research Of Dynamic Comment Extraction Based On Web

Posted on:2015-03-24Degree:MasterType:Thesis
Country:ChinaCandidate:H MengFull Text:PDF
GTID:2298330467970277Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The advent of Web2.0era has promoted the change of the Internet from the pastinformation dissemination platform to today’s information exchange platform, on whichpeople can express their views and discuss on any topic they were interested in to form thepublic opinion effect. Although there were some people who use the Internet public opinionwith bad intentions too, for this reason analysis of public opinion was paid more and moreattention to, and the research about Web information extraction is the basis for these analysis.Web information extraction is the technology to extract specific structured informationfrom unstructured or semi-structured web pages. The paper describes the status of webinformation extraction technology. Focusing on the problems of the existing technologies thatare structure-sensitive and lacking in the research of dynamic multi-level comments extraction,a new semi-automatic information extraction system is designed which is divided intoinformation access module and comment extraction module. Information access module is atool which succeeds in getting full content of the dynamic pages automatically based onbrowser API, message sending mechanism and chrome extension technology. In commentsextraction module the concept of LFSU is proposed based on the visual, structure andsemantic features of dynamic pages, using its location nature to identify the comment area indifferent organizational model, and giving the method which can extract comments both in thesingle-level and multi-level. The method uses little information of DOM tree, and does notinvolve complex structural contrast and cluster analysis. Hence the algorithm is efficient.By analyzing the results obtained from the coverage experiences in the real situation, thispaper proves the information extraction method can meet the actual demand of the publicopinion data in blogs, and especially has a good result for those pages which contains morethan one comments. The recall ratio, precision ratio and F-Value are all above92%.
Keywords/Search Tags:information extraction, dynamic pages, Chrome, LFSU, DOM
PDF Full Text Request
Related items