Font Size: a A A

Research And Implementation Of Content Extraction Based On Jssh;research JSSh

Posted on:2011-03-18Degree:MasterType:Thesis
Country:ChinaCandidate:J S WanFull Text:PDF
GTID:2178360308952591Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
As an emerging media, Internet is quite different from tranditional newspaper, broadcast and television. On Internet, anyone can release comments and spread opinions on different platforms——BBS,Blogs and personal Web sites. Further more, with the advent of Web2.0 technologies, there are more and more User Generated Contents [1] on the Web, so Web users are not only audiences but presenters of messages. Compared to newspapers and magazines, messages on Internet media circulate rapidly and there are a great many audiences of the messages thus spreaded. Internet is developing rapidly in China, but most users have few experiences on the Web. If there were no regulations, the Web would be full of information which is false, violent and the society should be disturbed. Nowadays, regulators pay more attention to the trend of information and guide some opinions in order to improve the environment of the Web. A censoring system dealing with daily management consists of three parts: information collecting, information integrating and reports presentation.However,there are hurdles in the process of information collecting, for example some Internet media release contents that are hardly readable by program in order to avoid the censoring system. These contents include vertical layout articles, textual images, dynamic Web pages and pages which could not be accessed by unauthorized users. In particular, dynamic Web page has account for considerable proportion and typical Web page collectors, such as Wget, Pavuk, can not collect them.In order to enlarge censoring scope and improve the functionality of the censoring system, automatic Web authentication and collecting dynamic pages are badly needed in the stage of information collection. Inspired by automatic Web testing[4], we use JSSh[5] (JavaScript Shell Server) supporting JavaScript interface to achieve the communication between JSSh Server and JSSh Client. In order to achieve automatic Web authentication, instructions given by JSSh Client are transimited to JSSh Server which manipulates Firefox to fill the login form and exchange authentication Cookies. Besides, mature Web browers have perfect layout engine to render Web pages including HTML, CSS and JavaScript and present good GUI to users. In this paper, we propose a scheme using layout engine, Gecko, to interprete dynamic scripts of Web pages. JSSh Client extracts contents and links from the HTML DOM which has been constructed by Gecko. Experients prove that the scheme is practical and effective.
Keywords/Search Tags:JSSh, Layout Engine, Dynamic Web page, Authentication
PDF Full Text Request
Related items