Font Size: a A A

A Method For Extracting The Topic Information In Webpages Based On The DIV Tag-Trees

Posted on:2011-05-15Degree:MasterType:Thesis
Country:ChinaCandidate:Z YangFull Text:PDF
GTID:2178360308969652Subject:Software engineering
Abstract/Summary:PDF Full Text Request
In recent years, the information resources on Internet has been growing constantly at an explosive rate. Facing such a huge Internet database, how to make a fast, efficient and economical access to all relevant information on a topic has become a hot research subject. As the CSS + DIV mode is becoming the mainstream of webpage layout, the efficient information extraction from such web has attracted more and more attention.CSS+DIV layout of news web, this thesis proposes a new web topic information extraction method which is based on DIV tag tree. It mainly includes the following three processes:HTML parsing process:after the formation of HTML document, each DIV tag will be extracted from the document, the DIV tags can be nested. Since each DIV tag corresponds to a DIV tag tree, the nesting DIV tag trees are extracted as the sub-tree of the nested ones, and then the HTML document will be converted into DIV forest.Noise filtering process:filter out the noise nodes of DIV tag trees.Pruning algorithm:first, establish STU-DIV model tree, and then cut off those DIV tag trees irrelevant to topic information after the analysis and pruning of topic relevance.Based on web topic information extraction method proposed by this thesis, this thesis design and achieved a news web topic information extraction system model to do topic information extraction experiment to the webs of news websites., the results show that the topic information obtained by such method from the news page is of good accuracy and integrity, and achieved better news topic extraction effect.
Keywords/Search Tags:Topic Information Extraction, DIV tag tree, STU-DIV model tree, Topic information correlativity analysis, Pruning algorithm
PDF Full Text Request
Related items