A Method For Extracting The Topic Information In Webpages Based On The DIV Tag-Trees

Posted on:2011-05-15

Degree:Master

Type:Thesis

Country:China

Candidate:Z Yang

Full Text:PDF

GTID:2178360308969652

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

In recent years, the information resources on Internet has been growing constantly at an explosive rate. Facing such a huge Internet database, how to make a fast, efficient and economical access to all relevant information on a topic has become a hot research subject. As the CSS + DIV mode is becoming the mainstream of webpage layout, the efficient information extraction from such web has attracted more and more attention.CSS+DIV layout of news web, this thesis proposes a new web topic information extraction method which is based on DIV tag tree. It mainly includes the following three processes:HTML parsing process:after the formation of HTML document, each DIV tag will be extracted from the document, the DIV tags can be nested. Since each DIV tag corresponds to a DIV tag tree, the nesting DIV tag trees are extracted as the sub-tree of the nested ones, and then the HTML document will be converted into DIV forest.Noise filtering process:filter out the noise nodes of DIV tag trees.Pruning algorithm:first, establish STU-DIV model tree, and then cut off those DIV tag trees irrelevant to topic information after the analysis and pruning of topic relevance.Based on web topic information extraction method proposed by this thesis, this thesis design and achieved a news web topic information extraction system model to do topic information extraction experiment to the webs of news websites., the results show that the topic information obtained by such method from the news page is of good accuracy and integrity, and achieved better news topic extraction effect.

Keywords/Search Tags:

Topic Information Extraction, DIV tag tree, STU-DIV model tree, Topic information correlativity analysis, Pruning algorithm

PDF Full Text Request

Related items

1	Tag Tree Template In The Pages Of Critical Information Extraction And Topic Identification
2	Topic Chain-based Topic Information Extraction From Chinese Food Complaint Documents
3	Research On The Rating Prediction Based On Dynamic Topic Analysis Of User Reviews
4	Research On Model Of Hot Topic Opinion Mining In Virtual Communities
5	Topic Optimization Method Based On Pointwise Mutual Information
6	The Research And Application Of Wechat Official Accounts Information Mining In Enterprise Information Service
7	Topic Analysis And Recommendation System Based On Scientific Research Documents
8	Research On WEB Topic Information Extraction Based On DOM Tree Node Importance
9	Research On Short Text Topic Information Mining Technology
10	Improved Algorithm For Topic Detection,topic Trend Analysis And Prediction In Social Network