Font Size: a A A

WEB Page Theme Block Identification According To Combination Features

Posted on:2018-02-10Degree:MasterType:Thesis
Country:ChinaCandidate:S ZhangFull Text:PDF
GTID:2348330512999460Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In current Internet era,web is one of major media of information and web page has become an important carrier to transfer messages.Though almost all kinds of information could be found in web pages,some of them are noisy.Such noise information harms automated web information collection and mining.Based on versatile considerations including the requirement above,how to find the topic information and reject noise in a web page has become a hot topic in computer science research.After analyzing and concluding existing technologies,merely using visual features or content features to figure out drawback,we propose an algorithm of recognizing theme blocks in web pages.This algorithm is based on combination features.Moreover,experiments indicate that the algorithm can improve the accuracy and stability in searching the topic information efficiently and effectively.The main research contents and innovation topics of this paper are summarized as follows:1)Implementing and improving the VIPS algorithm.the rules of page segmentation is updated and the threshold of block size in accordance with web page structure is adjusted to control the block granularity.Both make the block's semantic more meaningful.2)Proposing an algorithm(BBM25)of computing the relevance weight of web content with topics inspired by the BM25 algorithm.Making block as the basic unit,BBM25 algorithm mainly considers the weight of term,term frequency,length of block context and so on.3)Proposing an algorithm of identifying the theme block in a web page based on the combination features.Firstly,after segmenting a web page into blocks,according to its visual features,we use support vector machine(SVM)to predict whether a block is the theme block.Secondly,we analyze the content features of the same web block and calculate the relevance weight using BBM25.Furthermore,we combine these two methods to learn wether a block is really a theme block based on both its visual features and content features.Last but not least,experiments with our method are performed and compared with those that identify theme blocks according to only visual features or content features.The results confirm that our algorithm is more accurate and stable.
Keywords/Search Tags:theme block, VIPS, BBM25, visual features, text features, combination features
PDF Full Text Request
Related items