Font Size: a A A

Research Of Text Extraction Algorithm Based On Visual Semantic Block

Posted on:2014-04-03Degree:MasterType:Thesis
Country:ChinaCandidate:B HuFull Text:PDF
GTID:2268330395989038Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the advanced development of Internet technology and information, the quantity of webpage increases rapidly. It is more and more accustomed for people to use search engine to get the information what they want from the vast Internet. However, a webpage normally not only contains the content text information, but also disturbs greatly the efficiency and accuracy of search engine through various ways such as navigation bar, advertising links and recommended links.The current study proposed a text extraction algorithm based on visual semantic block pages which dispensed the reliance of current text extraction algorithm on webpage text and separated it into semantic blocks based on its semantic characteristics. Then the biggest semantic block is found and further ones with similar structure will be searched. At last the webpage text information will be extracted through constantly searching. On one hand, since this algorithm does not depend on the distribution density of webpage text, sound effects can be reached in some webpage with considerable text of noise information. In addition, pictures and videos included in the body text could also be extracted and the robustness of this algorithm is improved. On the other hand, during processing DOM-Tree, this algorithm doesn’t need to search for the whole DOM-Tree for getting information. Nevertheless, only the leaf nodes of DOM-Tree should be processed which save much time and increase efficacy of searching.This research conducts experimental analysis for300webpage of15portals which include some thematic websites such as news, blog, forum and BBS. The result shows that the accuracy and recall rate could reach above94%by utilizing the text extraction algorithm based on visual semantic block pages. Apart from this, because of starting from different angle, this algorithm could combine with other traditional algorithms that based on webpage text to obtain better effects.
Keywords/Search Tags:Extraction of webpage, DOM-Tree, VIPS algorithmEffective semantic block, Structure similarity
PDF Full Text Request
Related items