Font Size: a A A

Manual Annotation Technology-based Web Content Extraction System Development

Posted on:2011-10-07Degree:MasterType:Thesis
Country:ChinaCandidate:H X ZhouFull Text:PDF
GTID:2208330335498047Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Nowadays many Internet enterprises compete with each other fiercely in order to present more exact and more correlative results to users in the searching field. In this process, quality of the webpage content extrop becomes the key to success. Many pages with similar features and structures in appearance have totally different DOM (Document Object Model) trees because they are not written according to strict specifications. Therefore, it is difficult to accurately extract valuable information from web pages similar in appearance. And the relevance of searching results is influenced. These web pages which have special content blocks to be extracted need engineers to put in an amount of manpower. It will reduce the manpower cost effectively that a management tool is provided to operation persons. And this tool helps workers to determine which contents to extract through the way of manual annotation. At the same time, new methods make full use of ocular information and block information as the factor of extrop, which improves the precision.The current situation and problems which the web content extrop is facing are discussed and the necessity of exploring web content extrop systems based on manual annotation technology in Shopping Search Co., Ltd is made clear. On the basis of this, the function of web extrop system is discussed. Then the core functions including configuration management and result processing, etc. are analyzed. After that, the infrastructure of web content extrop system is analyzed and the core subsystem is designed in detail. Meanwhile, the main focus is put on the design of several core modules such as management of manual annotation, text block, processing of extrop rules and data export. In the end, the comparison of extrop quality between systems of the same kind is done and the improvement of quality is analyzed.
Keywords/Search Tags:Web Information Extrop, Manual Annotation, Visual Information, DOM, Wrapper
PDF Full Text Request
Related items