Font Size: a A A

Algorithms Based On Visual Similarity Of The Research In Information Extraction And Implementation

Posted on:2012-08-08Degree:MasterType:Thesis
Country:ChinaCandidate:H JiangFull Text:PDF
GTID:2178330335950399Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
This paper studies how to extract structured field information from the internet automatically from web pages, and then build a knowledge database for intelligent query. Most of the current web pages are written by HTML format, researchers generally convert it into DOM tree in order to be extracted by pattern matching. Current technology is to write corresponding Web page wrapper, but its scalability and flexibility are poor, and if the page has any slight revision, its accuracy will be terribly affected.Nowadays, the information resource is increasing every day, the difficultity which search engine positioning user is also increasing year-by-year. When facing several GB numbers of information resource, queries about search engine depending on keywords of full text, apparently failed to meet the demand. How to help users accurately identify the information needed truely material, is becoming an urgent to solve the problems.Information extraction technology is able to solve the problem in effective ways, it can quickly get the information users truely need, and it not only greatly shorten the search process, but reduce humanresource and time cost; It is because it can integrate various distributed information, and obtain a comprehensive information, so as to avoid artificially building material inconsistencies, and improve the information of validity and practicality. Information extraction technology has a lot of places function, one of the most successful, is the parity guide-in recently two years, parity guide system has been commercially available, among them appear more excellent Jango, MySimon and have Junglee.In web information extraction, we first need web page pretreatment, will have machine learning handle model, the current extraction model mainly have three ones:DOM tree model, understanding model, visual model. After converting it into one model, we secondly have one method for this model analysis, from mining structured information, It mainly have four ways: ontology method, markov method, dynamic bayesian network method, CRF method.Information technology research mainly objective is to establish information extraction systems, the page with a lot of semi-structured even disorder information transforms into structured data information; Information extraction system mainly through knowledge engineering and machine learning two methods to establish, they all have their own advantages and disadvantages.For information on the Internet, which are mainly divided into three categories, the free text, structured text, semi-structured text; because we found semi-structured text occupy larger proportion, therefore this paper mainly focus on semi-structured text. Consider the current web pages generally have specific template, then it can be automatically generated with content filled, so the page structure and layout both have good similarity.In this paper the design information extraction system is mainly used for comment website extraction:it first extract comment on the list, that is to extract lists of headlines, text, published time, praise degree; This system incorporates the similarity algorithm, visual characteristic and the DOM tree analysis technology etc.So first we establish a new DOM tree model with visual characteristics, we can call it vision -dom tree model, then we use several similar algorithms to automatically find similar sub-structure set according to vision-dom tree, now we have two basic steps:1.Using the modified edit distance algorithm, combined with tri-gram algorithm to find all the repeated sub-structure of this tree, and filter out interference set of repetitive sub-structure through the vision-dom by unique visual characteristics;2.In the vertical structure, the same attribute's display positions are generally similar, we use the projection method to find similar "nodes", about how to determine the similarity between the nodes, we choose the cosine measure;We realize the similarity algorithms based on visual characteristics in the comment system, then it can achieve the kind of automatic information extraction. The test set chose 84 pages 825 the comment on url, experimental results show that draining effects better; With the same using DOM tree technology and visual characteristic of the system to make contrast, MDR comparison results show that the accuracy or whether in extracting time efficiency, outperformed MDR system; Experimental results show that the algorithm has a higher set of rates of extraction precision and recall.Due to the text and title extraction, there still exists some mistakes of the extraction of information, so in the future, we will improve projection algorithm, in order to solve this problem, And screening rules can be still further optimized, which reduces the time complexity, We can also consider more other similarity algorithms, selection method concentrated most perfect union, generalization web ir techniques, still can consider to add some machine learning algorithm as the assistant training and realize the automatic extraction of web information.
Keywords/Search Tags:DOM Tree, Visual Characteristics, Edit Distance, Tri-gram Algorithm, Cosine
PDF Full Text Request
Related items