Font Size: a A A
, , , , and so forth, eliminates , , , and so on, which are the non-main marks. Based on these, it calculates various attribute depth by the fluctuation of the main mark depth. Finally, it takes the characteristic word's depth characteristic as one of the structurized characteristic. Because the order of various characteristics word in different database demonstrated in website is possible different, therefore we did not consider this factor.Next, this article inherits the classics Vector Space Model (VSM). This model extracts the information from each database interface into a spot in multi-dimensional space, which indicated by a vector form (C1, W1; C2, W2; ; Cm, Wm). Then it tests the distance between Web database interface vector and the model vector, takes the distance as the similarity between the test database and the model represented by the category. But in the vector space model, the classical method between words and expressions weight computation is if.idf (Term Frequency.Inverse Document Frequency, the term frequency. Arranges in inverted order the documents frequency), in view of the insufficiency of if.idf, the article define formula from the new angle to the words and expressions weight. Simultaneously we use one kind of new vector model to express the database inquiry, which regards database inquiry collection attribute information, also considers each attribute father character terms, and its sub-characteristic words. Take two kind of factors, frequency and concentration degree, as the overall evaluation of characteristic word selection. Each characteristic word divides into the rank of importance according to its position in the HTML documents, entrusts with the corresponding quantification weight, then analyzes it through giving the corresponding correlation coefficient in the domain by the degree of correlation foundation, the multiplication of them is the final weight value. Through the process of training the classified interface data set artificially, select characteristic word, construct corresponding characteristic model, which provides the foundation for the automatic classification. In the algorithm of constructing each kind of model, every model is in the representation of vector, which includes the kind of characteristic word, corresponding weight and the count of sub-characteristic word.Then, this article organizes the model characteristic word vector according to the tree structure, takes the father characteristic word as the parent point, whose sub-characteristic word serve as child point. According to tree's traversal depth, compute the similarity between test characteristic word vector and model characteristic word vector. When regards the words similarity computation, based on the WordNet structure and the content underlie, uses the is-a concept rank relations or the synonym between the words and expressions, simultaneously carries on the similarity quantification to the words and expressions between A and B, Make sum of fathers/sub-characteristic words in the model characteristic word vector with every the father/sub-characteristic words as the similarity between two models. Finds the biggest similarity interface model, whose value surpass the threshold, when the similarity between its sub nodes and the test characteristic words are all smaller than a threshold value, the category corresponding to this father characteristic word is the category of this test page.Finally, this article takes several hundred Web pages as the test data, by means of the table and cubic chart about the recall, precision and the F1 value, makes the detailed data statistic and comparison to confirm this method validity. Although people have made a lot of research in the area of Deep Web, and put forward a number of Deep Web data integration systems, which are only the research prototype systems, not real practical applications. According to the framework of Deep Web data integration system, some experts in China had made the classification and summary, but most of the work is still in the exploratory stage, the type of Web database processed is limited, lack automation, need to develop a more adaptive training algorithm to meet the continuous development of HTML standard, some aspects of the present work are just beginning or still blank, so it is a issue to achieve a truly available integrated system.

Classification Of Deep Web Based On Model Matching

Posted on:2011-07-14Degree:MasterType:Thesis
Country:ChinaCandidate:S Y LiFull Text:PDF
GTID:2178360305955208Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of World Wide Web, the magnanimous information contained in Web is possible for us to use. According to the difference of information depth, the entire Web may divide into Surface Web and Deep Web. In 2001, the quantity of cruising data about Bright planet demonstrated that the existing Deep Web approximately is 200,000, the capacity of resource approximately is 500 times of Surface Web, the monthly average quantity of visit is higher than 50% compared to the Surface Web, 50% more Deep Web content is preserved in the professional field database, 95% Deep Web information is for free. The UIUC made an investigation of the scale for the entire Deep Web in 2004, which estimated that the present Deep Web which can be visited has more than 450,000, provided more than 307,000 websites of the Web database, grew more than 1.5 times compared to Brightplanet 200,000 database websites in 2001. Therefore it is essential to do research for the effective methods to use these Deep Web information, large-scale Deep Web integration search is an important method, it through providing a uniform inquiry interface collection to achieve the convenience to inquire many Deep Web at the same time. Large-scale Deep Web integration search contains: (1) Deep Web Discovery; (2) Query Interface Extraction; (3) Source Classification; (4) Query Transfer; (5) Result Merging. This article research content is to realize the information extraction from the sample pages, constructing the Deep Web module through studying these disaggregated information, and then realizing the Deep Web database classification.The homepage information is usually presented with HTML (Hypertext Markup Language) language, The Deep Web database inquiry collection is usually indicated by non-structurized or half structurized HTML text, to use the interface message effectively, it needs to carry on the extraction from these HTML texts, forms into the structurized interface messages. HTML is composed of the text and the Tag string. In the information which receives by the customer termination, except the video information, audio frequency information and other binary system data, the surplus text information may divide into two parts: the first part is the control action identifier, belongs to a HTML grammar part, which is called the mark string of character (Tag String), composed of"<",">"and the string in the middle, such as , ,
,
and so on; another part is the text string, which is the true information when we glance over the homepage, these text information is true content of the homepage. Because Deep Web database inquiry collection in pages are showed by form element, therefore the inquiry collection usually indicated by the code between
and
, but it does not mean that all form elements are the form inquires collection, for instance, the user's registration, the BBS discussion group, and the write-send mail module, in addition, the search engine and a Yuan search engine and so on, which are also the form manifestation. If you want to distinguish the true Deep Web database inquiry connection accurately, one needs to carry on the distinction to these form elements. Because the advanced query page which reflects the database interface can most reflect the database situation, the usual inquiry interface is generally only a few aspects of this database, therefore this article only discusses the advanced query connection (Advanced Search Interface).First, this article describes Deep Web inquiry into a vector about the collection which certains characteristic words of the Web database inquiry, these characteristic words are the information which we must extract. In view of the traditional Web information extraction way based on the Document Object Model (DOM for short), this article deals with the HTML text according to the DOM into a kind of DOM tree structure, then it takes the information in the extraction of target information in the DOM hierarchical structure as the coordinate, and takes this kind of routing information as the structurized characteristic. This method is subject to the influence of website type and the designer's design, therefore it is very difficult to guarantee the characteristic word way in order. This article makes the improvement of the traditional way based on the DOM Web information extraction, proposes a kind of new extraction method about Deep Web database inquiry features, this method defines a series of mark list, such as
,
, ,
,
, ,
Keywords/Search Tags:Deep Web, Document Object Model, Vector Space Model, Similarity
PDF Full Text Request
Related items