Font Size: a A A

Study And Implementation Of A Two-phase Query Interface Extraction Technique Based On Domain Features

Posted on:2009-06-17Degree:MasterType:Thesis
Country:ChinaCandidate:G A LiFull Text:PDF
GTID:2198360308479271Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Recently, abundant and valuable data tends to deepen in Internet and stored in dynamically changed database systems, which locate behind the query interfaces. Compared with the "surface web" data based on the static pages, the "deep web" data covers wider application fields. Researchers pay close attention to how to integrate those large-scale data. The query interface is the gateway to access "deep web" background database in the way of form creates independently, has the characteristics of nonstandard attribute pattern, uneasy-understanding semantics, and polarizability of extraction results. Therefore, the foremost challenge for the "deep web" data integration is to extract the pattern information. This thesis studies on the above problems.According to the similarity of interfaces in the same field, the thesis proposes a two phrases query interface extraction algorithm. The proposed algorithm is based on domain features and divides the process of query interface extraction into two phrases:In the first phrase, the algorithm extracts the elements and the label with high extraction accuracy based on domain features. And in the second phrase, the automatically extraction is guided by the directionality reflected by extraction results of the first phrase. Meanwhile, the thesis also establishes a query interface extraction system based on the proposed algorithm. This system includes two modules named as query interface classification module and query interface extraction module. The first part judges the fields of query interfaces and classifies them. The second part extracts the classified query forms based on the proposed algorithm. This system breakthroughs the limitation of algorithm towards to query forms field, and improves practicality and generality of the algorithm.The experiments were done based on practice data sets and synthetic data sets. The results showed that the two phrases query interface extraction algorithms can avoid the diffusion phenomenon of query interface extraction, and solve polarization of the query interface extraction results existed in the present extraction methods and can achieve the high recall ratio and precision ratio.
Keywords/Search Tags:Deep Web, Query Interface, Automatically Extract, Domain Feature, Pattern
PDF Full Text Request
Related items