Font Size: a A A

Deep Web Form Schema Extraction And Schema Integration

Posted on:2010-01-23Degree:MasterType:Thesis
Country:ChinaCandidate:Z GongFull Text:PDF
GTID:2178360272496615Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the increasing development of network, the amount of information is growing atan alarming rate, a great amount of information has been deepened in myriad searchabledatabase online and is accessible only by filling out HTMLform to query. The number ofsuchinformationinthedepthsofnetworkisestimatedaround400-500timesastheamountofdatashowsdirectlyonthepages. Buttheycannotbefoundbytraditionalsearchenginesthat there is not static link for search engines to access them. Website where suchinformationiniscalledDeepWeborHiddenWeb.Traditional search engines use Spider Robot to recognize static links to create theindex of page. Such pages are usually called Publicly Indexable Web or Surface Web. Theessential difference between Deep Web and Surface Web is that Surface Web can be seenbytraditionalsearchenginesbutDeepWebcannot.DeepWebhasfeaturesasinformative,well structured, high value and specific thematic compared with Surface Web. It is gainingmoreandmoreattention,andalsoincreasinglybecomingahotresearch.For gaining a data from Deep Web, we are confronted with many challenges really.The main work of this page is summarized as these two questions: the research of DeepWebinterface,schemamatchingandintegration.First, we must know a schema's manifestation for let the computer be able todistinguish these semanteme information expressed by Deep Web entrance. With theinterface information accurate "reading" out by sufficiency to the schema understanding,user'squerycanbesufficiencytranslationedintoeachDeepWebentrance.Secondly,on thebasis of thesemantemeunderstandingto DeepWeb entrance, how tomate these different Deep Web interface is a question. These Deep Web entrance and uniteinterface (form) used by consumer with similar contents may possibly have differentlocationorsemanteme relationship.Thesemayusedifferent label,different vocabulary, ordifferentvalueregioninDeepWebinterface.Howcanwegetridoftheseforminformation fromtheunifiedform filledbyconsumerandtranslatedthemtoeachDeepWebentranceisaquestiontoo.Our research based on a domain ontology-based matching technology, including thequery interface schema extraction and schema integration. Ontology is a share's,conceptualization's, form, obvious style specification explains. That "conceptualization"refers to is that some phenomenon abstract model, that model is able to distinguish thephenomenon relevance concept in the world. "Obvious" refers to the concept of the typeused and the use of these concepts should be explicitly bound by the definition. "Form"means the Ontology of the machine can handle. "Shared" means that Ontology is theagreement of the knowledge, that is to say, it is not confined to a few individuals but thewholeacceptable.The Ontology application in a lot of fields is after developing for many years, nowmore and more broad. Onemajorreasonis thatthe Ontology has provided the basis carryingout one kind of semanteme communicating with (having a dialogue, mutually, handling,enjoying etc.) the inside of domain between the different entities (person, machine,software system etc.), is to provide one kind of common view from Ontology. Owing toabove-mentionedOntologycharacteristic,wemodifiedtheschemaextractionalgorithm,nolonger focused on the relationship between queryelements in interface and the structure ofschema,extractedtherelevantattributesassociatedwithentitiessavedasalist.Thenmatchit with semi-automatic ontology and generate integrated query interface, change this 1:m,m:ncomplexschemamatchingto at present comparativelymaturematchingmatingfor1:1way.Mainjobandresearchresultsareasfollows:1. First, the brief summarization shows the development process of Deep Web, bringforward Deep Web's concept and characteristic, introduces the current situationstudying at home and abroad at the same time. Then propose main framework andcorrespondingdifficulties.2. Introduce a few concepts and knowledge being related to the main work of this book,include: HTML language characteristic, the domain ontology, Deep Web global schema and the WordNet relevance have introduced, and structured ontology to be thebasisoffollow-upstudy.3. Studyon Deep Web schema extraction. Carry out detailed analysis and study on DeepWeb inquiry interface, propose Deep Web schema basic concept and method ofvision-based page segment, use the method divide the interface and do schemaextraction.4. Studyon DeepWeb schema integration. Have discussed the feasibilitythat ontologyisapplied in Deep Web search engine system. Introduce the method of schema matchingbased on ontology and attributes similarity degree reckoning we used in this bookdetailedly.Thengeneratetheintegratedinterfaceautomatic.Finally, we carried out an experiment on the entire algorithm we presented. We haveused six areas of forms to test, includingbook, airfare, job etc. The results showed that ourmethod of schema extraction had a rate of 85% the above. Next, we use domain ontologyof the book area do the integration test. The results showed that the integration had asuccess rate of 84%. In most cases, our method of Deep Web Form schema extraction andschemaintegrationisfeasible.In short, with the information of Deep Web increasing at full speed, the Deep Websearchenginebecomesmoreandmorepopular.Thestudyofthisbookhasbroughtforwarddomain ontology technology possibility applying in the Deep Web search engine field.With the technology of domain ontology and similarity degree reckoning further improvesand perfects in the future, we believe that the accurate rate and success rate of our methodwillimprovefurther.
Keywords/Search Tags:DeepWeb, Ontology, SchemaExtraction, SchemaIntegration
PDF Full Text Request
Related items