Font Size: a A A

Research On Expression And Extraction Of Web Database’s Characteirstics

Posted on:2013-11-11Degree:MasterType:Thesis
Country:ChinaCandidate:L ZhaoFull Text:PDF
GTID:2248330395459957Subject:Management Science and Engineering
Abstract/Summary:PDF Full Text Request
As the development of Internet, Web is speeding up to deepen. The web can bedivided into Surface Web and Deep Web. The former is pages set which can be searched bythe traditional search engine. The latter generally refers to the accessible online-databases.The amount of information in Deep Web is more than Surface Web. Deep Web takesadvantage over Surface Web in quantity and quality of information. Deep Web has becomeone of the main means to get information. Because a lot of information is locked in thedatabase, many of the pages are generated dynamically by response to specific queries, soretrieving the WDB will not only greatly expand the search capabilities, but also provide aconvenient means to find the information easily.WDB query interfaces are the only path to access the web databases (WDBs), eachquery interface corresponds to a different query mode. To find the right information, userscan fill in them and submit requests. But now, with the development of a variety ofscripting technologies such as JavaScript, Ajax, and dynamic web technology, thecomplexity of query interface is also increasing, and data in WDB is various. So in order toaccess WDB automatically and improve the search capabilities, we need to quickly identifythe characters of such dynamic query interface, find the constraint relations within thevarious elements, give the quantitative description of WDB data and extract them.To solve the above problems, we launch a study. This paper mainly studiesrepresentation of WDB characters, Web Database sampling, and extraction method ofWDB characters. So the specific studies include:(1) Expression method of Characteristics of WDB query interface and WDB dataIn this paper, the attributes of WDB data are divided into three categories: textattribute, digital attribute and catalog attribute. For text attribute, we use word frequency torepresent the characteristic. For digital attribute, because the digital attribute has thecharacteristics of continuity, and the normal distribution has strong universality, we use theexpectations and biases to express the characteristic. We use statistical method to express the characteristic of catalog attribute. After obtaining the characteristics of each types, wecan form the final feature vector. Finally, because ontology has a good knowledgerepresentation and reasoning ability, this study uses ontology to represent query interface.(2) Web database sampling based on Probability&Statistics ModelIn order to realize the extraction for the characteristics of WDB, this paper provides amethod to sample Web database based on Probability&Statistics Model. There are fivekey steps for sampling WDB:①Construct initial query Q and characteristics vector;②Execute query Q and get query results from WDB;③Add the result to sample set S andanalyze the query results, calculate the probability and conditional probability of variouscharacters to prepare for next query;④Judge if the loop should be broken;⑤Construct thenext query. According to experiment, the sampling method is reasonable and effective.(3) Extraction method of Characteristics of WDB query interface and WDB dataBased on the above research, this paper presents extraction methods of WDB queryinterface and WDB data’s characteristics. Firstly, this paper presents extraction methodswhich are more compatible for query interface-the extraction method based on regularexpression for form information and the extraction method based on Watir and Ajax forrelationships, the methods can do very well to extract the context information, attributeinformation and relationship information of the query interface. Secondly, in order toachieve the extraction of WDB data’s characteristic, we also give three methods. For textdata, we use word frequency to achieve extraction; for digital data, we use the normaldistribution; for catalog data, we use the ratio of the number of records.
Keywords/Search Tags:WDB, Query Interface, Characteristic Expression, CharacteristicExtraction, Database Sampling
PDF Full Text Request
Related items