Font Size: a A A

Automatic Data Extraction Of The Query Answer From The Deep Web

Posted on:2010-08-15Degree:MasterType:Thesis
Country:ChinaCandidate:H B ZhengFull Text:PDF
GTID:2178360272497089Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
In the modern world, the technique of WWW advanced quicker and quicker. TheInformation contained in the web with an Explosive growth, although we use the webspider to catch the information, and use the Search engine to help people to get theinformation they wanted. But we also think there is also much information on the net wecan't get just like the part of the iceberg under the sea. And the web spider of the classicsearch engine can't download the information. So we call this part of the web Deep Web.Now, the study of the Deep Web focus on these aspects: The discovery of the webdatabase, the extraction of the interface pattern of the query, the sort of the database, theintegration of the structure of query, the translation of the query, the extraction of the queryanswer, the annotation of the query answer and so on. In a word there are three mainmodules:1. the integration of the query interface module;2. the query processing module;3. the query answer processing module.My paper focuses on the third part that the query answer processing module. And inthis module also contains three small modules: the extraction of the query answer, theannotation of the query answer and the combination of the query answer.First, we introduce the technique of the information extraction. There are theinformation extraction based on the Dom-tree, the information extraction based on naturallanguage processing, the information extraction based on ontology, the informationextraction based on induction. Although these techniques give me a lot of help, after ouranalyzed the answer page of the deep web query, we find these techniques are very fit forthe situation. There are some flaws of them.The answer page of the Deep Web has two same points:1. In the answer page, our interest of the data block is the biggest one blocks of thepage and also is the only one.2. When you find the biggest one in your page, we always find there are some smallblocks in the block, and these small blocks are just what we want to extract fromthis page. But we find in the biggest block there are not only contained these small block we want to extract, there are also some other data blocks. Afterinvestigation we find these small block always have the same structure, and evenin this structure the CSS attitude or ID is the same. Because as the advancement ofthe Web technique, the Ajax frame based on the language JavaScript we can see itanywhere. One part of JavaScript code should band to the ID of one Dom node,for decrease the workload we construct the same structure.According to the two same point of the Deep Web, we define the definitions of thegeneralized data block and data region. Only the two definitions are not enough, I choosethe VIPS to help us to segment the pages and find the biggest part of the page. The VIPSalgorithm makes full use of page layout feature: it first extracts all the suitable blocks fromthe html DOM tree, then it tries to find the separators between these extracted blocks. Here,separators denote the horizontal or vertical lines in a web page that visually cross with noblocks. Finally, based on these separators, the semantic structure for the web page isconstructed. In this process , we use our generalized data block orientation algorithm to sitethe generalized data block. Our algorithm has three steps: first, we use the VIPS algorithmto construct the blocks of this level; second, we use the algorithm to judge the blockwhether or not is a generalized data block; third, if we find the generalized data block, thealgorithm finished, otherwise, we will return the first step, continue to the next level. In thisalgorithm, the algorithm to judge the block whether or not is a generalized data block is thecore of the whole algorithm:First step, we will judge the html tag of the root node of the current block, if its tag is
and the son node
contain node, the algorithm turn to MDR[16]algorithm, otherwise the algorithm turn to the second step.Second step, we compare each two nodes sequentially, because the generalized datablocks always connect together. First, we compare their son nodes'tags to judge whetherthey are the same.1. If they are the same, we will compare their attitudes(e.g. id, class or style), therewill be three situations:? If their attitudes are not the same, we will choose the second nodes tocompare the next nodes.? If their attitudes are the same, will add the nodes to the array (generalized datablocks), and pop all the nodes out of the stack(WaitingForNextCheck stack)? If their nodes don't have attitude, we will push them the first node withtags(e.g. the first node with 1, the fifth node with 5) to the stack(WaitingForNextCheck stack) waiting for the next comparison.2. If they are not the same, we will choose the second nodes to compare the nextnodeThird step, if the stack (WaitingForNextCheck stack) is not empty, the process will be:1. Pop two nodes;2. Compare the tags of the two nodes, if their tags are adjacent, turn to 3, or not wewill judge whether the stack is empty, if not delete the first node and pop the nextone turn to 2, otherwise, turn to 5.3. Compare with the tags of the two nodes, if they are the same, turn to 4, otherwise,we will judge whether the stack is empty, if not delete the first node and pop thenext one turn to 2, otherwise, turn to 5.4. Compare the attitudes of the two nodes, if they are the same we will add the twoblock to the array (generalized data blocks), and delete the first node and pop thenext one turn to 2, otherwise, turn to 5; or not, we will judge whether the stack isempty, if not delete the first node and pop the next one turn to 2, otherwise, turn to5; if they don't have the attitudes, first we will push the second nodes to the stack,the same time, we will judge whether they have son nodes. If all they have, turn to2; If only one of them have son nodes, we will judge whether the stack is empty, ifnot delete the first node and pop the next one turn to 2, otherwise, turn to 5; If allthey don't have son nodes, we will choose one of them to compare one in thearray(generalized data blocks) to judge if their structures are the same, if they aresame, we will add the two nodes to the array; otherwise, we will judge whetherthe stack is empty, if not delete the first node and pop the next one turn to 2,otherwise, turn to 5.5. The algorithm finished.Fourth step, we judge whether the length of the array is 0. If it is 0, we don't get thegeneralized data block; otherwise, we get them.The result of test shows us this algorithm has good effect and high performance.
Keywords/Search Tags:Deep Web, Information Extraction, Ajax, Generalized Data Block
PDF Full Text Request
Related items