Font Size: a A A

The Research Of Key Technology About Auto Ticketing Deep Web Data Collection System

Posted on:2017-08-05Degree:MasterType:Thesis
Country:ChinaCandidate:X F YangFull Text:PDF
GTID:2348330485481661Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The Internet technology has been widely used in passenger transport ticket network about aviation,railway and road,especially the Internet car tickets booking online is in rapid development stage.The premise of building bus tickets booking system online is to obtain the subject data on the Internet.The traditional search engine crawler can obtain static web resources,unable to access the Deep Web cyber source that is stored in the Web database.In view of the above problems,on the basis of in-depth analysis of the automobile ticketing network about the Deep Web resources structure and characteristics,this paper design the Deep Web data colection system of auto ticketing network.Research and implementation of the key technologies in data colection system is realised.The auto ticketing network query page recognition algorithm is proposed;The recognition result of the pages are classified to obtain pure data which contain auto ticketing network query interface pages;The framework and the core function module of the prototype system are introduced in detail.In addition,this paper has done a lot of experiments are performed to verify the proposed algorithm.The main research work is as follows:(1)The query page recognition algorithm of auto ticketing network is proposed.With the development of Web programming technology,the same type subject pages can use different HTML tags to show the same visual feature information of the web page.Result in existing web structure similarity algorithm which measure the structure similarity of the web page needs to match the HTML tag name information is unable to measure accurately the theme of the page.A recognition algorithm for subject page based on the tag tree adjacency matrix is proposed by constructing web page label tree's adjacency matrix,taking advantage of the structure characteristics of the adjacency matrix to compute the structure similarity between web pages to achieve the identical topic pages.The experimental results indicate that the optimal performance of the algorithm reaches 100% recall and precision rate is 96%,and the average performance reaches 97% recall and precision rate is 89%.(2)The decision tree classification model is used to classify data which is recognized by auto ticketing network query page recognition algorithm,and the selection of the best decision in the algorithm is improved.Due to the recognition algorithm to obtain the data set may contain other theme interface,error query interface will not only affect the data colection system performance,but also cause the waste of a lot of storage resources and network bandwidth resources.Therefore,the set from the recongnition algorithm needs to classify.The distribution of training data in practical application can't be fully representative of the test data,resulting in effective classification model can't be got.The information gain and the eigenvector method is combined to determine the weight of the best decision attribute in this paper.The experimental results show that under the premise of ensuring the accuracy of the same,with the increase of the amount of test data the improved algorithm has been significantly promoted.(3)The Deep Web data colection prototype system of the automobile ticket network is designed.Frame structure and core function module of the prototype system are described in detail.The theme crawler module,the query interface classification module,the Deep Web data capture module and the Deep Web data extraction module are introduced concretely.
Keywords/Search Tags:Internet, Auto ticketing network, Web recognition, Classification model, System design
PDF Full Text Request
Related items