Font Size: a A A

Research On Automatic Extraction Algorithm Of Internet Web Technology Data

Posted on:2022-09-04Degree:MasterType:Thesis
Country:ChinaCandidate:H K ZhangFull Text:PDF
GTID:2518306536991739Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The 14th five year plan"proposal"clearly puts forward that"building a national high-end exchange platform for scientific research papers and scientific and technological information"is one of the tasks of"strengthening the national strategic scientific and technological strength".At present,the Internet has become the main source of information in the form of web articles.In particular,web paper page,which contains scholar information and paper content information,plays a key role in building the knowledge map of academic field.However,different data sources have different page structures,so programmers need to write specific content extraction code for different pages,which consumes a lot of time.How to reduce manual participation in Web data extraction and develop automatic data collection method based on Internet technology has become the starting point of this paper.First of all,aiming at the problem of slow speed of static template noise removal in pages,this paper proposes a station noise removal algorithm based on hybrid trigeminal tree,which compares two pages under the same data source from top to bottom,and removes the same part;aiming at the problem of incomplete dynamic template data removal in pages,this paper proposes a weighted method based on the similarity between nodes,which can be used to solve the problem According to the noise of other nodes,the noise value of itself is initially corrected,and then the noise and text are distinguished by clustering method.The proposed method avoids the shortcoming of incomplete noise reduction,speeds up the speed of noise reduction,effectively reduces the number of nodes by 9/10,and the page size is about 1/14?1/10 of the original HTML document.Secondly,aiming at the trivial problem of page visualization block construction,this paper proposes an improved clustering algorithm based on birch,which improves the definition of traditional clustering feature tree,integrates the idea of DBSCAN density clustering,constructs density clustering feature forest,and dynamically changes the circle radius according to the correction factor to complete clustering.It retains the advantage of birch algorithm with low time complexity,avoids the problem of over fragmentation of visualization blocks,and effectively reduces the clustering error.Thirdly,aiming at the difficulty of matching visualization blocks and fields,this paper takes the data of Internet scientific papers as an example,and proposes three effective field matching methods.After getting the new data source,we only need to modify the matching rules according to the characteristics of the proposed fields.Finally,based on the proposed method,the automatic extraction algorithm is tested on the web paper page data set,and four evaluation indexes (?)?(?)?F1 and (?) are used to verify the accuracy and robustness of the proposed method,and the possible causes of errors are given.The (?) value of the experimental results on each data set is more than 94.44%.
Keywords/Search Tags:Scientific paper data, Web data extraction, Web page noise elimination, Visualization block construction, Hybrid hash tree
PDF Full Text Request
Related items