Font Size: a A A

Research On APK Crawler With Automatic Pagination Detection And Search Results Extraction

Posted on:2021-04-13Degree:MasterType:Thesis
Country:ChinaCandidate:K ZhaoFull Text:PDF
GTID:2428330632462631Subject:Computer technology
Abstract/Summary:PDF Full Text Request
At present,there are a large number of APK sites which distributed on the Internet with uneven quality,and the proliferation of malicious APKs can easily cause users to be exposed to threats such as privacy leaks,malicious deductions,and telecommunications frauds after downloading by mistake.It is necessary to crawl and monitor APKs on these sites in real time to reduce the harm.However,the existing common algorithms for identifying page number links,such as GL&CSL,BERyL,and XPath longest path detection,have the disadvantages of long recognition time and low accuracy.The common search results extraction algorithms,such as ViPER,CTVS,and STEM algorithms,also have disadvantages such as long extraction time and incorrectly extracting recommendation lists.In view of the above problems,this thesis investigates the current research status and key technologies about crawler,proposes a new page number link recognition algorithm and information extraction algorithm,and designs and implements an APK crawler system.The specific work is summarized as follows.Firstly,in order to solve the problem of long time and low accuracy of the page number link recognition algorithm,this thesis proposes a page number link recognition algorithm based on the text of paginated elements and hyperlink features.The algorithm first selects the DOM tree that meets the requirements,and then filters based on the text,hyperlinks,and other characteristics in the DOM tree.Experiments on a large number of web pages show that this algorithm improves the accuracy and reduces the time of identifying page numbers,the effect is better.Secondly,in order to solve the problem of long time and much noise,which is existed in common extraction algorithms,this thesis gives a search result extraction algorithm based on path signature.The algorithm generates signatures for paths in the DOM of the page and performs path aggregation for similar signatures,and then uses filters to filter noise information.Experiments on a large number of webpages show that this algorithm can effectively filter the recommendation list information,and because the search results can be obtained by avoiding rendering the page,the extraction time is shorter and the performance is better than other algorithms.Thirdly,design and implement an automated APK crawler system.In order to make the crawler system have better scalability and reliability,this thesis builds a Redis cluster service and uses Conductor as a task scheduling framework to ensure efficiency.Experiments on multiple sites show that the system can automatically crawl the APKs distributed on the Internet without human intervention.
Keywords/Search Tags:Search result extraction, Identify page numbers, APK crawler system
PDF Full Text Request
Related items