Font Size: a A A

Research And Implement Of Webpage Resource Acquisition System Oriented Theme

Posted on:2016-07-06Degree:MasterType:Thesis
Country:ChinaCandidate:J J GuoFull Text:PDF
GTID:2308330461998548Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of internet technology, the mankind has entered the internet age and various kinds of resources has aggregated, integrated and formed a huge information database by taking the Internet as the carrier. To obtain the required information quickly, accurately and efficiently is an urgent problem to be solved in the multitude of information resources.Search engine as a tool for retrieving information, it has become the main method for user to get information. However, the traditional search engine has many disadvantages, such as webpage index scale is large, the speed of updating is slow, and results of the query are low accuracy. In order to solve these problems, the vertical search engine has emerged as the times require. The system of acquiring information oriented theme as an important part of vertical search engine, it has played a decisive role in the search engine. And with the development of society, the progress of science and technology, its application scope will be more and more widely, which is of great significance to study it deeply.Around the building webpage resource acquisition system oriented theme, in paper the key technology of which subject information acquisition relates to be in-depth study and we have improved the theme relevance calculation model, optimized URL crawling strategy and put forward algorithm of theme Webpage information acquisition based on the dual constraints of web contents and hyperlink of Web.The main works of this thesis are as follows:(1) This paper has focused on the research of web information extraction technology, analyzed of the Natural Language Processing, wrapper, Ontology mode, web query method and Web information extraction method based on DOM tree structure and studded the advantages and disadvantages of each method. At the same time combined with the structure and characteristics of HTML document, this paper has analyzed the working principle of parsing DOM document based on tree structure, the relevant API interface and specific analytical process.(2) The thesis discusses the theme relevance calculation model, namely boolean model, vector space model and the probabilistic retrieval model, we in-depth study the working principle and implementation mechanism of the model,then analyze the advantages and disadvantages of each model, which laid a solid foundation for the improvement of theme relevance calculation model. In addition, according to the vector space model, we concretely analyze the method of calculating the theme characteristic words weight.(3) The thesis detailed researches the crawling strategy when acquiring information, analyzes the best first search algorithm, Fish algorithm and Shark algorithm and other heuristic algorithms based on text contents, studies the algorithm principle and the working process, then analyzes the advantages and disadvantages, At the same time analyzes HITS, Page Rank, TPR algorithm based on Web directed graph structure and points out each algorithm merits.(4) Based on analyzing the theme correlation computational models and the advantages and disadvantages of the crawling strategy, combined with the structure of the HTML document, we have improved the vector space mode. Considering the impact of webpage content, link anchor text and URL string for the URL theme related degree, we have optimize the URL crawling strategy. Combined with the theme relevance calculation model improved and the URL crawling strategy optimized, we put forward the theme webpage information acquisition algorithm based on dual constraint webpage content and hyperlink based on web.(5) The paper takes soybean as the theme and builds the webpage resource acquisition system based on Nutch, at the same time, we test and analyze the performance of the system, the experimental results show that the system runs stable and information acquisition is high accuracy.
Keywords/Search Tags:Information acquisition, theme relevance, web information extraction, focused crawler
PDF Full Text Request
Related items