Font Size: a A A

The Internet Public Document Search System Based On Vertical Search Technology

Posted on:2017-11-06Degree:MasterType:Thesis
Country:ChinaCandidate:J DongFull Text:PDF
GTID:2428330542976838Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Nowadays,with the development of cloud storage,virtual office and other Internet technology,ways for people to obtain information have changed from the traditional era of social media to the Information Age,in which Google,Baidu and other general search engines are widely used by people to retrieve and obtain information from the Internet.General search engines rely mainly on traditional web crawler technology,which collects information and data on the Internet "generally"but fails to "accurately" meet user's needs in the field of professional retrieval,indexing massive level data,information update speed,personalized services and so on.Therefore,the topic-focused web crawler and vertical search services,which aim at specific users,cover specific areas and meet specific needs,have emerged and are becoming vital part of the field of information search.Focused on the current frequently-used and rapidly growing vertical document retrieval services,the paper primarily studies the oriented search and personalized application of various types of documents on the Internet to achieve public Internet document collection system based on vertical search technology.First,the vertical search technology and information extraction technology are used to collect and store the document data and web information which users are concerned about on designated professional website;Second,combined with acquisition-type meta-search technology,the existing general search engines are used to collect and store various types of public electronic documents existing on the Internet;Third,the incremental indexing technology is used to achieve secondary search of document data collection and results display.Innovations of this paper:first,the author analyzes the web page URL link intelligent recognition processing algorithms and text-based DOM tree density text extraction algorithm to optimize the document data collection program;second,through Lucene full-text search engine,a unique self indexing module is given,combined with the sophisticated Baidu hard disk search technology,to achieve the index for Word,Excel,PDF,PPT and other public documents of the Internet,keywords search and document extraction.
Keywords/Search Tags:meta-search, vertical search, topic-focused web crawler, information extraction, document collection
PDF Full Text Request
Related items