Font Size: a A A

Research And Implementation Of Vertical Search Engine On Book Subject

Posted on:2015-05-17Degree:MasterType:Thesis
Country:ChinaCandidate:J W YouFull Text:PDF
GTID:2298330452953251Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
WWW has been an important repository of information with the emergence of theinternet network. Internet users can get information of interest from this repositoryrely on the search services provided by search engines. Traditional general searchengines can meet the basic needs of users to search information, but due to the broadinformation coverage, the results returned to the users include plenty of informationthat users don’t care. Users have to do further filter operations to choose those searchresults, these additional filter operations reduce the user experience. Vertical searchengines make up for this weakness, they narrow the information domain coveragecompared with generic search engines. Vertical search engines just index informationwithin a certain professional field or a subject field, and therefore, they can ensurethat the content retrieved by users is really what they want. In additional, verticalsearch engines will do some information integration processing to the clutterednetwork content. Vertical search engines can help users quickly identify the mostimportant information by directly showing users the structured data extracted fromcluttered network information.The basic concepts and classifications of search engines were introduced, andthen the working principle of search engine was analyzed. By compared the differencein working principle between general search engine system and vertical search enginesystem, those key technologies of vertical search engines such as theme web crawleralgorithm and page similarity were studied. The main work done in this paperincludes the following. According to the characteristic that hyperlinks of the samesubject are similar in url structure, the traditional Shark-search crawling algorithm hasbeen improved. While predicting the priority score of child links, structuralcharacteristic of links was considered. Vector Space Model was analyzed, the methodof secondary thematic evaluation was proposed to get more high-quality theme relatedWeb pages. According to the distribution characteristic of book metadata in a webpage, a semi-automatic metadata extraction algorithm was designed by using ananalytical tool named HTMLParser and a book–oriented vertical search engineprototype system was designed and implemented by using lucene a full-text indexingdevelopment package, and the default method of sorting search results in lucene wascustomized. Finally, the improved crawling algorithm in this article was analyzed byexperiment. The results show that this algorithm can run better in specified websitesbecause the similarity between links of the same subject is relatively obvious. Testedand compared to general search engine system, the search results by book–orientedvertical search engine prototype system were more accurate. Moreover the ordersequence of search results can be more reasonable by customizing the default sortmethod.
Keywords/Search Tags:Vertical Search Engine, Shark-Search, Information Extraction, Lucene
PDF Full Text Request
Related items