Font Size: a A A

Research And Implementation Of Vertical Information Search Technology And Algorithm Based On Web

Posted on:2018-12-09Degree:MasterType:Thesis
Country:ChinaCandidate:F L LiFull Text:PDF
GTID:2348330518465878Subject:System theory
Abstract/Summary:PDF Full Text Request
With the increasing development of computer hardware,the Internet also gets unprecedented advancement.Especially in this exploded data's era,tremendous information covers the whole world.Then big data and some new technologies that related to the computer come into being.In big data's era,the information retrieval system helps people find out certain data correctly.The definition of information retrieval system is that according to certain key words or tactics,users turn to related crawler technology to “crawler” some information from the internet,deal with and present the information they want by Chinese word segmentation,duplicated web pages technology,sequencing optimization technology.Among these information retrieval technologies.Baidu,360 in China and Google,Yahoo abroad are representative.Although both of them focus on the field of retrieval,they still have their own traits and become the necessary tool in people's daily life.Owing to the wide range of these information retrieval systems,there are some difficulties when it comes to some extensive information and specified domain.In order to overcome these difficulties,vertical information retrieval system is introduced.The definition of Vertical information retrieval system is that based on the information retrieval system of specified sphere.For example,it includes document vertical information retrieval system,tourism vertical information retrieval system,shopping vertical information retrieval system and other systems.This project mainly studies news vertical information retrieval system and bases on original technology to carry out some optimization.First,make add-on development on the Heritrix mode to improve the efficiency of the optimized crawler technology.Then,on the foundation of getting the website resources,transform the website mode into TXT mode by HTMLParser technology.And on the base of IK Analyzer word segmentation technology to optimize,segment the TXT contents and filter the Dirty Data of the TXT contents.Secondly,advance the TF-IDF weighting algorithm,omit the repeated parts in the websites.Finally,on the frame of struts + spring + Hibernate,take MySQL as storage database,take advantage of Page Rank algorithm to optimize Lucene's arranging algorithm,set up and search index to complete news vertical information retrieval system.
Keywords/Search Tags:Vertical Search, Chinese Word Segmentation, Crawler, Lucene, HTMLParser
PDF Full Text Request
Related items