Font Size: a A A

Research On Techniques Of Real Estate Information Vertical Search Engine

Posted on:2015-01-14Degree:MasterType:Thesis
Country:ChinaCandidate:Y LiFull Text:PDF
GTID:2268330428465052Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of computers and networks, the information on the Internet isgrowing exponentially. When using the general search engines to search for relevant information,the accuracy rate is increasingly unacceptable, and sometimes even access to dozens pages areunable to find the desired content, which for most people is unacceptable. With this kind of problem,vertical search engines emerged. The appearance of vertical search engines could solve the shortageof specific areas of general search engines, it can be more focused, more depth, more accurate.This subject mainly focuses on status of the resent real estate, design and implements the realestate vertical search engines. This paper emphasises research two key technologies of the system:web crawler and Chinese word segmentation. The main contents are as follows:(1)Introduced background and significance of the research, as well as an overview of thevertical search engines and the development of the vertical search engines at home and abroad.(2)Introduced related technologies of vertical search engines including web crawler technology,web information extraction technology, Chinese word segmentation technology, information indextechnology and search results sorting technology.(3)In-depth researched of Shark-Search algorithm, analyzed its shortcomings: the shortage ofanchor text context and the shortage of local optimum. For these two shortcomings, this paperproposed two improvements: have link clustering and the technology of tunnel added toShark-Search algorithm. Clustering could solve the deficiencies of anchor text context ofShark-Search algorithm, and the technology of tunnel could solve the problem of topic islanding,avoid local optimum. The result of experiment shows that the improved algorithm has significantlyimprovement on topic page crawl.(4)Researched a kind of probability and statistics model called the hidden markov model, andcombinate it with part of speech tagging, add custom status and custom annotations, aftercalculating to establish the state of the maximum probability sequence. It could identify real estatenamed entity efficiently.(5)Combined with the theory put forward by previous chapters, we have real estate verticalsearch engines system implemented, the system consist of five main modules: the crawler, pageprocessing, Chinese word segmentation, information index, information search.(6)Finally, summarized this paper and vista the future work of the research.
Keywords/Search Tags:Focused Crawler, Hidden Markov, Tunnel Technology, Inverted Index, InformationExtract
PDF Full Text Request
Related items