Font Size: a A A

Design And Implementation Of Retrieve System Of Query Recommendation About Chinese News

Posted on:2015-01-26Degree:MasterType:Thesis
Country:ChinaCandidate:J L JiFull Text:PDF
GTID:2298330422982074Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
When the foreigners query the information about chinese news, they like better queryexperience. Entrusted by the project team of “Research of Cross-cultural Influence ofConfucius Institute”, this paper realizes a simple retrieve system of query recommendationabout chinese news. The final purpose of the system is to help the users to make clear thequery intention when they are querying the information, and give further interesting terms tothe users. Finally, with the help of the query recommendation, the users can get the exact andcomprehensive webpage.The paper realizes three main modules of the system: webpage crawler, webpagepreprocessing, query recommendation. The complete system contains the webpage rankmodule, this module has been realized by others.The crawler module uses multithreading bases on HtmlUnit to get the webpage, and thesystem use Bloom filter to detect the same URL, this is an effective algorithm.The webpage preprocessing contains modules of the extraction of webpage, deleting theduplication of the webpage, webpage classification and webpage storage. The extraction ofwebpage module make use of features such as the density of the links, the density of words toextract the content. And we use the algorithm of Simhash to delete the duplicated webpage.The query recommendation module is the emphasis. Before the system give out thequery recommendation, the system must correct the error of the query words. The method ofquery correction is base on dual language model. This model uses the Bayesian probabilityformula and dynamic programming to correct the query words.To give the query recommendation, we must extract the import terms. The system use theopen-source of Stanford’s pos tagger to help to get the terms. The pos tagger is base onmaximum entropy model and the speed is fast.The system uses the term’s vector which is formed from the context to represent the term,so we can compute the cosine similarity of the vector to give the recommendation to the user.Finally, the system realizes a kind of efficient index file to get the fast access of the data.And the index file can be batch updated.
Keywords/Search Tags:query recommendation, web crawler, retrieval system
PDF Full Text Request
Related items