Font Size: a A A

The Core Word Extraction Based On Important Degree And Affinity Degree

Posted on:2015-02-12Degree:MasterType:Thesis
Country:ChinaCandidate:F S KongFull Text:PDF
GTID:2298330422992326Subject:Software engineering
Abstract/Summary:PDF Full Text Request
As the Internet, especially the extensive development of mobile Internet, electronic maps has been more and more widely used in recent years. The electronic map search engine arises at the historic moment. To improve the service quality of electronic map, on the one hand, it needs more precise, larger, more detailed, more the data points on the limitation of information, on the other hand also need to be able to understand user needs at the same time, the result is more accurate search engine. Query analysis is an important link of the search engine, and its contact with the user first, understand the user intent, guide subsequent recall of information and ordering. Through the core word extraction system, extract the user search string at the heart of the word, is an important way of promoting the optimization of query results.In this paper, the current development of search engine, on the basis of natural language processing technology as the background, analyzes the current search engine based on the query log, using natural language processing technology to deal with the present situation of the search string, and the current business needs, the combination of electronic map search engine core word extraction system requirements analysis is presented. At the same time, from a technical point by using the naive bayes model coupling with double word, improve the accuracy of based on statistical machine learning.Are presented in this paper the definition and calculation method of important degree and close degree, the calculation formula of the former according to important degree, through the Naive Bayes Classifier to find close to the original text of the text, and by the probability of morpheme in similar text and its importance in the original text. The latter using approximate double word coupling method, by using two continuous frequency and two morpheme and frequency of the quotient of the tightness between the two morphemes.This article uses the C++language, the Python language and MapReduce platform, to develop the core word extraction system. Divided into two parts, from the design at offline mining and on-line processing. Offline mining includes important degree of the mining module and tightness mining module. According to the important degree and the calculating formula of intensity, and using the graphs platform, has realized the distributed processing of large data, at the same time to guarantee the accuracy of the calculation, improve the efficiency of data mining. Online part includes the core word extraction module. Its use offline mining importance and tightness glossary, and the real word, black, white list and combined search string composition rules, strategy, implement the core word extraction for the search string. At the same time this article by increasing the number of corpus and adjust convergence parameters, offline mining results for the importance and the closeness optimization. By increasing the adjustment extraction strategy, to improve the accuracy of the core word extraction module. Finally practice the core word extraction system optimization.In this paper, based on statistical machine learning and artificial combination of rules, designs and realizes the core word extraction system, and constantly optimize the core word extraction results. In final evaluation, the new system is compared with the original system based on artificial rules completely, the final effect is increased by30.9%, improve the effect is obvious. The system has been successfully launched, serve the masses of users.
Keywords/Search Tags:Core Word Extraction, Naive Bayes Classifier, Coupling Degree ofDouble Character, Important Degree, Affinity Degree
PDF Full Text Request
Related items