Font Size: a A A

Research On The Key Technology Of Short Message Text

Posted on:2014-10-11Degree:MasterType:Thesis
Country:ChinaCandidate:L L YuFull Text:PDF
GTID:2268330425464246Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Currently we are in the era of the great explosion of the information. Although ways of getting knowledge and information become simple, efficiency of obtaining information is low in the face of data increasing in a geometric progression.How to capture the information we need in huge amounts of data timely and accurately is still a problem puzzles people.Text clustering technology can aggregate seemingly messy data into information based on a particular topic, in order to find information timely and accurately.Due to short length, rich semantics,many wrongly written characters and growth in geometric progression of short message text. dealing with the short message text in clustering has considerable trouble.Short text length could lead to making extract text feature meaningless, but wrong character and rich semantic make natural language recognition difficult.Further more,the truth that short message text grows in geometric progression brings huge challenge to the efficiency of clustering technology.In reality search engines do not process short message sometimes.But if the processing efficiency is low, it is difficult to find valuable and meaningful information. After all, short message text contains abundant information.How to extract meaningful knowledge from them has become more and more meaningful.This paper takes short massage text as research objects, and takes comparative evaluation of short message text clustering algorithm as the main research content. This paper makes a series of discussion and research about retrieval, extraction, denoising, participle, removing stop words,vectorizing the short message text,feature selection of short message and text clustering algorithm. This article mainly studied from two aspects, the first part accesses to short message text from the Web by creepy technology and segment Chinese text by segmentation technology. The second part makes clustering algorithm research after quantify text and express text as a model that can be recognized by computer.Firstly this article systematically introduces the Web crawler technology principle and the way the Web crawler works. And then it carries on collecting data set using the Web crawler technology. Then this article introduces the principle of Chinese text segmentation technology, the problem of Chinese word segmentation technology and present relatively popular Chinese word segmentation system in a detailed overview, on this basis it segments the short massage text obtained from Web through calling segmentation system of Chinese academy of sciences and uses stop words set to wipe out high-frequency meaningless phrases in the short message, in order to avoiding impact on text clustering results.Then it systematically researches on Chinese text quantitative representation model. The selection of text feature avoids that high-dimensional data brings dimension disaster to text clustering algorithm. In the following research, it adopts text representation method based on vector space model(VSM) and then normalizes text as data structure we need in clustering based on feature selection approach of word frequency.Afterwards this paper summarizes the basic principle of neighbor propagation (AP) algorithm, and introduces the basic concepts involved in the algorithm and process of algorithm working. At the same time, it discusses the impacts parameter selection puts on clustering result and the efficiency of the algorithm in operational process. After that it simply introduces the process as well as the advantages and disadvantages of k-means algorithm.And it discusses the clustering algorithm based on the word order(suffix tree clustering algorithm), and makes procedure of suffix tree clustering algorithm and course of structuring suffix tree explicit.Finally it conducts experiment simulation by using short message text set classified in advance.Clustering effect of three kinds of clustering algorithms are compared by accuracy of the clustering evaluation indexes, the recall rate and F values. Through the comparison of clustering evaluation index, it finds out that the text clustering based on AP algorithm is better than the other two kinds of algorithm on the clustering accuracy. This two clustering algorithm can be applied in the following process of building the prototype system. In the end based on the purpose of theoretical research, it designs and implements short message text clustering prototype system based on neighbor propagation (AP) algorithm. The system can obtain Web data information based on the URL inputted by user, and conducts short message text clustering, which makes users find the information they need timely and accurately convenient.
Keywords/Search Tags:short message text, text clustering, neighbor propagationalgorithm, suffix tree algorithm, K-MEANS algorithm
PDF Full Text Request
Related items