Font Size: a A A

Short Text Clustering Based On Frequent Word Co-occurrence Network

Posted on:2017-01-18Degree:MasterType:Thesis
Country:ChinaCandidate:W LiFull Text:PDF
GTID:2180330482479360Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of Web2.0, tremendous changes have taken place in text content on the Internet. In the Web1.0 era, the Internet is mainly content of static pages. However, the texts of these static pages of are long, standard documents. In the Web2.0 era, dynamic web technology has developed at a high speed. A large number of web application such as microblog platform, question answering communities, forums and instant communication software has become increasingly popular. These applications are mostly based on short texts, Shot text has the characteristics of briefness, fragmentation, and are easily spread on the Internet. These characteristics make short texts more suitable for today’s fast-paced society. How to dig out useful information and knowledge from a large number of short texts has become a hot research topic.Text mining is one of the traditional research areas of data mining, has accumulated a large number of outstanding theories and achievements. However, traditional text mining methods are designed for long text, not suitable for short text. Compared with long texts, short texts generated by Web2.0 based platforms are sparse. Moreover, there are many problems in the short text. For example, short text on the microblog platform uses a lot of network language, has many spelling errors and typos. This makes the existing, mature long text techniques and methods, are not capable of processing short texts. It is urgent to design an efficient method,which is suitable for the characteristics of short text.In view of the short text data sparsity and the term of nonstandard features, we propose a short text clustering algorithm, which is based on the frequent word co-occurrence network. The method first mines K word corpus in the presence of frequent item sets K>=3. Then, it constructs frequent term co-occurrence network (FWN, frequent word co-occurrence network, if two words appear at the same time frequent term set, then there is an edge). In FWN network, topics exist in the form of community, because its feature words are closely linked in a topic and form a topic community. Therefore, we use complex network community discovery algorithm to identify communities in the FWN network. Finally, we use the feature words of the topic as the prototype of the topic, adapt a single pass clustering algorithm based on the maximum similarity assignment to achieve fast clustering of short texts. Experimental results on the microblog short text data set have shown that the method are able to quickly find hot topics in the short texts, and do not need to specify the number of topics in advance K.In addition, we found that our approach can also be used to clustering the search results of a query, on the ease of secondary consolidation and sub themes. Therefore we developed a Baidu news search results clustering prototype system based on FWN method, the system can show clear query retrieval results on the structure, and eliminate disambiguation and improve query diversity.
Keywords/Search Tags:Short text clustering, FWN, text mining, complex network, community discovery, clustering
PDF Full Text Request
Related items