Font Size: a A A

Research On Feature Selection Methods And Its Applications In Text Clustering

Posted on:2016-12-12Degree:MasterType:Thesis
Country:ChinaCandidate:H Z YuFull Text:PDF
GTID:2308330461483503Subject:Management Science and Engineering
Abstract/Summary:PDF Full Text Request
Text data mining has become an important area of research. Its research object is text data from various data sources. It can help people mine, analyze text content and discovery text mode. Text clustering is a vital task in the field of text mining. It can help enterprises or users to summarize text data. High dimensional sparse text feature reduces the performance of text clustering. An effective method of feature selection is a key link to improve the text clustering effect. This paper mainly studies feature selection method in text clustering and applies to telecom customer complaints data. Concrete research content is as follows.This paper put forward to a kind of FS-CR feature selection method based on text clustering results. This method firstly clusters the original text corpus and will obtain a text of initial clustering results. Then according to the initial clustering results as category label, we calculate the text information gain of all features and choose the important features. Finally, we cluster text corpus again using important features and a better clustering results are obtained. This article will compare FS-CR method with the existing feature selection method such as document frequency and term contribution through three experiments. We use F-measure and feature compression ratio to evaluate results. Results show that the method uses a small amount of effective features to obtain the higher F-measure values and FS-CR feature selection method is feasible.Traditional weight calculation method only considers feature frequency and document frequency. There is a large number of semantic information in text. This paper introduced the location factor and paragraphs co-occurrence factor. A new feature selection method called FS-SI-CR which is based on text semantic information and cluster result is proposed in this paper. By introducing semantic information, text theme weight was strengthened, so as to optimize the initial text clustering results, and then improve the effect of the final text clustering. This paper compared FS-SI-CR with FS-CR and term contribution with semantic information. Experimental results show that the FS-SI-CR method is superior to other feature selection methods both in overall clustering effect and in text category.Existing telecom customer complaint data is text data with no category information. The text is different, for short text telecom customer complaint, paragraphs co-occurrence sematic information will be transferred to sentences co-occurrence sematic information. Firstly, this paper puts forward the telecommunications industry customer complaints framework of text mining. Then we conduct text preprocessing and FS-SI-CR method in telecom customer complaints in the text. Through the clustering results, we found that the application of FS-SI-CR method is good and the method can choose a few effective features. According to the analysis of features of different categories, we can discovery customer complaints issues so as to improve customer complaint handling efficiency and reduce labor costs. Importantly, it will provide decision support for telecom enterprise managers.
Keywords/Search Tags:Data Mining, Text Clustering, Feature Selection, Customer Complaints
PDF Full Text Request
Related items