Font Size: a A A

Researsh On Text Representation And Feature Extraction Based On The Full Covering Gr C Model

Posted on:2017-02-24Degree:MasterType:Thesis
Country:ChinaCandidate:H F XuFull Text:PDF
GTID:2308330503957523Subject:Electronics and Communications Engineering
Abstract/Summary:PDF Full Text Request
The era of big data witnesses the generation of vast amounts of text data resources, the text mining faces the daunting task of searching valuab le information from these fast-growing texts data. Text representation model and text feature extraction are important research fields and mining text message intelligently is urgent need for the text mining. Granular computing is a new theory in the field of artificial intelligence for mining mass of information. This dissertation aims to seek the text representation model and feature extraction algorithms based on Granular Computing theory to mine large text corpus.Latent Dirichlet Allocation(LDA)is a topic model based on statistics representation,where the keywords as the basic features can express the semantic of topics. LDA can extract keywords with high probability,but they may not be important for the topics.This dissertation proposes a text representation model based on set theory,named the full covering granular computing model of texts(FCGMT).On the basic of FCGMT,the important keywords selection algorithm is first designed to obtain the important keywords candidates generated by LDA model,and then to select the important keywords out from those candidates according to their significance degrees by knowledge reduction algorithm based on the full covering granular computing model. Experiments with Fudan corpus、Sogou corpus and web crawlers real-time corpus show that the important keywords selection algorithm based on full covering granular computing can improve the prec ision and recall rate compared with TF-IDF method and classic LDA method.1. The full covering granular computing model of texts(FCGMT) are presented on the basis of full coverage granular computing model, the candidate words are obtained depending on the training of LDA model, and then according to the full coverage granular computing model theory, find a way to match corpus, texts, topics and the candidate words with domain, the points of the domain, covering, parts of covering. the full covering granular computing model of texts on "theme- candidate words- document" were constructed to provide theoretical basis to extract important keywords algorithm based on FCGMT.2. An improved method of the reduction of granules in the full covering reduction algorithm was presented to optimize the process of attribute reduction in the full covering of granular computing model. For the multid imension of text feature, and attribute significance is not simple 1 and 0.Then attribute significance was defined for new. Finally, the important keywords extraction experiments demonstrated that the improved algorithm was effective.3. An important keywords extraction algorithm based on FCGMT was designed, first get candidate words through the pre-processing of texts and LDA model, then use the improvement of reduction of granules in the full covering algorithm to calculate the candidate words weight. The reasonable value of candidate words importance threshold were found by experiments analysis, thereby important keywords were extracted. Experiments with three corpus show that the important keywords selection algorithm based on full covering granular computing can improve the prec ision and recall rate compared with TF-IDF method and classic LDA method,so the important keywords can better characterize the topics.
Keywords/Search Tags:text representation model, granular computing, full covering, important keywords extraction
PDF Full Text Request
Related items