Font Size: a A A

Quality Phrase Mining Method Based On Statistic Features

Posted on:2021-02-09Degree:MasterType:Thesis
Country:ChinaCandidate:H H YangFull Text:PDF
GTID:2428330620961348Subject:Software engineering
Abstract/Summary:PDF Full Text Request
A large amount of unstructured text data has been brought to people with the development of Internet technology.How to represent the text data into a form which can be processed by computer is the primary problem of text mining.In order to overcome the shortcomings of Bagof-words model,such as curse of dimensionality,the inability to express complete semantics,and the omission of word order,this thesis expand the research of text data from word granularity to phrase granularity,and extract Quality Phrase from text corpus to achieve a better expression.The thesis mainly studies Quality Phrase Mining Method Based on Statistic Features(QPMSF),proposes general evaluation criteria of Quality Phrase,solves the problems which are poor quality of candidate phrase and the average allocation of Quality Phrase feature weights,and provides support for Text Classification and Information Retrieval tasks.The main contents are as follows:(1)The proposal of Quality Phrase Evaluation Criteria Based on Statistic FeaturesThe Quality Phrase Evaluation Criteria Based on Statistic Features is proposed.First,the evaluation criteria of Quality Phrase include frequency,combination,information,and completeness.Second,derive formulas and determine the criterion functions combined with statistics-related knowledge.Finally,the phrase mining experiments based on frequency and combination are designed to verify the validity of the Quality Phrase evaluation criteria on text corpora.Experiments are conducted on 6 text corpora such as 5Conf,DBLP Abstracts and AP News.The results show that using rectified frequency as the statistical method of frequency criterion is better than using original frequency,and the former can significantly improve the mining quality of Quality Phrase.By the results of chi-square test,point mutual information,and t-test,the thesis determines point mutual information as metric function of the combination criterion.(2)The proposal of Candidate Phrase Mining Method Based on Statistic FeaturesCandidate phrase mining is an important link of unsupervised phrase mining.To ensure the quality of candidate phrases,the thesis proposes Candidate Phrase Mining Method Based on Statistic Features.First,the method introduces frequency criterion to limit the number of word sequences and exclude low-frequency phrases in the n-gram generation stage;then,inspects whether multi-word phrases content the criterion of combination,extract phrases which meet the statistical significance measure function;finally,due to words that may be the core of an article,the method checks spelling of word phrases by Trie structure.It can improve quality and save efficiency.Experiments on text corpora show that the quality of candidate phrases can be improved by frequency n-gram mining,combination constraints of multi-word phrases and spell checking of word phrases,and Candidate Phrase Mining Method Based on Statistic Features can guarantee both the precision and recall at a high level.(3)The proposal of Quality Phrase Selection Method Based on Statistic FeaturesOn the basis of candidate phrases,the thesis proposes Quality Phrase Selection Method Based on Statistic Features to further improve the quality of phrase.First,calculate the contribution of frequency,combination,information,and completeness to Quality Phrase according to the category information of phrases;second,due to the mutual influence and redundancy between features,the method measures relevance between features by using Pearson correlation coefficient,and introduces penalty factor to make weight distribution better;finally,extract Quality Phrase according to the function score of feature weighting.Experiments on text corpora show that Quality Phrase Selection Method Based on Statistic Features can effectively extract meaningful phrases.Compared with other methods,the Quality Phrase Mining Method Based on Statistic Features has higher F1-Score and shorter running time,which can better represent the document.
Keywords/Search Tags:Quality Phrase, statistic features, candidate phrases, feature weighting, text mining
PDF Full Text Request
Related items