Font Size: a A A

The Research On Topical Phrase Mining For Patent

Posted on:2019-01-07Degree:MasterType:Thesis
Country:ChinaCandidate:S JiFull Text:PDF
GTID:2428330623969000Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The topic model can help discover the latent semantic information in the patent text and display it in a probability distribution way.The results have good mathematical properties,simple and intuitive,not only can help patent analysts quickly understand the general situation of a patent corpus in a certain field,but also can be used in the patent classification,patent information extraction and other further patent mining tasks.In recent years,the main method of researchers at home and abroad is to discover hidden topics based on co-occurrence relations among words.The generated results are composed of words in probability and lack of deep semantic information,which can explain poorly.On the one hand,this kind of topic model is difficult to extract the semantic rich low frequency words,the result is inclined to the high frequency word,which leads to the weakening of the topic expression ability.On the other hand,a large number of semantic rich phrases are dismantled,resulting in the difficulty of reading the topic,while the segmentation has caused the influence of the extra co-occurrence relationship.Generally speaking,phrases have richer semantic information than words,and human interpretation of topic results often depends on phrases.Therefore,this paper proposes a patent oriented topical phrase extraction method,which performs topic models based on phrase sets.The main work is as follows:(1)According to the characteristics of Patent Texts,a patent oriented phrase automatic extraction method is proposed in this paper.First,frequent phrase mining algorithm is used to generate candidate phrase sets.Then use the word rules to filter.Secondly,the concordance,informativeness,pc-value and TermRank statistics of candidate phrases are selected as the four features.The training set and test set are calculated and selected.Finally,the random forest classification algorithm is used to train and use the trained model to filter the candidate set.(2)This paper proposes a phrase based topic model GW_PhraseLDA,which is improved on the basis of PhraseLDA.In the process of modeling,PhraseLDA considers phrases,highlighting the role of phrases under certain circumstances,but it is still affected by the cooccurrence sparsity of phrases.And in the patent text,different phrases will express similar meanings,but PhraseLDA can not effectively identify the relationship.In this paper,the introduction of word vector and Generalized Pólya Urn model to PhraseLDA effectively solves the above problems.The experiment on Chinese patent text shows that the model proposed in this paper can effectively improve the quality of patent topic generation,which is more interpretable and discriminant than the traditional toipic model.
Keywords/Search Tags:Topic Model, Phrase Extraction, PhraseLDA, Random Forest, Word Vector, Patent Text
PDF Full Text Request
Related items