The Research On Topical Phrase Mining For Patent

Posted on:2019-01-07

Degree:Master

Type:Thesis

Country:China

Candidate:S Ji

Full Text:PDF

GTID:2428330623969000

Subject:Computer Science and Technology

Abstract/Summary:

The topic model can help discover the latent semantic information in the patent text and display it in a probability distribution way.The results have good mathematical properties,simple and intuitive,not only can help patent analysts quickly understand the general situation of a patent corpus in a certain field,but also can be used in the patent classification,patent information extraction and other further patent mining tasks.In recent years,the main method of researchers at home and abroad is to discover hidden topics based on co-occurrence relations among words.The generated results are composed of words in probability and lack of deep semantic information,which can explain poorly.On the one hand,this kind of topic model is difficult to extract the semantic rich low frequency words,the result is inclined to the high frequency word,which leads to the weakening of the topic expression ability.On the other hand,a large number of semantic rich phrases are dismantled,resulting in the difficulty of reading the topic,while the segmentation has caused the influence of the extra co-occurrence relationship.Generally speaking,phrases have richer semantic information than words,and human interpretation of topic results often depends on phrases.Therefore,this paper proposes a patent oriented topical phrase extraction method,which performs topic models based on phrase sets.The main work is as follows:(1)According to the characteristics of Patent Texts,a patent oriented phrase automatic extraction method is proposed in this paper.First,frequent phrase mining algorithm is used to generate candidate phrase sets.Then use the word rules to filter.Secondly,the concordance,informativeness,pc-value and TermRank statistics of candidate phrases are selected as the four features.The training set and test set are calculated and selected.Finally,the random forest classification algorithm is used to train and use the trained model to filter the candidate set.(2)This paper proposes a phrase based topic model GW_PhraseLDA,which is improved on the basis of PhraseLDA.In the process of modeling,PhraseLDA considers phrases,highlighting the role of phrases under certain circumstances,but it is still affected by the cooccurrence sparsity of phrases.And in the patent text,different phrases will express similar meanings,but PhraseLDA can not effectively identify the relationship.In this paper,the introduction of word vector and Generalized Pólya Urn model to PhraseLDA effectively solves the above problems.The experiment on Chinese patent text shows that the model proposed in this paper can effectively improve the quality of patent topic generation,which is more interpretable and discriminant than the traditional toipic model.

Keywords/Search Tags:

Topic Model, Phrase Extraction, PhraseLDA, Random Forest, Word Vector, Patent Text

Related items

1	Topic Extraction Andvisualization Of Patent Text
2	Text Classification Based On Word Vector And Topic Vector
3	The Method Of Fine-Grained Topic Information Extraction And Text Clustering Based On Chinese Phrase
4	Study On Feature Extraction And Text Representation Technology In Topic Tracking
5	Topic Extraction Algorithm Based On NP-Chunking And Phrase Weight Calculation
6	The Design And Implementation Of Text Topic Key Word Processing System Based Chinese Word Segmentation
7	Research On Topic Model Based Patent Mining And Its Applications
8	Methods For Phrase-based Text Mining And Analysis
9	Study On Text Clustering And Keyphrase Extraction Of Patent Document
10	The Research On Short Text Semantic Mining Based On Topic Model And Word Vector