Font Size: a A A

Research On Feature Extraction Method Of Semi-structured Document

Posted on:2022-06-21Degree:MasterType:Thesis
Country:ChinaCandidate:Z X HuangFull Text:PDF
GTID:2518306527478094Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Patent texts record a great deal of information about scientific and technological achievements,which has attracted people's attention.With the rapid development of the Internet,the number of patent texts has been increasing.How to efficiently extract key features from numerous patent texts has always been the basic research issue in natural language processing.However,the existing patent text feature extraction has not achieved satisfactory results,so the accuracy of patent text feature extraction needs to be further improved.To solve the above problems,this paper proposes an unsupervised TextRank patent keyword extraction model with public knowledge,which makes effective use of prior public knowledge.Specifically,two following points are first considered: 1)a TextRank network is constructed for each patent text,2)a prior knowledge network is constructed based on public dictionary data,in which network edges represent the prior interpretation relationship among all dictionary words in dictionary entries.Then,an improved node rank value evaluation formula is designed for TextRank networks of patent texts,in which prior interpretation features in prior knowledge network are introduced.Finally,patent keywords can be extracted by finding top-k node words with higher node rank values.In the experiment,the clustering performance of patent text and the performance comparison with standard keywords are compared to verify the accuracy of extracting keywords.Corresponding results demonstrate that,new method can markedly obtain better performance than existing methods for patent keywords extraction task in an unsupervised way.In addition,the paper fully considers the semi-structured semantic feature in the patent text,innovates the representation method of the patent text,and proposes an analysis of patent technology evolution method based on the association network.In the new method,each patent claim is regarded as an independent paragraph at first.According to the citation relationship between claims,the semantic relationship between paragraphs is constructed.At the same time,the co-occurrence relationship in paragraphs is combined to form a representation for patent texts-patent association network.Then,the time-space dimension TextRank model is proposed to calculate the global network node weight,and the technical words are filtered according to the node weight to realize the reduction and reconstruction of association network.Then use the AP clustering algorithm to obtain representative technical words as a seed set,in the adjacent time slice network according to the optimal path association strength average value to obtain the adjacent network evolution path.Finally merge the path repeated nodes to realize the technical evolution analysis.The experimental results verify the validity of the evolution of technical words obtained by the new method,which provide new ideas for the analysis of patent technology evolution.
Keywords/Search Tags:Semi-structured, patent text, feature extraction, keywords, prior knowledge, associative memory, technological evolution
PDF Full Text Request
Related items