Font Size: a A A

Research On Multi-granularity Patent Text Clustering Fusing Attribute Extraction

Posted on:2016-03-20Degree:MasterType:Thesis
Country:ChinaCandidate:D P SunFull Text:PDF
GTID:2308330461476537Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
With the development of science and technology, more and more countries and companies pay attention to protecting research findings. As a form of intellectual property protection, patents get more and more attention from people. Due to the late start of China’s economy development, compared with foreign companies, most domestic enterprises are lack of understanding patents in the early stages of development. Therefore, raising awareness of patent protection and enhancing the ability of patent analysis, become more and more important for the development of domestic enterprises. The patent text consists of structured and unstructured data and we analyze the latter one. There are many parts of unstructured data including the title, the abstract, patent claims and so on. The title often describes the key technology of patents. The abstract summarizes the content of the patent. Both of them contain a wealth of patent information. Therefore, we choose the title patents and patent abstract for the study.Taking patent abstracts as the study object, we propose an information extraction method based on conditional random fields(CRFs). Firstly, regarding attribute and attribute value as the named entity, this thesis trains CRFs model by using the training set. And then this thesis uses the model to extract attributes and attribute values from the test set. Secondly, we utilizes rules to predict the relation between attributes and attribute values. The accuracy, recall and F-score of predicting are 80.8%,81.2% and 81.0% respectively. Finally, using the extraction results, this thesis analyzes patents and compares the same type patents.On the patent title and patent abstract, this thesis proposes one patent text clustering method. Based on the sequence label technology, we finish the attribute and values extraction. We represent the abstract information with two different granularities, the text and the attribute-value. Using the distributed representation, this thesis represents the information. With different weights, we fuse the patent title, abstract and attribute-value linearly. Based on the spectral clustering, this thesis clusters the patent text. By examining the results of accuracy, recall and F-value, this thesis proves the feasibility and effectiveness of the proposed method, demonstrating the value of work.
Keywords/Search Tags:Attribute Extraction, Conditional Random Fields, Multi-granularity, Distributed Representation
PDF Full Text Request
Related items