Font Size: a A A

Study On Text Clustering And Keyphrase Extraction Of Patent Document

Posted on:2012-08-04Degree:MasterType:Thesis
Country:ChinaCandidate:X M XuFull Text:PDF
GTID:2248330395958165Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
In recent years, patent information resources grow rapidly, how to ultilize them and make them play an important role in scientific research and patent business becomes one of the hot topics in the field of text processing. Patent information resources contain a number of professional text, how to provide an effective method to organize and utilize the text, to help users get the information they need becomes more and more important.Text clustering is a good resulotion to organize and utilize the text information resources, whose task is to divide the text data into the different text clusters. Patent text clustering, namely text clustering for the patent, can divide the patent text data into the systematic and meaningful text clusters, reduce the size of the text data, and improve the efficiency of using and querying for users. To describe the result of patent text clustering, keyphrase extraction is a good resolution. The keyphrase is more than a word, with the more information, which can highly summarize the theme of a text cluster, help users understand the content of a text cluster quickly, and speed up the patent processing. Moreover, because the keyphrase is very refined, use the keyphrase to represent the patent text needs a very small computational cost, which can be used in information retrieval, text clustering and classification in patent processing.Considering the feature of the patent text, we propose an improved text clustering method of patent document and specific implementions, which include text pre-proprocess, text representation, Trie-tree-based optimized text reprensentation, feature weight computation, feature dimension reduction and other pretreatments, and assistant-field-based text similarity calculation, improved text clustering algorithm, optimal catygory number selection and other text clustering steps.In addition, we proposed a multi-method integrated keyphrase extraction method of patent document and specific implementations, which include keyphrase candidate extraction based on Part-of-Speech pattern, phrase recognition based on dictionary, phrase recognition based on context information, keyphrase scoring based on TF-ICF-CDF and other steps.In summary, we propose an improved text clustering method and an multi-method integrated keyphrase extraction method of patent document. Compared with traditional methods, we achieved the better performance.
Keywords/Search Tags:Patent, text clustering, multi-method integration, phrase recognition, keyphrasescoring, keyphrase extraction
PDF Full Text Request
Related items