Font Size: a A A

Research And Application Of Patent Map Service System

Posted on:2016-01-29Degree:MasterType:Thesis
Country:ChinaCandidate:Z T XuFull Text:PDF
GTID:2308330461956045Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Patent technology is an important bargaining chip between country’s competition in the industry. The patent map that through mining and analyzing based on unstructured information like the patent literature, which can help businesses understanding industry dynamics and improving existing technology. In addition, due to the patent literature mostly formed in unstructured, and the quantity is very huge, at the same time the traditional methods are also very time-consuming. So this paper use the MapReduce of Hadoop to deal with this vast amounts of unstructured information.After analyzing the corresponding demand and technique, paper estblishes a three layers architecture including data collection, text classification and patent information visualization, which focus on the text classification and patent information visualization. Data collection is mainly including manual collection and automatic collection, the automatic collection mainly use web crawler technology to collect data from the data source regularly according to the set of topics and keywords, and get the desired patent literature by removing duplicate web contents and cleaning the data. Squaring up the special of the patent literature, this paper make the titles and summaries of the patent as the original text of patent. Two primary steps of text classification technology are text preprocessing and text classification. This article analyses the various stages of text preprocessing include Chinese words segment, stop words filter, feature selection by information gain and text representation and statistics, including word frequency and document frequency, as needed by computation related thereto and designs in detail, with integration of theories for MapReduce computing model, ways by which parallelization can be achieved in the full course of text preprocessing against the unstructured features of patent literature, and through experiments that the time has been greatly improved. According to the characteristics of this article, has been optimized for KNN. This article put forward the way that combined the vector class center with KNN. The main idea of the algorithm that is in the first the training phase to obtain the center vector of each class by patent literature under the way of average method and regarded as primary classifier, and then it is calculated with test text to get the similarity calculated value, find the nearest M sub-class(M is the threshold that entered by system manually). Then calculate the text with the text of the M sub-class and find the nearest K patent literature by KNN algorithm, in order to determine the category of the test text. Through reducing the number of training text that calculated to reduce the amount of computation. Patent map display, using the average method to deal with the sorted literature and represented it as a class corresponds to a vector text that is category-center vector, then each category calculate with each other to get the similarity value. Then on the map the value that we achieved is represented as a line and the category is represented as a circle, and we can find the relation between the patent according to the categories.Finally, initial realization of the patent map build system, which can easily and fully provide us with the areas that we concerned about of the patent situation by producing patent maps, and allows us to understand the art of patent information more clearly.
Keywords/Search Tags:Hadoop, Text Categorization, MapReduce, KNN, Patent Map
PDF Full Text Request
Related items