Font Size: a A A

Patent Training Sample Pruning Based On A Supervised Clustering Algorithm

Posted on:2011-07-17Degree:MasterType:Thesis
Country:ChinaCandidate:M Z HuangFull Text:PDF
GTID:2178360308452400Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
We are living in an information explosion era; all walks of life have accumulateda great deal, even massive data. According to the statistics from the WIPO, patentdocuments contain 90% 95% of the outcome of the world's annual inventions. Theapplications for patent in the world increase more than 100 million every year and thetotal number has accumulated nearly to 4 billion. If we can take full advantage of thesepatent documents, we can save 60 % of the research time and 40 % of the research andcapital investment for a technical innovation. Each patent will be classified to a specificcategory in international patent classification (International Patent Classification, IPC)according to the contents. In the past, we classify patents in a manual way whichgreatly relies on domain experts and is time-consuming and not effective. Automaticpatent classification is of great important in this environment and a variety of automaticpatent classification study has raised, such as Naive Bayes, nearest neighbor, decisiontree and support vector machines. All of them have been applied to text classification,and have achieved some effects.The patent classification is a large-scale, unbalanced, hierarchical and multi-labeled text classification problem. Most of the traditional classification methods can'thandle such kind of complex issues. Even the best performance classifier—supportvector machine can't handle it. The reason is because its process of solving problemis a quadratic programming problem. And it leads to a result that the training time isnear the square level of the number of training samples. Therefore, Bao-Liang Lu andhis collaborators proposed min-max modular network, its most notable features are:the parallel and modular structure. The basic idea of the network is to"divide andconquer": for a large-scale problem, we divide it into a number of independent small-scale problems. We solve these small-scale problems in parallel, and then combine them into the large-scale problems.The contribution of the thesis is to introduce a supervised clustering based onmin-max modular network. We use this algorithm to prune the training samples andsuccessfully apply it into the classification of patent data. The main contributions ofthis thesis are listed following:1) Analyze the feature of min-max modular network: highly modularization,incremental learning ability.2) Analyze the feature of receivable field of the min-max modular network,and propose a supervised clustering method based on the receivable field to prune thetraining sample.3) After clustering, some cluster may have few samples and some of them maybe noises. We use a noise removal and cluster center combination algorithm to postprocess the network.4) We arrange a serial of experiments on NTCIR-5 patent data and compare theperformance of clustering to no-clustering. And the results denote that the clusteringalgorithm can use as a pre-process method to prune the training samples and maintainor even improve the generalization ability.5) We also arrange an experiment on the patent data to prove the incrementallearning ability of min-max modular network.
Keywords/Search Tags:Min-Max Modular Network, Gaussian Zero-crossing Function, Sample Pruning, Supervised Clustering, Patent Classifica-tion
PDF Full Text Request
Related items