Font Size: a A A

Study On Data Mining Technique Of Pharmaceutical Patents

Posted on:2008-12-14Degree:MasterType:Thesis
Country:ChinaCandidate:J LiangFull Text:PDF
GTID:2189360218455174Subject:Physical chemistry
Abstract/Summary:PDF Full Text Request
Pharmaceutical patents have become one of the most importance information widely usedin many fileds, especially in innovative drug design. However, our techniques of storage andretrieval of patent information by computers are far behind those developed countries. Manycommecial pharmaceutical patent databases have been built up in several countries, e.g.,British, U.S.A and French. And we have attanched importance to it in recent years. A copy ofpharmaceutical patent is different from other kinds of patents due to its contents consistingboth generic structures and corresponding descriptive texts. In this paper, the advanced datamining techniques are applied to handle the text information in order to facilitate the retrievalof patent information.I first improve StruDraw, one of chemical software designed specifically for genericstructure input and output in our group. The function of translating text into chemicalstructure may be helpful to those front-end users who have little chemical background toindex chemical structures directly and easily. It is worthy to mention that the software waswritten in C++ and its component-based architecture makes it easy to add new functions witha little modification.As text categorization, the first step of storing a chemical patent by computer is to classifythe patent to which kind it belongs to. Data mining, or machine learning algorithms are morecompetitive to those traditional manual methods. The applications of several machine learningmethods to the categorization of pharmaceutical patents are presented in this paper. About2000 pieces of pharmaceutical patents are categorized into five classes according to theircurative effects and are selected as training instances. Features in text form are first extractedfrom each class and then are expressed in numerical vector form. Three machine learningalgorithms, i.e., Support Vector Machines, Na(?)ve Bayes and RBF Neutral Network are testedby 5 or 10 folds corss validation methods. Their performaces are compared by a series ofexperiments. And results show SVM algorithms outperforms than the other two algorithms.Methods proposed in this paper maybe helpful to the pharmaceutical patent categorization.
Keywords/Search Tags:Data Mining, Machine learning, Pharmaceutical Patent, Text Categarizating, Translation from Character to Structure
PDF Full Text Request
Related items