Font Size: a A A

Knowledge Discovery Of Gene Ontology Based On Part-of-speech Tagging And Classification Algorithm

Posted on:2016-10-05Degree:MasterType:Thesis
Country:ChinaCandidate:Y K DaiFull Text:PDF
GTID:2298330467498922Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the continuous development of Internet technology and big data technology,real-life daily data increase exponentially. How to mine useful information from massdata? Obviously, knowledge discovery provides a powerful way to solve this problem.Knowledge discovery has invaluable research value in the information age.Ontology is an abstract of essence of existence of domain entities, which is aconceptual and structured representation, the method of knowledge discovery can beapplied to ontology so as to extract useful knowledge. Ontology has been widely usedin various professional fields. Root is an important lexical unit of ontology terms,which has important semantic information, is the primary key of ontology terms andis as important as the primary key of the database tuples. To identify and mine rootautomatically from ontology terms can help us understand the term deeply and finduseful knowledge, which is of important practical significance.Part-of-speech tagging(POS tagging) is also important for knowledge discovery.POS tagging helps to understand knowledge. POS tagging is a very important step innatural language processing. POS tagging has played a pivotal role in speechrecognition, information retrieval, word sense disambiguation and other fields.Data mining is an important step in knowledge discovery. Classification is anvery important technology in data mining. Classification is based on existingknowledge, which have a clear category label, thereby establishing a classificationmodel (classification), then it can be used to predict the category label of someunknown data. Today is the era of big data, classification techniques have been widelyused in big data analysis, the application and research of classification is still a hottopic.It is thus clear that to identify and mine root automatically from ontology terms,POS tagging and classification have very important practical significance forknowledge discovery. This paper combines POS tagging in the filed of naturallanguage processing with classification algorithms in the field of data mining so as tomine root from gene ontology terms, this is a complex knowledge discovery method,which lays an important foundation for the knowledge discovery of other ontology,moreover, this method has high reference value for the field of knowledge discovery,and it can be easily transferred to the application of other data, it is with strongapplicability.The problem to be solved of this paper is that to identify and mine rootautomatically from gene ontology terms, each term is composed of more than oneword, the root of each term is one of the words, and each term has only one root. Inorder to solve this problem, this paper takes a skillful analysis and transform theproblem, and established two models eventually. Both models can be applied to notonly gene ontology but also other field of ontology, which is obviously significant forknowledge discovery.To solve the above problem, this paper first proposed a POS tagging method RBUT, which is suitable for the feature of the established models. RBUT taggingalgorithm is based on the traditional rule-based tagging and transformation-basedtagging, which is a rule-based, unsupervised POS tagging algorithm. Experimentalresults show that RBUT is an effective and accurate tagging algorithm.Then, this paper proposed two models to solve the above problem. The firstmodel is a support vector machine model, the data preprocessing method was firstlyintroduced, the initial data of gene ontology were obtained after data preprocessing.Then the paper extracted Word,Tag,Size,Index this4features from the initial data asthe classification features of the support vector machine model. The SVM model inthis paper used a software tool LIBSVM, so the data need to be in the format ofLIBSVM. Considering Word,Tag this two features are non-numerical features, featureweight calculation is needed for this two features. Because TFIDF weight calculationmethod has some flaws, this paper used an improved weight calculation methodTFIPNDF. After weight calculation, each term of gene ontology terms can beexpressed in the form of a vector. The second model is a Naive Bayes model. Datapreprocessing method and feature selection method are the same as that of SVMmodel. The Naive Bayes model is based on the horizontal comparison of all words inthe same term, the specific approach is to calculate the probability as root of eachword in the same term, and the word with the maximum probability is as the root ofthis term. This approach ensures that the same term has only one word as the root.Experimental results show that the average accuracy rate of the two models is morethan88%, showing that the two models proposed in this paper are reliable andaccurate.In conclusion, this paper proposed two models for gene ontology: the SVMmodel and the Naive Bayes model. This two models are of high accuracy, which aresuitable for not only gene ontology but also other field of ontology, are importantmodels for knowledge discovery. Future research is focused on adding more features,improving the accuracy of the two proposed models.
Keywords/Search Tags:Gene Ontology, Knowledge Discovery, Part-of-Speech Tagging, Naive BayesClassification, Support Vector Machine
PDF Full Text Request
Related items