Knowledge Discovery Of Gene Ontology Based On Part-of-speech Tagging And Classification Algorithm

Posted on:2016-10-05

Degree:Master

Type:Thesis

Country:China

Candidate:Y K Dai

Full Text:PDF

GTID:2298330467498922

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

With the continuous development of Internet technology and big data technology,real-life daily data increase exponentially. How to mine useful information from massdata? Obviously, knowledge discovery provides a powerful way to solve this problem.Knowledge discovery has invaluable research value in the information age.Ontology is an abstract of essence of existence of domain entities, which is aconceptual and structured representation, the method of knowledge discovery can beapplied to ontology so as to extract useful knowledge. Ontology has been widely usedin various professional fields. Root is an important lexical unit of ontology terms,which has important semantic information, is the primary key of ontology terms andis as important as the primary key of the database tuples. To identify and mine rootautomatically from ontology terms can help us understand the term deeply and finduseful knowledge, which is of important practical significance.Part-of-speech tagging(POS tagging) is also important for knowledge discovery.POS tagging helps to understand knowledge. POS tagging is a very important step innatural language processing. POS tagging has played a pivotal role in speechrecognition, information retrieval, word sense disambiguation and other fields.Data mining is an important step in knowledge discovery. Classification is anvery important technology in data mining. Classification is based on existingknowledge, which have a clear category label, thereby establishing a classificationmodel (classification), then it can be used to predict the category label of someunknown data. Today is the era of big data, classification techniques have been widelyused in big data analysis, the application and research of classification is still a hottopic.It is thus clear that to identify and mine root automatically from ontology terms,POS tagging and classification have very important practical significance forknowledge discovery. This paper combines POS tagging in the filed of naturallanguage processing with classification algorithms in the field of data mining so as tomine root from gene ontology terms, this is a complex knowledge discovery method,which lays an important foundation for the knowledge discovery of other ontology,moreover, this method has high reference value for the field of knowledge discovery,and it can be easily transferred to the application of other data, it is with strongapplicability.The problem to be solved of this paper is that to identify and mine rootautomatically from gene ontology terms, each term is composed of more than oneword, the root of each term is one of the words, and each term has only one root. Inorder to solve this problem, this paper takes a skillful analysis and transform theproblem, and established two models eventually. Both models can be applied to notonly gene ontology but also other field of ontology, which is obviously significant forknowledge discovery.To solve the above problem, this paper first proposed a POS tagging method RBUT, which is suitable for the feature of the established models. RBUT taggingalgorithm is based on the traditional rule-based tagging and transformation-basedtagging, which is a rule-based, unsupervised POS tagging algorithm. Experimentalresults show that RBUT is an effective and accurate tagging algorithm.Then, this paper proposed two models to solve the above problem. The firstmodel is a support vector machine model, the data preprocessing method was firstlyintroduced, the initial data of gene ontology were obtained after data preprocessing.Then the paper extracted Word,Tag,Size,Index this4features from the initial data asthe classification features of the support vector machine model. The SVM model inthis paper used a software tool LIBSVM, so the data need to be in the format ofLIBSVM. Considering Word,Tag this two features are non-numerical features, featureweight calculation is needed for this two features. Because TFIDF weight calculationmethod has some flaws, this paper used an improved weight calculation methodTFIPNDF. After weight calculation, each term of gene ontology terms can beexpressed in the form of a vector. The second model is a Naive Bayes model. Datapreprocessing method and feature selection method are the same as that of SVMmodel. The Naive Bayes model is based on the horizontal comparison of all words inthe same term, the specific approach is to calculate the probability as root of eachword in the same term, and the word with the maximum probability is as the root ofthis term. This approach ensures that the same term has only one word as the root.Experimental results show that the average accuracy rate of the two models is morethan88%, showing that the two models proposed in this paper are reliable andaccurate.In conclusion, this paper proposed two models for gene ontology: the SVMmodel and the Naive Bayes model. This two models are of high accuracy, which aresuitable for not only gene ontology but also other field of ontology, are importantmodels for knowledge discovery. Future research is focused on adding more features,improving the accuracy of the two proposed models.

Keywords/Search Tags:

Gene Ontology, Knowledge Discovery, Part-of-Speech Tagging, Naive BayesClassification, Support Vector Machine

PDF Full Text Request

Related items

1	Chinese Word Found Its Part Of Speech Tagging
2	Knowledge Discovery Based On Rough Sets And Applications To Chinese Processing
3	Research On Chinese Part-of-speech Tagging Based On Semi Hidden Markov Model
4	Research And Application Of Knowledge Warehouse System Of CRM Based On Knowledge
5	Study Of Kazak Part-of-Speech Tagging Based Upon HMM
6	Research And Implementation Of Modify Chinese Part-of-Speech Tagging Based On FST Technology
7	Research On Lao Language Part-of-speech Tagging With Multiple Features
8	Research On Laodian Participle And Part-of-speech Tagging Method
9	Research On The Construction Method Of Burmese Part-of-speech Tagging Corpus
10	A Research On Lao Language Part-of-speech Tagging With Multi-feature Fusion