| For the development of national applications,ontology construction is of great significance.For new concepts in the ontology,it is also necessary to accommodate realistic applications.At present,the method of constructing the ontology is very timeconsuming and difficult to maintain,so the automatic construction of the ontology is very necessary.Because of the large proportion of unstructured text in the national information resources,it is important to obtain the relationship between the concepts and concepts from these unstructured texts for the automatic construction of the ontology.This paper studies the automatic construction of the national information resources ontology.The research contents of this paper include the following four points:1.We collect relevant text from the network and use the HMM model for word segmentation and part-of-speech.In order to solve the problem of segmentation ambiguity caused by dictionary matching method in word segmentation,this paper uses the statistical learning method HMM for word segmentation to ensure the accuracy of word segmentation.In the POS tagging,HMM also shows a very good effect.2.Because of the uniqueness of the national culture,there is no perfect dictionary to contain all the national vocabulary,and it is difficult to realize the new words of the text in the national field through the dictionary matching method.It takes a long time for statistical corpus so we use MapReduce to improve the feature selection method,including mutual information and left and right entropy and word frequency methods.The mutual information method and the left and entropy feature selection methods are improved respectively,and two methods of strengthening mutual information and relative entropy are proposed.3.In this paper,we use the supervised method which uses the SVM to carry out the research on the relationship between the entity and the object,and the relationship between the unsupervised learning method is not high in the relationship mining.We choose the concept of the word which is before and after the word and part of speech and the distance between the entity to the core predicate as the characteristics of the classifier.Aiming at the problem of low data labels in corpus,the self-training method is used to train the tags,finally the training is done which can get the model and extract the relationship.4.This paper establishes the core ontology of a national information resource,and then uses Jena extension to realize the automatic construction of the ontology. |