Font Size: a A A

Automatic Construction And Application Of Concept Taxonomy Based On Multi-source Heterogeneous Data

Posted on:2018-07-14Degree:MasterType:Thesis
Country:ChinaCandidate:X Y ZhangFull Text:PDF
GTID:2348330512478498Subject:Library and Information Science
Abstract/Summary:PDF Full Text Request
Experts in a special domain construct comprehensive domain concept hierarchy,can classify domain knowledge layer by layer in a top-down approach.This task can facilitate the user's search and retrieval,and support for more research and tasks based on this task,such as Question Answering Track,Machine Translation and so on.The traditional method by manual building is not efficient and cost high,on the other hand,recent methods by automatic building are always based on single corpus,the concept extracted is often not accurate and the hierarchical relationship is also relatively simple.By contrast,started from the semi-structured data of the domain-professional corpus,this paper combines unstructured UGC from online social media and constructs a hierarchy of conceptual taxonomy based on multi-source heterogeneous data.On the one hand,oriented to domain-professional corpus,we construct the initial concept taxonomy by semi-structured data.On the other hand,oriented to unstructured UGC,we summarize three levels of research work,namely:keyword extraction research,word similarity calculating research and concept taxonomy construction,the specific research contents are as follows:At first,during the keyword extraction research,this subject combines mutiple extraction methods such as Pattern Matching?Statistical Feature Ordering and Sequence Tagging,and proposes a keyword extraction method based on seed words expansion.Firstly,we get seed words based on Pattern Matching and Statistical Feature Ordering,secondly,we expand more key words based on Conditional Random Fields model.Compared with TF*IDF?TextRank?NC_value and CRFs model.Evaluation suggests that,our strategy can not only get rid of the dependence on high frequency in statistical feature extraction,but also solve the constraints on the syntax template,and get a higher recall rate.Secondly,during the word similarity measure research,we propose a word similarity measure based on multiple_source knowledge fusion.Firstly,we calculate similarities respectively based on Words Knowledge Hierarchy(TYCCL&Hownet)?Large-scale Corpora(newsµblog)?Web search engines(Baidu&Bing)as separate algorithms.At last,all similarities are combined through support vector regression to get final similarity.Evaluation suggests that,if the quantity and representativeness of training set data are big enough,Integrated model outperform all separate algorithms on performance and stability.Finally,during the concept taxonomy construction research,this paper obtains hierarchical relationship among concepts by Kmeans clustering algorithms.Because the clustering results obtained by different clustering algorithms are quite different,so we compare Affinity Propagation and Hierarchical Clustering algorithm and,and make a Quantitative evaluation to choose the proper algorithm.Based on above research in three levels,this subject achieves the construction of concept taxonomy for unstructured UGC.Then we harmonize these two kind of taxonomy according to some principles and get final hierarchy.In order to evaluate the quality of our concept taxonomy,we use a method called Application-Based Evaluation,apply constructed taxonomy into sentiment analysis research.The results show that:when we extend the emotional vector space based on constructed taxonomy,sentiment analysis system performance is significantly enhanced.In the meantime,this result also proves the effectiveness of our taxonomy.
Keywords/Search Tags:Concept Taxonomy, Keyword Extraction, Word Similarity Computing, Cluster Analysis, Sentiment Analysis
PDF Full Text Request
Related items