Font Size: a A A

Research On A Graph-based Method For Word Sense Induction

Posted on:2015-07-22Degree:MasterType:Thesis
Country:ChinaCandidate:B YinFull Text:PDF
GTID:2348330485996078Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Word Sense Induction(WSI) is an open problem of natural language processing, which concerns automatic identification of senses of a word. The senses of words are usually defined by a dictionary, but it might not compile all senses, especially a new sense or a sense in a specific domain. Therefore, WSI is important to recognize all the senses for a given word and indispensable for various NLP. In general, WSI is a task to construct clusters of contexts in which the target word occurs or to build cluster of words related to the sense of the target word. Each obtained cluster corresponds to one sense of the target word. In order to solve WSI problem, two major approaches have been proposed, clustering based method and graph based method. This thesis presents a graph based word sense induction method that is an extension of Hyper Lex. First, a co-occurrence graph is constructed. Vertices correspond to words appearing in the context of the target word, while edges between vertices represent co-occurrence relations between words. Weights of the edges indicate how two words correlated. In this method, the weights of the edges are determined based on both co-occurrence and syntactic relations of words. When two words co-occur frequently and they frequently appear under any kinds of syntactic relations, the weight of the edge between them becomes low. Once the co-occurrence graph is built, a simple iterative algorithm is applied to obtain the hubs by considering the connectivity of the graph. The hubs are detected by Key Player Problem(KPP) algorithm that measures a graph connectivity. Unlike Hyper Lex, a node that is connected from many other nodes is chosen as the hub. Next, an expanded graph is built. A dummy node representing the target word itself is added to the graph. Then each hub is linked to the target word with 0 weights. Finally, a minimum spanning tree(MST) is computed over the expanded graph by taking the target word as the starting node and making previously identified hubs as its first level children. The target word is removed from the MST, causing that the graph is subdivided into the sub-trees whose root nodes are hubs. The performance of the proposed WSI method was evaluated on a sense tagged corpus for 57 target words. We compared Purity, Inverse Purity and the harmonic mean of them(P-IP hereafter), that are well-known criteria for clustering, of the original Hyper Lex and our extended model. The average of P-IP for 57 target words of Hyper Lex was 0.482. By optimization of the parameters, P-IP was improved to 0.513. P-IP of the model where the hubs are chosen by KPP was 0.609. Finally, considering syntactic relations in the weights of the edges, P-IP was improved to 0.611. Furthermore, the number of induced senses in the proposed method was more similar to the genuine number of senses. These results proved the effectiveness of our proposed method.
Keywords/Search Tags:Word Sense Induction, Graph-based Method, Small World Graph, Key Player Problem, Syntactic Relation
PDF Full Text Request
Related items