Font Size: a A A

Research And Application Of Chinese Word Segmentation Method Based On Conditional Random Field

Posted on:2022-04-05Degree:MasterType:Thesis
Country:ChinaCandidate:Z W WangFull Text:PDF
GTID:2518306524471684Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
As society develops faster and faster,the government itself and the people have higher and higher demands on the government.The original document retrieval and management methods can not support the existing work,so the government document management system based on full-text retrieval system must be adopted.At present,the key technology of full-text retrieval system is Chinese word segmentation.This thesis uses an automatic word segmentation method based on a statistical model(conditional random field),which has the characteristics of low labor cost and high overall efficiency compared with the word segmentation method based on the vocabulary or relying on the vocabulary.The Chinese word segmentation of the Condition Random Fields(CRF)model is applied to the field of government official documents,and there are two shortcomings: The features used by the existing conditional random field model are not accurate in the word segmentation of government official documents;Chinese word segmentation based on the conditional random field model focuses on the accuracy of word segmentation,and does not pay attention to the ambiguity mark and ambiguity resolution after word segmentation,which leads to the effect of ambiguity resolution It is not ideal and fails to effectively improve the accuracy of word segmentation.Therefore,this thesis proposes a fusion feature method that can effectively improve the accuracy of word segmentation(index F-score).After the fusion feature method F-score has reached the bottleneck,an ambiguity resolution method based on ambiguous markers is proposed,which further improves the word segmentation F-score.On this basis,the methods are applied to the full-text retrieval system to effectively improve the accuracy of official document retrieval.The main work is as follows:1.Aiming at the low accuracy of existing conditional random field model features in government official document corpus word segmentation,a Chinese word segmentation fusion feature oriented to the field of government official documents is proposed.The fusion feature uses a five-character unigram unity and a three-character bigram feature template,four lexeme location feature,the word length feature and word type.The experimental results show that the fusion feature can obtain 92.84% of the Fscore,which is more effective than other features.2.After the use of Chinese word segmentation fusion features,the increase in Fscore reaches a bottleneck.In order to further improve the F-score of word segmentation,an ambiguity resolution method based on ambiguity markers is proposed.This method is based on stable word strings and stable words,corpus balance,and feature balance.Identify the words that are easily ambiguous,and then use methods such as mutual information,boundary entropy,stable word strings,and stable words to resolve ambiguity.Experimental results show that the ambiguity resolution method based on ambiguous markers can effectively increase the F-score to 94.42%.Compared with other methods,it has higher word segmentation accuracy,recall rate and F-score.3.On this basis,the above two Chinese word segmentation methods are applied to the full text retrieval system of the Chengdu Municipal Planning and Natural Resources Bureau.This thesis uses questionnaires to collect the use of the government official document retrieval function by the officer of the Chengdu Municipal Planning and Natural Resources Bureau in their daily work.According to daily work needs,the recall rate and manual retrieval cost indicators are designed,and the full-text retrieval system is compared with the original Official document inquiry system.The actual use of the system shows that compared with the original official document query system,the fulltext search system can effectively return official documents with high correlation with the searched keywords,with a higher recall rate(93.42%)and lower manual retrieval costs(3.38).
Keywords/Search Tags:Chinese word segmentation, conditional random field, fusion feature, likely ambiguity marker, ambiguity resolution
PDF Full Text Request
Related items