Font Size: a A A

The Study Of The Chinese Word Segmentation Algorithm In Medical Question Answering System

Posted on:2020-02-08Degree:MasterType:Thesis
Country:ChinaCandidate:W XuFull Text:PDF
GTID:2404330599958979Subject:Control Engineering
Abstract/Summary:PDF Full Text Request
Having being not only a vital field of smart healthcare development but also a researching hotspot of Chinese Natural Language Processing,the medical question answering system demands for higher execution efficiencies for answering users' medical questions with accurate and concise natural language quickly and appropriately.In which the accuracy and the rate of Chinese word segmentation directly affect the execution efficiency of the question answering system,especially its accuracy and time consumption.The neural network LSTM-CRF,which can increase the accuracy and rate of Chinese word segmentation algorithm,realize automatic segmentation,and be independent on manual annotation features.Research contents include: collection and storage of medical text data,design of improved neural network segmentation structure,and test of the effect of Chinese word segmentation algorithm on the execution rate medical question answering system.Main research outcomes are listed as follows:By quantifying and storing the collected disease and medical question answering information,a medical text database which includes 29,610 disease information and 23,632 pairs of questions and answers,ranging from 9856 diseases in 39 departments was established for this research.The Chinese word segmentation algorithm,based on models of LSTM and CRF,was designed.First,two LSTM neural networks were reversely connected,and the weight matrix of the two-layer network was adjusted.Then different weights for sequence information were selected,and the combination of inference layer was predicted through linear transformation of the context feature vector and CRF.The inference layer of the combinatorial network BI-LSTM-CRF was increased to six-tagset.The comparative word segmentation test between the medical and MSRA datasets,by re-adjusting the parameters of the fusion network,shows that the accuracy of the BILSTM-CRF network reached up to 90.5% for medical text segmentation when the network layer weight value is 0.85,which certified the applicability of BI-LSTM-CRF network for medical text segmentation.Two modules for solving the defects of BI-LSTM-CRF word segmentation network were designed,which leaded to improved LSTM-CRF model.As the unconstrained linearity of context could easily cause information loss and low accuracy of segmentation,1)an Importance layer was added between the BI-LSTM and CRF layers to calculate correlation between the input and output for acquiring overall texts' characteristics;2)the text vector of the input network model was denoised for guaranteeing a certain probability of the words embedding in fixed windows in order to reduce effects of joint words embedding left and right.Through a comparative test of BI-LSTM-CRF word segmentation algorithm on between single corpus and multiple mixed corpus,it is found that the accuracy was increased up to 94.7% in the former and even 96.3% in the latter.Thus BI-LSTM-CRF model has a higher accuracy rate for better generalization ability on large-scale corpora.At last,the performance of diverse Chinese word segmentation algorithms for the medical question answering system was tested: the medical FAQ question answering system was trained by three common and the in this thesis designed word segmentation algorithm,based on the same dataset.By analyzing results from time consumption,answers' accuracy and ROC curve,it can be found that the time consumption for answering is 10s(reduced by 4s),the accuracy of choosing answers is 91 %(increased by 5.2%),and the area enclosed by the ROC curve and the coordinate axis is larger.Thus a conclusion can be drawn that the improved Chinese word segmentation algorithm can improve the accuracy and the execution efficiency of the medical questioning answering system.
Keywords/Search Tags:Chinese word segmentation, Data crawler, Long short-term memory, Conditional random field, Medical question answering system
PDF Full Text Request
Related items