Font Size: a A A

Based On SVM And Word Features Research On Nwe Words Identification

Posted on:2013-07-07Degree:MasterType:Thesis
Country:ChinaCandidate:Y F XuFull Text:PDF
GTID:2248330395966493Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With China’s entry into the rapid development of the new area, theChinese also develop along with the time development, as an important partof Chinese words, update become the most active, science and technology,economy, and culture in everyday life, a large number of Chinese new wordsemerge, the creation of new words enrich people’s daily life and networkliving language, but also to the Chinese word segmentation challenge. WithEnglish and some other alphabetic languages, between Chinese words haveno obvious segmentation, such as words and words with spaces between thenatural segmentation symbols, so want to allow the computer to read Chineseto Chinese word segmentation, however the emergence of neologisms makesChinese word segmentation produced a lot of difficulty identifying" string"and" fragments", which to a certain extent affect segmentation accuracy,according to statistics, Chinese word segmentation errors caused by half iscaused by new words by words, if it can be timely organization to the Chineseword segmentation dictionary, for the promotion of Chinese wordsegmentation system accuracy will undoubtedly have tremendous simulativeeffect. Therefore, new word detection has become the Chinese automaticword segmentation is one of the difficulties and bottlenecks. How to identifyfor Chinese neologism has become an important research topic. Support vector machine SVM (Support Vector Machine) is a kind oftraining method of machine learning, it addresses the small sample, nonlinearand high dimension pattern recognition shows many unique advantages, andcan be applied to the function fitting and other machine learning problems,this paper puts forward the word feature and SVM for the identification ofnew words and extraction, first by modifying the word dictionary is simulatedby means of word segmentation dictionary words, the training corpus and thetest corpus segmentation, statistical various proposed word features from thetraining corpus, and then to extract positive and negative samples combinedwith word features to quantify, choosing different kernel function of supportvector machine to receive training through new support vector classification.By adding slack variables to improve the classification accuracy, the trainingcorpus to get new support vector classification as well as the test corpus inword candidate vector with SVM test, every candidate words calculated,according to the calculation value and threshold comparison to get the finalword recognition results.Through the design of new word identification procedures to achieve thetraining corpus to generate new candidate word extraction and recognitionsupport vector, and then combined with the test corpus output the recognitionresults. The new classification procedure realization to the test corpus therecalling rate and correct rate calculation and generate new wordsclassification image. Contains about300000Chinese characters on the people’s daily integration processing, in word segmentation dictionarysimulation were removed in100words as a simulation of new words, newwords and new words identification procedure combining classificationprocedure experiment to extract new words and the recall rate, accuracy rate.This paper selects the RBF kernel function (RBF) and associatedrelaxation variables using different word characteristic experiment, throughthe analysis of experiment results obtained by the selected word features willbe on the new word recognition results have positive effect, so in the nextexperiments using the proposed all the word feature, then the other under thesame conditions, respectively, using the RBF kernel function (RBF),polynomial kernel function, Sigmoid function for experiment, theexperimental results show that, when using the RBF kernel function (RBF)and feature of all words, new words recognition correct rate is45.12%, therecall rate was43%, get the optimal results, while the other two kernelrecalling rate and correct rate is low.Through the experiment it can be combined with word features and SVMcan carry out new words recognition and extraction, and achieved relativelygood results, thus the method can be extended to new words recognitionapplications.
Keywords/Search Tags:New word identification, SVM, Chinese word segmentationword feature information, kernel function
PDF Full Text Request
Related items