Font Size: a A A

Word Segmentation And Term Extraction From Multi-language Texts

Posted on:2020-10-01Degree:MasterType:Thesis
Country:ChinaCandidate:J W WangFull Text:PDF
GTID:2518306452971049Subject:Information management and information systems
Abstract/Summary:PDF Full Text Request
With the proposal of the Belt and Road Initiative and the in-depth development of economic globalization,the role of cross-language exchanges in national diplomatic and non-governmental economic exchanges has become increasingly prominent.Text data is the main way of information acquisition,and one of the main data sources for management decisions in the era of big data.Text has a more significant language difference than other unstructured data such as images,videos,and the like.In order to quickly acquire the ever-changing international information and support the international strategic decision-making of various organizations,the research on automatic analysis of multi-language text data has become more and more important.This paper studies the basic parts of multi-language text mining-word segmentation and term extraction.The main research contents of this paper are as follows:(1)Review of multi-language word segmentation methods.This paper analyzes the main word segmentation methods,tools and their performances of Chinese,Japanese,English,Russian,and compares and analyzes the multi-language word segmentation methods and tools applicable to multi-language text.After classifying each word segmentation method and tool,compare its implementation principle,algorithm,performance,dictionary,development language and operating system.(2)It proposes a multi-language term extraction method based on the by-step-of-atomic-word method.This method combines linguistic rules and statistical information.Firstly,the existing multi-language word segmentation method is used to perform word segmentation and part-of-speech tagging,and the atomic words with low degree of vocabulary are deleted from both part of speech and stop words;Then,all atomic word strings are extracted with atomic word as the step size,and the word strings with incomplete structure or poor independence are filtered by substring reduction algorithm,so as to obtain candidate words set with higher quality.Finally,the wording degree of the candidate words is calculated by combining the word unithood and word frequency distribution,and the output is sorted to determine the final words set.This method can effectively extract words with low frequency and relatively complete structure,which improves the precision of term extraction to some extent.(3)The application of multi-language term extraction method based on the by-step-of-atomic-word method in the processing of the United Nations Parallel Corpus.In the application stage of Chinese and English corpus,this paper tests the reduction and misjudgment trend of substring reduction under different K values and compares the results of term extraction.The experimental results show that compared with the existing method,the precision and recall of this method in Chinese corpus increased by 4.08% and 4.23%,and the precision and recall in English corpus increased by 8.19% and 8.91%.In summary,this paper studies the multi-language word segmentation and term extraction methods,which are the basic technical problems of multi-language automatic analysis.For texts in different languages,use the existing multi-language word segmentation method for word segmentation,and then use the multi-language term extraction method based on the by-step-of-atomic-word method to extract words,which can realize information retrieval,public opinion analysis and other text mining applications.This method is suitable for automatic analysis of massive text,and also supports text analysis after speech recognition.
Keywords/Search Tags:Text Mining, Multi-language, Word Segmentation, Term Extraction
PDF Full Text Request
Related items