Font Size: a A A

Information Extraction For Semi-structured Chinese Resume

Posted on:2019-07-29Degree:MasterType:Thesis
Country:ChinaCandidate:W T YanFull Text:PDF
GTID:2428330566986149Subject:Control theory and control engineering
Abstract/Summary:PDF Full Text Request
Resume is a common semi-structured text.It is an important carrier for job seekers to introduce their basic information and experience to employers.With the wide application of Internet technology,enterprise will receive a large number of Chinese resumes in the form of electronic documents when recruiting.Classification for them or handling them requires the user to manually open the files,read its whole structure,select the interested information and close them.This manual labor takes a lot of time and scales linearly with the number of target fields.Therefore,it is necessary to study how to use the computer to extract the main field content automatically from the resume,and to unify the results quickly and accurately according to the needs of the enterprise.Based on the analysis of the structural features of semi-structured Chinese resume,ideas to classify resumes,some constraints obtained by analyzing resume features and overall extraction strategy is introduced.In order to establish a dictionary of information extraction and solve the complicated and time-consuming problems of traditional method,the algorithm based on lexical analysis for keyword extraction is studied.First,ideas for the text splicing is proposed,and then,the text of the merged resume divided and indicators such as word entropy and affinity are calculated.Second,according to the characteristics of the semi-structured resume,the selection conditions and thresholds are built.Third,the dictionary is expanded by using the algorithms based on string editing distance and the string similarity calculation method based on the N-Gram model.In the process of information extraction,comprehensive consideration the characteristics of semi-structured resume text and general method,an information extraction scheme for semi-structured Chinese resume is formulated,and a text segmentation algorithm based on title keyword matching and text format matching is studied.In the recognition and extraction of content,according to the strong regularity of the resume content and the correlation between content,we integrated the methods that based on dictionary matching,rules matching,and statistical model.For the basic information of resume,detailed extraction rules are built,and for the complex item content,three main features of the resume are presented.According to these features,the content of the resume is identified and extracted by using the method that combining the dictionary matching and the hidden Markov model based on the text block,and the data sparsity problem of the model training process is resolved.Based upon the previous work,we use Java programming language to implement a set of Chinese resume information extraction system.The system has a friendly human-computer interface and users can dynamically manage extraction of dictionaries,extraction rules and resume information.The automatic information extraction function for Chinese resumes in word,PDF and HTML formats is realized.In addition,it can update the data in a timely manner according to the latest record of network.Finally,we use the system to resolve the resume samples,and the statistical analysis are carried out according to the accuracy rate and recall rate of the extracted results,and the results are satisfactory.
Keywords/Search Tags:Semi-structured resume, text segmentation, rules, statistical model, regular matches
PDF Full Text Request
Related items