Font Size: a A A

Research And Application Of Talent Job Online Matching Based On Text Feature Extraction Technology

Posted on:2018-04-26Degree:MasterType:Thesis
Country:ChinaCandidate:C M LiFull Text:PDF
GTID:2348330512989108Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the development of technology and the popularization of the Internet,There has been a great change on the work patterns of recruitment industry.The information carrier of recruitment is migrating from the past newspaper television to the Internet,and spreading faster with more large-scale data.The traditional recruitment industry urgently needs to quickly locate the appropriate job seekers by technical means,thereby reducing the time and manpower costs incurred in the hiring process.Through the web crawler,business cooperation,platform collection and other channels to gather a large number of resume text data and a large number of job text data published by employers,but how to establish a rapid and accurate matching bridge is the key of this paper.If only through words hit matching,the matching accuracy is not high.Because there is a large number of irrelevant interference words in the matched text,and maybe the potential matching objects are ignored.For example,the words in the position are technology,but the words in the resume are software development.Although the two words are not identical,it can be found that the meanings of the two are coincident to a certain extent and can be matched.In summary,how to realize the accurate and efficient match between the resume and the position text by computer technology is a very research topic for the recruiting industry.This paper uses an improved TF-IDF resume text feature extraction algorithm,with optimized inverted indexing technology.A feasible solution to the problem is proposed and applied in the actual production environment.The main contents of this paper are as follows:(1)The algorithm is optimized based on common text feature extraction.After analyzing the advantages and disadvantages of different feature extraction algorithms,an improved TF-IDF resume text feature extraction algorithm is proposed.The traditional TF-IDF algorithm takes into account the contribution of the feature item to the global,and can effectively describe the role of the feature in the global text.But it also ignores the fact that some words is important for certain categoties of documents,not for the global.In order to enhance the contribution of the feature word to the global,this paper adds calculating of the entropy,combining the TF-IDF algorithm.Meanwhile in order to fully consider the impact on specific text category,joining the information entropy calculation when calculating the importance of the feature word to certain categories.Finally the improved TF-IDF algorithm will be used on the feature extraction of the resume text content.(2)Position text classification.In this paper,the application scenario is in the environment of large amount of data,rapid iteration of data updating and frequent renewal of text content.By analyzing the pros and cons of each classification algorithm,meanwhile considering the application scenarios,the paper proposes a classification framework of “the position-text real-time classification based on inverted indexing technology”.This framework puts the data into the index library by inverted indexing technique,which can carry on the updating iteration of data in real time and add sorting computation at the search level.Through “word segmentation,feature extraction and search for” three steps,text of position will get a batch of labeled matching results.Each result has a corresponding matching values.By marking the weighted aggregation of data,the result of classification is finally obtained.The framework can update the classified model data in real time,as well as process real-time text classification.(3)Feature similarity calculation.It is impossible to resolve potential matching objects by simply matching the identical keywords.In this paper,the vectorization technique will be used to vectorize the feature words,which maps the feature words to the individual coordinates in the multidimensional space.Then calculate the cosine similarity between each feature word,and finally get the similarity between the features.In the process of matching,the text feature words are expanded to achieve the purpose of considering the potential matching object.Build matching system.Based on inverted indexing technology,using multiple domains matching and setting weights,analyzing the syntactic rules of text,extracting important text passages separately,and integrating correlative algorithm,finally achieves the effect of improving matching accuracy.At the end of this paper,the matching system in actual production environment will be built.In order to accommodate to the growth of future data scale,database is clustered in the data storage layer.Inverted index of the matching system uses cluster mode,on the one hand to improve the matching speed,on the other hand to enhance the horizontal expansion capabilities.Up to now,the system has been used in the actual production environment,and served many customers.By analyzing the records in system,the average pass rate of the talent resume-job matching is 69.60%.And according to the feedback of customers,the system can feed online resume-job matching and it can speed up the recruitment and reduce recruitment costs.
Keywords/Search Tags:inverted index, feature extraction, text classification, text vectorization
PDF Full Text Request
Related items