Font Size: a A A

Study On Some Chinese Text Classification Technology And Applications In Knowledge Extraction

Posted on:2011-12-04Degree:MasterType:Thesis
Country:ChinaCandidate:Z K ShiFull Text:PDF
GTID:2178360305455053Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With significant development of Internet technology, people can access large amount information much easier than before. However, sometimes it's hard to find real useful ones among this information. Information classification is a good method to solve this problem; however, the traditional information classification is manual, classification efficiency is very low. The speed of increasing of information is much faster than people can deal with, it will be more difficult to process information in traditional way. Nowadays, the text classification has become one of the hot issues.In the 863 project, we suggest to establish "Digital Agricultural Knowledge Grid Research and Application". The system provides users with Question and Answer Mechanism. Users can ask questions by using standard question template in system or using their own questions. When users use standard question template from system, existing answers from database will be picked up. When users use their own questions, systems will analyze questions according to database and the closest answer will be chosen. If there is no matching answer from system database, the system will use search engine to find all relevant information. System will save highest score of information to its database in order to provide reference in future.As excessive amount information on the network, there is irrelevant information contained. If information provided to customers without any process, then it need to be filtered by user themselves cause more difficult to obtain useful information. Therefore, text classification is highly required and deleted irrelevant information or low correlation is needed.In general, information obtained from search engines from the network is only a link or text, on the one hand, user still need to do a lot of reading to analyze the information in order to get what they need, on the other hand, it will use lots of space to save it in knowledge base. Therefore, information obtained from search engine need to be summarized into key points in order to provide to user or save it to knowledge base.I have done the following work in order to address the issues raised above:1. Studying development of text classification technology domestic and abroad.. International Text Classification started much earlier than us and already had a very complete Text Classification technology. There is public corpus as standard for Text Classification. Our Text Classification research started later than western country. Due to the difference between Chinese and English, Western Text Classification technology can't be fully copied and there still are problems need to be solved.2. Introduce Text Classification related technology.Firstly, text preprocessing methods are introduced, which including segmenting processing, stop word processing and repeated word processing. Secondly, characteristics of the text item weighting and feature selection methods are introduced. This including Document Frequency, Information Gain, Information Gain, Expected Cross Entropy,χ2statistics and Odds Ratio than the other commonly used feature selection methods. Evaluation method for Text Classification are been given as well.3. Describing the feature of Vector Space Model and calculation of support vector machines.The principle of supporting vector machines, two commonly used mathematical model of support vector machines, which including linear separable and inseparable support vector machines.4. Improved the FCSVM.Typically, the distribution speed will decrease when increasing the number of Support Vector, which effects implementing Support Vector. In order to solve this problem, the FCSVM has been improved, the improved algorithm will support Vectors with almost no loss of accuracy. Experiment shows that the method can improve the Classification speed of SVM with almost no loss of accuracy, this method for sorting small number of text classification problem5. Comparison and study for Multi-type text.Because there are many text data in Multi-type text, the traditional Support Vector Machine are not suitable for them. Reach has been done for 1-a-1-based method and 1-a-r-method. Experiment shows that based on 1-a-1 method, Support Vector Machine and Algorithm suitable for small size and larger number of text classification category.1-a-r category text algorithm is suitable for large size and less the number of categories text classification problem.6. According to requirements of framework of semantic search tools, designed and implemented the knowledge extraction subsystem in the agricultural knowledge consulting system The subsystem can classify the text from search engine, and summarize the key point from the information in order to provide as answer to users. Build information to Knowledge Base so that the "consultation system" can obtain related information if there is similar issue in order to improve system efficiency.7. There are both evaluation test and focus test has been done for Optimal sorting of results in text classification subsystem module.Experimental results show that the text categorization module's classification's accuracy is 75% or more, that means the result sorting sub-system can achieve the functions demanded from actual user.Through text classification methods, it can isolate non-related information effectively. And obtain information from text classification in order to increase efficiency of gaining knowledge, increase search engine work effectively. Put best answer recognized by users into knowledge base so that the "consultation system" can obtain related information if there is similar issue in order to improve system efficiency.
Keywords/Search Tags:Text Classification, Support Vector Machines, Vector Space Model(SVM), Feature Selection, Knowledge Extraction
PDF Full Text Request
Related items