Font Size: a A A

The Research Of Directional Information Analysis Based On Text Mining

Posted on:2013-02-03Degree:DoctorType:Dissertation
Country:ChinaCandidate:W J ChengFull Text:PDF
GTID:1268330398480101Subject:Information management and information systems
Abstract/Summary:PDF Full Text Request
With the fast development of Internet, more and more information swirles in the net. To get the right information from the information sea has become one of the key issues nowadays for the researehers, experts and the Internet users. To help Internet users finding the related information is very important to their propaganda, decision-making, development and crisis management. Directional information analysis is an effective solution to this issue. Directional information analysis refers to mining the domain information or the topic information gotten long-term concern by individual users or unit users. Aiming at the interest points concerned in the long term by individual users or unit users, the Internet information is analyzed and tracked. This dissertation studies the information selection, subject classification and text clustering problems of directional information analysis. The corresponding methods and models are proposed, which can promote the development of directional information analysis.In this dissertation, the analysis document is obtained by utilizing document source information, which is gotten through the keyword search of search engine, and the crawler technology. Then according to the characteristics of directional information analysis tasks, the technologies in each stage of directional information analysis and related algorithms are researched and discussed in depth. Finally a series of effective and applicable models or algorithms is put forward, and an efficient and practical task framework of directional information analysis is constructed. This dissertation focuses on the following issues:1. Heuristic information extraction model based on the text returning from the search engineThe retrieval results returned contains the title, abstract and other information. Only to let the results returned be as the analysis object is far from enough. In order to get a comprehensive document analysis element, this dissertation builds XML structure of the documents which contain the body of the document, click on the amount, release time and times cited, and gives specific methods to obtain the elements in the XML structure. The research of the text extraction is focused on. Based on the DOM tree structure and the surveys, according to the prompting function of the punctuation and the link used in the text analysis, the calculation method of the layout label weight is put forward. The body’s center label is determined by using the summary returned from the search engines, then the polymerization process of the center label and the weight is described, and the biggest weight label is chosen to be as a text extraction label.2. The participation of users in the topic clustering and classification frameworkThe difficulties of the subject classification are described. In allusion to the characteristics of directional information mining tasks, the necessity and possibility of topic classification in which user participate. The supervised features of text classification are introduced. The complete topic clustering and classification framework of directional information mining task are put forward.3. The text classification model based on the uncertain probability logicBased on the more comprehensive study of text classification technology, the characteristics of text categorization are analyzed in detail. The reasons for classifier deviation are discussed in depth. The text classification model based on the uncertain probability logic is put forward by introducing the subjective logic theory based on uncertain probability logic, and the text classification evidence which is the trust relationship between model features and the categories. By constructing a significant distribution event space and the distribution weights of two concepts space calculation characteristic in the average distribution event space, a simple linear classifier is achieved. The results of comparison experiments based on the general classification evaluation corpus set show that, relative to the performance NB, KNN, LLSF and NNeT, the performance of the model put forward in this dissertation has significantly increased, which has the near performance with SVM, but the classification speed of that is improved significantly. This model has stronger adaptability to different corpus sets, and the classifier can maintain a high performance without feature selection.4. K-Means clustering model based on text classification and user participationThere is not exiting a clustering algorithm can be generally applied to the variety of structures revealed by a variety of multidimensional date sets. Different applications can use different information sources, and often have specific requirements on the clustering quality, efficiency, etc. So it is need to select the appropriate clustering algorithm by according to the application and making full use of relevant information.This dissertation describes and compares the clustering algorithm of division, hierarchical clustering algorithm, density-based clustering algorithm, grid-based clustering algorithm, and so on. K-Means clustering model has shown strong vitality in terms of algorithm simplicity or efficiency. Therefore, the advantages, disadvantages and improvement methods of the K-Means clustering model are discussed in-depth. This dissertation introduces the text classification and the monitoring information in which users participate; integrates the two advantages of the system automatic supervision and labor supervision; constructs the K-Means clustering model based on text-based classification and user participation. Accordingly, it effectively overcomes the problems such as the initial K value of the K-Mean clustering is difficult to determine, it is easy to fall into local optimum, and other issues. Through several iterations under the supervision and confirm of users, a more ideal and controllable subject classification results are produced.In order to let users understand the theme content and the heat of the document quickly, this dissertation makes the text classification label problems be as the father label; makes the TFIDF value be as the basis for selecting the subsidiary label; makes the words which have big TFIDF value be as candidate labels. By introducing HowNet dictionary network, the sub-labels which have more extensive concept and scope are gotten. The calculation method of the document heat is defined by referencing the model of website dissemination influence intensity.This dissertation carries out a preliminary inquiry into the directional information analysis tasks. In close connection with the characteristics of related tasks, more effective and applicable models and algorithms are researched, which lay a foundation for further study.
Keywords/Search Tags:Directional Information Analysis, Information Extraction, Topic Classification, TextClustering, Participation of Users
PDF Full Text Request
Related items