Font Size: a A A

Research On Algorithms For Text Feature Selection

Posted on:2011-07-22Degree:MasterType:Thesis
Country:ChinaCandidate:N LinFull Text:PDF
GTID:2178330332956551Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Data mining is a fusion of database technology, artificial intelligence techniques, machine learning and many other interdisciplinary research areas. Data mining is from a large number of, incomplete, and there is noise, and vague, the practical application of random data, extracting implicit in the work, that people do not know in advance, but is potentially useful information and knowledge. Text classification is an important data mining research content and text classification feature selection is the key technology and core issues.In the text classification process, the text of the face of high-dimensional feature space, there are a large number of irrelevant features and redundant features, so many scholars tried to use various methods to remove the text feature space irrelevant features and redundant features in order get an approximation of the superior feature set. At present, the widely used feature selection algorithm to remove the text are mostly confined to irrelevant features, redundant features of the text to consider is relatively small, leading to the text although they can significantly lower the dimension of space, but the classification results not accurate enough.In this paper, feature selection for text classification algorithm of some classical system of analysis and summary, and then put forward on this basis a new problem to solve the corresponding feature selection algorithms:First, a text based on dynamic programming feature selection algorithm Thought (DPFS), the algorithm is not from the characteristics of relevance and redundancy of the overall consideration, combined with dynamic programming, through the feature subset do not related and Redundancy Analysis, and then get a near optimal feature set. Experimental results show that the algorithm through a combination of dynamic programming, the calculated C-related and R-related to storage, to avoid a lot of double counting, to improve the performance of programs. In addition, the algorithm also did not feature relevance and redundancy analysis, so improving the accuracy of feature selection, which also greatly improve the accuracy of text classification.Secondly, the paper proposes an improved version of the LAM feature selection algorithm (ILAMFS), the algorithm is not from the characteristics of relevance and redundancy of the overall considerations, the right feature set to a non-correlation and redundancy analysis been done on the basis of a secondary redundant, so I got an excellent feature set is more similar. In addition, this algorithm uses a linear calculation, in a way that greatly improved the speed of calculation, while the threshold is difficult to solve a given problem, the algorithm also incorporates the golden section, the weighted average, etc..Experimental results show that ILAMFS data dimensionality reduction algorithm, the threshold selection and dimensionality reduction process to reduce the calculation amount and improve the accuracy of feature selection is effective.
Keywords/Search Tags:Feature Selection, Redundancy, Weighted Average, Dynamic Programming
PDF Full Text Request
Related items