Font Size: a A A

Research Of Dimensionality Reduction And Its Appliacation On Data Mining Of Large-Scale Text

Posted on:2009-05-23Degree:DoctorType:Dissertation
Country:ChinaCandidate:R G YuFull Text:PDF
GTID:1118360272485422Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet, people often need to face massive data to analysis and process in the age of"information explosion", and this large amount of data is still increasing in a geometrical rate. In real world, the massive data always is high dimensional and sparse, and redundancy often exists in the massive data. Compressing on massive data and keeping the internal properties becomes one of the important research topics in artificial intelligence,machine learning,data mining and other fields. High Efficient dimension reduction algorithm is a method of processing high dimensional mass data and has certain practical application value. This paper focuses on research and application of the rapid dimension reduction algorithm, which is applicable to the massive data.The paper proposes two new dimension reduction algorithms: The First is On the Expected Distortion Bound of Direct Random Projection (DRP). The second is Anchor points based Isometric Embedding under least square error criterion (AIE). On the Expected Distortion Bound of Direct Random Projection (DRP) has a time complexity of O ( dn ). The performance of DRP is investigated in terms of expected distortion analysis. We prove: 1) an expected distortion bound of DRP; and 2) given moderate conditions, the DRP with appropriate expected distortion can be found in O (1) random time. Furthermore, we propose a simple heuristic to facilitate finding an appropriate DRP. By experiments, DRP might be more stable than the other two random projection algorithms. Using an incremental strategy, the total time cost of DRP is O ( d log d ) in flow data mode.Anchor points based Isometric Embedding under least square error criterion (AIE) has a time complexity of O ( n log( n )), and after obtained geodesic distances it has linear time complexity for embedded points and can be fully realized in parallel. Compared with Isomap, LLE etc. nonlinear dimension reduction algorithms, AIE have better time complexity.Current mainstream search engines generate search results by analyzing statistical information such as the frequency of queries in web pages and the ranking of web pages. In many situations, search engines can not determine what kind of information users want. This paper describes a web content relevance mining method using large amounts of clickthrough data in web log. Furthermore, based on this method, we present a framework of Feedback Search Engine (FSE) and associated algorithms. According to page-to-page relevance, FSE generate search results dynamically and provide its users more accurate and personalized information.
Keywords/Search Tags:Dimensionality Reduction, Data Mining, Random Projection, Isometric Embedding, Feedback Search Engine, Clickthrough Data
PDF Full Text Request
Related items