Data Mining For Light-weighted Vertical Search Engine Based On Pervasive Computing

Posted on:2014-01-17

Degree:Doctor

Type:Dissertation

Country:China

Candidate:H Guan

Full Text:PDF

GTID:1228330392960337

Subject:Computer architecture

Abstract/Summary:

PDF Full Text Request

The speciality in perversive computing environment can bring challengesto the traditional vertical searching service due to the limit of a mobile instru-ment and special services. Among those challenges on a vertical search enginein the perversive computing environment, we only focus on the needed textcategorization and collaborative filtering. Dimension reduction for the highperformance on categorization and an accurate recommendation with a highperformance are our targets in our research process. Basically, in this thesis,our contributions are:1) A regularization of LSI space mapping for improved accuracy. La-tent Semantic Indexing (LSI) based on a sparse Singular Value Decomposition(SVD) algorithm has been used extensively as dimension reduction in informa-tion retrieval applications. Diferent mapping models having diferent efciencyand diferent explanation may confuse a user, especially in text classification.With the help of the Jacobi theory, we show that the main issue for these modelsis viewing a vector in LSI space with diferent forms of a singular value weight.In text categorization, a positive weight should be more suitable than an inverseweight in a mapping model. We demonstrate that an inverse weight in the tra-ditional mapping model causes a much earlier decline in performance and poorperformance on high dimensions because of an excessive impact from smallbut imprecise singular values. To fix these defects in the traditional models, aregularization on the diagonal matrix is suggested to mitigate the excessive im-pact of an involved diagonal matrix. A positive weight and this regularizationmake the suggested mapping model perform robustly. Experimental results onthree benchmarks substantiate that, compared with the latest mapping modelSingular Value Rescaling (SVR) in queries, the suggested mapping model candeliver a more robust performance in classification tasks, especially on highdimensions. 2) A fast dimension reduction for document classification based on Impre-cise Spectrum Analysis. The computational overhead of SVD is known to be abottleneck in dealing with large data sets, and faster dimension reduction withcompetitive accuracy is desired in such a setting. Imprecise Spectrum Analysis(ISA) is presented to carry out fast dimension reduction for document classi-fication. ISA follows the one-sided Jacobi method for computing SVD andsimplifies its intensive orthogonality computation. It uses a representative ma-trix composed of top-k column vectors derived from the original feature vectorspace and reduces the dimension of a feature vector by computing its productwith this representative matrix. An analysis is provided to show the approxima-tion error and the rationale behind such a dimension reduction method. To fur-ther improve classification accuracy, this paper also presents a feature selectionmethod in building the initial feature matrix and augments the representativematrix by including centroid vectors. Our extensive experimental results showthat ISA is fast in handling large term-document feature matrices while deliv-ering better or competitive classification accuracy for the tested benchmarkscompared to LSI with SVD.3) A user-based collaborative filtering recommendation algorithm basedon error feedback. An economic and accurate collaborative filtering (CF) al-gorithm should be a key in designing a fast recommender system for a verticalsearch engine. FPCC algorithm we ofer not only achieves a good accuracy, butalso has an economic space-complexity with a good scalability. FPCC is arosefrom a user-based Pearson Correlation Coefcient (UPCC), however, three dis-tinctive approaches we provide can revamp UPCC into a high accuracy CFalgorithm based on our error-feedback mechanism. In this feedback mecha-nism, based on personal but refined prediction-bias elements, we predict anactive user’s habitual bias with an item-based PCC algorithm, to compensatethe predicting error raised from a user-based CF algorithm. In this way, a user-based CF algorithm’s accuracy can be boosted without losing a rating matrix’ssparsity, which makes FPCC scalable easily. Results are encouraging. FPCCcan not only achieve a brilliant performance on speed, but also deliver a highaccurate recommendation.4) A skew amplification for refined Item-Based Collaborative Filtering Al- gorithms. Case Amplification can improve the accuracy of a collaborative fil-tering (CF) algorithm with no extra space overhead by amplifying the efect ofclose candidates in the prediction. However, in a cold start scenario, the tra-ditional Case Amplification on an item-based prediction can reduce accuracy.Given a small known set, Case Amplification can give a mediocre candidatean unsuitable amplification, by amplifying the numerator and the denominatorin a predicting formula equally. We propose a skew amplification mechanismto address the problem: we amplify the numerator and the denominator difer-ently. This reduces the efect of a mediocre but close item in the prediction. Thebalance between diferent amplifications is kept automatically by a controller,whose behavior depends on the size of the given set. Evaluation was carriedout on four benchmarks, and results show that, in a cold-start scenario, skewamplification outperforms Case Amplification on boosting an item-based CFalgorithm, especially when the given set becomes small.5) A semi-dense algorithm based on multi-layer optimization for recom-mendation system. We propose a new semi-dense algorithm based on Multi-layer Optimization to speed up the basic Pearson Correlation Coefcient inCollaborative Filtering. Semi-dense algorithm spares out over-reduplicate ac-cessing and judgement on selected sparse vector to accelerate the batch ofsimilarity-comparisons in one thread. We propose a reduce-vector in thread-pool to restrict the lock using on critical resources in parallelize implementa-tion. Thread-pool is wrapped with Pthreads on multi-core node to make semi-dense parallelization more easily. A shared zip file is read to cut down messageswith Message Passing Interface package. The performance of proposed semi-dense with multi-layer framework achieved a brilliant speedup.We applied these algorithms into vertical search engines, and results arecheerful.

Keywords/Search Tags:

Feature Extraction, Text Categorization, Vertical Search, Semi-sparse Algorithm, Imprecise Spectrum Analysis, Skew Amplification

PDF Full Text Request

Related items

1	The Sparse Representation Coding Model And Its Application In Text Categorization
2	Research Of The Automatic Chinese WEB Text Categorization In Search Engine
3	Design Of Vertical Search Engine For Academic Resources Of Computer Science
4	Design And Implementation Of Commodity-Oriented Vertical Search System
5	Research On High Performance Chinese Text Classification Based On Machine Learning
6	Research On Key Problems In Text Mining
7	Text Classification Technology And Applied Research
8	Research And Implementation Of Information Extraction And Categorization Model In Vertical Search
9	The Text Categorization Algorithm Based On Nearest Subspace Search
10	Semi-supervised Text Categorization Technology Research Based On The Semantic Analysis