Nowadays,the rapid development of Android malware poses many threats to the security of the Android platform and exposes mobile users to huge risks of fraud and cyber attacks.Android malware detection has been a key research topic in the field of mobile security in recent years.However,there is a significant issue in machine learning-based malware detection that training data may contain noisy labels,which has a considerable impact on the performance of the detection model.This impact is becoming more severe as the size of the datasets used continues to grow.Actually,beyond Android malware,label noise is a common problem in machine learning datasets(e.g.,image datasets).There is a plethora of research and techniques to address noisy labels in academia.However,existing technologies present a number of challenges when migrating to the Android malware domain due to the complex composition of apps being fundamentally different from images.Currently,the problem of noisy labels faced by Android malware detection has not been effectively solved.To address this problem,this paper proposes a novel noise detection algorithm,and designs and implements a noise filtering system for Android malware detection based on it.The main work of this paper is as follows.(1)We propose a novel and effective noise detection method for Android malware detection.We first conduct a large-scale empirical study to reveal the unreliability of the commonly used malware labelling method in our research community.In response,after thorough research and exploration,we propose a noise detection algorithm based on confidence learning,ensemble learning and app relationship.We have migrated Confidence Learning,an advanced noise estimation technique,to the domain of Android malware.To mitigate the bias introduced by model itself,we incorporate the idea of ensemble learning to achieve more robust results.Further,we leverage app relations to improve the precision.(2)To evaluate the performance of the above method,we conduct a series of experiments from multiple perspectives.The experimental results show that our method can achieve excellent and stable performance in pinpointing noisy labels,i.e.,with an accuracy of over 94%and F1 of over 91%at varying noise ratios(5%-30%).In addition,compared to state-ofthe-art,our method achieves much better results(8%to 218%improvement)with significantly shorter time(70 to 249 times faster).We further show that the performance of existing malware detectors can be improved after removing noise by our method.These results demonstrate the effectiveness and feasibility of our approach to quickly and effectively reduce noisy samples in Android datasets.Meanwhile,in order to create a reliable and advanced dataset for experiments,we design and implement an automated malware collection tool that collects over 4K real malware samples from Android related security reports.This automated tool not only can greatly save labor and time in collecting data,but also is reusable and facilitate future updates and maintenance of the malware dataset.(3)We design and implement a complete noise filtering system for Android malware detection.The system is a generic framework to reduce the noise level of training data for the training of any machine learningbased Android malware detection. |