| The analysis of the malicious behavior is one of the effective measures to protect the security of user information.As the mainstream malicious behavior detection method,the instruction sequence-based malicious behavior analysis technology collects the instruction control flow information when the program is running,restores the real running track of the program and identifies the behavior characteristics of the program.However,there are still the following main problems in command stream acquisition and restore execution trajectory.On the one hand,there is a huge time difference between the instruction stream acquisition system with high performance overhead and the system where program running in the real user environment,which is easy to be discovered by malicious programs and cause escape behavior.On the other hand,the invalid data collected during the recording process brings the restoration process difficult.A semantic extraction and data pre-processing system based on the instruction sequence is designed.By using the features of Intel PT hardware components,the instructions and control information generated by program running is collected efficiently and cheaply,and the execution process is accurately restored.Firstly,based on the virtualization technology,the program running data in the virtual machine is collected.Secondly,based on Intel PT filtering mechanism,the data running within the target range is collected.Based on the CPL filtering mechanism,information that is not executed by the root command is collected.Based on the CR3 filtering mechanism,the collection scope is limited to the instructions executed by the target process.Based on IP filtering mechanism,performance information within the range of the loaded target image is collected.In addition,the virtual machine driver can monitor the creation and deletion of the process and the loading of the image,and transfer the captured target process and image information to the host in a hypercall manner.Finally,after parsing the collected compressed data packets into data,the control flow of the program execution can be restored with decompile tool.The function and performance of the system are tested,and compared with the pure virtual machine system and the Intel PT system-wide recording method.The experimental results show that the performance cost of the semantic extraction and data preprocessing system based on the instruction sequence is less than 10%,and it can accurately restore the control flow executed by the program. |