Identifying Malware through Static and Dynamic Analysis

In this project the files were classified in to Malware and Benignware depending on the various property, structure, features of the files which were learnt in the course CS698M instructed by Prof. Sandeep Shukla.

The dataset consisted of files of both malware and beningnware class. The static analysis data consisted of opcodes, structure, string files whereas the dynamic analysis data consisted of json files which were compiled using cuckoo. The task was to do feature engineering to classify the files.

In static analysis the feature extraction phase contents from specific significant headers and top 400 API's extracted after applying tf-idf on thousands of API's were considered. Then I performed multiclass classification task of 4 classes(Benign, paked Benign, Malware, packed Malware) later to be conmbined to 2 classes(Benign, Malware) for better binary classification.

In Dynamic analysis the feature extraction phase preliminarily contents from Network, Summary, classes of API's, Total API's which were total of 296 features than I feature engineered by removing Missing value columns, one of Correlated feature columns, Single Unique value columns, Zero Importance , Low Importance Features which reduced the featurre size to 35 to achieve better classification.

In both static , dynamic analysis some noise data was added for Improve the classification. Different algorithms were tested to classify such as SVM, MLP network, Random forest, Decision tree. Random Forest was found to perform best in both the cases.

The code for the project can be found here