Model Report - Malware Detection

Executive Summary

This report presents the development and evaluation of machine learning models for static malware detection using the Brazilian Malware Dataset. The objective was to build classifiers capable of distinguishing between malware and goodware based on Portable Executable (PE) file header features.

Best Model

Xgboost

Cross-Validation AUC

0.9980

Test Set AUC

0.9978

Dataset Overview

Total Samples	50,181
Features	27
Memory Usage	167.52 MB
Train/Test Split	80/20 (Stratified)

Goodware (0)	21,116 (42.08%)
Malware (1)	29,065 (57.92%)
Imbalance Ratio	1.38

Class Distribution

Exploratory Data Analysis

Feature Distributions

Distribution of numeric features in the dataset (log-scaled where appropriate).

Correlation Analysis

Feature correlation matrix showing relationships between features and with the target variable.

Top Features Correlated with Target:

Characteristics: 0.5172

DllCharacteristics: 0.3631

ImageBase: 0.3252

NumberOfSections: 0.293

TimeDateStamp: 0.2267

Entropy Analysis

Entropy is a key indicator for packed/encrypted malware. Values above 7 often indicate suspicious content.

Mean Entropy

6.6945

High Entropy Files (>7)

18,744

Goodware Mean

6.4681

Malware Mean

6.859

Model Training & Cross-Validation

Experimental Setup

Data Split: 80% training, 20% hold-out test set (stratified)
Cross-Validation: 10-fold stratified CV for model selection
Primary Metric: AUC (Area Under ROC Curve)
Secondary Metric: Accuracy
Preprocessing: Median imputation, Standard scaling

Cross-Validation Results (Sorted by AUC)

Rank	Model	AUC (Mean)	AUC (Std)	Accuracy (Mean)	Accuracy (Std)	F1 (Mean)
1	Xgboost	0.9980	0.0005	0.9868	0.0022	0.9886
2	Lightgbm	0.9976	0.0004	0.9838	0.0016	0.9860
3	Random Forest	0.9976	0.0007	0.9881	0.0018	0.9897
4	Catboost	0.9972	0.0005	0.9838	0.0018	0.9860
5	Pytorch Mlp	0.9827	0.0019	0.9424	0.0045	0.9504
6	Decision Tree	0.9803	0.0014	0.9810	0.0014	0.9836
7	Logistic Regression	0.8775	0.0054	0.8195	0.0063	0.8315

Best model highlighted in green. CV results show mean ± standard deviation across 10 folds.

Feature Importance

Most predictive features from the best performing model.

Top 5 Most Important Features:

num__DllCharacteristics: 0.5247
num__Characteristics: 0.1640
num__BaseOfCode: 0.0828
num__NumberOfSections: 0.0295
num__NumberOfSymbols: 0.0270

Final Test Evaluation

Performance Metrics on Hold-Out Test Set

ROC AUC	0.9978
Accuracy	0.9881
F1 Score	0.9898
Precision	0.9888
Recall	0.9907

Confusion Matrix

ROC Curve

Conclusions

Model Performance: The best model achieved an AUC of 0.9978 on the hold-out test set, demonstrating strong capability in distinguishing malware from goodware.
Feature Importance: Analysis revealed that entropy, section characteristics, and executable metadata are among the most predictive features.
Deployment: The model is packaged into a production-ready Flask web application with CI/CD pipeline.
Recommendation: This system should be used as part of a comprehensive security strategy, not as the sole detection mechanism.

Try Single Prediction Try Batch Prediction

Model Training & Evaluation Report