Malware Detection using Collective Classification
Last Updated: December 15, 2025
This is a project focused on detecting malicious code regions within binary files using a graph-based collective classification approach, aiming to provide a more granular insight than traditional file-level malware detection.
Code: https://github.com/minlu21/collective-classification-malware-detection.git
Project Overview and Motivation
Traditional malware detectors often perform only file-level binary classification to determine if an entire binary is malicious or not. However, techniques like obfuscation can hide local malicious semantics, making file-level detection less effective.
To address this challenge, the primary goal of this project was to detect malicious code regions at a fine-grained level using control-flow graphs or program-dependence graphs. To do so, I utilize graph-based analysis through collective classification instead of relying on Graph Neural Networks (GNNs).
Why not GNNs? Collective classification offers several advantages over GNNs for this task, including:
- Simplicity and Interpretability.
- Low Computational Cost/Scalability.
- Explicit Handling of Edge Trust/Weight, resulting in more interpretable results.
The solution of this project is implemented in a two pass approach:
- We identify initial malicious regions using some algorithm. The starting malicious regions in this case are all basic blocks, as basic blocks are the fundamental units of our control-flow graph/program-dependence graph analysis.
- Using the initial identified basic blocks, we run a reputation propagation algorithm on the control-flow graph/program-dependence graph, enabling us to highlight further potentially relevant malicious regions.
Technologies
- ML Related: scikit-learn, PyTorch, SentenceTransformers, Huggingface
- General Binary/Graph Analysis: IDAPro, networkx
- Data Analysis: Matplotlib, seaborn
Implementation Overview
-
Dataset Construction and Preprocessing
- Dataset: 66 malicious PE binaries (from MalwareZoo) and 30 benign PE binaries (from Scoop) were used.
- Graph Creation: Headless IDAPro was used to construct Control Flow Graphs (CFGs), which were then converted to Program Dependence Graphs (PDGs).
- Code Embedding: A BERT-Base model was finetuned for one epoch using the SimCSE loss (Multiple Negatives Ranking Loss) on the assembly code contained in the basic blocks. This unsupervised process ensures that similar code segments have nearby embeddings.
-
Identifying Initial Suspicious Regions To identify the initial suspicious basic blocks, unsupervised anomaly detection was used. This is because we do not have a labelled training dataset (where each basic block were identified as malicious or not).
- Method 1 (Isolation Forest, aka IsoForest): This method leverages the fact that outliers (anomalies) are few and structurally different from normal instances. Anomalies are detected based on a short average path length from the root of a specialized random decision tree (iTree) to the terminal node. This is because normal data, being clustered, requires deeper cuts and thus has a long average path, while the opposite is true for anomalous data. The model was fitted using code samples from both benign and malicious binaries and ultimately basic blocks from both benign and malicious binaries may potentially be identified as anomalous (hence potentially malicious). However, ablation studies conducted at the end do show that basic blocks from malicious binaries are more frequently identified as anomalous.
- Method 2 (Sparse Autoencoders, aka SAEs): SAEs are effective for unsupervised anomaly detection using reconstruction error.The finetuned BERT-Base model mentioned in part 1 served as the encoder. Normal data can be efficiently encoded and accurately reconstructed, resulting in a low reconstruction error, while anomalous data cannot be represented well by the learned sparse features, leading to a high reconstruction error.The anomaly threshold is set as the x-th (where x = {90, 95, 97.5, 99}) percentile of the reconstruction error on normal data.
-
Guilt Propagation This step propagates the initial suspicion scores across the graph structure (CFG/PDG) using one of two reputation propagation algorithms:
- Method 1 (Positive Reputation Propagation): Basic blocks initially identified as anomalous (by IsoForest/SAE) are assigned a prior reputation of 1; otherwise, the prior is 0. The reputation is then shared with successors based on the outgoing edge weights (set to 1). The formula used is:
- Method 2 (Signed Reputation Propagation): The update rule is the same as the positive method, but the prior initialization is different.
- Two sets of anomalous blocks are used: Anomaly1 (higher percentile for anomalies, e.g., 97.5th) and Anomaly2 (lower percentile, e.g., 95th).
- This method uses a signed prior: blocks in the highest percentile (e.g., above 97.5th) are assigned a prior of 1.0, those below the lower percentile (e.g., 95th) are assigned -1.0, and the rest are 0. This explicitly handles both "guilt" and "innocence" propagation.
Evaluation Methodology
Three ablation studies were performed, with their key findings summarized below.
-
Isolation Forest vs. Sparse Autoencoders (SAE)
- Isolation Forest was found to be more discerning when identifying anomalous basic blocks compared to SAEs.
- Despite the difference in the number of identified suspicious blocks, both methods showed a similar pattern in the binaries with the highest peaks.
- The choice of initialization method (IsoForest vs. SAE) did not drastically affect the final pattern after reputation propagation, but it did impact the total number of malicious regions identified.
-
Positive Reputation vs. Signed Reputation Propagation
- The Positive Reputation Propagation algorithm could lead to a very large number of suspicious regions being identified when the initial detection (Step 1) found a high number of suspicious blocks.
- The results strongly suggest that the negative reputations introduced in the Signed Reputation Propagation algorithm have a large impact during propagation, helping to constrain the final number of identified suspicious regions in the malicious binary dataset.
-
CFG vs. PDG
- Using Control Flow Graphs (CFGs) resulted in a higher final number of suspicious regions being identified compared to Program Dependence Graphs (PDGs).
- This counter-intuitive result is likely because CFGs have fewer edges, which causes the reputation propagation algorithms to converge much slower
Conclusion
The project successfully demonstrated a graph-based collective classification approach for fine-grained, node-level malicious code region detection. Key takeaways include:
- Initialization: Isolation Forest proved more discerning than SAE for initial anomaly detection.
- Propagation: Signed Reputation Propagation is potentially superior, as it enables the propagation of both negative and positive reputations, which results in a more constraind set of positive (i.e., potentially malicious) basic blocks being identified. However, further manual analysis needs to be conducted to verify that this is indeed beneificial to solving the problem.
- Graph Type: PDGs, with their denser connectivity, lead to faster convergence and a lower, potentially more accurate, final count of suspicious regions compared to CFGs.