Published on

Prudent Practices for Designing Malware Experiments: Status Quo and Outlook


Let’s begin by defining some terminology. Malware can be defined as malcious software that causes deliberate harm to people, systems, and networks. We can classify malware into malware families that group related samples based on their behaviour. Malware-execution-driven experiments or dynamic analysis is a method of analysing malware by executing samples in a controlled environment to observe their behaviour.

Rossow et al propose a set of guidelines that can be applied to malware-execution driven experiments in both the construction of datasets, and the design of the experiments. They do this as they identified a number of pitfalls in the research community that makes it difficult to replicate experiments and generalise from results. For example, the presence of environment artefacts (things such as username strings and IP addresses) within a malware dataset can have a degrading effect on performace.

These guidelines fall under four categories: correctness, transparency, realism, and safety.

Correctness is concerned with the construction of malware datasets that limit the introduction of biases. This is achieved by ensuring relevant sampling, separation and membership of the dataset, while also putting controls within the analysis environment to protect data sensors and mitigate the influence of system artefacts upon observed malware behaviour.

Transparency aims to provide more description and detail to the malware analysis process by encouraging authors to identify malware samples used, provide system/network configuration information, and provide interpretation and reasoning for observed results.

Realism seeks to edge authors towards designing experiments that are reflective of how malware behaves ‘in the wild’ to allow generalisation of findings.

Finally, safety promotes mitigation from harm by highlighting the need for both implementation and discussion of containment policies within a malware experiment.

In order to assess the applicability of these guidelines, they survey 36 papers (40% from top-tier venues) interpreting the results in three stages.

They first performed a per-guideline analysis, in which they investigated the extent each guideline was met, and found numerous violations across all the categories. For example, in the safety category most papers did not deploy or adequately describe their containment policy.

Per-paper analysis discussed how many papers could have benefitted from significant improvement via the application of these guidelines, observing a correlation between the number of violated criteria and the number of applicable criteria. This demonstrated that these guidelines become increasingly important when designing more complex experiments.

Finally a top-venue analysis sought to detail how papers appearing in top-tier venues compared with those appearing in other venues, identifying that papers in top-tier venues tended to include real-world scenarios (but these are potentially based on biased datasets) and interpret false positives. However, violations generally remained comparable across these groups, suggesting that all papers in the community could equally benefit from the suggested guidelines.

In conclusion, this paper identifies a number of pitfalls within the malware-execution research community that has impact upon the scientific method. This situation could be improved by increasing effort into the presentation (transparency) of research methodology and interpretation of results. The guidelines of correctness and realism are however difficult to control, as it is not always obvious that certain practices lead to incorrect datasets or unrealistic scenarios. That being said, guidelines presented here do help to establish a common set of criteria to ensure future prudent experimentation with malware datasets.