Abstract
Research in malware analysis faces significant challenges, especially when dealing with datasets. The limitations imposed by outdated malware samples and inaccurate tagging restrict the utility of specific malware datasets for research. This paper undertakes a comprehensive examination of malware datasets, aiming to enhance our understanding of cyber threats and strengthen cybersecurity strategies. We identified 27 datasets that satisfied our criteria. Three of those datasets were selected for further enumeration using VirusTotal's API for malware analysis. The method presented here systematically evaluates and categorizes those datasets, considering a number of factors such as the availability of raw samples, temporal relevancy, and sample quantity. Examining the datasets through quantitative methods exposes nuanced biases associated with temporal factors, file types (e.g., .exe, .elf), hardware architectures (e.g., ARM, x86, x64), and distributions across various malware categories (e.g., trojan, droppers, spam). These insights are crucial for researchers and cybersecurity professionals who intend on employing machine learning models that may be susceptible to bias that may be present.