Logo image
On the Effectiveness of Automatic Code Generation for Synthetic Dataset Creation
Journal article   Open access   Peer reviewed

On the Effectiveness of Automatic Code Generation for Synthetic Dataset Creation

Josh Mitchell, Varghese Mathew Vaidyan and Yong Wang
IEEE Access, Vol.13, pp.142990-142997
2025

Abstract

Assembly Benchmark testing Codes Complexity theory Filtering algorithms Intermediate representation low-level virtual machine machine code Synthetic data Computer Security Data Mining Machine Learning Optimization Semantics
This paper compares synthetic and real-world code datasets for machine learning applications in cybersecurity by examining the relationships between machine code and Low-Level Virtual Machine Intermediate Representation (LLVM IR). This study analyzes 1000 randomly generated programs from a compiler fuzzer against 1000 randomly selected samples from AnghaBench to evaluate suitability for security analysis tasks. Statistical analysis revealed that the code generated with fuzzers consistently produces more complex instruction patterns and achieves broader coverage of the available instruction sets, when compared to real-world samples, with statistically significant differences across all measured categories ( p \lt 0.001 ). The research examines instruction distributions, coverage metrics, program complexity, and statistical properties to characterize synthetic and real-world code differences. Our findings have important implications for vulnerability detection and malware analysis systems, and the research shows that synthetic data generation can effectively complement or potentially surpass real-world samples. These insights help security researchers and practitioners select training datasets for machine learning applications in cybersecurity.
url
https://doi.org/10.1109/ACCESS.2025.3587104View
Published (Version of record) Open

Metrics

1 Record Views

Details

Logo image