On the Effectiveness of Automatic Code Generation for Synthetic Dataset Creation

Josh Mitchell; Varghese Mathew Vaidyan; Yong Wang

doi:10.1109/ACCESS.2025.3587104

Back

On the Effectiveness of Automatic Code Generation for Synthetic Dataset Creation

Journal article

Open access

Peer reviewed

On the Effectiveness of Automatic Code Generation for Synthetic Dataset Creation

Josh Mitchell, Varghese Mathew Vaidyan and Yong Wang

IEEE Access, Vol.13, pp.142990-142997

2025

DOI: https://doi.org/10.1109/ACCESS.2025.3587104

Appears in Artificial Intelligence and Machine Learning Research

Abstract

Assembly

Benchmark testing

Codes

Complexity theory

Filtering algorithms

Intermediate representation

low-level virtual machine

machine code

Synthetic data

Computer Security

Data Mining

Machine Learning

Optimization

Semantics

This paper compares synthetic and real-world code datasets for machine learning applications in cybersecurity by examining the relationships between machine code and Low-Level Virtual Machine Intermediate Representation (LLVM IR). This study analyzes 1000 randomly generated programs from a compiler fuzzer against 1000 randomly selected samples from AnghaBench to evaluate suitability for security analysis tasks. Statistical analysis revealed that the code generated with fuzzers consistently produces more complex instruction patterns and achieves broader coverage of the available instruction sets, when compared to real-world samples, with statistically significant differences across all measured categories ( p \lt 0.001 ). The research examines instruction distributions, coverage metrics, program complexity, and statistical properties to characterize synthetic and real-world code differences. Our findings have important implications for vulnerability detection and malware analysis systems, and the research shows that synthetic data generation can effectively complement or potentially surpass real-world samples. These insights help security researchers and practitioners select training datasets for machine learning applications in cybersecurity.

Files and links (1)

url

https://doi.org/10.1109/ACCESS.2025.3587104View

Published (Version of record) Open

Metrics

1 Record Views

Details

Title: On the Effectiveness of Automatic Code Generation for Synthetic Dataset Creation
Creators: Josh Mitchell - Dakota State University
Varghese Mathew Vaidyan - Dakota State University
Yong Wang - Dakota State University
Publication Details: IEEE Access, Vol.13, pp.142990-142997
Publisher: IEEE
Number of pages: 8
Identifiers: 996868491201851
Academic Unit: Computer Science
Language: English
Resource Type: Journal article