Abstract
In a constantly evolving cyber threat landscape, cybersecurity and threat intelligence analysts are tasked with producing reliable cybersecurity guidance and publications with speed, accuracy, and reliability. This project investigates the quality of texts generated by two mainstream large language models (LLM) and the process for evaluating these texts for integration, ultimately composing viable and timely cybersecurity guidance for dissemination and publication. The two models selected for study include ChatGPT4o, and Gemini 1.5. These models are prompted to write security guidance for 20 common vulnerabilities and exposures that have been known to be exploited in the wild. These known exploited vulnerabilities are identified and described on the Cybersecurity and Infrastructure Security Agency’s Know Exploited Vulnerabilities Catalog, providing vendor documentation for ground-truth analysis. The ground-truth analysis of the generated texts includes the calculation of ROUGE scores in comparison to this vender-published cybersecurity guidance. In exploration of LLM-as-judge strategies, these ROUGE scores are calculated by ChatGPT5 functioning as judge. The questions driving this research include: (1) Which of the two chosen models appears to be the most viable for composing cybersecurity guidance and publications? (2) Is there a way to implement an automated evaluation tool into the workflow of cybersecurity analysts to reduce the amount of time required to check the accuracy of the LLM output? (3) What is the average ROUGE similarity score of responses correlated to ground-truth text samples, and how can this score relate to confidence in the validity and applicability of a text for cybersecurity analytic products? (4) How can LLMs be leveraged for composing cybersecurity guidance and publications? By calculating ROUGE scores for the generated texts, ChatGPT5 functioning as judge revealed that Gemini 1.5 and ChatGPT4o’s outputs scored similarly when compared to ground-truth texts; however, ChatGPT4o outperformed Gemini 1.5 when responding to content-specific, multiple-choice questions. In accordance with Intelligence Community Directives, the final product of this project includes a proposed methodology for the integration of generated texts into analytic workflows for the production of cybersecurity guidance and publications. This methodology includes four domains: model selection, prompt injection, text evaluation, and text integration. Through application of this methodology, analysts gain the vocabulary and tools necessary for integrating generated texts with analytic rigor and confidence. Statistical analysis further demonstrated that prompt design and data structure had a greater influence on output quality than model selection, supporting the viability of both LLMs for generating reliable cybersecurity guidance texts.