Abstract
The rapid advancement of Large Language Models (LLMs) has transformed various scientific domains, yet their effective application in specialized fields such as geoscience requires more than general pre-trained capabilities. This dissertation introduces a novel knowledge-infused, LLM-driven framework for geoscience data analytics, using the Mindat mineral database as a comprehensive example. The research focuses on the LLM-driven automatic workflow application for analyzing complex geoscience data, which often involves interdisciplinary geoscience domain knowledge and computer science expertise, which can create technical barriers for users from non-computer science communities.
Through a systematic investigation of four research questions, this dissertation explores the integration of domain-specific knowledge into LLMs via prompt engineering and fine-tuning techniques. The research first establishes the current landscape of LLMs, neuro-symbolic AI, and knowledge graphs within geoscience through a comprehensive literature review, identifying gaps in existing approaches and challenges. Building on these foundations, the study demonstrates how LLMs can enhance data preprocessing and cleansing tasks for geoscience datasets, particularly for natural language request parsing and data structure transformation for the Mindat data use cases. These preprocessing capabilities significantly improve data quality and consistency, enabling more reliable downstream analysis. The investigation extends to evaluating both commercial and open-source LLMs in streamlining geoscience workflows. Using ChatGPT-4o as a benchmark, the research assesses the capabilities of commercial LLMs in automating complex analytical tasks while providing natural language interfaces. The results demonstrate significant improvements in workflow efficiency and accessibility, particularly for users with limited programming expertise. Another investigation into open-source LLM examines the viability of cost-effective, scalable solutions for domain-specific applications, evaluating the accessibility and scalability of geoscience data analysis workflows. This comparative analysis demonstrated the utility of open LLM and the trade-offs between commercial and open-source solutions in specialized scientific domains.
This research contributes to the field by demonstrating how knowledge-infused LLMs can broaden access to advanced data analytics in geoscience while mitigating the requirement for heavy human labor in fine-tuning dataset curation. The developed data analysis workflow is driven by LLM agents equipped with data retrieval and visualization tools to enable data processing and subsequent analysis. The integration framework of data API and LLM agents enables researchers, educators, and industry professionals to retrieve datasets and conduct data analysis products with minimal programming expertise. This study uses the Mindat data portal as an example to construct an automatic workflow driven by LLMs, including commercial and open-sourced ones, to manifest how LLMs can enhance the efficiency of geoscience data analysis. The workflow constructed using fine-tuned open LLM indicates the accessibility of deploying LLM-assisted data service, facilitating transparency and reproducibility in research and practical usages. The framework's effectiveness is demonstrated through qualitative evaluation experiments compared with advanced proprietary LLMs, which elaborate on its practical applications in real-world geoscience scenarios.
Moreover, the proposed LLM-driven pipeline from augmenting fine-tuning data to the final utilization of fine-tuned open LLM offered a template for implementing LLMs in other knowledge-intensive fields where embedding complex knowledge restraints into data interactions is challenging. The research established solid knowledge-infusing workflows in LLMs, setting examples for adapting similar models to different scientific domains with human-friendly interfaces while maintaining transparency and reliability.This dissertation provides a template for responsible AI integration in scientific research, balancing advanced computational capabilities with accessibility, interpretability, and transparency.
Future directions for expanding the workflow's capabilities include adding support to multiple data portals and equipping the LLM agents with more analytical tools, leveraging integrating applications with knowledge graphs and neuro-symbolic AI approaches to improve the LLM fine-tuning processing, and utilizing knowledge graphs to craft the sophisticated knowledge during the implementation of LLM in the domain-specific data analysis workflows.