Benchmarking and Adapting Open-Vocabulary Object Detection on Near-IR Camera-Trap Imagery

Qasem Alharbi

Open-vocabulary object detection (OVOD) extends traditional object detection by allowingmodels to detect and classify objects belonging to categories not seen during training, defined instead by user-specified text queries at inference time. Many recent OVOD methods achieve this by combining a region proposal from a detector with a vision-language model: the detector first identifies candidate object regions in the image, and the vision-language model then aligns each region’s visual features with the provided text representations to assign the best-matching category label. Most OVOD research is evaluated on RGB benchmarks, whereas many wildlife monitoring systems rely on near-infrared (near-IR) camera-trap data. Near-IR images are typically single-channel and collected under active illumination, which changes appearance cues such as texture, contrast, and brightness relative to RGB imagery. These differences make it difficult for OVOD models trained on RGB images to perform well on near-IR data. This thesis presents the Near-Infrared Rodents Dataset (NIR-Rodents), a curated near-IR dataset with bounding-box annotations for three rodent species: mouse, chipmunk, and flying squirrel. NIR-Rodents includes two synchronized viewpoints, overhead and front-facing. The experiments use the overhead subset because it provides a more consistent viewpoint and clearer full-body visibility, while the front-facing images are included in the dataset for future study. Using NIR-Rodents, this thesis first benchmarks four representative OVOD frameworks, CORA, RegionCLIP, Detic, and GLIP, using their original released implementations and evaluation settings. For all four frameworks, the study compares class-name prompts and descriptive prompts to examine how prompt formulation affects detection performance. The cross-framework comparison uses mean Average Precision (mAP) as the main detection metric. For RegionCLIP, Detic, and GLIP, the analysis also reports Average Precision (AP) for each class, following the default reporting used by these models. CORA is also included in this zero-shot comparison, but its results are reported using the model’s base-target setup, in which AP is summarized for the base classes and the target class. To study adaptation in more detail, this thesis then focuses on CORA. CORA fine-tuned with class-name prompts. In this fine-tuning stage, one species is treated as the target class and the remaining two species are treated as base classes, and this process is repeated by rotating the target class across the three species. The adapted CORA model is then evaluated again with descriptive prompts. This design makes it possible to examine how fine-tuning changes CORA across different class splits. To further examine how effective this adaptation remains as training data become more diverse, the thesis repeats the same fine-tuning process using an extended dataset that combines NIR-Rodents with additional images from public camera-trap datasets. The results show that zero-shot performance varies across frameworks and prompt types, and that descriptive prompts often reduce detection performance before adaptation. The CORA experiments further show that fine-tuning improves performance relative to zero-shot CORA and leads to more stable behavior when descriptive prompts are used on images with clearer visible traits. The extended-data setting further changes the balance of performance across class groups and introduces additional challenges. Overall, the findings show that evaluating OVOD on near-IR data requires careful analysis of overall detection performance, per-class behavior, and prompt sensitivity. The results also suggest that region-prompt fine-tuning is a practical way to improve CORA for near-IR rodent detection without changing the detector architecture. Keywords: open-vocabulary object detection; near-infrared imaging; camera traps; wildlife monitoring; vision-language models; CORA; region prompts.

Benchmarking and Adapting Open-Vocabulary Object Detection on Near-IR Camera-Trap Imagery

Abstract

Files and links (1)

Metrics

Details