Abstract
Open-vocabulary object detection (OVOD) extends traditional object detection by allowingmodels to detect and classify objects belonging to categories not seen during training, defined
instead by user-specified text queries at inference time. Many recent OVOD methods achieve
this by combining a region proposal from a detector with a vision-language model: the detector
first identifies candidate object regions in the image, and the vision-language model then aligns
each region’s visual features with the provided text representations to assign the best-matching
category label. Most OVOD research is evaluated on RGB benchmarks, whereas many wildlife
monitoring systems rely on near-infrared (near-IR) camera-trap data. Near-IR images are
typically single-channel and collected under active illumination, which changes appearance cues
such as texture, contrast, and brightness relative to RGB imagery. These differences make it
difficult for OVOD models trained on RGB images to perform well on near-IR data.
This thesis presents the Near-Infrared Rodents Dataset (NIR-Rodents), a curated near-IR
dataset with bounding-box annotations for three rodent species: mouse, chipmunk, and flying
squirrel. NIR-Rodents includes two synchronized viewpoints, overhead and front-facing. The
experiments use the overhead subset because it provides a more consistent viewpoint and clearer
full-body visibility, while the front-facing images are included in the dataset for future study.
Using NIR-Rodents, this thesis first benchmarks four representative OVOD frameworks,
CORA, RegionCLIP, Detic, and GLIP, using their original released implementations and
evaluation settings. For all four frameworks, the study compares class-name prompts and
descriptive prompts to examine how prompt formulation affects detection performance. The
cross-framework comparison uses mean Average Precision (mAP) as the main detection metric.
For RegionCLIP, Detic, and GLIP, the analysis also reports Average Precision (AP) for each
class, following the default reporting used by these models. CORA is also included in this
zero-shot comparison, but its results are reported using the model’s base-target setup, in which
AP is summarized for the base classes and the target class.
To study adaptation in more detail, this thesis then focuses on CORA. CORA fine-tuned
with class-name prompts. In this fine-tuning stage, one species is treated as the target class and
the remaining two species are treated as base classes, and this process is repeated by rotating
the target class across the three species. The adapted CORA model is then evaluated again
with descriptive prompts. This design makes it possible to examine how fine-tuning changes
CORA across different class splits. To further examine how effective this adaptation remains as
training data become more diverse, the thesis repeats the same fine-tuning process using an
extended dataset that combines NIR-Rodents with additional images from public camera-trap
datasets.
The results show that zero-shot performance varies across frameworks and prompt types,
and that descriptive prompts often reduce detection performance before adaptation. The CORA
experiments further show that fine-tuning improves performance relative to zero-shot CORA
and leads to more stable behavior when descriptive prompts are used on images with clearer
visible traits. The extended-data setting further changes the balance of performance across class
groups and introduces additional challenges.
Overall, the findings show that evaluating OVOD on near-IR data requires careful analysis
of overall detection performance, per-class behavior, and prompt sensitivity. The results also
suggest that region-prompt fine-tuning is a practical way to improve CORA for near-IR rodent
detection without changing the detector architecture.
Keywords: open-vocabulary object detection; near-infrared imaging; camera traps; wildlife
monitoring; vision-language models; CORA; region prompts.