VespAI: a deep learning-based system for the detection of invasive hornets | Communications Biology – Nature.com

Bait station

Bait stations consisted of a Dragon Touch Vision 1 1080p camera, suspended at a height of 210mm above a featureless detection board, shielded by an opaque baffle (Fig.4). This setup minimised background and lighting variability, thus simplifying the computational complexity of hornet detection, while ensuring that only hornets and other insects visiting the station were captured in videos. A sponge cloth impregnated with commercial vespid attractantVespaCatch (Vto-pharma) or Trappit (Agrisense)was placed in a 90mm diameter Petri dish at the centre of the bait station, thus attracting hornets to land directly beneath the camera. We used these bait stations to collect and extract an extensive training dataset, comprising images of V. velutina, V. crabro, and other insects across locations in Jersey, Portugal, France, and the UK.

To ensure dataset fidelity, resultant images of both V. velutina and V. crabro were visually identified via expert assessment of colouration, abdominal markings, and morphology. Additionally, the identity of each hornet species was confirmed through utilisation of the appropriate taxonomic keys65,66.

Data were collected in 2021 and 2022, with selected images being extracted from the raw video footage, and divided into three subsets. All training images were collected in 2021, while the final validation images were collected in 2022, ensuring complete spatiotemporal and biological novelty. Images yielded a maximum simultaneous co-occurrence of six V. velutina, this being observed in Jersey; and five V. crabro, this being recorded in the UK. As a processing step prior to training, images were letterboxedthis being the process of downsampling to 640640 for enhanced throughput performance, while maintaining a 16:9 aspect ratio and filling any residual image space with blank pixels. This then allowed for extensive image augmentation during training, producing three additional variations to supplement each original frame, and thus increasing the total number of images by a factor of four. The specific details of each training data subset are outlined in the following sections.

A collection of 1717 images for training and 430 for initial validation metrics, totalling 8,588 after augmentation. This set contained hornet images with a 50:50 split between V. velutina and V. crabro, while the number of non-target insects was intentionally limited. Data were collected from bait stations at sites in the UK and Portugal.

A collection of 2196 images for training and 549 for initial validation metrics, totalling 10,980 after augmentation. This set contained all hornet images from the HTS, in addition to 598 images of non-target insects. Images of non-target insects included a representative selection of species attracted to the bait station, with a focus on visually similar genera such as Vespula, Dolichovespula, and Polistes. All insects were identified to the genus level, utilising a combination of expert assessment and the relevant taxonomic identification resources65,67. A full list of non-target taxa is provided in (TableS1). These data were collected from bait stations at sites in the UK, Jersey, and Portugal.

A collection of 557 images for final validation only, totalling 2228 after augmentation. Of these, 433 contained instances of V. velutina and V. crabro in a 50:50 split, including multiple co-occurrences of both species and non-target insects. The remaining images contained a combination of non-target species and empty bait stations under different lighting and climatic conditions. Validation data were collected from bait stations at sites in the UK, Jersey, France, and Portugal.

Annotation was performed using the Plainsight AI (Plainsight) software interface. This allowed for expedited labelling via automated polygon selection and AI-assisted predictive annotation. Two classes of annotation were generated, corresponding to V. velutina and V. crabro, and these were then manually applied to a random selection of training frames. Polygonal masks included hornet bodies and wings, and excluded legs and antennaeas we found these to be redundant during testing. Once ~500 frames had been annotated manually, we then used this data to train an automated detection and segmentation model within the labelling interface, allowing us to more rapidly generate further annotations for training. Prior to data export, all annotations were reviewed manually, and corrections made where required. Annotations were exported in COCO format, enabling full segmentation of hornet features from the background68.

To develop a hardware-specific hornet detection and classification model, we combined our extensive image dataset with bespoke augmentations to obtain high predictive confidence. The VespAI detection algorithm is built on the YOLOv5 family of machine vision models, specifically YOLOv5sa variant optimised to run on portable processors such as the Raspberry Pi 448. As a front-end pre-filter to this, we incorporated the lightweight ViBe50 background subtraction algorithm, allowing the system to remain passive in the absence of motion (Fig.2a). Specifically, this pre-filter detects motion from the raw video input, extracts the contours of moving insects, and retains only objects within a reference size range generated from known hornet detections (Fig.2a and S1). Consequently, energy is conserved, as only relevant candidate frames are passed on to the YOLOv5 detection algorithm itself. This then applies a single fully convolutional neural network (F-CNN) to images (Fig.2b), providing superior speed, accuracy, and contextual awareness when compared to traditional regional convolutional neural networks (R-CNN)49,69.

All models were built and optimised using the PyTorch70 machine learning environment, with the aim of generating an end-to-end software package that would run on a Raspberry Pi 4. This was achieved by testing models on a range of YOLOv5 architectures, specifically YOLOv5m, YOLOv5s, and YOLOv5n; thus optimising them to include the minimum number of parametersthis being ~7 millionwhilst maintaining their performance (Fig. S2b).

Final models were trained and tested utilising a NVIDIA Tesla V100 Tensor Core GPU (NVIDIA), with a total of 200300 epochs per model, and a batch size of nine images. Model optimisation was evaluated via three loss functions; bounding box loss, this being the difference between the predicted and manually annotated bounding boxes; objectness loss, defined as the probability that bounding boxes contained target images; and cross-entropy classification loss, encompassing the probability that image classes were correctly classified (Fig. S2). In all cases, training concluded when there was no improvement in these three loss functions for a period of 50 epochs.

The prototype system was developed to provide proof-of-concept for remote detection under field-realistic conditions. The VespAI software was installed on a Raspberry Pi 4, running an Ubuntu desktop 22.04.1 LTS 64-bit operating system. This was then connected via USB to a variety of 1080p cameras, and tested using both mains and battery power supplies. These components were mounted on top of a bait station in the standard camera position, and a remote device was connected to the Pi server via the secure shell command. This allowed the hardware to be controlled remotely, and hornet detections viewed from a corresponding computer.

The setup was validated in Jersey during 2023, testing five candidate camera models and four prototype systems over a total of 55 trials at two field sites, yielding >5500 frames for analysis. Cameras were selected to test system robustness to differing lens and sensor options, while maintaining a standard resolution of 1080p across a range of cost-effective models (Fig. S5 and TableS2). Prior to testing, each camera was calibrated to a specific height, thus ensuring that the relative size of objects in frame remained constant across differences in lens angle and focal length (TableS2). Field sites were situated in Jersey to allow visits from both V. velutina, and V. crabro workers, along with a variety of common non-target insects, thus providing a rigorous test of the system under representative conditions.

Each trial consisted of a100-frame test, with the monitor capturing and analysing frames in real-time at intervals of either 5 or 30sthese being based on known hornet visitation durations (Fig. S4). Specifically, in the first 38 trials, the system was set to collect images at 5s intervals; before optimising to 30s intervals in the final 17 trials (TableS3), thus allowing for maximum power and data storage conservation, in tandem with reliable hornet detection. (Fig. S4). Results were then manually validated, and compared to the corresponding model predictions to calculate evaluation metrics.

Following field testing, the system was configured to integrate a DS3231 Real-Time Clock module, thus ensuring accurate timestamps for detections in the absence of external calibration.

To train the detection models and enable customised image augmentation, we employed the Python packages PyTorch, Torchvision, and Albumentations. Models were then evaluated via k-fold cross-validation, specifically utilising the metrics of precision, recall, box loss, objectness loss, classification loss, mean average precision (mAP), and F1 score (Fig. S2 and Table1). Cross-validation analyses employed a subsample (k) of 5, as this proved sufficient to select an optimised detection classifier that balanced model size with performance. Resultant model rankings were based on mean cross-validation scores, calculated using the Python packages scikit-learn and PaddlePaddle, and the YOLOv5 integrated validation functionality. Additional performance visualisations were generated via the packages Seaborn, Matplotlib, and NumPy. All statistical analyses were performed in SPSS (release v. 28.0.1.1) and Python (release v. 3.9.12).

Cross-validation of polygonal and box annotation techniques utilised precision, recall, box loss, objectness loss, classification loss, and mAP as response variables, and compared models with copy-paste augmentation levels of 0%, 30%, and 90%, with the former of these corresponding to box annotations.

Visualisation of training data subsets to ensure sufficient image novelty utilised frequency distribution analyses of blur, area, brightness, colour, and object density between the HTS, H/NTS, and VS.

Cross-validation of model architectures employed precision, recall, box loss, objectness loss, classification loss, and mAP as response variables, and compared models using the YOLOv5m, YOLOv5s, and YOLOv5n architectures.

Cross-validation of models trained on the hornet training subset and hornet/non-target training subset used F1 score and mAP as response variables, and compared models trained on the HTS and H/NTS, validated against the VS.

The LRP class classification model employed normalised contributions to classification decisions as a response variable, and compared same and opposite class pixel contributions. The LRP training subset classification model used normalised contributions to classification decisions as a response variable, and compared models trained on the HTS and N/HTS.

Precision and recall analyses were utilised to compare camera models, with comparisons based on median performance across test types for each metric.

Model development utilised a sample of 3302 images collected from a total of four countries, each consisting of multiple sampling sites. Data augmentation further expanded this sample to 13,208 images and provided additional variation to enhance model robustness. Analyses of the prototype system employed a sample of >5500 frames, collected across 55 field trials at two sites in Jersey. The source data underlying all figures and analyses are available within the supplementary data. Full details of statistical tests, subset sample sizes, and model selection procedures are provided in the results and statistical analyses sections.

Further information on research design is available in theNature Portfolio Reporting Summary linked to this article.

Originally posted here:

VespAI: a deep learning-based system for the detection of invasive hornets | Communications Biology - Nature.com

Related Posts