Fusion CNN-Transformer Model for Target Counting in Complex Scenarios
Abstract
To overcome the shortcomings of traditional manual counting methods, which are labor-intensive, resource-consuming, and inefficient, this study introduces a computer-based counting model. This model integrates convolutional neural networks (CNNs) with Transformer networks to efficiently recognize and count specific target objects in large-scale data scenarios. This approach leverages CNNs for local feature extraction and incorporates Transformer networks to capture long-range global information, achieving a synergistic effect. The methodology includes key steps such as “CNN for feature extraction and Transformer for global attention.” The experiment outcomes show that the model has an average absolute error of 10.13, a root mean square error of 12.08, an average counting accuracy of 98.6%, a peak signal-to-noise ratio of 23.75, a structural similarity of 0.933, a coefficient of determination of 0.901, an average counting time of about 6.58ms per image, and a parameter count of 3.21 in target counting. It can also recognize and respond well to high complexity scenes while maintaining high accuracy. Compared to the CNN model, the research model reduces the error rate by 13.4%, indicating that the fusion of CNN and Transformer networks is effective in object counting for computer vision tasks. This result indicates that the model integrating convolutional neural networks and fully self attention networks can be well applied to computer recognition and object counting.
Full Text:
PDFDOI: https://doi.org/10.31449/inf.v49i12.7315

This work is licensed under a Creative Commons Attribution 3.0 License.