Joint Global-Local Feature Alignment With Fine-Tuned Pretrained Transformers for Text-Based Person Search
Abstract
Full Text:
PDFReferences
K. Simonyan and A. Zisserman, “Very deep con volutional networks for large-scale image recognition”, arXiv preprint arXiv:1409.1556, 2014.
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition”, in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778. doi: 10.1109/CVPR.2016.90.
Y. Zhang and H. Lu, “Deep cross-modal projection learning for image-text matching”, in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 707–723. doi: 10 .1007/978-3-030-01246-5_42.
C. Gao, G. Cai, X. Jiang, et al., “Contextual non-local alignment over full-scale representation for text-based person search”, Jan. 2021. doi: 10.48550/arXiv.2101.03036.
N. Sarafianos, X. Xu, and I. Kakadiaris, “Adversarial representation learning for text-to-image matching”, in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 5813–5823. doi: 10.1109 / ICCV. 2019 .00591.
X. Han, S. He, L. Zhang, and T. Xiang, “Textbased person search with limited data”, arXiv preprint arXiv:2110.10807, p. 13, 2021. doi: 10.48550/arXiv.2404.18106.
Z. Wang, Z. Fang, J. Wang, and Y. Yang, “Vitaa: Visual-textual attributes alignment in person search by natural language”, in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XII 16, Springer, 2020, pp. 402–420. doi: 10.1007/978-3-030-58610-2_24.
S. Yan, H. Tang, L. Zhang, and J. Tang, “Image-specific information suppression and implicit local alignment for text-based person search”, IEEE transactions on neural networks and learning systems, 2023. doi: 10.1109/TNNLS.2023.3310118.
S. Li, M. Cao, and M. Zhang, “Learning semantic-aligned feature representation for text-based person search”, in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2022, pp. 2724–2728. doi: 10.1109 /ICASSP43922.2022.9746846.
X. Shu, W. Wen, H. Wu, et al., “See finer, see more: Implicit modality alignment for text-based person retrieval”, in European Conference on Computer Vision, Springer, 2022, pp. 624–641. doi: 10.1007/978-3-031-25072-9_42.
Y. Bai, M. Cao, D. Gao, et al., “Rasa: Relation and sensitivity aware representation learning for text-based person search”, 2023. doi: 10.24963/ijcai.2023/62.
A. Radford, J. W. Kim, C. Hallacy, et al., “Learning transferable visual models from natural language supervision”, in International conference on machine learning, PMLR, 2021, pp. 8748–8763. doi: 10 . 48550 / arXiv.2103.00020.
J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation”, in International conference on machine learning, PMLR, 2022, pp. 12 888–12 900. doi: 10.48550/arXiv.2201.12086.
B. Xiao, H. Wu, W. Xu, et al., “Florence-2: Advancing a unified representation for a variety of vision tasks”, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 4818–4829. doi: 10.1109/CVPR52733.2024.00461.
S. Li, T. Xiao, H. Li, B. Zhou, D. Yue, and X.Wang, “Person search with natural language description”, in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 5187–5196. doi: 10.1109/CVPR.2017.551.
Y. Chen, G. Zhang, Y. Lu, Z. Wang, and Y.Zheng, “Tipcb: A simple but effective partbased convolutional baseline for text-based person search”, Neurocomputing, vol. 494, pp. 171–181, 2022. doi: 10.1016/j.neucom.2022.04.081.
S. Yang, Y. Zhou, Z. Zheng, Y. Wang, L. Zhu, and Y. Wu, “Towards unified text-based person retrieval: A large-scale multi-attribute and language search benchmark”, in Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 4492–4501. doi: 10.1145/3581783.3611709.
Z. Zheng, L. Zheng, M. Garrett, Y. Yang, M. Xu, and Y.-D. Shen, “Dual-path convolutional image-text embeddings with instance loss”, ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), vol. 16, no. 2, pp. 1–23, 2020. doi: 10.1145/3383184.
A. Dosovitskiy, L. Beyer, A. Kolesnikov, et al., “An image is worth 16x16 words: Transformers for image recognition at scale”, in International Conference on Learning Representations, 2021.
A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al., “Language models are unsupervised multitask learners”, OpenAI blog, vol. 1, no. 8, p. 9, 2019.
D. Q. Nguyen and A. T. Nguyen, “Phobert: Pre-trained language models for vietnamese”, pp. 1037–1042, 2020. doi: 10.18653/v1/2020.findings-emnlp.92.
T. T. T. Pham, V.-T. Nguyen, H.-Q. Nguyen, et al., “Person search by natural language description in vietnamese using pre-trained visualtextual attributes alignment model”, in 2021 13th International Conference on Knowledge and Systems Engineering (KSE), IEEE, 2021, pp. 1–6.
T. T. T. Pham, H.-Q. Nguyen, H. Phan, et al., “Towards a large-scale person search by vietnamese natural language: Dataset and methods”, Multimedia Tools and Applications, vol. 81, no. 19, pp. 27 569–27 600, 2022. doi: 10.1007/s11042-022-12138-1.
W. Suo, M. Sun, K. Niu, et al., “A simple and robust correlation filtering method for text-based person search”, in European conference on computer vision, Springer, 2022, pp. 726–742. doi: 10.1007/978-3-031-19833-5_42.
M. Cao, Y. Bai, Z. Zeng, M. Ye, and M. Zhang, “An empirical study of clip for text-based person
search”, in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, 2024, pp. 465–473. doi: 10.48550/arXiv.2308.10045.
J. Park, D. Kim, B. Jeong, and S. Kwak, “Plot: Text-based person search with part slot attention for corresponding part discovery”, in European Conference on Computer Vision, Springer, 2025, pp. 474–490. doi: 10.1007/978-3-031-72664-4_27.
H. P. T. Tran, T. H. P. Phan, T. B. N. Nguyen, et al., “M-irra: A multilingual model for text-based person search”, in The 3rd APSIPA Workshop on Signal and Information Processing (SIP) in Vietnam, 2024, pp. 1–6.
DOI: https://doi.org/10.31449/inf.v49i28.7953

This work is licensed under a Creative Commons Attribution 3.0 License.