Joint Global-Local Feature Alignment With Fine-Tuned Pretrained Transformers for Text-Based Person Search

Thi Thanh Thuy Pham; Huong-Giang Doan

doi:10.31449/inf.v49i28.7953

Abstract

Text-based person search (TBPS) aims to retrieve person images from a database using natural language description. Although significant progress has been made, TBPS remains challenging due to the complexities of cross-modal understanding. This work proposes a novel framework named GLAlign that jointly aligns global and local features from both vision and text modalities using large-scale, pretrained, fine-tuned transformers. Specifically, we utilize ViT-B/32 for visual encoding and GPT-2 (English) or PhoBERT (Vietnamese) for textual encoding. To enhance alignment, we perform human part parsing and noun phrase extraction, enabling fine-grained local feature correspondence between body regions and descriptive attributes. The proposed method is evaluated on four benchmark datasets: CUHKPEDES, CUHK-PEDES-VN, 3000VnPerson-Search, and 3000Vn-V2E. In the CUHK-PEDES dataset, our model achieves a Rank-1 accuracy of 80.75%, outperforming state-of-the-art methods such as PLOT (75.28%) and RaSa (76.51%). On the 3000VnPerson-Search dataset, our model reaches a Rank-1 accuracy of 85.72% for Vietnamese descriptions, indicating its robustness across both high-resource and low-resource languages. These results demonstrate the effectiveness of combining global-local alignment with fine-tuned vision language transformers for the TBPS task. The source codes are available at: https://github.com/TextBasedPersonSearch/PersionSearch.

Author Biography

Huong-Giang Doan, Electric Power University

Corresponding Author

References

K. Simonyan and A. Zisserman, “Very deep con volutional networks for large-scale image recognition”, arXiv preprint arXiv:1409.1556, 2014.

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition”, in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778. doi: 10.1109/CVPR.2016.90.

Y. Zhang and H. Lu, “Deep cross-modal projection learning for image-text matching”, in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 707–723. doi: 10 .1007/978-3-030-01246-5_42.

C. Gao, G. Cai, X. Jiang, et al., “Contextual non-local alignment over full-scale representation for text-based person search”, Jan. 2021. doi: 10.48550/arXiv.2101.03036.

N. Sarafianos, X. Xu, and I. Kakadiaris, “Adversarial representation learning for text-to-image matching”, in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 5813–5823. doi: 10.1109 / ICCV. 2019 .00591.

X. Han, S. He, L. Zhang, and T. Xiang, “Textbased person search with limited data”, arXiv preprint arXiv:2110.10807, p. 13, 2021. doi: 10.48550/arXiv.2404.18106.

Z. Wang, Z. Fang, J. Wang, and Y. Yang, “Vitaa: Visual-textual attributes alignment in person search by natural language”, in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XII 16, Springer, 2020, pp. 402–420. doi: 10.1007/978-3-030-58610-2_24.

S. Yan, H. Tang, L. Zhang, and J. Tang, “Image-specific information suppression and implicit local alignment for text-based person search”, IEEE transactions on neural networks and learning systems, 2023. doi: 10.1109/TNNLS.2023.3310118.

S. Li, M. Cao, and M. Zhang, “Learning semantic-aligned feature representation for text-based person search”, in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2022, pp. 2724–2728. doi: 10.1109 /ICASSP43922.2022.9746846.

X. Shu, W. Wen, H. Wu, et al., “See finer, see more: Implicit modality alignment for text-based person retrieval”, in European Conference on Computer Vision, Springer, 2022, pp. 624–641. doi: 10.1007/978-3-031-25072-9_42.

Y. Bai, M. Cao, D. Gao, et al., “Rasa: Relation and sensitivity aware representation learning for text-based person search”, 2023. doi: 10.24963/ijcai.2023/62.

A. Radford, J. W. Kim, C. Hallacy, et al., “Learning transferable visual models from natural language supervision”, in International conference on machine learning, PMLR, 2021, pp. 8748–8763. doi: 10 . 48550 / arXiv.2103.00020.

J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation”, in International conference on machine learning, PMLR, 2022, pp. 12 888–12 900. doi: 10.48550/arXiv.2201.12086.

B. Xiao, H. Wu, W. Xu, et al., “Florence-2: Advancing a unified representation for a variety of vision tasks”, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 4818–4829. doi: 10.1109/CVPR52733.2024.00461.

S. Li, T. Xiao, H. Li, B. Zhou, D. Yue, and X.Wang, “Person search with natural language description”, in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 5187–5196. doi: 10.1109/CVPR.2017.551.

Y. Chen, G. Zhang, Y. Lu, Z. Wang, and Y.Zheng, “Tipcb: A simple but effective partbased convolutional baseline for text-based person search”, Neurocomputing, vol. 494, pp. 171–181, 2022. doi: 10.1016/j.neucom.2022.04.081.

S. Yang, Y. Zhou, Z. Zheng, Y. Wang, L. Zhu, and Y. Wu, “Towards unified text-based person retrieval: A large-scale multi-attribute and language search benchmark”, in Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 4492–4501. doi: 10.1145/3581783.3611709.

Z. Zheng, L. Zheng, M. Garrett, Y. Yang, M. Xu, and Y.-D. Shen, “Dual-path convolutional image-text embeddings with instance loss”, ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), vol. 16, no. 2, pp. 1–23, 2020. doi: 10.1145/3383184.

A. Dosovitskiy, L. Beyer, A. Kolesnikov, et al., “An image is worth 16x16 words: Transformers for image recognition at scale”, in International Conference on Learning Representations, 2021.

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al., “Language models are unsupervised multitask learners”, OpenAI blog, vol. 1, no. 8, p. 9, 2019.

D. Q. Nguyen and A. T. Nguyen, “Phobert: Pre-trained language models for vietnamese”, pp. 1037–1042, 2020. doi: 10.18653/v1/2020.findings-emnlp.92.

T. T. T. Pham, V.-T. Nguyen, H.-Q. Nguyen, et al., “Person search by natural language description in vietnamese using pre-trained visualtextual attributes alignment model”, in 2021 13th International Conference on Knowledge and Systems Engineering (KSE), IEEE, 2021, pp. 1–6.

T. T. T. Pham, H.-Q. Nguyen, H. Phan, et al., “Towards a large-scale person search by vietnamese natural language: Dataset and methods”, Multimedia Tools and Applications, vol. 81, no. 19, pp. 27 569–27 600, 2022. doi: 10.1007/s11042-022-12138-1.

W. Suo, M. Sun, K. Niu, et al., “A simple and robust correlation filtering method for text-based person search”, in European conference on computer vision, Springer, 2022, pp. 726–742. doi: 10.1007/978-3-031-19833-5_42.

M. Cao, Y. Bai, Z. Zeng, M. Ye, and M. Zhang, “An empirical study of clip for text-based person

search”, in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, 2024, pp. 465–473. doi: 10.48550/arXiv.2308.10045.

J. Park, D. Kim, B. Jeong, and S. Kwak, “Plot: Text-based person search with part slot attention for corresponding part discovery”, in European Conference on Computer Vision, Springer, 2025, pp. 474–490. doi: 10.1007/978-3-031-72664-4_27.

H. P. T. Tran, T. H. P. Phan, T. B. N. Nguyen, et al., “M-irra: A multilingual model for text-based person search”, in The 3rd APSIPA Workshop on Signal and Information Processing (SIP) in Vietnam, 2024, pp. 1–6.

Joint Global-Local Feature Alignment With Fine-Tuned Pretrained Transformers for Text-Based Person Search

Abstract

Author Biography

Huong-Giang Doan, Electric Power University

References

Authors

DOI:

Downloads

Published

How to Cite

Issue

Section

License

Developed By

Information