Multi‑modal CrossViT using 3D spatial information for visual localization