Pose Anything: A Graph-Based Approach for Category-Agnostic Pose Estimation

Pose Anything teaser.

Pose Anything - given only one example image and skeleton, our method can perform pose estimation on unseen categories.

Abstract

Traditional 2D pose estimation models are limited by their category-specific design, making them suitable only for predefined object categories. This restriction becomes particularly challenging when dealing with novel objects due to the lack of relevant training data. To address this limitation, category-agnostic pose estimation (CAPE) was introduced. CAPE aims to enable keypoint localization for arbitrary object categories using a single model, requiring minimal Support images with annotated keypoints. This approach not only enables object pose generation based on arbitrary keypoint definitions but also significantly reduces the associated costs, paving the way for versatile and adaptable pose estimation applications. We present a novel approach to CAPE that leverages the inherent geometrical relations between keypoints through a newly designed Graph Transformer Decoder. By capturing and incorporating this crucial structural information, our method enhances the accuracy of keypoint localization, marking a significant departure from conventional CAPE techniques that treat keypoints as isolated entities. We validate our approach on the MP-100 benchmark, a comprehensive dataset comprising over 20,000 images spanning more than 100 categories. Our method outperforms the prior state-of-the-art by substantial margins, achieving remarkable improvements of 2.16% and 1.82% under 1-shot and 5-shot settings, respectively. Furthermore, our method's end-to-end training demonstrates both scalability and efficiency compared to previous CAPE approaches.

Results

Qualitative Results

Using our method, given a support image and skeleton we can perform structure-consistent pose estimation on images from unseen categories.

Out-of-Distribution

Our model, which was trained on real images only, demonstrates its adaptability and effectiveness across varying data sources such as cartoons and imaginary animals, created using a diffusion model. Furthermore, our model demonstrates satisfactory performance even when the Support and query images are from different domains.

Quantitative Results

We compare our method with the previous CAPE methods CapeFormer and POMNet and three baselines: ProtoNet, MAML, and Fine-tuned. We report results on the MP-100 dataset under 1-shot and 5-shot settings. As can be seen, the enhanced baseline models, which are agnostic to the keypoints order as opposed to CapeFormer, outperform previous methods and improve the average PCK by 0.94% under the 1-shot setting and 1.60% under the 5-shot setting. Our graph-based method further improves performance, improving the enhanced baseline by 1.22% under the 1-shot setting and 0.22% under the 5-shot setting, achieving new state-of-the-art results for both settings. We also show the scalability of our design. Similar to DETR-based models, employing a larger backbone improves performance. We show that our graph decoder design also enhances the performance of the larger enhanced baseline, improving results by 1.02% and 0.34% under 1-shot and 5-shot settings respectively.

BibTeX

If you find this research useful, please cite the following:

@misc{hirschorn2023pose,
      title={Pose Anything: A Graph-Based Approach for Category-Agnostic Pose Estimation},
      author={Or Hirschorn and Shai Avidan},
      year={2023},
      eprint={2311.17891},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}