Category-Agnostic Pose Estimation (CAPE) localizes keypoints across diverse object categories with a single model, using one or a few annotated support images. Recent works have shown that using a pose graph (i.e., treating keypoints as nodes in a graph rather than isolated points) helps handle occlusions and break symmetry. However, these methods assume a static pose graph with equal-weight edges, leading to suboptimal results.
We introduce EdgeCape, a novel framework that overcomes these limitations by predicting the graph's edge weights which optimizes localization. To further leverage structural priors, we propose integrating Markovian Structural Bias, which modulates the self-attention interaction between nodes based on the number of hops between them. We show that this improves the model’s ability to capture global spatial dependencies. Evaluated on the MP-100 benchmark, which includes 100 categories and over 20K images, EdgeCape achieves state-of-the-art results in the 1-shot setting and leads among similar-sized methods in the 5-shot setting, significantly improving keypoint localization accuracy.
Using our method, given a support image and skeleton we can refine the structure for better pose estimation on images from unseen categories.
Our model predicts the best weighted graph structure for localization.
Edge weights are shown in the graph edges, with thicker edges indicating higher weights.
We evaluate our method on the MP-100 dataset under both 1-shot and 5-shot settings. For a fair comparison, we use backbones of similar sizes across methods. However, it is important to highlight that PPM differs significantly from other approaches, as it leverages the much larger Stable Diffusion model and relies on test-time optimization. These factors give PPM a distinct advantage but also introduce additional complexity and computational overhead. Nevertheless, we include its results for the sake of completeness. As shown in the results, our method achieves state-of-the-art performance, outperforming previous methods by an average margin of 1.21% in the 1-shot setting. In the 5-shot setting, our approach maintains the highest performance among similar-sized methods, with an improvement of 0.84%. These results underscore the robustness and efficiency of our method, particularly in challenging few-shot scenarios, where structural priors and adaptive graph refinement are critical.
Model | Backbone | 1-Shot | 5-Shot | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Split 1 | Split 2 | Split 3 | Split 4 | Split 5 | Avg | Split 1 | Split 2 | Split 3 | Split 4 | Split 5 | Avg | ||
POMNet | ResNet-50 | 84.23 | 78.25 | 78.17 | 78.68 | 79.17 | 79.70 | 84.72 | 79.61 | 78.00 | 80.38 | 80.85 | 80.71 |
CapeFormer | ResNet-50 | 89.45 | 84.88 | 83.59 | 83.53 | 85.09 | 85.31 | 91.94 | 88.92 | 89.40 | 88.01 | 88.25 | 89.30 |
ESCAPE | ResNet-50 | 86.89 | 82.55 | 81.25 | 81.72 | 81.32 | 82.74 | 91.41 | 87.43 | 85.33 | 87.27 | 86.76 | 87.63 |
MetaPoint+ | ResNet-50 | 90.43 | 85.59 | 84.52 | 84.34 | 85.96 | 86.17 | 92.58 | 89.63 | 89.98 | 88.70 | 89.20 | 90.02 |
SDPNet | HRNet-32 | 91.54 | 86.72 | 85.49 | 85.77 | 87.26 | 87.36 | 93.68 | 90.23 | 89.67 | 89.08 | 89.46 | 90.42 |
X-Pose | Swin | 89.07 | 85.05 | 85.26 | 85.52 | 85.79 | 86.14 | - | - | - | - | - | - |
PPM | Stable Diffusion | 91.03 | 88.06 | 84.48 | 86.73 | 87.40 | 87.54 | 93.64 | 92.71 | 91.76 | 92.85 | 91.94 | 92.58 |
SCAPE | DinoV2 | 91.47 | 86.29 | 87.23 | 87.07 | 86.94 | 87.80 | 94.33 | 90.53 | 91.49 | 90.68 | 89.80 | 91.37 |
GraphCape | SwinV2 | 91.19 | 87.81 | 85.68 | 85.87 | 85.61 | 87.23 | 94.24 | 91.32 | 90.15 | 90.37 | 89.73 | 91.16 |
Ours | DinoV2 + DPT | 93.69 | 89.27 | 87.85 | 86.67 | 87.59 | 89.01 | 95.51 | 91.94 | 91.33 | 90.36 | 91.92 | 92.21 |
If you find this research useful, please cite the following:
@misc{hirschorn2024edgeweightpredictioncategoryagnostic,
title={Edge Weight Prediction For Category-Agnostic Pose Estimation},
author={Or Hirschorn and Shai Avidan},
year={2024},
eprint={2411.16665},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2411.16665},
}