/10

Poster4 位审稿人

最低2最高4标准差0.9

ICML 2025

EFDTR: Learnable Elliptical Fourier Descriptor Transformer for Instance Segmentation

Jiawei Cao,Chaochen Gu,Hao Cheng,Xiaofeng Zhang,Kaijie Wu,Changsheng Lu

提交: 2025-01-24更新: 2025-07-24

TL;DR

EFDTR: A polygon-based instance segmentation framework using Elliptic Fourier Descriptors (EFDs) for precise contour modeling, outperforming polygon methods and rivaling mask-based approaches.

摘要

关键词

Instance SegmentationLearnable Elliptical Fourier DescriptorContour Prediction

评审与讨论

审稿意见

评分: 42025-03-13

This paper proposes EFDTR, an instance segmentation framework that leverages Elliptical Fourier Descriptors (EFDs) to represent object contours. The approach employs a two-stage Transformer decoder: the first stage predicts low-order (particularly first-order) elliptical Fourier coefficients to capture global shape information, and the second stage refines the polygon with many vertices aligned by phase matching. The authors argue that EFDTR retains the flexibility of polygon-based methods while using frequency-domain matching to resolve ambiguities in vertex assignment, achieving strong segmentation performance in a variety of challenging cases.

update after rebuttal

My concerns have been fully addressed.

给作者的问题

When connecting multiple polygons to form a single loop, is there any risk of self-intersection or artifacts that degrade IoU? How do you handle such scenarios?
How well do higher-order EFDs cope with extreme non-convex boundaries or shapes containing holes? Could second or higher orders be beneficial in such situations?

论据与证据

The paper presents quantitative comparisons with prior polygon-based methods (e.g., DeepSnake, PolarMask, DANCE, BoundaryFormer) and some mask-based methods (e.g., Mask R-CNN, Mask DINO). The reported AP values show an improvement in boundary precision.

While the results on COCO appear convincing, the experiments are predominantly limited to this single dataset. Additional evaluation on other datasets would further strengthen the claims regarding robustness to multi-polygon or highly complex shapes.

方法与评估标准

The approach is methodologically sound, however, broader testing on additional datasets or tasks would better illustrate general applicability.

理论论述

There are no formal proofs of new theorems.

实验设计与分析

The main experiments are on COCO, comparing the proposed method to polygon-based approaches and selected mask-based baselines, focusing on standard AP metrics.

补充材料

与现有文献的关系

The paper compares to standard polygon-based methods like DeepSnake, PolarMask, PolyTransform, BoundaryFormer, as well as to conventional mask-based methods like Mask R-CNN and Mask DINO.

遗漏的重要参考文献

One key contribution is a Multiple Polygon Connection strategy (via minimum spanning tree) that merges all polygons of an instance into a single closed contour.

Research on multi-polygon (PolygonGNN: Representation Learning for Polygonal Geometries with Heterogeneous Visibility Graph [https://dl.acm.org/doi/abs/10.1145/3637528.3671738]) and work where multiple surfaces are combined into a closed 3D manifold (PolyhedronNet: Representation Learning for Polyhedra with Surface-attributed Graph) are relevant.

其他优缺点

Strengths

Introducing novel elliptical Fourier descriptors to segment polygons addresses the vertex-matching issue via the frequency-domain phase.
On COCO, it substantially outperforms prior polygon-based methods and approaches certain mask-based baselines.

Weaknesses

Experiments are primarily on COCO; more challenging or domain-specific datasets are not tested.
While the paper shows first-order EFD is most beneficial, there is limited discussion on whether certain shapes might need higher orders.
The two-stage design, along with multi-scale feature fusion, may increase computational load; speed comparisons to baseline mask-based methods are not thoroughly explored.

其他意见或建议

Evaluating on additional datasets would better demonstrate the method’s robustness.
Extending the discussion around the MST-based multi-polygon connection could strengthen the justification and highlight potential edge cases.
Including more analyses of the performance/accuracy trade-off with higher-order Fourier terms might clarify when they can be helpful.

作者回复

2025-04-01

Response to Reviewer GvJY

Thank you very much for your valuable and constructive feedback. We truly appreciate your time and effort. Below, we provide point-by-point responses to the concerns and suggestions you raised.

Response to More Dataset

We have conducted additional experiments on the SBD and Cityscapes datasets. The results are as follows:

SBD Results:

Method	Venue	AP $_{vol}$	AP $_{50}$
E2EC[1]	CVPR 2022	59.2	65.8
PolySnake[2]	TCSVT 2024	60.0	66.8
Ours (R50)	-	70.2	78.2

Cityscapes Results (Val):

Method	Venue	AP (Val)
E2EC[1]	CVPR 2022	39.0
PolySnake[2]	TCSVT 2024	40.2
Ours (R50)	-	43.4

Our method outperforms all other polygon-based instance segmentation methods on both datasets, achieving the highest AP. Notably, PolySnake, which is recognized as the best contour prediction model according to Paperwithcode, is surpassed by our approach by 10 AP $_{vol}$

Response to EFD with Higher Order

We evaluated the impact of different EFD orders on SBD and Cityscapes:

Dataset	EFD Order	AP $_{vol}$ / AP (Val)
SBD	1	70.2
SBD	2	66.2
SBD	4	58.6
Cityscapes	1	43.4
Cityscapes	4	32.6

We observe that the first-order EFD consistently delivers the best performance. We hypothesize that, within the current model architecture, higher-order EFDs introduce optimization challenges, resulting in unstable point sampling during the second stage. In contrast, the first-order EFD acts as a stable and learnable geometric representation, analogous to rotated bounding boxes. Although higher-order EFDs are capable of capturing more detailed shape distributions, they tend to be unstable, which adversely affects the performance of the downstream polygon decoder. Below, we present samples of the fitting results using first- and fourth-order EFDs.

img

Response to Self-intersection and Degraded IoU

On the COCO dataset, we apply a minimum spanning tree (MST) algorithm to connect multiple polygons into a single contour. Among the COCO instances annotated with multiple polygons (9.71% of the total), we sample up to four polygons per instance, which accounts for 99.67% of all multi-polygon annotations. With a maximum of four polygons, the likelihood of self-intersection is extremely low. Given the small proportion of instances with more than four polygons, we believe the impact on training is minimal.

polygon number	instance number	proportion
1	767315	90.28%
[2, 5)	79842	9.39%
[5, 10)	2589	0.30%
[10, ∞)	203	0.02%

Response to Holes and EFD Order

In COCO, instance masks are annotated using simple polygons without regard to orientation, and thus do not include holes. However, the SBD dataset does contain instances with holes. As discussed in the Response to EFD with Higher Order, we observed that higher-order EFDs do not lead to better performance. Visualization shows that the 1st-order model captures the outer contour more robustly.

Although some hole-containing instances exist in SBD, they represent a small portion of the dataset. Our analysis suggests that the model’s strength lies in accurate outer contour detection rather than specific handling of holes. This could be a limitation of the current model architecture, which we aim to address in future work.

[1] Zhang T, Wei S, Ji S. E2ec: An end-to-end contour-based method for high-quality high-speed instance segmentation[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022: 4443-4452.

[2] Feng H, Zhou K, Zhou W, et al. Recurrent generic contour-based instance segmentation with progressive learning[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2024.

审稿人评论

2025-04-02

Most of my concerns have been addressed. I recommend authors include a discussion on works mentioned in "Essential References Not Discussed."

作者评论

2025-04-02

We greatly appreciate the two papers your recommended.

In the paper “PolygonGNN: Representation Learning for Polygonal Geometries with Heterogeneous Visibility Graph,” the authors propose a two-hop method to build a five-tuple heterogeneous geometric representation. PolygonGNN is validated through extensive experiments on both synthetic and real-world datasets, showcasing its robust performance across a variety of scenarios. Notably, the heterogeneous spanning tree sampling strategy in PolygonGNN bears similarities to our approach of using a minimum spanning tree (MST) to connect multiple polygons. This data augmentation technique has inspired us and provides valuable insights into improving performance.

The second paper, “PolyhedronNet: Representation Learning for Polyhedra with Surface-attributed Graph,” introduces the innovative Surface-Attributed Graph (SAG) and employs local rigid representations for accurate polyhedron representation and reconstruction. This 3D data representation work serves as an important reference for our future extensions, particularly when considering the application of surface-based approaches in geometric modeling.

Both papers suggested by the reviewer have greatly contributed to enhancing the depth of our work. We promise that we will include discussions of these works and cite them in our paper.

Our Learnable Elliptical Fourier Descriptor Transformer (EFDTR) method, by employing a two-stage model structure with the EFD decoder and polygon decoder, leverages a Fourier-phase-based regression target assignment strategy. This approach has achieved state-of-the-art results for polygon-based methods across multiple datasets. On an Nvidia RTX 3090, our method achieved a performance of 22.26 ± 0.30 (as referenced in the response to Reviewer jXM9).

Finally, we kindly request you to consider revising your evaluation score if our response has addressed your concerns. Once again, we sincerely thank you for your time and efforts in reviewing our work. Your feedback has been instrumental in improving the quality of our paper.

审稿意见

评分: 42025-03-13

The paper presents EFDTR as an innovative solution for instance segmentation, combining the strengths of polygon-based representations with advanced deep learning techniques. The proposed framework not only enhances segmentation accuracy but also paves the way for future advancements in contour learning and applications in computer vision.

update after rebuttal

The rebuttal addressed my concerns, and I updated my score accordingly.

给作者的问题

论据与证据

Most claims are supported.

方法与评估标准

Yes

理论论述

Yes

实验设计与分析

Yes

补充材料

The authors provide no supplementary material.

与现有文献的关系

遗漏的重要参考文献

其他优缺点

Pros:

The idea is interesting.
The paper is well-written and well-structured.

Cons:

Hole Case: The proposed method demonstrates suboptimal performance in scenarios where holes are present in the mask. This is illustrated by the zebra example in Figure 5, where the foot of the right zebra appears to be merged.
Details. The reviewer noticed a few missed details when there are sharp changes in boundary, where some SAM-based methods can perform well in such case (the connection point between the motorcycle wheel and the rod). But the reviewer also acknowledges that it is a common case even for mask-based methods. So, is there some balance or trade-off between sharp and other concerns?
The reconstruction performance. For the elliptical representation, the reviewer would be pleased to see the reconstruction error raised by the downsampling and speed.

其他意见或建议

作者回复

2025-04-01

Response to Reviewer jXM9

Thank you for your constructive comments and insightful suggestions, which have helped us improve the quality and clarity of our manuscript. We address your points in detail below.

Response to Hole Case

The COCO dataset primarily uses polygon annotations for instance masks, with only 1.2% of annotations using RLE. Polygon annotations in COCO do not explicitly account for holes during the annotation process, meaning that hole structures are generally not represented in the ground truth masks. Therefore, during training on the original dataset, the model has limited exposure to true hole instances.

To further investigate this, we conducted experiments on the SBD dataset, which includes explicit annotations for hole regions. Upon qualitative analysis, we found that the current 1st-order EFD formulation struggles to accurately capture hole structures. However, using higher-order EFDs—while helpful for capturing fine-grained shapes—led to a drop in overall performance. This trade-off suggests a limitation in the current design, and we consider hole representation as a promising direction for future improvement.

Response to Balance between Sharp and Other Regions

This is an excellent point. Our current approach samples points uniformly from the EFD-predicted contours in the first stage. While this ensures consistent coverage, it may under-sample sharp corners or highly curved regions. A more adaptive sampling strategy—allocating denser points around such regions—could yield more accurate reconstructions. However, this would require an additional module to learn the sampling distribution dynamically. We greatly appreciate this suggestion and plan to explore it in future work.

Response to Reconstruction Performance

To evaluate the balance between reconstruction accuracy and inference speed, we tested different vertex group settings on the SBD dataset. All experiments were conducted at an input resolution of 512×512 using an NVIDIA RTX 3090 GPU, with CUDA 11.8, PyTorch backend, and FP32 precision. The results are as follows:

Model	Vertex Group	FPS (mean ± std)	AP $_{vol}$
Ours (R50)	16	22.26 ± 0.30	67.3
Ours (R50)	8	21.29 ± 0.28	69.8
Ours (R50)	4	18.19 ± 0.16	70.2
Ours (R50)	2	13.76 ± 0.08	70.9
MaskDINO	-	6.45 ± 0.03	-
E2EC $^*$	-	36	59.2

$^*$ E2EC is tested at an NVIDIA A6000.

These results demonstrate that our method maintains a favorable trade-off between accuracy and inference speed, especially compared to pixel-based methods like MaskDINO.

审稿意见

评分: 42025-03-14

This paper devises a method for regressing vertex positions using Elliptic Fourier Descriptors (EFDs). Furthermore, it proposes a learnable transformer architecture to incorporate these EFDs. The transformer pipeline consists of two stages: 1) a transformer predicting EFDs to get coarse instance regions and 2) a transformer to decode the EFDs into more precise polygon instance segmentations. In order to represent multiple polygons as a single EFD, this approach connects polygons together. To determine the connectivity, this method computes distances between polygons and uses a minimum spanning tree of the resulting graph. To supervise the EFD regression, an L1 loss is used whereas a smooth-L1 loss is used to supervise the polygons. The overall approach is compared with both polygon and pixel-based methods showing improved performance compared to polygon methods and comparable performance with pixel-based methods. Key components of the method (number of decoder layers, EFD prediction order, and others) are thoroughly ablated justifying their importance.

update after rebuttal

I have read the rebuttal and it addresses my main concerns. I am maintaining my score of accept.

给作者的问题

N/A

论据与证据

The claims made are supported by sufficient evidence.

方法与评估标准

Yes. The authors compare to both polygon-based methods which are most similar and pixel-based methods which exhibit state of the art performance.

理论论述

N/A

实验设计与分析

The experimental design appears sound.

补充材料

Supplementary material not provided.

与现有文献的关系

The findings in this paper advance the state of polygon-based instance segmentation which provides a more lightweight representation than pixel-based methods. The authors demonstrate this advancement by showing that their method archives superior performance as compared to other polygon-based methods such as SharpContour and BounrdaryFormer.

遗漏的重要参考文献

None that I am aware of.

其他优缺点

Strengths:

This paper presents a new method for polygon-based segmentation using EFDs.
The proposed method is evaluated against multiple baselines. Results show that this method leads to improved performance over existing polygon-based methods and comparable performance against pixel-based methods.
Paper is well written and the method is easy to follow.

Weaknesses:

The main weakness would be that this method still performs worse than some pixel-based approaches. While these polygon methods might be more efficient, it seems likely that for many applications, performance would be more important and pixel-based methods would be used instead.

其他意见或建议

N/A

作者回复

2025-04-01

Response to Reviewer 3kZq

Thanks a lot for the time and effort you invested in providing the detailed reviews. Regarding the current weaknesses you pointed out, we are glad to give our responses.

Summary and Strengths

We are pleased that you found our method well motivated, the design sound, and the paper clearly written. We're also glad that you appreciated our comprehensive ablation studies and performance improvements over existing polygon-based methods.

Weakness: Comparison to Pixel-Based Methods

"The main weakness would be that this method still performs worse than some pixel-based approaches. While these polygon methods might be more efficient, it seems likely that for many applications, performance would be more important and pixel-based methods would be used instead."

Thank you for this valuable observation. We agree that in high-accuracy applications, pixel-based methods often remain the preferred choice. However, our goal is to provide a segmentation model that outputs structured data, which is particularly useful in tasks such as bird's-eye view vector map construction and remote sensing vector map generation. We believe our method can contribute to these areas by offering a novel solution.

Additionally, we provide inference performance comparisons in our submission, which highlight the efficiency of our method. For further context and elaboration, we also encourage referring to our responses to the other reviewers.

Once again, thank you for your valuable comments and support.

审稿意见

评分: 22025-03-17

The paper proposes an instance segmentation method for images, based on the prediction of a polygon as a series of connected points instead of a more commonly used pixel mask. The contour of a polygon capturing an instance is decomposed with a Fourier decomposition. The authors propose a transformer-like architecture that extracts multi-resolution features from input images, predicts coefficients of the Fourier decomposition, samples the obtained contour, and refines the sampled point (polygon vertices) with several additional transformer-like layers to predict the final positions of the vertices. The approach is evaluated on COCO dataset and compared to the other mask-based and polygon-based instance segmentation baselines. A series of ablation experiments empirically justifies the choice of some important hyperparameters.

给作者的问题

Have the authors tried to remove the EFD decoder and just use a polygon decoder with more layers?

论据与证据

The paper claims to obtain the best instance segmentation results among polygon-based methods. To support this claim they compare their method to a series of polygon-based baselines (the latest are from CVPR’22).
The paper claims that their approach is a competitive alternative to pixel mask-based methods. To verify, it is compared to some mask-based methods.

方法与评估标准

The proposed method that decomposes the contour of the polygon using Fourier decomposition and uses a transformer-based architecture to infer the coefficients of the decomposition looks valid from a theoretical point of view. However, I suspect the method does not work well in practice. In theory, high-order decompositions are desirable because the instance contours have non-trivial topology, and low-order approximations are not able to distinguish the contour point samples with the same phase but different amplitudes. However, the ablation in Table 3 shows that the method works best with the first-order approximation (which approximates the contour with an ellipsoid). That fact raises the question of whether the decomposition is required at all. Given that, the method could have just sampled the initial points on a circle and just used the polygon decoder with more layers to infer the final positions of the vertices.

理论论述

There are no theoretical claims in the paper.

实验设计与分析

Overall, the experiments are designed well; the ablations make sense. However, the authors only used relatively old ResNet-50/101 backbones for their method, which makes the approach significantly weaker compared to the overall state-of-the-art methods.

补充材料

There are no supplementary materials.

与现有文献的关系

I am not sure about the subfield of polygon-based instance segmentation, but overall the authors do not include several important prior works. For example, even on the simplest ResNet-50 backbone, Co-DINO-Deformable-DETR++ [1] achieves AP of 52.1 (vs 43.6 reported for the presented method) when trained for the same number of epochs. The corresponding paper also contains multiple references to other methods that produce better results.

遗漏的重要参考文献

[1] Z. Zong, G. Song, Y. Liu. DETRs with Collaborative Hybrid Assignments Training. In ICCV’23.

其他优缺点

The authors try to make a case for their method by carving a niche for polygon-based instance segmentation methods and trying to push their results, but it is hard to justify the niche if the difference between the state-of-the-art and the proposed methods is that large (AP = 45.1 vs 65.9 best for [1] with ViT-L backbone). The fact that the high-order decompositions do not work well does not help the case.

其他意见或建议

I suggest the authors consider using other backbones and try to make the high-order approximations work properly.

Update after rebuttal

First of all, I would like to apologize for my initial misunderstanding of the proposed method. Indeed, the segmentation contour is projected on the approximated decomposition in a bijective manner, so I withdraw this concern and, as a result, am willing to improve my rating.

After the rebuttal, my concerns mostly stayed the same:

The proposed method does provide a way to project segmentation contours on ellipsoids (1st order approximations), which is of some value and likely helps the method, but the fact that the use of any high-order approximations lowers the performance is concerning and might indicate that the method is either severely limited in its potential of is technically not correct.
The authors operate in a subfield of the polygon-based segmentation and only choose very weak mask-based baselines for comparison. As I mentioned in the review, there are mask-based instance segmentation methods that outperform the proposed method using the same back-bones by a large margin (Co-DINO-Deformable-DETR++ achieves AP of 52.1 vs 43.6 reported for the presented method on ResNet50). The gap further widens with the use of the more up-to-date backbones. I agree that a direct comparison of these classes of methods is not entirely fair, but it is hard to ignore the existence of these related methods and not diminish the importance of the presented method. To the very least, the paper should contain a baseline that takes modern superior mask-based segmentations and converts them to the polygon-based format. This seems like a much fairer competition compared to outdated mask-based methods with worse performance. If the masks can be converted to polygons without a significant loss in quality, the presented method loses its significance.

Overall, it is still hard for me to recommend acceptance due to these concerns. There are some merits in the presented work but to me, the approach seems to be too far from the current state of the art (or to the very least it is hard to properly assess that).

作者回复

2025-04-01

Response to Reviewer tj3E

We sincerely appreciate your time and effort in reviewing our paper and are glad to provide detailed responses to your insightful questions and suggestions.

Clarification on Method

While higher-order EFDs can offer finer contour approximations, the concern regarding "same phase but different amplitudes" may stem from a misunderstanding of the concept of phase in the EFD frequency domain versus angular coordinates in polar representations.

In our approach, Elliptic Fourier Descriptors establish a bijective mapping between each contour point $p(\theta)$ and a unique phase value $\theta \in [0, 2\pi)$ . This bijection guarantees a one-to-one correspondence between contour points and their associated phase values in the frequency domain.

Our point regression strategy exploits this property to avoid ambiguities present in Cartesian and polar systems. For example, in polar coordinates, a ray from the centroid may intersect the contour at multiple points sharing the same angle but differing in radius. In contrast, under EFD, such points are associated with distinct phase values, as illustrated in the figure below.

img

Response to Vanilla Ellipse

To further validate the effectiveness of our two-stage architecture and the role of the first-stage EFD prediction, we conducted additional experiments on the SBD dataset in response to the reviewer’s concerns:

Type 1: Identical to the original model, except the first stage predicts a naive inscribed ellipse.
Type 2: Based on Type 1, the number of layers in the first stage is reduced to 2, while the second-stage polygon decoder is deepened to 6 layers.

Model	1st Stage Layers	2nd Stage Layers	AP $_{vol}$
Original	6	3	70.2
Type 1	6	3	59.5
Type 2	2	6	62.7

The results demonstrate that the EFD-based first-stage prediction significantly outperforms the naive ellipse initialization. Moreover, increasing the depth of the second-stage decoder alone does not compensate for the lack of EFD-based guidance.

We attribute the superior performance of our design to the following:

Although both are ellipses, the 1st-order EFD behaves more like a PCA-based initialization, better capturing the object's principal orientation and shape distribution.
EFD provides a more principled framework for assigning regression targets than heuristic elliptical approximations.

Response to Larger Backbone

You raised an important point regarding the potential benefit of a stronger backbone in enhancing the learning of high-order EFDs. We conducted experiments with a Swin-L backbone using a 1× training schedule on the COCO val2017 set:

EFD Order	Schedule	Backbone	mAP(val2017)
1	1×	Swin-L	44.1
4	1×	Swin-L	33.2
1	1×	ResNet-50	40.6

Interestingly, even with a more powerful backbone, the 1st-order EFD still achieves the highest mAP. We believe that, under the current architecture, high-order EFD parameters are difficult to optimize effectively, introducing instability in the second-stage point sampling. Conversely, the 1st-order EFD serves as a robust and easily learnable target, similar to rotated object detection, which helps stabilize the overall training.

Response to Model Design Question

Remove the EFD decoder and just use a polygon decoder with more layers.

This experiment is in the Response to Vanilla Ellipse section.

Additional Remarks

Our method is polygon-based, where both the output and the supervision signals are sequences of points. Within the domain of polygonal contour prediction, our method surpasses all prior work. Although in terms of AP, pixel-based methods outperform ours, we emphasize that our vector-based formulation offers a unified and end-to-end framework that holds practical value in applications such as vectorized HD map construction and remote sensing vectorization. We believe our contribution provides an effective approach for tasks requiring structured, vectorized outputs.

For further comparisons on additional datasets and inference efficiency, please refer to our responses to other reviewers. Once again, we sincerely thank you for your valuable feedback.

最终决定Accept (poster)

2025-05-01

The submission got 1 negative and 3 positive recommendations eventually. The reviewers were mainly concerned about the performance on real 3D data, evaluations, and technical details. The authors did a good job in their rebuttal and addressed most of these concerns. The reviewers did not actively engage in a discussion although the AC tried to initiate one. The AC read through the submission, all reviews, rebuttals, and the confidential comments to the AC. The AC agreed with most of the reviewers and values the idea. Per this, the AC made a decision of acceptance. The decision was also approved by the senior AC.