Thank you for your valuable review and suggestions. Below we respond to the comments in Weaknesses (W) and Questions (Q).

W1.1 & Q1.1: Quantitative orientation error on real images.

Currently, the 6 benchmarks in Table 2 are evaluated on real-world data with quantitative 3D orientation error.

Moreover, we have also tested on the latest and largest object 3D orientation benchmark, ImageNet3D [1], which covers 200 object categories. The Acc@30° results are as follows:

Setting	Model	Avg.	Electronics	Furniture	Household	Music	Sports	Vehicles	Work
Zero-shot	ImageNet3D-ResNet50	37.1	30.1	35.6	28.1	11.8	51.7	36.7	40.9
	Orient Anything-B	48.5	61.0	66.8	37.9	27.3	25.6	70.8	33.4
Fine-tuning	ImageNet3D-ResNet50	53.6	49.2	52.4	45.8	26.0	65.2	56.5	58.5
	ImageNet3D-DINOv2-B	64.0	75.3	47.9	32.9	23.5	74.7	38.1	64
	Orient Anything-B	71.3	77.6	89.7	64.4	54.4	47.6	87.4	61.2

Note that we couldn't find detailed definitions for the major categories like Electronics, Furniture, and Household in ImageNet3D, so we used GPT to map its 200 categories into the 7 general categories. As a result, we used GPT to map its 200 categories into 7 broader ones. Therefore, the comparison results for each general category may vary, and the average score provides a more meaningful comparison.

[1] Ma W, Zhang G, Liu Q, et al. Imagenet3d: Towards general-purpose object-level 3d understanding. NeurIPS 2024.

W1.2: Generalization to real and even in-the-wild images.

We provide numerous visualizations on real and in-the-wild images in both our main text and supplementary materials. Furthermore, in response to your W1.1 & Q1.1, we present the results on the current largest 3D orientation estimation benchmark, ImageNet3D, which also demonstrates strong generalization to these real-world scenarios.

W2.1 & Q4.1: Study on the dataset quality.

In https://anonymous.4open.science/r/visualization-B728/Training_Samples/, we provide lots of visualization cases from our curated training dataset, showcasing both high-quality images and accurate orientation annotations.

W2.2 & Q4.2: Influence of training data.

We train ViT-B version models using different ratios of data. Below are the results:

Data Ratio	COCO(Acc)	ImageNet3D(Acc@30°)
25%	63.55	45.47
50%	65.47	44.08
75%	66.82	47.65
100%	69.85	48.52

Q1.2 Difference and advantage to object 6D pose estimation.

Difference: Traditional 6D pose estimation methods focus on relative orientation to a reference frame or template 3D model, while Orient Anything focuses on semantic orientation (e.g., the semantic “front face” of an object) without any reference. Therefore, the previous benchmark for object pose estimation is not suitable for our task.

Importance of no reference mesh/image: Monocular images are the most accessible and widely used form of visual input, and in many scenarios, references for the desired object are often unavailable. By not relying on reference meshes or images, Orient Anything enables broader applications, such as solving spatial reasoning questions and evaluating whether the generated image adheres to the desired spatial relationships, as discussed in Section 7.

Q2 Generalization to unseen categories.

In response to your W1.1 & Q1.1, we further provide evaluation results on the current largest single-view orientation estimation benchmark, ImageNet3D, which significantly surpasses existing methods and demonstrates strong generalization to real images and various categories.

Additionally, in response to Reviewer ev5K’s W1 & Q1, we discussed how to further scale the annotated data and expand the covered categories through synthetic 3D assets and voting-based annotation strategy.

Q3 Why trained on synthetic data can generalized to real data.

The synthetic-to-real generalization ability is mainly obtained through the task-agnostic pre-trained model that is trained on massive real images. The similar idea has been discussed and validated in Marigold [2] and Depth Anything V2 [3]. For further discussions, please refer to the response to Reviewer dtES’s W3.

[2] Ke B, Obukhov A, Huang S, et al. Repurposing diffusion-based image generators for monocular depth estimation. CVPR 2024

[3] Yang L, Kang B, Huang Z, et al. Depth anything v2. NIPS 2024