Thank you for your comments and feedback. Please find below the answers to your questions:

Q1. Why some of the results are not shown in Table 1?

The results for the baseline models are from their respective papers which are cited in the table. These works have not provided the results for all the datasets and/or dataset splits. We will clarify this in the paper. For GPT-4, we did not run the model on the much larger PlotQA evaluation sets due to high cost and usage limits as pointed out in the footnote on page 7.

Q2: Why is the correct answer for arithmetic in Table 7 is -50752953286.0?

We apologize for not including the original charts for each example in the paper and will include them in the updated version. The correct answer is from the PlotQA dataset. The question-answer pairs in PlotQA are generated by templates automatically and PlotQA considers the “difference” between A and B just as instead of the absolute difference. That is why the answer is negative in Table 7.

Q3: What is the evaluation prompt used when there is no Describe step?

In this case, the reasoning process starts directly with task decomposition and not the Describe step. For example, the first demonstration in the prompt would be:

Q: In which year the private health expenditure per person in Oman is 210.69?
A: Let’s extract the data of Oman.
The data is 183.88 in 2008, 233.80 in 2009, 210.69 in 2010, 195.26 in 2011, 196.32 in 2012,
154.21 in 2013, 153.22 in 2014.
The value 210.69 is in year 2010. So the answer is 2010.

We have added the prompt used for this case to Appendix A-2.

Q4: It would be great if the authors could discuss the applicability of the proposed method on other VQA tasks, e.g. CLEVR.

Our proposed method can be applied to other VQA tasks given an appropriate vision module. We focused on visual language reasoning to necessitate the multiple interactions between language and vision modules since other VQA tasks either do not require multi-step reasoning or can be accomplished by language prior alone. We will highlight the flexibility of our method in the final draft.

Q5: Is there any examples of the model failed at decomposing the problem?

There are examples of incorrect decomposition due to ambiguity in the question. In some cases, the model fails to request the only information required for answering although the model still answers correctly finally:

Q: How many people use daily?
A: Let's describe the figure.
The figure shows the data of: Value (blue). The x-axis shows: Weekly | Daily | Stopped using.
Let's extract the data of Value.
The data is 41.2 in Monthly, 37.8 in Weekly, 18.2 in Daily, 2.8 in Stopped using.
Among [41.2, 37.8, 18.2, 2.8], the value that is 18.2 is the number of people who use daily. So the answer is 18.2.

In other cases, the incorrect decomposition leads to invalid queries for the vision module. The vision module then gives the wrong intermediate result which finally leads to an incorrect answer.

Q: What's the percentage of people who don't believe it can impact us and don't know much about covid?
A: Let's describe the figure.
The figure shows the data of: Value (dark blue). The x-axis shows: Staying alert and to the ground | Don't believe it can impact us or | Don't know much about.
Let's extract the data of Value.
The data is 1.0 in Staying alert and the 2019.
The percentage of people who don't believe it can impact us and don't know much about covid is 1.0. So the answer is 1.0.

We will extend our error analysis in the revised draft to include such examples.