TabPedia: Towards Comprehensive Visual Table Understanding with Concept Synergy
摘要
评审与讨论
This paper introduces a novel large vision-language model, named TabPedia, aiming to perform comprehensive visual table understanding (VTU) in a unified framework. In order to tackle the dilemma of modal isolation and task exclusivity, it presents a concept synergy mechanism, in which diverse VTU tasks and multi-source visual embeddings are regarded as concepts. This framework integrates table detection (TD), table structure recognition (TSR), table querying (TQ) and table question answering (TQA) with the powerful capabilities of large language models (LLMs). Extensive experiments are conducted to validate the effectiveness of TabPedia. In addition, this paper establishes a new and comprehensive table VQA benchmark, ComTQA, featuring about 9,000 hihg-quality QA pairs.
优点
- This paper presents a unified approach that combines table perception and comprehension tasks with autoregressive language models for generating textual descriptions. Impressively, the design of TD, TSR and TQ tasks to adapt the serilization output format of LLMs also achieves comparable performances with previous task-specific methods.
- The proposed concept synergy mechanism effectively enables the table perception-related and comprehension-related tasks in harmony, and achieves impressive performance in various visual table tasks.
- The TQ task validates the powerful capabilities of LLMs for VTU. Preivous methods suffer from the redundant operation that first crops table-centric images from original documents, and then recognizes table structure. In contrast, TabPedia could directly parse table structure information from original documents with no significant degradation in performance. These results could inspire more researches to explore the LLMs' potential ability for broadly visual table understanding.
- The qualitative visualizations of various VTU tasks, including TD, TSR, and TQA, on real-world images showcase the broad applicability of TabPedia. These visualizations highlight TabPedia's ability to generalize across different scenarios, suggesting its potential real-world applications.
缺点
- In object detection tasks, the outputs for different objects are unordered. However, large language models generally produce serialized outputs. Could you please explain in detail how TabPedia addresses the mismatch between the task and model architecture?
- In line 187, the authors provide one exemplar about user's question for each table task. Could you give more detailed descriptions about instruction design and display complete instructions for each tasks?
- In line 206-207, the structure of the table is represented with five object classes. These objects are indepedent of each other or have implicit relationship among them. Please explain this part for better understanding this descriptive format.
- For ComTQA benchmark, please provide more detailed statistical information, such as the average question length, average answer length.
The Broader Impact section is not comprehensive enough, more detailed discussions are necessary.
问题
See the above Weaknesses.
局限性
Although there exist some dicussions about the broader impact and limitations of the proposed framework in the main paper, some deeper discussion is missing, such as whether the techniques presented in the paper can be extended in other areas? the potential challenges of visual table understanding in multilingual scenarios? More detailed dicussions could fullfill the comprehension of this paper.
We appreciate the detailed comments and acknowledgment of our contributions. We provide the responses as follows.
-
Q1: In object detection tasks, the outputs for different objects are unordered. However, large language models generally produce serialized outputs. Could you please explain in detail how TabPedia addresses the mismatch between the task and model architecture?
A1: When creating data annotations, for multiple unordered table boxes in a single image, we sort them in ascending order based on the coordinates of the top-left corner of each box. Specifically, we sort them first by the horizontal coordinate , and then by the vertical coordinate . This strategy ensures that the annotations for each image are unique and guides TabPedia to orderly understand image structure information and generate answers sequentially.
-
Q2: In line 187, the authors provide one exemplar about the user's question for each table task. Could you give more detailed descriptions about instruction design and display complete instructions for each task?、
A2: Large language models inherently have powerful comprehension ability. Our instruction design follows two principles: 1) For different tasks, the instructions should be distinct to help the model better understand the differences between various visual tasks. 2) For a single task, the instructions should be diverse to enhance the model's ability to follow instructions. Based on both rules, we adopt GPT-3.5 to expand the manually designed instruction set. The complete instructions for different VTU tasks are shown in the uploaded pdf.
-
Q3: In line 206-207, the structure of the table is represented with five object classes. These objects are indepedent of each other or have implicit relationship among them. Please explain this part for better understanding this descriptive format.
A3: Please refer to the official comment.
-
Q4: For the ComTQA benchmark, please provide more detailed statistical information, such as the average question length, average answer length.
A4: We calculate the average, maximum, and minimum lengths of the questions and answers in the ComTQA based on the number of characters. It is observed that the longest answer length reaches 1000 characters, attributable to the inclusion of all possible correct answers within the corresponding answer annotations, especially in cases where multiple answers are provided. More detailed statistical information could be found in the uploaded pdf.
Statistic Value Min question length 17 Max question length 273 Avg question length 67 Min answer length 1 Max answer length 1000 Avg answer length 13 -
Q5: Although there exist some dicussions about the broader impact and limitations of the proposed framework in the main paper, some deeper discussion is missing, such as whether the techniques presented in the paper can be extended in other areas? the potential challenges of visual table understanding in multilingual scenarios? More detailed dicussions could fullfill the comprehension of this paper.
A5: Thanks for your constructive suggestions. We will add the following content to the Broader Impact section.
"In TabPedia, the collaboration of visual perceptive and comprehensive tasks via concept synergy mechanism shows impressive performance. For other document-related fields, this mechanism also could be applied or adapted to improve the reading capability of models. Exploring the transferability of our model to related tasks or domains could shed light on its versatility and applicability in various contexts. Furthermore, addressing the challenges of visual table understanding in multilingual scenarios is a crucial aspect that warrants more detailed discussion. Multilingual settings introduce complexities such as language variations, cultural differences, and diverse table formats that may impact the performance of our model. Investigating these challenges and proposing strategies to overcome them would enhance the robustness and generalizability of our approach."
I appreciate the authors' detailed explanations and visualizations in the rebuttal. After reviewing, all my concerns have been addressed. I recommend integrating these results into the final manuscript. I will keep my initial score.
Thanks for your thoughtful review and for your helpful suggestions. We will include extra citations, more detailed explanations about TSR annotations, more statistics of ComTQA benchmark and more detailed discussion about broader impact in the final manuscript.
This paper proposes TabPedia, a novel large-scale vision-language model designed for comprehensive visual table understanding (VTU) within a unified framework. It addresses the challenge of modal isolation and task exclusivity by proposing a concept synergy mechanism that treats diverse VTU tasks and multi-source visual embeddings as interconnected concepts. This framework integrates various table tasks with the powerful capabilities of large language models. Extensive experiments are conducted to validate the effectiveness of TabPedia. A new TQA dataset ComTQA is established for better evaluating the VTU task in real-world scenarios.
优点
- This paper first investigates the collaboration of table perception and comperhension tasks in a unified framework, and achieving impressive performances on diverse VTU tasks.
- This paper proposes an efficient table detection strategy, without the demand of complex NMS algorithm to eliminate densely overlapped boxes. It directly predict the postions of all table instances conveniently, inspiring a new way to solve the detection-related tasks.
- The new TQA benchmark, ComTQA, comprises high-quality QA pairs extracted from real-world table images. This benchmark addresses the limitations of previous benchmarks by including more complex question types that were previously absent, making it a more challenging and suitable benchmark for community development.
缺点
- Please add the references of all datasets in Tab.1.
- In table 5, the task "TD+TQ" is unclear. Please clarify the setting of this task.
- In ComTQA dataset, there exists some samples with multiple answers. Since this paper utilizes the accuracy metric, how to judge the multiple answers as correct or incorrect?
- In addition, I have a minor consideration about the inference efficiency of the proposed TabPedia due to the mechanism of autoregressively output. Despite some of the unfairness of such comparisons, I suggest that it needs to be properly discussed in the Limitation section.
问题
Please refer to the Weaknesses.
局限性
Please refer to the Weaknesses.
We appreciate the detailed comments and acknowledgment of our contributions. We provide the responses as follows.
-
Q1: Please add the references of all datasets in Tab.1.
A1: We will add the references of all datasets in Tab.1 in our revised manuscript following your suggestions.
-
Q2: In table 5, the task "TD+TQ" is unclear. Please clarify the setting of this task.
A2: In the "TD+TQ" setting, the document images are first fed into TabPedia with the prompt of the TD task to detect all possible table positions. Next, the detected table positions are fed into TabPedia with the prompt of the TQ task to parse the specific table structure in turn. Since the table positions are generated from TabPedia rather than directly adopting ground-truth coordinates, this setting brings more challenges. Impressively, TabPedia still achieves plausible performance with slight accuracy degradation under this setting compared with those under the TQ setting.
-
Q3: In the ComTQA dataset, there exists some samples with multiple answers. Since this paper utilizes the accuracy metric, how to judge the multiple answers as correct or incorrect?
A3: For multiple answers in the single QA pair, we separate them with the special char '\n' as the ground-truth answer. During testing, we judge if each answer exists in the response of TabPedia. The response is considered correct only if all the answers are present in the response; otherwise, it is considered incorrect.
-
Q4: In addition, I have a minor consideration about the inference efficiency of the proposed TabPedia due to the mechanism of autoregressive output. Despite some of the unfairness of such comparisons, I suggest that it needs to be properly discussed in the Limitation section.
A4: Thanks for your constructive suggestions. We will add the following content to the Limitation section to discuss the inference efficiency of TabPedia.
“TabPedia, as a multimodal large model, requires autoregressive answering. Compared to parallel decoding algorithms such as DETR [1] and Faster R-CNN[2], it consumes longer decoding time. Meantime, certain algorithmic designs such as KV cache, flash attention, and hardware improvements can effectively improve inference efficiency. We believe that with the iterative development of large model technology, the inference efficiency of TabPedia can be significantly improved.”
[1]: Carion, Nicolas, et al. "End-to-end object detection with transformers." In ECCV 2020.
[2]: Sun, Xudong, Pengcheng Wu, and Steven CH Hoi. "Face detection using deep learning: An improved faster RCNN approach." In Neurocomputing 2018.
If our responses solve your concerns, we sincerely hope that you could raise your score. It's important to us.
Thanks for your detailed response to my questions and concerns. After thoroughly reading the rebuttal, I find that most of my concerns have been adequately addressed. Now, I still have some questions on the main contribution, meditative tokens, which I expect for further discussion. I acknowledge that it is an interesting design exhibiting several properties as shown in Figure D5. It would be better to give more detailed explanations, as also noticed by other reviwers. My questions are:
-
In figure D5, I find that meditative tokens have different awareness patterns for different table perception and understanding tasks. It seems meditative tokens can capture information from different vision sources according to the specific tasks. This is an interesting phenomenon, is it possible to give the importance of High- and low-resolution vision tokens when they are captured by the meditative tokens for different tasks?
-
To my knowledge, a recent work [1] also try to solve the visual shortcomings of MLLM by combing dual vision encoders. I am wondering is it possible to generalize the meditative tokens to the versatile MLLM? If so, it may be another feasible alternative to solve the problem raised in work [1].
-
In table 7, I notice that introducing meditative tokens can bring significant performance improvements on both perception and understanding tasks. It would be better to explain or showcase what kinds of cases can be improved by introducing this design.
[1] Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs.
Thank you for your affirmative to our contribution and thoughtful review. For the further question, we provide the responses as follows.
-
Q1: Is it possible to give the importance of High- and Low-resolution vision tokens when they are captured by the meditative tokens for different tasks?
A1: Thanks for your enlightening suggestion. To answer this question, we have sampled 100 test cases for each task and report the averaged numeric importance of high- and low-resolution vision tokens when they are attended by the meditative tokens for different tasks in the following table. Specifically, for the various VTU tasks, we calculate the averaged attention scores (across all layers and attention heads) from the LLM decoder, which indicates the extent to which the meditative tokens focus on either high- or low-resolution visual tokens.For the TSR and TQ tasks, the meditative tokens pay significantly more attention to the high-resolution visual encoder tokens. We attribute this to the fact that both tasks require more fine-grained visual information to be "deliberated" in order to construct the dense table structure. In contrast, for the TD and TQA tasks, the two visual encoders contribute almost equally to the information attended to by the meditative tokens, validating the importance of both vision encoders for these tasks.
Task High-res visual tokens Low-res visual tokens TD 0.49 0.51 TSR 0.71 0.29 TQ 0.73 0.27 TQA 0.51 0.49 -
Q2: I am wondering is it possible to generalize the meditative tokens to the versatile MLLM? If so, it may be another feasible alternative to solve the problem raised in work [1].
A2: Thanks for the constructive suggestion. The answer is yes. The meditative tokens are not a setting unique to the VTU domain. In general, the meditative tokens are a simple yet effective technique to help the language models better leverage different concepts involved. We believe the benefits of these tokens are not limited to relevant visual information or task types used in this work. However, exploring the broader applications of the meditative tokens on general MLLM may deviate a bit from the purpose of this work, and we leave the investigation to these potential benefits in our future work.
[1] Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs.
-
Q3: In table 7, I notice that introducing meditative tokens can bring significant performance improvements on both perception and understanding tasks. It would be better to explain or showcase what kinds of cases can be improved by introducing this design.
A3: As shown in Tab.7, introducing meditative tokens can bring promising performance across VTU tasks. We compare in detail the differences in the generated results before and after the introduction of Meditative in different VTU tasks. It is observed that introducing meditative tokens mainly improves the quality of long-form responses. Also for the perception tasks including TD and TSR, introducing meditative tokens can alleviate the meaningless or repetitive word generation. For the comprehension task, TQA, introducing meditative tokens can generate more elaborated and reasonable response. As suggested, we showcase several samples for better understanding.
Question GT TabPedia (w/o Meditative Tokens) TabPedia what is the last site's municipality? Fukuyama Fukuoka \nReason: The last site is Tachibana, and its municipality is Fukuoka. Fukuyama \nReason: It is shown in the last row of the table that the last site's municipality is Fukuyama. So the answer is Fukuyama. john briggs and james phillip both drove which type of car? Honda Prelude Chevrolet Mazda RX-7 \nReason: From the image, we can see that both John Briggs and James Phillipps drove Mazda RX-7. Honda Prelude Chevrolet \nReason: From the image, we can see that both John Briggs and James Phillip drove the Honda Prelude Chevrolet. John Briggs drove the No.9 Honda Prelude Chevrolet, while James Phillip drove the No.10 Honda Prelude Chevrolet. So they both drove the same type of car, which is the Honda Prelude Chevrolet. Output all the table element's locations in the photo. Here is a list of all the locations of table element in the picture:\n [0.095,0.139,0.424,0.279]\n [0.095,0.375,0.458,0.620]\n [0.092,0.704,0.472,0.862]\n [0.518,0.155,0.807,0.321] Here is a list of all the locations of table element in the picture:\n [0.095,0.139,0.426,0.329]\n [0.095,0.409,0.447,0.669]\n [0.095,0.699,0.459,0.859] Here is a list of all the locations of table element in the picture:\n [0.096,0.140,0.422,0.281]\n [0.095,0.378,0.456,0.617]\n [0.094,0.707,0.474,0.862]\n [0.518,0.156,0.809,0.324]
Thank you for the rebuttal. All my concerns have been solved properly. The proposed concept of synergy through meditative tokens presents some intriguing properties, and I believe it could be a potential way to solve the problem of optimizing visual information usage. For these reasons, I believe this paper could make a valuable contribution to the conference, sparking new ideas among attendees, and I am inclined to raise my rating to ACCPET.
We sincerely appreciate your recognition of our work and your constructive suggestions. We will improve our manuscript based on your suggestions.
This paper introduces TabPedia, a novel large vision-language model designed to address the challenges in visual table understanding (VTU) tasks. TabPedia incorporates a concept synergy mechanism that treats various VTU tasks and multi-source visual embeddings as concepts within a unified framework, allowing for the seamless integration of tasks such as table detection, structure recognition, querying, and question answering by utilizing the capabilities of large language models (LLMs). A new comprehensive table visual question answering benchmark called ComTQA is created, which includes approximately 9,000 question-answer pairs. Extensive experiments on both table perception and comprehension tasks across various public benchmarks demonstrate the effectiveness of TabPedia.
优点
This paper presents very detailed experiments on the VTU tasks and achieves good results.
It is meaningful to combine all VTU tasks into the same framework.
缺点
The novelty of TabPedia seems to be minimal, as existing VLM models are almost similarly structured.
Although the authors present the Attention map of meditative tokens in the Appendix, I still don't understand the reason why meditative tokens work.
Different vision encoders are introduced. But I don’t know how they help each other and what they extract.
In addition, there are many places where the author doesn't explain things clearly. For example,
L177 why the low-resolution vision encoder is not trained?
L207 “To better understanding, we display a representative sample in Appendix B.” After reading the appendix I still don't understand how the authors constructed the TSR data. By the way, is there any grammatical mistake in "To better understanding"?
L248 “temperature parameter” is also not explained.
问题
See weaknesses.
局限性
Yes.
We appreciate the detailed comments and acknowledgment of our contributions. We provide the responses as follows.
-
Q1: The novelty of TabPedia seems to be minimal, as existing VLM models are almost similarly structured.
A1: We would like to re-emphasize the main novelty, which has been confirmed by Reviewer #bkeN and #fN3K, lies in the unified framework integrating various VTU tasks with "impressive" performance achieved. We do agree our TabPedia is built upon the canonical "Vision Encoder + Projection + LLM" paradigm, however, our main focus is to explore the synergistic effects of diverse VTU tasks and multi-source visual embeddings through the proposed concept synergy mechanism, of which the indispensability has been verified by the qualitative and quantitive experimental results given in the paper (see Tab.7 and Fig.D5). Actually, we have performed further explorative experiments by infusing the proposed concept synergy mechanism into another powerful VLM, QWEN VL, and surprisingly found the better performance. To put it another way, our concept synergy mechanism is a versatile one, which can be readily applied to most existing VLMs. -
Q2: Although the authors present the Attention map of meditative tokens in the Appendix, I still don't understand the reason why meditative tokens work.
A2: The most intuitive motivation behind our meditative tokens is from the success of "additional tokens" (see Sec. 2.3 of main text), where the input sequence is extended with these additional tokens popularized for various intentions, such as extracting task-specific information, providing extra information or improving model performance. Inspired by it, our proposed meditative tokens serve as the informative buffer to adaptively integrate different partial visual tokens and understand the intentions of specific task questions in visual table understanding. As illustrated in Figure D5, the meditative tokens can adaptively capture task-related visual features with respect to diverse tasks. -
Q3: Different vision encoders are introduced. But I don’t know how they help each other and what they extract. Why the low-resolution vision encoder is not trained?
A3: We equip our TabPedia with dual vision encoders to effectively extract visual information. For the low-resolution vision encoder, we utilize the one from the CLIP vision encoder (ViT-L), which has been pre-trained on 400 million image-text pairs sourced from the open-world data, thereby embedding extensive world knowledge into its pretrained weights. To preserve its generalization ability, we keep it frozen during the whole training procedure. By performing the comparative experiments (see the following table), we observe no significant performance improvement but with longer training time consumption by unfreezing it, which is in line with the conclusion in the pioneering work [1]. Besides, we suppose the encoder frozen can serve as a regularization, facilitating the extraction of layout information and alleviating potential overfitting problems, as well as more stable training. However, ViT-L is constrained by its limited ability to capture nuanced visual representations from high-resolution document images with intricate textual content and dense table structures contained. As various tasks may require different vision clues from either vision encoders, the dual vision encoders are expected work flexibly for various tasks (TQA often requires detailed table information while global layout matters for TSR task), which also triggers our meditative tokens. To strike the trade-off between computational consumption and performance, we thus freeze the low-resolution vision encoder during training. This explanation will be further elaborated upon in our revised manuscript.Exp Setting PubTab1M-Det FinTabNet WTQ low-res enc (frozen) + high-res enc (unfrozen) 98.5 95.11 47.8 low-res enc (unfrozen) + high-res enc (unfrozen) 98.4 95.62 46.4 [1] Huang, Xiaohu, et al. "Froster: Frozen clip is a strong teacher for open-vocabulary action recognition." In ICLR 2024.
-
Q4: “To better understanding, we display a representative sample in Appendix B.” After reading the appendix I still don't understand how the authors constructed the TSR data. By the way, is there any grammatical mistake in "To better understanding"?
A4: Please refer to the official comment. For all the typos, we will thoroughly revise them in the future version.
-
Q5: "temperature parameter" is also not explained.
A5: To our best knowledge, "temperature" is a basic concept in Machine Learning field [1,2], which has emerged long before deep learning era. Considering as a submission of the conference in the field, we thus assume all the readers have this background. In the context of Language Model (LM), specifically Large Language Models (LLMs), the temperature parameter is used to control the randomness or uncertainty in the output generated by the model. A small temperature value sharpens the probability distribution. This means that the most likely tokens are given even higher probabilities, while less likely tokens receive lower probabilities. This leads to more deterministic and less diverse outputs.[1] https://en.wikipedia.org/wiki/Boltzmann_distribution
[2] Bishop C M, Nasrabadi N M. Pattern recognition and machine learning[M]. New York: springer, 2006.
Thank you for your detailed response and further explanations. After carefully reviewing your answers, I still have a few questions that I would like to discuss further:
Regarding the Methodology: The "meditative tokens" you propose, when compared to prompt tuning, both involve adding extra parameters. This raises some concerns about the novelty of the method. Additionally, I find the naming of "meditative tokens" somewhat inappropriate. Could you please elaborate on the differences between your approach and prompt tuning, and clarify the reasoning behind this particular naming?
Regarding Novelty: While you have introduced dual vision encoders to address the weaknesses in some visual encoders within LLMs, I am still unclear about how these two encoders interact with each other. Are they merely stacked, or is there a more intricate interaction at play? Understanding the underlying design principles here is crucial, and I would appreciate a more detailed explanation.
Regarding the Attention Map: Although Appendix D presents the Attention map of the meditative tokens, I am not convinced that these visualizations lead to any meaningful conclusions since discrepancies in attention are to be expected. Could you provide further clarification on how these Attention maps substantiate the effectiveness of your method?
I look forward to your further insights and appreciate the answers you have put into this rebuttal.
-
Q3: Could you provide further clarification on how these Attention maps substantiate the effectiveness of your method?
A3: Fig. D5 only visualizes the working pattern of meditative tokens on a single sample for each task. To be more clear, we have further sampled 100 test cases for each task and report the averaged numeric importance of high- and low-resolution vision tokens when they are attended by the meditative tokens for different tasks in the following table. Specifically, for the various VTU tasks, we calculate the averaged attention scores (across all layers and attention heads) from the LLM decoder, which indicates the extent to which the meditative tokens focus on either high- or low-resolution visual tokens.
For the TSR and TQ tasks, the meditative tokens pay significantly more attention to the high-resolution visual encoder tokens. We attribute this to the fact that both tasks require more fine-grained visual information to be "deliberated" in order to construct the dense table structure. In contrast, for the TD and TQA tasks, the two visual encoders contribute almost equally to the information attended to by the meditative tokens, validating the importance of both vision encoders for these tasks.
Task High-res visual tokens Low-res visual tokens TD 0.49 0.51 TSR 0.71 0.29 TQ 0.73 0.27 TQA 0.51 0.49 Furthermore, we also investigate the averaged contribution of meditative tokens, high-resolution visual tokens, and low-resolution visual tokens to the generated answers. It is worth noting that if the previous table showed the importance of different visual cues for the "thinking" process, then the following results demonstrate the importance of the "thinking" results. Specifically, we calculate the averaged scores of the TabPedia-generated answers with respect to these three types of tokens across all the attention maps from the LLM, respectively. One can observe that the meditative tokens contribute the most information to the generation of satisfactory answers, which demonstrates that the proposed meditative tokens are indispensable and effective. Please refer to the answers A1 and A3 to Reviewer bkeN for more explanations and examples.
Task Meditative tokens High-res visual tokens Low-res visual tokens TD 0.65 0.16 0.19 TSR 0.64 0.12 0.24 TQ 0.71 0.11 0.19 TQA 0.56 0.18 0.25
Thank you for your thoughtful review and for highlighting key points. For the further question, we provide the responses as follows.
-
Q1: Could you please elaborate on the differences between your approach and prompt tuning, and clarify the reasoning behind this particular naming?
A1: Although introducing extra parameters, our method and prompt tuning are two different things. The prompt tuning is a Parameter Efficient Fine-Tuning (PEFT)[1] solution aiming at efficiently training the large models by introducing the task-specific input prompt, while our method is a framework where the meditative tokens target to adaptively integrate different concepts for different VTU tasks to generate plausible answers. That is, prompt tuning is a training strategy by which our method can also be trained. Comparing them is more akin to comparing VGG and LORA.
The word "meditative" in the name is derived from its core purpose: "allow the model time to ponder before generating output". More concretely, our dual vision encoders yield around 1.5k visual tokens rich with visual information. To facilitate the model's deliberation, we propose appending several trainable tokens after these visual tokens. This buffer enables the decoder to thoughtfully contemplate the received visual data and associated question content while recursively generating coherent responses, which has already been explained in the first round A2. This process is analogous to the thoughtful deliberation of human beings when perceiving an image in reality. In such cases, people naturally reserve mental space and time to ponder the visual information before formulating a response. Inspired by it, we have designated the appended trainable tokens as "Meditative tokens".
-
Q2: More clarification about dual vision encoder.
A2: As explained in the Fig.2 of the main text, our TabPedia simply concatenates the visual tokens extracted by the dual vision encoders together. The effectiveness of this dual-encoder combination has already been verified by several previous works [2,3]. In the first round rebuttal, we have elaborately explained the respective role of both encoders, please refer to the first round A3. It is important to re-emphasize that we do not claim the dual-encoder design as the main contribution of our method. Compared to this "1+1=2" design, we place more attention on how to achieve a "1+1>2" effect, which is accomplished through the proposed meditative tokens. To take a further step, the "interaction" of either encoder you deemed happens in the LLM-like decoder. Rather than "interaction", we would name it as "contribution" and our aim is to drive them to contribute themselves to the LLM decoder through meditative tokens for different tasks (refer to following A3 for more details).
[1] Han Z, Gao C, Liu J, et al. "Parameter-efficient fine-tuning for large models: A comprehensive survey." In arXiv 2024.
[2] Wei, Haoran, et al. "Vary: Scaling up the vision vocabulary for large vision-language models." In arXiv 2023.
[3] Tong, Shengbang, et al. "Eyes wide shut? exploring the visual shortcomings of multimodal LLMs." In CVPR 2024.
Dear Reviewer Xi9r,
We would like to extend our appreciation for your time and valuable comments. Due to the rush in finalizing the writing, some aspects may cause confusion and misunderstanding. Ensuring that the rebuttal aligns with your suggestions is of utmost importance. We have responded to your concerns as quickly as possible. Considering that the discussion phase is nearing its end, we respectfully remind you to let us know if you have any other questions so that we can better address your concerns. We would greatly appreciate it if you could consider improving the evaluation after reviewing our responses.
Thank you very much for your consideration.
Sincerely,
The authors.
As mentioned in L204-L210 of the main paper, we propose a canonical table structure representation based on the detection format. we jointly adopt five object classes to model TSR, including table column, table row, table column header, table projected row header and table spanning cell. In Appendix B, we display the specific rectangular boxes of all objects in a table. Taking into account the serialized output of the LLM, we represent the table structure with a series of “[object] [x1, y1, x2, y2]”, which are also separated by “\n”.
A table generally is composed of five basic elements, i.e., column, row, spanning cell, column header and projected row header. "Row" denotes the rectangular boxes of each row's content in the table, while "Column" denotes the rectangular boxes of each column's content. The area where each row and each column intersect represents the table cell. Besides these both most common table elements, "Column header" refers to the area in the table that contains the data type or content for each column, usually occupying multiple rows at the top of the table. “Projected row header”, as a special row, represents the area that contains a single non-blank cell in a row. "Spanning cell" refers to a cell in a table that spans multiple rows or columns. According to these definitions, these objects have implicit relationship and construct a table’s hierarchical structure through physically overlapped rectangle boxes.
Thanks for the thorough reading and fruitful reviews, below we address the key concerns and suggestions case by case. For more clarity, we append the pdf containing Figures and Tables.
This paper got conflicting reviews, with two reviewers supporting acceptance and one supporting rejection.
The point of contention appears to be around the novelty of the contribution itself and whether it merits publication at NeurIPS. The authors claim that their novelty lies in (1) the unified formulation of different table tasks, (2) the meditative tokens.
I am fine with (1). But I remain unconvinced by (2). The inclusion of the meditative tokens feels orthogonal to the contribution around table comprehension. It is also not explored sufficiently. One reviewer asked if the performance improvements from meditative tokens are simply because they add more parameters. This is a reasonable question and one that was not answered adequately. There is also the question of extra computation that the meditative tokens require; could the extra computation explain the performance increase?
Regardless, the performance increases are promising. The paper is being accepted but I would urge the authors to include the following experiments:
(1) Train your model with random meditative tokens frozen. If your model performs better, then we know that the meditative tokens do more than just provide extra computation. (2) I would also like to see experiments where the meditative tokens are used with single vision encoders, to understand their role further.