5.0

/10

Poster4 位审稿人

最低4最高6标准差1.0

3.5

置信度

正确性2.8

贡献度2.3

表达2.8

NeurIPS 2024

GoMatching: A Simple Baseline for Video Text Spotting via Long and Short Term Matching

Haibin He,Maoyuan Ye,Jing Zhang,Juhua Liu,Bo Du,Dacheng Tao

OpenReview PDF

提交: 2024-05-08更新: 2024-12-19

摘要

关键词

video text spottingefficient tuninglong and short term matching

评审与讨论

审稿意见

评分: 6置信度: 52024-07-05

The paper introduces GoMatching, a streamlined and efficient baseline for video text spotting that enhances tracking capabilities through a novel Long-Short Term Matching module, while also setting new performance benchmarks on multiple datasets and introducing the ArTVideo test set for arbitrary-shaped text evaluation.

优点

This method leverages an off-the-shelf, query-based image text spotter and adapts it to video datasets through a focused training regimen on tracking while maintaining robust recognition performance.
A key component of GoMatching is the LST-Matcher, which enhances tracking capabilities by integrating both long-term and short-term matching results through a Transformer architecture.
The paper reports that GoMatching sets new records on multiple benchmark datasets, including ICDAR15-video, DSText, BOVText, and a newly proposed test set called ArTVideo, which features arbitrary-shaped text.

缺点

The paper does not provide a corresponding share of inference time; the reviewer recommends that an analysis could be presented and compared with previous work (TransDETR). It is not a problem if the inference speed has not achieved state-of-the-art; such analysis can further support the paper.
The related ablation study only presents simple numerical comparisons of experimental results, which appear overly simplistic. More detailed analysis is necessary. It would be better if corresponding visual results comparisons could be provided if possible.

问题

see weaknesses

局限性

The reviewer did not see any discussion of limitations in the main text

作者回复

2024-08-05

We sincerely thank you for your positive and insightful comments. Below we address the key concerns and promise to incorporate all feedback in the revised version.

Q1: An analysis of inference time could be presented and compared with previous work (TransDETR).

A1: We included the comparison of inference FPS in Tab. 5 of the supplementary material, as shown in the following table. "Size" denotes the shorter side of the input image during inference. All FPS results are measured on one 3090 GPU. With the same image size setting as TransDETR, GoMatching achieves faster FPS while obtaining significant performance improvements over all metrics, particularly on MOTA. While resizing the shorter side to 1000, GoMatching outperforms TransDETR by 11.08 MOTA at the cost of only 2.09 lower FPS. Overall, GoMatching can get significantly better performance with faster FPS.

Method	MOTA	MOTP	IDF1	FPS
TransDETR (Size: 800)	60.96	74.61	72.80	12.69
GoMatching (Size: 800)	68.51	77.52	76.59	14.41
GoMatching (Size: 1000)	72.04	78.53	80.11	10.60

Q2: The related ablation study only presents simple numerical comparisons of experimental results, which appear overly simplistic. More detailed analysis is necessary. It would be better if corresponding visual results comparisons could be provided if possible.

A2: Thank you for your valuable advice. In the supplementary material, we provided the visual results comparison between ST-Matcher and LST-Matcher in Fig. 8. We will add more analysis as shown below. With ST-Matcher which only leverages short-term frames, the image blur due to camera motion would cause strong appearance changes and corresponding feature variation, probably leading to the tracking candidate mismatch and ID switch issue. In contrast, by aggregating long-term information which is overlooked in ST-Matcher to alleviate the influence of short-term appearance changes, LST-Matcher can better address the ID switch issue.

Additionally, we add a visual comparison between with and without rescoring mechanism in the attached PDF. Without rescoring, the frozen text spotter from static image domain tends to convey low confidence on small and blur texts caused by camera motion. Adopting rescoring mechanism helps better adaptation to unseen video datasets, thereby distinguishing these text instances via confidence calibration and avoiding them from being filtered out by the confidence threshold.

2024-08-11

Thank you for the author's response, which addressed my concern. I will keep my score as it is.

审稿意见

评分: 4置信度: 42024-07-08

This work uses a query-based image text spotter for video text spotting to solve the poor recognition issue. To achieve this, they add a rescoring head to restore the confidence of detected text instances and use transformers to enhance the tracking capability in videos.

优点

Extend an image text-spotting mechanism to a video text-spotting scenario.
Introduce a rescoring mechanism to mitigate the domain gap between image and video datasets.
Long-short-term matching strategy to enhance the tracking capabilities.
An extensive experimental study has been provided.

缺点

It is a generic comment that the recognition text in all the qualitative examples is not visible even with zooming in. Also, in the video they have been provided in the supplementary, I think there are some arbitrary shape text misrecognized. The description of the methodology section is too confusing. Thanks to the author for providing the code as supplementary to help me understand the work.
The input to the Image Text Spotter (i.e. DeepSolo) is a single frame, not multiple frames. Then what is the domain gap here? How the proposed rescoring algorithm solves the domain gap. I think this rescoring mechanism is for highlighting the small text.
It is also not clear to me how this technique works for multi-language settings (i.e. English and Chinese). In Deepsolo its completely dataset dependent. But here in Figure 4, in a single frame, the model can detect English as well as Chinese text. The mechanism of this multilingual OCR is not mentioned in the paper.
The motivation for expanding an image text spotter to a video text spotter is not clear to me. Especially when you are working with a single frame. On the other hand, there is a recent work on "End-to-End multi-view scene text recognition" (Pattern Recognition, 2024) which considers three views together to enhance the recognition performance.

问题

Lack of motivation.
what is the domain gap between an image and a single video frame?
How to tackle multilingualism?

局限性

No there is no explicit mention of the limitations.

作者回复

2024-08-05

We sincerely thank you for your thoughtful and insightful comments. Below we address the key concerns and promise to incorporate all feedback in the revised version.

Q1: About the recognition text in qualitative examples and the description of the methodology.

A1: Thanks for your pointing out. We will enlarge the recognized texts to make them clearly visible, as shown in the provided qualitative results in the rebuttal PDF. We will also endeavor to revise the methodology section and make it easier to understand, such as adding more illustrations and more explanations for the equations.

Q2. What is the domain gap? How the proposed rescoring algorithm solves the domain gap?

A2: The domain gap between video and image arises from differences in form and data source. Specifically, video frames often contain and concern more small and blurry text instances caused by camera distance and motion. In our method, DeepSolo trained on image data is employed and kept frozen. We found that the frozen DeepSolo tends to provide lower confidence for text instances in video data, leading to sub-optimal performance. Adopting the simple rescoring mechanism can help distinguish text instances through confidence calibration and prevent some instances from being filtered out by the confidence threshold, thus better adaptation to unseen video data.

To further validate above analysis and show how rescoring head works, we provide the detection F-measure and AP results on ICDAR13-video. ICDAR13-video is selected because of its available testing GT for calculating AP. As can be seen, using rescoring achieves 4.7% improvement on F-measure, mainly resulting from the large enhancement on Recall by 9.1%. Considering the AP results, rescoring mainly improves the AP_S metric (AP for small instances with areas less than 32*32 pixels) . These evidences validate that the rescoring mechanism solves the domain gap by calibrating the confidence score and avoiding some small and blur texts from being filtered, leading to a better tracking candidate pool. We also provide qualitative comparisons in the attached PDF for visualization support. We agree that the rescoring mechanism can highlight small texts. It can also highlight the instances with medium and large size, as demonstrated by the 2.6% and 2.1% enhancements on AP_M (AP for medium instances with areas between 32*32 pixels and 96*96 pixels) and AP_L (AP for large instances with areas larger than 96*96 pixels).

Method	Precision	Recall	F-measure
TransDETR	80.6	70.2	75.0
GoMatching w/o rescoring	92.4	65.7	76.8
GoMatching	89.5 (-2.9)	74.8 (+9.1)	81.5 (+4.7)

Method	AP	AP_S	AP_M	AP_L
w/o rescoring	26.2	11.6	40.1	49.8
w/ rescoring	29.3 (+3.1)	15.5 (+3.9)	42.7 (+2.6)	51.9 (+2.1)

Q3: It is also not clear to me how this technique works for multi-language settings (i.e. English and Chinese).

A3: DeepSolo released both English-only and bilingual (English and Chinese) version model weights. For bilingual video text spotting on BOVText, we use the officially released bilingual version DeepSolo weights instead of the English-only version. We do not introduce extra techniques in the rescoring mechanism and LST-Matcher. As detailed in Tab. 1(b), our GoMatching achieves a 41.5% improvement on MOTA compared to the previous SOTA method on bilingual BOVText dataset, further underscoring the scalability and potential of our proposed baseline for multilingual text spotting. In the future, one could further expand the character table of the text spotter or explore Mixture-of-Experts to assemble different recognizers for different languages. Our method has the potential to seamlessly transfer the multilingual image text spotter into multilingual video text spotter at low cost.

Q4: The motivation for expanding an image text spotter to a video text spotter is not clear to me.

A4: In this paper, we focus on video text spotting (VTS) which requires text detection, recognition, and tracking across multiple frames. We observe that the SOTA VTS model exhibits inferior text recognition proficiency on oriented and curved text, limiting the real-world applications. These bottlenecks have not been identified by previous VTS methods. In the static image realm, image text spotters (which cannot perform tracking) have significantly improved recognition ability. We compare video methods with image methods, and summarize two key aspects for existing VTS models: 1) the inferior model architecture, particularly the recognition part, and 2) the lack of diversity in training data, such as the inclusion of few arbitrarily-shaped texts. However, collecting and annotating large-scale video training set with abundant curved texts is extremely costly. Thus, we propose to efficiently turn an image text spotter to a video text spotter, leveraging the merits of off-the-shelf architecture and the valuable knowledge learned from diverse image data. Subsequently, we design a simple rescoring mechanism to adapt the image text spotter to video domain at low training cost, and introduce LST-Matcher to cast the strong tracking ability. Please also refer to the response A1 to Reviewer 1HqK.

E2EMVSTR [1] offers a novel and practical method for enhancing scene text recognition. E2EMVSTR inspires us to improve video text spotting by mining the semantic consistency across frames and better distinguishing different tracklets. We will cite this paper and add discussion on potential improvements.

[1] An End-to-End Model for Multi-View Scene Text Recognition. PR, 2024.

2024-08-10

I appreciate the author's efforts in addressing my concerns, particularly in providing the new experimental results as requested. The explanation of the domain gap isn't satisfactory. As they are processing a single frame at a time not a bunch of frames. So there is no time factor. Both the inputs are in the pixel domain. Maybe some text are little blurry or occluded. There are lots of recent works has have tackled these issues. So I would like to keep my rating same as the previous one.

2024-08-11

We sincerely appreciate the reviewer's comments and would like to clarify the concept of domain gap in our work. It is widely recognized that the domain gap can be referred to the differences in data distributions between the source and target domains, which can exist even when processing a single frame at a time. The differences can arise due to variations in camera settings or other factors that affect the pixel distribution of the input data. Therefore, the absence of a time factor does not negate the presence of a domain gap. Besides, as we mentioned in Sec. 4.2 (Training Setting) of the paper, DeepSolo processes multiple frames at a time but not a single frame. During inference, it is a common practice that a tracking-by-detection method [1] or a tracking-by-query-propagation method [2] processes the video frame by frame, the same as we adopt in GoMatching.

Finally, the existence of recent works on enhancing performance on blurry text does not negate the value of our proposed simple baseline. We provide the analysis on the main bottleneck of existing VTS methods, not only the recognition ability on blurry text, and the first solution that turns a ITS model into a VTS model at low cost. The substantial improvements on extensive video benchmarks support our simple yet effective designs.

We hope the response can address your concern and would greatly appreciate it if the reviewer could provide further feedback.

[1] Global Tracking Transformers. CVPR, 2022.

[2] End-to-End Video Text Spotting with Transformer. IJCV, 2024.

2024-08-14

Thanks for the response. However, it doesn't directly address my concern. The first paper they referred to isn't related to video text spotting and they are using multiple frames to analyze the connection to embedding learning and perform Re-identification of frames (section 4.5). The second paper is about video text spotting they also have a "Temporal Tracking Loss over Multiple Frames". There is no such thing in GoMatching which makes me confused. I asked in my reviews does the rescoring algorithm is used only for improvement over blurred texts. If yes, then which component keeps track of multiple frames? What is tracking the candidate pool? How it is created? The equations (2) and (3) obtained in the paper for long and short-term matching aren't well constructed for prediction and tracking. I didn't find any motivation to use those equations for tracking. Keeping all these in mind, I would like to keep my rating the same and ask the authors to simplify the method and rewrite the motivation with proper evidence.

2024-08-14

Our GoMatching adopts the tracking-by-detection paradigm. It involves two primary stages: 1) the detector (DeepSolo with rescoring mechanism) first provides the text spotting results for video frames; 2) the tracker (LST-Matcher) then associates these results across multiple frames to track for each text instances detected by the detector and form the trajectories. In the training phase, the LST-Matcher is exposed to multiple input frames in each batch, allowing it to learn the temporal relationships cross frames. During inference, we engage a memory bank to store the image text spotting results from multiple history frames provided by the detector, in the same way as [1]. The usage of memory bank enables the LST-Matcher to associate the detected text instances in current frame with the tracklets established in history frames (Sec. 3.3).

As shown in the table provided in A2, the rescoring algorithm is used to adapt DeepSolo in video data domain, enhancing its text spotting ability in video data, not only for blurred texts. This enhancement improves the quality of detected text instances in memory bank (the tracking candidate pool), enabling the LST-matcher to achieve a better video text spotting results.

Our LST-Matcher is built upon the similarity-based tracking with thresholded associatation scores. Equations (2) and (3) are employed to calculate distributions of the association scores (similarity scores) between the instance in current frame with the trajectories cross frames (two adjacent frames for short-term association in eq.(2) and multiple frames for long-term association in eq.(3)). These equations are used to optimize the log-likelihood in training phase, with the objective of maximizing the association scores of the identical instances cross frames while minimizing the association scores between different instances, as mentioned in Sec. 3.4. We will endeavor to revise the methodology section and make it easier to understand, as we committed in our first rebuttal.

[1] Global Tracking Transformers. CVPR, 2022.

审稿意见

评分: 4置信度: 22024-07-12

The paper describes an approach for text spotting in videos. The method uses a text spotting method for still images to detect text instances frame by frame and the output is further processed by a specific module to find association between text instances through adjacent frames. In addition, a new dataset for video text spotting with curved text intances is presented. Experimental results are given for this new dataset and for several standard benchmarks for video text spotting.

优点

The paper proposes a simple approach to get the associations between text instances in different frames obtained with a standard text spotting method in single images. Although I am not very familiar with archtectures for video text spotting, the cntribution seems interesting and it seems to have the potential to be used with different image text spotting methods.
Experimental results show that the proposed method performs well and also that the different components of the approach contribute positively to the global performance of the method.

缺点

The paper is based on two main claims: one about the gap between detection and recognition and potential problems with optimization of association and the other, the lack of curved text in training data. I do not see through the paper how the proposed method and dataset help to solve these problems. Experimental results are good, but they are not put in relation with the initial claim and it is not shown how the proposed method help to alleviate this specific problem. On the other hand, a new dataset with curved text is proposed, but this dataset is not used to enrich the training data in the general setting and thus, it is not shown how it alleviated the lack of training data.

问题

The base DeepSolo method is kept frozen during training. What is the reason for that? Have you tried training also this part and see waht results are obtained? Some analysis and discussion about this would be interesting

局限性

There is no specific discussion on the limitations of the method

作者回复

2024-08-05

We sincerely thank you for your thoughtful and insightful comments. Below we address the key concerns and promise to incorporate all feedback in the revised version.

Q1. How the proposed method and dataset help to solve the gap between detection and recognition, the optimization conflict, and the lack of curved text in training data?

A1: In this paper, we started by analyzing the bottleneck of the SOTA video text spotter (TransDETR) and found its inferior recognition ability, particularly on curved text. While comparing VTS to ITS methods which have substantially promoted the recognition ability, we summarize two key aspects of the detection-recognition gap: 1) the inferior model architecture, particularly the recognition part, and 2) the lack of diversity in video data. However, collecting large-scale video training data with diverse curved text is extremely time- and labor-consuming. Thus, we are motivated to resort to the off-the-shelf recognition model of the SOTA image text spotter (e.g., DeepSolo) and leverage its excellent capability in text recogniiton, especially for curved text recognition. We find that DeepSolo already delivers superior recognition performance on video data in zero-shot setting (Fig. 1). We propose to efficiently turn an image spotter (cannot conduct tracking) to a video spotter. By leveraging a frozen DeepSolo, we not only promote the video text recognition ability but also preserve the knowledge from diverse image data, indirectly alleviating the lack of curved text in video. Additionally, keeping DeepSolo frozen prevents potential optimization conflict between text spotting and tracking, and helps focus training efforts more on tracking. Subsequently, we introduce a rescoring head to better adapt DeepSolo to the video domain at low cost, and design the LST-Matcher to cast the strong tracking ability. Finally, the substantial improvements on extensive video benchmarks support these effective designs.

As for our proposed ArTVideo, it is used for evaluating the video text spotting on curved text, filling the gap in video field for the first time and facilitating subsequent research. In future work, we can try to leverage several foundation models and SOTA specialists to filter and label a high-quality, large-scale, and diverse video text dataset, directly solving the lack of training data of video curved text.

Q2. What are the reasons for keeping DeepSolo frozen during training?

A2: There are three key reasons. 1) Preserving the valuable knowledge learned from diverse image data. As shown in Fig. 1(a), we found DeepSolo has superior recognition performance in zero-shot testing on video data, particularly for curved texts. When we unfroze DeepSolo on video data, we observed an unexpected performance drop, as demonstrated in the table in A3. This indicates that there are knowledge forgetting or overfitting issues when fine-tuning on video data with relatively low text shape and vocabulary diversity. Thus, to better preserve the knowledge learned from diverse image data, we keep it frozen. 2) Saving computational resources. Compared to image text spotting, video text spotting requires additional temporal-level learning. On each GPU, GoMatching needs 6 frames to learn temporal relationships. Unfreezing DeepSolo would lead to more computational overhead and higher GPU memory consumption. For example, unfreezing all parameters of DeepSolo would cause an unaffordable out-of-memory issue. In contrast, freezing DeepSolo only requires 3 GPU hours of training on ICDAR15-video using one 3090 GPU. 3) Focusing more on tracking optimization. As video text spotting involves three sub-tasks, keeping DeepSolo frozen can prevent potential optimization conflicts between text spotting (detection and recognition) and tracking.

Q3. Have you tried training also DeepSolo?

A3: Thank you for your suggestion. Actually, in Appendix F, we have discussed the influence of different training strategies (including unfreezing DeepSolo) on ICDAR15-video. The results are shown below:

Index	Training Setting	MOTA ( $\uparrow$ )	MOTP ( $\uparrow$ )	IDF1 ( $\uparrow$ )
1	'Only LST-Matcher' (default)	72.04	78.53	80.11
2	First 'Only DeepSolo', Then 'Only LST-Matcher'	70.82	78.09	79.64
3	'End-to-End', DeepSolo's Decoder ('0.001')	71.48	79.14	78.98
4	'End-to-End', DeepSolo's Decoder ('0.01')	70.15	78.17	77.67
5	'End-to-End', DeepSolo's Decoder ('0.1')	68.03	75.46	77.16

In the table, 'Only DeepSolo' and 'Only LST-Matcher' respectively refer to fine-tuning of DeepSolo and LST-Matcher while keeping other modules fixed. 'End-to-End' denotes training both DeepSolo and LST-Matcher in an end-to-end manner. '0.001', '0.01', and '0.1' represent the ratios of the DeepSolo decoder learning rate relative to the base learning rate. Due to constraints in training resources, for DeepSolo, we only unfroze its decoder during end-to-end training.

We observe that fine-tuning DeepSolo on the video dataset (row 2) leads to performance decline compared to the default setting (row 1), caused by the low diversity and quality of the video data. Moreover, when training GoMatching end-to-end (row 3 to 5), the performance gradually declines as the decoder's learning rate increases. This is likely due to conflicts among tasks. Therefore, it would be worthwhile to explore more effective multi-task optimization strategies in future works.

评论- To Reviewer 1HqK

2024-08-13

Dear Reviewer 1HqK,

Thanks again for your diligent effort in reviewing our submission! We have carefully addressed the concerns raised and conducted the requested experiments. As the discussion phase deadline is approaching, we sincerely hope you can consider positively recommending our work if your concerns are solved. If you still have further comments/suggestions, please don't hesitate to let us know.

Best regards,

Authors of Paper 2848

2024-08-13

I appreciate the responses given by the authors to my comments. I do not have any further comments, but i am not convinced yet about the relation between the proposed work and the motivation claimed in the introduction.

审稿意见

评分: 6置信度: 32024-07-12

This paper adopts the idea of tracking-by-detection and apply it in the task of video spotting. The proposed algorithm is built on top of a SOTA image text spotting model and the authors contributions lie in the tracking part. They design LST-Matcher to integrate both short-term and long-term matching results. Experiments on multiple real datasets show that the proposed method outperforms state-of-the-art video spotting approach.

优点

S1. Applying the idea of tracking-by-detection in the task of video spotting is interesting and useful.

S2. Experimental results are impressive.

缺点

W1. The authors adopted a powerful image text spotting model Deepsolo and compared it with TransDETR. This may cause an issue that requires further investigation, i.e., how much performance improvement was brought by Deepsolo? In multi-object tracking, the performance of object detector plays an important role to the overall performance. Therefore, it becomes difficult to judge contributions brought by LST-Matcher.

W2. Multi-object tracking is a very crowded research area. Obviously, the literature review in this paper missed a lot of recent works. My question is since there are so many works belonging to the category of tracking-by-detection, why the authors eventually choose the technical solution as in LST-Matcher? Can we adopt the idea of other works such as ByteTrack or its subsequent works? I noticed that TencentOCR incorporated ByteTrack and demonstrates better performance than GoMatching.

W3. Following W2, TencentOCR exhibits better video text spotting results. So, what is the drawback of TencentOCR in real practice. Is it less efficient than GoMatching?

W4. In TransDETR, detection results are also reported. It would be better to report the quality of detection task. Even though detection is not part of contributions in this paper, it still can help explain W1 to see the contributions brought by Deepsolo.

问题

Please address the concerns in W1, W2, W3 and W4.

局限性

The authors failed to discuss the limitations of the work.

作者回复

2024-08-05

We sincerely thank you for your positive and insightful comments. Below we address the key concerns and promise to incorporate all feedback in the revised version.

Q1. How much performance improvement was brought by DeepSolo? It would be better to report the quality of detection task.

A1: Thanks for your suggestion. Following TransDETR, we provide the detection performance on ICDAR13 as follows:

Method	Precision	Recall	F-measure
Free [1]	79.7	68.4	73.6
TransDETR [2]	80.6	70.2	75.0
GoMatching w/o rescoring (DeepSolo)	92.4 (+11.8)	65.7 (-4.5)	76.8 (+1.8)
GoMatching	89.5 (+8.9)	74.8(+4.6)	81.5 (+6.5)

Without the rescoring mechanism, GoMatching outputs the original zero-shot results of DeepSolo, resulting in a 4.5% decrease in Recall and only a 1.8% improvement in F-measure compared to TransDETR. This is due to the domain gap between image data and video data. For detection, directly adopting a frozen image text spotter leads to low confidence and consequently a relatively low Recall on video data. DeepSolo mainly promotes the recognition performance, especially for curved text, as evidenced in Fig. 1(a) of the main paper. Using the rescoring mechanism, GoMatching achieves a 9.1% improvement in Recall compared to DeepSolo and a 6.5% F-measure enhancement compared to TransDETR. These improvements validate the effectiveness of the simple rescoring mechanism in alleviating the domain gap and leading to a better tracking candidate pool. The impressive results of GoMatching are not merely attributed to the introduction of a robust image text spotter.

[1] Free: A Fast and Robust End-to-End Video Text Spotter. IEEE TIP, 2020. [2] End-to-End Video Text Spotting with Transformer. IJCV, 2024.

Q2. Why authors choose the technical solution as in LST-Matcher? Can we adopt the idea of ByteTrack?

A2: Thanks for your thoughtful question. There are two key aspects: 1) performance, and 2) simplicity. From the perspective of performance, when replacing the LST-Matcher with ByteTrack, the MOTA significantly decreases from 72.04% to 65.05%, demonstrating the superior tracking capability of LST-Matcher. In addition, LST-Matcher only contains two Transformer encoder-decoder blocks, consuming little computational resource and inference time. We provide the proportion of inference time for each component on ICDAR15-video in the following table. Within the whole pipeline, LST-Matcher only consumes 5.43% of the inference time.

DeepSolo	Rescoring Mechanism	LST-Matcher	Other (pre/post-processes)
87.78%	<0.01%	5.43%	6.79%

Q3. What is the drawback of TencentOCR in real practice?

A3: TencentOCR is a solution with remarkable performance for the DSText competition. However, as we described in Sec. 4.3 of the main paper, it ensembles the results of several models with multiple backbone architectures. DBNet and CasCade MaskRCNN are employed for detection, while Parseq and ByteTrack are adopted for recognition and tracking, respectively. Ensembling the results of the integrated pipeline under different backbone settings requires more deployment and computational cost, limiting the real practice, particularly for real-time response. Besides, the model weights of TencentOCR are not released. In contrast, as shown in Tab. 1(c) of the main paper and Tab. 5 in the supplementary material, our GoMatching outperforms TencentOCR in terms of MOTA on DSText (22.83% vs 22.44%) with a simpler structure and inferences faster than TransDETR (14.41 FPS vs 12.69 FPS on ICDAR15-video in the same image size setting). It demonstrates that GoMatching is a simple, strong, yet efficient baseline for video text spotting.

2024-08-10

I appreciate the efforts paid by the authors to address my concerns, especially providing new experimental results as required. However, In response A1, as to the detection performance in the provided table, DeepSolo presents the highest precision, and low recall. It looks like the performance was not well tuned towards better F1-score? This result left my major concern unresolved. So I will keep my original rating.

2024-08-10

We regret that our first response did not fully resolve your major concerns, and we try our best to address them as follows.

The F-measure, which considers both Precision and Recall, is defined by the formula:

F-measure = 2 * \frac{Precison * Recall}{Precison + Recall}.

Due to the domain gap between the image data and video data, the original DeepSolo tends to provide lower confidence for text instances in video data. These low-confidence instances are more likely to be filtered out by the confidence threshold, resulting in a poor Recall and a sub-optimal F-measure. Our rescoring mechanism effectively adapts DeepSolo to the video domain at low cost, leading to a 9.1% improvement on Recall and a 4.7% F-measure enhancement compared to the original DeepSolo. These improvement allows DeepSolo with rescoring mechanism to provide a better tracking candidate pool for subsequent tracking. Furthermore, when we replaced the LST-Matcher with ByteTrack, the MOTA significantly decreased by 6.99% (72.04% $\rightarrow$ 65.05%), highlighting the superior tracking capability of our proposed LST-Matcher. Together, the impressive results achieved by GoMatching are not merely attributed to the DeepSolo, but to the synergistic effect of our proposed methods.

作者回复

2024-08-05

We sincerely thank the reviewers for their insightful reviews and kind support for our work. We are encouraged that the reviewers appreciate the interesting contribution and idea (Reviewer 1HqK, Lmjj), the good and impressive results (Reviewer 1HqK, Lmjj, BgWx), and the potential for wider usage with different image text spotting methods (Reviewer 1HqK).

We provide detailed responses to each reviewer respectively, and we promise to incorporate all feedback in the revised version.

Explanation of supplementary PDF:

Regarding the supplementary PDF, we provide larger and clearer visual results of GoMatching and further analysis the effectiveness of our proposed method, following the suggestions from Reviewer xYhx and BgWx.

最终决定Accept (poster)

2024-09-25

The mansuscript received mixed reviews. The authors' response added new information and helped clarify many of the outstanding issues raised by the reviewers.

Overall, the reviewers agree that the contribution is technically sound. The experimental results, and new results presented during rebuttal, are good and convincing, although I agree with reviewer 1HqK in that they could be better put in relation with the initial claims in the introduction.

I agree with the authors that there is typically a non-negligible domain gap between static scene text images and video frames due to encoding, motion blur and a number of other effects.

Overall, taking into account the reviews, and the lengthy discussion between the authors and reviewers, I consider that the paper is worth presenting in the conference, and encourage the authors to incorporate the key points of this discussion / reviews in the final manuscript.