A Benchmark Study For Limit Order Book (LOB) Models and Time Series Forecasting Models on LOB Data
摘要
评审与讨论
The paper provides a benchmark to evaluate the performance of deep learning models on Limit Order Book (LOB) data, covering two key tasks: Mid-Price Trend Prediction (MPTP) and Mid-Price Return Forecasting (MPRF). The study utilizes an open-source stock LOB dataset (FI-2010) and a proprietary futures LOB dataset (CHF-2023). Additionally, the paper introduces an innovative architecture, the Cross-Variate Mixing Layer (CVML), which significantly enhances the predictive performance of existing time series models on LOB data.
优点
- It is the first to benchmark existing LOB models on the Mid-Price Return Forecasting (MPRF) task.
- The paper proposes a novel architecture, the Convolutional Cross-Variate Mixing Layer (CVML), as an add-on to existing deep learning multivariate time series models to significantly improve performance on LOB data.
缺点
The paper's contributions are somewhat limited in scope, and it does not fully address the significant challenges and value propositions within the Limit Order Book (LOB) scenario for time series prediction:
- The paper falls short in providing a detailed description of the unique challenges that LOB presents for time series prediction. It particularly lacks an articulation of these challenges in a way that informs the design of new LOB benchmarks,
- The paper lacks comprehensive evaluation metrics tailored for LOB data. For instance, beyond conventional statistical indicators such as Mean Squared Error (MSE), Correlation (Corr), and R-squared (R²), there is a need for evaluation metrics or testing methods that better align with the characteristics of LOB data. This includes how to test the model's handling of LOB data features (e.g., high-frequency updates and noise issues), the complexity of market microstructure in LOB data, and insufficient testing of the model's robustness in the face of extreme market events.
- Despite the inclusion of futures data, the paper does not convincingly demonstrate the breadth and representativeness of the datasets used to cover various LOB scenarios, including foreign exchange and cryptocurrencies. A more robust justification is needed to explain how the selected datasets capture the range of challenges encountered in diverse LOB environments.
- The paper lacks a comparative analysis with existing methods, which is essential for validating the proposed CVML approach within the LOB domain. Without such comparisons, it is challenging to determine whether CVML provides a significant advantage over alternative methods in tackling the specific challenges of LOB time series prediction.
问题
- CVML enhances the ability of time series models to capture cross-variate and temporal correlations. This type of method seems not to be limited to the LOB scenario; is it also applicable to other scenarios? What specific advantages does CVML offer in the context of LOB that make it particularly effective in this domain? Additionally, the authors have not compared CVML with other related methods. Could a comparison be made with other similar methods to demonstrate the advantages of the CVML approach?
- While the paper utilizes stock and futures datasets, it is unclear whether these two types of datasets are sufficient to encapsulate the breadth of challenges present in various Limit Order Book (LOB) scenarios, including foreign exchange and cryptocurrencies. Can the authors provide a more detailed discussion on the specific challenges that LOB presents and how these compare to other LOB contexts? Additionally, can the authors demonstrate that the current datasets are comprehensive enough to address the challenges across all types of LOB scenarios?
C5: CVML enhances the ability of time series models to capture cross-variate and temporal correlations. This type of method seems not to be limited to the LOB scenario; is it also applicable to other scenarios?
answer: Yes, it is applicable to other scenarios. To clearly demonstrate its ability to capture cross-variate and temporal correlations and applicability to other scenarios, we conduct an experiment on a synthetic dataset that has well-defined cross-variate correlations specified in the data generation process. The synthetic dataset's target is a synthetic electricity price, the other variates are electricity load, electricity production, and temperature. We generate the data in a way that the temperature affects the electricity load (e.g. cold and hot temperatures increase the electricity load), the electricity load and the production affect the electricity price (e.g. higher load increases the price and higher production lowers the price). The following results are consistent with the conclusion of the paper. It shows the efficacy of the CVML architecture.
For more details including the background and detailed generation process of the synthetic data, please refer to the answer to C5 for Reviewer zxYU (Part 3 (Cont)).
| Model | MSE (↓) K=1 | MSE (↓) K=5 | MSE (↓) K=10 | Corr (↑) K=1 | Corr (↑) K=5 | Corr (↑) K=10 | R² (↑) K=1 | R² (↑) K=5 | R² (↑) K=10 |
|---|---|---|---|---|---|---|---|---|---|
| PatchTST-CVML | 0.807 | 0.7538 | 0.7589 | 0.4298 | 0.485 | 0.48 | 0.1818 | 0.2346 | 0.2296 |
| PatchTST | 1.0508 | 1.0441 | 1.0382 | 0.3907 | 0.3915 | 0.3991 | -0.0654 | -0.0602 | -0.054 |
| DLinear-CVML | 0.7825 | 0.7733 | 0.8743 | 0.4585 | 0.4639 | 0.3364 | 0.2067 | 0.2147 | 0.1124 |
| DLinear | 0.8861 | 0.8847 | 0.8868 | 0.3188 | 0.3209 | 0.3159 | 0.1016 | 0.1017 | 0.0997 |
| iTransformer-CVML | 0.8544 | 0.8183 | 0.8179 | 0.4011 | 0.4235 | 0.4317 | 0.1337 | 0.1691 | 0.1697 |
| iTransformer | 1.1667 | 1.161 | 1.1706 | 0.339 | 0.3627 | 0.3628 | -0.1829 | -0.1789 | -0.1884 |
| TimeMixer-CVML | 0.7469 | 0.7704 | 0.7539 | 0.493 | 0.4693 | 0.4865 | 0.2427 | 0.2177 | 0.2347 |
| TimeMixer | 0.7622 | 0.7596 | 0.7781 | 0.4774 | 0.4782 | 0.4586 | 0.2272 | 0.2287 | 0.2101 |
C6: What specific advantages does CVML offer in the context of LOB that make it particularly effective in this domain?
answer: CVML offers several advantages for handling the unique characteristics of LOB data that make it particularly effective in this domain. Unlike standard time series data, LOB data has a low signal-to-noise ratio and complex interdependencies across multiple variables. Traditional Transformer-based models like PatchTST and iTransformer, while powerful for many time series applications, often struggle with LOB data due to their high noise levels and intricate cross-variate correlations.
CVML addresses these challenges by acting as an end-to-end-learnable pre-processing module specifically tailored to amplify relevant signals before they reach the core prediction layers. Through a series of Conv1D layers with progressively increasing dilation, CVML performs two key functions:
Cross-Variate Feature Extraction: By mixing features across different variates, CVML extracts critical interdependencies within the data, enabling more effective modeling of cross-variate relationships. This is essential for LOB data, where the cross-correlation across levels and the cross-correlations between price and volume at the same level are strong due to the trading mechanism of the market.
Temporal Signal Enhancement: The increasing dilation in successive Conv1D layers allows CVML to capture temporal patterns across a broader range of time steps, which helps in distilling useful temporal signals amidst high noise. This layer-by-layer dilation ensures that relevant information is not overshadowed by noise, providing a cleaner signal for downstream modeling.
Q1: Additionally, the authors have not compared CVML with other related methods. Could a comparison be made with other similar methods to demonstrate the advantages of the CVML approach?
answer: Answered in C4.
Q2: While the paper utilizes stock and futures datasets, it is unclear whether these two types of datasets are sufficient to encapsulate the breadth of challenges present in various Limit Order Book (LOB) scenarios, including foreign exchange and cryptocurrencies. Can the authors provide a more detailed discussion of the specific challenges that LOB presents and how these compare to other LOB contexts?
answer: Answered in C3.
Q3: Additionally, can the authors demonstrate that the current datasets are comprehensive enough to address the challenges across all types of LOB scenarios?
answer: Answered in C3.
Dear authors,
Thank you for the efforts and responses provided. Regarding CVML, it appears to be a potentially more universally effective method that could transcend the LOB scenario. If the authors could provide a more comprehensive theoretical analysis and experiments, it might be more persuasive. In terms of the LOB context, the current description and datasets still seem somewhat limited. From the perspective of this paper being positioned as a benchmark for LOB, it does not appear to be a contribution significant enough for ICLR. Therefore, my score remains unchanged.
C3: Despite the inclusion of futures data, the paper does not convincingly demonstrate the breadth and representativeness of the datasets used to cover various LOB scenarios, including foreign exchange and cryptocurrencies. A more robust justification is needed to explain how the selected datasets capture the range of challenges encountered in diverse LOB environments.
answer: The reason we include the futures data is to use a LOB dataset of a different asset than the FI-2010 to show the limited generalizability of current LOB model architectures. We do not aim to provide a comprehensive breadth of LOB assets. It is extremely costly (millions of dollars) and unnecessary to include a comprehensive set of LOB datasets for our paper's research focus. As for the challenges encountered in diverse LOB environments, we showed the extreme market events in the CHF-2023 dataset and demonstrated its impact on the model performance in our analysis in Section 5 of the paper. This shows that in the futures LOB, the data is more volatile and has extreme events.
To further strengthen our conclusion in the paper, we conducted more experiments on a public cryptocurrency LOB dataset (https://www.kaggle.com/datasets/siavashraz/bitcoin-perpetualbtcusdtp-limit-order-book-data/data). We attached the results below. The results show that the ranking of the model performance on the crypto dataset is also different from that on the FI-2010 dataset, which is consistent with the conclusion of the paper.
| Model | K=1 | K=2 | K=3 | K=5 | K=10 | Avg |
|---|---|---|---|---|---|---|
| MLP | 92.4 (0.2) | 93.2 (0.1) | 93.8 (0.1) | 94.3 (0.1) | 95.3 (0.0) | 93.8 |
| LSTM | 86.2 (2.8) | 94.7 (0.2) | 95.2 (0.5) | 96.3 (0.1) | 97.2 (0.1) | 94.0 |
| CNN1 | 94.5 (0.7) | 96.3 (0.2) | 96.9 (0.2) | 96.9 (0.1) | 97.7 (0.1) | 96.5 |
| CTABL | 53.9 (0.2) | 64.3 (0.0) | 72.1 (0.2) | 80.9 (0.2) | 92.3 (0.2) | 72.7 |
| DeepLOB | 97.9 (0.1) | 98.4 (0.0) | 98.5 (0.1) | 98.1 (0.0) | 98.3 (0.0) | 98.3 |
| DAIN | 34.7 (0.3) | 53.3 (0.4) | 66.9 (0.6) | 79.8 (0.1) | 87.1 (0.1) | 64.4 |
| CNN-LSTM | 97.2 (0.1) | 97.7 (0.2) | 97.9 (0.1) | 97.7 (0.1) | 98.1 (0.0) | 97.7 |
| CNN2 | 97.1 (0.1) | 98.1 (0.0) | 98.2 (0.1) | 97.9 (0.1) | 98.1 (0.0) | 97.9 |
| TransLOB | 95.9 (0.8) | 98.2 (0.2) | 98.3 (0.2) | 98.0 (0.2) | 98.4 (0.1) | 97.8 |
| TLONBOF | 56.5 (1.0) | 70.0 (0.9) | 78.1 (0.7) | 88.2 (0.3) | 94.8 (0.1) | 77.5 |
| BinCTABL | 50.5 (0.4) | 60.3 (0.5) | 65.1 (1.7) | 73.1 (0.2) | 90.4 (0.4) | 67.9 |
| DeepLOBAtt | 97.6 (0.2) | 98.1 (0.5) | 98.7 (0.2) | 98.4 (0.1) | 98.5 (0.1) | 98.3 |
| DLA | 57.1 (2.0) | 69.2 (0.3) | 74.4 (0.4) | 81.0 (0.2) | 89.0 (0.2) | 74.2 |
C4: The paper lacks a comparative analysis with existing methods, which is essential for validating the proposed CVML approach within the LOB domain. Without such comparisons, it is challenging to determine whether CVML provides a significant advantage over alternative methods in tackling the specific challenges of LOB time series prediction.
answer: CVML is the first architecture proposed to be used as an add-on to existing time series forecast models to enhance their forecasting performance on the MPRF task. Thus, we demonstrate its efficacy by comparing a time series model's forecasting performance with its counterpart that has the CVML add-on. To demonstrate its effectiveness across different types of time series models (MLP-based, linear-based, Transformer-based), we conduct such comparisons for 4 most recent SOTA time series forecasting models.
C2: The paper lacks comprehensive evaluation metrics tailored for LOB data. For instance, beyond conventional statistical indicators such as Mean Squared Error (MSE), Correlation (Corr), and R-squared (R²), there is a need for evaluation metrics or testing methods that better align with the characteristics of LOB data. This includes how to test the model's handling of LOB data features (e.g., high-frequency updates and noise issues), the complexity of market microstructure in LOB data, and insufficient testing of the model's robustness in the face of extreme market events.
answer: The metrics we used (error measurement: MSE, linear correlation measurement: Corr, and explained variance measurement: R-square) are widely used in the literature about LOB prediction and time series prediction [1-14]. Since the task, MPRF, our paper focuses on is to forecast the mid-price returns in the future, which is a regression problem, we chose these metrics to evaluate the quality of the forecasts of the models.
The model's handling of the LOB data features, the complexity of market microstructure in the LOB data, and the model's robustness in the face of extreme market events have been demonstrated in the quality of the forecasts or the predictions of the model. For instance, in the MPTP task, the models' performance is worse on the CHF-2023 dataset compared to the FI-2010 dataset. In our analysis in Section 5 of the paper, we specifically showed the extreme market events in the CHF-2023 dataset and demonstrated its impact on the model performance, which showed the models' robustness in the face of extreme market events.
[1] Gu, Shihao, Bryan Kelly, and Dacheng Xiu. "Empirical asset pricing via machine learning." The Review of Financial Studies 33.5 (2020): 2223-2273.
[2] Petropoulos, Fotios, et al. "Forecasting: theory and practice." International Journal of Forecasting 38.3 (2022): 705-871.
[3] Kelly, Bryan, and Dacheng Xiu. "Financial machine learning." Foundations and Trends® in Finance 13.3-4 (2023): 205-363.
[4] Xiao, Chenglin, Weili Xia, and Jijiao Jiang. "Stock price forecast based on combined model of ARI-MA-LS-SVM." Neural Computing and Applications 32.10 (2020): 5379-5388.
[5] Maciejowska, Katarzyna, and Rafał Weron. "Short-and mid-term forecasting of baseload electricity prices in the UK: The impact of intra-day price relationships and market fundamentals." IEEE Transactions on power systems 31.2 (2015): 994-1005.
[6] Ntakaris, Adamantios, et al. "Feature engineering for mid-price prediction with deep learning." Ieee Access 7 (2019): 82390-82412.
[7] Shahi, Tej Bahadur, et al. "Stock price forecasting with deep learning: A comparative study." Mathematics 8.9 (2020): 1441.
[8] Yong Liu, Tengge Hu, Haoran Zhang, Haixu Wu, Shiyu Wang, Lintao Ma, and Mingsheng Long. itransformer: Inverted transformers are effective for time series forecasting. arXiv preprint arXiv:2310.06625, 2023.
[9] Yuqi Nie, Nam H Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. A time series is worth 64 words: Long-term forecasting with transformers. arXiv preprint arXiv:2211.14730, 2022.
[10] Ailing Zeng, Muxi Chen, Lei Zhang, and Qiang Xu. Are transformers effective for time series forecasting? In Proceedings of the AAAI conference on artificial intelligence, volume 37, pp. 11121–11128, 2023.
[11] Shiyu Wang, Haixu Wu, Xiaoming Shi, Tengge Hu, Huakun Luo, Lintao Ma, James Y Zhang, and Jun Zhou. Timemixer: Decomposable multiscale mixing for time series forecasting. arXiv preprint arXiv:2405.14616, 2024.
[12] Yunhao Zhang and Junchi Yan. Crossformer: Transformer utilizing cross-dimension dependency for multivariate time series forecasting. ICLR, 2023.
[13] Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang. Informer: Beyond efficient transformer for long sequence time-series forecasting. AAAI, 2021.
[14] Tian Zhou, Ziqing Ma, Qingsong Wen, Xue Wang, Liang Sun, and Rong Jin. FEDformer: Frequency enhanced decomposed transformer for long-term series forecasting. ICML, 2022b.
Thank you for your reviews. We appreciate your comments and suggestions. We address them as follows.
C1: The paper falls short in providing a detailed description of the unique challenges that LOB presents for time series prediction. It particularly lacks an articulation of these challenges in a way that informs the design of new LOB benchmarks.
answer: The unique challenges that LOB presents are two-fold. The first is that although LOB data is a universal data structure for different financial assets such as stocks, futures, and options, the underlying process of generating the LOB data is very different for different assets due to both the nature of the assets and the different trading rules of the assets. However, the existing benchmark of LOB models only focuses on the stock data, which we hypothesized to be biased to the stock data. Thus, it informed us to demonstrate the necessity to benchmark LOB models on the data including different assets. The benchmark results for the MPTP task support our hypothesis and confirm the value of evaluating LOB models on data including various assets.
The second challenge is the difficulty of making forecasts on the LOB data and the cross-variate correlations in the LOB data. The LOB data is extremely noisy and hard to forecast because whenever there exists any patterns or trends in the data, opportunities of making a profit appear and they will be quickly captured by the traders on the market. In addition, there exists clear and stable cross-variate correlations in the LOB data. The variates in the LOB data include the bid and ask price and volume of different levels of the limit order book. These variates are intrinsically correlated due to the trading mechanism. For example, when the best bid price increases, the best ask price is very likely to increase. Then the volume on the bid side becomes large, the mid-price is more likely to increase. This second challenge inspired us to create a benchmark of time series forecasting models on the LOB Data to test their forecasting performance in dealing with the cross-variate correlations.
Dear Reviewer,
Thank you for the suggestions. Regarding CVML, could you please be more specific on what kinds of theoretical analysis and experiments could make the paper more persuasive?
Regarding the LOB context, we propose a benchmark for both the MPTP and MPRF tasks. In the paper, we cover the stock and futures assets. In the rebuttal, we also include the cryptocurrency asset. For MPTP, we show the limited generalizability of current LOB model architectures and highlight the distinct underlying characteristics of LOB data from different assets, this has not been done before in the literature. Besides, our benchmark is the first on the MPRF task in the literature. We are also the first to benchmark state-of-the-art time series forecasting models on the MPRF task, bridging the gap between general-purpose and LOB-specific time series forecasting.
More importantly, our contribution is not limited to proposing a benchmark. A large part of our contribution is to propose a new architecture (CVML) to address the major issue demonstrated by the benchmark, which is that recent SOTA time series prediction models show minimal prediction power on LOB datasets. Experiments show that our proposed CVML method substantially improves the performance of all major time series prediction architectures (transformer-based, linear-based, MLP-based) on the LOB dataset. We also analyze the effect of CVML on the standard deviations of the inputs. Furthermore, we conduct ablation study experiments to verify CVML's ability to capture cross-variate and temporal correlations.
We sincerely hope you can reconsider the significance of our work for ICLR.
Best regards,
Authors
Dear Authors,
I appreciate the effort put into expanding the datasets and evaluation metrics for the LOB scenario. However, I feel that the current benchmark lacks significant innovation, as it merely extends existing datasets and metrics without substantial novelty.
The CVML appears to be an innovative aspect of your work, but it does not seem to have specific considerations or designs tailored to the unique distribution characteristics of LOB time series data and the challenges of prediction within this domain. It looks like a general optimization method rather than one specifically crafted for LOB intricacies. Of course, if you could demonstrate that CVML provides significant improvements across various scenarios and is a sufficiently innovative approach, I believe it would constitute a contribution. However, the current work does not provide a thorough investigation of related methods, nor does it include comparisons with other scenario data, which I believe is necessary to establish its merit.
The paper proposes to evaluate existing LOB models on the mid-price return forecasting(MPRF) task using a proprietary futures LOB dataset and present the first benchmark study to evaluate SOTA time series forecasting models on the MPRF task. Moreover, the paper proposes an architecture of convolutional cross-variate mixing layers (CVML) as an add-on to any deep learning multivariate time series model to significantly enhance MPRF performance on LOB data. However, I think the contribution of the paper is not sufficient, the innovation is average, and the relevant descriptions are relatively weak. In addition, there are some serious issues with the writing of the paper, such as the lack of Section 3 and Section 5 in the description of the paper organization.
优点
The paper proposes to evaluate existing LOB models on the mid-price return forecasting(MPRF) task using a proprietary futures LOB dataset and present the first benchmark study to evaluate SOTA time series forecasting models on the MPRF task. Moreover, the paper proposes an architecture of convolutional cross-variate mixing layers (CVML) as an add-on to any deep learning multivariate time series model to significantly enhance MPRF performance on LOB data.
缺点
I think the contribution of the paper is not sufficient, the innovation is average, and the relevant descriptions are relatively weak. In addition, there are some serious issues with the writing of the paper, such as the lack of Section 3 and Section 5 in the description of the paper organization.
问题
How to verify the advantages of the benchmark proposed in this article without comparison with relevant benchmarks?
Thank you for your reviews. We appreciate your comments and suggestions. We address them as follows.
C1: I think the contribution of the paper is not sufficient, the innovation is average, and the relevant descriptions are relatively weak.
answer: Thank you for raising this concern. Our major contributions are as follows.
- We evaluate existing LOB models on the MPTP task using a proprietary futures LOB dataset, CHF-2023, to assess the transferability of models designed for stock LOB data across asset classes.
- We present the first benchmark on the MPRF task in the literature.
- We pioneer benchmarking state-of-the-art time series forecasting models on the MPRF task, bridging the gap between general-purpose and LOB-specific time series forecasting.
- We propose a novel Cross-Variate Mixing Layer (CVML) as an add-on to existing time series models, enhancing their MPRF performance by an average of 244.9%.
Considering our contributions in both the first MPRF benchmark and a new architecture achieving very good performance, we believe our paper has significant contributions and good innovation. We kindly ask for more details and are hoping to have a further discussion with you.
C2: there are some serious issues with the writing of the paper, such as the lack of Section 3 and Section 5 in the description of the paper organization.
answer: Thank you for pointing out the typos in the description of the paper organization. We have fixed it in the revision.
C3: How to verify the advantages of the benchmark proposed in this article without comparison with relevant benchmarks?
answer: We can verify the advantages of the benchmark proposed in this article in the following ways. The major contributions from the benchmark we proposed are two-fold. The first is that we show the limited generalizability of current LOB model architectures and highlight the distinct underlying characteristics of LOB data from different assets. To verify this, we provide further results on a public crypto LOB dataset (https://www.kaggle.com/datasets/siavashraz/bitcoin-perpetualbtcusdtp-limit-order-book-data/data). As it shows, the ranking of the models is inconsistent with the ranking on the FI-2010 and CHF-2023 datasets. It is consistent with our conclusion in the paper.
| Model | K=1 | K=2 | K=3 | K=5 | K=10 | Avg |
|---|---|---|---|---|---|---|
| MLP | 92.4 (0.2) | 93.2 (0.1) | 93.8 (0.1) | 94.3 (0.1) | 95.3 (0.0) | 93.8 |
| LSTM | 86.2 (2.8) | 94.7 (0.2) | 95.2 (0.5) | 96.3 (0.1) | 97.2 (0.1) | 94.0 |
| CNN1 | 94.5 (0.7) | 96.3 (0.2) | 96.9 (0.2) | 96.9 (0.1) | 97.7 (0.1) | 96.5 |
| CTABL | 53.9 (0.2) | 64.3 (0.0) | 72.1 (0.2) | 80.9 (0.2) | 92.3 (0.2) | 72.7 |
| DeepLOB | 97.9 (0.1) | 98.4 (0.0) | 98.5 (0.1) | 98.1 (0.0) | 98.3 (0.0) | 98.3 |
| DAIN | 34.7 (0.3) | 53.3 (0.4) | 66.9 (0.6) | 79.8 (0.1) | 87.1 (0.1) | 64.4 |
| CNN-LSTM | 97.2 (0.1) | 97.7 (0.2) | 97.9 (0.1) | 97.7 (0.1) | 98.1 (0.0) | 97.7 |
| CNN2 | 97.1 (0.1) | 98.1 (0.0) | 98.2 (0.1) | 97.9 (0.1) | 98.1 (0.0) | 97.9 |
| TransLOB | 95.9 (0.8) | 98.2 (0.2) | 98.3 (0.2) | 98.0 (0.2) | 98.4 (0.1) | 97.8 |
| TLONBOF | 56.5 (1.0) | 70.0 (0.9) | 78.1 (0.7) | 88.2 (0.3) | 94.8 (0.1) | 77.5 |
| BinCTABL | 50.5 (0.4) | 60.3 (0.5) | 65.1 (1.7) | 73.1 (0.2) | 90.4 (0.4) | 67.9 |
| DeepLOBAtt | 97.6 (0.2) | 98.1 (0.5) | 98.7 (0.2) | 98.4 (0.1) | 98.5 (0.1) | 98.3 |
| DLA | 57.1 (2.0) | 69.2 (0.3) | 74.4 (0.4) | 81.0 (0.2) | 89.0 (0.2) | 74.2 |
The second contribution is that we provide the first benchmark of LOB models and time series forecasting models on the MPRF task. To verify this, we release our code in (https://www.dropbox.com/scl/fo/micrv6zmy9kar99ktyywj/AFuHwsYVzO1TEHIwYnE59XU?rlkey=lo80px6jggx2ph4vizgw1eys8&st=woqoisnn&dl=0). The dataset, FI-2010, we used is also public (https://etsin.fairdata.fi/dataset/73eb48d7-4dbc-4a10-a52a-da745b47a649).
This paper presents a comprehensive benchmark study evaluating deep learning models on limit order book (LOB) data, focusing on mid-price trend prediction (MPTP) and mid-price return forecasting (MPRF) tasks. The experiments present serval interesting findings. To overcome noise in LOB data for MPRF task, the paper introduces a novel architecture called convolutional cross-variate mixing layers (CVML) to enhance MPRF performance. The empirical results demonstrate the effectiveness of CVML.
优点
- The paper bridges gaps in the literature regarding MPRF
- The experiments are extensive, and conducted in many facets, e.g., comprehensive datasets, multiple metrics, different horizons.
- The proposed model CVML is and effective, and can be added into existing time series forecasters to enhance performance.
缺点
- The importance of LOB problem is not adequately emphasized. Also, the scope of LOB problem is not very extensive.
- Why should we care about MPTP and MPRF tasks? Are there any other tasks we need to emphasize about LOB problem? The paper should make it clear.
问题
Please refer to the weakness
Thank you for your reviews. We appreciate your comments and suggestions. We address them as follows.
C1: The importance of LOB problem is not adequately emphasized. Also, the scope of LOB problem is not very extensive.
answer: The importance of the limit order book (LOB) prediction problem lies in its capacity to advance understanding in both financial markets and broader time series modeling. For financial markets, LOB is the most fundamental data structure, thus, LOB data is the only and most significant data we can use to analyze the financial market. Beyond the scope of financial analysis, the LOB problem is also important to the broader time series forecasting field. The LOB data is a type of multivariate time series. Its unique value to the field is two-fold.
- The first is the difficulty. LOB data is extremely noisy and hard to predict because any easy pattern or trend means opportunities of making profits and would have been quickly captured and digested by the market.
- The second is its obvious cross-correlation nature. The variates in the LOB data include the bid and ask price and volume of different levels of the limit order book. These variates are intrinsically correlated due to the trading mechanism. For example, when the best bid price increases, the best ask price is very likely to increase because an increasing best bid price means more buying power is entering the market, which means more market orders of bid will consume the orders on the ask side, increasing the best ask price. This leads to an increase of the mid-price. Besides, the volume on the bid side becomes large, the mid-price is also more likely to increase. Compared to time series data from other domains, these correlations are more stable and obvious. The modeling ability of such cross-variate correlations is useful across the time series forecasting field.
Thus, with above two advantages, research results on the LOB data is also very valuable to the whole time series forecasting field.
To further demonstrate our arguments, we attach the forecasting results on a synthetic dataset that is not in the finance field. The synthetic dataset has well-defined cross-variate correlations specified in the data generation process. The synthetic dataset's target is a synthetic electricity price, the other variates are electricity load, electricity production and temperature. We generate the data in a way that the temperature affects the electricity load (e.g. cold and hot temperatures increase the electricity load), the electricity load and the production affect the electricity price (e.g. higher load increases the price and higher production lowers the price). The following results are consistent with the conclusion of the paper. It shows the efficacy of the CVML architecture.
More details including the detailed generation process of the synthetic data, please refer to the answer to C5 for Reviewer zxYU (Part 3 (Cont.)).
| Model | MSE (↓) K=1 | MSE (↓) K=5 | MSE (↓) K=10 | Corr (↑) K=1 | Corr (↑) K=5 | Corr (↑) K=10 | R² (↑) K=1 | R² (↑) K=5 | R² (↑) K=10 |
|---|---|---|---|---|---|---|---|---|---|
| PatchTST-CVML | 0.807 | 0.7538 | 0.7589 | 0.4298 | 0.485 | 0.48 | 0.1818 | 0.2346 | 0.2296 |
| PatchTST | 1.0508 | 1.0441 | 1.0382 | 0.3907 | 0.3915 | 0.3991 | -0.0654 | -0.0602 | -0.054 |
| DLinear-CVML | 0.7825 | 0.7733 | 0.8743 | 0.4585 | 0.4639 | 0.3364 | 0.2067 | 0.2147 | 0.1124 |
| DLinear | 0.8861 | 0.8847 | 0.8868 | 0.3188 | 0.3209 | 0.3159 | 0.1016 | 0.1017 | 0.0997 |
| iTransformer-CVML | 0.8544 | 0.8183 | 0.8179 | 0.4011 | 0.4235 | 0.4317 | 0.1337 | 0.1691 | 0.1697 |
| iTransformer | 1.1667 | 1.161 | 1.1706 | 0.339 | 0.3627 | 0.3628 | -0.1829 | -0.1789 | -0.1884 |
| TimeMixer-CVML | 0.7469 | 0.7704 | 0.7539 | 0.493 | 0.4693 | 0.4865 | 0.2427 | 0.2177 | 0.2347 |
| TimeMixer | 0.7622 | 0.7596 | 0.7781 | 0.4774 | 0.4782 | 0.4586 | 0.2272 | 0.2287 | 0.2101 |
C2: Why should we care about MPTP and MPRF tasks? Are there any other tasks we need to emphasize about LOB problem?
answer: The most important problem in terms of machine learning on LOB data In supervised learning, two primary branches are classification and regression. MPTP corresponds to classification, while MPRF aligns with regression tasks. Our paper focuses specifically on exploring supervised learning methods within these frameworks. However, there are certainly additional tasks that can be developed based on LOB data, such as anomaly detection, which highlights the potential for even broader applications. We see significant opportunities in this dataset for expanding research beyond these two core tasks.
The paper presents a study of many popular timeseries models for predicting future prices of assets. The important differentiator is the fact that models compared in this benchmark are using order book features (such as prices/quantities at different order book levels, per-level mid-price, etc) in addition to standard features such as asset price.
Two main tasks are considered: Mid-Price Trend Prediction (a classification problem, where the model has to predict if the price will go up/down/stay) and Mid-Price Return Forecasting, where the model has to predict future mid-price return at some forecasting horizon (judged by Mean Square Error).
The study uses one open-source dataset (FI-2010 stock dataset) and one proprietary futures dataset (CHF-2023) which is not publicly available. Authors are providing the results for a variety of popular time-series models.
In addition, authors propose an architecture of convolutional cross-variate mixing layers (CVML) that improves various time series models mid-price return forecasting performance
优点
- The topic of LOB models is very interesting, and this study (unlike other existing benchmarks) compares the models on futures data
- Good variety of the tested models and used features.
- Running experiments on a proper-size, realistic dataset (CHF-2023)
- CVML architecture is a simple add-on that seems to perform well based on the experiments (however, probably needs to be confirmed if it still improves models that were more tuned than just batch_size / learning rate.)
缺点
- No code provided to reproduce the results, even on the open-source dataset.
- One of the datasets is extremely small, almost toy-size (10 trading days, 5 companies)
- The other dataset is not available to the public, so it will be impossible to reproduce results for it
- Very limited hyperparamter search (either just keeping original values or searching only learning rate / batch size). Study would be much more convincing if the models got some computational budget to tune a bit hyper-params for CHF-2023 Dataset.
- Overall, the paper and proposed techniques feels very specific to trading and would be better suited for a trading-related workshop or conference, rather than ICLR
问题
- The order book in the experiments was divided into 5 levels, what was the algorithm for determining the mapping layer -> value_range? (motivation: if I would like to apply your method for other order book, how to decide the split)
- Did you try dividing order book into more than 5 layers ?
- Table 1 has in description “ (....) a set of 26 Time-insensitive features” , but looking at the Table it seems there are 2 * 5 (u2) + 4 * 4 (u3) + 4 (u4) + 2 (u5) = 32 features in total ?
- Why some of the models (like DAIN) have lookback size which is much smaller than all other models?
- CVML results on CHF-2023 seem to be missing, did you run the equivalent of Table 6, but with CHF-2023?
Synthetic Time Series Data Generation Process
The synthetic dataset consists of multiple time series components with complex relationships and non-linear interactions. The data generation process follows a hierarchical structure where intermediate variables influence the final target variable (electricity price). There are totally six varieties: electricity price, load, production, temperature, supply_margin, price_volatility. These varieties have cross-variate correlations. For example, an increasing load of electricity leads to an increasing electricity price. An increasing production of electricity leads to a decreasing electricity price. When the temperature is too high or too cold, the load will increase. We introduce the detailed generation process of each variate as follows.
Temperature Generation: Let represent the time index for hourly observations. The temperature time series is generated as:
where:
- represents random fluctuations
- represents heat waves
- represents cold snaps
Load Generation: The electricity load is modeled as (including base load, daily pattern, AC usage, heating, seasonal pattern, weekday effect, and a random noise):
where and represents the indicator function.
Production Generation: The production capacity is generated through multiple components:
where:
- represents maintenance periods
- represents outages
Electricity Price Generation (Target Variable): The final price is generated through a complex interaction of components:
where:
- represents regime changes
- represents price spikes triggered by conditions:
where if any of the following conditions are met:
-
(high demand relative to production)
-
(very hot weather)
-
(very cold weather)
-
(significant production issues)
-
represents sudden price drops
-
represents price noise
Derived Features
Additional features are computed from the primary variables:
where represents the rolling standard deviation over a 24-hour window.
As it shown above, this synthetic time series dataset, although it was defined in a specific domain (electricity) for interpretability, includes generic patterns that could be found across different domains of multivariate time series. Specifically, it includes the following patterns: multiple seasonality (e.g. weekly/monthly sales cycles, weekday/weekend differences in web traffic), non-linear relationships, temporary anomalies and recoveries (e.g. electricity outage), periods of high volatility followed by calmer periods (e.g. viral content spread on social media), multi-factor interactions (e.g. inventory-price-demand relationships in supply chain) and stochastic components that mirror real-world randomness. A model that performs well on this synthetic dataset has meaningful values for the broader time series forecasting field.
C5: the paper and proposed techniques feels very specific to trading
answer: The proposed technique is not specific to trading. The high-frequency LOB data is a type of highly noisy multivariate time series data that includes both temporal correlations (correlations between the history mid-price and the future mid-price) and cross-variate correlations (correlations between the mid-price and the volumes of different levels). The modeling of cross-variates and temporal correlations is important in different time series forecasting domains. A technique that shows efficacy on MPRF is valuable to the entire time series forecasting field. Time series forecasting is important across various domains including electricity, traffic, weather, health, astronomy, retail, supply chain management, finance, etc.
To further demonstrate our proposed technique's value for time series forecasting in other domains, we run the same experiments of the CVML on a synthetic electricity dataset. The target is the electricity price and the features include the electricity load, the electricity production, and the temperature. It is not hard to see that there exists both temporal correlations and cross-variate correlations (the temperature will affect the load, the load and the production will affect the price).
The following results on this dataset are consistent with the paper's results. It further shows the value of our proposed technique for general time series forecasting. We also attach a detailed explanation of the data generation process underneath the result table.
| Model | MSE (↓) K=1 | MSE (↓) K=5 | MSE (↓) K=10 | Corr (↑) K=1 | Corr (↑) K=5 | Corr (↑) K=10 | R² (↑) K=1 | R² (↑) K=5 | R² (↑) K=10 |
|---|---|---|---|---|---|---|---|---|---|
| PatchTST-CVML | 0.807 | 0.7538 | 0.7589 | 0.4298 | 0.485 | 0.48 | 0.1818 | 0.2346 | 0.2296 |
| PatchTST | 1.0508 | 1.0441 | 1.0382 | 0.3907 | 0.3915 | 0.3991 | -0.0654 | -0.0602 | -0.054 |
| DLinear-CVML | 0.7825 | 0.7733 | 0.8743 | 0.4585 | 0.4639 | 0.3364 | 0.2067 | 0.2147 | 0.1124 |
| DLinear | 0.8861 | 0.8847 | 0.8868 | 0.3188 | 0.3209 | 0.3159 | 0.1016 | 0.1017 | 0.0997 |
| iTransformer-CVML | 0.8544 | 0.8183 | 0.8179 | 0.4011 | 0.4235 | 0.4317 | 0.1337 | 0.1691 | 0.1697 |
| iTransformer | 1.1667 | 1.161 | 1.1706 | 0.339 | 0.3627 | 0.3628 | -0.1829 | -0.1789 | -0.1884 |
| TimeMixer-CVML | 0.7469 | 0.7704 | 0.7539 | 0.493 | 0.4693 | 0.4865 | 0.2427 | 0.2177 | 0.2347 |
| TimeMixer | 0.7622 | 0.7596 | 0.7781 | 0.4774 | 0.4782 | 0.4586 | 0.2272 | 0.2287 | 0.2101 |
Thank you for your reviews. We appreciate your comments and suggestions. We address them as follows.
C1: No code provided to reproduce the results, even on the open-source dataset.
answer: Please refer to this anonymous link to access our code to reproduce the results: https://www.dropbox.com/scl/fo/micrv6zmy9kar99ktyywj/AFuHwsYVzO1TEHIwYnE59XU?rlkey=lo80px6jggx2ph4vizgw1eys8&st=woqoisnn&dl=0
C2: One of the datasets is extremely small, almost toy-size (10 trading days, 5 companies)
answer: While the FI-2010 dataset covers only 5 companies over a span of 10 trading days, it captures high-frequency trading data of nanosecond resolution, amounting to approximately 400,000 time-series data points. LOB datasets are extremely valuable. This dataset is substantial for high-frequency analysis and is widely recognized and utilized in limit order book prediction studies [8-14].
In comparison, the time series benchmark datasets (as shown in Table 4 of [1]) used by most of the existing works [1-7] on time series prediction models are much smaller then this dataset. These datasets have no more than 40 thousand time-series data points, which are less than 1/100 of the size of the dataset we used.
[1] Yong Liu, Tengge Hu, Haoran Zhang, Haixu Wu, Shiyu Wang, Lintao Ma, and Mingsheng Long. itransformer: Inverted transformers are effective for time series forecasting. arXiv preprint arXiv:2310.06625, 2023.
[2] Yuqi Nie, Nam H Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. A time series is worth 64 words: Long-term forecasting with transformers. arXiv preprint arXiv:2211.14730, 2022.
[3] Ailing Zeng, Muxi Chen, Lei Zhang, and Qiang Xu. Are transformers effective for time series forecasting? In Proceedings of the AAAI conference on artificial intelligence, volume 37, pp. 11121–11128, 2023.
[4] Shiyu Wang, Haixu Wu, Xiaoming Shi, Tengge Hu, Huakun Luo, Lintao Ma, James Y Zhang, and Jun Zhou. Timemixer: Decomposable multiscale mixing for time series forecasting. arXiv preprint arXiv:2405.14616, 2024.
[5] Yunhao Zhang and Junchi Yan. Crossformer: Transformer utilizing cross-dimension dependency for multivariate time series forecasting. ICLR, 2023.
[6] Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang. Informer: Beyond efficient transformer for long sequence time-series forecasting. AAAI, 2021.
[7] Tian Zhou, Ziqing Ma, Qingsong Wen, Xue Wang, Liang Sun, and Rong Jin. FEDformer: Frequency enhanced decomposed transformer for long-term series forecasting. ICML, 2022b.
[8] Tran, Dat Thanh, et al. "Temporal attention-augmented bilinear network for financial time-series data analysis." IEEE transactions on neural networks and learning systems 30.5 (2018): 1407-1418.
[9] Zhu, Bangzhu, et al. "Forecasting carbon price using a multi‐objective least squares support vector machine with mixture kernels." Journal of Forecasting 41.1 (2022): 100-117.
[10] Passalis, Nikolaos, et al. "Deep adaptive input normalization for time series forecasting." IEEE transactions on neural networks and learning systems 31.9 (2019): 3760-3765.
[11] Tsantekidis, Avraam, et al. "Using deep learning for price prediction by exploiting stationary limit order book features." Applied Soft Computing 93 (2020): 106401.
[12] Akyildirim, Erdinc, et al. "Forecasting mid-price movement of Bitcoin futures using machine learning." Annals of Operations Research 330.1 (2023): 553-584.
[13] Kolm, Petter N., Jeremy Turiel, and Nicholas Westray. "Deep order flow imbalance: Extracting alpha at multiple horizons from the limit order book." Mathematical Finance 33.4 (2023): 1044-1081.
[14] Ntakaris, Adamantios, et al. "Feature engineering for mid-price prediction with deep learning." Ieee Access 7 (2019): 82390-82412.
C3: The other dataset is not available to the public, so it will be impossible to reproduce results for it
answer: Thanks for raising this concern. It is true that we cannot release our proprietary futures dataset. However, there exists other public LOB datasets, such as this crypto dataset (https://www.kaggle.com/datasets/siavashraz/bitcoin-perpetualbtcusdtp-limit-order-book-data/data). Our conclusion from benchmarking on the FI and CHF datasets is still reproducible on the crypto dataset, which is that there is limited generalizability of current LOB model architectures and the underlying characteristics of LOB data from different assets are different. In the results shown below, the ranking of the LOB model performance is also different from the one on the FI-2010 dataset, which is consistent to the conclusion of the paper.
| Model | K=1 | K=2 | K=3 | K=5 | K=10 | Avg |
|---|---|---|---|---|---|---|
| MLP | 92.4 (0.2) | 93.2 (0.1) | 93.8 (0.1) | 94.3 (0.1) | 95.3 (0.0) | 93.8 |
| LSTM | 86.2 (2.8) | 94.7 (0.2) | 95.2 (0.5) | 96.3 (0.1) | 97.2 (0.1) | 94.0 |
| CNN1 | 94.5 (0.7) | 96.3 (0.2) | 96.9 (0.2) | 96.9 (0.1) | 97.7 (0.1) | 96.5 |
| CTABL | 53.9 (0.2) | 64.3 (0.0) | 72.1 (0.2) | 80.9 (0.2) | 92.3 (0.2) | 72.7 |
| DeepLOB | 97.9 (0.1) | 98.4 (0.0) | 98.5 (0.1) | 98.1 (0.0) | 98.3 (0.0) | 98.3 |
| DAIN | 34.7 (0.3) | 53.3 (0.4) | 66.9 (0.6) | 79.8 (0.1) | 87.1 (0.1) | 64.4 |
| CNN-LSTM | 97.2 (0.1) | 97.7 (0.2) | 97.9 (0.1) | 97.7 (0.1) | 98.1 (0.0) | 97.7 |
| CNN2 | 97.1 (0.1) | 98.1 (0.0) | 98.2 (0.1) | 97.9 (0.1) | 98.1 (0.0) | 97.9 |
| TransLOB | 95.9 (0.8) | 98.2 (0.2) | 98.3 (0.2) | 98.0 (0.2) | 98.4 (0.1) | 97.8 |
| TLONBOF | 56.5 (1.0) | 70.0 (0.9) | 78.1 (0.7) | 88.2 (0.3) | 94.8 (0.1) | 77.5 |
| BinCTABL | 50.5 (0.4) | 60.3 (0.5) | 65.1 (1.7) | 73.1 (0.2) | 90.4 (0.4) | 67.9 |
| DeepLOBAtt | 97.6 (0.2) | 98.1 (0.5) | 98.7 (0.2) | 98.4 (0.1) | 98.5 (0.1) | 98.3 |
| DLA | 57.1 (2.0) | 69.2 (0.3) | 74.4 (0.4) | 81.0 (0.2) | 89.0 (0.2) | 74.2 |
Performance comparison of different models across various prediction horizons (K). Values show percentage accuracy with standard deviation in parentheses.
C4: Very limited hyperparameter search (either just keeping original values or searching only learning rate / batch size).
answer: When doing hyperparameter tuning, we found that learning rate and batch size were the two major hyperparameters that significantly affected the performance. Thus, we only searched over learning rate and batch size. This choice is also consistent with another major related work [1] on the LOB benchmark topic.
Regarding hyperparameters that pertain to the model architecture, such as the number of layers or hidden dimension, we chose not to change them because a different set of such kind of hyperparameters produces a new model, which is not the same as the one proposed in the original paper anymore. As a benchmark paper, we think we should only compare the model architectures proposed in the original paper.
[1] Prata, Matteo, et al. "Lob-based deep learning models for stock price trend prediction: a benchmark study." Artificial Intelligence Review 57.5 (2024): 1-45.
C6: The order book in the experiments was divided into 5 levels, what was the algorithm for determining the mapping layer -> value_range? (motivation: if I would like to apply your method for other order book, how to decide the split)
answer: Order book's levels are not self-defined and the price levels are not quantizations of a continuous range. The exchange defines the minimum price change of a trading asset. For example, if the minimum price change is 0.25, then the price gap between each pair of consecutive levels is 0.25. Traders cannot place orders at a price that is not a multiple of 0.25. Thus, the total number of levels of the order book of an asset depends on the market. There could be more than 100 levels if orders are at a very low or very high price. Price levels is just a way to view the orders of different prices from the traders for an asset. The top-N levels are most relevant to the change of the mid-price and include the most of the orders. The size of the LOB data is extremely large, thus, usually data vendors only provide the top-N (N could be 5 or 10) levels of the LOB.
C7: Did you try dividing order book into more than 5 layers ?
answer: The data we have has 5 levels. As explained by the above answer, the price levels are intrinsically discrete and we cannot divide them.
C8: Table 1 has in description “ (....) a set of 26 Time-insensitive features” , but looking at the Table it seems there are 2 * 5 (u2) + 4 * 4 (u3) + 4 (u4) + 2 (u5) = 32 features in total ?
answer: The confusion might come from u3. The index i is ranging from 1 to n-1, n=5. The first two features do not change with i, thus it is 2 rather than 2*4. The rest 2 features change with i, so it is 2 * 4. Thus, 2 * 5 + (2 + 4 * 2) + 4 + 2 = 26.
C9: Why some of the models (like DAIN) have lookback size which is much smaller than all other models?
answer: The lookback size of the models is kept the same as the original papers of those models. We have also tested increasing the lookback size of DAIN but it turned out that the original one was the best.
C10: CVML results on CHF-2023 seem to be missing, did you run the equivalent of Table 6, but with CHF-2023?
answer: In the original submission, for the MPRF task, we thought it was sufficient to use FI-2010 to show the efficacy of CVML. Since the MPRF benchmark has a different purpose than the MPTP benchmark and does not require the comparison of the results on FI-2010 with CHF-2023, we did not run the equivalent on CHF-2023. We are happy to also run the equivalent of Table 6 with CHF-2023. However, CHF-2023, including 5 years of time series of 500ms resolution, is much larger than the FI-2010 dataset. We are lacking sufficient time during the rebuttal period. We will include the results in the edited paper in the future.
Dear Reviewer,
With the rebuttal period coming to a close, we wanted to follow up regarding our responses. We've carefully considered and responded to your reviews, and we hope our response has adequately addressed your concerns. If you have any outstanding questions, we are happy to address them further. If you find our response satisfactory, we kindly invite you to consider raising your score.
We're truly grateful for the considerable time you've invested in reviewing our paper. Your detailed feedback has been invaluable in enhancing our work.
Best regards,
Authors
The authors present a comprehensive benchmark to evaluate the performance of deep learning models on limit order book (LOB) data with four contributions: evaluate existing LOB models on a proprietary futures LOB dataset, the first benchmark of existing LOB models on the mid-price return forecasting (MPRF) task, introduce the first benchmark study evaluating state-of-the-art (SOTA) time series forecasting models on the MPRF task, propose an architecture of convolutional cross-variate mixing layers (CVML) as an enhancement to any deep learning multivariate time series model.
优点
The paper establishes a clear purpose, aiming to benchmark deep learning models on limit order book (LOB) data across different asset types. It identifies the primary gap in research, particularly for mid-price return forecasting (MPRF) in LOB data, and articulates four main contributions, including the novel Cross-Variate Mixing Layer (CVML) architecture, which adds value to the paper. This paper is also easy to follow with enough details on the performance of the benchmarks.
缺点
1: As a benchmark paper, although this paper provides lots of comparison methods and results. I feel like the datasets (2 datasets) selected for this study are not enough. LOB is based on different markets, which normally will have different kinds of results. The 2 datasets are from NASDAQ Nordic stock market and SC (Crude Oil), one of China’s most liquid futures contracts. For papers that are mainly focusing on technical achievements, these might be enough datasets and experiments to showcase superior performance, but for benchmark papers, two markets are not enough. I would recommend adding more market datasets, such as Bitcoin Limit Order Book (LOB) Data, which has many public resources.
2: I have questions on CVML,
2.1 The authors claim that, CVML can be added on any deep learning multivariate time series model to significantly enhance MPRF performance on LOB data. However, in Table 6, the performance variances are really different, from 3.1% to 958.3%. I'm not sure if this can be called as "can be added on any deep learning methods". Also, there are only 4 methods. I think it's not safe to claim "can be added on any deep learning methods" with only 4 methods experiments.
2.2 "an average improvement of 244.9%" is the conclusion on CVML form the authors. I don't understand how the authors calculate this average. Can you explain me how you make such calculation?
问题
Please see the weakness part for my questions.
Reviewer igyH
Thank you for your reviews. We appreciate your comments and suggestions. We address them as follows.
C1: I would recommend adding more market datasets, such as Bitcoin Limit Order Book (LOB) Data, which has many public resources
answer: We further conducted the benchmark experiment on a Bitcoin LOB data (https://www.kaggle.com/datasets/siavashraz/bitcoin-perpetualbtcusdtp-limit-order-book-data/data). We attach the results as follows. The ranking of the LOB model performance is different from the one on the FI-2010 dataset, which is consistent to the conclusion of the paper.
| Model | K=1 | K=2 | K=3 | K=5 | K=10 | Avg |
|---|---|---|---|---|---|---|
| MLP | 92.4 (0.2) | 93.2 (0.1) | 93.8 (0.1) | 94.3 (0.1) | 95.3 (0.0) | 93.8 |
| LSTM | 86.2 (2.8) | 94.7 (0.2) | 95.2 (0.5) | 96.3 (0.1) | 97.2 (0.1) | 94.0 |
| CNN1 | 94.5 (0.7) | 96.3 (0.2) | 96.9 (0.2) | 96.9 (0.1) | 97.7 (0.1) | 96.5 |
| CTABL | 53.9 (0.2) | 64.3 (0.0) | 72.1 (0.2) | 80.9 (0.2) | 92.3 (0.2) | 72.7 |
| DeepLOB | 97.9 (0.1) | 98.4 (0.0) | 98.5 (0.1) | 98.1 (0.0) | 98.3 (0.0) | 98.3 |
| DAIN | 34.7 (0.3) | 53.3 (0.4) | 66.9 (0.6) | 79.8 (0.1) | 87.1 (0.1) | 64.4 |
| CNN-LSTM | 97.2 (0.1) | 97.7 (0.2) | 97.9 (0.1) | 97.7 (0.1) | 98.1 (0.0) | 97.7 |
| CNN2 | 97.1 (0.1) | 98.1 (0.0) | 98.2 (0.1) | 97.9 (0.1) | 98.1 (0.0) | 97.9 |
| TransLOB | 95.9 (0.8) | 98.2 (0.2) | 98.3 (0.2) | 98.0 (0.2) | 98.4 (0.1) | 97.8 |
| TLONBOF | 56.5 (1.0) | 70.0 (0.9) | 78.1 (0.7) | 88.2 (0.3) | 94.8 (0.1) | 77.5 |
| BinCTABL | 50.5 (0.4) | 60.3 (0.5) | 65.1 (1.7) | 73.1 (0.2) | 90.4 (0.4) | 67.9 |
| DeepLOBAtt | 97.6 (0.2) | 98.1 (0.5) | 98.7 (0.2) | 98.4 (0.1) | 98.5 (0.1) | 98.3 |
| DLA | 57.1 (2.0) | 69.2 (0.3) | 74.4 (0.4) | 81.0 (0.2) | 89.0 (0.2) | 74.2 |
Performance comparison of different models across various prediction horizons (K). Values show percentage accuracy with standard deviation in parentheses.
C2: However, in Table 6, the performance variances are really different, from 3.1% to 958.3%. I'm not sure if this can be called as "can be added on any deep learning methods"
answer: The percentages in Table 6 refers to each metric's percentage improvement from adding CVML to each model. For each model and metric, we calculate the percentage difference between the sum of the model's performance using CVML across all 5 horizons and the sum of that model's performance without using CVML across all 5 horizons. It is not variance. We can see that we have positive improvements in all metrics and all models, which demonstrates the efficacy of CVML.
C3: there are only 4 methods. I think it's not safe to claim "can be added on any deep learning methods" with only 4 methods experiments.
answer: Thanks for bringing this up. We selected these 4 methods because they are 4 representative models. They are the most recent time series models with SOTA performance and they represent the most effective types of neural network architecture, Transformers (PatchTST, iTransformer), Multi-layer Percetrons (Time-Mixer) and Linear Models (DLinear). Besides, PatchTST and iTransformers represent the two possible ways of apply the self-attention mechanism in the time series domain: applying along the time dimension or applying along the variates/features dimension.
We agree that the language, "on any deep learning methods", seems too absolute and overstate and thank you for point it out. But, we do want to point out that although 4 models do not cover the whole time series forecasting literature, we think they are sufficiently representative to show that CVML is able to improvement Transformer-based, Multi-layer Perceptron-based or Linear time series forecasting models, which cover all recent SOTA time series forecasting models types.
C4: "an average improvement of 244.9%" is the conclusion on CVML form the authors. I don't understand how the authors calculate this average.
answer: The improvement here refers to the overall improvement on R2. We treated the overall improvement as a comparison between the "team" of four CVML models and the "team" of four non-CVML models. For the CVML team, we summed the R2 scores for all 4 time series models with CVML across all 5 horizons (4x5=20 scores being summed). Likewise, for the non-CVML team, we summed the R2 scores for all 4 time series models without CVML across all 5 models (20 scores). Then, we calculated the percentage change between these two sums of R2 scores. Thank you for pointing it out, we realize that calling it average improvement can be confusing. Another way of measuring the average improvement is that we calculate the average of the percentage numbers in Table 6. Each of those numbers represents the average percentage change across 5 horizons. Further summing up the 4 percentage numbers for the R2 gives us the average percentage change (improvement) of R2 across all 4 time series models and all 5 horizons, which is 517.9% ((958.3% + 814.3% + 104.8% + 194.2%)/4).
Dear Reviewer,
With the rebuttal period coming to a close, we wanted to follow up regarding our responses. We've carefully considered and responded to your reviews, and we hope our response has adequately addressed your concerns. If you have any outstanding questions, we are happy to address them further. If you find our response satisfactory, we kindly invite you to consider raising your score.
We're truly grateful for the considerable time you've invested in reviewing our paper. Your detailed feedback has been invaluable in enhancing our work.
Best regards,
Authors
This paper has been evaluated by 5 knowledgeable reviewers. They have unanimously recommended its rejection (including 2 straight rejects and 3 marginal rejects). The authors have provided a rebuttal, but it has not improved the assessment of the submission at its current stage. The reviewers agreed that it has potential, but the innovation of the proposed new benchmark is too limited. The ways to improve the appeal of this work for possible future submissions include a more comprehensive theoretical analysis, experiments with broader collection of LOB-related data (as well as venturing beyond LOB specific applications).
审稿人讨论附加意见
The reviewers did not engage in extensive conversation among themselves, even though they have checked what other reviewers had to say and considered that in their possible re-assessment. Even though the rebuttal was appreciated, and additional results appeared to be in-line with what the reviewers expected, the conclusion remained unchanged: current benchmark lacks significance.
Reject