Improved Learning via k-DTW: A Novel Dissimilarity Measure for Curves
We introduce $k$-DTW, a novel distance for polygonal curves that generalizes and provides the best of the gold standards: DTW and Fréchet distances.
摘要
评审与讨论
The paper proposes k-DTW (k-Dynamic Time Warping) as a dissimilarity measure between polygonal curves. The motivation is that many diverse datasets can be thought of as curves and measuring appropriately the distance between them is a fundamental problem. Usually researchers has studied Frechet distances and DTW for this purpose and typically some transformations of one curve into another is involved. However, both aforementioned measures have their disadvantages: Frechet distance is very sensitive to outlier whereas DTW is not a metric since it does not satisfy the triangle inequality.
To alleviate these issues with existing distances, the paper proposes k-DTW which combines advantages of standard measures like Frechet distance and of the standard DTW (essentially interpolating between the two), without their disadvantages.
At a high-level, the parameter k controls the degree of accuracy for the transformation performed between the two curves. Oftentimes, carrying out the whole trasformation may be expensive and also overfit to noise, whereas k-DTW only cares about the most important parts of the trasformation, as measured by a small subset of size k, and ignores the rest of the transformation. Although this may seem lossy it offers some advantages.
The main contributions of the paper are:
-
The first point the authors make is that k-DTW satisfies a strengthened triangle inequality compared to DTW and it is thus closer to a proper metric, while retaining some robustness of DTW.
-
The second point is that there is an exact algorithm, as well as a (1 + eps)-approximation for k-DTW using a parametric search for the k-th largest matched distance with standard DTW on modified distance matrices as a subroutine.
-
Next, they show the first dimension-free learning bounds for clustering under k-DTW and a separation result showing that k-DTW has strictly smaller Rademacher and Gaussian complexity than DTW for clustering curves.
-
Finally, the authors provide experiments showing the benefits of k-DTW over the other two measures in the setting of clustering and nearest neighbor classification, on synthetic and real world data.
给作者的问题
论据与证据
yes
方法与评估标准
yes
理论论述
yes, except the learning section 4.
实验设计与分析
yes.
补充材料
no
与现有文献的关系
Important relation since k-DTW could improve efficiency for comparing curves.
遗漏的重要参考文献
no
其他优缺点
The conceptual message of the paper is that if we don't care about the whole computation of DTW or Frechet and we rather focus on the most important k parts of the computation then we can have good algorithms. The idea is that curves can be approximated by polygonal curves if we select some vertices on them and we connect them with affine segments, essentially interpolating between the points on the curves.
The main definition introduced in the paper is Definition 2.2 with a parameter k: as k increases we care about more and more pairs of large distances between the two curves; specifically, for k = 1 we recover the Frechet distance, and for k large enough we recover the DTW distance.
I find the definition here very natural, and it comes as no surprise that once we focus on k terms in the summation, then then triangle inequality will be respected up to a multiplicative factor of k (Lemma 2.3).
The authors after introducing their k-DTW distance, they rule out some perhaps straighforward approaches. One thing that is notable is that the k-DTW distancε between two curves is NOT equal to just taking the largest k distances from the sum that yields their DTW distance.
Their first main result is to give an exact algorithm with runtime proportional to the description of the curves and the number of distinct pairwise distances. The analysis is non-trivial but it follows from the top-k framework of (Bertsimas & Sim, 2003).
The second main result is a (1+eps)-aox with a better runtime. The authors essentially shave off the number of distinct pairwise distances and replace it with the 1/eps x logarithm of k (say the accuracy eps is fixed small number).
The third main result, is what happens when we want to learn the median of curves sampled from a distribution over curves with vertices in the unit Euclidean ball of dimension d. The authors calculate the Rademacher and Gaussian complexities both for DTW and for k-DTW and they show how they can replace a strong dependence on the size of the polygonal curve with their parameter k. Importantly, because k can be much smaller than dimension d, this provides a dimension-independent bound for learning.
No major weaknesses could be found.
其他意见或建议
We thank the reviewer for the valuable feedback.
The paper proposes a new distance for between curves, the k-DTW distance, and make several types of compelling arguments for its use. These include better near-metric properties than DTW, and better learning complexity than Frechet distance - k-DTW generalizes both. The paper provides an efficient approximation algorithm (with clear provable guarantees), heuristics for improving runtime in practice while maintaining guarantees, and learning complexity results for learning a median curve from a set. Each step involves non-trivial insights. Moreover, the paper shows empirical improvement on synthetic (for clustering) and real data sets (for classification) that demonstrate clear improvement over the standard DTW and Frechet distances.
Data analysis on curve data is an active area, and this paper provides an inventive and comprehensively presented new method in this area. Moreover, I found it surprising that the new method works so well - it considers the longest k distances in a matched monotonic sequence (DTW uses all, and Frechet uses the 1 longest), and this has very different properties and seems to be more robust than either.
Moreover I found the paper clearly written. It combines a variety of complicated theoretical and practical perspectives.
给作者的问题
N/A
论据与证据
Yes
方法与评估标准
Yes
理论论述
Yes (for the most part), the properties proposed to justify the metric are clearly explained and stated formally. The main proofs are proven or sketched in the main paper, and the ones deferred to the appendix make sense, and I did not find any concerns.
实验设计与分析
They seem reasonable. There are three experiments: clustering (synthetic -- it illustrates the advantages of k-DTW well), classification on learning analytics (theirs does best among DTW, Frechet), and classification letters/trajectories (theirs often does best).
I would have liked to have seen a more explicit parameter search over k on the training data, and then used once on evaluation data.
补充材料
I quickly reviewed proofs, and looked at experiments.
与现有文献的关系
I think it is fine. It covers important aspects of curve distances, although it may be useful to compare to things like Edit Distance with Real Penalties: Chen and Ng. "On the marriage of lp-norms and edit distance" in VLDB 2004
遗漏的重要参考文献
No
其他优缺点
N/A
其他意见或建议
N/A
We thank the reviewer for the valuable feedback.
Question 1: Edit Distance with Real Penalties, as in: Chen and Ng, VLDB 2004.
Answer 1: We will include some references, discussions and baseline experiments for your suggestion along with another suggestion of reviewer "MvAs". Please see our reply to "MvAs" for details on the other distance measures.
The following table shows the results of our experiments, extended to weak Fréchet distance, partial DTW, and Edit Distance with Real Penalties, accompanied with the best performing -DTW (cf. Table 4 in our submission).
The best performing measure for each case is highlighted in , and the worst in . One can see that in the majority of cases, -DTW either still performs , or . The only exception is "Char0uw" where ERP excels and beats all others by a large margin. -DTW is still second best in these cases. Notably, all competitors have some worst cases, while -DTW is never worst.
| Dataset | Distance | AUC (std.err.) | Accuracy (std.err.) | -Score (std.err.) |
|---|---|---|---|---|
| Cars+Bus | PartialDTW | |||
| WeakFréchet | ||||
| EditRealPenalty | ||||
| -DTW | ||||
| Sim.C+B | PartialDTW | |||
| WeakFréchet | ||||
| EditRealPenalty | ||||
| -DTW | ||||
| Char0uw | PartialDTW | |||
| WeakFréchet | ||||
| EditRealPenalty | ||||
| -DTW | ||||
| Char1nw | PartialDTW | |||
| WeakFréchet | ||||
| EditRealPenalty | ||||
| -DTW | ||||
| Char2nu | PartialDTW | |||
| WeakFréchet | ||||
| EditRealPenalty | ||||
| -DTW | ||||
| TwoPersons | PartialDTW | |||
| WeakFréchet | ||||
| EditRealPenalty | ||||
| -DTW |
Question 2: More explicit parameter search over .
Answer 2: We refer to Table 2 in the appendix of our submission for a more extensive parameter search over , cross-validated and independently repeated times. We may add the suggested evaluation in the next revision.
thanks for updated experiments with EditRealPenalties. I retain my score of 4: accept.
This paper proposes a new distance measure called -DTW, which is positioned as a interpolation between the classical DTW distance and the Fréchet distance. The technical novelty is to consider only the top matched distances in the alignment path, rather than summing all distances (as in DTW) or taking the maximum distance (as in Fréchet). The authors prove that -DTW interpolates these two classical measures for and sufficiently large. Empirical results show that -DTW can yield stronger triangle-like properties than DTW, while still being more robust to outliers than Fr'echet. A parametric search algorithm is introduced for computing -DTW exactly (as well as a -approximation). The paper demonstrates applications in clustering (via hierarchical agglomerative clustering) and in nearest neighbor classification on synthetic and real-world datasets. The main contribution is that -DTW addresses key shortcomings of Fréchet (outlier-sensitivity) and DTW (lack of metric-like properties) and, from experiments, appears beneficial in tasks like clustering and classification.
给作者的问题
You primarily demonstrate robustness to outliers through experiments. Could you formalize the observed outlier-robustness into a dedicated theorem or formal analysis?
Are there specific heuristic strategies or data-structural optimizations you foresee to further enhance the runtime efficiency of the k-DTW algorithm?
论据与证据
Claim 1: k-DTW is more robust to outliers than Fréchet distance. Evidence: The synthetic-clustering experiment shows how spikes in type-A curves heavily penalize Fréchet but do less harm under k-DTW. However, the paper does not provide a standalone "robustness theorem." The authors rely primarily on intuitive arguments and empirical demonstrations.
Claim 2: k-DTW has stronger metric-like properties than DTW. Evidence: They prove a relaxed triangle inequality with a factor of k, whereas DTW's violation can be worse by factors depending on curve length. The paper contains theoretical lemmas (especially Lemma 2.3) and the associated proofs. Empirically, the authors illustrate fewer pathological merges in clustering under k-DTW compared to DTW.
Claim 3: The proposed k-DTW algorithm is feasible for practical use. Evidence: An exact algorithm and a (1+ε)-approximate algorithm are provided, though in worst cases it can be more expensive than classical DTW or Fréchet. The experiments show that it is slower than DTW and discrete Fréchet in practice.
方法与评估标准
The authors' methods include:
- Defining k-DTW and demonstrating how to compute it via parametric search.
- Evaluating clustering via HAC with single-linkage and complete-linkage.
- Conducting l-nearest neighbor classification on real-world data.
These choices are appropriate to showcase the value of k-DTW in both supervised and unsupervised tasks. The design of synthetic data for highlighting robustness to peaks/outliers is quite clear. However, adding additional baseline distances like partial DTW or weak Fréchet could further solidify the empirical evaluations.
理论论述
I checked the theoretical claims as laid out in Sections 2 and 3 of the paper:
- The proof of the relaxed triangle inequality (Lemma 2.3) is good.
- One point that is less formally addressed is the outlier-robustness: the paper does not contain a specialized "robustness theorem." Instead, it uses examples and partial discussions to argue that k-DTW dilutes spikes in the alignment cost.
实验设计与分析
- The authors constructed three curve types (A, B, C) to highlight the different shortcomings of Fréchet and DTW. This approach is sound: it clearly demonstrates how one or two spikes can dominate Fréchet, or how DTW can produce surprising alignments.
- They then used hierarchical clustering (single-linkage and complete-linkage) and visually showed the intra-/inter-cluster distances. The analysis effectively highlights differences among the measures.
- The real-world classification experiment (l-NN) is straightforward but adequate to underscore classification improvements. However, the paper could be enriched by including additional distance measures (partial DTW, weak Fréchet) for completeness.
补充材料
I checked see references to additional proofs and extended experiments in the appendix, but we did not observe a separate outlier-robustness theorem in the supplementary.
与现有文献的关系
The paper explicitly compares -DTW to the two classic distances: DTW and Fréchet. That is a direct relationship to established literature in computational geometry and time series analysis. We note that variants like weak Fréchet distance or partial DTW also exist, but they are not tested. The top- approach has precedents in top- optimization and the Ky-Fan norm (summing the largest singular values).I recommend referencing additional distances like partial DTW and weak Fr'echet which might be beneficial to help the robustness comparison.
遗漏的重要参考文献
I recommend referencing additional distances like partial DTW and weak Fréchet.
其他优缺点
Strengths:
- Proposes a simple and new bridging distance between two classic curve distances.
- Demonstrates some theoretical foundations like relaxed triangle inequality, dimension-free learning bounds, and provided practical benefits.
- Proposed an algorithm to compute k-DTW, plus the (1+ε)-approximation.
Weaknesses:
- No explicit "robustness theorem" about outliers, only partial arguments via examples.
- Time complexity is worse than classical DTW or discrete Fréchet distance, both in theory and experimental results.
- The experimental comparisons could be further supported by including partial DTW, weak Fréchet, or other established distances for completeness.
其他意见或建议
na
We thank the reviewer for the valuable feedback.
Question 1: Formalize robustness.
Answer 1: The concept of robustness for curve distance measures could be formalized as follows: given two curves whose matched vertices are at constant distance, say , if we move one vertex away to increase the distance by a large value , then the average distance contributed per vertex increases through this modification by for -DTW, . This means that Fréchet is largely dominated by the single outlier, while for -DTW the increase averages out, so that up to a factor , one single large perturbation of order is indistinguishable from tiny perturbations of (all) single vertices. This supports and quantifies the intuition and may be included as a lemma in our next revision.
To make the intuition more rigorous, we will examine the statistical breakdown point for -DTW, which is defined to be the smallest fraction of vertices to be perturbed to corrupt the distance. A larger breakdown point indicates more robustness to arbitrary perturbations. As all -DTW variants are closely related to the geometric median (minimizing a sum of distances), we will adapt the breakdown point analysis of the median [1] to our setting. It seems to be provable that the breakdown point of -DTW is . This would imply that as long as , the breakdown point is asymptotically larger than for Fréchet and thus considerably more robust. However, to reach a constant breakdown point that provides robustness comparable to DTW for arbitrarily long curves, would still need to be of order .
[1] Lopuhaä, Rousseeuw - Breakdown Points of Affine Equivariant Estimators of Multivariate Location and Covariance Matrices, The Annals of Statistics, 1991
Question 2: Heuristic strategies or data-structural optimizations to enhance the runtime.
Answer 2: To enhance the runtime of our -DTW algorithm without losing theoretical guarantees, we currently see two options:
-
The current DTW subroutine is the plain DP algorithm without any tweaks. It could potentially be enhanced using heuristic DTW speed-ups such as [2] as a black box.
-
Our first heuristic (line 206, right) could be enhanced by computing first only a subset of the DTW paths in a narrow band along the main diagonal. This works in linear time and may be promising for reducing the variable 'mincost' early on so to cut off many unnecessary iterations.
We will also add the challenging open problem of improving the top- optimization framework that our algorithms build upon. This will hopefully yield provably faster exact algorithms for other top- problems as well.
[2] Silva, Batista - Speeding up all-pairwise dynamic time warping matrix calculation, ICDM 2016.
Question 3: Additional baselines like partial DTW and weak Fréchet.
Answer 3: Our claim is that -DTW interpolates between the two extreme cases Fréchet and DTW. Comparisons to these two are thus most natural for supporting our claims. Please note that weak Fréchet has the same sensitivity to outlier vertices as standard Fréchet. Similarly, partial DTW admits only a factor triangle inequality. Hence, they do not contribute towards the goals of our paper.
However, we agree on adding a few more baselines for completeness. We will thus include references, discussions and baseline experiments for partial DTW or weak Fréchet distance in addition to the suggestion of reviewer "uWfM". Please see our reply to "uWfM" for details and preliminary experiments.
The paper introduces a new dissimilarity measure for polygonal curves called k-DTW, which improves on Dynamic Time Warping (DTW) by offering stronger metric properties and greater robustness to outliers. It provides both exact and approximate algorithms, theoretical learning bounds, and shows improved performance in clustering and nearest neighbor tasks. k-DTW also needs fewer samples than DTW.
All reviewers are positive about the paper. No major issues were raised, while minor ones were largely addressed in the rebuttal.
The finals scores are 3xAccept.
In addition to the theoretical contribution, the reviewers, overall, also appriciated the experiments (related to clustering and classification) and how they provided empirical evidence to support the authors' claims.
In the AC's view, this is a clear Accept.