PaperHub
6.3
/10
Poster4 位审稿人
最低5最高8标准差1.1
8
6
6
5
3.0
置信度
ICLR 2024

Entropy Coding of Unordered Data Structures

OpenReviewPDF
提交: 2023-09-22更新: 2024-03-13
TL;DR

We present shuffle coding, a general method for optimal compression of unordered objects, achieving state-of-the-art compression rates on a range of graph datasets including molecular data.

摘要

关键词
graph compressionentropy codingneural compressionbits-back codinglossless compressiongenerative modelsinformation theoryprobabilistic modelsgraph neural networksmultiset compressionasymmetric numeral systemscompressionentropyshuffle coding

评审与讨论

审稿意见
8

This paper proposes a general method called shuffle coding for compressing sequences of unordered objects using bits-back coding.
The proposed shuffle coding is applicable to multisets, graphs, hypergraphs etc.
SOTA compression rate is achieved on several graph datasets including molecular data.

优点

The problem of compressing unordered data optimally by considering the storing order as redundancy is novel to me.
The paper is a solid contribution with solid theoretical foundations. The mathematical structures from group theory are closely connected with the target unordered data.
The idea of decoding the ordering as part of an encode function in the spirits of bits-back coding is very interesting.
The background and proposed methods are clearly explained and relatively easy to follow.
Experiments are conducted on server different datasets, and achieve sota compression rate.
Limitations of the proposed method is properly discussed.

缺点

One major concern is the practical value of compressing those unordered data. Compared with images or videos, it seems to me the data volume of those unordered date is very limited. However, I still appreciate the technical contribution of this paper.

Currently the numbers in Table 4 is relatively not easy to understand. It would be better to report speed as GB/s or MB/s and compare this with previous methods, so that readers can better feel the efficiency.

问题

Definition 2.5 is described in a very concise way. It is not clear why the definition is specifically connected with rANS. I think is also applies to arithmetic coding.

It is not clear how eq9 is related to the real bitrate. In this definition, the data is compressed in an instance-wise manner and l(m) is the initial bit cost which can be amortized later if we compress more instances. However, log(1/p(x)), which is defined as the rate of the codec, is an averaged value averaging over the input symbol number. I do not understand why l(m) can be added with log(1\p(x)).

Definition 2.3 (Automorphism group) is actually the difinition of stabilizer. If I remember correctly, the concept of Automorphism in group theory is connected with operation by conjugation. Did I miss anything?

评论

We thank reviewer 72sp for their review.

Currently the numbers in Table 4 is relatively not easy to understand. It would be better to report speed as GB/s or MB/s and compare this with previous methods, so that readers can better feel the efficiency.

We fully agree with the reviewer, and updated Table 4 to show speeds in the more standard unit of kB/s. While we did not find information on speeds for PnC, we added a column for SZIP compression speeds for comparison (no decompression speeds were reported).


Definition 2.5 is described in a very concise way. It is not clear why the definition is specifically connected with rANS. I think is also applies to arithmetic coding.

Our method requires stack-like (LIFO) codecs, such as those based on the range variant of asymmetric numeral systems (rANS), to recover bits corresponding to the redundant order using bits-back. Queue-like (FIFO) codes such as arithmetic coding cannot be directly used to implement bits-back coding, see Townsend et al. (2019) for more detail. This was previously not explicitly stated. We revised section 2.2, including definition 2.5, to clarify this.


It is not clear how eq9 is related to the real bitrate. In this definition, the data is compressed in an instance-wise manner and l(m) is the initial bit cost which can be amortized later if we compress more instances. However, log(1/p(x)), which is defined as the rate of the codec, is an averaged value averaging over the input symbol number. I do not understand why l(m) can be added with log(1\p(x)).

If mm is the initial message, the left-hand side l(encode(m,x))l(\text{encode}(m, x)) of eq. 9 is the real bit rate for a specific input xx, measuring how many bits are needed to store xx without encoding any other information. log(1/p(x))\log(1/\text{p}(x)) is the ideal (amortized) bit rate for that specific input xx, bounding the real bit rate from below. This is not an average over all possible inputs xx. Note that in general, mm can contain previously compressed objects, and l(m)l(m) is therefore not necessarily just the initial bit cost.


Definition 2.3 (Automorphism group) is actually the difinition of stabilizer. If I remember correctly, the concept of Automorphism in group theory is connected with operation by conjugation. Did I miss anything?

That is all absolutely correct. The automorphism group of ff is the stabilizer subgroup of ff under the action of Sn\mathcal{S}_n, as mentioned right after definition 2.3. The concepts of automorphism and conjugation are indeed closely related in group theory. For example, the function fh(g)=h1ghf_h(g) = h^{-1}gh that conjugates any element gg of a group GG by a fixed element hh of GG is an automorphism of GG (called an inner automorphism).

评论

Thanks for the reply. I keep my score

审稿意见
6

Summary

  • This paper propose shuffle coding, a BB-ANS approach towards lossless compression of unordered sets. The scenario described by the paper seems to related to the Birkhoff Polytope in variational inference of permutation [Linderman 2017, Reparameterizing the Birkhoff Polytope for Variational Permutation Inference], where each permutation's likelihood is the same. The authors demonstrate the effectiveness of their approach on various datasets.

优点

Strength

  • The unordered set / graph compression problem is of good practical value. The proposed approach is a neat extension of bits-back coding. It is simple, novel and works well.

缺点

Weakness

  • As the authors have discussed, the current initial bits required is quite large. This hinders the practical application of the proposed approach to one-shot object coding. Though it is still possible to apply this approach to a dataset to amortize the initial bits. An alternative to the bit-swap approach mentioned by authors is correlation communication [Harsha 2010, The Communication Complexity of Correlation] [Li 2018, Strong Functional Representation Lemma and Applications to Coding Theorems] [Theis 2021, Algorithms for the Communication of Samples], which has extra overhead of approximately loglogN!\log \log N!, but does not require any initial bits.

问题

Questions

  • For practically large structure, what is the computational cost for finding and sampling from the isomorphism class f^\hat{f}? For practical codec, the metrics such as latency and throughput should also be discussed.
评论

We thank reviewer NBHy for taking the time to review our paper.

The scenario described by the paper seems to related to the Birkhoff Polytope in variational inference of permutation

Indeed, as the reviewer pointed out, our method is related to [Linderman 2017], where they reparameterize the Birkhoff Polytope to perform variational inference over permutation matrices. The vertices of the n-dimensional Birkhoff Polytope represent permutations over n elements. Our decoding algorithm can be seen as performing inference of the permutation applied to the input via canonicalization. Relaxing this inference to the inner hull of the Birkhoff Polytope, as done in [Linderman 2017], could be a way to perform lossy compression of combinatorial objects.


An alternative to the bit-swap approach mentioned by authors is correlation communication

While correlation communication methods could theoretically be an alternative to the future work of interleaving, these methods have computational complexity which scale exponentially with the entropy of the source, making them impractical. However, the reviewer is correct that these methods could in theory be used to avoid the initial bits problem, if we disregard computational complexity.

These methods relate to bits-back coding in the following way. Consider the source XX and another random variable YY from which we can recover XX deterministically, i.e., X=f(Y)X = f(Y) for some function ff. Note this implies that H(XY)=0H(X|Y) = 0. As a concrete example from BB-ANS, Y=(Z,X)Y = (Z, X) where ZZ is the latent variable of the VAE. Then, bits-back coding implements rate R=H(Y)H(YX)=I(X;Y)=H(X)H(XY)=H(X),R = H(Y) - H(Y|X) = I(X; Y) = H(X) - H(X|Y) = H(X), where H(YX)-H(Y|X) is the savings from the bits-back step.

Correlation communication methods can be used to communicate a sample Y | X, achieving a rate of R+log(R+1)+5R + log(R+1) + 5 via the Poisson Functional Representation Lemma (in this case, the discrete version which falls back onto the exponential races) [Li & El Gammal 2017], and X can then be recovered from that sample. Unfortunately, as mentioned, the computational complexity would be much larger than our method.

Bits-back can be viewed as an extremely efficient algorithm for “deterministic” correlation communication (i.e., when X=f(Y)X = f(Y) for some ff).


For practically large structure, what is the computational cost for finding and sampling from the isomorphism class f~\tilde{f}?

We provided relative timings for nauty, the library we used to find the canonical labeling and automorphism group (these are the computational bottlenecks for sampling from the isomorphism class), in Appendix E. As mentioned in the paper, no polynomial-time algorithm for the graph isomorphism is known, but nauty and Traces solve this problem efficiently for various graph classes. Thorough benchmarks of nauty and traces on different graph types and sizes can be found at https://pallini.di.uniroma1.it/.


For practical codec, the metrics such as latency and throughput should also be discussed.

We have updated Appendix E to now report compression throughput. We believe that latency is more appropriate for streaming applications, such as video decoding. We would envision our codec being used in a non-streaming setting, where a whole file would be downloaded and then decompressed.

评论

Thanks for the feedback, I keep my rating as accept.

审稿意见
6

In this work, the authors introduce shuffle coding, a method for compressing sequences of unordered objects based on bits-back coding. They provide an exposition of the group-theoretic fundamentals relevant for shuffle coding (Section 2 and Appendix A), define the desiderata required for an optimal-rate codec for unordered sequences (Section 3) and present shuffle coding, an algorithm that fulfils these desiderata (Section 3.1). They apply shuffle coding to unordered data structures (specifically graphs), showing strong performance compared to PnC and SZIP (Section 5).

优点

This paper presents a few key strengths which, in my view, are as follows:

Elegant unified framework: This paper provides an unified theoretical framework for compressing unordered objects, such as multisets and graphs. This approach is based on the elegant idea that the order of the parts of an object does not matter, one can reduce the cost of communicating the object by getting a certain number of bits, i.e. the bits corresponding to a particular ordering of the parts, back. This generality is appealing because one does not have to devise specialised method for each different type of unstructured object. However, it should be noted that this framework assumes access to a "canonicalisation" function, which determines a canonical representation as well as the automorphism group of the object (e.g. graph) in question (also see weaknesses section below).

Strong compression performance: The authors demonstrate that shuffle coding achieves strong compression rates in practice. In particular, shuffle coding outperforms SZIP and PnC on a range of graph datasets. This is an encouraging result because, while the proposed method is asymptotically optimal, it also involves a non-trivial overhead (which is amortised as more data are compressed). Therefore it is good that despite this cost, shuffle coding seems to perform well in practice.

Well written paper: The paper is generally clearly written and well motivated. The authors have made an effort to discuss the limitations of their work, specifically about the time complexity and initial rate overheads induced by their method. I appreciated the illustrations and pseudocode listings in the main text, which helped understand their method.

缺点

The paper's main weaknesses, in my view, revolve around the practical applicability of shuffle coding:

Large runtime complexity: As the authors note, applying shuffle coding to a graph requires solving a graph isomorphism problem, for which no polynomial-time algorithm is known. This can be a significant hurdle when coding larger graphs. The authors brought up this issue in the paper, and suggested that approximately solving the isomorphism problem is a promising way to scale the method. However, in its current form, the method does not scale to larger graphs. Further, it is unclear what the tradeoff between the communication rate and computational complexity of the method would be, if one were to use an approximate scheme instead.

Initial bit overhead: Due to the requirement of initial bits, shuffle coding introduces a communication overhead to the transmitted message. This can be significant in the one-shot or few-shot case, making the method from compressing small sequences of messages. I think the authors should specify the amount of extra bits used in the experiments (see Table 3) in the main text.

Contribution as a general solution seems somewhat exaggerated: While the shuffle coding approach is general from a theoretical point of view, it requires access to a "canonicalisation" function. The authors focus on graphs, for which existing libraries offering this functionality exist. For other permutable classes, the authors argue that one can "embed objects into graphs in such a way that the structure is preserved and the canonization remains valid." However, it is unclear whether, for example, this embedding might affect the compression rate, or how difficult it is to construct such an embedding. In light of this latter point, the statement that the authors' "implementation can easily be adapted to different data types" in the abstract may be regarded as exaggerated (if for example coming up with an algorithm that constructs such graph embeddings is challenging, or if the computational / memory complexity of such an algorithm is large). A more measured statement in the appendix and / or positioning of the main text might be beneficial in this regard.

Summary: In conclusion, while I find the paper to be well-motivated and elegant, I think the practical applicability of shuffle coding may be limited due to runtime issues, as well as issues pertaining to the embedding of other kinds of permuted classes on graphs, and the initial bit overhead. Therefore I have recommended a moderately positive score for the paper, but I am willing to adjust my score if the authors address my points of concern raised above, and the questions and recommendations made below.

问题

Below are some questions for the authors and suggestions that I think could improve the paper.

Questions

Clarification on the discount factors for Table 3: The ER and PU models are specific probabilistic models for graphs and, as the authors explain, they can be swapped in with any other exchangeable model of graphs. Therefore, while it is important to report the overall communication cost in the experiments, an equally important (in my view) quantity to report is the rate discount afforded by shuffle coding. What are the discounts corresponding to the results in table 3? Perhaps the authors can report these as an extra column, since the discount is a function of the graph alone and not the modelling distribution P.P.

Extending the framework with approximate isomorphism solutions: The authors claim that while no known polynomial-time solution for solving the exact graph isomorphism problem is known

this limitation can be overcome by approximating an object’s canonical ordering, instead of calculating it exactly. This introduces a trade-off between speed and compression rate in the method, and lowers runtime complexity to polynomial time.

Can the authors comment on the tradeoff between the speed and compression rate of this modified method, here and / or the main text?

Suggestions

Definition 2.5: I think that definition 2.5 is not totally accurate and / or could be improved. Presumably, what the authors mean by "inverse functions" is that decode`decode` is the inverse function of encode.`encode`. In that case the decode`decode` function isn't really needed and / or can be defined as decode=encode1.`decode` = `encode`^{-1}. Also, the statement "with respect to PP" does not seem to make sense on its own. I think clearer statement is to say that

an optimal codec for PP is an invertible function encode:M×XM`encode` : M \times X \to M, which is optimal with respect to P,P, in the sense that for any [...].

More importantly however, in shuffle coding, the encode`encode` function is not invertible, because for fF,f \in \mathcal{F}, applying decode(encode(M,f))`decode`(`encode`(M, f)) returns the message MM together with the canonical representation fˉ\bar{f} rather than ff itself (see for example the function signature and return statements in the code listing in Section 3.1). While this is a technicality that does not seem affect the validity of the algorithm, I think the authors should modify their definition and / or exposition to account for this.

Framing of definition 2.5: I think definition 2.5 and the paragraph under it could be framed better. In particular, the authors define optimal codecs and explain that "definition 2.5 captures the abstract properties of codecs based on the range variant of asymmetric numeral systems." I think a clearer approach would be to say up-front that their plan is to set up a method that gets those bits which correspond to permutations of the graph in question back. This in turn necessitates using a last-in-first-out codec, such as (r)ANS, and that definition 2.5 specifies "Optimal last-in-first-out codecs", rather than "Optimal codecs".

Initial bit overhead statement: The authors argue that the constant bit overhead incurred by bits-back "exists for all entropy coding methods." While all entropy coding methods have a constant overhead that is amortised as more and more data are compressed, typical methods such as Arithmetic Coding (AC), have constant overheads of the order of a couple of bits. This is far smaller than the claimed overhead of "at most 64 bits" of bits-back for shuffle coding. I think the authors' phrasing somewhat misrepresents the overhead of other entropy coding methods. A more accurate statement would be welcome here, such as: "As in other entropy coding methods, which invariably have similar (though typically smaller) constant overheads, the constant 'initial bit cost' of bits-back is amortised as more data are compressed."

All units in bits? Are all reported rates, including for example Table 1, in bits? If so, it would be good to add a statement up front, early in the paper, that specifies this.

评论

We thank reviewer aWcE for their thoughtful, thorough and constructive review.

Weaknesses

Large runtime complexity: We address this in our overall response.

Initial bit overhead: The average initial bit cost per TU dataset in Table 3 is 0.01 bits per edge for both ER and PU, demonstrating good amortization. We added this result to section 5.

Clarification of the discount factors for Table 3: We added a column to report the discounts for all datasets in Table 3. As in Table 2, the discounts are very significant for compression performance.

Questions

Extending the framework with approximate isomorphism solutions: The tradeoff between rate and performance of the approximate version of shuffle coding is promising: Preliminary experiments show that on SZIP graphs, we can achieve speedups of multiple orders of magnitude for a few-percent increase in compression rate compared to optimal-rate shuffle coding, i. e. a 1000x speedup for a 4% rate increase on the “geom” dataset. We implemented a version of shuffle coding that has pseudo-linear runtime and completes on graphs with more than 10 million edges. The approximate method saves the “easy-to-find” redundant bits, leaving the “hard” bits out that cause most of the computational complexity when calculating the canonical ordering in nauty and traces. On all datasets we tested, there seems to be a favorable distribution between these, i. e. many easy bits vs. few hard bits. Characterizing and predicting this tradeoff would be an interesting avenue for future work.

Contribution as a general solution seems somewhat exaggerated: We updated section 2.1.1 to clarify that our method critically depends on the availability of function to retrieve a canonical ordering (as well as a set of generators of the automorphism group) for elements of the permutable class. As mentioned in Anders and Schweitzer (2021), embedding into vertex-colored graphs, and then running nauty/traces, is the standard technique for computing the automorphism group (and canonization) of other objects. It would be great though to prove that this is always possible, and to have a general embedding method. We actually used an embedding of edge-colored graphs into vertex-colored graphs in order to canonize and compress graphs with edge attributes (which are not directly supported by nauty/traces) for our experiments. We have updated the paper to mention that we did this.

Suggestions

We agree with all of the reviewer’s suggestions and have implemented them in our updated paper version. This includes a simplified and better-framed definition 2.5, an updated statement regarding the initial bit overhead of rANS, and a missing clarification regarding units (all results are indeed in bits, this information was missing for Table 5).

Invertibility of encode function. The encode function for shuffle coding is invertible because it takes an unordered object of type X=F~X=\tilde{F} (not FF), so the codec signature is M×F~MM×\tilde{F}→M. We updated section 3.1 to mention this explicitly. We merely use an ordered object with arbitrary order to represent the unordered object in our implementation, as described in section 3.1. Their order has no meaning in the sense that equality of unordered objects is implemented by comparing canonized versions of their ordered representations. In practice, we use a wrapper data type called Unordered to define equality in this way (i. e. Unordered(‘abc’) == Unordered(‘bca’)), so that the universal invertibility property c.decode(c.encode(m, x)) == (m, x) of codecs is upheld. This is omitted in the listing of section 3.1 for simplicity.

评论

Thank you for your rebuttal. I am happy to hear you found my review thoughtful, thorough and constructive, and that some of the suggestions I made were useful for improving the paper itself. Your rebuttal has generally addressed most points of concern that I have. I also think the preliminary exploration on extending this approach with approximate graph isomorphisms is a very interesting one, and that if the speed-up to rate trade-offs you are claiming are in fact consistent / robust across different graphs, this would make for a very compelling method. I am currently erring on maintaining my score of 6 (on the grounds of the current version of the method being potentially very slow). However, I think the possible improvements that could come from this extension could make this a very effective and practical compression scheme, and that this is a potentially important factor in favour of this method that the AC could consider when assessing the paper.

审稿意见
5

This paper focuses on the problem of compressing unordered objects. It presents a general approach called “shuffle coding” to compress different data structures with bits-back coding. After introducing the background in Section2, including data structures and the problem definition of compressing unordered objects, the authors derive Lemma3.2 and then provide the pseudo algorithm for the proposed method. The key idea is to decode an ordering as part of an encode function. The experimental results demonstrate that the proposed shuffle coding could compress different data structures with better lossless compression performance, compared with ordered ER.

优点

It seems to be meaningful to reduce compression cost by removing the order information in data structure. The proposed shuffle coding can get a discount in lossless compression of such data structures, as illustrated by Equation 14.

缺点

My major concern is about the significance of the problem studied in this paper: considering the complexity, will the proposed method have wide/potential applications in practice? For my side, it seems slightly intuitive to remove the order information so that we can reduce the coding cost when we compressing graph data. Is bits-back coding necessary in this scheme? These my concern may partially be attributed to my lack of expertise in the field of compressing graphs. In addition, Appendix C describes the modifications compared with Daniel et al., 2023b, which is also an important baseline in this paper. But for a reader that is not very familiar with the area, appendix C may be hard to understand.

问题

As I am not entirely familiar with this field, the authors are welcome to direct my attention and address my concerns with more details provided. And I’ll gladly consider increasing my initial rating.

评论

We thank reviewer gbs3 for the time to review our paper.

We addressed the concern regarding runtime complexity in our overall response.

It is intractable to directly compress unordered objects without bits-back at the optimal rate. The role of bits-back coding is to enable us to convert the problem of compressing unordered objects into a problem of compressing ordered objects, where we can apply common sequential methods such as the ones shown in Listing 1 in the paper.

Appendix C describes details of an ordered graph model that we plug into shuffle coding for our experiments. It is not a baseline for shuffle coding, since we are concerned with unordered graph compression. It is not core to our method either since shuffle coding can be combined with any ordered graph codec. We merely aimed to demonstrate that a few-parameter ordered model/codec can lead to competitive results in combination with shuffle coding, and Severo et al. 2023b is a good fit for that purpose. The minor modifications lead to rate improvements as shown in Appendix F, but are not core to our method.

评论

Thanks for the feedback and confirmation on the concern regarding the limited practical value. I would like to keep the score.

评论

We are delighted to read that all reviewers agree on the good soundness of our paper, and mostly agree that it represents a “good” contribution. Multiple reviewers highlighted the novelty, elegance and strong compression performance of the approach.

A concern regarding the practical value of our method due to its large runtime complexity was voiced multiple times. We clearly acknowledge this as a limitation in the paper. We show that shuffle coding is tractable for many practical datasets, even with the basic version presented in this paper. We leave a description of a version of shuffle coding with pseudo-linear runtime and similar rates, with wider potential for applications, for future work.

We released an updated version of the paper, following reviewers’ feedback. Most notably, the framing and content of definition 2.5 were revised for clarity, additional experimental results were added, and we now report compression speeds in Table 4.

AC 元评审

A new method for compression is valuable to the community. This paper provides a reasonable methodology. Almost all the reviewers are positive about the contribution. I recommend acceptance.

为何不给更高分

The reviewers have all raised the issue of practical implementation of entropy coding. That makes the impact somewhat lower. Also compression is a great application area, but the methodology does not belong to the core of machine learning.

为何不给更低分

It is a methodology paper, with deficiencies, that can be addressed in follow up works.

最终决定

Accept (poster)