Factor Graph Optimization of Error-Correcting Codes for Belief Propagation Decoding
We learn new binary linear block codes by optimizing the factor graph of the Belief Propagation algorithm
摘要
评审与讨论
This paper presents a gradient-based, data-driven method for the design of sparse-graph codes tailored to belief propagation (BP) decoding. The main contribution of the paper is to learn the factor graph structure through a differentiable representation that facilitates backpropagation. Specifically, the authors start with a complete bipartite graph where the edges are learnable.
优点
The proposed approach to design codes is interesting and, as the authors show, leads to codes that outperform some existing codes.
缺点
Despite that the method proposed by the authors is interesting, I believe this paper should be rejected, primarily for reasons concerning its limited contribution scope and weak experimental comparisons. My argumentation is detailed below:
-
Limited contribution: While the approach presented in the paper is conceptually interesting, the contribution of this paper is too narrow to merit publication in a major conference like ICLR. The work is more appropriate for a coding conference such as ISIT or ITW. I believe that this work is valuable for the coding community and deserves publication, but I lacks depth and broader impact typically expected at a major machine learning venue or for a major coding journal.
-
Unconvincing experiments and comparisons: The experimental results are insufficient to answer the core question this paper seeks (or should seek) to address: Does this method allow to design codes that perform better than state-of-the-art codes? This question remains unanswered. This casts doubt on the interest of the proposed approach. The authors should compare the performance of the designed codes with that of the best available codes, as well as with relevant performance bounds. Without these comparisons, the results merely demonstrate the construction of improved codes over baselines, which still underperform relative to the best codes. This weakens the impact of the method.
Please, see my Comment 2 in Section "questions" for some suggested comparisons.
- Non-standard metrics: The paper reports performance using the negative natural logarithm of the BER. This is very unconventional for a coding paper (and this IS a coding paper) and complicates unnecessarily the interpretation of the results. Standard practice in coding theory involves presenting BER or block error rate (BLER) results directly (the latter is more suitable!), which allows for clearer, more interpretable results. The authors should adhere to these standard metrics.
Please, see my Comment 3 in Section "questions".
问题
-
Sections 3 and 4 can be presented in a much clearer and accessible manner. Currently, these sections make relatively straightforward concepts appear unnecessarily complex.
-
The experiments/results section should be thoroughly reworked. It is not unexpected that the authors' obtained codes outperform the baseline codes they started with. Indeed, most of them are not particularly good codes (compared to the best-existing ones). Some of them, indeed are clearly poor codes for BP decoding. The authors should benchmark the performance of their newly-designed codes against SOTA codes.
For example, the authors should consider benchmarking their (128,64) codes with the SOTA codes reported in Figure 10 of the paper ``Efficient error-correcting codes in the short blocklength regime'' (the details of the codes are given in the paper). Without such a comparison and similar ones for other lengths, the relevance of the proposed approach is questionable.
Furthermore, the authors should also benchmark the performance of their codes against finite-length performance bounds, such as the Gallager's random coding bound, the random coding union bound (see same paper for details). In this sense, please report results in terms of block error rate, rather than bit error rate, as it is more relevant.
-
Rather than reporting results in terms of the negative natural logarithm of the BER, please plot BLER curves (as the ones in the paper cited above). Only in this way one is can fully understand how the proposed codes perform against SOTA codes and performance bounds. In other words, for the (128,64) codes you should reproduce Figure 10 in the paper above and include your own curves.
-
I do not see much value in the results in Figures 3, 4, 5, and 7. The results are unsurprising and provide limited insight. These results would be better suited for the appendix rather than the main body of the paper.
For example, in Figure 3 you show improvements with respect to random codes with different sparsity rates. For such lengths, random codes, particularly with higher density, are not good codes under BP decoding, so it is not surprising the your learned code works better! What do we learn from this figure that we don't know yet?
The same applies to Figure 7. I quote the paper: "We can observe that for low-density codes the modifications remain small, since the code is already near local optimum, while for denser codes the change can be substantial". This is what any relatively knowledgeable coding expert would expect, so the figures do not bring any new insight.
伦理问题详情
none
We thank the reviewer for the feedback and insights.
Our work builds upon the original Judea Pearl’s BP algorithm, published in AAAI82 as a marginalization method for Bayesian networks. Submitted to ICLR under the “probabilistic methods” area, our manuscript introduces a novel method for learning the structure of Bayesian networks (graph connectivity), as outlined in lines 64-70. Structure learning is a well-established subdomain of probabilistic graphical models. The error-correcting codes (ECC) application is chosen because it is one of the most prevalent applications of BP.
We address below the remarks about ECC. However, the work is best judged as a machine learning work submitted to a machine learning conference, which includes novel technical contributions including our graph connectivity masking paradigm and a novel optimization framework.
Weaknesses
- Limited contribution: paper is too narrow for ICLR…deserves publication…more appropriate for a coding conference such as ISIT or ITW.
We respectfully disagree. As noted in our introduction, this work is the first to propose a gradient-based method for learning Bayesian networks/codes, as other reviewers also recognized.
- See question 2.
- See Question 3.
Questions
-
Sections 3 and 4 represent half of the paper. Can the reviewer point to specific complex presentations?
-
Our work introduces a new learning paradigm aimed at improving the performance of a given initialization of the factor graph/code. As with any first-order method applied to non-convex objectives, the method is sensitive to initialization and generally converges to local minima. Thus, the primary focus of this work is to present an efficient and empirically effective method, rather than exhaustively testing all sparse code variants.
In response to the reviewer’s recommendation, we conducted experiments with state-of-the-art (SOTA) 5G NR LDPC codes. The BER and BLER vs. SNR results, presented in Appendix L of the revised manuscript, demonstrate that our method can enhance these SOTA codes, highlighting the potential of our data-driven approach in modern ECC settings.
A comparison with SCL is now also included in Appendix I of the revised manuscript. Our results show that even in the challenging short-length setting, where achieving sparsity is difficult, our method significantly improves over existing low-density codes and approaches the ML bound with very few iterations, even under suboptimal initialization. With good initialization (using a sparse LDPC code), our method achieves state-of-the-art performance.
- The chosen metric and display method follow established norms in the field of ECC published in the AI community. This metric emphasizes the exponent, is easily inverted (), and enables compact presentation across multiple codes and scenarios (18 codes on different channels).
Based on the reviewer’s suggestions, we have added BER and BLER vs. SNR curves in Appendix J of the revised manuscript. Additionally, Appendix L provides similar curves for short state-of-the-art 5G NR LDPC codes, demonstrating that our method outperforms these codes as well.
- Our optimization framework primarily optimizes BER rather than BLER using cross-entropy loss, but this does not seem to significantly impact performance due to BP’s global marginalization effects. The figures are part of our ablation study, illustrating various impacts and effects of and on our method.
- Figure 3 shows that initialization plays a critical role in convergence. Interestingly, as shown, increased initial sparsity does not necessarily improve performance (e.g., at ).
- Figure 4: Although loopy belief propagation is widely known beyond coding theory, empirical validation is needed to show that stochastic gradient descent (SGD) enforces sparsity through its inductive bias [1]. Figure 6 further supports that sparsity regularization has minimal impact on convergence, suggesting that gradient descent optimization and BP have similar sparsity enforcement effects.
These points are clarified in the revised manuscript.
[1] Soudry et al., The implicit bias of gradient descent on separable data, JMLR18
The title of your paper, "Factor Graph Optimization of Error-Correcting Codes for Belief Propagation Decoding," clearly indicates that the focus is tied to error-correcting codes. Please rest assured that I have evaluated your work as a machine learning paper with an application to error-correcting coding.
Thank you for providing additional details and for clarifying the rationale behind the chosen metric. However, I would like to reiterate that the selected metric is one adopted by a relatively small subset of researchers. It is not the standard metric commonly used in the coding community--even for ML-based decoding methods--to evaluate decoder performance.
Standard metrics such as bit error rate (BER) and block error rate (BLER) plotted directly against SNR are universally recognized in the coding community. These metrics provide clearer, more interpretable comparisons between decoders. Departing from these well-established norms without a compelling reason introduces unnecessary complexity and makes it harder to evaluate your method's effectiveness relative to existing approaches. Furthermore, using a less common metric can obscure performance insights, especially for readers unfamiliar with this representation.
While I appreciate that you have included BER and BLER vs. SNR curves in your rebuttal, these results confirm my earlier impression: your method does not achieve better performance than state-of-the-art (SOTA) decoders.
Comparisons: Comparing your method to BCH and Polar codes with BP decoding is not meaningful because these are not SOTA methods for these code families. Such comparisons, while potentially making your decoder appear more favorable, are not fair benchmarks and fail to provide an accurate assessment of your approach.
SCL decoding for polar codes: It is unclear what list size you used for the successive cancellation list (SCL) decoding of polar codes. Based on the results, it appears that you may have used a small list size, which significantly weakens the performance of SCL and renders the comparison unfair.
Final remarks: While your work introduces some interesting ideas, it does not demonstrate broad impact or significant advancements relative to existing methods to merit acceptance at a top-tier conference like ICLR. Your contributions would be better suited for a more specialized venue, such as ISIT or ITW (but not for a major journal paper), where they would be appreciated within the context of error-correcting codes.
Dear Reviewer 7LZc,
As we near the conclusion of this year's discussion period, we would like to extend our gratitude for your thoughtful feedback, which has significantly enhanced our manuscript. We believe we have addressed the majority of the concerns raised by you and the other reviewers. In particular:
- We added BER and BLER vs. SNR curves in Appendices J and L. These curves allow a better visualization of our method’s impact.
- We added a performance comparison with SCL in Appendix I. The comparison shows our method can provide state-of-the-art sparse codes even on extremely short block lengths.
- We added performance on short 5G NR LDPC codes in Appendix L. The results demonstrate that our method can also outperform state-of-the-art short sparse codes.
- We added the codes’ column weight distribution in Appendix K. Our method introduces a distribution shift that adapts to BP's inductive bias in the short block length regime.
We would greatly value any further feedback you may have.
We thank the reviewer for the response.
I have evaluated your work as a machine learning paper with an application to error-correcting coding.
While we appreciate the reviewer's perspective, we note that their comments focus solely on classical ECC visualization and performance metrics for specific codes, without addressing the method itself. Our method generates codes for which BP is significantly better than the input codes, which required decades of research, using just a few hours of optimization on relatively weak GPUs.
Standard metrics plotted directly against SNR are universally recognized
We adopted the widely used tabular notation following the conventions established in numerous works across both IT and ML communities, including most if not all neural decoder contributions we are aware of.
This notation provides a clearer and more compact presentation of performance across multiple codes. Presenting results for 14 codes on three different channels in a single manuscript using traditional plots would be infeasible. Moreover, the numerical approach enables precise reconstruction of plot values, while extracting numerical values from plots for future reference is not exact.
While this practice is commonplace in many of the published work, we have included the requested plots to address the reviewer’s concern.
your method does not achieve better performance than state-of-the-art (SOTA) decoders.
- Our method focuses on optimizing codes for the BP decoder and not on developing or training decoders. This is the essence of the work – using a novel method for learning Bayesian networks such that the BP becomes more effective. Again, our method is not about training a decoder.
- We find the reviewer’s claim that Appendix J figure does not demonstrate our method’s ability to improve 5G NR LDPC SOTA difficult to reconcile with the evidence presented. We improve across all codes, except for the 3rd one, for which we obtain identical results.
- We now also provide here the performance of the method on the 5G NR LDPC codes on other channels. For the (64,32) code, our optimized code shows a clear superiority across all three channels types (including the one already in appendix J). For (128,64) the results are very similar for all three channels. However, see point 4 below.
- As extensively mentioned in the paper, the method is focusing on short-length codes (Sec 2), an important and challenging topic in modern telecommunications, particularly for BP decoding. From a hundred code-length the method requires perturbed gradients in order to extract from local minima (good codes), e.g., see lines 430,903.
Comparing your method to BCH and Polar codes with BP decoding is not meaningful
- These codes serve as a clear demonstration of the descent capabilities of our optimization framework. Inferring an effective direction and algorithm for stochastic, non-convex, and non-differentiable optimization problems is highly non-trivial. We show that we are able to improve those codes (for the BP decoder) even though they were engineered to have a specific structure with favorable properties.
- Neural BP manuscripts that use BCH and Polar codes have garnered thousands of citations from both the IT and ML communities, underscoring the widespread acceptance of benchmarks similar to those used in our manuscript.
- Of the 14 codes tested in our original manuscript, 10 are sparse. Why focus on the BCH and Polar codes, if you believe these are not meaningful? Following the reviews, there are also three additional SOTA sparse codes that were added to the revised manuscript (Appendix J).
It is unclear what list size you used for the successive cancellation list (SCL) decoding of polar codes. Based on the results, it appears that you may have used a small list size, which significantly weakens the performance of SCL and renders the comparison unfair.
We clearly wrote in line 976: “The first and the second row of the SCL algorithm denote performance with a list length of 1 and 32 respectively.”
While your work introduces some interesting ideas, it does not demonstrate broad impact or significant advancements relative to existing methods to merit acceptance at a top-tier conference like ICLR... better suited for a more specialized venue...
- Assuming the ICLR community is familiar with graphical models, our work is the first method to provide a first-order optimization method of Bayesian networks.
- We do demonstrate the large superiority of our method for the vast majority of the factor graph initializations.
- The review and follow-up responses focus on highly technical aspects, and very specific results, judged from a very narrow context, and focusing on a small portion of the results.
- We believe IT conferences are not necessarily more adapted at judging graphical models and modern optimization methods than ICLR or other ML venues.
The paper presents a novel gradient-based data-drive approach for constructing low-density parity-check codes. The belief-propagation algorithms on sparse graph codes are reformulated into a differentiable matrix representation. The paper further relaxes the 0-1 constraints on the parity-check matrix bits into learnable continuous weights. Combined with a STE (Straight-Through Estimator) method, paper reformulate the optimization problem into a differentiable end-to-end optimization problem. The paper also proposes binary line-search method for solving the optimization problem.
优点
- The paper contains a novel idea of constructing low-density parity-check codes using a end-to-end differentiable data-driven approach. The idea of using STE (an approach usually used in deep learning quantization and sparsity design) for converting the original NP-complete discrete optimization into a differentiable optimization problem is novel.
- The approach in this paper open-up a new research direction of using end-to-end differentiable optimization for constructing low-density parity-check codes.
- The proposed binary line search method is novel.
- The experimental results in the paper show that the differentiable end-to-end approach results in improved coding performance
- The paper is well-written and clear.
缺点
- It seems that the computational complexities of the algorithm would be quadratic with respect to the code length. For example, every element of the parity-check matrix is a learnable continuous variable. Also, in the binary line search, during each optimization step, the relevant grid samples can be as large as n(n−k), which is quadratic with respect to the code length n.
- In the experimental results, the optimizations are initialized from a known sparse codes. There is a question that whether we can start the optimization from a randomly initialized code design and still ends up with a code with outperforming decoding performance.
- It seems that the sparsity regularization term is a L1 regularization. The paper lacks a in-depth discussion on other possible sparsity regularization and whether L1 regularization is optimal.
- From equation 8, it seems that the losses are equally weights at different decoding iteration steps. Because, the losses should be large at the first several iteration steps, it is my opinion that if we train the model using such equally weighted losses, we are mainly optimize the codes for its performance at the first several decoding iteration.
问题
- For the computational complexity issues, could authors provide more discussions on how to lower the computational costs or future directions on lowering the computational costs?
- The paper shows that initializing from known sparse codes would already result in performance improvements. However, initializing from these known codes would probably result in codes that are very close to the existing codes. It would be interesting to see whether other types of initialization would give us more diverse or new codes that are totally different from all the hand-crafted codes and outperform the known codes. I think initializing the optimization from the all zero parity-check matrix would impose a very challenging optimization problem, because of the symmetry breaking issues. Could the authors try to initialize the optimization by random picking a code from the code ensemble satisfying a particular pair of degree distributions? Or the authors could provide a discussion on why initialization from known codes makes more sense.
- For the sparsity regularization, the authors provide very little discussions in the draft. In my opinion, the sparsity regularization may need more careful considerations. For example, what we may want to achieve is a sparse parity-check matrix satisfying a certain pair of degree distributions (or a regular LDPC code). Thus, each row(column) should be sparse with a designed number of non-zero elements. The authors could argue that a simple L1 regularization would already guarantee that the optimization would end-up with a regular code or a code satisfying a designed degree distribution, of course by providing more experimental results. Could the author try to use a combined L1 and L2 regularization, where L1 is applied to each row and each column, and L2 is applied to row weights and column weights?
- For the loss weighting, I think the authors could try equal weighting at the beginning epochs and weight the losses at large decoding iterations more at later epochs. Or, the author could provide more discussions on why equally weighting is desired in certain senses.
In a summary, it is suggested that the authors could
- provide more discussions on how to lower the computational costs either in this paper or as a future direction
- provide more experiments on different optimization initialization or a discussion on why the current way of initialization makes more sense
- provide more experiments on using a combined L1, L2 sparsity regularization or a discussion on whether the L1 regularization already results in desired sparse parity-check matrices
- provide more experiments on using adaptive loss weighting or a discussion on justifying the current equal weighting.
We thank the reviewer for the positive feedback and insightful comments.
Computational Complexity
Regarding the gradient computation, we indeed require gradients of size to perform the first-order optimization. Our line search reduces the infinite search space to . However, due to the local proximity of the optimum to the working point, we limited the search to 50 steps in our experiments (see line 322 and Appendix C).
This complexity is minimal and manageable compared to training modern neural networks with up to trillions of parameters (as suggested by recent work in the field of ECC). Furthermore, our method is the first gradient-based optimization approach for designing codes and factor graphs, contrasting with prior heuristic or search-based methods. It requires an average of one hour of training for the tested codes (line 323) and is up to 140 times faster than existing search-based methods, such as the genetic algorithm in [Elkelesh et al., 2019] (line 409). Notably, all computational cost is incurred only during training, with no added cost to BP’s inference algorithm.
Impact of Initialization
The impact of initialization in our method stems from using first-order oracles to optimize non-convex objectives (cf. lines 20, 297). This challenge is common in first-order methods, including neural network optimization. We tested multiple initializations (sparse, dense, random codes) and observed that sparse initializations yielded the best results, positioning the algorithm near better local minima. However, for larger codes, the method can struggle to improve due to this pre-existing local minimum, as it may require escaping a potentially suboptimal extremum (line 425). This point is now emphasized in the paper’s conclusions.
L1 Regularization
If we misinterpreted the reviewer’s remark, please clarify. Given that , the effects of L1 and (squared) L2 norms are similar. However, applying different regularization factors to weights and columns could yield unique impacts.
We note here that in Figure 4, we provide empirical evidence that SGD enforces sparsity through its inductive bias [1]. Figure 6 further suggests that sparsity regularization has minimal impact on convergence, implying that gradient descent and BP exert similar effects in promoting sparsity. Additionally, any constraint can be incorporated into the optimization problem.
Weighting of Iterations
We appreciate the reviewer’s recommendation regarding iteration weighting. Weighting the objective as suggested by Nachmani et al., (2016) using a multi-loss objective across iterations may indeed improve performance. We now added this reference. We note here that due to computational constraints, we optimized our model with only five iterations (line 320).
[1] Soudry et al., The implicit bias of gradient descent on separable data, JMLR18
In this study, the authors propose a method to optimize the parity check matrix (PCM) using deep learning techniques. They modified the BP decoding equations that rely on so that could become trainable. To accelerate the training process, they utilized a line search method. The proposed code optimization method has the advantage of being applicable to codes with constraints or in arbitrary channels.
优点
In the field of model-based neural decoder research, previous work has focused on optimizing decoder weights with a given . However, this study differentiates itself by optimizing itself, which is both novel and distinctive. The derivation of trainable BP decoding equations for is particularly commendable. The paper presents a significant amount of experimental results, demonstrating performance gains through optimization for various code types.
缺点
In Table 1, the optimization is performed for a wide range of initial codes, but for high-density codes like BCH, Polar, and LTE Turbo, which have their own optimized decoders instead of BP decoders, the improvements in BP performance seem less meaningful. It seems more appropriate to compare LDPC codes, but there are concerns about the "representativeness of the LDPC codes" used in the comparison. Both the MAKCAY and CCSDS codes are quite outdated (and indeed seem to show limited performance improvement). Furthermore, the PEG construction used for the LDPC PEGX code appears to be an early-stage method. In addition, The CCSDS code is known to ensure efficient encoding, but as shown in Figure 7, modifying without constraints (as it seems to have been done in Table 1) may no longer guarantee this property.
Therefore, it would be better to evaluate whether the performance can be improved for more representative LDPC codes (e.g., short-length ARJA code class or 5G LDPC codes) while maintaining the functionalities of these code types (efficient encoding, QC structure, and rate compatibility).
Additionally, a comparison with Elkelesh et al. (2019), another data-driven code construction method, does not seem fair. In Table 4, the authors compared their optimized matrix for iterations of 5 and 15, whereas the Elkelesh method was optimized at iterations of 75 and 150. Since the Elkelesh method is also a data-driven construction approach, it seems feasible to optimize it with a smaller iteration count, such as 15. A comparison under such conditions would be necessary.
The proposed method appears to be a local optimization method highly dependent on the initial matrix. As shown in Figure 7, only a small number of edges were changed in sparse cases. As the goal of code construction is to create a globally optimized for a given code length, the proposed method does not seem to align with this objective.
问题
Additionally, I have a few questions:
-
The purpose of showing Figure 2 is unclear. I can observe that the variation increases significantly at high SNRs; a discussion on this would be helpful.
-
What exactly is the meaning of PEG X? In Table 1, there appear to be significant performance differences between PEG2, PEG5, and PEG10 (with PEG5 showing particularly superior performance). It would be helpful to clarify this.
-
In line 251, it is stated that the method can be applied regardless of modulation. If there are experimental results for modulations other than BPSK, it would be useful to include them.
-
In line 428, an experiment was proposed to optimize only while keeping III fixed in the systematic form of . However, most LDPC standards use a dual diagonal form rather than the systematic form . Therefore, an optimization experiment assuming the dual diagonal form would be more appropriate.
-
Overall, the code lengths are very short, with a maximum length of 128. In this range, Polar codes are known to be superior, so the practicality of LDPC code optimization seems somewhat limited.
-
Further research on additional properties of the modified code, such as minimum distance or cycles, is necessary.
We thank the reviewer for the valuable feedback and constructive remarks.
Missing performance on 5G LDPC codes.
Our work introduces a new learning paradigm aimed at improving the performance of a given initialization of the factor graph/code. As with any first-order method applied to non-convex objectives, the method is sensitive to initialization and generally converges to local minima. Thus, the primary focus of this work is to present an efficient and empirically effective method, rather than exhaustively testing all sparse code variants.
In response to the reviewer’s recommendation, we conducted experiments with state-of-the-art (SOTA) 5G NR LDPC codes. The results, presented in Appendix L of the revised manuscript, demonstrate that our method can enhance these SOTA codes, highlighting the potential of our data-driven approach in modern ECC settings.
Comparison with Elkelesh et al.
We are unsure what is the concern.
We implemented the Genetic Algorithm (GA) method from [Elkelesh et al., 2019] (originally in MATLAB) and conducted a direct comparison under the same conditions as our method, training both methods on 15 iterations and testing on 5 and 15 iterations.
The iterations mentioned refer to the number of GA training iterations. As a genetic algorithm, this approach is slower and limited in its beam-search capacity.
We also note in line 900 that combining our method with GA could help avoid poor local minima (similar to the perturbed gradients in Appendix A), especially with good initialization.
The proposed method appears to be a local optimization method highly dependent on the initial matrix
The sensitivity to initialization stems from using first-order oracles for the optimization of non-convex objectives (see lines 20, 297, 422, 426, 803, and 901). This limitation is inherent to all first-order methods in non-convex optimization, with neural network optimization being the most ubiquitous illustration.
Experimental results show that local convergence can yield substantial improvements, even with SOTA sparse codes (Appendix L).
Questions
1.Figure 2 shows the performance improvement in dB across various sparse codes, providing a compact summary (14 codes tested) of the conventional BER vs. SNR plots. In response to the reviewer’s suggestion, we added BER and BLER vs. SNR plots for various codes and channel configurations in Appendices J and L of the revised manuscript.
2.The definition is provided in line 327.
4.This experiment is part of the ablation study (Section 6) and demonstrates the method’s ability to integrate constraints/regularizations and their potential beneficial impact. The optimization framework can accommodate any structural constraint or desired code property.
5.LDPC codes, known for asymptotic optimality and efficient decoding, are relevant in modern communication, where short codes are increasingly important ([Liva et al., 2016], [1], lines 82-85). Supporting efficient short LDPC codes would enable a single decoder (BP) to be used on a device instead of deploying multiple decoders, as required in modern standards like 5G (short Polar codes and longer LDPC codes [ESTI, 2021]).
A comparison with SCL is now included in Appendix I of the revised manuscript. Our results show that even in the challenging short-length setting, where achieving sparsity is difficult, our method significantly improves over existing low-density codes and approaches the ML bound with very few iterations, even under suboptimal initialization. With good initialization (using a sparse LDPC code), our method achieves state-of-the-art performance.
7.Sparsity statistics are provided in Figure 5. The optimization does not affect the code’s girth (line 458), and calculating the minimum distance remains infeasible for these codes.
Per other reviewers’ suggestions, we now include the column weight distribution in Appendix K. We welcome further feedback on important analyses to include.
[1] Coşkun, Mustafa Cemil, et al. "Efficient error-correcting codes in the short blocklength regime." Physical Communication, 34 (2019): 66-79.
Thank you for addressing the comments.
Although I agree that this work presents new contributions to the AI-based coding community, I find the manuscript unsuitable for publication in a high-quality conference for the following reasons:
-
Local Optimization Limitation: The proposed method is unavoidably limited to local optimization, which does not align with the ultimate goal of the code design problem—finding globally optimized codes.
-
Focus and Presentation Issues: The main body of the paper should emphasize LDPC codes, as the work primarily utilizes the BP decoder. Instead of focusing on polar/BCH codes with a BP decoder, it would suffice to compare polar codes using an SCL decoder, given the focus on short-length codes.
-
Insufficient Comparison: The comparison with other methods (e.g., the Genetic Algorithm or other) is weak. Currently, the manuscript simulates only a single code. A broader and more robust comparison is necessary to strengthen the validity of the results.
-
Consideration of Constraints: There needs to be a more detailed consideration of constraints, such as dual diagonal, QC structures, and raptor-like constraints, to demonstrate the effectiveness of the proposed methods within practical applications.
-
Most importantly, the current manuscript suffers from a lack of readability, and the focus appears scattered as it attempts to address multiple points simultaneously.
Remaining Questions:
-
I misunderstood the use of the term iteration (line 862), interpreting it as referring to BP decoder iterations. While this was my error, other readers might also misinterpret it. Since iteration is consistently used to describe BP decoding throughout the paper, I recommend adopting a different term for iterations in the GA algorithm to avoid confusion.
-
While I read the definition on line 327, it remains unclear. Specifically, the statement “the degree of each node is X%” is ambiguous. For example, if X=2 for LDPC PEG2(64, 32) in Table 1, what does “the degree of each node is 2%” mean? Although I understand the PEG algorithm well, this definition still needs clarification.
Dear Reviewer j4us,
As we near the conclusion of this year's discussion period, we would like to extend our gratitude for your thoughtful feedback, which has significantly enhanced our manuscript. We believe we have addressed the majority of the concerns raised by you and the other reviewers. In particular:
- We added BER and BLER vs. SNR curves in Appendices J and L. These curves allow a better visualization of our method’s impact.
- We added a performance comparison with SCL in Appendix I. The comparison shows our method can provide state-of-the-art sparse codes even on extremely short block lengths.
- We added performance on short 5G NR LDPC codes in Appendix L. The results demonstrate that our method can also outperform state-of-the-art short sparse codes.
- We added the codes’ column weight distribution in Appendix K. Our method introduces a distribution shift that adapts to BP's inductive bias in the short block length regime.
We would greatly value any further feedback you may have.
We thank the reviewer for the answer.
Local Optimization Limitation and not finding globally optimized codes.
Non-convex (stochastic and non-differentiable) global optimization is an NP-complete problem. The reviewer’s requirement for (finite time) global optimization is reducible to solving the most important open problem in computer science.
The main body of the paper should emphasize LDPC codes, as the work primarily utilizes the BP decoder. Instead of focusing on polar/BCH codes with a BP decoder, it would suffice to compare polar codes using an SCL decoder, given the focus on short-length codes.
-
These codes serve as a clear demonstration of the descent capabilities of our optimization framework. Inferring an effective direction and algorithm from stochastic, non-convex, and non-differentiable optimization problems is highly non-trivial. We show that we are able to improve those codes (for the BP decoder) even though they were engineered to have a specific structure with favorable properties..
-
Neural BP manuscripts that use BCH and Polar codes have garnered thousands of citations from both the IT and ML communities, underscoring the widespread acceptance of benchmarks similar to those used in our manuscript.
-
Of the 14 codes tested in our original manuscript, 10 are sparse. Why focus on the BCH and Polar codes, if you believe these are not meaningful? Following the reviews, there are also three additional SOTA sparse codes that were added to the revised manuscript (Appendix J).
-
The reviewer may have overlooked the revision. We have now included a comparison with SCL in Appendix I of the revised manuscript. Our results show that even in the challenging short-length setting, where achieving sparsity is difficult, our method significantly improves over existing low-density codes and approaches the ML bound with very few iterations, even under suboptimal initialization. With good initialization (e.g., using a sparse LDPC code), our method achieves state-of-the-art performance.
Consideration of Constraints
As an algorithmic framework targeting the ML community under the graphical models paradigm, our work is not designed to address the sum of all of industrial constraints. Such a requirement deviates significantly from the established benchmarks and scope of neural coding/decoding research.
lack of readability
The claim about readability issues is new, and did not appear before (!). We invite the reviewer to specify which sections and what simultaneously addressed points they find unclear so that we may address them effectively. If this refers to the two issues below, please consider these fixed.
I misunderstood the use of the term iteration (line 862).. While this was my error, other readers might also misinterpret it.
This line refers to the genetic algorithm baseline and not our method. Genetic algorithms are iterative by definition, and at each iteration a generation is created. We will refine the terminology in the revised manuscript so that it is clear that an iteration of the BP method is not the same as an iteration of the genetic algorithm.
definition on line 327 remains unclear
It refers to the degrees' sparsity level, i.e., with referring to the percentage and the number of variables nodes. We will make this definition clearer.
The paper deals with factor graph (or parity-check matrix, PCM) optimization of error-correcting codes for Belief Propagation (BP) decoding. Actually, this optimization is difficult as it is an integer optimization problem: the elements of the PCM are either 0 or 1 and even small changes (e.g. the replacement of one element can drastically worsen the performance). The authors propose a gradient-based data-driven approach via a novel complete graph tensor representation of the Belief Propagation algorithm. The demonstrate the efficiency of the resulting codes by simulations.
优点
The main strong point are as follows:
- the first gradient-based data-driven approach via a novel complete graph tensor representation of the Belief Propagation algorithm.
- automating the process of finding a good code.
These point are indeed important as usually in practice people construct LDPC codes via heuristics, or some optimization algorithms like genetic algorithm or simulated annealing, which takes a lot of time and computational resources.
缺点
I list the main weaknesses below:
- Given that LDPC codes are known to perform poorly at short lengths due to short cycles in the Tanner graph ( the length of the minimal cycle (the girth) of the Tanner graph is O(log n) ), could the authors elaborate on their motivation for focusing on LDPC optimization for short codes? Are there specific applications or theoretical insights they hope to gain from this approach?
- Could the authors provide BER vs SNR (or Eb/N0) curves for their results, in addition to the current tabular format? This would allow for easier comparison with existing literature and help identify potential error floors.
- To better understand the performance of your constructed codes, could you include a comparison with polar codes under SCL decoding (L=8) for short lengths? Such results can be found e.g. herehttps://rptu.de/channel-codes/channel-codes-database.
- To provide theoretical context for your results, could you compare them to the finite length achievability and converse bounds from Polyanskiy et al. (2010)? This would help situate your method's performance relative to fundamental limits.
- Could you provide an analysis of the final Tanner graphs produced by your method, including metrics such as column weight distribution, minimum distance, and trapping sets? This would offer insights into the structural properties of the optimized codes.
- [Minor] You write about the generator matrix and the problems related to obtaining G from H. But for the BP decoder you can perform training on zero codeword only (as you mention later). I suggest to shift to zero codeword from the very beginning.
问题
See weaknesses
We thank the reviewer for their valuable feedback and insights.
Motivations for Short Codes
The primary motivation for short codes is practical, as outlined in lines 82-85. Alongside the survey by [Liva et al., 2016], further motivations are detailed in [1]. We summarize here the main reasons:
- Larger blocklength codes approaching the capacity can be only marginally improved.
- Short and medium blocklength codes are increasingly important in emerging applications, including IoT, smart metering networks, remote command links, and messaging services.
- From a systemic perspective, larger computational resources are required to explore the optimization of larger codes.
Further explanations on these motivations are provided in the revised manuscript.
[1] Coşkun, Mustafa Cemil, et al. "Efficient error-correcting codes in the short blocklength regime." Physical Communication 34 (2019): 66-79.
BER vs. SNR Curves
The chosen metric and display method follow established norms in the field of ECC published in the AI community. This metric emphasizes the exponent, is easily inverted (), and enables compact presentation across multiple codes and scenarios (18 codes on different channels).
Based on the reviewer’s suggestions, we have added BER and BLER vs. SNR curves in Appendix J of the revised manuscript. Additionally, Appendix L provides similar curves for short state-of-the-art 5G NR LDPC codes, demonstrating that our method outperforms these codes as well.
SCL Decoding Performance
A comparison with SCL is now included in Appendix I of the revised manuscript. Our results show that even in the challenging short-length setting, where achieving sparsity is difficult, our method significantly improves over existing low-density codes and approaches the ML bound with very few iterations, even under suboptimal initialization.
With good initialization (using a sparse LDPC code), our method achieves state-of-the-art performance.
PCM Analysis
Sparsity statistics are provided in Figure 5. The optimization does not affect the code’s girth (line 458), and calculating the minimum distance remains infeasible for these codes.
Per the reviewer’s suggestion, we now include the column weight distribution in Appendix K. We welcome further feedback on important analyses to include.
[Minor] Generator Matrix
Our method preserves the symmetry property of BP, making it independent of the generator matrix. However, the initial formulation supports alternative optimization schemes that could break this symmetry and lead to non-invariance with respect to the generator matrix (e.g., by optimizing Trellis graph connectivity). This clarification is added to the revised manuscript.
I would like to thank the authors for the answers! Let me provide several clarifications:
-
Motivations for Short Codes. I asked about the motivation for short LDPC codes. Short codes are definitely of extreme importance, but 5G standartization experience shows that short polar codes are much better. I would understand the problem statement if you change the decoding algorithm, but you utilize classical BP.
-
SCL Decoding Performance. Only the case of (32, 16) code is considered. The results for L=1 and L=32 are practically identical, which says that you do not use CRC. Please compare to some 5G polar code (see the link to the results in my first review). If we look at Fig. 13, it is rather natural to add SCL results for (128, 86) polar code.
-
Please add the Polyanskiy's bound (normal approximation). It is rather tight and show the best possible results. ML is not sufficient as it may be bad for bad codes.
I leave the current rating.
We thank the reviewer for the feedback.
-
Assuming a given standard, there is a need to integrate a dedicated (optimal) decoder for each family of code. E.g., 5G requires deploying BP decoder and SCL decoder for the adaptive decoding of various code length (Polar for shorter lengths). Enabling more efficient BP for short codes (or SCL for larger codes as well) may pave the way to standard built upon one single family of codes, potentially accelerating decoding while saving die size and power.
-
CRC is an augmentation practice that if applied to one method and type of codes could and should be added to the other codes as well (CRC-aided LDPC codes). Also, SCL in general and CRC (with its hyperparameters) in particular are far beyond the scope of our work.
While SCL is not within the natural scope of our work, following the comment “Only the case of (32, 16) code is considered”, we ran SCL on more codes in order to compare them to 5G LDPC SOTA (without unused rows and columns (puncturing)) and we provide the curves here. As can be seen, our method provides better performance than Polar codes also for the additional codes. However, the performance of our metod degrades as the code get larger since gradient based solution would converge to local minima (near good initialization).
-
We are unsure how Polyanski’s bound should be calculated and integrated in our results for accurate comparison with our settings. We will be happy to add it following the reviewer’s advice, given more guidance. That been said, such a result is also highly ECC oriented and, we believe, out of the scope of our ML paper.
Following the review, we have provided various comparisons with very specific codes and decoders developed and augmented over decades. While supporting our claims, from our perspective, the relevancy of these results is not extremely large, since our stated goal is to improve short codes for BP. We would like to reiterate that our work focuses on advancing the field of structure learning of graphical model. We believe that our novel method is a very significant step towards this goal.
Dear Reviewer GzCC,
As we near the conclusion of this year's discussion period, we would like to extend our gratitude for your thoughtful feedback, which has significantly enhanced our manuscript. We believe we have addressed the majority of the concerns raised by you and the other reviewers. In particular:
- We added BER and BLER vs. SNR curves in Appendices J and L. These curves allow a better visualization of our method’s impact.
- We added a performance comparison with SCL in Appendix I. The comparison shows our method can provide state-of-the-art sparse codes even on extremely short block lengths.
- We added performance on short 5G NR LDPC codes in Appendix L. The results demonstrate that our method can also outperform state-of-the-art short sparse codes.
- We added the codes’ column weight distribution in Appendix K. Our method introduces a distribution shift that adapts to BP's inductive bias in the short block length regime.
We would greatly value any further feedback you may have.
The paper proposes an optimization scheme of a Tanner graph for low-density parity-check (LDPC) codes. Particularly, the scheme aims to improve decoding performance by a belief propagation (BP) decoder. Since the problem is a non-convex binary optimization problem whose cost function is implicitly defined, a gradient-based approach is applied by relaxing variables and using a straight-through estimator. In addition, a grid search is proposed to find an optimal learning rate. From numerical results, the proposed method outperforms the decoding performance of existing codes.
优点
- The proposed method is the first gradient-based optimization of a parity-check matrix for error-correcting codes.
- In addition, a grid-search approach for finding the learning rate is proposed, which is an alternative to the line-search approach.
- Numerical results show that optimized codes by the proposed method can be decoded more accurately than conventional codes.
缺点
The reviewer understands the motivation of the work and the numerical effectiveness of the proposed method. However, the paper contains some flaws as follows:
- Technical contributions of the proposed method
The contributions that the authors claim are threefold: (i) formulation of the problem, (ii) reformulation of BP in tensor fashion, and (iii) differentiable and fast optimization method for the problem. Here, the reviewer wants to evaluate each contribution in detail.
-
(i) Formulation of the problem: In the paper, Eq. (4) (or (8)) is the formulation of the problem, whose novelty is claimed by the authors. However, a similar formulation (mainly focusing on "expectation w.r.t. random codewords and noise") has already been proposed in [Choukroun & Wolf, 2024a] cited in the paper. Eq. (4) is different from [Choukroun & Wolf, 2024a] in that Eq. (4) includes the regularization terms. However, the authors do not compare the differences explicitly and claim that the whole formulation in this paper is novel. In addition, the effect of regularization needs to be included in numerical experiments. The reviewer suggests that the author carefully clarify the novelty of the formulation.
-
(ii) Reformulation of BP in Tensor fashion: The reviewer does not consider this part technically novel because a similar "tensorization" technique has been used for the implementation of neural BP decoders.
-
(iii) Differentiable and fast optimization method for the problem: The reviewer agrees with this point. However, there are some flaws, as shown below.
Overall, the technical contributions that the authors claim seem partly insufficient. In addition, the paper does not contain any mathematical or analytical contributions, which is obviously a weak point.
- Contributions of the grid-search approach
Related to the above point, the authors claim the novelty of the grid-search approach for learning rate. However, the reviewer wonders about the novelty due to the following reasons.
-
No comparison with line-search methods: the authors claim that the conventional line-search methods assume the convexity of a cost function, and they are unsuitable for non-convex problems like Tanner graph optimization in this paper. However, the assumption is required to show the optimality of line-search methods. Practically, line-search methods are used for non-convex optimizations without assumptions. Therefore, numerical comparison in terms of optimization performance and/or execution time should be necessary to show the effectiveness of the grid-search method.
-
No theoretical guarantee: In contrast to the above discussion, there is no guarantee of the grid-search method for finding the optimal learning rate. It is a weakness of the proposed method.
-
Effectiveness of the grid-search approach: It is reported in Sec. 6 that an optimized matrix depends on the initial matrix, suggesting that the proposed method finds a suboptimal solution, not an optimal one. This fact may weaken the effectiveness of the method. At least, a comparison with other learning-rate optimizations will be required.
- Missing advantage of the gradient-based optimization
The authors claim that the proposed gradient-based optimization is fast but no evidence is provided in the paper. How much is the method fast compared with what?
问题
Questions are included in the "Weakness" section.
Suggestions:
- Around Line 131: Please clarify whether vectors are column or row vectors. Anyway, the dimensions of the multiplication in are incorrect.
- Line 157: should be because the first iterations is .
- Eq. (4): Under , what does the expectations w.r.t. mean? Is a random variable, not a constant number?
- Line 197: A generator matrix is not unique in general. Is the form of the matrix fixed in this paper?
- Line 204 etc.: should be accourding to Line 131.
- Eq. (6): "(" should be removed.
- Eq. (8): The cumulative loss function w.r.t. iteration in (8) has been proposed in previous studies on neural BP decoders. Please add some references.
- Line 291: What is the meaning of "sufficient statistics"? It is unclear whether the gradient is sufficiently statistical in terms of statistics. It is recommended to rephrase the term correctly.
- Eq. (9): What does index stand for? Since represents a matrix, indices are given like .
- Line 310: What is the meaning of "line-search objectives"? Is it the cost function for the optimal learning rate?
- Line 373: The i.i.d. Gaussian mixture model is called bursty noise in this paper. It seems incorrect because bursty noise generally implies time-dependent noise.
- Line 457: "sparse code" should be "a sparser code" or "sparser codes."
We thank the reviewer for the feedback and remarks.
1.i. Novelty wrt [Choukroun & Wolf, 2024a]
The objectives in both papers describe the same standard channel coding pipeline. While [Choukroun & Wolf, 2024a] describes the end-to-end optimization of differentiable neural decoders (i.e., code and decoder), our method optimizes the code for the classical Belief Propagation decoder. Besides the regularization term, it can be observed that the expectation, the objective, and the optimization variables are different.
Our setting is fundamentally different (lines 201-207) since unlike [Choukroun & Wolf, 2024a] our decoder cannot be differentiably optimized as it requires the structure learning of the underlying graph/algorithm.
1.ii. Novelty wrt neural BP
Neural/weighted BP should never be implemented in a tensor form since it would lose all of the efficiency advantages of message-passing algorithms (sparsity).
We adopt the tensor form in order to learn the connectivity itself upon a complete factor graph. We are not learning the weighting of the variable nodes’ messages obtained upon a given fixed graph connectivity as in neural BP (line 48).
This is a very fundamental difference with neural BP.
- mathematical or analytical contributions
From the ML perspective, the method develops new tools of high interest, such as a new way to optimize graphs for BP (not just for codes, i.e., structure learning) and a novel line search procedure tailored for quantized optimization. The information theoretical aspects of codes are not the main emphasis of our work. As a gradient descent method, the proposed framework's performance is tied to the initial graph, and this is emphasized repeatedly in the paper (cf lines 20, 297,422,426,803,901). Further analysis of the optimization over the cross-entropy loss can be done via classical convergence analysis of quantized networks [2], but this is beyond the scope of our work.
- “Comparison with line-search methods” … “LS methods are used for non-convex optimizations wo assumptions…
Existing methods:
Inexact line search methods can be used for non-convex optimizations to find local minima. The existing zero/first-order line-search methods operate on a predefined continuous range (generally small) on which the search is performed.
In our binary setting, a continuous range is impossible to set as it can grow arbitrarily large according to the values. Moreover, a standard line search would find local minima of the line-search objective because of the suboptimal range and the discretization approximation.
Our method:
As a binarization-aware method, we provide a provably (line 304) optimal line-search method that requires parallelizable computations.
- Effectiveness of the grid-search approach … depends on the initial matrix
The impact of initialization is related to the use of first-order oracles for the optimization of non-convex objectives (cf lines 20, 297,422,426,803,901).
This is, of course, not unique to our work: any first-order method on non-convex objectives would be impacted by initialization, learning neural networks being the most ubiquitous example.
- How much is the method fast compared with what
The method is the first gradient-based optimization for the design of codes/factor graphs. All other existing methods are heuristics or search-based methods.
Our method requires a single hour of training on average on the tested codes (line 323) and is up to 140 times faster (and better) than the existing search-based method (a genetic algorithm) [Elkelesh et al., 2019] (line 409)).
Suggestions
The suggestions and typos will be added to the revision.
3.The expectation should be with respect to the discrete number of iterations of the BP algorithm in order to ensure optimal decoding under any computational constraints (line 199). This can be similar in a sense to the multi-loss of neural BP. A citation has been added.
4.No assumption on the generator matrix is given in the paper. As the method is built to maintain the symmetry property of BP, the optimization framework is not dependent on the generator.
7.We may be missing your point but this is the discretization of the expectation above (suggestion 3). We added a reference to [Nachmani et al. 2016] for reference to the multi-loss.
10.“Cost function for the optimal learning rate” indeed. For example, since the BER metric is not differentiable, one may search for optimal BER during the line search for optimal performance on the loss of interest (BER), instead of the Bayesian (uncertainty) cross-entropy metric.
11.A citation will be added as this terminology already exists, e.g., [1].
[1] Kurmukova and Deniz Gunduz. Friendly Attacks to Improve Channel Coding Reliability. AAAI24.
[2] Li et al. Training quantized nets: A deeper understanding. Neurips 2017.
Dear Reviewer 6Jjp,
As we near the conclusion of this year's discussion period, we would like to extend our gratitude for your thoughtful feedback, which has significantly enhanced our manuscript. We believe we have addressed the majority of the concerns raised by you and the other reviewers. In particular:
- We added BER and BLER vs. SNR curves in Appendices J and L. These curves allow a better visualization of our method’s impact.
- We added a performance comparison with SCL in Appendix I. The comparison shows our method can provide state-of-the-art sparse codes even on extremely short block lengths.
- We added performance on short 5G NR LDPC codes in Appendix L. The results demonstrate that our method can also outperform state-of-the-art short sparse codes.
- We added the codes’ column weight distribution in Appendix K. Our method introduces a distribution shift that adapts to BP's inductive bias in the short block length regime.
We would greatly value any further feedback you may have.
I appreciate the authors' response.
1.ii. Novelty wrt neural BP
Although the tensor form loses sparsity property at a glance, one can use sparse matrix operations. So, this is not a critical point. I agree that the current formulation is defined on a complete graph, but the reviewer considers that it is not so novel.
2.“Comparison with line-search methods” … “LS methods are used for non-convex optimizations wo assumptions…
Although the authors claim that the existing methods find local minima, this is also the case with the proposed method, as the authors repeatedly stated in the response.
The reviewer considers that some concerns that need to be resolved still exist in the paper. Thus, I keep my current score.
We thank the reviewer for their response. We respectfully disagree with the reviewer on both remaining points, which are fundamental to the understanding of our work.
Novelty wrt neural BP
- We have already clarified that neural BP fundamentally differs from our method. Neural BP learns a weighting of the messages, whereas our method learns the connectivity structure itself.
- Sparse tensors, by definition, are not dense and do not allow backpropagation over zero elements.
- While dense tensors might be used for practical reasons (e.g., GPU programming), the optimization performed by our method operates in a completely different domain than neural BP (learned parameters vs. connectivity) and serves an entirely different purpose (improved decoding vs. code optimization).
Comparison with line-search methods
It appears that there is a fundamental misunderstanding of the paper (lines 294-308) and our rebuttal explanation (rebuttal item 2). As a first order (i.e., gradient based) optimization method, our entire approach can only identify a local minimum. Of course, global optimization of non-convex objectives is NP-complete.
However, our proposed method uniquely enables finding the global minimum of the line-search optimization objective. Other classical LS methods (e.g., grid/random search, bisection, golden section, quadratic/cubic interpolation, Armijo backtracking etc.) cannot achieve this because of the non-convexity, stochasticity and non-differentiability of the problem.
We sincerely thank the reviewers for their thoughtful assessment of our manuscript.
We have addressed the reviewer’s feedback and performed the requested follow-up experiments. We have also integrated the reviewers' remarks and suggestions into the revised manuscript. Revisions and additions based on the reviewers’ comments are highlighted in blue in the updated PDF.
Specifically:
- We added BER and BLER vs. SNR curves in Appendices J and L (Reviewers GzCC,7LZc). These curves allow a better visualization of our method’s impact.
- We added a performance comparison with SCL in Appendix I (Reviewer GzCC). The comparison shows our method can provide state-of-the-art sparse codes even on extremely short block lengths.
- We added performance on short 5G NR LDPC codes in Appendix L (Reviewers j4us,7LZc). The results demonstrate that our method can also outperform state-of-the-art short sparse codes.
- We added the codes’ column weight distribution in Appendix K (Reviewers GzCC,j4us). Our method introduces a distribution shift that adapts to BP's inductive bias in the short block length regime.
We note here that our work builds upon the original Judea Pearl’s BP algorithm, published in AAAI82 as a marginalization method for Bayesian networks. Submitted to ICLR under the “probabilistic methods” area, our manuscript introduces a novel method for learning the structure of Bayesian networks (graph connectivity), as outlined in lines 64-70*. Structure learning is a well-established subdomain of probabilistic graphical models. The error-correcting codes (ECC) application is chosen because it is the most prevalent application of BP.
Beyond the important ECC-related questions that were the focus of the reviews, we would be happy to discuss the broader impact of our data-driven method for structure learning in probabilistic graphs, which, to the best of our knowledge, represents the first gradient-based approach in this domain.
(*) Line references pertain to the original manuscript. Future updates will incorporate additional requests that will further change the line numbers, and this text already appeared in the original submission.
Summary: This paper proposes optimization of parity-check matrices of short-codelength linear codes assuming the belief-propagation (BP) decoding. For this purpose, a continuously-relaxed representation of the parity-check matrix is introduced, on the basis of which a gradient-based optimization is considered, where the gradient of the piecewise-constant objective function (equation (8)) is approximated by utilizing the shifted straight-through estimator (STE). To deal with the non-convexity of the optimization objective, the authors also propose a line-search-based heuristic.
Strengths: According to the authors, this is the first proposal to learn the parity-check matrices of linear error-correcting codes. The efficiency of the proposal has been demonstrated experimentally in various scenarios.
Weaknesses:
- Reviewer 6Jjp mentioned several suggestions, and Reviewer 7LZc raised concern on clarity, which would imply that the quality of presentation of this paper is not satisfactory. Although Reviewer 1Q14 wrote that this paper is well written, upon my own reading of this paper my evaluation on quality aligns with Reviewers 6Jjp and 7LZc. See Additional points below as well.
- Line search: As the problem of optimizing the parity-check matrix is formulated as a non-convex optimization, as the authors states one would resort to some heuristic. On the other hand, why one can expect the proposed line search to work well in this particular problem formulation is not discussed.
Reasons: All the reviewers agree the significance of the proposal as well as the demonstrated performance gains. On the other hand, several concerns have been raised by the reviewers, and the authors have not been successful in resolving many of them, resulting in ratings that are mostly below the acceptance threshold.
Additional points:
- The formula in lines 126 and 196 should read according to the definitions in this paper.
- The sentence in lines 147-9 has broken grammar and does not make sense.
- The sets in Section 3 and the sets in Section 4 are equivalent. It is preferable to use the same symbols throughout.
- The authors claim in lines 241-2 that BP is differentiable, but at this point is still binary, so that BP is not differentiable with respect to . It is only after introducing the continuous relaxation that BP may be (piecewise) differentiable.
- Reviewer j4us questioned the claim that the proposal can be applied regardless of modulation. This claim has not been justified.
- The authors claim that they chose the BCE as the BER measure (lines 260-1), but they are different notions: , whereas the bit error is defined as , whose expectation is the BER.
- I feel that the subdifferentiability (line 269) is irrelevant here. Subdifferentiability is a notion associated with convex functions, whereas the objective here is non-convex, as the authors mention on the same line.
- Berlekamp (1974): "(No Title)" should be removed.
审稿人讨论附加意见
The authors claimed in Introduction as well as in their rebuttal that their proposal can be applicable beyond design of error-correcting codes to structure learning of general Bayesian networks. It would be true that the idea of introducing a relaxation of network connectivity information might in principle be applicable to general structure learning problems, its usefulness has not been explored in this paper at all.
In spite of the comments of some reviewers, I do not think that the contents of this paper falls outside the scope of the conference. Rather, it would mainly be quality of the presentation of this paper which is judged not to be at the level of the conference. The authors are not successful in convincing the reviewers with the significance of their potential contribution via their rebuttal (although the authors might have different views), and in my view this in turn signifies the importance of writing good-quality texts in order to properly communicate the contents to the readers.
Reject