PaperHub
8.2
/10
Poster4 位审稿人
最低4最高6标准差0.7
6
5
4
5
3.5
置信度
创新性3.3
质量3.3
清晰度3.3
重要性3.8
NeurIPS 2025

CATransformers: Carbon Aware Transformers Through Joint Model-Hardware Optimization

OpenReviewPDF
提交: 2025-05-02更新: 2025-10-29
TL;DR

We propose CATransformer, the first carbon-aware co-optimization framework for Transformer-based models and hardware accelerators.

摘要

关键词
Carbon FootprintSustainabilityHardware AcceleratorArchitecture SearchNeural Architecture SearchOptimization

评审与讨论

审稿意见
6

The paper introduces CATransformers, a carbon-aware co-optimization framework that jointly searches Transformer architectures and custom accelerator designs to minimize total lifecycle carbon emissions (operational + embodied). The method couples (i) a multi-objective Bayesian optimizer (qNEHVI on Ax/BoTorch) balancing accuracy, latency, energy, and carbon, (ii) a lightweight prune-and-fine-tune evaluator that supplies high-rank-correlated proxy accuracies, and (iii) a hardware estimator that integrates Accelergy, Sunstone, ACT, and Electricity Maps to quantify area, energy, latency, and carbon for candidate accelerators early in design space exploration. Experiments span BERT-Base, Llama-3 (8 B), ViT-B/16, and multi-modal CLIP variants. With a 20 TOPS edge budget, carbon-optimized solutions cut total emissions by up to 30 % over latency-optimized and 8 % over energy-optimized baselines, at comparable accuracy. A focused case study yields the CarbonCLIP family, which achieves up to 17 % lower carbon than TinyCLIP while matching or improving accuracy/latency. The framework is to be open-sourced.

优缺点分析

Strengths:

  1. Introduces a novel, holistic framework that co-optimizes both ML models and hardware accelerators to directly minimize total lifecycle carbon emissions (operational + embodied). Novel combination of Bayesian NAS with joint embodied + operational carbon objective; extends prior work (e.g., CE-NAS) from model-only to full stack. Multi-modal CLIP optimisation under carbon constraints is new.
  2. First study to treat carbon as first-class search objective across model and hardware; could influence sustainable ML practice on edge devices. CarbonCLIP models are likely reusable by practitioners focusing on low-footprint deployments.
  3. The authors cross-check latency/energy estimates against real GPUs (≤ 9 % error), show a Spearman 0.98 correlation between proxy- and full-accuracy, and report low hyper-volume variance (< 3.5 %) across search runs, evidence that both the estimator and optimizer are reliable.
  4. Empirical results across diverse architectures (BERT, Llama-3, ViT, CLIP) demonstrate up to 30 % emission cuts over latency-optimized baselines, with the CarbonCLIP family delivering ~17 % lower carbon while maintaining accuracy.
  5. Clear visualizations (Pareto fronts, ISO-accuracy plots) effectively communicate trade-offs and design insights.

Weaknesses:

  1. Restrictive carbon assumptions i.e. fixed 1 inf/s duty cycle, 3-year lifetime, California grid intensity, and Taiwan 22 nm fabrication limit generalizability; alternative scenarios are only briefly treated in the appendix.
  2. estimator details (e.g., specific Accelergy/Sunstone knobs, ACT parameters) are not fully documented, which may hinder independent reproduction
  3. targets edge-inference accelerators; training-time carbon and data-center/cloud inference remain unaddressed.
  4. several tables are hard to parse, some abbreviations go undefined, and embodied-carbon scaling equations (fundamental to the paper) are omitted from the main text.
  5. There are several works [1-5] that try to answer sustainability related issues in AI as well as estimate CO2 footprint. I think listing them would benefit the paper and give a more holistic understanding of how researchers have been trying to tackle this problem before.

[1] Bouza, L., Bugeau, A. and Lannelongue, L., 2023. How to estimate carbon footprint when training deep learning models? A guide and review. Environmental Research Communications, 5(11), p.115014. [2] Luccioni, S., Jernite, Y. and Strubell, E., 2024, June. Power hungry processing: Watts driving the cost of ai deployment?. In Proceedings of the 2024 ACM conference on fairness, accountability, and transparency (pp. 85-99). [3] Luccioni, S., Gamazaychikov, B., Hooker, S., Pierrard, R., Strubell, E., Jernite, Y. and Wu, C.J., 2024. Light bulbs have energy ratings—so why can’t AI chatbots?. Nature, 632(8026), pp.736-738. [4] Gowda, S.N., Hao, X., Li, G., Gowda, S.N., Jin, X. and Sevilla-Lara, L., 2025. Watt for what: Rethinking deep learning’s energy-performance relationship. In European Conference on Computer Vision (pp. 388-405). Springer, Cham. [5] Strubell, E., Ganesh, A. and McCallum, A., 2020, April. Energy and policy considerations for modern deep learning research. In Proceedings of the AAAI conference on artificial intelligence (Vol. 34, No. 09, pp. 13693-13696).

问题

  1. Results assume 1 inference/s for 3 years. How would conclusions change for higher-throughput mobile scenarios (e.g., 10 inf/s) or shorter device lifetimes? A sensitivity plot in the main paper would strengthen applicability.
  2. Appendix K touches on regional grids, but design choices might invert (energy vs. latency) under a zero-carbon grid. Can the optimizer ingest a distribution over grid intensities and return robust designs? Clarify in 3.3/4.5.
  3. You plan to open-source, but key configs (e.g., Cacti SRAM models, yield factors) are unspecified. Please list exact tool versions and parameter files to enable artifact evaluation.
  4. The evaluator fine-tunes each candidate for one epoch on MS-COCO/GLUE etc. What is the wall-clock overhead per candidate, and is it amortized across search iterations? A brief cost breakdown would help readers gauge scalability.
  5. Carbon is central, but other sustainability axes (e-waste, water, mineral use) may correlate with chip area. Could CATransformers be extended to multi-impact objectives? A short discussion in the conclusion would increase societal relevance.

Scores could rise with clearer reporting of (1)–(3) and an expanded robustness study for different duty cycles / grids.

局限性

Authors acknowledge lack of GPU/CNN support, focus on edge inference, and fixed carbon-intensity/lifetime assumptions. Suggested to foreground these in Sec 5 and add an explicit paragraph on potential negative societal impacts beyond carbon (privacy, resource inequity).

最终评判理由

The authors have responded to my queries and I have no significant concern remaining.

格式问题

No formatting concerns.

作者回复

We thank the reviewer for the detailed review of our work and thoughtful suggestions. Below, we answer the questions and comments in order.

Question 1, Weakness 1: Tuning lifetime and regional parameters

We fully agree that factors such as grid carbon intensity, device lifetime, and inference frequency significantly impact the total carbon footprint. To support diverse deployment scenarios, CATransformer exposes all of these parameters as configurable inputs. In addition, hardware design parameters, such as process technology, are also fully configurable.

These factors primarily influence the balance between operational and embodied carbon. For example, high inference frequency, long device lifetime, or deployment in regions with high grid carbon intensity increases operational energy consumption, making operational carbon the dominant contributor. Conversely, in scenarios with low inference frequency, short device lifetime, or low grid carbon intensity, embodied carbon plays a more significant role.

We include a sensitivity analysis on inference frequency. The table below reports the percentage of operational carbon as a fraction of total carbon footprint across CarbonCLIP model variants. Increasing the inference rate from 1 to 10 inferences per second leads to an average of 50% increase in the operational carbon share.

ModelRatio of Operational Carbon (%)
1 inference / sec3 inference/ sec5 inference/ sec10 inference/ sec
CarbonClip-XS18405369
CarbonClip-S27526578
CarbonClip-M32597183
CarbonClip-L36637485
CarbonClip-XL44718089

Appendix K complements this analysis by demonstrating how external factors like grid carbon intensity influence whether CATransformer prioritizes reducing operational energy or minimizing hardware footprint. Importantly, the trends observed with varying grid intensity generalize to changes in inference frequency and device lifetime, as all induce similar shifts in the operational versus embodied carbon tradeoff. This confirms that CATransformer adapts its optimization strategy consistently across deployment scenarios—whether in high-throughput mobile environments or in systems with shorter lifespans.

We will clarify these insights in the final version of the paper to better highlight CATransformer’s flexibility and its nuanced handling of carbon-aware optimization.

Question 2: Carbon zero regions and varying grid intensities

We thank the reviewer for important questions regarding carbon assumptions. On a zero-carbon grid, operational carbon becomes negligible, shifting optimization entirely to minimizing embodied carbon, potentially favoring smaller hardware. In such cases, the pruned submodel size and architecture are likely determined solely by latency constraints, since operational energy (i.e., the energy required to run the model) is no longer a factor. The goal then becomes to run the largest possible model (to maximize accuracy) on the smallest hardware that still meets the latency requirement. While CATransformer currently uses a single grid intensity, it can be readily extended to ingest a distribution of intensities, enabling robust designs for varying regional or fluctuating grid conditions. These clarifications will be added to sections 3.3 and 4.5.

Question 3, Weakness 2: Details of estimators and open source tools used

We agree with the reviewer that providing exact parameters and configurations for each toolchain is essential for reproducibility. We have submitted the full codebase as supplementary material, including all configuration files, tool versions, and experimental parameters. These details will also be made publicly available upon open-sourcing the code. Below is a summary of the key tools and configurations used:

  • CACTI: Version 7, using 22nm process technology node configurations from commit 1ffd8d
  • Accelergy: Version 0.4
  • ACT: Used an IC yield factor of 0.875 and accounts for the carbon footprint of manufacturing gases per unit area, assuming 95% abatement.
  • Carbon Intensity: Based on 2024 data from Electricity Maps. Exact values are specified in the configuration files:
    • California: 224 CO2eq/kWh
    • Taiwan: 524 CO2eq/kWh We will ensure that all configuration files and tool references are clearly documented and accessible in the final open-source release to support artifact evaluation and reproducibility.

Question 4: Cost of optimization process

Runtime per candidate varies with model size and available compute. In our experiments, optimization was conducted on a single node with 8 V100-SXM2-16GB GPUs. Under this setup, the full optimization takes ~5 hours for BERTbase and up to 20 hours for CLIP, each with 100 trials. This translates to 3–12 minutes per trial, including ~2 minutes for hardware estimations. Fine-tuning adds another 1–10 minutes per candidate, depending on model size. With newer or more powerful hardware, per-trial costs would decrease proportionally. To reduce overhead, we implement result caching, avoiding redundant retraining of previously evaluated candidates across hardware settings and amortizing cost over search iterations.

Question 5: Extension to other sustainability axes

While CATransformer currently focuses on carbon footprint, we agree that sustainability extends to factors like electronic waste, water usage, and rare mineral consumption. Our optimizations, which often reduce chip area, can implicitly benefit these metrics. In principle, CATransformer could support multi-objective optimization over broader environmental impacts. The main challenge lies in the current lack of standardized, fine-grained data for these additional sustainability metrics. We will highlight these considerations in the conclusion to underscore the broader societal relevance and need for holistic, sustainable hardware design.

Weakness 3: Extensibility to training and datacenter accelerators

This work focuses on making model inference carbon-efficient through hardware-model co-design. Carbon footprint of inference in aggregate is significant due to much wider usages [1]. We also agree with the reviewer that large-scale pretraining of foundation models can result in substantial carbon emissions. As our GPU-based results from Appendices L and M show, our proposed design methodology can scale to datacenter-level accelerators. CATransformer inherently supports training by modeling forward and backward pass latency and energy at the operator level. It can be extended to incorporate parallelization strategies and training system design, enabling optimization of carbon efficiency in distributed training.

[1] Luccioni, S., Jernite, Y. and Strubell, E., 2024, June. Power hungry processing: Watts driving the cost of ai deployment?. In Proceedings of the 2024 ACM conference on fairness, accountability, and transparency (pp. 85-99).

Weakness 4: Improvements on formatting for Tables and Abbreviations.

In the final version of the paper, we will enhance the presentation of tables and ensure that all abbreviations and equations are clearly defined.

Weakness 5: Related works in Sustainability in AI

We thank the reviewer for highlighting these relevant works on AI sustainability. Broadly, they address the topic through three main approaches: (1) Quantifying the carbon footprint of training: [1] and [5] estimate the emissions and energy use of training deep learning models or evaluate tools for doing so; (2) Analyzing Deployment Energy Use and emissions: [2] and [3] focus on energy consumption during inference and advocate for standardized environmental assessment frameworks; and (3) Exploring Energy-Performance Trade-offs: [4] explores the balance between model accuracy and electricity use, proposing metrics to encourage more energy-efficient architectures. We will incorporate a discussion of these works in Section 2 of the final version.

Limitations

We agree that broader societal impacts, such as privacy concerns in edge deployment and potential resource inequities from hardware specialization, are important considerations. We will include a discussion of these challenges, alongside the limitations of fixed carbon-intensity and lifetime assumptions, to provide a more balanced perspective on the societal implications of our work.

评论

Thanks for the detailed response. I have raised my rating to the maximum possible. I believe this paper has great potential in a field that is constantly focusing on scale. I hope the paper gets the attention it deserves. Please do add the related papers as I feel that may help people gauge other ways of tackling the same problem and that this problem would receive much greater attention than it is currently.

评论

Thank you for the kind words and support. We’ll make sure to include the related work as suggested to better frame the problem and highlight its broader relevance.

审稿意见
5

The authors introduced CATransformers which is a carbon-aware search framework for sustainability-driven co-optimization of ML models and hardware architectures. They tackled minimizing the total carbon emission including both embodied and operational by jointly optimizing for model's and hardware accelerator's design, and their approach includes three main parts: 1) a multi-objective bayesian optimizer that maximizes accuracy while minimizing latency, energy, and total carbon (embodied + operational) by exploring different model and hardware parameter, 2) ML model evaluator which estimates the accuracy of candidate model architectures during the search, 3) Hardware estimator that provides unified, end-to-end analysis of inference latency and total carbon footprint for each model–hardware configuration. The final output is the sustainable model and hardware configuration.

The authors evaluated their proposed approach on various models like BERT, Llama, etc. They show their approach can reduce carbon emissions by up to 30% while maintaining accuracy and latency. They also showed that their optimized CLIP model outperformed models like TinyCLIP and baseline CLIP in accuracy and total carbon emissions.

优缺点分析

Strengths:

  1. Comprehensive experiments that show their approach can perform in various scenarios
  2. The paper is well-written and organized and informative by having key takeaways for each experiment
  3. The authors also consider embodied carbon as a parameter for total emissions which is a critical parameter
  4. Novel approach to jointly optimize model and hardware arch. from the base when the objective is carbon reduction which tells us the optimization should start from early stages.

Weakness:

  1. The current approach is tailored for edge devices and inference where the carbon optimization in large-scale clusters, e.g., HPC clusters can have greater impact
  2. There is no clear discussion section on what the challenges could be of extending the architectures from transformer-based models to another architecture
  3. While the Pareto frontiers are informative, it could benefit from more intuitive interpretations, e.g., how practitioners should choose between latency and carbon optimization in real deployments remains a bit underspecified.

问题

  1. Can you please clarify what the challenges of using other model archs like CNN could be?
  2. Since there is a trade-off between latency and carbon consumption, do you have an approach to help the user select which configuration tailored to their need?
  3. Is it possible to plug-in CATransformer into PyTorch? what are the challenges?

局限性

yes

最终评判理由

My questions and concerned were addressed properly.

格式问题

N/A

作者回复

We thank the reviewer for their thoughtful feedback and suggestions. Below, we answer the questions in order:

Weakness 1: Extensibility to Training and datacenter accelerators

This work focuses on making model inference carbon-efficient through hardware-model co-design. Carbon footprint of inference in aggregate is significant due to much wider usages [1]. We also agree with the reviewer that large-scale pretraining of foundation models can result in substantial carbon emissions. As our GPU-based results from Appendices L and M show, our proposed design methodology can scale to datacenter-level accelerators. CATransformer inherently supports training by modeling forward and backward pass latency and energy at the operator level. It can be extended to incorporate parallelization strategies and training system design, enabling optimization of carbon efficiency in distributed training.

[1] Luccioni, S., Jernite, Y. and Strubell, E., 2024, June. Power hungry processing: Watts driving the cost of ai deployment?. In Proceedings of the 2024 ACM conference on fairness, accountability, and transparency (pp. 85-99).

Weakness 2, Question1: Generalizability to Non-Transformer architectures

CATransformer uses Torch.fx to extract operator graphs and estimate latency and energy, making it extensible to any model representable by Torch.fx, including CNNs like ResNet. However, Torch.fx cannot trace dynamic control flow, such as if statements or data-dependent loops, limiting support for models like state-space architectures (e.g., Mamba) that involve input-dependent state transitions. In such cases, the model must be modified or constrained to make the control flow more static for Torch.fx to function properly.

Extending to other model types also requires latency and energy estimations for their unique operators, such as custom kernels or non-standard layers not found in Transformers. These must first be profiled or estimated for runtime and energy consumption on the target hardware to enable optimization within our framework. If such toolchains become available, CATransformer can integrate them with minimal changes. Additionally, different models may require customized pruning methods. Attention head pruning, for instance, does not apply to CNNs and Mamba-style architectures.

We will clarify these limitations and generalization challenges in the paper.

Weakness 3, Question 2: Choosing the optimal configuration

We agree that clearer guidance on interpreting the Pareto frontiers would enhance the practical relevance of our work. Choosing an optimal configuration requires balancing carbon emissions and latency based on deployment constraints. For instance, latency-critical applications (e.g., medical imaging, autonomous vehicles) may prioritize low-latency points, while non-critical tasks (e.g., captioning, search) may favor carbon savings. Interactive applications like augmented reality may benefit from a middle ground. Additionally, the trade-off between operational and embodied carbon depends on context. For example, regions with high grid intensity may prefer configurations that reduce operational energy. We will clarify these considerations in the final version.

Question 3: Challenges in plugging-in CATransformer into PyTorch

While components of CATransformer, such as the optimized Transformer models, can be used within PyTorch due to its open-source foundation and use of HuggingFace models, fully integrating the entire framework into PyTorch is challenging. CATransformer relies on external tools such as CACTI, Accelergy, Phaze, and ACT for detailed hardware and carbon analysis. These capabilities are not supported natively by PyTorch.

评论

Thank you authors for your response. About the optimal configuration, on what level you intend to do the abstraction for the user? On what level the user is involved in determining his desired objective? Does the user have to explicitly mention that I care X% about being carbon-aware and Y% about latency or you have more abstract version in mind? I ask this because it is critical for a system to be as abstract as possible from the user perspective and not be a burden on user side.

评论

We thank the reviewer for the follow-up question. Currently, CATransformer uses two user-configurable parameters to guide the optimization process. The first is a latency constraint, which is especially important for latency-sensitive tasks. This ensures that the optimization only outputs configurations that satisfy the specified latency target, while minimizing carbon emissions within that constraint. The second set of parameters includes accuracy and carbon thresholds. These allow users to specify target thresholds, and the optimization then searches for configurations that meet or come close to those targets, minimizing carbon while maximizing accuracy.

These are the only knobs the user needs to adjust, making it easy to adapt the system to different workloads or use cases. This approach balances user control and abstraction, allowing users to express their preferences in terms of latency, carbon, and accuracy without needing to manage low-level detail

评论

Thank you for the response. I'll keep my ratings the same

审稿意见
4

This paper introduces CATransformers, a novel co-design framework that jointly optimizes Transformer-based model architectures and hardware accelerators with carbon efficiency as a primary objective. It addresses the growing carbon footprint of AI systems by incorporating both operational and embodied carbon into the early-stage design space exploration. The framework integrates a multi-objective Bayesian optimizer, a structured pruning-based model evaluator with proxy fine-tuning, and a hardware estimator that quantifies carbon, latency, and energy. Experiments across language, vision, and multi-modal models demonstrate CATransformers’ ability to reduce total carbon emissions without sacrificing accuracy or latency, highlighting its effectiveness through the design of CarbonCLIP, a family of low-carbon CLIP variants.

优缺点分析

Strengths:

  1. This paper is the first to systematically incorporate carbon emissions as a primary optimization objective in the co-design of Transformer architectures and hardware accelerators.

  2. The proposed CATransformers is designed to be reproducible and adaptable across a wide range of Transformer models.

  3. The paper systematically analyzes trade-offs across multiple objectives, model scales, and hardware configurations.

Weakness:

  1. Some components in the method section are described rather compactly, which may hinder reproducibility without the code release.

  2. The current framework exclusively targets inference optimization. It does not account for training-phase carbon cost, especially from large-scale pretraining of foundation models. And this may dominate the total carbon footprint. This limits the general applicability of the framework.

  3. While the method aims to reduce model carbon impact, the carbon cost of the search process itself is not measured or discussed.

问题

  1. The current framework focuses solely on inference-time optimization — does this neglect the potentially much larger carbon cost of training, especially for large-scale pretraining?

  2. While the method reduces model carbon footprint, could the carbon cost of the search process itself be substantial and offset the benefits?

局限性

While the paper does offer some discussion of its limitations, I would suggest that it could be further strengthened by considering the following aspects.

  1. The authors should discuss how to balance the potential sustainability challenges introduced by custom accelerators, such as increased electronic waste and maintenance costs due to hardware heterogeneity.

  2. The authors could briefly discuss the risks of proxy-based estimations deviating from real-world outcomes.

最终评判理由

The authors provide more details and make more comparisons to refine the article. This article will make contribution to the society of research about carbon and environment.

格式问题

No major formatting issues observed. The paper follows the NeurIPS 2025 template well.

作者回复

We thank the reviewer for their thoughtful feedback and suggestions. Below, we answer the questions in order:

Weakness 1: Details of the methods and key components

We will open-source the codebase upon acceptance and include detailed tutorials and examples to support reproducibility. In the final version of the paper, we will also expand the method section to provide additional implementation details.

Weakness 2, Question 1: Extensibility to training and datacenter accelerators

This work focuses on making model inference carbon-efficient through hardware-model co-design. Carbon footprint of inference in aggregate is significant due to much wider usages [1]. We also agree with the reviewer that large-scale pretraining of foundation models can result in substantial carbon emissions. As our GPU-based results from Appendices L and M show, our proposed design methodology can scale to datacenter-level accelerators. CATransformer inherently supports training by modeling forward and backward pass latency and energy at the operator level. It can be extended to incorporate parallelization strategies and training system design, enabling optimization of carbon efficiency in distributed training.

[1] Luccioni, S., Jernite, Y. and Strubell, E., 2024, June. Power hungry processing: Watts driving the cost of ai deployment?. In Proceedings of the 2024 ACM conference on fairness, accountability, and transparency (pp. 85-99).

Weakness 3, Question 2: Carbon Footprint of the optimization process

We quantify CATransformer’s carbon footprint using CodeCarbon [1] . On average, the optimization process takes 5 hours for BERTbase and up to 20 hours for CLIP. In the most resource-intensive case (CLIP), 100 optimization trials emit approximately 57 kgCO2e, while final model training emits 454 kgCO2e per model. This means the optimization process costs roughly 1/13th the carbon budget of training the final model. The CarbonCLIP models are trained for 2 epochs on the MetaCLIP dataset—roughly 40% of the training steps used in prior work (MetaCLIP, TinyCLIP, CLIP). Despite the one-time cost of optimization, CATransformer achieves overall efficiency gains through reduced training steps post-pruning, along with inference gains that scale with the number of devices.

[1] Benoit Courty, Victor Schmidt, Sasha Luccioni, Goyal-Kamal, MarionCoutarel, Boris Feld, Jérémy Lecourt, LiamConnell, Amine Saboni, Inimaz, supatomic, Mathilde Léval, Luis Blanche, Alexis Cruveiller, ouminasara, Franklin Zhao, Aditya Joshi, Alexis Bogroff, Hugues de Lavoreille, Niko Laskaris, Edoardo Abati, Douglas Blank, Ziyao Wang, Armin Catovic, Marc Alencon, Michał St˛echły, Christian Bauer, Lucas Otávio N.de Araújo, JPW, and MinervaBooks. mlco2/codecarbon: v2.4.1, May 2024.

Extended Limitation Discussion

We agree that sustainability trade-offs from custom accelerators, such as increased electronic waste and maintenance complexity due to hardware heterogeneity, are important concerns. In the final version, we will discuss potential mitigations, including reusing existing hardware and co-optimizing models with commercially available accelerators.

We also acknowledge the limitations of proxy-based accuracy estimation. While such proxies enable scalable design space exploration, they can deviate from real-world outcomes due to overfitting to short fine-tuning runs or underestimating pruning’s impact on downstream performance. These approximations may miss interactions between model structure, data, and training dynamics. In future work, we aim to develop more reliable alternatives that improve accuracy estimation without sacrificing scalability.

评论

Thank you for your thoughtful and detailed rebuttal. I appreciate the effort you've put into clarifying the aspects of code availability, optimization cost measurement, and the discussion of limitations. These responses have significantly improved my understanding of the work, particularly regarding reproducibility and environmental impact considerations. I would kindly suggest including a more in-depth discussion on how the framework could potentially extend to training-phase optimizations, as well as incorporating the extended limitation analysis into the final version of the paper. Addressing these points would further enhance the paper’s clarity, completeness, and broader applicability. Overall, I find the work to be promising and believe it is well on track for publication with these minor refinements. I will finally maintain my ratings.

评论

Thank you for the thoughtful feedback and encouraging assessment. We will incorporate your suggestions on training-phase extensions and expanded limitation analysis in the final version to further strengthen the paper.

审稿意见
5

The paper proposes a model-architecture and hardware co-design framework for jointly optimizing priority metric like carbon footprint (both embodied and operation) followed by latency, energy and accuracy. A new open source hardware evaluation tool chain for design space exploration is also showcased. The framework is evaluated across multiple directions with join hardware and model architecture improvement, ablations with hardware or model architecture fixed. The work provides a basis for important decision-making in choosing co-optimized hardware and model architecture for sustainability focused end-users. Co-optimized model–hardware configurations reduce total carbon by 30% over latency optimized and 8% over energy-optimized baselines.

优缺点分析

  • Good paper on framing the groundwork for identifyng design tradeoffs for large models
  • Good use of light-weight fune tuning as a proxy for accuracy during Bayes Opt.
  • The open-source toolchain is really useful for holistic design space exploration for latency, energy and total carbon emission
  • It shows faster/high volume evaluation rate (100M/100iters)
  • Pruned models require less training budget than full-pretraining.(60% reduction)
  • The ablations and the corresponding takeaways also help gain great insights in to model architecture and hardware design choices,
  • The works highlights the complex trade-offs that were not previously considered in earlier works
  • None that I can identify

问题

  • Based on the discussion in section 3.1: Does it mean that the carbon footprint of the pretrained model can't be optimized since we start from it and run the pruning. Or, is the suggestion that if we pretrain on the pruned model from scratch, we will recover the same accuracy? (this is in realation to the discussion in last line of 3.5)

局限性

None that I could identify since I am not an expert in co-design area.

最终评判理由

I would retain my current score.

格式问题

None. Paper is well written

作者回复

We thank the reviewer for their thoughtful feedback and for recognizing the significance and quality of our work, noting that the paper “provides a basis for important decision-making…for sustainability-focused end-users” and “highlights the complex trade-offs that were not previously considered in earlier works.”

Question 1: Optimizing pretrained models

CATransformer takes a pretrained model as input and identifies pruned submodels and hardware architecture configurations that together minimize the total carbon footprint. We show that post-pruning training on the selected submodel can achieve accuracy comparable to the original model while significantly reducing carbon emissions (section 4.5 Table 3). Alternatively, given a fixed pretrained model, CATransformer can perform a hardware architecture search to identify the configuration that minimizes carbon footprint under specified latency constraints (Section 4.3 Table 2).

评论

Dear Reviewers, given the authors' response, if you have not done so, please raise any remaining questions and/or concerns in a timely fashion so the authors have a chance to reply.

I remind you that Reviewers must participate in discussions with authors before submitting "Mandatory Acknowledgement”.

Thank you for your work.

最终决定

The authors introduce CATransformers,  a unified optimization pipeline for carbon-efficient Transformer-based models and hardware accelerators.  The approach uses a base pre-trained Transformer to generate pruned variants and a hardware template.  The algorithm uses a lightweight fine-tuning of pruned models to ensure satisfactory accuracy as untrained pruned models provide poor performance.  Since the optimization problem cannot be optimized easily, CATransformers uses multi-objective Bayesian optimization  to maximize accuracy while minimizing in particular total carbon emissions. With this contribution, the authors proposed an end-to-end analysis of  total carbon footprint for the explored models and hardware configurations.

Reviewers were convinced by several key features of the proposed approach in particular the use of carbon emissions as a primary optimization objective for Transformer-based models to foster sustainable ML practice,  the extensive experimental analysis of model objectives,  scales, and hardware configurations. The experiments are well described and demonstrate significant emission cuts in various frameworks.   Key details were provided during the discussion with reviewers, in particular on the supplementary material to ensure reproducibility. The authors also clarified all the parameters that can be used as configurable inputs to support deployment in many settings. The authors are encouraged to clarify the generalizability to architectures which are not transformer-based to detail the challenges to extend the optimization to other model. They should also discuss possible limitations with respect to global deployment of sustainable ML models. 

Overall, the rebuttal highlighted that the paper has a great potential and can be of interest for a large ML audience.