What level of effort is required to expand Pentathlon to accommodate a new hardware platform or incorporate a new model into the benchmark?

Expanding our centralized host machine to accommodate a new hardware platform involves physically connecting the hardware to our host machine and making necessary software adjustments. This process can be challenging and may take 6–10 hours based on the authors’ experience. For users that want to use our benchmark on their chosen hardware, we expect the process to involve minimal effort, typically just cloning our software and installing necessary dependencies, provided their hardware is Linux-compatible.

To make it easier to incorporate new models, we provide code templates and detailed documentation to guide this process. Based on the experience of an author who was initially unfamiliar with the benchmark, it's estimated that an average researcher can adapt a model to our benchmark in under two hours. if the model is already compatible with Hugging Face APIs or is hosted on Hugging Face, much less effort is required.

Beyond the BLEU score, are there additional metrics available within Pentathlon to assess model quality, such as perplexity or other relevant NLP-specific metrics?

Efficiency Pentathlon is intentionally designed without built-in support for accuracy or quality metrics. Instead, it provides users with the models’ outputs, deferring to them conducting quality evaluations. We choose to do this for a couple of reasons.

First, Efficiency Pentathlon is designed to support a variety of tasks, and its current version supports anything that is available on Hugging Face. Anticipating and implementing the accuracy metrics for these tasks is beyond our capabilities and the scope of this work. Second, for many tasks (e.g., MT as the reviewer mentioned), post-processing decisions can significantly affect a model's accuracy. For instance, in XX to Romanian translation, the choice to include or exclude diacritics can greatly influence the model's BLEU score. Such nuances in post-processing are critical to the overall performance evaluation. We believe that the users are better-positioned to make decisions about e.g., post processing and selecting evaluation metrics for quality. Therefore we decide to not impose specific evaluation metrics for model quality, instead leaving these decisions to the users.

How can you distinguish the impact of "algorithmic innovations" from other efficiency-related factors in the Pentathlon benchmark?

The reviewer raised a great point that algorithmic innovations can be hard to distinguish from many other factors in practice. To address this, we try to control as many potential “non-algorithmic” confounders as possible, including hardware platforms, software environment, data processing pipelines, IO frameworks, etc.