Maia-2: A Unified Model for Human-AI Alignment in Chess
摘要
评审与讨论
The paper introduces a unified modeling approach called Hermes for aligning human and AI performance in chess. Hermes effectively captures human chess styles across different skill levels, providing a coherent reflection of player improvement. The approach incorporates a skill-aware attention mechanism that dynamically combines player skill levels with encoded chess positions, allowing the model to adapt to varying player skills. Experimental results demonstrate that this unified framework significantly enhances the alignment between AI and human players, facilitating a deeper understanding of human decision-making and the development of AI-guided teaching tools.
优点
- This paper is easy to read and follow, with detailed descriptions and accompanying code.
- The method is simple yet effective, demonstrating remarkable results in various settings.
缺点
- The method is actually straightforward, as it only employs the multi-head attention mechanism, a general technique applicable in various settings and applications.
- The motivation for using attention and the reasons for its effectiveness are unclear.
- The concept of skill used in this paper is not well-defined, which may limit its applicability to the chess environment only.
问题
- How is skill defined in this paper? Is it manually designed or given by the chess setting? How does it handle this setting, and why can't we implicitly obtain different skill levels?
- How could this method be expanded to settings other than chess, such as those used in multi-agent systems[1]?
- How can this method be expanded to different chess variants or adapted for a large-scale chess model?
Ref: [1] Yuan L, Zhang Z, Li L, et al. A survey of progress on cooperative multi-agent reinforcement learning in open environment[J]. arXiv preprint arXiv:2312.01058, 2023.
局限性
A/N
Design choices (W2)
We agree that we didn’t sufficiently explain our architecture choice. The rationale behind our design is that each channel (feature map) of the ResNet output represents different aspects of a chess position, and the attention blocks actively select and interact with the features according to the given skill level. Evidence can be found in Figure 4: for skill-dependent concepts, the representation before attention blocks understands them uniformly across all skill levels with high accuracy, whereas after the attention blocks the representation is attenuated with higher skill levels understanding the concepts better than lower skill levels. This shows that attention blocks effectively “pretend” not to know the concepts to model the degree of imperfection in human moves, whereas skill-independent concepts are understood similarly by the representations before and after the attention block. Table 3 “w/o Att” shows that simply concatenating the skill embeddings with the flattened ResNet outputs did not work well. Thus, a more sophisticated way of conditioning is needed.
Contribution (W1)
To the best of our knowledge, we are the first to emphasize coherence in human behavior modeling. In particular, our work introduces the first human behavior model that is coherent across various skill levels and even improves the move-matching accuracy.
The primary methodological contribution lies in the unified modeling approach for coherent human behavior modeling that is enabled by this specifically designed model architecture. While we believe that the architecture has conceptual benefits that help make these contributions possible, and we will include the rationale behind the architecture in the revisions, we do not claim that the architecture is optimal in any sense; we view the specifics of the architecture as secondary to the advances, i.e., move prediction coherence and accuracy, provided by the unified modeling approach. We will also mention other architecture choices as promising avenues for future work.
Definition of Skill (W3, Q1)
The skill level is defined by the Elo rating system, which was originally proposed for two-player zero-sum games [1] and widely used in Chess. Each player is associated with an Elo rating before starting a game, and the Elo rating will be updated after the game. We use the Elo rating before the game as annotated labels for skill levels.
Extracting skill embeddings from past moves is another research question, which gives individual skill embeddings instead of grouped skill embeddings. The performance can be promising if sufficient past moves are provided [2]. We see this as a promising direction for future work and will add it to the paper.
Generalization to other domains (Q2)
Elo rating systems now become widely applicable in many domains such as LLMs [3]. Besides Elo, our categorical skill-level modeling can be easily adapted to any continuous rating or discrete grading system. For example, proficiency in math problem-solving.
Unlike multi-agent systems, we treat this setting as effectively single-agent, where the chess engine itself will predict human moves without interacting with any other agents. Although Hermes itself is not designed as a multi-agent system, it could be extended to multi-agent systems in the broader context of studying interactions between human-like agents to achieve specific goals, in particular competitions among agents.
Adapting to Chess Variants (Q3)
It is fairly easy to apply our method for chess variants as long as the human historical data is sufficient and the skill level is provided. In particular, the Lichess database provided human historical data for Antichess (27.7M games), Atomic (21.6M games), and Chess960 (20.1M games), the only difference would be the data and the labels for the prediction heads.
[1] Wikipedia. 2024. "Elo Rating System." Wikimedia Foundation. Last modified July 22, 2024.
[2] McIlroy-Young, Reid, et al. "Learning models of individual behavior in chess." KDD 2022.
[3] Chiang, Wei-Lin, et al. "Chatbot arena: An open platform for evaluating llms by human preference." arXiv 2024.
Thank you for your response, I maintain my score at this stage.
This work explores developing a unified model to predict human moves in Chess. To address the coherence challenges, the authors propose to use skill-aware attention with channel-wise patching to encode skill levels and board positions into a neural network model. Experimental results show their proposed model achieves comparable or better performance in prediction accuracy and coherence compared to SOTA models.
优点
This work is overall very solid. The proposed model is clearly presented, with enough details for reproduction. Evaluation is also convincing and thorough.
缺点
A minor problem is about the skill level encoder. The proposed Hermes model encodes both players’ skills, but Maia model only encodes one player’s, which adds an advantage to Hermes model and makes comparisons in human move prediction potentially unfair.
问题
- About the volatile predictions of Maia model, is this problem more due to prediction errors or the volatile nature of human playing data? For example, a move that relies on several subsequent moves to be effective might be a bad choice for middle-level player, but a good choice for both low-level (where the opponent doesn’t know how to counter) and high-level players (where the active player can manage the subsequent changes). I’m not an expert in Chess, just wondering whether this is a possible cause of the incoherent problem.
- In Figure 10, is Maia-2 actually the proposed Hermes model? This evaluation of win-rate seems indirect. Why not directly pair Hermes-1500 (by setting the active player level to 1500, for example) with Maia-1500 and evaluate the final outcome, to determine which model is better if the goal is to evaluate move quality?
局限性
No significant limitation.
Skill level encoder (W1)
Maia implicitly encodes both players’ skill levels by only selecting the games between the same-strength players for model training. In Table 1, we have equated the active and opponent skill levels to ensure fair comparisons, and our model outperforms Maia in this setting. Note that Maia is restricted to training on games between players at the same skill level, which is a significant limitation on the distribution of data it considers.
Not only does Hermes outperform Maia in Maia’s setting, but our unified model is also capable of properly modeling situations where the player skill levels are different. This allows for better flexibility and improved performance on both matched and unmatched skill-level settings (Figure 2). In addition, having both ratings can help derive insights into human behavior. For example, improving the players' skills themselves affects human decisions more than the varied opponent skill levels (Figure 3B).
Maia’s Incoherence (Q1)
We strongly suspect that Maia’s incoherence is much more due to noise than to signal. While it’s possible that some positions may have non-monotonic relationships with the probability of choosing the “best” move, it is highly unlikely that the true relationships are as chaotic as Maia’s predictions (which often change direction 4–5 times throughout the 9-step skill range).
Figure 10 (Q2)
Yes, it is Hermes, which we already modified. Thank you for pointing out this typo which we will fix. We didn’t aim to evaluate the predicted move quality of Hermes or Maia because our goal is to replicate human moves instead of maximizing quality. Figure 10 shows the human move prediction accuracy conditioned on move quality: for better or worse moves, how well can we predict? The higher the win-rate loss, the lower the move quality. Therefore, we find that human move prediction models are good at higher-quality moves, which are more certain, and they get worse at lower-quality moves, which are more random and thus hard to predict. Our unified model Hermes still outperforms the Maia under such settings.
Thanks for responses. As other reviewers have noted, there are some areas where the method could be clarified, along with some corrections needed for typos and figures.
Despite these minor revisions, I firmly believe that this work surpasses the acceptance threshold. Their focus on coherence in human behavior introduces an interesting research topic. The proposed unified model can be easily applied to modeling human behavior in other domains. This work has significant potential to inspire future research, and I do not find it limited by concerns regarding novelty, contribution or generalization (as mentioned by other reviewers).
I will maintain my current evaluation.
Thank you for your support.
The paper proposes a unified modeling approach named Hermes for aligning human and AI behaviors in chess. It addresses the limitations of previous models by integrating a skill-aware attention mechanism that dynamically adapts to various skill levels of players. The Hermes model aims to enhance AI-guided teaching tools and provide deeper insights into human decision-making in chess. The model is evaluated based on move prediction accuracy and coherence, showing improvements over existing models.
优点
The paper presents an approach to human-AI alignment in chess through the introduction of a skill-aware attention mechanism that dynamically adapts to players’ skill levels. This technique addresses the non-linear nature of human learning and significantly improves the coherence of AI models across different skill levels. The paper is well-structured and clearly written. The evaluation of the Hermes model is thorough, demonstrating notable improvements in move prediction accuracy and coherence over existing models like Maia and traditional chess engines such as Stockfish and AlphaZero.
缺点
Despite its strengths, the paper has several limitations. While the skill-aware attention mechanism seems "innovation", the overall novelty of the paper is somewhat limited as it builds on existing models and techniques without introducing fundamentally new concepts in AI or chess modeling. Additionally, the paper does not adequately address potential biases introduced by relying heavily on data from a specific online platform, which may affect the generalizability of the model. The experiments mainly report move prediction accuracy and move prediction coherence, which are not sufficient to support the model’s practical effectiveness in gameplay scenarios. Most importantly, the paper lacks human-AI experiments that would substantiate the claims of alignment.
Thus, I find the method lacks novelty and the experiments are insufficient to support the claims, leading me to recommend the paper for rejection.
问题
See weaknesses.
The font in figures is too thin.
局限性
The paper discussed the limitations in appendix.
Contribution
To the best of our knowledge, we are the first to emphasize coherence in human behavior modeling. It is important to reconsider the assertion that there are "no fundamentally new concepts," as our work introduces the first human behavior model that is coherent across various skill levels and even improves upon the state-of-the-art move-matching accuracy.
The primary methodological contribution lies in the unified modeling approach for coherent human behavior modeling that is enabled by this specifically designed model architecture. While we believe that the architecture has conceptual benefits that help make these contributions possible, and we will include the rationale behind the architecture in the revisions, we do not claim that the architecture is optimal in any sense; we view the specifics of the architecture as secondary to the advances, i.e., move prediction coherence and accuracy, provided by the unified modeling approach. We will also mention other architecture choices as promising avenues for future work.
Generalization to other platforms
Given the universal and objective nature of chess rules and strategies, it is unlikely that human behaviors in chess are biased from one platform to another. For example, given a recorded chess game, we strongly believe it would be extremely difficult to predict whether it was played on Lichess or some other platform. This observation mitigates the concerns of Hermes’s generalizability towards other platforms.
Human Studies
In an ideal human experiment, we would give a position to a human at a particular rating, and compare their chosen move to our model output. Our experiments do exactly this, and thus we view them as massive human studies that measure the move-matching accuracy and coherence with the recorded behaviors of real humans.
In addition to this main experiment, we've also performed additional experiments that address other dimensions of your question. In particular, we've implemented a randomized experiment on Lichess: human players challenge our bots, and we randomize whether players play against Maia or Hermes. Our final result is that our higher move-matching and our vastly improved coherence, across all skill levels, come at no cost to human subject engagement, and in fact slightly increase engagement: Hermes even seems to be slightly more engaging: players rematch Hermes after the first game 1.5 percentage points more than Maia (40.6% vs. 39.1%). Although engagement is not our main objective (move-matching and coherence are) this is further promising evidence that we have achieved our goal of a human-aligned AI model that coherently captures human style across different skill levels.
Typos/Minors
Thanks, we will modify them.
Thanks for your detailed replies.
In your reply, you claim that you are "the first to emphasize coherence in human behavior modeling" and "first human behavior model that is coherent across various skill levels and even improves upon the state-of-the-art move-matching accuracy". What are the differences between your paper and the paper titled "Aligning Superhuman AI with Human Behavior: Chess as a Model System" [1]? They claimed that "...a customized version of AlphaZero trained on human chess games, that predicts human moves at a much higher accuracy than existing engines, and can achieve maximum accuracy when predicting decisions made by players at a specific skill level in a tuneable way."
Thus, from my limited knowledge and your reply, I don't think your paper is "the first". I will maintain my score as "Reject".
[1] https://www.cs.toronto.edu/~ashton/pubs/maia-kdd2020.pdf
Thank you for engaging with our rebuttal. The major difference between our work and the paper you mentioned ([1]) is coherence. Maia [1] is a set of independent models, one for each skill level, that each independently achieve a respectable accuracy on human move-matching at their targeted skill level. Hermes (our model), in contrast, is a single unified model that accurately predicts moves at all skill levels in a coherent way. The problem with Maia's approach is that its predictions are incoherent—its predicted moves in the same position p are unrelated to each other. It may (and often does) predict that, say, only 1100, 1400, and 1800 rated players will play the right move in position p, whereas 1200, 1300, 1500, 1600, 1700, and 1900 rated players will play the wrong move in position p. This runs counter to how people actually improve: as people progress up the skill levels, they learn concepts, and once they start playing the right move in position p they will tend to keep playing it as they get better. (There may be cases where the relationship between skill and correctness is non-monotonic, but anecdotally these are rare—much rarer than Maia predicts non-monotonicity.) Hermes, on the other hand, makes much more coherent predictions. We directly compare the coherence of Maia and Hermes's predictions in Table 4. Maia only treats around 1.5% of positions monotonically, but Hermes treats around 27% of the same positions monotonically, a huge improvement. These results, combined with the fact that [1] makes no mention of coherence, is what support the claims you quoted: "the first to emphasize coherence in human behavior modeling" and "first human behavior model that is coherent across various skill levels and even improves upon the state-of-the-art move-matching accuracy".
Does our response address your concern, or are there any others we can respond to in the few hours before the rebuttal window closes?
This work tackles the problem of modeling chess agents at varying skill levels. Prior work learns separate models for each skill level, so the authors introduce their method “Hermes” which uses skill-aware attention to adapt the predictions based on the skills i.e. chess ratings of both players in the game. Technically, Hermes uses a categorical embedding for different buckets of player ratings, and uses multi-head attention to project the game state to the player move, some auxiliary game information, and a value head. The active and opponent skill embeddings are projected and added to the query matrix in the attention step, to make the model skill-aware. The authors train and evaluate the model on large datasets of chess gameplay across varying ratings, and conduct in-depth behavior analysis of their models compared to the baselines. They demonstrate that Hermes slightly outperforms the baselines with higher confidence predictions. Further, it enables additional capabilities such as modeling chess players over the entire spectrum of skill levels, and monotonic improvement of predicted moves as the skill increases.
优点
- The paper tackles an important problem of skill-aware modeling of human behavior. The authors instantiate this in the setting of chess games between players of varying skill levels. The intuition of the overall method is well understood, the framework and the parameters are clearly mentioned, and the results are shown over the standard datasets and compared to recent baselines.
- The paper is well written. In the paper, the processing steps for the dataset, including filtering and balancing are explained clearly. The authors do not discard rare games between players at very different chess ratings. Is using these rare games helpful for the model? It would be nice to see some ablation showing the effect of this additional data on model performance.
- Hermes uses auxiliary information prediction to align the model better. This is very intuitive as forcing the model to predict auxiliary information from the current state should improve its understanding of the game concepts. I am curious to know if this has been explored in the past work. Else, it would be interesting to see if this could improve the performance of baseline agents
- The paper presents an in-depth analysis of the performance and predictions of Hermes under different conditions. It shows that Hermes is more confident in its predictions than the baseline that learns individual models at individual skill levels. Further, Hermes performs similarly across all pairs of skills showing that it captures and models the players' ratings over the entire spectrum, while the baseline is accurate only around the particular skill level they are trained on. Finally, they also show that the model implicitly differentiates in its predictions on the moves that are skill-dependent and independent.
缺点
- The paper focuses on improving the predictive model performance for chess games, but the improvement over the baseline is very small (2%). Further, the table doesn't include any error ranges and has no statistical testing to show if the performance improvement is statistically significant.
- The paper does not cite some recent work on modeling agents with varying skill levels [1]. I would recommend the authors further include and cite works on modeling diverse agents and partners, specifically in multi-agent learning literature.
- In Hermes, the players’ skill embeddings are projected and added to the query matrix in the attention step, to make the model skill-aware. However, there are no experiments or intuition to explain the significance of this architectural choice. Over simple conditional networks (or more recent works such as FiLM [2]), it would be nice to see what is the significance of this particular attention mechanism.
- Some corrections regarding the manuscript: Figure 3 refers to an Errors (middle) figure which is missing. I believe there is a typing error in Figure 5, where the top row has Hermes labels which should be Mia.
[1] Jacob, Athul Paul et al. “Modeling Boundedly Rational Agents with Latent Inference Budgets.” ArXiv abs/2312.04030 (2023): n. Pag.
[2] Perez, Ethan et al. “FiLM: Visual Reasoning with a General Conditioning Layer.” AAAI Conference on Artificial Intelligence (2017).
问题
- How does the granularity of skill clusters affect the model's performance?
- What is the training and test split over the dataset?
- Could the model interpolate between varying skill levels? I would be curious to see what would be its performance on unseen skill levels.
- Does the Q-Q plot for Value head prediction only include accurate predictions?
- Is there a qualitative analysis of where the improvement in Hermes over the baseline comes from? Does it improve all skill levels or is it restricted to a low number of buckets?
局限性
- The paper focuses primarily on modeling chess agents, so, a discussion on the broader impact of this method would be valuable. Here, the model assumes access to the ground truth skill embedding of the player. However, It would be interesting to see if it is possible to extract the skill embedding from the past moves i.e. implicit modeling of player/opponent skill.
- The model's accuracy is around 54%, which is still pretty low for the model to be deployed. The evaluation metrics are limited to the model's prediction accuracy. A human study or evaluation against skill-specific agents could further demonstrate Hermes' effectiveness in modeling skills at all levels.
- The paper has limited citations to past work on modeling diverse agents, so it is difficult to assess the novelty and the impact of the contribution. Further, one of the main contributions is the skill-aware module, so I would suggest the authors to include some baselines or discussion to explain the intuition and benefits of using this mechanism over other simpler baselines.
Skill level modeling (Q1, Q3, L1)
Interpolation (Q3): Interpolation between skill levels within our range is impossible since we already cover all the involved ratings: the “1100” rating bucket contains games played by players with ratings 1100–1900, and the “1200” bucket consists of games played by 1200–1299 rated players, etc.
Granularity (Q1): We group players by ranges of 100 ratings to balance the natural volatility of ratings, the data availability within each rating range, and the practical meanings of ratings (e.g. the difference between 1200 and 1205 is not humanly perceptible). Also, this is consistent with prior work, enabling us to make direct comparisons.
Automation (L1): Extracting skill embeddings from past moves is another research question, which gives individual skill embeddings instead of grouped skill embeddings. The performance can be promising if sufficient past moves are provided [1]. We see this as a promising direction for future work, and will add it to the paper.
Multi-agent systems (W2, L3)
We treat this setting as effectively single-agent, where the model is tasked with predicting a human move without interacting with any other agents. And since there is only one chess agent, the common or conflicting goals of agents are not defined. Therefore, this does not follow the definition of multi-agent systems: multiple decision-making agents that interact in a shared environment to achieve common or conflicting goals.
Although Hermes itself is not designed as a multi-agent system, it could be extended to multi-agent systems in the broader context of studying interactions between human-like agents to achieve specific goals, in particular competitions among agents. We will add more discussions on multi-agent learning literature and make it clear in revisions, especially the ones related to human chess plays like Section 6 in the mentioned paper.
Dataset (Q2)
Please refer to the front of Section 4 Table 6-10 for details. To ensure a fair comparison with Maia, which is tested on Dec 2019 data and trained on the rest before Dec 2019, we trained a version of Hermes on only the data that Maia was trained on for a perfectly fair comparison. However, we also have more data now to train on. Therefore, we use games played in Dec 2019 and Dec 2023 for testing and the rest before Dec 2023 for training the full Hermes model.
Q-Q plot (Q4)
The positions used for plotting are not selected based on the correctness of the policy head predictions.
Typos/Minors (W4)
Thanks, we will modify them.
[1] McIlroy-Young, Reid, et al. "Learning models of individual behavior in chess." KDD 2022.
I thank the authors for their detailed responses. From the responses and comments, I understand that the main contribution of this work is coherence, however, it is still unclear how Hermes ensures better coherence than the baselines. It would be great if the authors could provide some intuition about that. It seems that the coherence trend is observed post-training and so, is not necessarily the main motivation behind the model. Or is there something specific to the unified modeling that I am misunderstanding here?
Given the current presentation of the contributions, I believe this paper needs more analysis and explanation to present a clear picture of the method and its benefits. Thus, I would maintain my score.
Thank you for your comment. Our response addresses two main points:
- Coherence is the central motivation of our work, not just a post-hoc observed outcome.
- Hermes ensures better coherence than the baselines through its unified, parameter-sharing approach.
Coherence as central motivation
We emphasize that coherence is not only observed post-training--it is the central motivation of our work. The third sentence of the Abstract states "Critical to achieving this goal, however, is coherently modeling human behavior at various skill levels." (L5). The central limitation of previous work, and the motivation for our present work, articulated in the Introduction is Maia's lack of coherence (paragraph starting on L48: "Maia models players at different skill levels completely independently...Viewed as a whole, [Maia's predictions] are volatile", "In order to serve as algorithmic teachers or learning aids, our models of human behavior must be coherent"). The first sentence of the Discussion summarizes our contribution as "Hermes is a unified model architecture that can accurately and coherently capture human decision-making in chess across a broad spectrum of skill levels." (L383). We hope that situating coherence as the central idea of our paper in the Abstract, Introduction, and Discussion makes it sufficiently clear that coherence is what we are aiming for, instead of something we stumbled upon, but we would also be happy to implement any suggestions you have in order to make this point clearer.
Architecture intuition
As for why Hermes ensures better coherence than the baselines, we are happy to provide some intuition (which we will certainly incorporate into our revision to make sure this is as clear as possible). In one sentence, Hermes uses a unified, parameter-sharing modeling approach instead of Maia’s independent parameters as a way to regularize across skill levels.
To explain more fully, the root cause of Maia's lack of coherence is that Maia models learn 9 sets of independent parameters for each of the 9 skill levels. This means that there is no mechanism to encourage or enforce consistency (i.e. coherence) across skill levels. Maia 1400 and Maia 1500, for example, are distinct models with completely separate training data and zero parameter overlap. As a result, Maia often outputs dramatically different predictions for neighboring skill levels on the same position, which leads to a lack of coherence. In contrast, in Hermes we learn a unified set of parameters to predict human decisions conditioned on skill level. This ensures that the conditional prediction will always be based on the shared knowledge in the one and only parameter space that we learn---therefore decisions made by 1500-rated players are partially informed by what 1400-rated and 1600-rated (etc.) players do. In other words, Hermes is implicitly regularized by the shared parameters across all skill levels, without over-optimizing towards any particular skill level. Neighboring skill levels will yield similar predictions, e.g., P(y|position, skill level_{i}) \approx P(y|position, skill level_{i+1}), unless Hermes recognizes some condition to switch to another prediction is satisfied. This unified modeling approach ensures coherence by design.
In addition, our skill-aware attention module enables Hermes to learn non-trivial interactions between positions and skill levels. The skill-aware attention module plays a crucial role in maintaining coherence. Whereas the various Maia models learn different representations of the position for each skill level, Hermes first learns the same unified representation that it uses for all skill levels, and then the skill-aware attention module learns how different skill levels interact with the position to produce a move. By learning a unified representation of the position first, and then adjusting based on skill level, Hermes ensures that all skill levels are informed by a consistent understanding of the position. This decomposition---learning position representation separately from skill-level interaction---naturally encourages coherence across skill levels.
Summary
In summary, our central motivation is to develop a model capable of coherent and accurate human move prediction. Our unified modeling approach is deliberately chosen to solve Maia's parameter independence problem, and our skill-level attention module is specifically designed to maintain a shared position representation across all skill levels while better modeling the interactions between position and skill level. We hope this addresses your concern.
If this explanation clarifies that coherence is our central motivation and why Hermes's design encourages coherence, along with the other points we addressed in our first response, would you be open to considering raising your score?
Main focuses & Broader Impact (L1, L2, W1)
We contribute an approach to human move prediction that is not only the new state of the art for accuracy, but our model achieves coherence in its predictions. To power algorithmic teaching tools, we believe that it is not enough to treat different skill levels independently and make predictions that don’t make coherent sense. Instead, we need coherent move predictions to algorithmically capture the trajectory of human ability as we progress from beginner mistakes to expert decisions. This way, we enable the building of systems that guide people along efficient learning paths. To accomplish this, we design a skill-aware attention mechanism for the unified modeling of human behavior across various skill levels, instead of modeling each skill level independently as previous methods did.
Move prediction accuracy (W1, Q5, L2)
Thank you for pointing out that we didn’t sufficiently explain the significance of our performance gains. In the human move prediction problem for amateur players, the ceiling accuracy is far below 100% given the randomness and diversity of their decisions—even the same player won’t always make the same decision when faced with the same position. Our 2 percentage point gain is substantial considering that the difference between Maia and Leela, the previous state-of-the-art model for this task and a traditional chess engine not trained for this task at all, is only 6 percentage points. We will update the paper to make this clearer.
In response to Q5, Hermes demonstrates a performance improvement over Maia across virtually all combinations of the active player’s and the opponent’s skill levels (see Figure 2).
In response to W1, like other large models such as LLMs, it’s computationally infeasible to run the model repeatedly with different data splits due to the massive volume of data involved (9.1B positions). Nonetheless, besides discrete move matching accuracy, we also adopt a continuous and thus more stable metric, perplexity, in Table 2, which shows more significant improvement.
Skill-aware Attention (W3, L3)
We agree that we didn’t sufficiently explain our architecture choice. The rationale behind our design is that each channel (feature map) of the ResNet output represents different aspects of a chess position, and the attention blocks actively select and interact with the features according to the given skill level. Evidence can be found in Figure 4: for skill-dependent concepts, the representation before attention blocks understands them uniformly across all skill levels with high accuracy, whereas after the attention blocks the representation is attenuated with higher skill levels understanding the concepts better than lower skill levels. This shows that attention blocks effectively “pretend” not to know the concepts to model the degree of imperfection in human moves, whereas skill-independent concepts are understood similarly by the representations before and after the attention block. Table 3 “w/o Att” shows that simply concatenating the skill embeddings with the flattened ResNet outputs did not work well. Thus, a more sophisticated way of conditioning is needed.
The primary methodological contribution lies in the unified modeling approach for coherent human behavior modeling that is enabled by this specifically designed model architecture. While we believe that the architecture has conceptual benefits that help make these contributions possible, and we will include the rationale behind the architecture in the revisions, we do not claim that the architecture is optimal in any sense; we view the specifics of the architecture as secondary to the advances, i.e., move prediction coherence and accuracy, provided by the unified modeling approach. We will also mention other architecture choices such as FiLM as promising avenues for future work.
Human studies (L2)
In an ideal human experiment, we would give a position to a human at a particular rating, and compare their chosen move to our model output. Our experiments do exactly this, and thus we view them as massive human studies that measure the move-matching accuracy and coherence with the recorded behaviors of real humans.
In addition to this main experiment, we've also performed additional experiments that address other dimensions of your question. In particular, we've implemented a randomized experiment on Lichess: human players challenge our bots, and we randomize whether players play against Maia or Hermes. Our final result is that our higher move-matching and our vastly improved coherence, across all skill levels, come at no cost to human subject engagement, and in fact slightly increase engagement: Hermes even seems to be slightly more engaging: players rematch Hermes after the first game 1.5 percentage points more than Maia (40.6% vs. 39.1%). Although engagement is not our main objective (move-matching and coherence are) this is further promising evidence that we have achieved our goal of a human-aligned AI model that coherently captures human style across different skill levels.
Please see the comments for the rest of our response.
I would like to thank the authors for their explanation. While the unified modeling approach is a sensible (but not very strong) intuition for coherence, I am curious about two points:
- Intuitively, conditional models can still learn independent functions in an over-parameterized setting, so if the authors could provide some details (model size, training set size, etc..) to show that this does not hold under their setting. This would provide more evidence for the claim that parameter sharing and the information bottleneck induced by this approach, is the key element that unlocks coherence.
- P(y|position, skill level_{i}) \approx P(y|position, skill level_{i+1}) - I believe this a very strong claim, assuming that the learned network is Lipschitz continuous. Is there any theoretical evidence for that? Or is there any previous work that points to this for parameter sharing + attention-based methods?
Again, thanks for the responses. I am still open to increasing the scores and would take the additional information into account during the AC-reviewer discussion phase as well.
Thank you for your continued engagement in the rebuttal phase!
Regarding point 1, Hermes has 23.3M parameters and was on trained on 9.1B training positions, which is not the over-parameterized setting [1], where the number of parameters has to significantly exceed the number of training samples. Evidence that Hermes is indeed a conditional model can be found in Figures 2, 3 (B), and 4 in the paper, and Figures 5, 6, and 7 in the Appendix, and the text accompanying these Figures. [1] Allen-Zhu, Zeyuan, Yuanzhi Li, and Yingyu Liang. "Learning and generalization in overparameterized neural networks, going beyond two layers." Advances in neural information processing systems 32 (2019).
Regarding point 2, it’s important to clarify that Hermes is deliberately designed to encourage coherence across skill levels without rigidly enforcing it. Our objective is not to impose coherence as a hard constraint, which might obscure legitimate differences in player behavior between skill levels, but to create a model architecture that naturally encourages coherence where the data supports it. (The informality of our previous explanation was intended to address your question in the response to our rebuttal, asking for intuition on how Hermes ensures better coherence than the baselines, but it is also appropriate given our goal of designing a model that encourages coherence when supported by the data but remains flexible enough to avoid enforcing it where it doesn’t belong.)
Hermes achieves this through a unified, parameter-sharing approach combined with a skill-aware attention mechanism. By sharing parameters across skill levels and allowing the model to adjust based on skill-level input, we induce a form of regularization that naturally enforces smoother transitions between predictions at adjacent skill levels. Importanty, this regularization is soft---it doesn’t impose artificial coherence where it doesn’t naturally exist in the data.
The principle of parameter sharing has been well-documented in other contexts, such as multi-task learning and transfer learning, where it has been shown to promote smoother, more coherent outputs across related tasks. You can view Hermes predicting slightly different rating levels as a multi-task learning problem where the tasks are very similar. When tasks (or in our case, skill levels) share underlying structures, parameter sharing allows the model to generalize knowledge effectively, resulting in more coherent outputs across these tasks. This principle is well-supported in the literature and is directly applicable to the challenge of modeling adjacent skill levels in chess.
Moreover, the skill-aware attention mechanism in Hermes allows the model to adapt its focus based on skill level, ensuring that while the underlying position representation is consistent, the nuances of how different skill levels interact with that position are captured appropriately. This mechanism plays a critical role in maintaining coherence without compromising the model’s ability to capture genuine differences across skill levels.
In contrast, Maia learns totally independent parameter sets, and treats the problem of predicting moves made by 1400 and 1500-rated players as completely distinct. Given infinite data Maia may also learn coherent predictions, but even given massive training data it fails to cohere anywhere near as well as Hermes. Our empirical results show that unifying the prediction tasks and using skill-aware attention has major practical benefits in achieving coherence.
In our revised manuscript, we would be happy to clarify these points further and better explain that our design choices are intended to encourage, though not necessitate, smooth and coherent behavior across skill levels.
Thank you for your thoughtful reviews and constructive suggestions for our work. Our work has been recognized as addressing "an important problem of skill-aware modeling of human behavior" (Reviewer 1), introducing an "innovative skill-aware attention mechanism” (Reviewer 2), and being "very solid" with "convincing and thorough" evaluation (Reviewer 3). Reviewer 4 appreciates our "simple yet effective" methodology, yielding "remarkable results in various settings."
Here is a brief overview of the main concerns addressed.
Main focuses & Broader Impact (R1, R2, R4)
We contribute an approach to human move prediction that is not only the new state of the art for accuracy, but our model achieves coherence in its predictions. To power algorithmic teaching tools, we believe that it is not enough to treat different skill levels independently and make predictions that don’t make coherent sense. Instead, we need coherent move predictions to algorithmically capture the trajectory of human ability as we progress from beginner mistakes to expert decisions. This way, we enable the building of systems that can guide people along efficient learning paths. To accomplish this, we design a skill-aware attention mechanism for the unified modeling of human behavior across various skill levels, instead of modeling each skill level independently as previous methods did.
Design choices (R1, R3, R4)
We agree that we didn’t sufficiently explain our architecture choice. The rationale behind our design is that each channel (feature map) of the ResNet output represents different aspects of a chess position, and the attention blocks actively select and interact with the features according to the given skill level. Evidence can be found in Figure 4: for skill-dependent concepts, the representation before attention blocks understands them uniformly across all skill levels with high accuracy, whereas after the attention blocks the representation is attenuated with higher skill levels understanding the concepts better than lower skill levels. This shows that attention blocks effectively “pretend” not to know the concepts to model the degree of imperfection in human moves, whereas skill-independent concepts are understood similarly by the representations before and after the attention block. Table 3 “w/o Att” shows that simply concatenating the skill embeddings with the flattened ResNet outputs did not work well. Thus, a more sophisticated way of conditioning is needed.
Our primary methodological contribution lies in the unified modeling approach for coherent human behavior modeling that is enabled by this specifically designed model architecture. While we believe that the architecture has conceptual benefits that help make these contributions possible, and we will include the rationale behind the architecture in the revisions, we do not claim that the architecture is optimal in any sense; we view the specifics of the architecture as secondary to the advances, i.e., move prediction coherence and accuracy, provided by the unified modeling approach. We will also mention other architecture choices as promising avenues for future work.
Human studies (R1, R2)
In an ideal human experiment, we would give a position to a human at a particular rating, and compare their chosen move to our model output. Our experiments do exactly this, and thus we view them as massive human studies that measure the move-matching accuracy and coherence with the recorded behaviors of real humans.
In addition to this main experiment, we've also performed additional experiments that address other dimensions of this question. In particular, we've implemented a randomized experiment on Lichess: human players challenge our bots, and we randomize whether players play against Maia or Hermes. Our final result is that our higher move-matching and our vastly improved coherence, across all skill levels, come at no cost to human subject engagement, and in fact slightly increase engagement: players rematch Hermes (the new system in this paper) after the first game 1.5 percentage points more than Maia (40.6% vs. 39.1%). Although engagement is not our main objective (move-matching and coherence are) this is further promising evidence that we have achieved our goal of a human-aligned AI model that coherently captures human style across different skill levels.
More detailed responses are provided to individual reviews with pointers to Limitations (L), Weaknesses (W), and Questions (Q).
We hope these clarifications have addressed your concerns and strengthened our paper.
This paper proposes a chess engine that predicts human moves at a specific ELO. It improves over a previous paper in this domain by using conditioning instead of bucketing.
On one hand, the results can be significant for the chess community and make an improvement over a previous baseline. On the other hand, it uses standard conditioning techniques, builds upon previous results, and presents incremental gains. As such, the results are somewhat incremental and will not have an impact on the wider machine learning community.
After discussing and weighting the pros and cons, I have decided to recommend accepting the paper. I understand that chess is a hard domain, and that progress in this domain can then be applied in other domains. I am also confident that this work will impact the small "AI in chess community" and will most likely be picked up in future work.
I have a few suggestions for how to improve the paper.
Firstly and most importantly, the paper is dealing with modeling multiple policies with a single network. There has been a large body of work on the topic in the multi-agent RL community that is not covered well enough. As such, this work is not likely to be noticed by that community. I suggest discussing this work. For example, https://arxiv.org/abs/2308.09175 is even in chess.
Secondly, it will be useful to explain the readers the implication of accuracy at 50%. Is this a "good number"? what I mean here is that usually we would like to reach 100%, but this is clearly not possible in chess when we model a distribution of players. Thus, we need some way to calibrate these results and understand how much headroom there is still for improvement.
In the discussion paragraph you are proposing to fine tune the model on historical data of a specific player. Is that something you can add to the paper? What would be the accuracy in this case?
It will also be interesting to do some error analysis. For example, you can show positions in which the current model makes the correct prediction and the previous one is wrong. There might be qualitative differences between the models, or it might be limited to blundering less..
Please try to address my suggestions in the final version of the paper. In particular, the literature review.