A Geometric Approach to Personalized Recommendation with Set-Theoretic Constraints Using Box Embeddings
Set-theoretic embeddings offer the appropriate inductive bias needed to effectively answer queries with set constraints.
摘要
评审与讨论
This work addresses the task of personalized recommendation using set-theoretic queries. The authors frame this problem as "set-theoretic matrix completion," highlighting that traditional approaches, such as logistic matrix factorization, do not align with the set-theoretic operations needed during inference.
优点
- The study models attribute-specific query recommendation as "set-theoretic matrix completion," treating attributes and users as item sets.
- The paper effectively demonstrates the limitations of existing vector embedding models for this specific task.
- Experimental results validate the effectiveness of the proposed model.
缺点
- The approach presented in this paper conflicts with the widely used matrix factorization model, which effectively leverages collaborative filtering signals between users and items. It is unclear how the proposed model addresses these signals.
- The experimental baselines are not state-of-the-art; comparing the proposed method to more advanced recommendation models would better demonstrate its advantages.
- Some equations are difficult to follow due to unclear notation explanations.
问题
It is suggested that an efficiency comparison between the traditional vector embedding methods and the proposed box embedding method be discussed.
Thank you for recognizing the strengths of our work and for your constructive feedback. While we address your specific questions and concerns in the following rebuttal, we encourage you to review our general response, which highlights common themes regarding the scope of our study and ambiguity in notations.
The approach presented in this paper conflicts with the widely used matrix factorization model, which effectively leverages collaborative filtering signals between users and items. It is unclear how the proposed model addresses these signals.
All the models including our Box-based method train on Collaborative signal (user-vs-item matrix ) and Attribute signal (attribute-vs-item matrix ).
As you mentioned, the matrix factorization model effectively leverages collaborative filtering signals between users and items. They achieve that by training to increase the dot product between user and item representation when user and items interact and decrease the dot product similarity when they do not (on a negatively sampled version).
In the Box embedding-based method, we train on in a similar manner. Instead of dot product similarity, we use box containment as the similarity measure; i.e, if a user has watched an item , then the collaborative signal encourages the item to be inside user . We enforce this set containment by optimizing the following:
If the user has watched an item , i.e., , then,
If a user has not watched an item , i,e, (we sample the negatives), then,
Here, means "training to achieve the value". During training, we get values between for the above-mentioned expression. We train them to be as close to or according to the collaborative signal in . We use binary cross-entropy to achieve this.
Note that, by doing this optimization we not only incorporate collaborative signals but also conceptualize the users as a set that contains all the relevant items.
The experimental baselines are not state-of-the-art; comparing the proposed method to more advanced recommendation models would better demonstrate its advantages.
In response to reviewer Vvma's suggestion, we have implemented LightGCN, a graph convolution-based solution for recommendation. Hyperparameter tuning is currently underway, and due to GPU constraints, it may require an additional two days to complete. We will update the draft with the new results and respond to this thread once the tuning is finalized.
Some equations are difficult to follow due to unclear notation explanations.
We have crafted a general response above titled General Response - Notations, where we have simplified the notations and provided a more detailed explanation for clarity. This response addresses similar questions raised by other reviewers regarding the notations. We kindly request that you review this section, and please let us know if any of the notations remain unclear. We are eager to receive your feedback and will revise our current draft accordingly.
It is suggested that an efficiency comparison between the traditional vector embedding methods and the proposed box embedding method be discussed.
Box embeddings are generally quite fast because the computation of box volumes and their intersection volumes can be parallelized over dimensions.
We report the training time (mm : ss) for a single epoch, where we select different batch sizes with 5 negative samples on the Movielens-1M dataset. Experiments are conducted on the Nvidia GTX 1080Ti GPU.
| Batch Size | MF | NeuMF | LightGCN | Box |
|---|---|---|---|---|
| 64 | 08:37 | 17:00 | 70:30 | 19:32 |
| 128 | 04:32 | 09:46 | 38:40 | 11:40 |
| 256 | 02:29 | 04:40 | 20:55 | 05:28 |
| 512 | 01:18 | 02:23 | 10:47 | 02:54 |
| 1024 | 00:40 | 01:20 | 05:24 | 01:12 |
We observe that the MF, being the simplest approach with minimal computational requirements, is consistently the fastest across all batch sizes. At the largest batch size (1024), it achieves the shortest training time of just 00:40. The Box-based method exhibits training times comparable to NeuMF. However, it is significantly faster than LightGCN, which relies on graph convolutional computations. The iterative message-passing operations required by LightGCN result in considerably higher training times, particularly at smaller batch sizes (e.g., 70:30 at a batch size of 64). As the batch size increases, the training time for Box embeddings becomes almost as efficient as MF. For instance, at a batch size of 1024, Box achieves a training time of 01:12, compared to 00:40 for MF. This demonstrates that the computational complexity of box embeddings is of the same order as MF
Note that the training times above use GumbleBox embeddings, which involve log-sum-exp calculations. However, this could be improved even further at inference time by replacing these soft min and max approximations with hard operators. If such an optimized approach is desired, then training can accommodate this by regularizing temperature. For deployment in industrial set-up, we could take additional steps with Box Embeddings as outlined in (Mei et al., 2022b).
- Mei et al., (2022b) Learning Probabilistic Box Embeddings for Effective and Efficient Ranking.
The authors apply box embedding to attribute-specific query recommendation. They formulated the task of attribute-specific query recommendation and proposed a recommendation method based on box embedding for the task. The authors also tried establishing an evaluation protocol for this new task and provided detailed analyses based on generalization spectrum gap and compound error. On the other hand, the current manuscript severely lacks a discussion of existing recommendation fields (e.g., context-aware recommendation), and the technical novelty is unclear.
优点
- The authors tried to use the latest technique in the NLP field (i.e., box embedding) for a recommendation-related task.
- The authors have carefully designed an evaluation protocol for attribute-specific query recommendation based on traditional collaborative filtering.
缺点
- The scope of this study is on a rather limited application, which the authors called "attribute-specific query recommendation".
- The current manuscript lacks related work on "attribute-specific query recommendation". In addition, the authors should discuss the relationship between this study and context-aware recommender systems [a].
- Some experimental settings are not convincing. See the following questions/comments for details.
References
[a] Adomavicius, Gediminas, and Alexander Tuzhilin. "Context-aware recommender systems." Handbook of Recommender Systems. Boston, MA: Springer US, 2010. 217-253.
问题
Questions
- (Q1) The authors used nDCG@K for model selection, but evaluated the final performance based on HR@K. Why did the authors use inconsistent metrics for validation and testing? In my opinion, HR@K with 100 negative items is a rather insensitive measure for ranking evaluation. I would like to recommend using Recall@K without negative sampling.
- (Q2) In the current manuscript, the authors do not mention/discuss the result on MovieLens-20M (Table 7 in Appendix B.3). Also, Table 7 is not self-contained. What is the definition of VEC-*? (probably MF or NeuMF?)
Comments.
- (C1) To my understanding, "attribute-specific query recommendation" is the task where (1) item attribute values are partially observed and often missing, and (2) in the prediction phase, the positive items are conditioned not only on a user but also on a boolean query. The problem of (1) has been addressed in existing studies (e.g., [b,c]). I am not familiar with (2) in the context of item recommendation, but there might be existing research on it. A discussion/comparison of existing studies on these points would make it easier to understand the novelty of this work.
- (C2) In line 368, the authors report that they followed the standard sample scoring procedure described in Rendle et al. (2020). However, to my understanding, using this sampling technique is not recommended for a dataset with a small item catalog such as Last-FM, MovieLens-1M, NYC-R. It may just undermine the reliability of the reported results to reduce a small experimental cost.
References
[b] Wu, Le, et al. "Joint item recommendation and attribute inference: An adaptive graph convolutional network approach." Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval. 2020.
[c] Xian, Yikun, et al. "Ex3: Explainable attribute-aware item-set recommendations." Proceedings of the 15th ACM Conference on Recommender Systems. 2021.
Thank you for your constructive feedback! While we address your specific questions and concerns in the following rebuttal, we encourage you to review our general response, which highlights common themes regarding the scope of our study and ambiguity in notations.
The scope of this study is on a rather limited application, which the authors called "attribute-specific query recommendation".
While it is true that our study focuses on "attribute-specific query recommendation," this is a deliberate choice, as it addresses an unexplored and important problem within the broader landscape of recommendation systems. The ability to handle queries involving attributes and their logical combinations is critical for real-world applications, ranging from personalized e-commerce searches like “red winter coats under $100” to specific media preferences such as “dark comedies.” These scenarios highlight the need for systems that can handle both attribute-specific constraints and individual user preferences—an area where existing systems are limited. we see our focused contribution as the first step toward addressing an underexplored area with substantial real-world implications.
We also request the reviewer to refer to the “Scope of our Work” section in the “General Response.”
The current manuscript lacks related work on "attribute-specific query recommendation".
(C1) To my understanding, "attribute-specific query recommendation" is the task where (1) item attribute values are partially observed and often missing, and (2) in the prediction phase, the positive items are conditioned not only on a user but also on a boolean query. The problem of (1) has been addressed in existing studies (e.g., [b,c]). I am not familiar with (2) in the context of item recommendation, but there might be existing research on it. A discussion/comparison of existing studies on these points would make it easier to understand the novelty of this work.
We appreciate your feedback highlighting the lack of discussion around "context-aware" and "attribute-aware" recommendations. Including this discussion would indeed help situate our work more clearly within the broader context of related research in this domain.
Context-Aware Recommendation: The concept of context-aware recommendation, as introduced in Adomavicius et al. (2011), provides a general framework where “context” is broadly defined as any auxiliary information. This framework emphasizes that user preferences for items can vary based on the context in which interactions occur, reflecting a user-centric view of contextual information.
Building on this foundation, recent works have explored specific instances of context-aware recommendation, such as “attribute-aware recommendation.” These approaches often leverage item or user attributes as contextual information to address various goals, including improving user profiling (Adomavicius et al., 2011), predicting missing item attributes (Wu et al., 2020; Chen et al., 2022), enhancing recommendations for cold-start scenarios(Deldjoo et al., 2019), or providing attribute-based explanations for recommendations (Xian et al., 2021).
Our work differs significantly in its focus and objectives. We term “attribute-specific -recommendation,” which involves generating recommendations explicitly constrained by logical combinations of attributes. Unlike attribute-aware approaches, which aim to improve recommendation quality by incorporating attribute information as auxiliary data, our work directly targets the task of satisfying explicit attribute-based constraints posed by users. We believe the explicit reasoning over attribute-constrained queries is underexplored and represents a meaningful contribution to the field.
We have ensured this distinction clearly in the revised draft of the paper.
Additional Baselines: We observe that many recent "attribute-aware" recommendation approaches leverage Graph Convolutional Networks to better model attribute-item-user interactions. Also, in response to reviewer Vvma's suggestion, we have implemented LightGCN, a graph convolution-based solution for recommendation. Hyperparameter tuning is currently underway, and due to GPU constraints, it may require an additional two days to complete. We will update the draft with the new results and respond to this thread once the tuning is finalized.
- Adomavicius et al., (2011) Context-Aware Recommender Systems
- Wu et al., (2020) Joint Item Recommendation and Attribute Inference: An Adaptive Graph Convolutional Network Approach
- Chen et al., (2022) Multi-view Graph Attention Network for Travel Recommendation
- Deldjoo et al., (2019) Movie genome: alleviating new item cold start in movie recommendation
- Xian et al., (2021) EX3: Explainable Attribute-aware Item-set Recommendations
(Q1) The authors used nDCG@K for model selection, but evaluated the final performance based on HR@K. Why did the authors use inconsistent metrics for validation and testing? In my opinion, HR@K with 100 negative items is a rather insensitive measure for ranking evaluation. I would like to recommend using Recall@K without negative sampling. (C2) In line 368, the authors report that they followed the standard sample scoring procedure described in Rendle et al. (2020). However, to my understanding, using this sampling technique is not recommended for a dataset with a small item catalog such as Last-FM, MovieLens-1M, NYC-R. It may just undermine the reliability of the reported results to reduce a small experimental cost.
The main results presented in the paper (Table 4) actually align with your recommendation, so we will clarify a few points.
- Model checkpoint selection: We only use 100 negative samples for validation (tuning hyperparameters and model selection) because this needs to be fast enough to run during training. This is also why we used nDCG@k, as it is more sensitive for ranking, particularly in the restricted setting of 100 negative samples.
- Final Evaluation with Full Item Set: Our reported HR@k at set-theoretic evaluation does not use 100 random negative samples, but rather the whole set of items. We report HR@k because this is more standard in recommendation literature, and in this case is synonymous with Recall@k which you suggested (since this is a leave-one-out evaluation).
We have highlighted this part in color blue, in Section 4.3. Please let us know if that alleviates the doubt about the experimental settings. Please let us know your further concerns, we would be very eager to attend to that.
(Q2) In the current manuscript, the authors do not mention/discuss the result on MovieLens-20M (Table 7 in Appendix B.3). Also, Table 7 is not self-contained. What is the definition of VEC-*? (probably MF or NeuMF?)
Thank you for pointing this out. This was an honest oversight on our part. This occurred while moving the ML20M results to the appendix to accommodate page restrictions. VEC-* is MF and we have updated Table 7 with the results from NeuMF. The general conclusions drawn from these results remain unchanged. Once we finish experiments with additional baselines, we will update the final results table and conclusions and will revert to this thread with the updated information.
Thank authors for your detailed responses.
I appreciate authors' responses that indeed clarify some of my concerns. However, the current manuscript appears to be incomplete, as a significant amount of information was added during this discussion phase. Based on the discussions here, I recommend improving the paper further and resubmitting it at a later opportunity.
In my opinion, the current paper attempts to achieve two goals simultaneously: (1) proposing a new task and (2) introducing a non-trivial method that differs significantly from existing research. As a result, the discussions on each aspect seem to lack sufficient depth. If the goal is to propose a new task, it is crucial to carefully discuss its relationship with existing studies. On the other hand, if the focus is on proposing a new method, experimental comparisons with applicable existing methods should be conducted, or if such methods are not applicable to the proposed task, the reasons should be thoroughly discussed.
As a comment for future improvements of this paper, LightGCN, mentioned by the authors, is not an appropriate baseline. This is because the graph used by LightGCN is based only on user-item interactions and does not incorporate attributes, which are the most important aspect of this study. So, I will maintain my original score.
Thank you for engaging in the rebuttal phase. We deeply value your feedback and aim to address your concerns thoroughly. Please see the following points for our responses.
Applicability of LightGCN
Thank you for pointing out this concern. We would like to clarify that all models in our study, including LightGCN, are adapted to jointly model user-item and attribute-item interactions. Specifically, for LightGCN, we construct a joint graph where item nodes are connected both to users and item-attributes. This adaptation ensures that LightGCN incorporates attribute information, addressing the primary focus of our study.
In fact, any recommendation system model initially developed for user-item interactions can be extended for our setup by including attribute-item interactions in the training process. This involves sharing the item embeddings (or graph representations, in the case of LightGCN) across both user-item and attribute-item interactions.
Applicability of Baselines in Experimental Comparisons
We carefully designed our study to be comprehensive and relevant to the primary focus of our work, ensuring that it addresses both standard practices and the unique challenges of our proposed task. These points are discussed in detail in the Baseline subsection (sec 4.4), background (sec 2.1), with a summary provided below:
Relevance of Selected Baselines: The baselines chosen—MF, NeuMF, and LightGCN—are standard, well-established methods in recommendation systems. They represent varying levels of complexity and address user-item and attribute-item interactions effectively.
- Matrix Factorization (MF) serves as the simplest and most direct analogue to our approach, making it foundational for comparisons.
- Neural Matrix Factorization (NeuMF) extends this by incorporating non-linear interactions, offering a more expressive benchmark.
- LightGCN, despite being primarily designed for user-item graphs, was adapted in our experiments to include attribute-item relationships through a joint graph. This ensures its applicability to our problem setting.
Focus of the Study: Our primary objective is to address set-theoretic compositionality within recommendation systems, a unique challenge not explicitly tackled by most existing advanced neural methods. While sophisticated approaches exist for matrix completion tasks, their design does not align with the explicit goal of handling compositional semantics. This positions our method as complementary rather than directly comparable to such baselines.
Thoroughness of the Evaluation: We believe the three selected baselines cover a diverse range of approaches, from simple to advanced techniques, and provide a comprehensive benchmark for evaluating our method across multiple domains. The inclusion of LightGCN, specifically, strengthens our experimental rigor and demonstrates the adaptability of standard models to incorporate attribute-item interactions.
By focusing on these baselines, our study emphasizes both robustness and relevance without diluting the central contribution of introducing a method tailored to set-theoretic compositionally.
Discussing the Relationship with Existing Studies
Thank you for raising this important point. We agree that it is crucial to situate a proposed task within the broader context of existing research.
We believe this concern aligns with the earlier suggestion about expanding discussions on "context-aware recommendation" and "attribute-aware recommendation." In response to that feedback, we included a detailed discussion in Section A.2 of the related work. Specifically, we clarified how our work differs from these approaches. For example:
Context-Aware Recommendation: These methods aim to personalize recommendations by incorporating contextual information (e.g., time, location). While they enhance prediction quality, they do not address the logical compositionality required to satisfy explicit user constraints.
Attribute-Aware Recommendation: These approaches use attribute data to improve recommendation quality or enhance explainability. However, attributes are typically treated as auxiliary features rather than central elements in the recommendation process. Consequently, the number of attributes considered is often small—for instance, only eight movie-related attributes are used in the Movielens 20M dataset. This limits the scope and depth of attribute utilization.
In contrast, our work fundamentally differs by focusing on attribute-constrained recommendation, where the system is explicitly designed to satisfy logical combinations of attributes as constraints specified by users. Moreover, we address a broader and richer set of attributes, which are carefully curated from diverse sources. This enables a more thorough and representative evaluation, ensuring that our approach aligns with the complexities of real-world scenarios.
In summary, our study introduces and formalizes the task of attribute-constrained recommendation, which stands apart from existing frameworks by addressing explicit user constraints based on attributes. This distinction is emphasized throughout the related work section to ensure clarity.
We hope this explanation highlights the careful consideration of related studies in positioning our contribution. Let us know if there are additional aspects you'd like us to elaborate on.
This paper proposes using box embeddings for matrix completion to improve personalized item recommendation with set-theoretic queries. Box embeddings are employed to bypass the limitation of commonly used vector embeddings, which might fail to recommend items for set-theoretic queries consisting of negotiation and intersection relationships. By representing users, items. and attributes as box embeddings, i.e., hyper rectangles in d-dimensional space, the proposed approach can jointly factorize user-item interaction matrix and item-attribute matrix. Then, users and attributes are regarded as boxes containing multiple items. As such, given a query containing set relationships between attributes, the model retrieves top items having the largest box volume shared with those of users as recommendation list. The whole model is trained to capture containing relationships, i.e., user and attribute boxes contain multiple item boxes. Experimental results on four datasets demonstrate the strong performance of the proposed method.
优点
-
The paper studies an interesting yet not well explored problem: personalized item recommendation with set-theoretic queries. The set nature of queries, which consists of set relationships such as negotiation and intersection, leads to an interesting research question: how to capture such relationships to make more accurate recommendations.
-
The employment of box embeddings to represent users, items, and attributes is intuitive and sensible to capture set relationships between these three.
-
Experimental results on four real-world datasets demonstrate good performance of the proposed box embedding method, outperforming vector embedding approaches.
-
The paper is well-structured and easy to follow.
缺点
-
Despite showing promising recommendation results, the proposed method seems to be a direct application of box embeddings for personalized item recommendation with set theoretic queries, which is somehow a limited contribution.
-
The paper relies heavily on the theory of box embeddings but some key concepts were not described sufficiently. For example, in line 191, how to guarantee x^\llcorner < x^\urcorner for all dimensions ; what is the definition of in Equation 3?; what are the parameters of the model to optimize?
-
The baselines are somewhat limited. Although authors already mentioned in Section 2.1., representative baselines such as LightGCN [1] and MultiVAE [2] were not considered. Including more recent and advanced baselines will further ascertain the strength of the proposed method.
[1] He et al. Lightgcn: Simplifying and powering graph convolution network for recommendation. SIGIR 2020. [2] Liang et al. Variational autoencoders for collaborative filtering. WWW 2018.
-
Lack of efficiency analysis. What is the advantage of using box embeddings over vector embedding w.r.t. running time?
-
Missing descriptions of some important experimental settings, e.g., how many negative sampled required to train Equations in lines 233 and 240. Moreover, ablative analysis of key hyper-parameters is also not presented. For instance, how the number of negative samples affect the model accuracy? The same questions for in line 240.
问题
The paper introduces the notation of query in the context of recommendation systems. What is the difference between query in the context of this paper and query in the context of search engine like Google?
Thank you for recognizing the strengths of our work and for your constructive feedback. While we address your specific questions and concerns in the following rebuttal, we encourage you to review our general response, which highlights common themes regarding the scope of our study and ambiguity in notations.
Despite showing promising recommendation results, the proposed method seems to be a direct application of box embeddings for personalized item recommendation with set-theoretic queries, which is somehow a limited contribution.
We acknowledge that this work indeed is an application of box embeddings, but it does so in a novel way—box embeddings have never before been tested empirically for their ability to generalize to arbitrary set-theoretic queries. As such, it was necessary to both (a) create a rigorously-justified set-theoretic task in multiple different domains, and (b) propose methods for leveraging the geometry to calculate scores from the boxes for these set-theoretic queries. Furthermore, understanding why vector-based embeddings fall short in faithfully representing set-theoretic relationships is an essential part of our contribution. Vector methods often struggle with maintaining set-theoretic consistency, a limitation that our approach with box embeddings directly addresses. By spotlighting these challenges and presenting an effective alternative, our work provides actionable insights that future researchers can build upon. We will reframe our contributions in the final version to emphasize these unique advancements and insights.
The paper relies heavily on the theory of box embeddings but some key concepts were not described sufficiently. For example, in line 191, how to guarantee x⌞<x⌝ for all dimensions d; what is the definition of VolIntGB in Equation 3?; what are the parameters of the model to optimize?
The following clarifications will also be incorporated into the revised version of the manuscript.
Definition of ?
is the volume of the intersection between two boxes. When the box embedding parameters are defined using Gumbel random variable (abbrev ), we replace with .
To elaborate, Dasgupta et. al. 2020 propose GumbelBox, where the min and max corners are replaced with Gumbel random variable solving this gradient issue. This modification boils down to using use logsumexp which is a smooth approximation of max operations. logsumexp is defined as , is known as the temperature of the GumbelBox.
Under the GumbelBox (abbrev ), we change the notations to .
Do we need to enforce ?
We do not enforce when the box embedding is defined as the Gumbel box.
In Gumbel box, the is replaced with max-Gumbel and min-Gumbel random variables with mean respectively. Under these Gumbel distributions, any point in the real line has a positive probability to be inside the box, i.e., .
Thus, when the end points flip, i.e., , the value of becomes negligible, although still remains positive. This is also observed in our equation; , even if , i.e, . Thus we need not ensure the , when we use instead of .
Trainable parameters of box embeddings?
The upper and lower corners are both trainable parameters for box embeddings. The training signal is provided by the volume of intersection () between two boxes. To ensure a fair comparison, we compare the performance of dimensional box embeddings with dimensional vector-based embeddings. Note that, the temperature corresponding to the Gumbel Box is treated as hyperparameters.
Representative baselines such as LightGCN [1] and MultiVAE [2] were not considered. Including more recent and advanced baselines will further ascertain the strength of the proposed method.
We have already implemented the LightGCN and currently running the hyperparameter tuning for all the datasets for faithful final results. Due to GPU constraints this process would take two more days. We will update the results in our revised manuscript, and respond back in this thread.
Lack of efficiency analysis. What is the advantage of using box embeddings over vector embedding w.r.t. running time?
Box embeddings are generally quite fast because the computation of box volumes and their intersection volumes can be parallelized over dimensions.
We report the training time (mm : ss) for a single epoch, where we select different batch sizes with 5 negative samples on the Movielens-1M dataset. Experiments are conducted on the Nvidia GTX 1080Ti GPU.
| Batch Size | MF | NeuMF | LightGCN | Box |
|---|---|---|---|---|
| 64 | 08:37 | 17:00 | 70:30 | 19:32 |
| 128 | 04:32 | 09:46 | 38:40 | 11:40 |
| 256 | 02:29 | 04:40 | 20:55 | 05:28 |
| 512 | 01:18 | 02:23 | 10:47 | 02:54 |
| 1024 | 00:40 | 01:20 | 05:24 | 01:12 |
We observe that the MF, being the simplest approach with minimal computational requirements, is consistently the fastest across all batch sizes. At the largest batch size (1024), it achieves the shortest training time of just 00:40. The Box-based method exhibits training times comparable to NeuMF. However, it is significantly faster than LightGCN, which relies on graph convolutional computations. The iterative message-passing operations required by LightGCN result in considerably higher training times, particularly at smaller batch sizes (e.g., 70:30 at a batch size of 64). As the batch size increases, the training time for Box embeddings becomes almost as efficient as MF. For instance, at a batch size of 1024, Box achieves a training time of 01:12, compared to 00:40 for MF. This demonstrates that the computational complexity of box embeddings is of the same order as MF
Note that the training times above use GumbleBox embeddings, which involve log-sum-exp calculations. However, this could be improved even further at inference time by replacing these soft min and max approximations with hard operators. If such an optimized approach is desired, then training can accommodate this by regularizing temperature. For deployment in industrial set-up, we could take additional steps with Box Embeddings as outlined in (Mei et al., 2022b).
- Mei et al., (2022b) Learning Probabilistic Box Embeddings for Effective and Efficient Ranking.
The paper introduces the notation of query in the context of recommendation systems. What is the difference between query in the context of this paper and query in the context of search engine like Google?
The primary difference between queries in our work and those in search engines like Google lies in personalization and the use of explicit set-theoretic constraints. While search engines typically handle free-form natural language queries, our work defines queries as personalized, constraint-based inputs that are explicitly set-theoretic, allowing "query" and "constraint" to be treated synonymously. For instance, on Netflix, a user might select the dark-comedy movie button. While Netflix could broadly suggest dark-comedy films, the recommendations need to align with the user's individual interpretation of the genre; for example, one user might consider Fargo a dark-comedy movie, while another may not.
We also request the reviewer to refer to the “Scope of our Work” section in the “General Response.”
Related work we included in our draft:
While set-theoretic queries are commonplace in search, popular question-answering (QA) benchmarks often do not include them. We found QUEST (Malaviya et al., 2023) to be the most closely related study, introducing a benchmark for entity-seeking queries with implicit set-based semantics. However, QUEST does not focus on explicit constraints or personalization, which are central to our work. On the other hand, we find related studies in the group recommendation systems (Amer-Yahia et al., 2009) where the preferences of multiple users are explicitly aggregated into a coherent recommendation.
We have ensured that the distinctions between our approach and search are clarified in the related work section of the revised manuscript. The new changes will be marked in color blue.
- Malaviya et al. (2023) QUEST: A Retrieval Dataset of Entity-Seeking Queries with Implicit Set Operations.
- Amer-Yahia et al. (2009) Group Recommendation: Semantics and Efficiency.
Thank authors for your detailed responses providing additional context to the paper.
I decide to keep my original score. There still remain some questions need further refined about the proposed method.
First, while box embedding approach has the potential to solve the set-theoretic query-related problems, the proposed method (mostly) directly applies the prior work Dasgupta et al., 2020. As such, the recommendation improvements come from the existing box embedding method. This is also mentioned by authors in the rebuttal Vector methods often struggle with maintaining set-theoretic consistency, a limitation that our approach with box embeddings directly addresses. This means box embedding is the key solution, yet it has been explored for recommendation task.
Second, while the efficiency analysis has been explored, it is based on ML-1M dataset, which is a very small dataset. Testing on larger datasets would further provide insights into the efficiency. Representative baseline such as Multi-VAE is also missing.
Third, as mentioned in the related work, a recent study Learning User Representations with Hypercuboids for Recommender Systems also applied box embedding for recommendation task. However, there has been no clear discussion and comparison to this work, making it difficult to understand the connection between the proposed method and prior study.
In our work, we recognize a fundamental requirement to natively represent set-theoretic relationships in the embedding space for tasks involving complex concept intersections, unions, and differences. For instance, a user's preference for movies that are both romantic and comedic can be naturally represented using set-theoretic operators. More specifically, we want to learn a map , such that . This property is also commonly known as the homeomorphism of Boolean algebra.
Current vector-based methods struggle with preserving such relationships, as they do not inherently encode the geometric structure necessary for Boolean algebra. This limitation motivates our choice of box embeddings, which are uniquely suited to natively handle set operations
We devise the training objective with box embeddings that ensures set containment, i.e., item embeddings are contained by corresponding attribute concepts and users. This ensures that set-theoretic queries (e.g., "find movies that are both romantic and comedic but not action") can be directly executed in the embedding space, offering a principled approach to modeling complex user preferences. Empirical results across multiple datasets demonstrate the efficacy of our approach validating the utility of box embeddings in modeling complex user preferences.
Related work on Box + Recommendation system:
The study "Learning User Representations with Hypercuboids for Recommender Systems" primarily focuses on enhancing recommendation efficacy by modeling the diversity of users' interests. It advocates for region-based embeddings, suggesting that such representations better capture diverse preferences. The choice of box embeddings in their work is incidental and not fundamental; the authors could have opted for any parameterized region, such as spheres or Gaussian density curves. Notably, their method employs multiple concentric hypercuboids to approximate arbitrary shapes, prioritizing flexibility over logical consistency.
In contrast, our work is grounded in the explicit need to represent set-theoretic relationships in the embedding space. We propose a framework that adheres to the homeomorphism of Boolean algebra, ensuring that set-theoretic operations (e.g., intersections, unions) are natively supported. Arbitrary region-based embeddings, including the multiple hypercuboid formulation described in the related paper, cannot guarantee these properties. Empirical results in our work demonstrate that preserving Boolean algebraic consistency leads to superior performance in set-theoretic query tasks, underscoring the significance of this distinction.
RE: Representative baseline such as Multi-VAE is also missing.
LightGCN (LGCN) was prioritized over Multi-VAE as it consistently outperforms Multi-VAE in recommendation benchmarks (e.g., He et al., Table 4). This made it a more relevant choice for evaluating our proposed approach.
Moreover, as emphasized in the background section, while many advanced neural architectures exist, they do not address set-theoretic dependencies. The baselines we selected—MF, NeuMF, and LightGCN—are standard, widely recognized models that represent varying levels of complexity and effectively capture user-item and attribute-item interactions.
Efficiency Analysis on ML20M:
Training time (mm : ss) for a single epoch, where we select different batch sizes with 5 negative samples on the Movielens-20M dataset. Experiments are conducted on the Nvidia GTX 1080Ti GPU.
| Batch Size | MF | NeuMF | Box |
|---|---|---|---|
| 2048 | 06:37 | 12:44 | 17:47 |
| 4096 | 03:45 | 07:32 | 08:47 |
| 8192 | 02:04 | 03:38 | 05:03 |
Missing descriptions of some important experimental settings, e.g., how many negative sampled required to train Equations in lines 233 and 240. Moreover, ablative analysis of key hyper-parameters is also not presented. For instance, how the number of negative samples affect the model accuracy? The same questions for in line 240.
We perform extensive hyperparameter tuning for the learning rate, batch size, volume, and intersection temperature of GumbelBoxes (), the number of negative samples for noise contrastive training (Equation in line 233), and the attribute loss constant (Equation in line 240). Specifically, we vary the number of negative samples in {1, 5, 10, 20} and the attribute loss constant in the range {0.1, 0.3, 0.5, 0.7, 0.9}. For each method, 100 hyperparameter runs are conducted, with each run drawing a random hyperparameter combination from the grid of all possible combinations.
Due to the random sampling nature of the hyperparameter tuning, different runs with varying numbers of negative samples are likely to differ in other hyperparameter settings as well. This makes it challenging to isolate the impact of specific hyperparameters, such as the number of negatives or , under the current scheme.
We have already provided this discussion on hyper-parameters in Section 4.3 and Appendix B.2. We will update the writing around the loss equations, to incorporate a discussion about the negative sampling and attribute constant hyper-parameters.
Edit (Sun Nov 24): Updated Appendix B.2 with a Parallel co-ordinate plot with space of hyperparameters vs Box model's performance on ML20M.
This general response is intended to address the questions regarding the scope of our work raised by the reviewers, and we will respond to specific questions or concerns in their respective rebuttal sections separately. We kindly request the reviewers to read the general response before reviewing our answers to the specific questions.
Scope of the Work
The task of constrained recommendation is both significant and broadly applicable in real-life scenarios, as it enables recommendation systems to meet specific and nuanced user preferences. For instance, on Netflix, a user might search for dark-comedy movies. While Netflix could broadly suggest dark-comedy films, the recommendations need to align with the user's individual interpretation of the genre. For example, one user might consider Fargo a dark-comedy movie, while another may not. Similarly, on Spotify, a user might be interested in exploring jazz music, but explicitly condition on excluding smooth jazz from the suggestions. On Yelp, two users might query "Italian place with free parking." A well-designed system should adapt to their distinct preferences, recommending a casual pizzeria to one user and a fine-dining Italian restaurant to the other, based on their past behavior.
At its core, even a simple attribute query, such as “Italian restaurants,” can be seen as a set-theoretic query—a conjunction between the restaurants aligning with the user's preferences and the set of Italian restaurants. As queries grow more complex, involving logical combinations of attributes, the recommendation task becomes increasingly challenging. Such scenarios are common in real-world applications, underscoring the need for recommendation systems that can interpret and satisfy these explicit set-theoretic constraints while still personalizing results to individual tastes.
While traditional recommendation systems are well-explored, they generally fall short in addressing queries that involve attributes and their logical combinations. This limitation restricts their ability to solve practical tasks where such constraints are often integral to user needs.
Our work introduces a novel framework that bridges this gap, enabling recommendation systems to effectively manage set-theoretic queries across multiple practical domains. By addressing this critical challenge, we aim to extend the functionality of recommendation systems to better meet nuanced user needs.
Training with the encoded similarity
- We train on binary cross entropy-based loss where the cross entropy is calculated between the observed data and the model’s similarity output (incorporating both collaborative and attribute data).
- Parameters: For the MF method, the vector corresponding to the users, items, and attributes are trainable parameters. For the NeuMF method, the neural network is trained along with these parameters. For box embeddings, the vectors corresponding to the lower left () and upper right () corners are the trainable parameters.
- Hyperparameters: We perform extensive hyperparameter tuning for the learning rate, batch size, volume and intersection temperature of Gumbelboxes(), number of negative samples for the noise constrastive training(. Please refer to Appendix B.2 for detailed hyperparameter space. For each method, we conduct 100 hyperparameter runs, varying the values of the aforementioned hyperparameters.
Inference:
-
During inference, we calculate the volume of the box embedding region corresponding to the set-theoretic query.
- Inferring on : We calculate score as:
- Inferring on : We calculate score as:
- Inferring on : We use inclusion-exclusion to calculate scores as:
-
For vectors, there is no prescribed way for calculation for these set operations, so we explore the following options. 1. Filter: Instead of representing the set operation in the embedding space we use post hoc aggregation. 2. Product: Normalized prediction scores can be multiplied to get the joint query score. 3. Geometric: Vector addition and subtraction for intersection and negation respectively.
- involves
minandmaxoperations, which hinder with gradient, making it hard to learn the parameters of the box embeddings. - Dasgupta et. al. 2020 propose GumbelBox, where the
minandmaxcorners are replaced with Gumbel random variable solving this gradient issue. This modification boils down to using uselogsumexpwhich is a smooth approximation ofmaxoperations.logsumexpis defined as , is known as the temperature of the GumbelBox. - Under the GumbelBox (abbrev ), we change the notations to , to , to .
- With the new formulation, the per-dimensional similarity score becomes the following:
- are the volume and intersection temperature of GumbelBox. As , the GumbelBox becomes the originally proposed Box. We tune these as hyperparameters.
We appreciate the reviewers' observations regarding the clarity of our explanations of the notations. We acknowledge that some key concepts were not described as thoroughly as they could have been. In this general response, we provide an outline of the methodology of our work and clarify these concepts. These updates will be incorporated into the revised version of the manuscript. We kindly request the reviewers to review the general response before addressing our answers to the specific questions.
Problem Set up
- The user-item matrix is the collaborative signal
- The attribute information is coded in .
- We want personalized recommendations with attribute constraints.
-
User would like what comedy movies? This is the same as completing rows in .
-
What romantic-comedy movie that the user would like? This is same as completing rows in
-
What comedy movie the user would like that is not romantic? This is same as completing rows in
-
Similarity Function
- Similarity Functions for encoding the collaborative and attribute signals.
-
For vector, we represent the user with vector and items with vector , the similarity is calculated as of for any neural method .
-
For box embeddings, we represent the user , item and attribute as hyper-rectangles or cartesian product of intervals: . where, .
-
Also the th dimension of is an interval denoted as .
-
Note that, the total volume of a rectangle is a mere multiplication of the length of each dimension. So the rest of the discussion will be on a single dimension , we can multiply the scores for each dimension to calculate the final score.
- Volume of item box at dimension is given by the length of the interval. .
- Volume of intersection between user and item box is given by, .
- To understand the above equation better, the intersection of two intervals is again another interval, whose
min co-ordinatemax(min-coordinateof the two intervals),max co-ordinatemin (max-coordinateof the two intervals).
-
The volume of Intersection between the user and the item box decides how much the item is related to the user. We normalize the similarity score by item box volume. The similarity value for the dimension is given as:
-
We sincerely thank the reviewers for their time and thoughtful, constructive feedback. In this rebuttal, we humbly take this opportunity to address their concerns and questions regarding the related work and experimental setup, aiming to clarify the scope of our work and provide a more detailed explanation of the notations used.
We sincerely thank the reviewers for their valuable feedback, which has greatly helped us improve our draft by incorporating stronger baselines to support our claims, enhancing the clarity of notations and experimental setup, adding a detailed discussion on time efficiency, and better positioning our work within the context of related research.
The revisions are highlighted (in blue) throughout the paper. Key updates are as follows:
- Performance of Light-GCN (Table 4):
- The main results table (Table 4) has been updated to include the performance metrics for Light-GCN.
- Relative performance trends remain consistent with previous findings.
- Notation Section (Sections 3.1 and 3.2):
- Simplified and clarified the notations, making them more intuitive and aligned with the problem setup.
- Included missing definitions for terms such as
Volnt,Vol,VolInt_GB, and others to enhance readability.
- Updated Related Work (Appendix A.2 and A.4):
- Expanded and refined discussions in the related work sections to address reviewer concerns.
- Added sections on context-aware recommendation and set-theoretic queries in search.
- Time Efficiency Analysis (Appendix D):
- Discussed the computational efficiency of box embeddings, highlighting their scalability and parallelizability.
- Highlighted that box embeddings offer computational complexity similar to simple vector-based methods, significantly faster than complex methods like LightGCN.
- Highlighted Training Details:
- Explicitly detailed training procedures, including the number of negative samples used, model selection protocols, and final scoring evaluation strategy. (
- Added a parallel coordinates plot to visualize hyperparameter choices and their impact on performance (Appendix B.2, Figure 2.)
- Added missing experiments on NeuMF on ML20m dataset. (Table 7., appendix B.3)
The draft aims to address the specific concerns raised by reviewers and improve overall clarity and completeness.
We humbly request further feedback from the reviewers!
I have read and agree with the venue's withdrawal policy on behalf of myself and my co-authors.