c:["$","div",null,{"className":"container py-8 max-w-6xl mx-auto","children":["$","$e",null,{"fallback":null,"children":["$","$L16",null,{"paper":{"id":"eS5zjXvxf8","title":"MultiIoT: Towards Large-scale Multisensory Learning for the Internet of Things","abstract":"$17","keywords":["multimodal learning","representation learning","internet of things","benchmarks"],"primary_area":"datasets and benchmarks","venue":"Submitted to ICLR 2024","conference":"ICLR","year":2024,"status":"rejected","is_accepted":false,"avg_rating":3,"avg_rating_normalized":3,"rating_min":1,"rating_max":5,"rating_std":1.63299,"review_count":3,"comment_count":5,"creation_date":"2023-09-17","modification_date":"2024-02-11","forum_link":"https://openreview.net/forum?id=eS5zjXvxf8","pdf_link":"https://openreview.net/pdf?id=eS5zjXvxf8","arxiv_id":null,"arxiv_url":null,"arxiv_match_method":null,"arxiv_matched_at":null,"tldr":"MultiIoT is the largest ML for IoT benchmark to date, bringing unique challenges in multisensory modeling, temporal interactions, and heterogeneous sensors to solve tasks of real-world practical impact.","created_at":"2026-01-21T10:30:13.069371+00:00","updated_at":"2026-04-22T06:43:41.454154+00:00","authors":[{"id":"~Shentong_Mo1","name":"Shentong Mo","openreview_id":"~Shentong_Mo1","position":0},{"id":"~Paul_Pu_Liang1","name":"Paul Pu Liang","openreview_id":"~Paul_Pu_Liang1","position":1},{"id":"~Russ_Salakhutdinov1","name":"Russ Salakhutdinov","openreview_id":"~Russ_Salakhutdinov1","position":2},{"id":"~Louis-Philippe_Morency1","name":"Louis-Philippe Morency","openreview_id":"~Louis-Philippe_Morency1","position":3}]},"stats":{"ratings":[{"id":"i7Tsnctb78","value":5,"confidence":5},{"id":"EXxmqfq36n","value":3,"confidence":4},{"id":"EtoTDI9JEk","value":1,"confidence":4}],"avg_rating":3,"rating_min":1,"rating_max":5,"rating_std":2,"detailed_scores":{"soundness":[],"contribution":[],"presentation":[],"originality":[],"quality":[],"clarity":[],"significance":[]}},"commentTree":[{"id":"i7Tsnctb78","paper_id":"eS5zjXvxf8","replyto":"eS5zjXvxf8","number":1,"type":"Official_Review","role":"reviewer","rating":5,"confidence":5,"soundness":null,"contribution":null,"presentation":null,"originality":null,"quality":null,"clarity":null,"significance":null,"content":{"rating":"5: marginally below the acceptance threshold","summary":"This paper proposes MultiIOT which includes over 1.15 million samples from 12 modalities and 8 tasks. This paper summarizes the recent developments and key challenges in the field. Then, the authors benchmark the different model architectures for processing multi-modal sensory signals and propose some insights.","questions":"1. Can the authors discuss if there is extra effort in consolidating the different datasets? e.g. how to unify the data format and make them really 'one' benchmark and convenient for the research community to benchmark their algorithms on all the tasks easily.\n2. What are the implementation details for each task? In Sec. B there is some brief explanation like 'Network Architecture: Distinct neural architectures optimized for each modality type, such as CNNs for images and RNNs for sequential data', but it is not enough. More experimental details are needed to understand and replicate the experiments.\n\nMinor Issues:\n- No qualitative results were provided. Authors could consider including data points and visualizations for the dataset, benchmark, and method. \n- Fig. 3 is of low visual quality. Authors could design more and better charts to illustrate the model comparisons.","soundness":"2 fair","strengths":"- I like the author's efforts in incorporating more modalities in understanding the scenes and human behaviors. This work is in general well-motivated and I believe this work would be interesting for future ML research from an application standpoint. \n- Discussions on current situations in the field and outstanding challenges are well-written and easy to follow.","confidence":"5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully.","weaknesses":"The main weaknesses of this work are the technical contribution and experimental evaluation.\n- For dataset and benchmark, in Sec. 2.2 the authors claim 'We collected diverse data from IoT devices, such as Inertial Measurement Units (IMU), Thermal sensors, Global Positioning Systems (GPS), capacitance, depth, gaze, and pose.' However, from my understanding, it consists of solely existing datasets while most of them contain only several modalities.\n- The experiments section contains no quantitative comparison with existing methods. There are other methods proposed for these individual tasks, and it would be difficult to evaluate the performance of the evaluated model variants without comparing them with the existing baselines.","contribution":"2 fair","presentation":"2 fair","code_of_conduct":"Yes","flag_for_ethics_review":["No ethics review needed."]},"created_at":"2023-10-29T00:00:00+00:00","modified_at":"2023-11-11T00:00:00+00:00","replies":[],"contentHtml":{"rating":"

5: marginally below the acceptance threshold

","summary":"

This paper proposes MultiIOT which includes over 1.15 million samples from 12 modalities and 8 tasks. This paper summarizes the recent developments and key challenges in the field. Then, the authors benchmark the different model architectures for processing multi-modal sensory signals and propose some insights.

","questions":"

Can the authors discuss if there is extra effort in consolidating the different datasets? e.g. how to unify the data format and make them really 'one' benchmark and convenient for the research community to benchmark their algorithms on all the tasks easily.
What are the implementation details for each task? In Sec. B there is some brief explanation like 'Network Architecture: Distinct neural architectures optimized for each modality type, such as CNNs for images and RNNs for sequential data', but it is not enough. More experimental details are needed to understand and replicate the experiments.

Minor Issues:

No qualitative results were provided. Authors could consider including data points and visualizations for the dataset, benchmark, and method.
Fig. 3 is of low visual quality. Authors could design more and better charts to illustrate the model comparisons.

","soundness":"

2 fair

","strengths":"

I like the author's efforts in incorporating more modalities in understanding the scenes and human behaviors. This work is in general well-motivated and I believe this work would be interesting for future ML research from an application standpoint.
Discussions on current situations in the field and outstanding challenges are well-written and easy to follow.

","confidence":"

5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully.

","weaknesses":"

The main weaknesses of this work are the technical contribution and experimental evaluation.

For dataset and benchmark, in Sec. 2.2 the authors claim 'We collected diverse data from IoT devices, such as Inertial Measurement Units (IMU), Thermal sensors, Global Positioning Systems (GPS), capacitance, depth, gaze, and pose.' However, from my understanding, it consists of solely existing datasets while most of them contain only several modalities.
The experiments section contains no quantitative comparison with existing methods. There are other methods proposed for these individual tasks, and it would be difficult to evaluate the performance of the evaluated model variants without comparing them with the existing baselines.

","contribution":"

2 fair

","presentation":"

2 fair

","code_of_conduct":"

Yes

"}},{"id":"EXxmqfq36n","paper_id":"eS5zjXvxf8","replyto":"eS5zjXvxf8","number":2,"type":"Official_Review","role":"reviewer","rating":3,"confidence":4,"soundness":null,"contribution":null,"presentation":null,"originality":null,"quality":null,"clarity":null,"significance":null,"content":{"rating":"3: reject, not good enough","summary":"This paper provides an extensive benchmark, MultiIoT, for machine learning of IoT applications, that contains a large amount of data samples from 12 modalities and 8 different downstream tasks. Experiments are provided to compare the performance of machine learning models trained on different learning objectives and sensory modalities on each task. The conclusion was made that multi-modal and multi-task learning is beneficial in learning useful semantics from each modality.","questions":"1. What is the main difference between the \"adapter models\" and \"unimodal multi-task models\"? Do they only differ in the training paradigm, where the adapter models use self-supervised pretraining, while the multi-task models simultaneously optimize for multiple downstream tasks? In my opinion, they are similar because they both utilize a shared encoder to extract the general semantics of a single sensory modality signal.\n\n2. In section 4.2, what are the scales, w.r.t the number of parameters, of each compared model? Do you guarantee that the comparison between different models is fair, by avoiding comparing the performance between models with significantly different scales?","soundness":"2 fair","strengths":"1. The coverage of the sensory modalities and downstream tasks in the paper is fairly comprehensive.\n\n2. Some of the observations made in the paper are interesting and can motivate future research in the IoT domain. For example, the authors found that the interaction of different tasks can facilitate single-task performance","confidence":"4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.","weaknesses":"1. As a benchmark, the authors did not provide a new dataset with comprehensive modality and task coverage that can be used for general IoT machine-learning models. Instead, the datasets evaluated in the benchmark all come from public resources, which only contain a subset of sensory modalities. For this reason, I feel it is actually an overclaim to address that the benchmark consists of over 1.15M samples, which comes from the sum of different datasets.\n\n2. The paper lacks a thorough comparison of how different DNN architectures, e.g., CNN, RNN, and Transformer, differ in processing the IoT sensing tasks, which in my opinion, is also an important perspective in such a benchmark.\n\n3. As an important perspective of IoT applications, the benchmark did not mention any efficiency results or considerations.","contribution":"2 fair","presentation":"3 good","code_of_conduct":"Yes","flag_for_ethics_review":["No ethics review needed."]},"created_at":"2023-10-30T00:00:00+00:00","modified_at":"2023-11-11T00:00:00+00:00","replies":[],"contentHtml":{"rating":"

3: reject, not good enough

","summary":"

This paper provides an extensive benchmark, MultiIoT, for machine learning of IoT applications, that contains a large amount of data samples from 12 modalities and 8 different downstream tasks. Experiments are provided to compare the performance of machine learning models trained on different learning objectives and sensory modalities on each task. The conclusion was made that multi-modal and multi-task learning is beneficial in learning useful semantics from each modality.

","questions":"

\n
What is the main difference between the \"adapter models\" and \"unimodal multi-task models\"? Do they only differ in the training paradigm, where the adapter models use self-supervised pretraining, while the multi-task models simultaneously optimize for multiple downstream tasks? In my opinion, they are similar because they both utilize a shared encoder to extract the general semantics of a single sensory modality signal.
\n
\n
In section 4.2, what are the scales, w.r.t the number of parameters, of each compared model? Do you guarantee that the comparison between different models is fair, by avoiding comparing the performance between models with significantly different scales?
\n

","soundness":"

2 fair

","strengths":"

\n
The coverage of the sensory modalities and downstream tasks in the paper is fairly comprehensive.
\n
\n
Some of the observations made in the paper are interesting and can motivate future research in the IoT domain. For example, the authors found that the interaction of different tasks can facilitate single-task performance
\n

","confidence":"

4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.

","weaknesses":"

\n
As a benchmark, the authors did not provide a new dataset with comprehensive modality and task coverage that can be used for general IoT machine-learning models. Instead, the datasets evaluated in the benchmark all come from public resources, which only contain a subset of sensory modalities. For this reason, I feel it is actually an overclaim to address that the benchmark consists of over 1.15M samples, which comes from the sum of different datasets.
\n
\n
The paper lacks a thorough comparison of how different DNN architectures, e.g., CNN, RNN, and Transformer, differ in processing the IoT sensing tasks, which in my opinion, is also an important perspective in such a benchmark.
\n
\n
As an important perspective of IoT applications, the benchmark did not mention any efficiency results or considerations.
\n

","contribution":"

2 fair

","presentation":"

3 good

","code_of_conduct":"

Yes

"}},{"id":"EtoTDI9JEk","paper_id":"eS5zjXvxf8","replyto":"eS5zjXvxf8","number":3,"type":"Official_Review","role":"reviewer","rating":1,"confidence":4,"soundness":null,"contribution":null,"presentation":null,"originality":null,"quality":null,"clarity":null,"significance":null,"content":{"rating":"1: strong reject","summary":"This paper claims to present a large multi modality benchmark for Internet-of-things (IoT). There are data present from 12 modalities and 8 tasks are defined to be solved with models trained with this data. The paper further evaluates various different types of architectures to asses how best to combine the various modalities to attain the best accuracy for the various tasks. The motivation for proposing this dataset is because of the claimed need to address various challenges with multimodal data IoT data including \"High-modality multimodal learning\", \"Temporal interactions\",\"Heterogeneity\" and \"Real-time\". Overall the authors find that multi-modality multi-task networks result in the best accuracy on the tasks.","questions":"I would strongly recommend that the authors review existing published works in ICLR and other AI and computer vision conferences to understand how to improve their papers' presentation, experiments, contributions and style, etc. In its current form the paper is not acceptable as a scientific article.","soundness":"1 poor","strengths":"Multimodal IoT seems like a potentially interesting under-explored topic.","confidence":"4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.","weaknesses":"The paper is significantly below the acceptance level of ICLR for the following reasons.\n\n1. The paper is poorly written and lacks a clear structure, premise or narrative.\n2. It is unclear what the claimed contribution of the work is and how it advances scientific research. It reads more like an opinion piece on the topic of multimodal IoT, rather than offering any concrete scientific insights.\n3. Many of the datasets in the collection of 1.115M samples presented in this work are publicly available datasets from other research projects and not ones curated by the authors.\n4. The experiments are poorly described and simply not reproducible.\n5. The experiments have no clear conclusion or insights.","contribution":"1 poor","presentation":"1 poor","code_of_conduct":"Yes","flag_for_ethics_review":["No ethics review needed."]},"created_at":"2023-11-08T00:00:00+00:00","modified_at":"2023-11-11T00:00:00+00:00","replies":[],"contentHtml":{"rating":"

1: strong reject

","summary":"

This paper claims to present a large multi modality benchmark for Internet-of-things (IoT). There are data present from 12 modalities and 8 tasks are defined to be solved with models trained with this data. The paper further evaluates various different types of architectures to asses how best to combine the various modalities to attain the best accuracy for the various tasks. The motivation for proposing this dataset is because of the claimed need to address various challenges with multimodal data IoT data including \"High-modality multimodal learning\", \"Temporal interactions\",\"Heterogeneity\" and \"Real-time\". Overall the authors find that multi-modality multi-task networks result in the best accuracy on the tasks.

","questions":"

I would strongly recommend that the authors review existing published works in ICLR and other AI and computer vision conferences to understand how to improve their papers' presentation, experiments, contributions and style, etc. In its current form the paper is not acceptable as a scientific article.

","soundness":"

1 poor

","strengths":"

Multimodal IoT seems like a potentially interesting under-explored topic.

","confidence":"

","weaknesses":"

The paper is significantly below the acceptance level of ICLR for the following reasons.

The paper is poorly written and lacks a clear structure, premise or narrative.
It is unclear what the claimed contribution of the work is and how it advances scientific research. It reads more like an opinion piece on the topic of multimodal IoT, rather than offering any concrete scientific insights.
Many of the datasets in the collection of 1.115M samples presented in this work are publicly available datasets from other research projects and not ones curated by the authors.
The experiments are poorly described and simply not reproducible.
The experiments have no clear conclusion or insights.

","contribution":"

1 poor

","presentation":"

1 poor

","code_of_conduct":"

Yes

"}},{"id":"dq93yDRFPq","paper_id":"eS5zjXvxf8","replyto":"eS5zjXvxf8","number":1,"type":"Meta_Review","role":"area_chair","rating":null,"confidence":null,"soundness":null,"contribution":null,"presentation":null,"originality":null,"quality":null,"clarity":null,"significance":null,"content":{"metareview":"The paper presents a benchmark for multimodality in the context of IoT, with 12 modalities and 8 tasks. While the reviewers found some merit in the dataset, the main finding of the paper that multi-modal multi-task learning provides some gains on this dataset was considered to fall short of being publishable, as there were multiple issues/questions with the framing, presentation of the experiments, and explanation of the experiments. Unfortunately, the authors did not participate in the author/reviewer discussion, and as such the issues have not been clarified. While the paper is recommended to be rejected at this time, the authors are encouraged to take the reviewers' feedback into account for a future submission to make their dataset a valuable contribution to the IoT domain.","justification_for_why_not_lower_score":"The dataset seems interesting on its own right if it is fleshed out more in the future.","justification_for_why_not_higher_score":"The paper falls short on multiple fronts such as Writing, Experiments, Contributions, and is far from the acceptance bar."},"created_at":"2023-12-05T00:00:00+00:00","modified_at":"2024-02-17T00:00:00+00:00","replies":[],"contentHtml":{"metareview":"

The paper presents a benchmark for multimodality in the context of IoT, with 12 modalities and 8 tasks. While the reviewers found some merit in the dataset, the main finding of the paper that multi-modal multi-task learning provides some gains on this dataset was considered to fall short of being publishable, as there were multiple issues/questions with the framing, presentation of the experiments, and explanation of the experiments. Unfortunately, the authors did not participate in the author/reviewer discussion, and as such the issues have not been clarified. While the paper is recommended to be rejected at this time, the authors are encouraged to take the reviewers' feedback into account for a future submission to make their dataset a valuable contribution to the IoT domain.

","justification_for_why_not_lower_score":"

The dataset seems interesting on its own right if it is fleshed out more in the future.

","justification_for_why_not_higher_score":"

The paper falls short on multiple fronts such as Writing, Experiments, Contributions, and is far from the acceptance bar.

"}},{"id":"Pr2NJ9MXV0","paper_id":"eS5zjXvxf8","replyto":"eS5zjXvxf8","number":1,"type":"Decision","role":"program_chair","rating":null,"confidence":null,"soundness":null,"contribution":null,"presentation":null,"originality":null,"quality":null,"clarity":null,"significance":null,"content":{"title":"Paper Decision","comment":"","decision":"Reject"},"created_at":"2024-01-16T00:00:00+00:00","modified_at":"2024-02-17T00:00:00+00:00","replies":[],"contentHtml":{"title":"

Paper Decision

","decision":"

Reject

"}}],"submissionHistory":[]}]}]}]