PaperHub
6.8
/10
Poster4 位审稿人
最低4最高5标准差0.4
4
4
5
4
3.5
置信度
创新性3.5
质量3.5
清晰度3.3
重要性2.8
NeurIPS 2025

Can NeRFs "See" without Cameras?

OpenReviewPDF
提交: 2025-05-12更新: 2025-10-29
TL;DR

Enabling NeRFs to "see" in wireless multipath settings.

摘要

关键词
NeRFIndoor floorplan

评审与讨论

审稿意见
4

EchoNeRF demonstrates that a NeRF-style volumetric model can recover the and predict the 3D indoor structure using only sparse RF power measurements. By introducing the virtual transmitters that can cast power RF beams through a voxel grid and learning per‐voxel reflectivity and density to match line-of-sight and single-bounce multipath signals, the method reconstructs floorplans and accurately forecasts unseen transmitter without any optical inputs. Also, this work clarify the limitation of single stage training - LoS model dominate the training due to the gradient issue and correspondently propose the two-stage training strategies. This work opens the door to camera-free mapping and localization driven by RF multipath.

优缺点分析

Strengths:

  1. The paper’s simple yet effective free-space LoS model and its natural extension to first-bounce reflections are well justified.

  2. Discretizing voxel orientations offers a clear, efficient way to handle multiple reflections.

  3. The two-stage training schedule successfully addresses gradient instability in single-stage methods.

Weakness:

  1. The motivation is vague - what practical, efficiency and cost advantages do RF signals offer over camera-based 3D reconstruction?

  2. The voxel-selection strategy for reflections needs more details. Summing reflected power by iterating over every voxel scale poorly as the voxel count grows. Although the authors accelerate this with discrete orientation angles, further implementation are needed.

  3. Lack of ablation study. Experiments isolating the reflection model's impact and comparing single-stage versus two-stage training are needed.

  4. Proposed method only models line-of-sight plus first-bounce reflections - which only account for less than 6 % of total RF power - so it ignores high-order multipath that can be substantial in cluttered or non-planar environment.

Overall, the paper is well written and demonstrates a competitive improvement compared to other methods. However, I have several concerns - outlined in the weakness section. Although there are some limitations, the paper still has several commendable aspects. I would vote for a borderline accept, but I may adjust my score depending on the authors' rebuttal and the other reviewers' feedback.

问题

  1. The voxel grid and discrete normal angles also impose coarse spatial and angular resolution, and iterating over every voxel (even with orientation pruning) can become expensive as scene size grows. Please clarify the impact of voxel grid density and normal angles discretion.

  2. Please provide more details of selection strategies of voxel for the reflection.

  3. More ablation study should be included, eg. 1) with and without reflection model and 2) only one stage training versus two-stages training.

局限性

  1. EchoNeRF models only line-of-sight and first-bounce reflections (≤6 % of total RF power), ignoring higher-order multipath that can dominate in cluttered or non-planar scenes.

  2. Its voxel grid and discrete normal sampling yield coarse spatial/angular resolution, and—even with orientation pruning—iterating over all voxels becomes costly as scene size grows.

  3. The paper lacks a thorough ablation study isolating the reflection model and comparing single-stage versus two-stage training.

最终评判理由

The authors' rebuttal addressed most of my concerns. Based on the feedback from other reviewers, it is evident that this paper is of decent quality, despite some flaws. Therefore, I maintain my original score: borderline accept.

格式问题

The appendix details should be in the appendix instead of main paper.

作者回复

We sincerely thank the reviewer for their constructive feedback. Below, we answer some concerns.


Q1. The motivation is vague - what practical, efficiency and cost advantages do RF signals offer over camera-based 3D reconstruction?

We think there are a number of advantages, and based on your comment, we realize we should add them to the introduction of the paper:

  1. Privacy: Using cameras to infer floorplans raise privacy concerns inside homes and offices. Apple’s ARKit (which uses the phone camera or LIDAR to perform localization and mapping at homes) have received pushback from customers. Amazon Alexa has not adopted vision techniques due to privacy. RF based methods learn only the geometry of the environment, without “seeing” the details in rooms. Several companies have expressed enthusiasm to us for RF based imaging solutions.
  2. See through: Extensions of EchoNeRF may be able to do X-ray vision, i.e., infer objects inside an opaque box using RF measurements from outside the box. This can have applications in medical imaging, airport security, or military applications (where a team of soldiers may want to infer the floorplan and people inside an enemy building).
  3. Fewer measurements: The RF modality can serve as an assistance to cameras to reduce the number of measurements. For instance, cameras may be used to image the lobbies and corridors of a hotel, while the RF modality can sense the rooms and un-visited areas. This lowers the overhead of visiting every location with the camera.

Lastly, we are curious about what can be achieved with RF-based NeRFs and perhaps more applications would emerge once such capabilities have matured.


Q2. The voxel-selection strategy for reflections needs more details. Summing reflected power by iterating over every voxel scale poorly as the voxel count grows. Although the authors accelerate this with discrete orientation angles, further implementation are needed.

Thanks for understanding our paper correctly and asking this question. We make 2 observations:

  1. For each discrete orientation, the manifold of reflection is sparse even in 3D. As a result, the number of optimization variables do not grow too fast. For instance, let’s consider all voxels at 9090^\circ orientation, i.e., their surface normal is perpendicular to the horizontal floor. Now, say a Tx and Rx are on the floor and we want to identify which of the 9090^\circ voxels above the floor will result in a valid reflection from the Tx to the Rx. For this case, observe that the manifold is the vertical line bisecting Tx and Rx; only voxels on this vertical line are candidates for reflection. Similar manifolds will exist for other discrete orientations.
  2. Yes, even if the manifolds are thin/sparse, they may still add up when number of discrete orientations grow. However, observe that we only consider voxels with high opacities (1\approx 1), since they are the ones likely to reflect. Voxels that are transparent are ignored, implying that we are discarding many voxels even on the manifolds. This also helps the optimization.

Q3. Lack of ablation study. Experiments isolating the reflection model's impact and comparing single-stage versus two-stage training are needed.

Perhaps this confusion arises from not explaining Fig. 5 in detail. Observe that row 4 in Fig. 5 is the output of the LoS model, and row 5 shows the improvements with the reflection model. Row 4 is also the result from the 1st stage of our model (where the LoS model dominates), while row 5 is after completing the full 2-stage training.

If you are curious about training the whole EchoNeRF model in a single stage, we tried that but the one-stage training did not converge and led to poor results. This prompted us to move to the 2-stage solution.


Q4. Proposed method only models line-of-sight plus first-bounce reflections - which only account for less than 6% of total RF power - so it ignores high-order multipath that can be substantial in cluttered or non-planar environment.

We should have been clearer here. It’s actually the opposite. The line of sight (LoS) and the first-bounce together makes up 94% of the total power (and 6% is the contribution from all higher order reflections). The reason for this is that higher order reflections travel longer distances and get absorbed by multiple surfaces; hence their power attenuates significantly by the time they reach the receiver.


Q5. More ablation study should be included, eg. 1) with and without reflection model and 2) only one stage training versus two-stages training.

Please see response to question 3.

评论

The authors' rebuttal addressed most of my concerns. Based on the feedback from other reviewers, it is evident that this paper is of decent quality, despite some flaws. Therefore, I maintain my original score: borderline accept.

审稿意见
4

This paper proposes a method to learn wall layouts from radio-frequency (RF) measurements of multiple transmitters (Txs) placed throughout the scene and emitting concurrently. The key to the approach is to learn a NeRF (MLP-based) that learns a mapping of locations to opacity and orientation such that the RF power is accurately reconstructed. The authors demonstrate through simulated results that this analysis-by-synthesis approach enables interpretable room layout reconstructions (with the assumption that any voxel orientation is at either 0, 45, 90, or 135 degrees, since walls tend to be perpendicular).

优缺点分析

Strengths

  1. The paper's proposed EchoNeRF model seems reasonable, novel, and effective (at least in simulation). In particular, I was impressed the model is able to handle multiple transmitters simultaneously (akin to multiplexed imaging), since this would result in measurements mixing together, making it much more difficult to separate power for each Tx such that geometry can be learned. This is a hard problem and may be relevant in domains beyond RF.

  2. I found the ablations and comparison between EchoNeRF and EchoNeRF_LOS helpful and informative. In particular, it was interesting to see what happens when furniture is in the scene.

  3. The paper is very well written and the techniques, for the most part, are well explained. The figures are clean and informative.

Weaknesses

  1. The paper lacks any real-world validation, making it difficult to assess its practicality. This issue is exacerbated by a lack of comprehensive analysis or even discussion of the impact of real-world noise (aside from noise in receiver location in Sec 4.3 and mentioning the use of additive noise in the background on modelling in the appendix).

  2. The paper claims on L22 that "optical NeRF understandably fails since it is not equipped to handle multipath", which is a misleading generalization, and subsequently misses several works related to optical multipath NeRF that should be cited and/or discussed. Some of the multipath NeRF from lidar literature is similar to this work, but isn't discussed. In particular, PlatoNeRF [1] models first-order reflections from multiple "virtual Txs" and uses a similar two-stage training approach (however, it does not model multiple Txs at once as in this work). A more comprehensive list of papers that should be cited/discussed is below.

  3. One of the main advantages of modeling reflections is stated to be occlusion-awareness (L128-129), but this occlusion-awareness is minimally ablated. it's discussed on L263-266, but it's unclear to me why EchoNeRF_LOS cannot see the outer walls as the Txs (red stars) are directly adjacent to them - maybe if the Rx location was shown in the GT, it would become more clear.

  4. The authors discuss NeWRF, but don't offer quantitative comparison. Fig 1 is helpful and clearly shows this method isn't adequate for reconstruction (so this point is fairly minor), but it would be helpful to at least show qual. results on one of the same scenes as EchoNeRF for apples to apples comparison.

Multipath modeling with LiDAR:

[1] PlatoNeRF. https://doi.org/10.1109/CVPR52733.2024.01380. (models first-order reflections with NeRF)

[2] Non-Line-of-Sight Imaging via Neural Transient Fields. https://doi.org/10.1109/TPAMI.2021.3076062 (models second-order reflections with NeRF)

Multipath modeling with RGB:

[3] NeRF-Casting. https://doi.org/10.1145/3680528.3687585

[4] oRCA. https://doi.org/10.1109/CVPR52729.2023.01990

问题

I enjoyed the paper and it's well written, but the first two weaknesses listed above make it difficult to recommend for acceptance in its current form. Thus, adding real-world results and sufficient discussion to past work would help mitigate my concerns.

A few other smaller comments:

  • L154 The nomenclature for wj is confusing - if wj is an angle, why is the set of discrete values = (1, 2, 3, …) instead of (45, 90, 135, …)? It did not become clear that these values should be multiplied by 45 degrees until later in the text.

  • While there is more thorough background on RF in the appendix, a brief background section in text would be very helpful for a broad audience like NeurIPS. This could briefly describe at a high level questions, such as how do we interpret the data? Is there a temporal dimension or is it just an intensity image? Do RF signals reflect like optical signals (scatters on diffuse objects, reflects in one direction on specular objects)?

  • In training stage 1, is there any way to filter out pixels that correspond to occlusions and only supervise other pixels?

  • An ablation with varying numbers of discretized voxel orientations would be informative (increasing K_w)

  • This paper calls itself a NeRF based method but there is no notion of radiance in the formulation - seems like more of a neural implicit representation (NIR). The authors may consider revising their terminology or motivating why they call this method a NeRF.

  • It would be helpful if the Fig 5 caption was a bit more descriptive (e.g. mention how the reader can interpret the red stars and lightly colored lines in the first row)

  • Why is voxel orientation used as an output of the NIR rather than using more traditional neural SDF approaches that represent scene geometry and can easily be used to compute reflected ray directions?

局限性

Yes.

最终评判理由

Although the paper is missing real-world evaluation, the rebuttal mitigated my main concerns. Based on the rebuttal, I would like to increase my rating to borderline accept. I encourage the authors to reword L22 to say "vanilla NeRF" rather than "optical NeRF" to be more clear, add the missing related works, and add the new ablation on noise to the supplementary materials in the next revision.

格式问题

No concerns.

作者回复

We sincerely thank the reviewer for their constructive feedback. Below, we answer some concerns.


Q1. The paper lacks any real-world validation, making it difficult to assess its practicality. This issue is exacerbated by a lack of comprehensive analysis or even discussion of the impact of real-world noise (aside from noise in receiver location in Sec 4.3 and mentioning the use of additive noise in the background on modeling in the appendix).

We acknowledge that EchoNeRF has a considerable simulation-to-real gap. However, even in simulations, this still proved to be a challenging problem (in our opinion), mainly due to the core inverse problem that needed to be solved under unknown number of multipath reflections, sparse signal measurements, and each measurement being a scalar (because RF signal power is a scalar).

We believe this paper has finally solved this inverse problem, arguably the most critical piece in the whole RF-NeRF puzzle. While this does not deliver a practical end-to-end system, the current version of EchoNeRF gives us (and others) a platform to build on with follow-up ideas. These ideas include, but are not limited to, fusing with RGB, audio, LIDAR or other sensors (as suggested by several reviewers); using channel impulse responses (CIR) to extract more information from each measurement; leveraging multiple frequencies in WiFi/6G signals; using floorplan priors; and even post-processing on the final results. These should all help in closing the sim2real gap.

Regarding real-world noise, we have run new experiments this week where noise is added at the receiver. We report graceful degradation in EchoNeRF with increasing noise levels, shown below:

Specifically, we added Gaussian noise to the RSSI measurements with a mean equal to the noise floor (in dB) and a variance of 4 dB. We tested five noise floor levels ranging from -80 dB to -130 dB across the 6 floorplans shown in Fig 5. The SNR at a receiver is computed as the difference between the received signal power and the noise floor (e.g., a received power of -70 dB with a noise floor of -80 dB results in an SNR of 10 dB). Our typical RSSI measurements ranges from -60dB (strong) to -130dB (weak). We report the mean IoU below:

SNR (in dB)EchoNeRF_LoS IoU (↑)EchoNeRF IoU (↑)Qualitative
⁡inf (no noise)0.2510.371Legible
600.2460.336Legible
500.2310.292Legible
400.2260.298Legible
300.2070.241Missing walls
100.090.14Illegible

As expected, the performance drops with decreasing SNR; both EchoNeRF_LoS and EchoNeRF’s floorplans are still legible till 30dB, but below that (when noise power becomes comparable to the signal power) both EchoNeRF_LOS and EchoNeRF break down. This is expected.


Q2. The paper claims on L22 that "optical NeRF understandably fails since it is not equipped to handle multipath", which is a misleading generalization, and subsequently misses several works related to optical multipath NeRF that should be cited and/or discussed. Some of the multipath NeRF from lidar literature is similar to this work, but isn't discussed. In particular, PlatoNeRF [1] models first-order reflections from multiple "virtual Txs" and uses a similar two-stage training approach (however, it does not model multiple Txs at once as in this work). A more comprehensive list of papers that should be cited/discussed is below.

In L22, we were referring to vanilla NeRFs which do not model multipath. In L95-L100, we discussed how follow-up work have modeled multipath as a 2-component decomposition problem, while EchoNeRF must handle a KK-component decomposition, where KK is unknown.

Thanks for pointing us to the papers. We will surely cite them in our revised version but the ideas from these papers don’t apply directly to ours. Specifically, the RGB based reflection-aware NeRFs (e.g., NeRFCasting and ORCa) are still focussed on efficiently solving the 2-component decomposition, typically for rendering glossy surfaces or mirrors.

The other line of work using LIDARS (e.g., PlatoNeRF, NLOS imaging) are coping with many unknown reflections, however, LIDARs have very high time resolution (very high clock frequency), and is therefore able to receive the incoming rays in different time buckets. This temporal separation allows the NeRF to make separate measurements for different surfaces in the scene.

Since we use only signal power in EchoNeRF, we only have one single scalar measurement (named RSSI) that contains a mixture of the LoS and all the KK reflections. This makes our inverse problem even more complex.


Q3. One of the main advantages of modeling reflections is stated to be occlusion-awareness (L128-129), but this occlusion-awareness is minimally ablated. it's discussed on L263-266, but it's unclear to me why EchoNeRF_LOS cannot see the outer walls as the Txs (red stars) are directly adjacent to them - maybe if the Rx location was shown in the GT, it would become more clear

We understand the confusion and let us briefly clarify it here. We will explain better in the revised version of the paper.

In row 1 of Fig. 5 , the red stars denote the Tx locations, and the light gray dots denote the Rx locations (i.e., where measurements are made). Row 4 of the same figure shows the results from the EchoNeRF_LOS model. The outer walls are not detected in row 4 because there are no Rx location, hence no measurements, from outside the home; the gray dots in row 1 are all inside the home floorplan. However, when there are walls between a red Tx and a gray Rx inside the home, those occlusions show up in row 4 as inner walls.

In row 5, EchoNeRF “sees” the reflections from the outer walls, and hence the results in row 5 can bring the outer walls as well.

We hope this helps to clarify the situation.


Q4. The authors discuss NeWRF, but don't offer quantitative comparison. Fig 1 is helpful and clearly shows this method isn't adequate for reconstruction (so this point is fairly minor), but it would be helpful to at least show qual. results on one of the same scenes as EchoNeRF for apples to apples comparison.

We understand that apples-to-apples comparison with NeWRF would be helpful, hence we have now run the following experiment for the 4th floorplan in Figure 5. We placed 500 receivers (Rx), spread across the floorplan, and one transmitter (Tx) located at the center. Results from training NeWRF using their official codebase do not show any floorplan. Instead, the learnt geometry has somewhat similar behavior as that of Fig 2(c), where black dots are scattered inside and outside the actual floorplan at virtual Tx locations, but not aligned with walls of the house.

This is not surprising. As authors of the NeWRF paper themselves acknowledged, their neural model fits the signal measurements well but does not solve the core inverse problem to learn the physics of signal propagation (which is in turn important to predict the floorplan).

We will include this qualitative comparison in our revised manuscript.


Q5. It would be helpful if the Fig 5 caption was a bit more descriptive (e.g. mention how the reader can interpret the red stars and lightly colored lines in the first row

Of course. We realized this as we responded to Q3 above.


Q6. L154 The nomenclature for ωj\omega_j is confusing - if ωj\omega_j is an angle, why is the set of discrete values = (1, 2, 3, …) instead of (45, 90, 135, …)? It did not become clear that these values should be multiplied by 4545^\circ until later in the text.

Yes, we were sloppy here. We will make this precise.


Q7. Why is voxel orientation used as an output of the NIR rather than using more traditional neural SDF approaches that represent scene geometry and can easily be used to compute reflected ray directions?

During the optimization, one can indeed calculate the voxel’s orientation using the voxel opacities. However we found empirically (and as also observed in several papers [1,2]), the orientations inferred from opacities, or via SDF, are noisy during the learning process. These intermediate noisy orientations destabilizes EchoNeRF’s ability to learn the reflections. This is the motivation to explicitly model the orientations per voxel.

[1] Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains

[2] Ref-NeRF: Structured View-Dependent Appearance for Neural Radiance Fields


Q8. A brief background section on RF in text would be very helpful for a broad audience like NeurIPS. This could briefly describe at a high level questions, such as how do we interpret the data? Is there a temporal dimension or is it just an intensity image?

Thanks for this suggestion; it makes sense to add a brief review for the wider audience. We will add a background section in the revised version.


Q9. In training stage 1, is there any way to filter out pixels that correspond to occlusions and only supervise other pixels?

We understand your intuition, however, please note an occluding pixel can be located anywhere along the line joining the Tx and the Rx. For all these possible locations, the received LoS power will be the same, hence the LoS model (in stage 1) does not have a way to pin down the correct occluding pixel. To address this, the second stage training is helping to correctly position the occluding pixel, by optimizing for the occlusion pixel that best explains the reflections. This is why filtering out the occluding pixels is not suitable. This motivates the need for the 2-stage training.

评论

I'd like to thank the authors for addressing my questions and concerns through their rebuttal. The majority of my concerns were mitigated. Unfortunately the lack of real-world results makes it challenging to be confident of the real-world impact of this work. However, as the authors point out, the problem they consider is challenging and this is a step in the right direction nonetheless. The added ablation on noise is helpful. Can the authors clarify roughly what SNR level is expected in real-world measurements (e.g. reference real-world SNR for the new table from Q1) - or is this what is meant by "typical RSSI measurements ranges" in the response?

评论

Thanks for the clarification. Although the paper is missing real-world evaluation, the rebuttal mitigated my main concerns. Based on the rebuttal, I would like to increase my rating to borderline accept. I encourage the authors to reword L22 to say "vanilla NeRF" rather than "optical NeRF" to be more clear, add the missing related works, and add the new ablation on noise to the supplementary materials in the next revision.

评论

Thank you for your careful consideration and the thoughtful questions. We are grateful.

Our SNR results are actually very conservative (i.e., worst case analysis) so results reported in the rebuttal Table has robustness built in, to cope with noisy real-world conditions. Here is why.

Real-world WiFi SNR is typically between 30dB to 60dB depending on the distance of the device from the WiFi router. It is possible to check SNRs on our own laptops (e.g., on a Macbook, clicking (option + click) on the wifi icon on the top title bar showed us):

RSSI: -46dBm, Noise: -105dBm

This means the laptop’s SNR was 59dB inside a home at a distance of around 8 feet from the router. Another measurement showed 28dB at 51 feet away (farthest location from the home router).

In the rebuttal Table, we have added noise aggressively, so when we report, say, 40dB SNR, it is for the closest receiver that experienced the highest SNR among all receivers. All other receivers experienced lower SNR, and we evaluated EchoNeRF under such weak-SNR scenarios.

In real-world settings, say average SNR is around 45dB. Then the best SNR would be around 60+dB, which corresponds to the first row of our Table. EchoNeRF shows good results for such SNRs as visualized by the floorplans in Figure 5 (and in the Appendix).

审稿意见
5

The paper introduces EchoNeRF–a method for reconstructing floorplans from WiFi measurements. In contrast to prior works, EchoNeRF specifically targets reconstructing geometry rather than just WiFi signals. To achieve this, the method proposes two key ideas: (1) a multi-path model for WiFi signals that represents simulated signal as a sum of multi-bounce paths and (2) a floorplan representation which uses opacity and a discretized normal to describe walls. The multi-path signal is then simulated, for the first-and second bounce, using a two-stage algorithm which first recovers the line-of-sight paths before geometry with first-order reflections. In practice, the method recovers significantly more accurate floorplans than prior work and as a result also can simulate WiFi signals for new transmitter locations much more accurately.

优缺点分析

strengths:

  1. The paper develops a principled technique for reconstructing floorplans with WiFi signals and in the process makes an interesting observation that existing WiFi based NeRF methods really do not reconstruct the geometry in any sort of meaningful way. This is a similar to the series of papers a few years ago, e.g. NeuS [Wang et al. 2021] and VolSDF [Yariv et al. 2021], which demonstrated that NeRF was failing to capture high quality surfaces even though novel views were extremely high quality. In this case, the gap between the geometry recovered by EchoNeRF and methods like NeWR and NeRF2 is even more drastic.

  2. The paper has extensive discussion of both the design decisions, implementation, and the limitations of the method. This should make the paper and the method extremely useful to the broader NeurIPS community as it improves reproducibility and motivates future work. The approach itself is very novel and provides a new framing of how to simulate WiFi with NeRF, both in terms of theory and practical approaches.

weaknesses:

  1. A very minor point, it would have been great to have an example with real world data.

references:

@article{wang2021neus,
  title={NeuS: Learning Neural Implicit Surfaces by Volume Rendering for Multi-view Reconstruction}, 
  author={Peng Wang and Lingjie Liu and Yuan Liu and Christian Theobalt and Taku Komura and Wenping Wang},
  journal={NeurIPS},
  year={2021}
}
@inproceedings{yariv2021volume,
  title={Volume rendering of neural implicit surfaces},
  author={Yariv, Lior and Gu, Jiatao and Kasten, Yoni and Lipman, Yaron},
  booktitle={Thirty-Fifth Conference on Neural Information Processing Systems},
  year={2021}
}

问题

  1. Do the authors have a sense of how this method would work in 3D where scenes exhibit some variation in the added Z dimension? Separately, have the authors thought about how 2D floorplans may be useful in reconstructing components of indoor scenes with visible light?

  2. Is there any reason to expect that parameterizing the occupancy with a signed distance function (SDF) as in [Wang et al. 2021, Yariv et al. 2021, Miller et al. 2024] would improve the quality of the reconstructed floorplans?

references:

@article{wang2021neus,
  title={NeuS: Learning Neural Implicit Surfaces by Volume Rendering for Multi-view Reconstruction}, 
  author={Peng Wang and Lingjie Liu and Yuan Liu and Christian Theobalt and Taku Komura and Wenping Wang},
  journal={NeurIPS},
  year={2021}
}
@inproceedings{yariv2021volume,
  title={Volume rendering of neural implicit surfaces},
  author={Yariv, Lior and Gu, Jiatao and Kasten, Yoni and Lipman, Yaron},
  booktitle={Thirty-Fifth Conference on Neural Information Processing Systems},
  year={2021}
}
@InProceedings{Miller:VOS:2024,
  author    = {Miller, Bailey and Chen, Hanyu and Lai, Alice and Gkioulekas, Ioannis},
  title     = {Objects as Volumes: A Stochastic Geometry View of Opaque Solids},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month     = {June},
  year      = {2024},
  pages     = {87-97}
}

局限性

Yes.

最终评判理由

The paper presents a novel and well-structured approach for reconstructing floor plans from WiFi measurements, introducing a principled multi-path signal model and a wall representation that jointly enable significant improvements over prior WiFi-based NeRF methods in both geometry recovery and signal simulation. The method is clearly motivated, thoroughly discussed, and experimentally validated. My questions about extending to 3D and the potential role of SDF parameterization were addressed in the rebuttal with concrete technical considerations and future plans. With these points clarified, I maintain my strong recommendation for acceptance.

格式问题

N/A

作者回复

We sincerely thank the reviewer for their constructive feedback. Below, we answer some concerns.


Q1. Do the authors have a sense of how this method would work in 3D where scenes exhibit some variation in the added Z dimension?

Yes, we have pondered on this question and see two possible approaches to extend to 3D:

(1) Re-compute the reflection manifolds in 3D, i.e., given the Tx and Rx locations, find which voxels will produce valid reflections. In 2D, these manifolds were curved lines on the 2D plane (as shown in Fig. 3), however, in 3D, the manifolds will become curved lines in the 3D space. Importantly, the number of voxels on these manifolds do not grow excessively, since the 2D manifolds are essentially projections of the 3D manifolds. Hence, we expect the optimization to remain stable.

(2) A more engineering approach could be as follows. When the Tx and Rx are at a similar height hh, hardly any reflections occur on the vertical walls from higher or lower than h.h. Hence, we only need to estimate the ceiling and the floor, since the vertical walls can be extended upward and downward from a 2D floorplan. Since ceilings and floors are both horizontal (9090^\circ orientation), we can only add a vertical 9090^\circ manifold in our optimization to model the reflections from the ceiling and floor. Knowing the ceiling and floor heights, and the 2D floorplan, a closed 3D floorplan can be created using existing 3D software.

In follow-up work, we plan to attempt idea #1 first and fall back, if necessary, on idea #2.


Q2. Separately, Have the authors thought about how 2D floorplans may be useful in reconstructing components of indoor scenes with visible light?

Yes, there is indeed a lot of room for innovation in multi-modal fusion. We focussed only on RF in this paper to understand where EchoNeRF alone can get us. However, once the RF mode matures, we see 2 immediate tracks for fusion:

  1. Camera helping RF: When cameras can be partially used to view parts of the environment (e.g., kitchen and living room but not the bedroom or bathroom), they can offer valuable initialization to EchoNeRF. This can boost EchoNeRF’s accuracy. Cameras may even be adaptively used to selectively repair parts of the floorplan that exhibit higher uncertainty.
  2. RF helping cameras: RF can assist cameras by reducing the number of measurements needed to infer the full scene. This is possible because RF signals can see through walls, which obviates the need for the user to walk to all corners of the house. EchoNeRF can also be used as a second information stream to add robustness to vision-based imaging.

Q3. Is there any reason to expect that parameterizing the occupancy with a signed distance function (SDF) as in [Wang et al. 2021, Yariv et al. 2021, Miller et al. 2024] would improve the quality of the reconstructed floorplans?

Thanks for bringing SDF into our chain of thought. SDF is indeed useful for predicting surface occupancy and may improve the smoothness of walls if we replace opacity with SDFs, as proposed in Wang et al. 2021. However, the wall orientation estimates may be less stable while SDFs are being trained. We found empirically (and as also observed in several papers [1,2]), that the orientations inferred from opacities are noisy during the learning process. That said, we would like to revisit the potential benefits of plugging SDFs into our now-stable EchoNeRF pipeline.

[1] Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains

[2] Ref-NeRF: Structured View-Dependent Appearance for Neural Radiance Fields


Q4. Minor weakness: A very minor point, it would have been great to have an example with real world data.

We want to thank you for recognizing that this problem is hard even in simulation. When we started this project, we were envisioning demos of EchoNeRF in the same spirit as camera-based NeRFs. This paper falls short and we acknowledge EchoNeRF has a simulation-to-real gap. However, the core inverse problem with RF proved to be hard (in our opinion) even in a simulation setting; we realized the hardness as we iterated through many failures before finally solving it.

Our next step is to add the necessary pieces to EchoNeRF so it can function in real scenarios. These pieces include, but are not limited to, fusing with RGB, audio, LIDAR or other sensors; using channel impulse responses (CIR) to extract more information from each measurement; leveraging multiple frequencies in WiFi/6G signals; using floorplan priors; and even post-processing on the final results. These should help in narrowing the sim2real gap.

评论

The rebuttal addressed all of my questions and provided additional context regarding the challenges of using real-world data and extending to 3D. The other concerns raised by reviewers do not affect what I consider the core strengths of the paper, so I maintain my recommendation for acceptance.

审稿意见
4

This paper addresses the problem of floor plan estimation from multipath signals (e.g., radio frequency (RF) signals such as WiFi or audio). The paper proposes a novel and interesting approach, EchoNeRF, by redefining Neural Radiance Fields (NeRF) to learn, implicitly, the environment structure through many line-of-sight (LoS) paths of multiple reflection RF measures. This manuscript presents a compelling evidence of the efficacy to predict floor plan geometries solely from RF signals.

优缺点分析

Strengths

  • The paper is well-written and organized. The manuscript stands clearly the motivation, challenges, and implications of the proposed solution. It also present a clear description of details and assumption towards the modification of NeRF-like pipeline for RF signals.
  • The benchmarks and baselines are suitable for the task and relevant for evaluations.
  • The idea behind the proposed EchoNeRF is novel and it sounds feasible and valid. Although, the floor plan estimation results significantly differ from other solutions that use cameras or direct geometry sensors, the quality of the estimated geometry are impressive.

Weaknesses

  • One minor weakness in the presented manuscript is the lack of experiments using real RF signal like UWB sensors or RF IC-boards, which are relatively easy to use and inexpensive.

  • One major weakness in the presented manuscript is that the evaluation is limited to the ZinD dataset [9], which primarily contains sparsely furnished and geometrically simple rectangular rooms. Expanding the experiments to include more complex datasets such as MVL-dataset or HM3D may enhance the discussion of the limitations of the proposed EchoNeRF.

Solarte, B., Wu, C. H., Jhang, J. C., Lee, J., Tsai, Y. H., & Sun, M. (2024, September). Self-training Room Layout Estimation via Geometry-Aware Ray-Casting. In European Conference on Computer Vision (pp. 253-269). Cham: Springer Nature Switzerland.

Ramakrishnan, S. K., Gokaslan, A., Wijmans, E., Maksymets, O., Clegg, A., Turner, J., ... & Batra, D. (2021). Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai. NeurIPS 2022

问题

Q1: Did the authors consider to evaluate EchoNeRF in environments with walls exhibiting different RF absorption and reflection properties. For example, brick, wood, concrete, and glass windows? If so, how do the reconstruction results differ across such diverse environments?

Q2: Since multipath signals arise not only from walls but also from ceilings and floors, can the proposed EchoNeRF reconstruct full 3D room geometry, including floor and ceiling. If so, I suggest you to add those experiments to enhance your novelty and the discussion for future works.

Q3: I recommend that the authors expand the discussion in Section 5 to address the challenges associated with multiple receivers (RX) and transmitters (TX). open challenges in complementing previous solution that uses RGB cameras or depth/lidar sensors (multimodal approaches), and open concerns related to data privacy.

Q4: Fig. 3 requires significant improvement. In its current form, it is really difficult to interpret and does not effectively convey the intended information. For instance, the purple shadow (representing the manifold of reflexions at ω\omega=45^\circ) present incident and reflected lines that is not at ω=45\omega=45^\circ in the illustration. Can you elaborate on this?

Q5: In L154-L157, the paper stands that ωj=[1,..,Kω]\omega_j = [{1,..,K_\omega}] with Kω=4K_\omega=4, However, in L158 and Fig.3 ω\omega=45^\circ. can you elaborate on this? Is this related to the number of reflations of the signal and the shape of the room? if so, how this affect more complex room shapes? What happens if ω\omega is overestimated, e.g 64? what about underestimated e.g. 1?

局限性

yes

最终评判理由

The authors addressed my main concerns, however evaluations on more complex scenes that challenges the limits of the proposed solution are still missing. Nevertheless, I sincerely considered that the idea presented in this manuscript is significantly novel, and it has the quality needed for NeurIPS standard. Thus, I keep my rating of borderline accept.

格式问题

No format issue were found

作者回复

We sincerely thank the reviewer for their constructive feedback. Below, we answer some concerns.


Q1. One major weakness in the presented manuscript is that the evaluation is limited to the ZinD dataset [9], which primarily contains sparsely furnished and geometrically simple rectangular rooms. Expanding the experiments to include more complex datasets such as MVL-dataset or HM3D may enhance the discussion of the limitations of the proposed EchoNeRF.

Thank you for pointing to those datasets. Yes, greater floorplan complexity (complex wall patterns, furniture, clutter) will all help to bring out the deficiencies in EchoNeRF. The goal with this work was not so much to understand the limits of RF based NeRFs, but to establish the feasibility of such frameworks. Given that past work could hardly infer any geometry, our goal was to explore if NeRFs can be taught RF reflections, and if that training can converge to a reasonable floorplan. Frankly, and perhaps naively, we have been surprised at how much NeRFs could infer even when the measurements are just signal power (not channel impulse responses (CIR)). We believe there are many untapped opportunities that can be pursued in follow-up work to close the sim2real gap. Examples of such opportunities are: using priors, fusion with other modalities such as camera or audio, utilizing multiple RF frequencies, using CIR measurements, etc.


Q2. Did the authors consider to evaluate EchoNeRF in environments with walls exhibiting different RF absorption and reflection properties. For example, brick, wood, concrete, and glass windows? If so, how do the reconstruction results differ across such diverse environments?

We ran some experiments this week to observe the sensitivity to materials: we tested for 5 materials across the 6 floorplans shown in Fig 5. We report the mean IoU for EchoNeRF_LoS and EchoNeRF below:

MaterialEchoNeRF_LoS IoU (↑)EchoNeRF IoU (↑)
Concrete0.2510.371
Glass0.2360.364
Brick0.2320.357
Marble0.2260.328
Wood0.2270.311

Materials with higher reflectivity, such as concrete and glass, yield better performance than absorptive materials like wood. This is because more reflections allow better performance for EchoNeRF’s reflection model.


Q3. Since multipath signals arise not only from walls but also from ceilings and floors, can the proposed EchoNeRF reconstruct full 3D room geometry, including floor and ceiling. If so, I suggest you to add those experiments to enhance your novelty and the discussion for future works.

Thanks. We have pondered on this question and see two possible approaches to extend to 3D:

(1) Re-compute the reflection manifolds in 3D, i.e., given the Tx and Rx locations, find which voxels will produce valid reflections. In 2D, these manifolds were curved lines on the 2D plane (as shown in Fig. 3), however, in 3D, the manifolds will become curved lines in the 3D space. Importantly, the number of voxels on these manifolds do not grow excessively, since the 2D manifolds are essentially projections of the 3D manifolds. Hence, we expect the optimization to remain stable.

(2) A more engineering approach could be as follows. When the Tx and Rx are at a similar height hh, hardly any reflections occur on the vertical walls from higher or lower than h.h. Hence, we only need to estimate the ceiling and the floor, since the vertical walls can be extended upward and downward from a 2D floorplan. Since ceilings and floors are both horizontal (9090^\circ orientation), we can only add a vertical 9090^\circ manifold in our optimization to model the reflections from the ceiling and floor. Knowing the ceiling and floor heights, and the 2D floorplan, a closed 3D floorplan can be created using existing 3D software.

We plan to add this discussion to the future work section of the revised version. In follow-up work, we plan to attempt idea #1 first and fall back, if necessary, on idea #2.


Q4. Fig. 3 requires significant improvement. In its current form, it is really difficult to interpret and does not effectively convey the intended information. For instance, the purple shadow (representing the manifold of reflexions at=45) present incident and reflected lines that is not at in the illustration. Can you elaborate on this?

We now see how this figure can be confusing; we will redraw it in the revised version. Let us try to explain briefly what the purple curve, labeled 4545^\circ, means.

Say a voxel has an orientation of 4545^\circ with respect to a horizontal X axis. For a given Tx and Rx, at what location should we place this voxel so that the signal from the Tx would reflect on this voxel and reach the Rx? Observe that only certain locations will satisfy this requirement. The locus of all these locations is the purple line.


Q5. In L154-L157, the paper stands that ωj\omega_j = [1, … K] with KωK_\omega = 4.. However, in L158 and Fig.3, ωj=45\omega_j =45^{\circ}. can you elaborate on this? Is this related to the number of reflations of the signal and the shape of the room? if so, how this affect more complex room shapes? What happens if KωK_\omega is overestimated, e.g 64? what about underestimated e.g. 1?

That was a typo and thank you for catching it. KωK_\omega denotes the number of possible discrete orientations for wall voxels, where orientation ω\omega is in [0,180)[0, 180). When Kω=4K_\omega = 4, we can write ωj=j180Kω\omega_j = j\frac{180}{K_\omega}.

In such settings, walls that are oriented at 00^\circ, 4545^\circ, 9090^\circ, and 135135^\circ will be searched for by the EchoNeRF model. Higher values of KωK_\omega will benefit the estimation of floorplans with more complex wall structures, while lower KωK_\omega will need more investigation, an item for our future work.


Q6. I recommend that the authors expand the discussion in Section 5 to address the challenges associated with multiple receivers (RX) and transmitters (TX). open challenges in complementing previous solution that uses RGB cameras or depth/lidar sensors (multimodal approaches), and open concerns related to data privacy.

Thanks for the thoughtful suggestion. We were unable to discuss several of these open questions due to the lack of space in the submission draft. If accepted, we will create space to discuss the open challenges to help facilitate follow-up research after EchoNeRF.

评论

Although the authors address my main concerns, evaluation on more complex scenes that challenges the current limits of the proposed solution are still missing. Nevertheless, I sincerely considered that the current manuscript has the quality needed for NeurIPS standard. Thus, I keep my rating of borderline accept.

评论

We sincerely thank all reviewers for their valuable feedback and constructive comments. We will carefully address all suggestions in our revision. The paper has improved considerably in our opinion based on your insights and recommendations.

Fingers crossed.

 

Best regards,

Authors of #26612

最终决定

The paper received positive reviews with scores of 4, 4, 4, 5. Reviewers emphasized the novelty of extending NeRFs to RF multipath signals (myAA, zf7b, ybbR, fth9), the principled two-stage training and reflection modeling (myAA, fth9), the clarity of presentation (zf7b, ybbR), and the strong improvements over prior RF-based NeRF approaches (myAA, ybbR). Concerns about real-world validation, dataset diversity, related work coverage, and ablation details were largely mitigated by additional experiments and clarifications in the rebuttal.

After thoroughly reviewing the paper, the reviews, and the authors’ rebuttal, the AC concurs with the reviewers’ overall positive consensus and therefore recommends the paper for acceptance.

For the camera-ready version, the authors should ensure all discussions from the rebuttal are incorporated into the main paper and supplementary materials. The specific changes that need to be implemented are:

1. Real-world validation and motivation (ybbR, zf7b, fth9): Integrate the new noise robustness and material sensitivity experiments, clarify expected SNR ranges, and expand the discussion on practical advantages of RF over camera-based methods (privacy, see-through capability, reduced measurement requirements) .

2. Related work positioning (ybbR): Add missing citations to multipath NeRF literature (e.g., PlatoNeRF, NeRF-Casting, oRCA) and refine terminology (“vanilla NeRF”).

3. Experimental clarity (zf7b, fth9): Acknowledge the limitations of ZinD, point to HM3D/MVL as future directions, and highlight the role of two-stage training and reflection modeling in the ablations .

4. Presentation improvements (ybbR, zf7b): Revise unclear figures (e.g., Fig. 3 and Fig. 5) and ensure new results from the rebuttal are properly integrated.