This topic seems to be traditionally mainly covered in Computer Vision venues. And while the contribution of this work may also be of interest to the ML community, the writing needs to provide more background and context to be more accessible to the NeurIPS community. E.g. the readers at NeurIPS may not be familiar with the broader tasks and some more specialized architectures such as SparseBEV.

The work contains numerous writing and grammar issues. Some examples are mentioned in the minor comments section below. However, the work should undergo a careful proofread in its entirety.

Some of the mathematical notation seems off. E.g. {} is denoted as a set containing two sets while it is treated as a set cotaining tuples of point locations and semantic classes. Some more minor math notation issues are given in the comments below.

The first bullet point in the contribution list is somewhat broad and focuses more on the properties of the method. It does not mention some of the new key technical ideas underlying the method such as CPS. I would recommend fully rephrasing the contribution list and focusing on what is technically new (or at least for the first time applied in the context of occupancy prediction). Simply mentioning that there are several strategies that are introduced to boost performance makes the reader wonder what these strategies are.

Minor Comments:

l.3 "discretize 3D environment" -> "discretize the 3D environment"
l.33/34 "Alternative sparse latent representations has been explored" -> "Alternative sparse latent representations have been explored"
l. 37 What is meant by "necessitating complex intermediate designs and explicit"? Complex architecture?
l.42/43 "Our OPS eliminates" should be "Our method eliminates" or "OPS eliminates"
l.48 "unable to tacke tremendous voxels" maybe something like "unable to handle a very high number of voxels"
l.69/70 "all our model configurations easily surpass all prior arts" -> "all our model variants easily surpass all prior work".
l.83 "This task recently becomes a foundational perception task in autonomous driving" -> "This task has recently become a foundational perception task for autonomous driving"
Consider using instead of for the number of occupied voxels in ground truth. The latter looks like to the power of . Same for all other occurrences of in a superscript.
If denotes a semantic class, it should not be defined as element of . This is not necessarily wrong but may be confusing as it is likely represented as an integer if not a one-hot vector?
Using to represent the number of semantic classes may be confusing given that is used to represent the number of occupied ground truth voxels.