The methodology description and experiment setup is missing several relevant details (listed under "Questions"). In addition, the methodology has a series of weaknesses as outlined below that only allow for partial or specific statements of scaling laws. This contrasts the author's tendency to make quiet general statements about model behavior that are invalid under the experiment setup.

Methodological weaknesses:

In line 160 you state that the labels of LoveDA and BigEarthNet are unreliable. While that might be true, what supports your choice of the BVM dataset. From looking at the BVM publication, I cannot tell for example, whether these labels were manually annotated or automatically generated like it often is the case in EO.
A main weakness is the choice of just a single recently published dataset on which the authors base all of their conclusion for a subselection of models. In your conclusion you state that "In this paper, we have explored the scaling laws governing neural networks applied to large-scale, multi-spectral remote sensing tasks", but you have only explored a single task on a single dataset. Even within EO segmentation, many specific and relevant tasks exist, like flood segmentation, land-cover classification, crop-type mapping etc. While you recognize this limitation in the subsequent paragraph, it stand in contrast to the general statements you make throughout the paper. For example in line 353, you state that "the MS-Swin" models offer additional advantages, making them a better choice for multi-spectral segmentation tasks", which as a general conclusion does not follow your experiment setup
Another main weakness is the evaluation scheme, mainly that you only employ Overall Accuracy as a metric for segmentation, and moreover, base all your conclusions on this metric. You state in line 284 that it is a "global measure of how well the model correctly classifies pixels across classes". But this is not necessarily true. The issue here is that for segmentation tasks with high class imbalance like also present in BVM, just predicting a background class or an overly present class will often result in high scores, but does not in itself necessitate a skillfull image segmentation model. Other metrics, like Intersection-over-Union (IoU) or F1 score also need to be considered.
I am unfortunately confused by your "scaling efficiency metrc", as I cannot find a clear definition of how the individual factors G, P, and C are computed. Additionally, I am not sure where the motivation for this metric comes from, and would be interested in a discussion about its edge cases, where it is defined, for example if G=PxC, etc.
your introduction of a Spectral Dependency Module is not going into detail about relevant prior research in this domain. Ideas about this have been discussed for example here