I thank the authors for the effort they put into their responses, including updating the paper and adding experiments. I read through the other reviews when they were first released, and noticed quite a few common themes, including with the content of my review. I've also read the authors' response to each, as well as the global response.

In response to the authors’ comment on “several key works were overlooked “: To be clear, this isn't a case of missing a few references or baselines, which I often find common and relatively understandable. Rather, this paper completely failed to consider an entire (popular) class of solutions for incremental learning. Note that adapters have also commonly been used in image continual learning, not just text. It’s also not just line 085 that was problematic; many parts of the paper considerably overplay the novelty of the proposed method, and the experiments should be focused specifically around comparing with other expansion/adapter-based solutions, as opposed to the current focus on regularization. This paper still needs a considerable re-write on how its contributions are presented, even with the most recent changes.

Fixed vs frozen backbone:

I still stand by my comment in W2. Allowing the base network to co-train will lead to catastrophic forgetting of the base network features that allow earlier adapters to maintain performance on previous tasks. Instead, this method is relying on prior regularization methods to combat catastrophic forgetting, and is thus not technically an incremental learning method on its own. I would suggest an experiment using just the proposed adapters, without also pairing it with prior regularization methods; I suspect there will be high catastrophic forgetting. I thus don’t find “we co-train the backbone while others freeze it" to be a compelling contribution or way to distinguish from prior adapter-based incremental learning works. In my opinion, a more correct way of presenting the proposed approach is that the authors have found that adding adapters improves the performance of regularization methods, perhaps by providing added modeling capacity per task and thus reducing the load on the regularization to preserve knowledge.
I thank the authors for adding the comparison between freezing vs co-training the backbone; this ablation is much needed given the paper’s claims. On the other hand, while there does seem to be some improvement from co-training, it seems quite small. I suspect that the frozen approach requires different hyperparameter tuning, due to less weights being trained, and with a more thorough hyperparameter search, this gap would be closed.

Parameter counts: Something is not right here. The text says 256 while the math below shows 128. I assume the number is actually 256, given the paper says that worked best, and , matching the quoted 267K additional parameters. Note however, that there's a missing set of parentheses in the above calculation (unless all tasks share the same weight matrices in the adapter and instead there are only per-task biases, which seems unlikely). Assuming a separate adapter per task, it should actually be , which for is actually 2.6M. That means that for a relatively modest 10 tasks, the adapters are in fact ~12% the size of the original network, which is not insignificant (note that the rebuttal also miscalculates 267K as 0.1% of 21.8M--even if the adapter parameter count was correct, this should 1.2%, not 0.1%). In contrast, see [a] from my original review, with a more efficient adapter parametrization only 1-3% the size of ResNet-18.