The long generated stories don't seem to be very good in terms of storytelling. For example, Figure F does not really suggest any logical story (e.g. with a beginning, middle, end). It seems more like random captions were generated, especially near the end. The llm also starts repeating itself often (e.g. "A man in a hat and shirt looked surprised", "George, the cartoon monkey, stood in a grassy area". The authors in the paper say that this plot is "engaging", but IMO it is not engaging.

This is a difficult task, but a simple baseline is to have a text only LLM generate a curious george story. The story in Figure F looks quite poor compared to a pure text baseline. Another simple baseline would be to have a text-llm generate a story and then have a diffusion model generate an image for each paragraph. This might lack consistency between images, but overall might be better, especially if rated by humans. It feels like there is potential in this paper but I feel more iteration is needed to generate reasonable stories.

I'm also wondering how good the StoryStream dataset actually is for this task. From figure 5 it actually doesn't seem to have a very coherent story either, which might be why the trained model struggles. Figure 5 looks like a bunch of image captions with very little narrative storytelling. To be fair, it does seem that the other existing datasets have the same issue.

The main comparison with previous works is with MM-interleaved, but I am concerned about the fairness of these comparisons. MM-interleaved uses stable-diffusion-2-1 and Seed-story use SDXL. SDXL should be much better than stable-diffusion-2-1, and it is possible that the results in Figure 6a,b,d are strongly affected by this. Also, MM-interleaved use ViT-L and seed-story uses ViT-G as the image encoder which might also affect these numbers. MM-interleaved uses Vicuna-13b and seed-story uses llama2 which could affect Figure 6c. The authors do not mention these important differences between MM-interleaved and seed-story in the paper.

Nitpicks: References need to be cleaned up, there are multiple cases of duplicate references. E.g. for Qwen-vl, "Improved baselines with visual instruction tuning"