What is the impact of self-curation iterations? Is there a value to performing multiple iterations? Did you also consider inference of the document collection on an improved reverse model from the data extracted (making the augmentation step also iterative)?

Both are excellent questions. We observed improvements in data curation from the seed model to . However, from to , there was improvement in recall but drop in precision. Therefore, we did not use to conduct a third iteration of data curation. We did not experiment with making the augmentation step iterative but it sounds a very interesting idea for future work.

Can you add some samples of extracted instruction dataset instances to the paper?

Below are a couple of examples:

Diamond engagement rings gained in popularity during the Art Deco era with the round old European cut diamond being the favourite.

### Asscher Cut

The Asscher cut is one of the first patented diamond cuts in the world and was invented by Dutch master diamond cutter, Joseph Asscher of the Royal Asscher Diamond Company in 1902.  Classic asscher cut diamonds are cut into squares and resemble emerald cuts, which are rectangular. Asscher cut diamonds are different to a square emerald cut in that they have larger step facets, a higher crown, smaller table and have more brilliance. The corners are cropped to give the shape an octagonal appearance.

### Baguette Cut

Although the baguette cut was invented sometime prior to the mid-1500s, it only gained popularity in 1912 when Cartier reintroduced the cut to the modern world. Its elongated, table cut, rectangular shape became highly fashionable in the geometric craze of the Art Deco period.

### Emerald Cut

The emerald diamond cut emerged as one of the first faceted diamond cuts, third in line after the point cut and the table cut. The cut has a dramatic hall of mirrors effect and was standardised in the 1940s.  
\newline

Generated instruction: List the most popular diamond cuts in the Art Deco era.

Inclusive Sports Coaching provides 1:1 Programs for individuals looking to develop their sporting skills, as well as improve their self confidence and opportunities for social and community inclusion.

We recommend an 8 or 12 Session program to identify areas for improvement and sporting skills, conduct drills and physical activities to work towards specific outcomes, while engaging with the client in areas such as listening, memory retention, cognitive processing, social interaction, encouraging conversations, accepting and giving constructive feedback, and other areas as needed.

At the halfway point we produce a status report on progress, and have found parents/carers often share this with OT's, Physios and Teachers as a way to share information on the individual and provide a strong network of support. At the end of the program we produce a final report, with recommendations for ongoing improvement, potential for progress along the person's chosen sport pathway where applicable, etc.

Generated instruction: I have a business called Inclusive Sports Coaching. We provide 1:1 sport coaching for people with disabilities. I want to have some materials on hand to give to parents when they enquire about our services. What do you recommend I include in these materials?

How are the segments selected from ClueWeb. Are entire documents chosen, is there some filtering on criteria like length, etc? Are the inputs to the reverse models entire documents or smaller units like paragraphs?

We parse the warc files of ClueWeb in HTML format to extract segments. Each segment is a tree rooted at a header node, including subtrees from lower-level headers. We applied the following filters before sampling segments: Length: total length of text between 600 and 3000 characters. Duplications: we remove segments with repetitive sentences by computing jaccard similarity of ngrams from pairs of sentences in the segment. We remove segments when containing an empty header or the text is all uppercase, header contains navigation text such as “advertisement”, “forum”, “quick link”, “free newsletter”, etc.

Table 5: What is the model size used in this study? On which benchmark dataset? Does this observation hold over different model sizes? Which configuration has been used to report results in the rest of the paper?

These ablations were conducted with the 7B model and the win rates were evaluated on the 250 dev prompts sampled from the combined test prompts from multiple sources. We verified the same trend with the 65B model: combined system prompt: , only system prompt for OA seed data: . Results in the rest of the paper use the combined system prompt in both training and inference, i.e. the configuration corresponding to the first row of Table 5.