2. **Error Analysis of different LLMs in predicting tool invocation graph** Moreover, we also analyze the failures of different LLMs in tool invocation graph parsing that occur during task automation inference. These failures can be categorized into three main groups: incorrect tool names, incorrect tool dependencies, and mismatched tool parameters. For our analysis, we randomly selected 50 predictions, and the distribution of each error type across different LLMs is detailed in Table 4. We observed that: - gpt-4 demonstrates the fewest errors in all categories, indicating a higher accuracy in tool invocation graph prediction. - gpt-3.5-turbo and code-llama-13b show a progressively higher number of errors. Notably, the 'Tool Parameter Error' is the most common across all models, highlighting the challenge LLMs face in predicting accurate parameters for tools. | | Required Tool Missing | Tool Dependency Error | Tool Parameter Error | | --- | --- | --- | --- | | gpt-4 | 0 | 2 | 3 | | gpt-3.5-turbo | 2 | 8 | 11 | | code-llama-13b | 4 | 9 | 13 | *Table 4. Error Distribution in Different LLMs* Further, we present specific cases to elucidate the nature of prediction errors in these LLMs. Given the following example, gpt-4 correctly interpreted the task in the given example, underscoring its advanced task automation capabilities. Conversely, gpt-3.5-turbo and code-llama-13b omitted a critical tool ('Audio Downloader'), resulting in a 'Missing Required Tool' error. Additionally, code-llama-13b encountered compounded errors, including 'Tool Parameter Error' and 'Incorrect Tool Dependency'. We have added these cases to our revised version. > Instruction: I need an audio file downloaded from 'https://www.example.com/example.wav', then please reduce the background noise and apply a reverb effect according to my instruction 'reverb 50%'. Finally, combine it with the audio file 'example.wav'. > > Gold Tool Invocation Graph: {"task_nodes": [{"task": "Audio Downloader", "arguments": ["https://www.example.com/example.wav"]}, {"task": "Audio Noise Reduction", "arguments": [""]}, {"task": "Audio Effects", "arguments": ["", "reverb 50%"]}, {"task": "Audio Splicer", "arguments": ["", "example.wav"]}]}* > | LLM | Predicted Tool Invocation Graph | Error Type | | --- | --- | --- | | gpt-4 | {"task_nodes": [{"task": "Audio Downloader", "arguments": ["https://www.example.com/example.wav"]}, {"task": "Audio Noise Reduction", "arguments": [""]}, {"task": "Audio Effects", "arguments": ["", "reverb 50%"]}, {"task": "Audio Splicer", "arguments": ["", "example.wav"]}]} | None | | gpt-3.5-turbo | {"task_nodes": [{"task": "Audio Noise Reduction", "arguments": [""https://www.example.com/example.wav""]}, {"task": "Audio Effects", "arguments": ["", "reverb 50%"]}, {"task": "Audio Splicer", "arguments": ["", "example.wav"]}]} | Missing Required Tool | | code-llama-13b | {"task_nodes": [{"task": "Audio Noise Reduction", "arguments": ["example.wav"]}, {"task": "Audio Effects", "arguments": ["", "reverb 50%"]}, {"task": "Audio Splicer", "arguments": ["", ""]}]} | Missing Required Tool & Tool Parameter Error & Incorrect Tool Dependency: | *Table 4. Case Studies of Prediction Errors in LLMs*