Function Calling Test Suite for Fun and Profit

We recommend viewing in desktop mode for the best experience

Introducing function-calling-test-suite

Function calling is the fundamental feature that powers our flagship project, GPTScript. This makes an LLM’s ability to call functions the primary consideration when determining its suitability as a drop-in replacement for OpenAI’s gpt4-o (the current default model used by GPTScript). To quantify this metric, we decided to sink some time into building out function-calling-test-suite (FCTS), a shiny new test framework!


We’ll drop another blog post that delves into the specifics of the design shortly, but for now, here’s a breakdown of the framework’s key features:

  • Simple YAML specs to describe test cases
  • Optionally use gpt4-o to judge test results
  • Configurable test run count (i.e. run each test N times to detect non-deterministic model responses)

Now that introductions are out of the way, here’s what we’ve found with FCTS so far:


Rankings

We tested six major models with function calling support across a total of four major platforms. In order to account for the non-deterministic nature of generative models, we ran every test case 10 times per model, then ranked the models by overall pass rate.


RankPass RateModelPlatform
198.24%gpt-4o-2024-05-13OpenAI
294.71%gpt-4-turbo-2024-04-09OpenAI
387.65%claude-3-5-sonnet-20240620Anthropic
472.94%claude-3-opus-20240229Anthropic
551.18%mistral-large-2402La Plateforme (Mistral AI)
648.82%gemini-1.5-proVertex AI (Google)

A Quick Litmus Test

As mentioned earlier, GPTScript uses gpt-4o — which referenced gpt-4o-2024-05-13 at the time these rankings were compiled — by default, so we were already confident in its ability to satisfy our use cases. But to get a rough idea of how well these results stack up to reality, we also ran GPTScript on a selection of example scripts and recorded the pass rate for each model.


Examplegpt-4o-2024-05-13gpt-4-turbo-2024-04-09claude-3-5-sonnet-20240620claude-3-opus-20240229mistral-large-2402gemini-1.5-pro
bob-as-shell.gptpasspasspasspasspasspass
bob.gptpasspasspasspasspasspass
echo.gptpasspasspasspasspasspass
fac.gptpasspasspasspasspassfail
helloworld.gptpasspasspasspasspasspass
describe-code.gptpasspassfailfailfailfail
add-go-mod-dep.gptpasspassfailfailfailfail
hacker-news-headlines.gptpasspasspassfailfailfail
search.gptpasspasspasspassfailpass
json-notebookpasspasspassfailfailfail
sqlite-download.gptpasspasspasspassfailfail
syntax-from-code.gptpasspasspasspassfailpass
git-commit.gptpasspasspassfailpassfail
sentiments.gptpasspasspassfailpassfail

RankExample Pass RateFCTF Pass RateModel
1100%98.24%gpt-4o-2024-05-13
2100%94.71%gpt-4-turbo-2024-04-09
385.71%87.65%claude-3-5-sonnet-20240620
457.14%72.94%claude-3-opus-20240229
550.00%51.18%mistral-large-2402
642.86%48.82%gemini-1.5-pro

With the exception of claude-3-opus-20240229, which differs by ~16%, the practical rankings are within 6% of the FCTS rankings. Although this isn’t exactly an apples-to-apples comparison, we feel the congruence is enough to warrant some confidence that FCTS is a reasonable approximation of a model’s potential performance with GPTScript.


Huzzah!


Now that we’ve convinced ourselves that our results pass muster, let’s take a closer look at the test cases.


Test Case Overview

The initial test suite spans six categories and contains a relatively small number of test cases, but we feel they cover a wide mix of typical use cases without being too overwhelming.


CategoryDescription
basicTests that a model can make the most basic function calls
sequencedTests that a model can make function calls in a specific order
chainedTests that a model can pass the result of a function call to another function
groupedTests that a model can identify and make groups of function calls
semanticTests that a model can infer and make the correct function calls given natural language prompts and descriptions
gptscriptTests that a model can perform more complex tasks found in GPTScript’s example scripts.

Test IDDescriptionCategories
01_basic.yaml-0Asserts that the model can make a function call with a given argument and conveys the result to the userbasic
01_basic.yaml-1Asserts that the model can make a function call with an ordered set of arguments and conveys the result to the userbasic
03_sequenced.yaml-0Asserts that the model can make a sequence of function calls in the correct order and conveys the results to the usersequenced
03_sequenced.yaml-1Asserts that the model can make a mix of ordered an unordered function calls and conveys the result to the usersequenced
05_chained.yaml-0Asserts that the model can use the result of a function call as the argument for a specified function and conveys the result to the userchained
05_chained.yaml-1Asserts that the model can use the results of a group of function calls as arguments for a single function call and conveys the result to the userchained, grouped
05_chained.yaml-2Asserts that the model can use the results of a group of function calls as arguments for successive groups of function calls and conveys the result to the userchained, grouped
07_semantic.yaml-0Asserts that the model can derive and make a function call with one argument from a prompt and conveys the result to the usersemantic, basic
07_semantic.yaml-1Asserts that the model can derive and make a function call with two arguments from a prompt and conveys the result to the usersemantic, basic
07_semantic.yaml-2Asserts that the model can derive and make an ordered sequence of function calls from a prompt and conveys the results to the usersequenced, semantic
07_semantic.yaml-3Asserts that the model can derive and make two function calls from the prompt, using the result of the first call as the argument for the second, and convey the result to the usersemantic, chained
07_semantic.yaml-4Asserts that the model can derive and make a series of functions calls from a prompt, where the results of an initial group of calls are used as arguments for a final function call, and conveys the result to the usersemantic, chained
07_semantic.yaml-5Asserts that the model can interpret and execute a complex series of chained steps related to creating a database and creating entries in it.semantic, chained
07_semantic.yaml-6Asserts that the model can parse a comma delimited list from one function, pass each entry to a second function, and send the gathered results of those calls to a third function.chained, semantic, grouped
07_semantic.yaml-7Asserts that the model can parse a large csv style response and make a series of chained calls for each row in the csvsemantic, chained
07_semantic.yaml-8Asserts that the model can parse and transform user input based on the instructions in its system prompt.sequenced, gptscript, semantic, chained
07_semantic.yaml-9Asserts that the model can build chain of grouped function calls.sequenced, semantic, chained, grouped, gptscript

Note: Test ID refers to the spec file name and yaml stream index that a given spec originated from. There are “gaps” in the indices above are because we’ve elided the nascent negative test category from our analysis. We did this because we’re not fully confident the category is meaningful yet. The full spec files for the entire suite, including negatives, are available for review in the FCTS repo.


Comparing Performance

Plotting the number of passed runs for each test case as a heat map makes the major differences between models stand out.

fcts-heat-map.svg

Here we can see that the gulf in performance between OpenAI and the other providers is mostly caused by failing chained and semantic test cases. Interestingly, with the exception of claude-3.5-sonnet, non-OpenAPI providers fail the same two chained and semantic test cases across the board (05_chained.yaml-0, 05_chained.yaml-1, 07_semantic.yaml-6, and 07_semantic.yaml-8). These failures represent a whopping ~66% and 20% of the total test runs in their respective categories!


But to compare the deficits of each model in any greater fidelity, we’ll need to understand why they failed on a test-by-test basis.


gpt-4o-2024-05-13

Test IDFail RateFailure Pathology
07_semantic.yaml-430%- Fails to properly chain groups of function calls
- Hallucinates function arguments

gpt-4-turbo-2024-04-09

Test IDFail RateFailure Pathology
07_semantic.yaml-680%- Returns an incorrect argument after a large number of function calls
07_semantic.yaml-910%- Makes an unnecessary duplicate function call

claude-3-5-sonnet-20240620

Test IDFail RateFailure Pathology
05_chained.yaml-1100%- Chains correctly
- Final answer enumerates the chain of function calls invoked instead of the final evaluated result
05_chained.yaml-210%- Chains correctly
- Final answer enumerates the chain of function calls invoked instead of the final evaluated result
07_semantic.yaml-6100%- Halts after the first call
- Responds with a (”correct”) plan to finish answering the prompt instead of actually executing that plan

claude-3-opus-20240229

Test IDFail RateFailure Pathology
05_chained.yaml-060%- Makes chained calls in parallel
- Passes a “place holder” instead of a “real” argument
05_chained.yaml-1100%- Chains correctly
- Final answer enumerates the chain of function calls invoked instead of the final evaluated result
05_chained.yaml-2100%- Makes chained calls in parallel
- Passes a “place holder” instead of a “real” argument
07_semantic.yaml-6100%- Halts after the first call
- Responds with a (”correct”) plan to finish answering the prompt instead of actually executing that plan
07_semantic.yaml-8100%- Halts without making any calls
- Responds with a (”correct”) plan to finish answering the prompt instead of actually executing that plan

mistral-large-2402

Test IDFail RateFailure Pathology
05_chained.yaml-1100%- Halts without making any calls
- Responds with a (”correct”) plan to finish answering the prompt instead of actually executing that plan
05_chained.yaml-2100%- Makes chained calls in parallel
- Passes a “place holder” instead of a “real” argument
07_semantic.yaml-230%- Halts after the first call
- Responds with a (”correct”) plan to finish answering the prompt instead of actually executing that plan
07_semantic.yaml-4100%- Halts after the first call
- Responds with a (”correct”) plan to finish answering the prompt instead of actually executing that plan
07_semantic.yaml-5100%- Halts without making any calls
- Responds with a (”correct”) plan to finish answering the prompt instead of actually executing that plan
07_semantic.yaml-6100%- Makes chained calls in parallel
- Passes a “place holder” instead of a “real” argument
07_semantic.yaml-7100%- Makes chained calls in parallel
- Hallucinates arguments instead of using the results of the initial call
07_semantic.yaml-8100%- Halts after the first call
- Responds with a (”correct”) plan to finish answering the prompt instead of actually executing that plan
07_semantic.yaml-9100%- Halts after the first call
- Responds with a (”correct”) plan to finish answering the prompt instead of actually executing that plan

gemini-1.5-pro

Test IDFail RateFailure Pathology
01_basic.yaml-110%- Makes the correct tool call
- Returns the raw JSON of what looks like the internal “google representation” of the call result
05_chained.yaml-0100%- Fails to derive chain call order
- Passes “unknown” literal as argument
05_chained.yaml-1100%- Fails to derive chain call order
- Passes given arguments to the wrong function
05_chained.yaml-2100%- Makes no function calls
- Returns a 500 error
07_semantic.yaml-2100%- Chains correctly
- Final answer doesn’t contain the chain’s result
07_semantic.yaml-360%- Chains correctly
- Final answer is missing required information
07_semantic.yaml-4100%- Chains correctly
- Final answer is missing required information
07_semantic.yaml-6100%- Returns an incorrect argument after a large number of function calls
07_semantic.yaml-8100%- Fails to derive the chain order
- Hallucinates initial argument
07_semantic.yaml-9100%- Begins chain correctly
- Adds extra escape characters to new-lines

Thumbing through the failure pathologies above reveals a few common threads between models:


Premature Halting

claude-3-5-sonnet-20240620, claude-3-opus-20240229, mistral-large-2402, and gemini-1.5-pro all frequently halt before completing their tasks. Instead of executing the plan, they just describe what should be done. For example, claude-3-opus-20240229 stops after the first step in test 07_semantic.yaml-6, while mistral-large-2402 exhibits similar behavior in several tests, like 07_semantic.yaml-4 and 07_semantic.yaml-5. gemini-1.5-pro also halts prematurely, particularly in 05_chained.yaml-1.


Poor Chaining

claude-3-opus-20240229 and mistral-large-2402 tend to make parallel calls when they should be sequential, leading to incorrect results. This problem is evident in tests like 05_chained.yaml-2 and 07_semantic.yaml-6. gemini-1.5-pro also encounters this issue, especially in 05_chained.yaml-0 and 05_chained.yaml-1, failing to derive the correct call order.


Argument Hallucination

Hallucinating function arguments is another prevalent issue. gpt-4o-2024-05-13, claude-3-opus-20240229, and gemini-1.5-pro all exhibit this behavior. In 07_semantic.yaml-4, gpt-4o-2024-05-13 generates arguments that were not part of the original input. Similarly, claude-3-opus-20240229 and gemini-1.5-pro show this issue in tests 07_semantic.yaml-7 and 07_semantic.yaml-8, respectively, making up inputs on the fly.


Potential Confounding

At the moment, one factor that could throw off our results is the use of GPTScript provider shims for model providers that don’t support OpenAI’s Chat Completion API; e.g. claude-3-opus-20240229 and gemini-1.5-pro. While we’re fairly confident in our shims, there’s always the potential for unknown bugs to skew our test results. However, since we’ve tested our provider shims pretty thoroughly we expect confounding from this source to be minimal.


Conclusion

The exercise of building a framework calling test framework has been a fruitful one. It’s given us a much deeper grasp on the strengths and weaknesses of the current ecosystem’s top models. It’s also unveiled several real world takeaways that we’ve already put to use in our other work-streams (e.g. using an LLM to test GPTScript). To us, the results indicate a real gap in performance between OpenAI and the other providers which serves as evidence to support our initial decision to build GPTScript around OpenAI’s models. They’ve also exposed the best providers and made it clear that they are getting even better (e.g. gpt-4o vs gpt-4-turbo and claude-3.5-sonnet vs claude-3-opus).


If you’ve found this post interesting, you may want to check out the FCTS repo and give it a spin for yourself. Feel free to join our Discord server to chat with us about it too!