The LLM model market is very dynamic. Pricing and capability shifts mean that your applications must change models fairly often. Doing this with LLMAsAService is a snap.

What are the risks of model changes?

When new models are released, they bring about significant changes, often improving responses to prompts. This constant evolution in the models should excite you about their potential for your applications. The upside is that these changes can lead to improved performance and customer satisfaction. However, to fully realize these benefits, care needs to be taken to avoid regression.

An example of constant regression in our applications is the formatting of responses. If your application (like ours) expected a response to look similar, further automation (in our case, an accept button for a list of ideas) would break.

Here are some areas to consider -

Formatting changes. Different indenting and emphasis.
Preambles. Some models start being more personal and saying things like "Sure, let me do that," while the older models just give the answer.
Quality. Moving to a less expensive model sometimes (not always) returns a less appropriate answer.

How to reduce new model risks in LLMAsAService.io

Connect the new model in draft mode

You can have any number of models in LLMAsAService. Create a new service using the proposed model, but leave it in Draft status. Draft status allows you to test without including it in the failover list of production models. This allows you to call it for testing but not disrupt production.

Compare the responses against a sample of ACTUAL previous calls

On the Quality page of the control panel, you can randomly sample prior calls. You can also generate a response using any of the defined LLM Services. Find a few candidate prompts and generate responses using the new proposed model. This gives you a sense of what the customer WOULD see if that new model had answered the prompt shown.

For example, Gemini has a cost advantage against gtp4-o (although gpt4-0-mini is cheaper). In this example, I compared a prior response from GPT-4o to Gemini. I might have to tune the prompts to reduce the wordiness and detail by Gemini in my application.

Gather CSAT analytics

If the new model is deemed adequate, switch its status to active and put it into the pool of available services. Instrument your application (we will be offering this as a service soon) to capture "Good response" and "Poor response" data. Ensure that customers agree with your assessment that the new model performs adequately.

Our application (heycasey.io) collects this using the thumbs-up and down buttons. If the new model performs worse on these calls, we set that service back to draft and optimize the prompt until it performs better.