AI performance - blog header

How do you ensure performance when building AI services?

When building the FCDO’s first ever public-facing LLM service, CYB came through challenges to guarantee performance - our Head of Engineering Ali Salaman explains all
Picture of Ali Headshot square

Ali Salaman

Head of Engineering

05 Sep 2024

Caution Your Blast Ltd’s (CYB) recent project for the Foreign, Commonwealth & Development Office (FCDO) - an AI-enabled Knowledge Management tool - had two main aims. Firstly, we wanted to make it easier and quicker for British nationals to get the information they need from the FCDO when abroad. We also wanted to help the FCDO staff concentrate more time on the most vulnerable cases by allowing more users to self-serve online. This is especially important during crisis when FCDO receive large volumes of queries. Upon our recent launch, it became the FCDO’s first ever AI-enabled public-facing service.

The proliferation and access to Large Language Models (LLMs) over the past 18 months has opened up new capabilities when processing unstructured text. Our goal was to use the LLM to mimic how FCDO staff behave - first by understanding the user enquiry, and then matching it to the most appropriate FCDO-approved content (known as knowledge articles) accurately and efficiently.

This blog post delves deeper into the insights and lessons learnt from our journey, particularly focusing on our approach to performance tuning to reduce latency and optimise throughput.

The Hybrid Model Approach

We chose a hybrid model approach to maximise efficiency and accuracy. OpenAI was the market leader at the time we started the project in September 2023, and being able to access their models in a Microsoft Azure data centre means data would never leave our environment. That also made it safe to use from a data privacy perspective. Here’s how we structured it:

  • GPT-3.5

    Used for breaking down the user enquiries into their component questions. This model is efficient and performs well for initial processing. It’s also cheaper to use than the more capable 4-series models

  • GPT-4 Turbo

    A more capable model used for deeper understanding and matching enquiries to FCDO-approved articles, providing greater accuracy when handling more nuanced edge cases

While this setup seemed ideal, our subsequent load testing revealed significant performance challenges. During our load testing phase, we observed frequent high latency issues. Although our average response time was around nine seconds (by no means acceptable for our service), spikes reaching up to 30 seconds or more were common. Such latency was unacceptable for our use case, where timely responses are crucial.

When accessing the OpenAI models, you are provided with a token quota - think of it as approximately the maximum number of words you can send per minute. We faced frequent breaches of our token quota and request limits. For context, a token in OpenAI's framework is approximately two-thirds of a word. Our deployment had an allocated token quota which we regularly exceeded. This needed a deeper investigation into the root causes of these limitations.

Performance Tuning Strategies

To address these challenges, we employed several strategies:

Parameter Tweaking

There are parameters that we could adjust to change the way the API was called. We experimented by observing the impact of these values. Key parameters included:

  • Max Tokens

    Limiting the maximum number of tokens in the response to control token usage

  • Max Retries

    Reducing the number of retries to minimise repeated requests to the API and increased token consumption

  • Timeout

    Setting appropriate timeout values to ensure quicker responses

While these tweaks provided some improvements, they did not solve the high latency and token limit issues.

Fallback Model Approach

Next, we implemented a fallback model strategy. We deployed 2 instances of the GPT 3.5 model and 2 instances of the GPT 4 Turbo model. The first instance was used as the primary model, accessed first in normal processing conditions. The second instance would then be accessed whenever a timeout was experienced during high latency. This method offered a temporary solution but was akin to "papering over the cracks" rather than addressing the root cause.

Re-evaluating Model Choices

A critical realisation emerged regarding our model choices. Initially, we had selected GPT-4 Turbo, expecting it to become production-grade during our project timeline. Being categorised as “production-grade” means that the API and the models it uses will have all the levels of resilience and optimisation built into it, supported with the powerful hardware to ensure consistently performant behaviour. However, GPT-4 Turbo was still not upgraded to production grade status and we failed to revisit this assumption leading to suboptimal performance.

Upon re-evaluation, we replaced GPT-4 Turbo with the older but more stable GPT-4 model. This change brought about significant improvements. The average response time dropped to around three seconds and performance remained consistent and more reliable with only occasional but manageable spikes.

Final Outcomes and Future Prospects

With these adjustments, the system's performance and consistency have vastly improved. It has enabled us to continue to make further service improvements and as we rollout the service we are excited to see the impact this service will have on people around the world and within the FCDO itself.

Key Takeaways:

  • Hybrid Model Effectiveness

    Utilising different models for distinct tasks can optimise costs and performance but requires careful selection and ongoing evaluation.

  • Parameter Optimisation

    Adjusting API parameters can help to mitigate some performance issues but may not be sufficient alone.

  • Fallback Strategies

    Implementing fallback models provides temporary relief but should not replace addressing fundamental issues.

  • Regular Re-evaluation

    Continuously revisiting and reassessing model choices is a good practice to adopt to maintain optimal performance.

Our journey underscores the importance of adaptability and continuous improvement in deploying AI solutions. Initial feedback is very positive and we look forward to seeing how our optimised service enhances user interactions with the FCDO and contributes to more efficient access to their information.

If you have any thoughts or questions about this article, feel free to get in touch! info@cautionyourblast.com