Listen to an AI-generated overview of this article.

0:00

/625.32

Recently, I've been messing around with generative artificial intelligence at work. In particular, I've been experimenting with large language models (LLMs). In my journey, I've found it pretty difficult to find resources for interacting with these systems. Most that I've found are on some extreme part of the spectrum. Many blogs are geared towards non-technical folk, and therefore pretty simple in their suggestions. On the other end of the spectrum are sources such as white papers, and targeted towards highly technical, machine learning engineers. Neither are what I'm looking for, and so here we are.

This article is meant to be for software engineers like me, who are curious to learn some more advanced tips for interacting with LLMs. I assume that most people reading this have a basic understanding of how LLMs work, such as token selection. I'm also not going to cover any user interface details, like system prompting or agent creation. I'll start with basic tips and observations, and then slowly increase in complexity.

The Bare Basics

Learn to Live With Limits

As a rule of thumb, think about LLMs as roughly equivalent to an intern. So, say that you took your exact same context & input to the LLM, but instead gave it to an intern. Could they succeed? How long would it take them? You can then use that to se your expectations appropriately. For instance, if an intern might accomplish the task, but take a while, you should consider how you might narrow the scope or reduce the context to make it more well-defined.

In all honesty, you should probably just embrace and accept relative mediocrity. I'll take that a step further: you should build that expectation into your system. As it currently stands, LLMs will rapidly achieve pretty solid results. But taking them from that into a production-ready performance will be very difficult. More notable, it could actually even be counterproductive!

In one study, “Shum et al. (2023) found that [...] complex examples can improve the accuracy of complex questions, but perform poorly in simple questions."¹ Put simply, you might be able to change the window of accuracy to cover some complex edge cases. But doing that could cause simpler cases to fail. Instead of trying to tweak that reality, I've found it more affective to just go and assume that LLM integration will be ~80% accurate, and then design your UX or downstream integrations around that assumption.

Prescribe a Persona

You won't believe what scientists have discovered with this one simple trick!² This is likely something most people have heard about by now, but the technique is so simple and the results so impactful that it needs to be called out.

So in the beginning of your prompt (or as system context), simply tell the LLM to adopt a personality or role. For example, start your prompt with “you are a highly qualified QA analyst, tasked with finding bugs in web applications." I've heard a theory that this might be biasing the model's internal weights towards QA-related tokens...but I'm unsure how true that is. In any case, it's a surprisingly effective strategy that I've never had backfire on me yet.

There's also a related technique, usually called role-playing. With this technique, you describe your own role in the system instead. For example, "I am an end-user of the software, looking to learn about known bugs." Both of these have been shown to be pretty effective, so experiment with both.

Stick to a Structure

LLMs work well when you give them a structured format, for example markdown or JSON. Asking for structured output from the model is especially useful when integrating with downstream systems, or creating evaluation (eval) tests. What's interesting is that different models work better with different structures, so make sure to experiment with your model(s) of choice!³

There are also pretty established libraries, such as Instructor and Outlines, which help deal with structured output. They help take care of some toil, including validating the output and retrying when it's not conforming to your desired data structure. Furthermore, your favorite API probably has parameters to specify your desired output type as well. I know that Gemini has generation_config, and OpenAI has response_format.

Intro to Intermediates

Alright, with some basics out of the way, we can dig a little deeper. This section has tips that are a little more obscure, and a lot more detailed.

Never Say Never

An interesting observation is that LLMs don’t respond well to negative-assertion guardrails, such as “don’t generate code comments if they already exist.” To work around this, you might instead ask the model to classify its response as showing or not showing the bad behavior. Since LLMs tend to be pretty good at classification, you can use that to handle & filter in downstream systems.⁴ So to keep with the same example of code comments, you could instead ask the LLM to first classify whether any code comments already exist. And then only prompt for the generation based on its classification output.

Plan for a Pivot

One flaw of LLMs is that they really love to generate responses, even when they probably shouldn’t. This is a major contributor to hallucinations. As such, make sure that you give them an escape hatch.⁵ For example, your prompt might literally say “if the information is not in the article, write ‘I could not find this info.’”

Experiment with Examples

This technique is commonly known as few-shot prompting. As opposed to zero shot prompting, we give the model few examples of desired output to help guide it. As a rule of thumb, studies suggest we should aim for at least five examples.⁶

That said, proceed with caution! We don't want to just hand the model some examples blindly. “Many studies looked into [...] examples to maximize the performance, and observed that choice of prompt format, training examples, and the order of the examples can lead to dramatically different performance.”⁷

Aside from all of that, we also have to keep in mind the quality and type of examples we provide. Otherwise we risk biasing the LLM in an undesirable way. Here are some notable biases that can pop up.⁸

Majority bias is developed when we provide an unbalanced set of examples. The model will gravitate towards the over-represented sample.
Recency bias is the tendency to repeat the last label we give it.
Common token bias describes LLMs producing common tokens more often than rare tokens. So, the model may bias itself towards examples that use more commonly found words, even if it's less accurate.

All of the biases above can be overcome in fairly straightforward ways. However, some can actually be used to our advantage! When it comes to majority bias, we could steer the model in the right direction by making our examples representative of the production distribution. In this way, the model will be guided towards a realistic distribution when formulating its own responses.⁹

We can also use the recency bias to our advantage, by steering the LLM towards our escape hatch. When we have a reasonable default action, using that as the last example will bias the model to start falling back on that.¹⁰

The last note about providing examples is retrieval-augmented generation (RAG). I assume most engineers have heard of this technique by now. And it continues to be relevant, even as context windows reach millions of tokens. First and foremost, RAG can help cut costs of generation. Less tokens means less computation. More importantly, large context windows continue to suffer from “needle in the haystack” types of problems.¹¹ By slimming down the context and examples provided, we avoid confusing the model with irrelevant information.

Approaching the Advanced

Lastly, let's dive into some more advanced techniques. As a general rule of thumb, these should be explored only after a general proof-of-concept has been established. These techniques are unlikely to make an infeasible idea suddenly work, but rather can take a complex use case across the finish line.

Sequence the Steps

A relatively recent strategy to hit the LLM scene has been chain of thought prompting, which has been shown to greatly reduce hallucinations.

There are two main types to highlight:

Implicit (or zero-shot) chain of thought is laughably simple. In this method, we simply tell the model to think step-wise. For example, our prompt might literally begin with “let’s think through this step-by-step…”
Explicit chain of thought involves giving the model direct logical steps to follow. Interestingly, it’s been found that separating the steps by newlines (as opposed to numbers, periods, etc.) increases efficacy.¹²

Questioning for Questions

This technique goes by a few names, such as self-ask, decomposition, and prompt chaining. The idea is to generate chains of thought, which can each be independently searched via traditional means.¹³ For example, instead of asking “what is the height of the 44th President,” we can ask the model to generate the chains of thought that could answer this question. As a result, it might generate the steps (1) “who was the 44th President,” followed by (2) “what was Obama’s height.” Each of those questions could then be discovered via traditional search.

There are two main advantages to this approach. Firstly, it grounds our system in a more factual analysis, as well as enabling us to add citations to the generated knowledge. Another advantage to this approach is that we'll have a chance to bail out if the chain of thought is inaccurate, or not discoverable.

Caring about Consistency

Self-consistency is a fairly straightforward concept. It's a prompting technique that typically combines chain of thought with few-shot prompting. It further solidifies those results by prompting multiple times, and then taking the most frequent answer.¹⁴

The simplest example is math problems. For instance, you might prompt the model to solve a math problem, using traditional chain of thought reasoning. If the LLM answers 35, 35, 105 and 35...well, we can assume the correct answer to likely be 35.

Touching Trees

Tree of thoughts is an advanced (and resource-heavy) technique which uses chain of thought and self-evaluation to explore multiple branches of reasoning.¹⁵ It encourages the model to generate multiple potential chains of thought, and then uses a traditional tree traversal algorithm to start exploring those options. When we reach a bad point of reasoning, we can backtrack to an earlier state.

At the time of writing, I'm not sure if there's any developer-friendly resources to help implement this technique. The original paper included source code, and there's a fairly active repository that claims to provide plug-and-play access to this technique. But I've not used either, so can't provide any concrete recommendations.

Playing with Parameters

This article wouldn't be complete without at least a basic intro to model tuning. This is a subject worthy of an entire article within itself, which I'll hopefully find time to write pretty soon. But here's a quick overview of some common parameters.¹⁶

Max Tokens will limit how long of a response the model gives.
Temperature controls how random the token selection will be.
Top P / N are both used to control the sample size for token selection. Top P is percentage-based limiting, while Top N is strictly numerical.
Presence / Frequency Penalties both restrict using repetitive tokens. Presence will penalize the model for selecting a specific token at all, whereas frequency will penalize it for repeating the same token.

Conclusion

There are some neat things we can do with LLMs, and new techniques are coming out all the time to help squeeze more and more out of these models. LLMs seem to be pretty good at semantic reasoning and classification problems, which both can be further boosted by the ideas discussed in this article. If you come across an idea that I haven't shared, please reach out! My contact info is on the About Me page.

The Pragmatic Prompter

Aaron Knobloch