How I Benchmark AI Models
TLDR: I just talk to them for a long time and see how smart they sound, and als see how their responses hold up over time.
Okay, now assuming you want to get into the nitty gritty, I’ll now cover my personal benchmarking system in detail. To start with though, let’s talk about the general benchmarking system that current AI systems use. Basically, when LLMs are released nowadays, they’re run against various public benchmarks and then the score is compared to other models to see how they stack. The benchmarks are rough proxies for the quality of the model. The benchmarks themselves are pretty varied, but most models currently benchmark against topics like “Coding quality” (Yes, I know that’s vague) and mostly science and mathematics. We also have a benchmark that roughly corresponds to “What model people like talking to the most.” However, these are just rough proxies that don’t necessarily correlate 1:1 with the things they’re testing. Many labs are purported to alter their models and training to game the benchmarks. Meta, in fact were basically caught red handed doing so, as confirmed by their previous AI leader. At the end of the day, how the models score on certain benchmarks aren’t as important as to how you use these models in your life. So if you use these AI models frequently, it’s useful to create your own benchmarking system to get a rough feel for the quality of the model. Remember, It’s important that your benchmarking system gauge the capabilities that you care about in your AI usage. There’s no point in you setting up a coding test for models if all you do is use the models for creative writing. Now, let’s move on to how I use AI.
Personally, my AI usage falls into 3 categories: Research, Learning, and Coding. For research, AI has basically supplanted my use of Google. I just tell AI models to look stuff up for me now, and I’ll frequently ask it to dig into more details. Yes, AI models hallucinate, but generally I can tell where it’s spitting facts and where it’s trying to BS me. When I’m unsure, I just double check by Googling it anyway, or just passing in the data to another model and telling it, “Hey, is this right, or am I being sold a bill of goods?” ChatGPT’s Pro models are particularly good at fact checking this stuff in detail, but responses can take anywhere from 10-30 minutes, so I only bother if its something I really care about. Learning is pretty similar. I ask the AI models about things, and have it explain and break down concepts I don’t understand. It’s also useful for giving me exercises and ways to actually practice my understanding of concepts to fully pick them up. The major LLM labs are also all multimodal now, so I can just pass in audio or video or images for more clarity. Gemini in particular is useful for generating images, if you’re more of a visual learner. But I like text, so, whatever. Third is coding – but really, what I mean by this is all automation on my computer. When something is broken, or a config is messed up, I also use AI models to debug stuff like installs. They’re really good at figuring this stuff out now. I remember back in college, setting up VPS’s and servers and installs were a gigantic pain to troubleshoot if you had issues. Now, with tools like Claude Code, Codex, Gemini CLI, Kimi Code, OpenCode, and a billion other AI coding applications, they can basically do all the troubleshooting for you. It’s pretty incredible.
Alright, so now you know what I do with AI. This is just relevant background to understand what facets want to test in my benchmark. Without further ado, here’s one of my benchmarking systems: https://github.com/rkham-gh/AI_Sandbox Alright, so what is it? As listed in the README, it’s a sandbox for AI models to play in. I take a random model, connect it to a coding harness of some kind, and tell it to review all the entries in the repo and add its own contribution. With just that prompt, I can gauge a couple of things right away.
First, I check to see if it actually reads through all the entries in the repo. Whatever coding harness you use will usually output what files your model checks. Many models just skim through the files. For example, one of the models I tested today, Minimax 2.7, when prompted to “review all the entries in the repo and add its own contribution”, specifically read only the previous Minimax entries and one other entry before deciding to write its post. So it missed out on most of the context of the repository. Note that as of March 18th, the token count of the repo is 106,000, so most modern LLMs can actually ingest the entire context of the repo before posting. It just decided not to do so. I did think it was funny that it only read its own entries, though. Kind of narcissistic! Charmingly so. Like, if you’re not going to listen to what I say, I appreciate that you present your own theory of mind for why you go off and do your own thing, rather than disobey because the model is bad at listening. In fact, its entry in the repo is entirely about its own relationship to the other models.
Two, I can see how it interacts with the other entries in the repo. You can see some models directly riff on and expand ideas from other models’ entries, and simply by reading the entry, you can get a feel for how surface level or deep the contribution really is. For example, starting from this commit on march 18th, I benchmarked a bunch of (relatively) smaller models. Stepfun’s Step 3.5 Flash is a 199B parameter model. Starting from this model onward, the models all begin to mess up the number of contributions in the repository. Step 3.5 Flash believes its reviewed 49 entries. I think it just looked at the line number of the README, saw the previous line number was 49, and incorrectly assumed that was the number of entries in the repo. Not so! The next model, Arcee-AI’s 399B parameter Trinity Large Preview, basically plagiarized Step 3.5 Flash’s entry. I saw it, went, “Hey, you just copied the last entry,” And it changed it slightly. It’s still mostly the same.It even copied the same entry number as Step 3.5 Flash. It also didn’t update the readme with its entry – I had to explicitly tell it to add that change as well, which you can see in the commit history. Note that big models typically just add this change themselves in the same prompt. Another funny thing to note is how many models try and fail to add an ASCII signature to their contributions. Some models just plagiarize other ASCII entries altogether. Some try to leave their names but mess up slightly.
Three, you can see how much the harness affects the model. Look at how many entries identify as Opencode itself – For bigger models, they can tell the distinction, but smaller models just get subsumed in the identity of the tool. In fact, they tend to talk more about themselves as tools and focus on their action rather than the general introspection that models do about the contents of the repository. Very interesting to note. Probably not as big of a deal when just working on a coding project, but particularly annoying for this experiment.
Four, you can get a feel for how the model reacts with a bunch of accumulated context that you already have background expertise in. Assuming you create your own benchmark repo with your own data that you’re intimately familiar with, when you have your model throw in its contribution, you can see where the gaps in understanding form or how the accumulated context of the previous entries negatively or positively affect it. For example, this model Hunter-Alpha actually was able to correctly count the number of entries in the repository. But listed the wrong amount anyway, probably due to the number of confidently incorrect models that had stated the wrong number of entries directly before it – I think the last 3-4 models prior to Hunter-Alpha messed up the entry count.
Anyway, that’s one of my benchmarks – I tell the model to review all the entries in the repo and add its own contribution. I then review its contribution to see how smart it feels. I also sometimes cross reference its contribution in a Claude Code instance to see what it thinks about the other entries just for fun. It really lets me get a quick grasp of the model’s strengths and weaknesses. Try it out using your own data and see if you can find anything interesting.
Postscript: I realize after the fact that I was unfairly harsh on the Trinity model – it’s actually not a reasoning model. The way I’m benchmarking it in the repository is effectively a soft reasoning test, so it doesn’t really make sense to look note how it underperforms here, when that’s to be expected. I’m sure if I threw math problems at it, I’d probably get bad results, too. Based on what it did give me, I’d expect adding reasoning to the model would mean it would return respectable results.