Recognizing the growing use of LLMs in image recognition, we decided to investigate their applicability to our task. This led us to compare an LLM approach with traditional machine learning methods.
Our recent project involved creating an image recognition model capable of determining the state of electrical fuses (installed or not installed) in electrical boxes. As illustrated below, the image shows two connected phases, each with an electrical fuse in the “installed” state. The model we trained would identify these as two phases, both “installed.”
We employed the Google Vision service for training, leveraging a large dataset provided by the customer and conducting the entire training process in the cloud.
So, when we tried using LLMs for recognizing the fuses, we found that only a handful of the big-name models can even handle images. We tested all the ones we could get our hands on at the time. First off, we had to tell the model what it was looking at in the picture and what we wanted it to do. We also told it how we wanted the results to be laid out. Then, we looked at the results to see how accurate each model was. We also played around with two common tricks: zero-shot and few-shot learning. With zero-shot, we just showed the LLM the picture, described the task, and told it how to give us the answer. Few-shot was a bit different. We started by showing the LLMs a few example pictures along with the correct answers. Then, we showed it new pictures and gave it the same task description and output format.
Turns out, the few-shot method worked way better with all the LLMs we tried. The downside is that it sends more data (what they call “tokens”) to the LLM, so it ends up costing a bit more than the zero-shot approach. Here’s a quick look at how the two methods looked, along with the models we used and their accuracy:
Model | Zero shot cost | Zero shot precision [%] | Few shot cost | Few shot precision [%] |
gpt-4o | $0.0135 | 58 | $0.04 | 85 |
gpt-4o-mini | $0.0008 | 21 | $0.0024 | 58 |
gemini-2.0-flash-lite-preview-02-05 | $0.0002 | 21 | $0.0006 | 75 |
Now, if we compare these numbers to what we got with the original Google Vision setup, you can see that the regular machine learning way is still a bit better than using LLMs. But as you can see, the difference isn’t huge.
It looks like using LLMs shows some real potential, but it depends on what you’re trying to do. In our fuse example, building a specific machine learning model cost us around 2000€ just for the tech stuff, not even counting what we paid the developers. If a customer only needs to use this model maybe 1000 times a year, it’s way cheaper to just use an LLM and pay per use. That would be like 1000 * 0.04 = 400€ a year. But, there’s definitely a point where if you’re using the model a ton, it makes more sense to just train your own.
Also, it’s worth pointing out that each LLM out there can give you different accuracy. For what we were doing, it seemed like the Gemini-2.0-flash-lite model gave us the best value for money in terms of price and how well it worked.
At this point, it might seem like just using an LLM is the way to go for something like this. But we’ve got to ask ourselves a few questions. Do we want to host the LLM ourselves so we have control over our data, or are we okay with using a service like OpenAI? And how fast do we need to get our results? It’s no secret that LLMs take way longer to give you an answer compared to the old-school methods.
All these questions, and probably a few more, are going to decide whether an LLM or regular machine learning is the best fit for what a customer needs.