This blog post will dive into the technical details of the topic discussed in our business-oriented counterpart, which you can read here. That previous blog compared the application of Large Language Models (LLMs) for image recognition with more traditional machine learning techniques. The specific task we examined was the detection of the state of electrical fuses within electrical boxes containing multiple phases. This state was classified as either “installed” or “not installed.”

For our implementation, we opted to utilize the LangChain framework, a widely adopted tool for interacting with LLMs. LangChain facilitates the efficient use of multiple models and enables the comparison of their respective outputs.

We employed APIs to access several LLMs, including those provided by OpenAI and Google. A primary step involved developing a parsing mechanism for the LLM output, as this output is typically unstructured text. To address this, we used the PydanticOutputParser . This allowed us to construct a prompt that instructs the LLM to format its output according to a predefined structure. The subsequent example illustrates how an output parser can be created for our specific detection task.

The next stage involved setting up a chat model suitable for our purpose. Given the nature of the task, we needed to select from chat models that support multimodal inputs, enabling the sending of images as input data. The prompt utilized in this instance was a text-based prompt that transmitted the input image to the LLM, along with the defined structure for the output information.

After the prompt was prepared, we chained all parts together and then invoked the model.

The methodology we have just described is referred to as zero-shot learning. This involves providing the LLM with a description of the problem, followed by the input data and the desired output format. To enhance the quality of the results obtained, we also implemented a few-shot learning approach. This technique bears similarities to zero-shot learning; however, it additionally gives the LLM a set of example “training” data. In our specific use case, this involved providing the LLM with input images of the fuse box alongside their corresponding expected outputs. So following examples shows how we modified our prompt to LLM.

This few shot prompt we then injected into our original prompt

Right before the “new” image for labeling we inserted a few-shot prompt that LLM can use as a guideline for next output. This approach also brought the need of labeling few images, but it drastically improved the precision of LLM. The next table compares zero-shot with a few-shot approach in terms of output precision.

Model Zero shot precision [%] Few shot precision [%]
gpt-4o 58 85
gpt-4o-mini 21 58
gemini-2.0-flash-lite-preview-02-05 21 75

With the help of Langchain it was quite exciting to try this LLM approach to image recognition. We believe that there might be a lot of customer use cases where this approach could be optimal.

Related Post

Leave a Comment

© 2024 Instea, s.r.o.
All rights reserved. Privacy policy

Contact us

Where to find us

Connect with us