This blog post will dive into the technical details of the topic discussed in our business-oriented counterpart, which you can read here. That previous blog compared the application of Large Language Models (LLMs) for image recognition with more traditional machine learning techniques. The specific task we examined was the detection of the state of electrical fuses within electrical boxes containing multiple phases. This state was classified as either “installed” or “not installed.”
For our implementation, we opted to utilize the LangChain framework, a widely adopted tool for interacting with LLMs. LangChain facilitates the efficient use of multiple models and enables the comparison of their respective outputs.
We employed APIs to access several LLMs, including those provided by OpenAI and Google. A primary step involved developing a parsing mechanism for the LLM output, as this output is typically unstructured text. To address this, we used the PydanticOutputParser . This allowed us to construct a prompt that instructs the LLM to format its output according to a predefined structure. The subsequent example illustrates how an output parser can be created for our specific detection task.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
from pydantic import BaseModel, Field from typing import List, Optional from langchain_core.output_parsers import PydanticOutputParser class Fuse(BaseModel): """State of electrical fuse in fuse box""" is_installed: bool = Field(description="Is True when electrical fuse is installed between 2 electrical terminals. Otherwise is False") class ElectricalPhase(BaseModel): """Information about electrical phase in fuse box""" fuse: Optional[Fuse] = Field(description="The electrical fuse is small objects that sits between electrical terminals") class FuseBox(BaseModel): """Information about fuse box in the picture""" phases: List[ElectricalPhase] = Field(description="The list electrical phases with its fuse in the picture") def get_fuses_state(self): return [phase.fuse.is_installed for phase in self.phases if phase.fuse is not None] ouput_parser = PydanticOutputParser(pydantic_object=FuseBox) |
The next stage involved setting up a chat model suitable for our purpose. Given the nature of the task, we needed to select from chat models that support multimodal inputs, enabling the sending of images as input data. The prompt utilized in this instance was a text-based prompt that transmitted the input image to the LLM, along with the defined structure for the output information.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
from langchain_core.prompts import ChatPromptTemplate def create_prompt(output_parser: PydanticOutputParser, image: str): return ChatPromptTemplate([ ( "system", "You are a helpful assistant that checks electrical fuse box images. You can detect if fuses in boxes are installed or not.\nAnswer the user query. Wrap the output in `json` tags\n{format_instructions}" ), HumanMessage( content = [ { "type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image}"}, } ] ) ]).partial(format_instructions=output_parser.get_format_instructions()) |
After the prompt was prepared, we chained all parts together and then invoked the model.
1 2 3 |
chain = prompt | model | output_parser result = chain.invoke({}) |
The methodology we have just described is referred to as zero-shot learning. This involves providing the LLM with a description of the problem, followed by the input data and the desired output format. To enhance the quality of the results obtained, we also implemented a few-shot learning approach. This technique bears similarities to zero-shot learning; however, it additionally gives the LLM a set of example “training” data. In our specific use case, this involved providing the LLM with input images of the fuse box alongside their corresponding expected outputs. So following examples shows how we modified our prompt to LLM.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
from typing import List from langchain_core.prompts import ChatPromptTemplate, FewShotChatMessagePromptTemplate def create_few_shot_prompt(labels: List[Label]): examples = [] few_shot_template = ChatPromptTemplate.from_messages([ ("human", [{"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,{input}"}}]), ("ai", "{output}") ]) for label in labels: fuses = label.fuses electrical_phases = [ElectricalPhase(fuse=fuse) for fuse in fuses] fuse_box = FuseBox(phases=electrical_phases) example = { "input": image_to_base64(label.image_path), "output": fuse_box.model_dump_json() } examples.append(example) return FewShotChatMessagePromptTemplate( example_prompt=few_shot_template, examples=examples, ) |
This few shot prompt we then injected into our original prompt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
def create_prompt(parser: PydanticOutputParser ,image: str, few_shot_prompt: FewShotChatMessagePromptTemplate): return ChatPromptTemplate([ ( "system", "You are a helpful assistant that checks electrical fuse box images. You can detect if fuses in boxes are installed or not.\nAnswer the user query. Wrap the output in `json` tags\n{format_instructions}" ), few_shot_prompt, HumanMessage( content = [ { "type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image}"}, } ] ) ]).partial(format_instructions=parser.get_format_instructions()) |
Right before the “new” image for labeling we inserted a few-shot prompt that LLM can use as a guideline for next output. This approach also brought the need of labeling few images, but it drastically improved the precision of LLM. The next table compares zero-shot with a few-shot approach in terms of output precision.
Model | Zero shot precision [%] | Few shot precision [%] |
gpt-4o | 58 | 85 |
gpt-4o-mini | 21 | 58 |
gemini-2.0-flash-lite-preview-02-05 | 21 | 75 |
With the help of Langchain it was quite exciting to try this LLM approach to image recognition. We believe that there might be a lot of customer use cases where this approach could be optimal.