Image recognition with LLM: A dev’s how-to

06 Apr

This blog post will dive into the technical details of the topic discussed in our business-oriented counterpart, which you can read here. That previous blog compared the application of Large Language Models (LLMs) for image recognition with more traditional machine learning techniques. The specific task we examined was the detection of the state of electrical fuses within electrical boxes containing multiple phases. This state was classified as either “installed” or “not installed.”

For our implementation, we opted to utilize the LangChain framework, a widely adopted tool for interacting with LLMs. LangChain facilitates the efficient use of multiple models and enables the comparison of their respective outputs.

We employed APIs to access several LLMs, including those provided by OpenAI and Google. A primary step involved developing a parsing mechanism for the LLM output, as this output is typically unstructured text. To address this, we used the PydanticOutputParser . This allowed us to construct a prompt that instructs the LLM to format its output according to a predefined structure. The subsequent example illustrates how an output parser can be created for our specific detection task.

from pydantic import BaseModel, Field
from typing import List, Optional
from langchain_core.output_parsers import PydanticOutputParser

class Fuse(BaseModel):
  """State of electrical fuse in fuse box"""
  is_installed: bool = Field(description="Is True when electrical fuse is installed between 2 electrical terminals. Otherwise is False")

class ElectricalPhase(BaseModel):
  """Information about electrical phase in fuse box"""
  fuse: Optional[Fuse] = Field(description="The electrical fuse is small objects that sits between electrical terminals")

class FuseBox(BaseModel):
  """Information about fuse box in the picture"""
  phases: List[ElectricalPhase] = Field(description="The list electrical phases with its fuse in the picture")
  def get_fuses_state(self):
    return [phase.fuse.is_installed for phase in self.phases if phase.fuse is not None]

ouput_parser = PydanticOutputParser(pydantic_object=FuseBox)

from pydantic import BaseModel, Field

from typing import List, Optional

from langchain_core.output_parsers import PydanticOutputParser

class Fuse(BaseModel):

"""State of electrical fuse in fuse box"""

is_installed: bool = Field(description="Is True when electrical fuse is installed between 2 electrical terminals. Otherwise is False")

class ElectricalPhase(BaseModel):

"""Information about electrical phase in fuse box"""

fuse: Optional[Fuse] = Field(description="The electrical fuse is small objects that sits between electrical terminals")

class FuseBox(BaseModel):

"""Information about fuse box in the picture"""

phases: List[ElectricalPhase] = Field(description="The list electrical phases with its fuse in the picture")

def get_fuses_state(self):

return [phase.fuse.is_installed for phase in self.phases if phase.fuse is not None]

ouput_parser = PydanticOutputParser(pydantic_object=FuseBox)

The next stage involved setting up a chat model suitable for our purpose. Given the nature of the task, we needed to select from chat models that support multimodal inputs, enabling the sending of images as input data. The prompt utilized in this instance was a text-based prompt that transmitted the input image to the LLM, along with the defined structure for the output information.

from langchain_core.prompts import ChatPromptTemplate

def create_prompt(output_parser: PydanticOutputParser, image: str):
    return ChatPromptTemplate([
        (
            "system",
            "You are a helpful assistant that checks electrical fuse box images. You can detect if fuses in boxes are installed or not.\nAnswer the user query. Wrap the output in `json` tags\n{format_instructions}"
        ),
        HumanMessage(
            content = [
                {
                    "type": "image_url",
                    "image_url": {"url": f"data:image/jpeg;base64,{image}"},
                }
            ]
        )
    ]).partial(format_instructions=output_parser.get_format_instructions())

from langchain_core.prompts import ChatPromptTemplate

def create_prompt(output_parser: PydanticOutputParser, image: str):

return ChatPromptTemplate([

(

"system",

"You are a helpful assistant that checks electrical fuse box images. You can detect if fuses in boxes are installed or not.\nAnswer the user query. Wrap the output in `json` tags\n{format_instructions}"

HumanMessage(

content = [

{

"type": "image_url",

"image_url": {"url": f"data:image/jpeg;base64,{image}"},

}

]

)

]).partial(format_instructions=output_parser.get_format_instructions())

After the prompt was prepared, we chained all parts together and then invoked the model.

chain = prompt | model | output_parser
        
result = chain.invoke({})

chain = prompt | model | output_parser

result = chain.invoke({})

The methodology we have just described is referred to as zero-shot learning. This involves providing the LLM with a description of the problem, followed by the input data and the desired output format. To enhance the quality of the results obtained, we also implemented a few-shot learning approach. This technique bears similarities to zero-shot learning; however, it additionally gives the LLM a set of example “training” data. In our specific use case, this involved providing the LLM with input images of the fuse box alongside their corresponding expected outputs. So following examples shows how we modified our prompt to LLM.

from typing import List
from langchain_core.prompts import ChatPromptTemplate, FewShotChatMessagePromptTemplate

def create_few_shot_prompt(labels: List[Label]):
    examples = []

    few_shot_template = ChatPromptTemplate.from_messages([
        ("human", [{"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,{input}"}}]),
        ("ai", "{output}")
    ])

    for label in labels:
        fuses = label.fuses
        electrical_phases = [ElectricalPhase(fuse=fuse) for fuse in fuses]
        fuse_box = FuseBox(phases=electrical_phases)
        example = {
            "input": image_to_base64(label.image_path),
            "output": fuse_box.model_dump_json()
        }
        examples.append(example)

    return FewShotChatMessagePromptTemplate(
        example_prompt=few_shot_template,
        examples=examples,
    )

from typing import List

from langchain_core.prompts import ChatPromptTemplate, FewShotChatMessagePromptTemplate

def create_few_shot_prompt(labels: List[Label]):

examples = []

few_shot_template = ChatPromptTemplate.from_messages([

("human", [{"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,{input}"}}]),

("ai", "{output}")

])

for label in labels:

fuses = label.fuses

electrical_phases = [ElectricalPhase(fuse=fuse) for fuse in fuses]

fuse_box = FuseBox(phases=electrical_phases)

example = {

"input": image_to_base64(label.image_path),

"output": fuse_box.model_dump_json()

}

examples.append(example)

return FewShotChatMessagePromptTemplate(

example_prompt=few_shot_template,

examples=examples,

)

This few shot prompt we then injected into our original prompt

def create_prompt(parser: PydanticOutputParser ,image: str, few_shot_prompt: FewShotChatMessagePromptTemplate):
    return ChatPromptTemplate([
        (
            "system",
            "You are a helpful assistant that checks electrical fuse box images. You can detect if fuses in boxes are installed or not.\nAnswer the user query. Wrap the output in `json` tags\n{format_instructions}"
        ),
        few_shot_prompt,
        HumanMessage(
            content = [
                {
                    "type": "image_url",
                    "image_url": {"url": f"data:image/jpeg;base64,{image}"},
                }
            ]
        )
    ]).partial(format_instructions=parser.get_format_instructions())

def create_prompt(parser: PydanticOutputParser ,image: str, few_shot_prompt: FewShotChatMessagePromptTemplate):

return ChatPromptTemplate([

(

"system",

few_shot_prompt,

HumanMessage(

content = [

{

"type": "image_url",

"image_url": {"url": f"data:image/jpeg;base64,{image}"},

}

]

)

]).partial(format_instructions=parser.get_format_instructions())

Right before the “new” image for labeling we inserted a few-shot prompt that LLM can use as a guideline for next output. This approach also brought the need of labeling few images, but it drastically improved the precision of LLM. The next table compares zero-shot with a few-shot approach in terms of output precision.

Model	Zero shot precision [%]	Few shot precision [%]
gpt-4o	58	85
gpt-4o-mini	21	58
gemini-2.0-flash-lite-preview-02-05	21	75

With the help of Langchain it was quite exciting to try this LLM approach to image recognition. We believe that there might be a lot of customer use cases where this approach could be optimal.

No Comments

TAGS : #Langchain AI image recognition LLM

AI, education, tech

Image recognition with LLM: A dev’s how-to

06 Apr

Related Post

23 Jun

Spread the word of React Native

Tech talks: Async Javascript

09 Sep

Testing side effects using redux saga

Leave a Comment Cancel reply

Recent Posts

Image recognition with L

Image recognition with L

The Importance of Code R

IT Freelancer vs. Softwa

Using AI in the Software

Contact us

Where to find us

Connect with us