We are your Digital Ally™
Image recognition with LLM: A dev's how-to
AI

Image recognition with LLM: A dev's how-to

A technical walkthrough of implementing image recognition using LangChain and Large Language Models, comparing zero-shot and few-shot learning approaches for detecting electrical fuse states.

Adam HarnúšekApril 6, 2025

This article is the technical counterpart to our business-oriented overview of LLMs for image recognition. Here we dive into the implementation details using LangChain to interact with multiple LLMs via their APIs.

Defining the Output Schema with Pydantic

To get structured output from an LLM, we use PydanticOutputParser. First, we define the data models that represent what we want the model to return:

from pydantic import BaseModel, Field
from typing import List, Optional
from langchain_core.output_parsers import PydanticOutputParser
 
class Fuse(BaseModel):
    """State of electrical fuse in fuse box"""
    is_installed: bool = Field(description="Is True when electrical fuse is installed between 2 electrical terminals. Otherwise is False")
 
class ElectricalPhase(BaseModel):
    """Information about electrical phase in fuse box"""
    fuse: Optional[Fuse] = Field(description="The electrical fuse is small objects that sits between electrical terminals")
 
class FuseBox(BaseModel):
    """Information about fuse box in the picture"""
    phases: List[ElectricalPhase] = Field(description="The list electrical phases with its fuse in the picture")
 
    def get_fuses_state(self):
        return [phase.fuse.is_installed for phase in self.phases if phase.fuse is not None]
 
output_parser = PydanticOutputParser(pydantic_object=FuseBox)

The three models capture the nested structure of a fuse box: a FuseBox contains multiple ElectricalPhase objects, each of which may contain a Fuse with a boolean is_installed field.

Building the Prompt

We construct a ChatPromptTemplate that passes a base64-encoded image to the model along with the system instructions and the output format specification from our parser:

from langchain_core.prompts import ChatPromptTemplate
 
def create_prompt(output_parser: PydanticOutputParser, image: str):
    return ChatPromptTemplate([
        (
            "system",
            "You are a helpful assistant that checks electrical fuse box images. You can detect if fuses in boxes are installed or not.\nAnswer the user query. Wrap the output in `json` tags\n{format_instructions}"
        ),
        HumanMessage(
            content = [
                {
                    "type": "image_url",
                    "image_url": {"url": f"data:image/jpeg;base64,{image}"},
                }
            ]
        )
    ]).partial(format_instructions=output_parser.get_format_instructions())

Running the Chain

Invoking the chain is straightforward with LangChain's LCEL syntax:

chain = prompt | model | output_parser
result = chain.invoke({})

Zero-Shot vs. Few-Shot Learning

Zero-Shot Learning

Zero-shot learning provides the LLM with only a description of the problem and the desired output format — no examples. It is the simplest approach and has no additional token cost beyond the query itself.

Few-Shot Learning

Few-shot learning enhances the prompt by including labeled example images alongside the query. This gives the model concrete reference points before it encounters the actual input, and consistently produces better results across all models we tested.

Here is how to build a few-shot prompt template:

from typing import List
from langchain_core.prompts import ChatPromptTemplate, FewShotChatMessagePromptTemplate
 
def create_few_shot_prompt(labels: List[Label]):
    examples = []
    few_shot_template = ChatPromptTemplate.from_messages([
        ("human", [{"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,{input}"}}]),
        ("ai", "{output}")
    ])
 
    for label in labels:
        fuses = label.fuses
        electrical_phases = [ElectricalPhase(fuse=fuse) for fuse in fuses]
        fuse_box = FuseBox(phases=electrical_phases)
        example = {
            "input": image_to_base64(label.image_path),
            "output": fuse_box.model_dump_json()
        }
        examples.append(example)
 
    return FewShotChatMessagePromptTemplate(
        example_prompt=few_shot_template,
        examples=examples,
    )

And the updated prompt creation function that incorporates the few-shot template:

def create_prompt(parser: PydanticOutputParser, image: str, few_shot_prompt: FewShotChatMessagePromptTemplate):
    return ChatPromptTemplate([
        (
            "system",
            "You are a helpful assistant that checks electrical fuse box images. You can detect if fuses in boxes are installed or not.\nAnswer the user query. Wrap the output in `json` tags\n{format_instructions}"
        ),
        few_shot_prompt,
        HumanMessage(
            content = [
                {
                    "type": "image_url",
                    "image_url": {"url": f"data:image/jpeg;base64,{image}"},
                }
            ]
        )
    ]).partial(format_instructions=parser.get_format_instructions())

Performance Results

Electrical fuse example

The few-shot approach produced significant precision improvements across every model we tested:

ModelZero-shot PrecisionFew-shot Precision
gpt-4o58%85%
gpt-4o-mini21%58%
gemini-2.0-flash-lite-preview-02-0521%75%

The jump from zero-shot to few-shot is substantial — for gpt-4o it went from 58% to 85%, and even the lighter gemini-2.0-flash-lite model reached 75% with few-shot examples. If you need to squeeze more accuracy out of a smaller or cheaper model, providing labeled examples is the most effective lever available.

For the business-side analysis including cost comparisons and a decision framework for choosing between LLMs and traditional ML, see Image recognition with LLM.

© 2026