Checking out Moondream

Moondream is an open “vision language model”, that is, a multimodal that handles images and text. It’s based on SigLip for vision and Phi for language.

Getting ready is pretty straightforward: download one of the “.mf” available files listed in their README and load it with the “moondream” Python package.

At time of writing, they are offering a 0.5B and 2B parameters versions of the model. I’d stick to 2B if possible, 0.5B hallucinated way too easily in my testing.

I didn’t use any GPU acceleration and the times were pretty acceptable. It consumed from 9.5GB to 10.2GB of RAM during my usage (2B model).

Captioning and prompting images

Let’s take the model for a ride. Here’s my script:

#!/usr/bin/env python3

from PIL import Image
from moondream import vl  # pip install moondream
from os.path import exists
from signal import signal, SIGINT

 # supressing "KeyboardInterrupt" error message
signal(SIGINT, lambda _, __: exit())

print('Loading model...')
model = vl(model='moondream-2b-int8.mf')

while True:

    print()
    filepath = input('Image filepath: ').strip()
    
    if not exists(filepath):
        print('File not found!')
        continue

    with Image.open(filepath) as image:

        # image preprocessing
        print('Preprocessing the image...')
        encoded_image = model.encode_image(image)

        # caption generation
        print('Generating a caption...')
        caption = model.caption(encoded_image)['caption']
        print('Caption:', caption)
        
        # prompting
        prompt = input('Prompt: ').strip()
        print('Generating an answer...')
        answer = model.query(encoded_image, prompt)['answer']
        print('Answer:', answer)

And some of the results I got:

Caption: The image presents a vibrant illustration of a character with a dark, almost black, face. The character has a large, round, light blue body with a complex, swirling pattern of blue and white lines. The character’s eyes are large and round, with a white center and a black pupil, giving it a somewhat surprised or startled expression. The character’s arms are large and round, with a pinkish-orange hue. The character’s legs are also large and round, with a pinkish-orange hue. The character is set against a stark black background, which contrasts with the character’s colorful body. The character’s expression is neutral, with no visible emotion.

Prompt: What kind of character is this?

Answer: This is a cartoon character, specifically a blue and white character with a face.

Caption: The image depicts a young boy with brown hair and glasses, wearing a red and white striped shirt. He is smiling and looking directly at the viewer. The boy’s hair is styled in a somewhat messy manner. The background is black, providing a stark contrast to the boy’s colorful attire.

Prompt: What’s his name?

Answer: The image does not contain any text that would allow us to determine the name of the character.

Caption: The image presents a vibrant display of the Windows logo, a rectangular shape with rounded corners. The logo is composed of four distinct colors: red, green, yellow, and blue. The red is the largest and most prominent, followed by the green, then the yellow, and finally the blue. The logo is set against a stark black background, which accentuates its vivid colors. The logo is positioned in the center of the image, with the red and green colors on the left and right sides, respectively. The blue and yellow colors are located in the bottom right and top right corners, respectively. The logo is a simple, iconic design that is easily recognizable.

Prompt: Do you know if this is an older or recent logo of Windows?

Answer: The image shows a logo of Windows, which is an older version of the Windows logo. The logo features the iconic red and green squares, along with the blue square, all set against a black background. The design of the logo has remained consistent since its inception, making it a recognizable symbol of Windows computers.

Caption: The image shows a parking lot with a dark gray asphalt surface. Four cars are parked in the lot, each facing a different direction. The first car is a red sedan, the second is a white sedan, the third is an orange sedan, and the fourth is a red sedan. The cars are parked in a somewhat organized manner, with the red and orange cars positioned closer to the viewer than the white and red ones. The parking lot is marked with white lines, and the cars are parked in designated spaces. The image is taken from a high angle, providing a bird’s eye view of the parking lot.

Prompt: How many parking spots there are and how many are open?

Answer: There are five parking spots in the image, and three of them are currently open.

Caption: The image shows a man in a formal setting, possibly a library or study. He is wearing a dark suit jacket and a striped tie, along with round, dark-framed glasses. His hair is gray or silver, and he has a serious or contemplative expression. The background features a bookshelf filled with books, suggesting a library or study area.

Prompt: This is a famous mathematician, do you know his name?

Answer: Yes, the famous mathematician is known as “Euclid”.

Caption: The image depicts a vibrant carnival scene, with a large, ornate float dominating the center. The float is adorned with gold and white elements, including a large, white, possibly marble, bust of a female figure, and a large, gold, possibly metallic, structure. The float features a large, white, possibly marble, bust of a female figure, and a large, gold, possibly metallic, structure. The float also features a large, white, possibly marble, bust of a female figure, and a large, gold, possibly metallic, structure. The float is decorated with gold and white elements, including a large, white, possibly marble, bust of a female figure, and a large, gold, possibly metallic, structure. The float is surrounded by a crowd of people, some of whom are wearing costumes that resemble ancient Roman soldiers, and some of whom are wearing costumes that resemble modern-day dancers. The scene is set against a dark background, and the image is taken from a low angle, emphasizing the grandeur of the float.

Prompt: What’s the main theme of this carnival scene?

Answer: The main theme of this carnival scene is the presence of costumed performers, likely dancers or performers, who are part of a parade or a festive event. The image features several people in elaborate costumes, which suggests that the carnival is focused on showcasing artistic and creative performances. The costumes and the overall atmosphere of the scene evoke a sense of celebration and entertainment, making it a popular and enjoyable event for people to attend.

Transcribing text (OCR)

Caption: The image presents a collage of text elements arranged in a grid-like pattern. The top row features the phrase “Individuals and interactions over processes and tools”, followed by “Working software over comprehensive documentation over processes and tools”. The middle row includes “Customer collaboration over contract negotiation”, and the bottom row reads “Responding to change over responding to a plan”. The text elements are arranged in a visually appealing manner, with the text sizes varying to create a sense of depth and dimension. The background of the image is a blurred, out-of-focus image of a stone or rock surface, providing a neutral backdrop that allows the text to stand out.

Prompt: Can you transcribe the text in the image?

Answer: Individuals and interactions over processes and tools

Working software over comprehensive documentation Customer collaboration over contract negotiation Responding to change over responding to a plan

Caption: The image presents a handwritten text on a white background. The text is written in black ink and appears to be a mix of letters, numbers, and symbols. The text is not perfectly aligned, with some words appearing slightly off-center. The text is a mix of handwritten and typed elements, with some words appearing more handwritten than others. The text is not in a standard font, but rather a casual, handwritten style. The text is not in a clear, organized structure, and the content is not easily decipherable.

Prompt: Can you transcribe the text in the image?

Answer: In the long run, computing science will progress to transcend its parent disciplines, mathematics, and logic, by effectively realizing a significant part of Leibniz’s Dream.

Part of Leibniz’s Dream is to provide a symbolical calculation as an alternative to human reasoning.

(Please note the difference between “mimicking” and “providing an alternative to”)

Caption: The image depicts a cartoon drawing of a person with dark hair and a light-colored shirt, standing with arms outstretched. The person is wearing a light-colored vest. The drawing is done in a purple ink style. The background is a light color, and the person is surrounded by a scattering of small, dark dots. The text “America the beautiful is beautiful today!” is written in purple ink above the person. The drawing is dated “11-5-08” in the bottom left corner.

Prompt: Can you transcribe the text in the image?

Answer: AMERICA THE BEAUTIFUL IS TODAY! 11-5-08

Caption: The image shows a green highway sign with white text, indicating “Exit 20A”. The sign is mounted on a gray pole and is slightly tilted to the right. Below the main sign, there is a smaller orange sign with white text that reads “Old Exit 12 A”. The background features a clear blue sky and trees with a mix of green and brown/yellow foliage, suggesting it might be autumn.

Prompt: Can you transcribe the text in the image?

Answer: EXIT 20 OLD EXIT 12 A

Conclusions

While it’s a bit unfair to expect such a small model to understand that Tangela is a pokémon and that the guy in the picture is Kurt Gödel, I did expected it to answer that there was one open parking spot (and not three) in the picture. The captioning is pretty good but the prompting not so much.

The OCR was pretty decent, the only disappointment it’s that in the Dijkstra writing example (the text about computer science) it didn’t transcribed the last sentence (despite being fully visible).

Despite the negative points above, this thing it’s kinda of amazing if you have the expectations in check (it’s a small model after all): great captioning, decent OCR, not too slow, doesn’t require a GPU, freely available, and everything local.

Captioning and prompting images

Transcribing text (OCR)

Conclusions

Sources