Scene understanding: where are my apartment keys?

🔑 Gemini multimodal capabilities and a visual "needle in the haystack" test

Jun 02, 2024

Where did I put my apartment keys? 🤔 As AI models are getting increasingly better at working with visual input, it may be time they help us locating misplaced household items.

Test video

To see if a multimodal LLM can help me find “misplaced” apartment keys, I recorded a test video of my living room. Somewhere in the video, the keys appear. They are clearly visible, but in somewhat surprising location.

Gemini(s)

We’re going to use Gemini 1.5 Pro. It’s a multimodal model, meaning that it’s been trained to process both text and visual input.

To experiment with the model, we’re using Google AI Studio. It’s a developer interface that allows you to try different prompts in a chat-like UI. Importantly, it allows us to upload a video as input to the model 💫.

Scene understanding

After uploading the video to Google Drive, we can include it in the prompt alongside the text description of the request. Let’s start with a general scene understanding question: Can you describe this video?

The model gets it right! I like how it covered both the content of the video and the camera perspective (the video pans from left to right).

How many towels?

Let’s try some follow up questions. Are there any plants? Can you estimate how many towels are there? (Classic “estimate the number of X in Y” interview question…)

So far so good! Let’s move to the main challenge…

The needle in the haystack

Right on, there they are!

Before this, I first tried with a wider angle video where the keys were smaller, and the model did not find them.

Bottom line: this use case seems to already work… in controlled, favorable conditions :). Nevertheless, it’s impressive to see how much of scene understanding these large models can do already. That’s even more impressive when we remember they run on architecture designed to handle machine translation.

Looking forward to seeing how visual-input models gets better and more widely deployed over time!

In other news

📈 Evaluation scores of leading LLM models over the last 12 months, animated timeline
🫣 Elon Musk and Yann LeCun have a fight on Twitter.
Gary Marcus
tallies the score (Elon Musk: 1; Yann LeCun: 1; Humanity: 0)

Postcard from Paris

Dramatic sky over Pont National on the Southeastern edge of Paris. The end of May is unusually cold and rainy, so every bit of blue sky is much appreciated :).

Keep looking up 💫,
– Przemek

pnote