Scene understanding: where are my apartment keys?
🔑 Gemini multimodal capabilities and a visual "needle in the haystack" test
Where did I put my apartment keys? 🤔 As AI models are getting increasingly better at working with visual input, it may be time they help us locating misplaced household items.
Test video
To see if a multimodal LLM can help me find “misplaced” apartment keys, I recorded a test video of my living room. Somewhere in the video, the keys appear. They are clearly visible, but in somewhat surprising location.
Gemini(s)
We’re going to use Gemini 1.5 Pro. It’s a multimodal model, meaning that it’s been trained to process both text and visual input.
To experiment with the model, we’re using Google AI Studio. It’s a developer interface that allows you to try different prompts in a chat-like UI. Importantly, it allows us to upload a video as input to the model 💫.
Scene understanding
After uploading the video to Google Drive, we can include it in the prompt alongside the text description of the request. Let’s start with a general scene understanding question: Can you describe this video?
The model gets it right! I like how it covered both the content of the video and the camera perspective (the video pans from left to right).
How many towels?
Let’s try some follow up questions. Are there any plants? Can you estimate how many towels are there? (Classic “estimate the number of X in Y” interview question…)
So far so good! Let’s move to the main challenge…
The needle in the haystack
Right on, there they are!
Before this, I first tried with a wider angle video where the keys were smaller, and the model did not find them.
Bottom line: this use case seems to already work… in controlled, favorable conditions :). Nevertheless, it’s impressive to see how much of scene understanding these large models can do already. That’s even more impressive when we remember they run on architecture designed to handle machine translation.
Looking forward to seeing how visual-input models gets better and more widely deployed over time!
More on this
🎞️ Video is not actually one of the native modalities of the model: a still image is. As explained in the Google Developers blog post, the AI Studio extracts still frames from the video, and passes them as a sequence of images to the underlying model.
💫 Project Astra demo featured a similar use case: Do you remember where I put my glasses?
📝 The killer app of Gemini Pro 1.5 is video:
was the first to blog about these video capabilities. In his experiment he takes a video of a bookshelf and asks Gemini to identify the books
In other news
📈 Evaluation scores of leading LLM models over the last 12 months, animated timeline
🫣 Elon Musk and Yann LeCun have a fight on Twitter.
tallies the score (Elon Musk: 1; Yann LeCun: 1; Humanity: 0)
Postcard from Paris
Dramatic sky over Pont National on the Southeastern edge of Paris. The end of May is unusually cold and rainy, so every bit of blue sky is much appreciated :).
Keep looking up 💫,
– Przemek