356 - Visual Gook

356 - Visual Gook

The clue is in the name: large language models. They are trained in words and they don't have a visual language.

I haven't practiced very much with prompts for images because I'm busy job hunting. But I fell down a Flow rabbit hole the other day. Flow seems to be a video generator for / in Google (I haven't got all the software names and relationships figured out yet). I gave it written instructions and it interpreted them. Well, I ran out of time but it got me thinking.

I've noticed that Gemini struggles with simple visual concepts like 'rotate' or 'upside down'. For example, I wanted Gemini to cut out part of an image and rotate just the cut out part, but it didn't get it. I tried it again sometime later and it managed to rotate the whole image which was a step forward.

It's hard to brainstorm or experiment visually with AI (give me suggestions for AI models to try if I'm wrong), because it has to interpret the words and then translate them to a visual. I'm assuming that the volume of text descriptions of visual art (including text or audio transcriptions generated for blind or sight impaired people) is much lower in volume than the number of visual format images referenced by an LLM. 

I'm interested to know what is on offer for blind and partially sighted people. Making a quick search I found this guide on the National Gallery US on how to write descriptions of online art works

Artists create their own visual language. Having written that I'm not entirely sure what it means. But I have half an idea that there's a gulf somewhere around this between what an artist can do and what an AI can do.

I suppose everything is gobbledy gook to an AI. It just exports information based on the probability of it making sense.

Deserves more thought.

Check out this research project Beyond The Visual at UAL

Back to blog

Leave a comment

Please note, comments need to be approved before they are published.