Chapter 62: Multimodal Convergence

Цей контент ще не доступний вашою мовою.

Cast of characters

Name	Lifespan	Role
Alec Radford et al. (OpenAI CLIP team)	—	Established natural language as a supervisor for visual concepts.
Jean-Baptiste Alayrac et al. (Flamingo team)	—	Built a visual language model handling arbitrarily interleaved images, videos, and text with free-form text outputs.
OpenAI GPT-4 / GPT-4V teams	—	Shipped GPT-4 as a multimodal model with image and text inputs, then deployed vision through GPT-4V with documented risk surfaces.
Google Gemini team	—	Framed Gemini as trained jointly across image, audio, video, and text with interleaved multimodal inputs.
OpenAI GPT-4o team	—	Delivered the omni interface: text/audio/image/video inputs, text/audio/image outputs, end-to-end training, sub-second audio latency.
OpenAI Sora team / Google Veo team	—	Pushed text-conditional video generation to minute-scale clips while documenting simulator limits, safety filters, and provenance.

Timeline (2021–2024)

timeline
    title Multimodal convergence, 2021-2024
    2021-03 : CLIP paper (400M image-text pairs, language-supervised vision)
    2022-04 : Flamingo (interleaved image/video/text, free-form text output)
    2023-03 : GPT-4 Technical Report (image+text inputs, text outputs)
    2023-09 : GPT-4V System Card (vision deployed, risk surfaces documented)
    2023-12 : Gemini report (jointly trained across text/image/audio/video)
    2024-02 : Sora "world simulators" page (text-conditional video, up to 1 min)
    2024-05-13 : GPT-4o announced (real-time audio/vision/text, ~320 ms avg latency)
    2024-05-14 : Veo announced (1080p video longer than a minute)
    2024-08 : GPT-4o System Card (omni I/O, end-to-end, voice risk surfaces)
    2024-12 : Sora System Card (deployment, red teaming, multimodal moderation)

Plain-words glossary

Multimodal model. A system that accepts or produces more than one kind of media — for example, text and images, or text plus audio plus video — instead of being limited to a single modality.
Contrastive image-text pretraining. CLIP’s training setup: separate image and text encoders are trained so that matching pairs (a caption and its picture) land near each other in a shared representation space, while mismatched pairs are pushed apart.
Interleaved input. A prompt that mixes media in order — for example, a screenshot, then a question, then another image — rather than one isolated picture or one block of text. Flamingo and Gemini are described in their reports as supporting interleaved inputs.
Native multimodality. Used in the Gemini report to mean joint training across image, audio, video, and text in one model family, rather than a text model stitched together with separate specialist systems. Treat as a source-bound term, not a generic adjective.
Spacetime patches. Sora’s representation unit for pieces of video and image latent codes that carry both spatial and temporal information.
Three-model voice pipeline. The earlier ChatGPT Voice Mode setup that routed speech through separate transcription, language-model, and audio-generation steps.
Image-borne jailbreak. A prompt-injection attempt that hides instructions inside an image — text written on a sign, a screenshot of a UI, an embedded note — so the model has to decide whether visible text is content to describe, evidence to use, or a command to follow.

By the mid-2020s, “language model” had become both useful shorthand and an incomplete description. Language was still the control surface. Users typed or spoke requests, and systems often answered in words. But the frontier product was no longer only a text machine. It could inspect images, interpret screenshots, listen to audio, respond with speech, reason over mixed media, and generate video. The chat window had begun to absorb the senses.

This was not just an interface upgrade. It was a category break. A text-only model predicts and generates language. A multimodal system has to map images, audio, video, and text into a shared computational space where they can condition one another. A prompt can become a bundle: a question, a diagram, a screenshot, a photograph, an audio stream, a video clip, and a desired output. The model is asked not only to answer, but to connect media.

The transition from Chapter 61 matters here. Training scale made larger systems possible, but size alone did not explain the shift. The key change was what the model’s world could contain. Once the input is not just text, the model has to confront layout, pixels, sound, timing, visual ambiguity, speaker identity, image-borne instructions, motion, and media-specific safety rules. Multimodality made the assistant feel more natural. It also made the system harder to evaluate and govern.

The convergence was uneven. Vision, speech, and video did not become easy at the same time or for the same reasons. Image-text alignment benefited from enormous web-scale pairings of pictures and captions. Speech required preserving time, tone, and turn-taking instead of flattening everything into transcripts. Video had to carry visual coherence across frames. The historical pattern is therefore not one universal trick. It is a set of modality-specific bridges being pulled into the same product surface.

That unevenness is part of the story. A product could feel unified to the user while remaining internally plural: encoders, decoders, safety filters, latency constraints, and evaluation methods all had to be adapted to the medium in front of them.

The research bridge began before the product race.

CLIP showed one powerful path: use natural language itself to supervise vision. Radford and collaborators trained image and text encoders on 400 million image-text pairs, learning a shared space where captions and images could be compared. The point was not that CLIP became a chat assistant. It was that language could point at visual concepts at web scale. A phrase such as “a photo of a dog” could act as a handle on visual categories without requiring a manually labeled classifier for every task.

That idea changed the relationship between words and images. Traditional supervised vision often depended on fixed labels: cat, dog, car, tree. Natural language is more flexible. It can describe style, relation, texture, context, and intent. CLIP’s zero-shot transfer mattered because the model could use language to reference concepts it had learned through image-text pairing. Language became an interface for vision.

The architecture was also a sign of the future. CLIP did not merely attach a captioning model to an image model. It trained separate encoders for image and text so that matching image-text pairs landed near each other in representation space. That contrastive setup made the paired web data useful without reducing every image to a single fixed class label. The image could be compared with many possible text descriptions, and the text could become a query into visual concepts.

This is why CLIP belongs in a multimodal convergence chapter even though it was not a general assistant. It helped make language into a visual control surface. Later product systems would feel more conversational and more integrated, but the enabling idea was already visible: words and pixels could be aligned deeply enough that natural-language prompts could steer visual understanding.

Flamingo moved the bridge closer to assistant behavior. Alayrac and collaborators described a visual language model that handled arbitrarily interleaved visual and textual data, including images and videos, and produced free-form text. The interface pattern is the important part: a user or evaluator could provide a sequence where words and images appeared together, and the model would condition on that mixed context.

This is different from merely attaching an image classifier to a chatbot. A classifier maps an image to a label. A multimodal assistant must interpret an image in the context of a question, a prior image, a caption, or an example. Flamingo’s few-shot visual-language framing suggested that images and text could participate in the same prompting style that had become familiar in language models. Show examples. Ask a question. Let the model answer in free-form text.

Flamingo also matters because it included videos in the input story. A frame is not the same as a sequence. Even limited video handling brings ordering, motion, and temporal context into the prompt. The model has to treat visual information not as one isolated picture, but as media that can be interleaved with words. That was an early sign that “multimodal” would not stop at image captioning.

The research arc was therefore not “AI gets eyes” in one step. It was a sequence of alignments. Images and language had to share representation. Visual inputs had to be interleaved with text. The model had to treat media as context, not as a separate preprocessing step whose output was merely a label. Multimodal convergence began when the prompt itself stopped being a string.

GPT-4 brought that shift into the assistant frame. The GPT-4 Technical Report described GPT-4 as a large multimodal model that accepted image and text inputs and produced text outputs. Section 4.1 described prompts that could contain arbitrarily interlaced text and images, including documents, diagrams, screenshots, and photographs. The output remained text, but the input had changed.

That distinction is central. GPT-4 did not become an image generator in this framing. It became a text-generating assistant that could read visual inputs. A user could ask about a diagram, a handwritten note, a screenshot, a chart, or a picture. The model’s answer was still language, but the evidence was visual. The chat window expanded from typed questions to mixed media prompts.

GPT-4V made the deployment stakes explicit. OpenAI’s GPT-4V system card described the ability to instruct GPT-4 to analyze image inputs. It also described the new risk surfaces. The model could hallucinate about images. It could make visual errors. It could face image-borne jailbreak attempts. It could be asked about medical imagery, scientific diagrams, or people in photographs. Vision did not simply add capability. It added new ways to be wrong.

Documents and screenshots made the change especially practical. A user could show a receipt, a chart, a worksheet, a web page, a broken interface, or a diagram and ask what was happening. This shifted the assistant from text generation toward visual inspection. Many real tasks arrive visually: an error dialog, a graph, a form, a map, a table, a photo of a device. A text-only model forces the user to translate all of that into words. A visual model lets the user point.

Pointing is powerful because it reduces friction. Instead of describing a layout, upload the screenshot. Instead of transcribing a table, show the table. Instead of explaining a diagram, ask about the diagram. But pointing also transfers interpretive burden to the model. If the model misreads a label, confuses foreground and background, misses a small warning, or invents a relationship not present in the image, the user may not know which part failed.

Be My AI, the Be My Eyes beta discussed in the GPT-4V system card, made the usefulness tangible. For blind and low-vision users, image description could be valuable: describing scenes, reading visual details, or helping interpret surroundings. But the same context made error more serious. A hallucinated or mistaken visual description is not just an amusing failure when a user may rely on it to navigate the world. The system card’s discussion of hallucination and errors belongs in the center of the story, not the footnotes.

Medical and scientific images made the boundary sharper. The GPT-4V system card documented limitations and warned against treating the model as fit for medical function or professional medical advice. That is a crucial guardrail for multimodality. Image understanding feels authoritative when the answer is fluent. But a model can describe an image confidently while missing the visual feature that matters. In high-stakes domains, visual fluency is not enough.

Person-related visual tasks created another sensitive boundary. A model that can see images will be asked who someone is, what they are feeling, whether they belong to a category, or what can be inferred about them. Adding vision pulled identity, privacy, accessibility, medicine, and safety into the assistant surface. The same general chat interface now touches domains that earlier text-only models could avoid more easily.

Image-borne jailbreaks created another new surface. A text-only prompt injection arrives as text. A vision system can receive instructions embedded in an image, screenshot, sign, document, or interface. The model has to decide whether visible text is content to describe, evidence to use, or an instruction to follow. The old boundary between data and command becomes harder when the command can be inside the picture.

This also changed evaluation. A text answer can be checked against the text prompt. A visual answer has to be checked against the image, the question, and the model’s interpretation of both. The failure may be in object recognition, spatial relation, text reading, domain knowledge, or refusal behavior. When the same fluent paragraph hides all of those steps, a reviewer has to ask a harder question: did the model see the right thing before it reasoned about it?

That is why visual assistants made ordinary user tasks feel so impressive and so fragile at the same time. They could turn a screenshot into guidance, a chart into explanation, or a document image into a summary. But the evidence was no longer contained in the words of the prompt. It was partly in pixels. The assistant’s answer could be persuasive even when the visual grounding was wrong. Multimodality expanded usefulness by letting people point; it expanded risk by making the pointing itself something the model had to interpret.

Gemini then made “native multimodality” a product and architecture claim. Google’s Gemini report said the Gemini family exhibited capabilities across image, audio, video, and text, and that the models were trained jointly across those modalities. The architecture section described support for interleaved text, image, audio, and video inputs, with text and image outputs in the report’s framing.

“Natively multimodal” is best read narrowly here: the Gemini report framed the model as jointly trained across modalities rather than as a text model wrapped by separate specialist systems. It should not become a generic marketing adjective applied to every product. The historical point is that labs began competing not only on model size or chat quality, but on the category of model they claimed to be building.

Joint training mattered because the user’s task often crosses modalities. A user may show a chart and ask for the trend, provide a screenshot and ask where to click, upload an audio clip and ask what is happening, or combine an image with a written instruction. A system trained and presented as multimodal suggests that the modalities are part of one assistant interface, not separate apps stitched together after the fact.

The Gemini report’s interleaved-input framing is the important operational detail. Real-world tasks rarely arrive in clean single-modality packages. A student might ask about a photographed equation and a written hint. A developer might pair a screenshot with a console log. A user might combine a spoken question with an image. Interleaving lets the prompt become a small scene rather than a single text string. That is why multimodality was a product claim as much as a research claim.

The output side still needs precision. In the cited Gemini report, the source-supported outputs are text and image. The report should not be inflated into a claim that Gemini 1.0 generated every modality. Multimodal input and multimodal output are related but not identical. A system can read video without generating video. It can hear audio without speaking.

GPT-4o made the interface shift visceral through speech. OpenAI’s GPT-4o system card described a model that accepts any combination of text, audio, image, and video inputs and generates text, audio, and image outputs. The same document said it was trained end-to-end across text, vision, and audio, and could respond to audio with minimum latency of 232 milliseconds and an average of 320 milliseconds. The exact numbers matter because audio is temporal. A slow voice assistant feels like turn-taking through a wall.

OpenAI’s “Hello GPT-4o” page contrasted this with the prior ChatGPT Voice Mode pipeline, which used three separate models: one to transcribe audio to text, GPT-3.5 or GPT-4 to process the text, and another to convert text back into audio. That pipeline lost information such as tone, multiple speakers, background noise, laughter, singing, or emotion. GPT-4o’s promise was that those features no longer had to be discarded before reasoning.

Speech collapses the interface because it changes how a human experiences the model. Typing tolerates delay. Conversation does not. In speech, latency, interruption, rhythm, prosody, and emotional tone are part of the interaction. A voice assistant that can hear and answer quickly feels less like a form and more like a participant. That does not make it human. It makes the product boundary more intimate.

The three-model pipeline contrast explains why end-to-end training mattered. If speech is first flattened into text, then important acoustic information can vanish before the reasoning model sees it. Tone, hesitation, overlapping voices, laughter, singing, and background sound may carry meaning. A text transcript is useful, but it is a lossy interface. GPT-4o’s product story was that audio, vision, and text could be handled by one model rather than routed through a brittle relay.

The latency numbers also reveal a systems constraint. Real-time voice is not just about model intelligence. It is about response time. A model that answers well after several seconds may be acceptable for a written question and awkward in spoken dialogue. Low-latency audio turns inference into interaction. The user can interrupt, correct, laugh, hesitate, or change direction, and the system has to keep up. Multimodality therefore pushes on serving infrastructure even when this chapter leaves detailed economics to Ch63.

The safety boundary becomes more intimate too. Voice introduces risks around unauthorized voice generation, speaker identification, emotional interpretation, and unequal performance across accents, languages, and audio conditions. An error in text may be reread. An error in speech is heard in real time, often with social cues that make it feel more confident. Multimodal systems inherit the old hallucination problem and add modality-specific harms.

Voice also changes trust. A spoken answer can sound warm, certain, hesitant, amused, or concerned. Those cues affect how people interpret the system. The model may not “feel” anything, but the audio output can still produce social meaning. This is why voice safety cannot be reduced to transcript safety. The same words delivered with different timing, intonation, or speaker similarity can create different risks.

That social layer matters because audio makes the system harder to treat as a passive tool. A written answer sits on a screen. A spoken answer arrives in time, with pace and inflection. The model can interrupt less or more; it can pause; it can sound confident before the user has time to inspect the basis for that confidence. Real-time speech therefore made latency a capability, but it also made interaction design part of the safety problem. The assistant’s behavior was no longer only what it said, but how and when it said it.

The output side needs the same precision as Gemini’s: GPT-4o’s system card supports text, audio, and image outputs in the cited framing, not video output. That distinction matters because video belongs to a different technical and safety frontier.

Sora moved the public imagination there. OpenAI described Sora as a text-conditional diffusion model trained on videos and images, using spacetime patches of video and image latent codes. The technical move resembles tokenization for visual time: convert variable-duration, variable-resolution, variable-aspect-ratio media into patches the model can process. Sora was presented as capable of generating up to one minute of high-fidelity video.

The patch idea is important because video does not fit neatly into the same frame as text or a single image. Text can be tokenized into a sequence. Images can be represented through spatial patches or latents. Video needs space and time together. A “spacetime” patch carries a piece of visual sequence, not just a static crop. That lets a model work across media of different lengths and shapes, but it also makes the task harder: the representation has to carry motion and continuity.

Sora’s use of diffusion also connects to the earlier image-generation chapter without repeating it. Diffusion had already made image synthesis powerful by learning to reverse noise. Video generation extends the burden: the system must generate plausible denoising trajectories across time. Image diffusion’s history sits in Chapter 58; what matters here is that the same family of generative ideas moved into temporal media and exposed new failure modes.

Video is hard because it adds time. A single generated image can be plausible while hiding inconsistencies outside the frame. A video has to preserve identity, motion, geometry, object state, and temporal continuity across many frames. A cup should not morph. A hand should not forget its fingers. A bitten cookie should remain bitten. A character’s clothes should not randomly drift unless the scene calls for it. The model is no longer only arranging pixels. It is producing a sequence where the world is expected to persist.

OpenAI’s “world simulators” language was intentionally ambitious, but the same page listed limitations. Sora could struggle with physics, cause and effect, and object-state changes. Those caveats must travel with the claim. A plausible video is not proof of robust physical understanding. It is evidence that generative video models had become dramatically more capable, while still failing in ways that expose their lack of stable world modeling.

The Sora system card shifted the story from demo to deployment. It discussed visual patches, data categories, pretraining filtering, red teaming, evaluations, and multimodal moderation. Video generation requires more than a model that can synthesize frames. It needs safety filters, provenance thinking, content moderation, misuse testing, and policies for people, violence, sexual content, deception, and other media risks. The more realistic the media, the higher the burden.

Video also stresses evaluation because failure can be local, global, or temporal. A single frame may look good while the sequence fails. A subject may be consistent for three seconds and then drift. A physically impossible interaction may pass unnoticed unless the evaluator watches carefully. A safety issue may emerge only when frames, audio, prompt, and context are considered together. Video makes the evaluation object longer and more ambiguous.

The same ambiguity affects product claims. A generated clip can be stunning in a demo and still fail under mundane editing pressure: maintaining a character across shots, following a precise camera direction, preserving written text, or obeying a chain of physical causes. This does not make the advance unreal. It means the advance lives in a space where visual appeal, controllability, safety, and reliability are separate dimensions. Multimodal convergence repeatedly forces that distinction. A model can impress in one dimension while remaining weak in another.

Google’s Veo announcement added a competing product frame. Google introduced Veo as a video-generation model capable of 1080p videos longer than a minute, with natural-language and visual semantics, private preview access, and safety measures such as tests, filters, guardrails, SynthID, and watermarking. Without an independent technical evaluation, Veo stands as a product claim; its historical role is to show that video generation quickly became a frontier competition, not a one-lab curiosity.

Sora and Veo also show why multimodal convergence cannot be only a capability story. The same systems that make creative production easier make evaluation harder. Text errors can be quoted. Image errors can be inspected. Video errors unfold over time. Safety systems must reason across frames, audio, visual context, prompts, generated content, and provenance. The category of “AI output” becomes media infrastructure.

This is where “LLM” breaks as a literal description. Language remains central because it is the command surface. People ask in words. Systems explain in words. Prompts bind tasks together. But the frontier system is no longer exhausted by language modeling. It reads images, hears audio, watches video, speaks back, and generates media. The assistant becomes a multimodal interface to a stack of representations and safety layers.

The category break also changes what later chapters must cover. Multimodal inputs and outputs raise inference cost and latency questions, which Chapter 63 owns. They complicate evaluation and benchmark politics, which Chapter 66 owns. They intensify data-rights and copyright fights, which Chapter 68 owns. They increase pressure on chips, energy, and datacenters, which later chapters cover. Ch62’s narrower claim is that the interface changed first: text stopped being the whole world.

This is also why multimodality changed product expectations. Users no longer wanted separate systems for every medium. They expected one assistant to read the chart, explain the screenshot, hear the question, speak back, and eventually generate or edit media. That expectation may outrun reliability, but it became the product direction. The assistant became a front end for mixed media work.

The old separation between “computer vision,” “speech recognition,” “natural language processing,” and “graphics” did not disappear in the underlying research. But to the user, those boundaries became less visible. The product presented one surface. The model family or system stack handled the translation. Multimodal convergence is the name for that collapse of visible boundaries.

That collapse made the phrase “assistant” more literal. The system was no longer only a machine that completed strings. It became a machine people could ask to inspect, listen, narrate, translate, describe, generate, and revise across media. The promise was not that every modality was equally mature. The promise was that the user would no longer have to switch tools whenever the medium changed.

By the end of this arc, AI products no longer asked users to translate everything into a typed prompt before the machine could help. A user could show, speak, point, upload, ask, and watch. That made the systems feel more capable because they met more of the user’s world directly. It also made them harder to trust because every modality brought its own failure modes.

Multimodal convergence did not make AI understand the world the way humans do. It made the model’s inputs and outputs look more like the world humans inhabit: visual, auditory, temporal, messy, and mixed. That was enough to break the old category.