Blogs

What Does Multimodality Truly Mean For AI?

July 2, 2025
7
Vishakha Gupta
Vishakha Gupta
What Does Multimodality Truly Mean For AI?

In the words of Jeff Dean, “Multimodal models are the next frontier in AI. They allow us to build systems that understand the world more like humans do.” And Satya Nadella recently emphasized that “the future of AI is not just large, it’s richly multimodal.” The investment reflects this shift. According to PitchBook, over $2.5 billion has flowed into multimodal AI startups in the past 18 months alone. From OpenAI’s GPT-4V to Meta’s ImageBind and Google’s Gemini, the race is on to build systems that can reason across modalities, not just embed them. It naturally stems from the desire to make AI seem more human, and humans typically reason across modalities to glean maximum information in order to make any decision. 

Why It Matters: The New Frontier Of AI

Even if we ignore the hype and words from companies building these AI solutions, and as we all get immersed in the world of AI agents and ask them to live our digital lives while we sip coffee or margaritas during the day, a question I have to ask is, are these agents really like us? After all, our brain can process so many modalities of information like audio, scenes, text, prior knowledge to help us decide what we decide. For any AI to represent us, it has to be as good or better than us at such cross-functioning. No wonder multimodality is having a moment and not just as a buzzword. It’s reshaping how we build, interact with, and extract value from AI systems. From enterprise search to agentic workflows, the ability to reason across text, images, video, audio, and structured data is no longer a futuristic ideal: It’s the new baseline.

Multimodality In Classic And Generative AI World

Back when I first started working on deep-learning applications on computer vision data, it was impressive that you could train a model to understand images, but the answers could sometimes be very funky e.g. AlexNet could easily classify a person’s brain scan as a telephone lobe. Around the time we started user conversations for ApertureData, we got to a point where even off-the-shelf models could detect complex things like what people were doing in video datasets. That was made possible by rapid improvements in compute and memory. That same progression made it so that just last year all we could talk about was how we can use LLMs to answer questions on vast knowledge bases and then improve them with various techniques like the sixteen flavors of RAG to generate relevant answers for us. We now have models generating images and videos for us. That ability to work with various data types ranging from some rows and columns to videos, is multimodal AI, whether classic or generative.  

Multimodality In The Agentic World

In an agentic world, where autonomous systems plan, reason, and act, multimodality is not optional: it’s foundational. Agents must perceive the world through multiple lenses, reason across them, and act accordingly.

Without multimodal infrastructure, agents are blind to nuance. They can’t correlate a chart with its source, a video with its transcript, or a document with its embedded tables. They become brittle, hallucination-prone, and context-starved.

The Three Pillars Of Multimodal AI Development

Multimodal Models

Multimodal models are AI systems designed to process and integrate multiple types of data such as text, images, audio, and video, all within a unified architecture. This mimics human cognition, where we naturally combine sensory inputs to understand the world. Over the past decade, multimodal models have evolved from early fusion techniques to sophisticated architectures capable of integrating text, images, audio, video, and more within unified frameworks. Pioneering models like CLIP, DALL-E, and Flamingo demonstrated the power of vision/language alignment and few-shot learning, while recent advances such as ImageBind and Kosmos-2 enable any-to-any modality reasoning. Architectures have diversified into deep and early fusion strategies, with tokenized early-fusion gaining traction for its scalability. These models are transforming industries from healthcare diagnostics to personalized education and creative media, while also raising challenges around data alignment, bias, compute efficiency, and ethical dataset use. Looking ahead, the focus is shifting toward tool-augmented agents, multimodal generation, and real-time, efficient deployment.

TLDR; models are now very much capable of dealing with multimodal data as a step towards reasoning across modalities, in production.

Multimodal Processing Capabilities

Over the past decade, the evolution of CPUs, GPUs, and AI accelerators has been central to enabling multimodal AI. CPUs, once the default for AI workloads, gave way to GPUs like NVIDIA’s H100 and AMD’s MI300, which offer massive parallelism ideal for training large models like CLIP and DALL·E. Specialized accelerators have since emerged to push performance and efficiency even further. Google’s TPUs and Intel’s Gaudi chips are now widely used in cloud-scale AI, while startups like Cerebras (with its wafer-scale engine), SambaNova (dataflow-based systems), Groq (tensor-streaming processors), and Tenstorrent (RISC-V-based AI chips) are redefining the hardware stack for training and inference. These innovations have enabled real-time multimodal processing across edge and cloud environments, powering applications from autonomous vehicles to generative AI agents.

TLDR; processing is not the bottleneck as long as you have money to throw at it or some innovative ways to finetune better for less. 

Multimodal Data Management For AI

With models and processing being so advanced, how much and how many of these AI applications are really deployed in production and answering our questions today? It’s a cautious growth trajectory because today’s AI systems still struggle with or underperform because of the data access challenges that are still unresolved. These challenges become bigger when AI is acting for you, not just chatting with you. This is where the data infrastructure behind AI becomes crucial.

A lot of companies have always had multimodal data. For example, e-commerce companies always had product images in their DAMs (Digital Asset Management), audio transcripts from customer support calls, or text/video testimonials from user feedback. Similarly healthcare / life sciences companies typically have patient scans, PDF research documents, and doctors’ notes, and they are all connected to each other. Transportation and logistics sits on a pile of PDF invoices, route information, and increasingly, images of packages transferred. This list of examples of multimodality is very long. 

With the realization of its importance, naturally we have seen the rise of data tools for multimodal AI — vector databases with support multimodal embeddings, traditional databases with support for embeddings, data warehouses addressing access to various modalities of data, multimodal AI databases, multimodal ETL / data processing, and so on. Yet, the AI outcomes have been stuck on text and tabular data in the last few years. Data tooling being one challenge, the data problem actually is much broader and is where it becomes very subjective to the company under consideration, its policies, the specific mix of data we are dealing with, its access patterns, and how willing the teams are to stray past known tools and adopt more AI ready solutions. For AI to really become human-quality, data foundations have to be changed. 

TLDR; while models and processing capabilities mean we can reach the moon, data troubles will keep us in a hole on the ground unless we understand the real challenges and start chipping away at them.

Why Do We Need To Worry About Multimodal Data Management?

We have spent almost a decade now immersed in this world of multimodality, starting from our time with image data, graduating to video data, then audio and documents recently, and a brief stint with domain specific data types like DICOMs (used for 3-D medical imaging), point clouds, time series, and structured data. We have come across some incorrect assumptions about what it means to address multimodality from the early ML days to Agentic AI, and witnessed plenty of unreasonable expectations for AI to provide relevant responses even though the input to it was flawed or incomplete. In the rest of this article, we look at what multimodal data means, what are some right ways to think about it, and also tackle some misconceptions about it. 

Is Our Brain Tabular? 

When AI agents or chatbots base their responses purely on SQL querying, it makes me wonder: Is my brain tabular? Do I always imagine an Excel sheet in my brain in order to decide anything? The answer is no! It is occasionally the case when we are accounting or understanding trends but typically, it’s just one type of input needed for any situation. If that’s the case, how can querying purely relational data be enough? It is often not but it is part of the whole and it might also help us understand the rest of the data. For example, we might have employee information in the relational database, which might be enough to know salary distributions but not enough to know their performance over the years because their reviews were submitted as PDFs. One important thing to keep in mind though: It is part of the whole, not the whole!

Vector Search ≠ Multimodal Database

Let’s be clear: a vector database with multimodal embeddings is not a multimodal database, it’s still a vector database. Even if you embed images, PDFs, and videos into a shared vector space, you’re still operating on a flattened representation. You lose structure, provenance, and the ability to query across relationships. There is a possibility that you could adapt your embeddings to capture this information, but certain attributes will still remain a better fit for a contextual representation. And there is the actual data. Just by embedding it, we don’t really get rid of the need to access and format the data for display or summarizing. 

A true multimodal database, like ApertureDB, is designed to store, index, and retrieve native multimodal data with context intact. It understands that a bounding box in an image, a caption in a video, and a paragraph in a PDF are not just embeddings, they’re anchors for reasoning.

Is It Enough To Stuff Vector Search In Existing Databases?

A lot of apps just need simple semantic search coupled with keyword search, which could be addressed with the introduction of new types of indexes. Naturally databases that have always had various types of indexes can introduce a few more. But is this syntax how you want to add embeddings (SQL syntax for graph queries was bad, wait till you see it for embeddings)? As is the case with relational databases in general, what if you want to test multiple different embeddings extracted from various models? What if you have unimodal embeddings per data type embedding and then also have their corresponding multimodal versions. How many joins are you willing to do to get to the data? Even if it works, does it scale? Is it efficient? I think we all know the answer to that, given the vast amount of money that we have required so far in query optimizations. There is, of course, a solution whereby you add a vector index to graph databases, but we first need scalable graph databases and those embeddings cannot come just from the graph nodes. How about key-value stores? KV stores are notoriously bad at traversing relationships. A document database like Mongo comes close but has other challenges when it comes actually managing multimodal data and scaling.  

Can’t You Just “layer” On Object Stores?

Object stores, data warehouses, lake house architectures have been the traditional ways of dumping large or heavy data types like audio, video and looking them up with some tabular structure in a table. This was sufficient in the streaming world but once we started using machine learning for understanding this data and Gen AI for creating new outcomes from this data, simply accessing them once or streaming stopped being enough. Why should the burden of slicing a video or audio file fall on a data engineer who doesn’t know if the resulting clips will be accessed based on scenes, actions within scenes, or in fixed sized clips. Is it enough to embed the entire audio and then replay in its entirety, even if the interesting portion is just a fraction of the whole? Do we need access to a 500-page document or just the section where our questions are answered? If accessing or even preprocessing “portions” of complex data types is valuable, and by now we know it is, you need a first-class understanding of these objects in the foundational “truly” multimodal data layer. 

The Role Of Knowledge Graphs In Multimodal Understanding

Multimodal data is inherently fragmented—images, videos, text, and documents all live in different formats, with complex relationships. Knowledge graphs serve as the semantic layer that stitches this all together.

By representing entities, events, and relationships across modalities, knowledge graphs provide a structured backbone for reasoning. For example, they can connect a speaker in a video to their transcript, link a chart to its source dataset, or associate a bounding box in an image with a description in a document.

This structure is especially powerful when paired with agentic systems. It allows agents to traverse concepts across modalities—not just search, but reason. Think of it as turning raw multimodal data into a navigable, queryable web of meaning.

Navigating The Data: Beyond Search

Multimodal systems should feel less like search engines and more like navigation engines. Imagine an agent that can traverse a product catalog, watch a demo video, read the spec sheet, and surface the most relevant insight—all in one flow.

This requires more than embeddings. It requires queryable structure, semantic joins, and modality-aware indexing. It’s not just about finding the right chunk—it’s about understanding how chunks relate.

Model Context Protocol And Multimodality 

MCP is designed to handle modular, agentic workflows where different servers specialize in specific tasks—such as image analysis, vector search, or structured data retrieval. However, multimodal data types are typically represented as large blobs of binary data, which is awkward to embed in JSON, especially when it is bloated using Base64-encoding, so you have to resort to various tricks like linking to signed URLs or providing template resources.  Multimodal databases can serve as the substrate for next-gen MCP: enabling agents to reason over raw data, track provenance, and adapt workflows dynamically. It’s not just about processing content—it’s about understanding it in context. An evolution of MCP could be made to more directly address binary data to support true multimodal communication.

Multimodal AI Solutions Of The Future

In closing, I just want to say, it’s not easy for a machine to be a human. We keep learning from everything around us by connecting the dots, searching for patterns, and making decisions or creating content, based on our entire knowledge base, that expresses what we are thinking. AI solutions have come a long way in that journey, but until we embrace the need for rethinking how we deal with data, let go of patchwork solutions, and give it a holistic approach, we will keep slowing down our own progress.

Improved with feedback from Gavin Matthews

Related Blogs

Your Smart  AI Agent Needs A Multimodal Brain
Blogs
Your Smart AI Agent Needs A Multimodal Brain
Smart AI agents need more than text to truly act like humans—they need unified memory across text, images, video, audio, and metadata. Part 2 of this 3 part series blog series explains how a purpose-built multimodal database like ApertureDB delivers that memory, enabling modern AI agents to perceive, reason, and act with real context and speed.
Read More
Watch Now
Applied
Automating Knowledge Graph Creation with Gemini and ApertureDB - Part 2
Blogs
Automating Knowledge Graph Creation with Gemini and ApertureDB - Part 2
Part 2 of the tutorial walks you through extracting relationships between entities using Gemini 2.5 and building a fully connected, interactive knowledge graph in ApertureDB. It also covers visualizing the graph and highlights real-world applications in search, education, and AI pipelines.
Read More
Watch Now
Applied
Automating Knowledge Graph Creation with Gemini and ApertureDB - Part 1
Blogs
Automating Knowledge Graph Creation with Gemini and ApertureDB - Part 1
This blog shows how to build a knowledge graph using ApertureDB and Gemini 2.5 Flash to power smarter RAG systems. Part 1 covers extracting and storing entities, enabling real-world use cases like semantic search and AI-powered customer support.
Read More
Watch Now
Applied
Smarter Agents Start with Smarter Data
Blogs
Smarter Agents Start with Smarter Data
Building smart AI agents isn't just about better models — it's about better data infrastructure. This blog explores why legacy stacks fail multimodal AI and sets the stage for modern solutions that enable agents to reason, act, and scale.
Read More
Watch Now
Product
Building Real World RAG-based Applications with ApertureDB
Blogs
Building Real World RAG-based Applications with ApertureDB
Combining different AI technologies, such as LLMs, embedding models, and a database like ApertureDB that is purpose-built for multimodal AI, can significantly enhance the ability to retrieve and generate relevant content.
Read More
Managing Visual Data for Machine Learning and Data Science. Painlessly.
Blogs
Managing Visual Data for Machine Learning and Data Science. Painlessly.
Visual data or image/video data is growing fast. ApertureDB is a unique database...
Read More
What’s in Your Visual Dataset?
Blogs
What’s in Your Visual Dataset?
CV/ML users need to find, analyze, pre-process as needed; and to visualize their images and videos along with any metadata easily...
Read More
Transforming Retail and Ecommerce with Multimodal AI
Blogs
Transforming Retail and Ecommerce with Multimodal AI
Multimodal AI can boost retail sales by enabling better user experience at lower cost but needs the right infrastructure...
Read More
Vector Databases and Beyond for Multimodal AI: A Beginner's Guide Part 1
Blogs
Vector Databases and Beyond for Multimodal AI: A Beginner's Guide Part 1
Multimodal AI, vector databases, large language models (LLMs)...
Read More
How a Purpose-Built Database for Multimodal AI Can Save You Time and Money
Blogs
How a Purpose-Built Database for Multimodal AI Can Save You Time and Money
With extensive data systems needed for modern applications, costs...
Read More
Minute-Made Data Preparation with ApertureDB
Blogs
Minute-Made Data Preparation with ApertureDB
Working with visual data (images, videos) and its metadata is no picnic...
Read More
Why Do We Need A Purpose-Built Database For Multimodal Data?
Blogs
Why Do We Need A Purpose-Built Database For Multimodal Data?
Recently, data engineering and management has grown difficult for companies building modern applications...
Read More
Building a Specialized Database for Analytics on Images and Videos
Blogs
Building a Specialized Database for Analytics on Images and Videos
ApertureDB is a database for visual data such as images, videos, embeddings and associated metadata like annotations, purpose-built for...
Read More
Vector Databases and Beyond for Multimodal AI: A Beginner's Guide Part 2
Blogs
Vector Databases and Beyond for Multimodal AI: A Beginner's Guide Part 2
Multimodal AI, vector databases, large language models (LLMs)...
Read More
Challenges and Triumphs: Multimodal AI in Life Sciences
Blogs
Challenges and Triumphs: Multimodal AI in Life Sciences
AI presents a new and unparalleled transformational opportunity for the life sciences sector...
Read More
Your Multimodal Data Is Constantly Evolving - How Bad Can It Get?
Blogs
Your Multimodal Data Is Constantly Evolving - How Bad Can It Get?
The data landscape has dramatically changed in the last two decades...
Read More
Can A RAG Chatbot Really Improve Content?
Blogs
Can A RAG Chatbot Really Improve Content?
We asked our chatbot questions like "Can ApertureDB store pdfs?" and the answer it gave..
Read More
ApertureDB Now Available on DockerHub
Blogs
ApertureDB Now Available on DockerHub
Getting started with ApertureDB has never been easier or safer...
Read More
Are Vector Databases Enough for Visual Data Use Cases?
Blogs
Are Vector Databases Enough for Visual Data Use Cases?
ApertureDB vector search and classification functionality is offered as part of our unified API defined to...
Read More
Accelerate Industrial and Visual Inspection with Multimodal AI
Blogs
Accelerate Industrial and Visual Inspection with Multimodal AI
From worker safety to detecting product defects to overall quality control, industrial and visual inspection plays a crucial role...
Read More
ApertureDB 2.0: Redefining Visual Data Management for AI
Blogs
ApertureDB 2.0: Redefining Visual Data Management for AI
A key to solving Visual AI challenges is to bring together the key learnings of...
Read More

Ready to Accelerate your AI Workflows?

Unlock 10X productivity and simplify multimodal AI data management with ApertureDB—try it for free or schedule a demo today!

Stay Connected:
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.