In the words of Jeff Dean, “Multimodal models are the next frontier in AI. They allow us to build systems that understand the world more like humans do.” And Satya Nadella recently emphasized that “the future of AI is not just large, it’s richly multimodal.” The investment reflects this shift. According to PitchBook, over $2.5 billion has flowed into multimodal AI startups in the past 18 months alone. From OpenAI’s GPT-4V to Meta’s ImageBind and Google’s Gemini, the race is on to build systems that can reason across modalities, not just embed them. It naturally stems from the desire to make AI seem more human, and humans typically reason across modalities to glean maximum information in order to make any decision.

Why It Matters: The New Frontier Of AI

Even if we ignore the hype and words from companies building these AI solutions, and as we all get immersed in the world of AI agents and ask them to live our digital lives while we sip coffee or margaritas during the day, a question I have to ask is, are these agents really like us? After all, our brain can process so many modalities of information like audio, scenes, text, prior knowledge to help us decide what we decide. For any AI to represent us, it has to be as good or better than us at such cross-functioning. No wonder multimodality is having a moment and not just as a buzzword. It’s reshaping how we build, interact with, and extract value from AI systems. From enterprise search to agentic workflows, the ability to reason across text, images, video, audio, and structured data is no longer a futuristic ideal: It’s the new baseline.

Multimodality In Classic And Generative AI World

Back when I first started working on deep-learning applications on computer vision data, it was impressive that you could train a model to understand images, but the answers could sometimes be very funky e.g. AlexNet could easily classify a person’s brain scan as a telephone lobe. Around the time we started user conversations for ApertureData, we got to a point where even off-the-shelf models could detect complex things like what people were doing in video datasets. That was made possible by rapid improvements in compute and memory. That same progression made it so that just last year all we could talk about was how we can use LLMs to answer questions on vast knowledge bases and then improve them with various techniques like the sixteen flavors of RAG to generate relevant answers for us. We now have models generating images and videos for us. That ability to work with various data types ranging from some rows and columns to videos, is multimodal AI, whether classic or generative.

Multimodality In The Agentic World

In an agentic world, where autonomous systems plan, reason, and act, multimodality is not optional: it’s foundational. Agents must perceive the world through multiple lenses, reason across them, and act accordingly.

Without multimodal infrastructure, agents are blind to nuance. They can’t correlate a chart with its source, a video with its transcript, or a document with its embedded tables. They become brittle, hallucination-prone, and context-starved.

The Three Pillars Of Multimodal AI Development

Multimodal Models

Multimodal models are AI systems designed to process and integrate multiple types of data such as text, images, audio, and video, all within a unified architecture. This mimics human cognition, where we naturally combine sensory inputs to understand the world. Over the past decade, multimodal models have evolved from early fusion techniques to sophisticated architectures capable of integrating text, images, audio, video, and more within unified frameworks. Pioneering models like CLIP, DALL-E, and Flamingo demonstrated the power of vision/language alignment and few-shot learning, while recent advances such as ImageBind and Kosmos-2 enable any-to-any modality reasoning. Architectures have diversified into deep and early fusion strategies, with tokenized early-fusion gaining traction for its scalability. These models are transforming industries from healthcare diagnostics to personalized education and creative media, while also raising challenges around data alignment, bias, compute efficiency, and ethical dataset use. Looking ahead, the focus is shifting toward tool-augmented agents, multimodal generation, and real-time, efficient deployment.

TLDR; models are now very much capable of dealing with multimodal data as a step towards reasoning across modalities, in production.

Multimodal Processing Capabilities

Over the past decade, the evolution of CPUs, GPUs, and AI accelerators has been central to enabling multimodal AI. CPUs, once the default for AI workloads, gave way to GPUs like NVIDIA’s H100 and AMD’s MI300, which offer massive parallelism ideal for training large models like CLIP and DALL·E. Specialized accelerators have since emerged to push performance and efficiency even further. Google’s TPUs and Intel’s Gaudi chips are now widely used in cloud-scale AI, while startups like Cerebras (with its wafer-scale engine), SambaNova (dataflow-based systems), Groq (tensor-streaming processors), and Tenstorrent (RISC-V-based AI chips) are redefining the hardware stack for training and inference. These innovations have enabled real-time multimodal processing across edge and cloud environments, powering applications from autonomous vehicles to generative AI agents.

TLDR; processing is not the bottleneck as long as you have money to throw at it or some innovative ways to finetune better for less.

Multimodal Data Management For AI

With models and processing being so advanced, how much and how many of these AI applications are really deployed in production and answering our questions today? It’s a cautious growth trajectory because today’s AI systems still struggle with or underperform because of the data access challenges that are still unresolved. These challenges become bigger when AI is acting for you, not just chatting with you. This is where the data infrastructure behind AI becomes crucial.

A lot of companies have always had multimodal data. For example, e-commerce companies always had product images in their DAMs (Digital Asset Management), audio transcripts from customer support calls, or text/video testimonials from user feedback. Similarly healthcare / life sciences companies typically have patient scans, PDF research documents, and doctors’ notes, and they are all connected to each other. Transportation and logistics sits on a pile of PDF invoices, route information, and increasingly, images of packages transferred. This list of examples of multimodality is very long.

With the realization of its importance, naturally we have seen the rise of data tools for multimodal AI — vector databases with support multimodal embeddings, traditional databases with support for embeddings, data warehouses addressing access to various modalities of data, multimodal AI databases, multimodal ETL / data processing, and so on. Yet, the AI outcomes have been stuck on text and tabular data in the last few years. Data tooling being one challenge, the data problem actually is much broader and is where it becomes very subjective to the company under consideration, its policies, the specific mix of data we are dealing with, its access patterns, and how willing the teams are to stray past known tools and adopt more AI ready solutions. For AI to really become human-quality, data foundations have to be changed.

TLDR; while models and processing capabilities mean we can reach the moon, data troubles will keep us in a hole on the ground unless we understand the real challenges and start chipping away at them.

Why Do We Need To Worry About Multimodal Data Management?

We have spent almost a decade now immersed in this world of multimodality, starting from our time with image data, graduating to video data, then audio and documents recently, and a brief stint with domain specific data types like DICOMs (used for 3-D medical imaging), point clouds, time series, and structured data. We have come across some incorrect assumptions about what it means to address multimodality from the early ML days to Agentic AI, and witnessed plenty of unreasonable expectations for AI to provide relevant responses even though the input to it was flawed or incomplete. In the rest of this article, we look at what multimodal data means, what are some right ways to think about it, and also tackle some misconceptions about it.

Is Our Brain Tabular?

When AI agents or chatbots base their responses purely on SQL querying, it makes me wonder: Is my brain tabular? Do I always imagine an Excel sheet in my brain in order to decide anything? The answer is no! It is occasionally the case when we are accounting or understanding trends but typically, it’s just one type of input needed for any situation. If that’s the case, how can querying purely relational data be enough? It is often not but it is part of the whole and it might also help us understand the rest of the data. For example, we might have employee information in the relational database, which might be enough to know salary distributions but not enough to know their performance over the years because their reviews were submitted as PDFs. One important thing to keep in mind though: It is part of the whole, not the whole!

Vector Search ≠ Multimodal Database

Let’s be clear: a vector database with multimodal embeddings is not a multimodal database, it’s still a vector database. Even if you embed images, PDFs, and videos into a shared vector space, you’re still operating on a flattened representation. You lose structure, provenance, and the ability to query across relationships. There is a possibility that you could adapt your embeddings to capture this information, but certain attributes will still remain a better fit for a contextual representation. And there is the actual data. Just by embedding it, we don’t really get rid of the need to access and format the data for display or summarizing.

A true multimodal database, like ApertureDB, is designed to store, index, and retrieve native multimodal data with context intact. It understands that a bounding box in an image, a caption in a video, and a paragraph in a PDF are not just embeddings, they’re anchors for reasoning.

Is It Enough To Stuff Vector Search In Existing Databases?

A lot of apps just need simple semantic search coupled with keyword search, which could be addressed with the introduction of new types of indexes. Naturally databases that have always had various types of indexes can introduce a few more. But is this syntax how you want to add embeddings (SQL syntax for graph queries was bad, wait till you see it for embeddings)? As is the case with relational databases in general, what if you want to test multiple different embeddings extracted from various models? What if you have unimodal embeddings per data type embedding and then also have their corresponding multimodal versions. How many joins are you willing to do to get to the data? Even if it works, does it scale? Is it efficient? I think we all know the answer to that, given the vast amount of money that we have required so far in query optimizations. There is, of course, a solution whereby you add a vector index to graph databases, but we first need scalable graph databases and those embeddings cannot come just from the graph nodes. How about key-value stores? KV stores are notoriously bad at traversing relationships. A document database like Mongo comes close but has other challenges when it comes actually managing multimodal data and scaling.

Can’t You Just “layer” On Object Stores?

Object stores, data warehouses, lake house architectures have been the traditional ways of dumping large or heavy data types like audio, video and looking them up with some tabular structure in a table. This was sufficient in the streaming world but once we started using machine learning for understanding this data and Gen AI for creating new outcomes from this data, simply accessing them once or streaming stopped being enough. Why should the burden of slicing a video or audio file fall on a data engineer who doesn’t know if the resulting clips will be accessed based on scenes, actions within scenes, or in fixed sized clips. Is it enough to embed the entire audio and then replay in its entirety, even if the interesting portion is just a fraction of the whole? Do we need access to a 500-page document or just the section where our questions are answered? If accessing or even preprocessing “portions” of complex data types is valuable, and by now we know it is, you need a first-class understanding of these objects in the foundational “truly” multimodal data layer.

The Role Of Knowledge Graphs In Multimodal Understanding

Multimodal data is inherently fragmented—images, videos, text, and documents all live in different formats, with complex relationships. Knowledge graphs serve as the semantic layer that stitches this all together.

By representing entities, events, and relationships across modalities, knowledge graphs provide a structured backbone for reasoning. For example, they can connect a speaker in a video to their transcript, link a chart to its source dataset, or associate a bounding box in an image with a description in a document.

This structure is especially powerful when paired with agentic systems. It allows agents to traverse concepts across modalities—not just search, but reason. Think of it as turning raw multimodal data into a navigable, queryable web of meaning.

Navigating The Data: Beyond Search

Multimodal systems should feel less like search engines and more like navigation engines. Imagine an agent that can traverse a product catalog, watch a demo video, read the spec sheet, and surface the most relevant insight—all in one flow.

This requires more than embeddings. It requires queryable structure, semantic joins, and modality-aware indexing. It’s not just about finding the right chunk—it’s about understanding how chunks relate.

Model Context Protocol And Multimodality

MCP is designed to handle modular, agentic workflows where different servers specialize in specific tasks—such as image analysis, vector search, or structured data retrieval. However, multimodal data types are typically represented as large blobs of binary data, which is awkward to embed in JSON, especially when it is bloated using Base64-encoding, so you have to resort to various tricks like linking to signed URLs or providing template resources. Multimodal databases can serve as the substrate for next-gen MCP: enabling agents to reason over raw data, track provenance, and adapt workflows dynamically. It’s not just about processing content—it’s about understanding it in context. An evolution of MCP could be made to more directly address binary data to support true multimodal communication.

Multimodal AI Solutions Of The Future

In closing, I just want to say, it’s not easy for a machine to be a human. We keep learning from everything around us by connecting the dots, searching for patterns, and making decisions or creating content, based on our entire knowledge base, that expresses what we are thinking. AI solutions have come a long way in that journey, but until we embrace the need for rethinking how we deal with data, let go of patchwork solutions, and give it a holistic approach, we will keep slowing down our own progress.

Improved with feedback from Gavin Matthews

Tags:

AI Agents

Multimodal / Generative AI

Knowledge graph and graph databases

Retrieval augmented generation (RAG)

Vector / similarity / semantic search

Related Blogs

Human Memory As The Perfect Template For AI Memory

Blogs

October 23, 2025

Human Memory As The Perfect Template For AI Memory

This blog explores how human memory inspires the next generation of AI systems that don’t just recall data but learn, adapt, and reason through context. By mirroring how we process multimodal information, AI memory can evolve into a living, dynamic engine for intelligent decision-making.

Watch Now

How Multimodal Vector Databases Are Transforming Challenges Across Industries

Blogs

September 29, 2025

How Multimodal Vector Databases Are Transforming Challenges Across Industries

Learn how vector databases like ApertureDB help healthcare ads stay compliant by flagging missing facts and improving transparency.

Watch Now

Beyond SQL: The Query Language Multimodal AI Really Needs

Blogs

September 17, 2025

Beyond SQL: The Query Language Multimodal AI Really Needs

ApertureDB has its own query language, AQL, using JSON as its native format, because traditional languages like SQL, Cypher, were insufficient for managing, searching, and processing multimodal AI data at scale. AQL allows for expressing complex data types and operations, and ApertureDB also offers simplified interfaces, including SQL and SPARQL wrappers, and natural language access through RAG and MCP. The core idea is to prioritize the problem's solution over existing language barriers, aligning with the evolving needs of AI systems.

Watch Now

ApertureDB and AI Workflows: Building Blocks of Multimodal AI Applications

Blogs

September 1, 2025

ApertureDB and AI Workflows: Building Blocks of Multimodal AI Applications

ApertureDB AI Workflows are designed to simplify the creation of multimodal AI applications by providing modular, flexible, and purpose-built components for AI pipelines. These workflows automate common AI/ML tasks such as data ingestion, search, and data correlation, integrating with ApertureDB's graph, vector, and multimodal capabilities, and partnering with models and services from other tools.

Watch Now

Building Real World RAG-based Applications with ApertureDB

Blogs

Nov 21, 2024

Building Real World RAG-based Applications with ApertureDB

Combining different AI technologies, such as LLMs, embedding models, and a database like ApertureDB that is purpose-built for multimodal AI, can significantly enhance the ability to retrieve and generate relevant content.

Managing Visual Data for Machine Learning and Data Science. Painlessly.

Blogs

Oct 15, 2024

Managing Visual Data for Machine Learning and Data Science. Painlessly.

Visual data or image/video data is growing fast. ApertureDB is a unique database...

Blogs

Oct 15, 2024

What’s in Your Visual Dataset?

CV/ML users need to find, analyze, pre-process as needed; and to visualize their images and videos along with any metadata easily...

Transforming Retail and Ecommerce with Multimodal AI

Blogs

Oct 15, 2024

Transforming Retail and Ecommerce with Multimodal AI

Multimodal AI can boost retail sales by enabling better user experience at lower cost but needs the right infrastructure...

Vector Databases and Beyond for Multimodal AI: A Beginner's Guide Part 1

Blogs

Oct 15, 2024

Vector Databases and Beyond for Multimodal AI: A Beginner's Guide Part 1

Multimodal AI, vector databases, large language models (LLMs)...

How a Purpose-Built Database for Multimodal AI Can Save You Time and Money

Blogs

Oct 15, 2024

How a Purpose-Built Database for Multimodal AI Can Save You Time and Money

With extensive data systems needed for modern applications, costs...

Minute-Made Data Preparation with ApertureDB

Blogs

Oct 15, 2024

Minute-Made Data Preparation with ApertureDB

Working with visual data (images, videos) and its metadata is no picnic...

Why Do We Need A Purpose-Built Database For Multimodal Data?

Blogs

Oct 15, 2024

Why Do We Need A Purpose-Built Database For Multimodal Data?

Recently, data engineering and management has grown difficult for companies building modern applications...

Building a Specialized Database for Analytics on Images and Videos

Blogs

Oct 15, 2024

Building a Specialized Database for Analytics on Images and Videos

ApertureDB is a database for visual data such as images, videos, embeddings and associated metadata like annotations, purpose-built for...

Vector Databases and Beyond for Multimodal AI: A Beginner's Guide Part 2

Blogs

Oct 15, 2024

Vector Databases and Beyond for Multimodal AI: A Beginner's Guide Part 2

Multimodal AI, vector databases, large language models (LLMs)...

Challenges and Triumphs: Multimodal AI in Life Sciences

Blogs

Oct 15, 2024

Challenges and Triumphs: Multimodal AI in Life Sciences

AI presents a new and unparalleled transformational opportunity for the life sciences sector...

Your Multimodal Data Is Constantly Evolving - How Bad Can It Get?

Blogs

Oct 15, 2024

Your Multimodal Data Is Constantly Evolving - How Bad Can It Get?

The data landscape has dramatically changed in the last two decades...

Can A RAG Chatbot Really Improve Content?

Blogs

Oct 15, 2024

Can A RAG Chatbot Really Improve Content?

We asked our chatbot questions like "Can ApertureDB store pdfs?" and the answer it gave..

Blogs

Oct 15, 2024

ApertureDB Now Available on DockerHub

Getting started with ApertureDB has never been easier or safer...

Are Vector Databases Enough for Visual Data Use Cases?

Blogs

Oct 15, 2024

Are Vector Databases Enough for Visual Data Use Cases?

ApertureDB vector search and classification functionality is offered as part of our unified API defined to...

Accelerate Industrial and Visual Inspection with Multimodal AI

Blogs

Oct 15, 2024

Accelerate Industrial and Visual Inspection with Multimodal AI

From worker safety to detecting product defects to overall quality control, industrial and visual inspection plays a crucial role...

ApertureDB 2.0: Redefining Visual Data Management for AI

Blogs

Oct 15, 2024

ApertureDB 2.0: Redefining Visual Data Management for AI

A key to solving Visual AI challenges is to bring together the key learnings of...

What Does Multimodality Truly Mean For AI?

Why It Matters: The New Frontier Of AI

Multimodality In Classic And Generative AI World

Multimodality In The Agentic World

The Three Pillars Of Multimodal AI Development

Multimodal Models

Multimodal Processing Capabilities

Multimodal Data Management For AI

Why Do We Need To Worry About Multimodal Data Management?

Is Our Brain Tabular?

Vector Search ≠ Multimodal Database

Is It Enough To Stuff Vector Search In Existing Databases?

Can’t You Just “layer” On Object Stores?

The Role Of Knowledge Graphs In Multimodal Understanding

Navigating The Data: Beyond Search

Model Context Protocol And Multimodality

Multimodal AI Solutions Of The Future

Related Blogs

Start Your Multimodal AI Journey Today