Blogs

Vector Databases and Beyond for Multimodal AI: A Beginner's Guide Part 1

October 15, 2024
7 min read
Vishakha Gupta
Vishakha Gupta

Multimodal AI, vector databases, large language models(LLMs), retrieval-augmented generation (RAG), and knowledge graphs are cutting-edge technologies making a significant impact across various industries today. These technologies have evolved so rapidly in the past few years and seen unprecedented adoption in pretty much every industry vertical in some form or the other, making them more of a need-to-know vs. nice-to-know. In fact, for data practitioners and decision makers, it’s essential to dig deeper, underneath the terms, and understand the challenges that stand in their way of successfully implementing their AI strategies.

With our specific interest and research on data, that is in large part responsible for the quality of these AI methods, we have created a series of blogs that start from introducing relevant terminologies to advanced cost analysis of getting things wrong. In this first blog of the series, in addition to introducing these terms, let's explore multimodal data, how databases including vector or graph fit in, practical use cases, and how the generative AI wave is reshaping the landscape.

The Fundamentals

Multimodal Data

For diagnosing complex symptoms, doctors often prescribe a variety of tests.  Think about the CT scans or x-rays that are commonly taken -- as well as blood tests - really any and all sorts of tests - that are used to diagnose patients and come up with treatment plans in health and life sciences . Multimodal data here refers to all the different types of data that, when combined, the doctor uses to improve your health. And it is not just at doctor’s offices - multimodal data (like video, audio, sensors, text, etc.) can be captured virtually everywhere. Another point to note about most of this data is that it typically goes beyond structured representations that could be captured in spreadsheets or relational databases, to more unstructured formats that are not so easy to search through, but are more representative of how humans understand the world.

Source: https://www.mdpi.com/1999-4893/16/4/186

Multimodal AI

To bridge the gap between human-like understanding and AI capabilities, Multimodal AI combines different types of data—such as text, images, videos, or audio—to generate more comprehensive and accurate outputs. Consider autonomous cars. To operate in a rich human environment, these vehicles collect many different types or modalities of data to enable them to travel without a driver. This data could be coming from cameras collecting videos from both inside and outside the vehicle, recorders capturing audio of the passengers as well as external environment, sensors collecting Radar or LiDAR (Light Detection and Ranging) data, weather information, etc. Multimodal AI combines all of this data and, with AI models, actually generates commands to drive and control the car, such as providing depth perception so the car doesn't hit anything or anyone.

Source: https://api.semanticscholar.org/CorpusID:218684826

Generative AI

This is probably a household term by now, ever since OpenAI’s ChatGPT gained in popularity. Generative AI (Gen AI) leverages machine learning techniques to generate data similar to or derived from the datasets it was trained on, in an attempt to mimic human responses to questions. This includes creating new content such as images, music, and text. A key aspect of Gen AI is its ability to work with both text and multimodal data. Text-based generative AI can create human-like text, while multimodal generative AI can understand and generate content that combines multiple data types, such as text, audio, video and images. This makes Gen AI a powerful method in a variety of fields, from content creation to data augmentation. Gen AI methods often use semantic searches combined with other techniques to find the right subset of supporting data in the context of the question asked and use large models to generate their responses.

Vector Embeddings

For it to be useful, w e want to be able to search through this vast collection of multimodal information to find something we are looking for (semantic search). We may know some keywords to guide this search or maybe we know what the information should contain or look like.

Imagine how large and complicated all this unstructured data is and how hard it will be to search everything . If trying to do facial recognition on a casino floor, for example, huge amounts of videos and images would have to be compared pixel by pixel to find a specific face  in the crowds in all that data. It would involve a lot of manual visual inspection , making it extremely inefficient. But it gets dramatically easier with embeddings.

Embeddings give a simpler and lower dimensional representation of a specific piece of data so it is faster and easier for us to spot what we are looking for. Initially it may give approximate answers that can then be used to target more specific or relevant answers.

To extract embeddings, you may leverage different, potentially off-the-shelf models like Facenet , YOLO , large language models as offered by OpenAI, Cohere, Google, Anthropic, or the numerous open source ones, by running inference on your data and extracting, commonly, the penultimate layer.

Source: https://partee.io/2022/08/11/vector-embeddings/

Multimodal Embeddings

Multimodal embeddings are a way of linking different types of information, like text and images, into a shared understanding. This means that a computer can look at a picture (for example of a dog) and understand its meaning in relation to a piece of text (like ‘cute golden retriever puppies’), or vice versa. By combining these different forms of data, multimodal embeddings help computers interpret and interact with the world in a more human-like way.

Contrastive Language–Image Pre-Training ( CLIP ) , developed by OpenAI , is one of the earliest and prominent models that utilizes this approach . It learns by looking at many image and text pairs from the internet, figuring out which texts go with which images. This helps it know what an image is showing just by looking at it, even if it hasn’t seen that exact image before. Newer models like OpenAI’s GPT-4o, Google’s Gemini series of models extend multimodal capabilities to other types of data like audio to get closer to human-like   abilities  of interpretation and search.

Source: https://opensearch.org/blog/multimodal-semantic-search/

Why We Need Vector Databases

Regardless of the source of these embeddings or feature vectors, finding similar data leads to complex requirements due to their high-dimensional nature. As a data scientist or an entire data team uses their well trained AI models to extract embeddings, they need to efficiently index them for search and classification. Vector databases  and now some traditional databases  offer special indexes to accommodate the high-dimensional nature of embeddings.

With these indexes as well as some clustering and other algorithmic magic, a vector database can return a label for a given embedding based on its closest matches, which is classification. Picture an image of Michelle Obama. The vector database can return a ‘label’ (or embedding) that tells you it is a picture of Michelle Obama. A vector database can also return the other closest vectors in the given search space. Imagine starting with an image of Taylor Swift and finding similar images of people that look like her.

Vector databases are therefore used in applications that are looking for similar items (text or multimodal), looking to classify data that is missing labels, or helping GenAI applications as a step towards creating a response to a user’s queries.

LLMs, RAGs, Knowledge Graphs

Large Language Models (LLMs) are trained on a vast amount of text data (there are large vision models for visual data, and so on). LLMs are designed to generate human-like text based on the input they are given. LLMs can understand context, answer questions, write essays, summarize texts, translate languages, and even generate creative content like poems or stories. They are used in a variety of applications, including chatbots, text editors, and more. Examples of LLMs include OpenAI’s GPT-3 and GPT-4.

If a question asked by a user requires context that a large model was not trained on, it can potentially hallucinate an answer. One of the methods to correct that or say “I don’t know the answer” is to use Retrieval Augmented Generation (RAG) to take the first set of matches from a query, order them based on relevant context or use newer data that was not yet included in the model training, and then create an answer. The methods can range from applying a different sorting algorithm to text or multimodal data that matched the first query, all the way to attaching richer context, for example, as retrieved from knowledge graphs.

A knowledge graph is a semantic data representation that describes real-world concepts and their relationships. In these graphs, nodes represent real-world entities e.g. a person, a product and edges represent their relationships, e.g. a person buys a product. The connection of nodes and edges provides rich semantic information which helps infer knowledge about whichever domain the graph is constructed for. The core unit of a knowledge graph is the “Entity-Relationship-Entity” triplet and graph databases can be used to represent this when building applications. There are now models that help you not only extract embeddings from a piece of text but also extract the relevant named entities and their relationships that can then be used to construct these knowledge graphs. It presents a great intertwining between vector databases, graph databases, various models, and accessing relevant data itself.

Next Steps

Vector databases  are useful for managing and analyzing data, but modern data needs are more complex. Databases for multimodal data  like ApertureDB offer a unified platform that combines different functions. This gives businesses a strong solution for data management and analysis in today's fast-changing world.

Want to learn more?  Continue reading the the next blog in the series. In this blog, we’ll look at real-life examples of how multimodal AI is used. These examples show why we need advanced systems that do more than just basic data searches. We’ll explain why special multimodal AI databases are needed to handle these complex tasks efficiently and reliably, making everything run smoothly .

Last but not least, we will be documenting our journey and explaining all the components listed above on our blog, subscribe here .

I want to acknowledge the insights and valuable edits from Laura Horvath and Drew Ogle.

Related Posts

Are Vector Databases Enough for Visual Data Use Cases?
Blogs
Are Vector Databases Enough for Visual Data Use Cases?
ApertureDB vector search and classification functionality is offered as part of our unified API defined to...
Read More
Read More
Can A RAG Chatbot Really Improve Content?
Blogs
Can A RAG Chatbot Really Improve Content?
We asked our chatbot questions like "Can ApertureDB store pdfs?" and the answer it gave..
Read More
Read More
Minute-Made Data Preparation with ApertureDB
Blogs
Minute-Made Data Preparation with ApertureDB
Working with visual data (images, videos) and its metadata is no picnic...
Read More
Read More
Building Real World RAG-based Applications with ApertureDB
Blogs
Building Real World RAG-based Applications with ApertureDB
Combining different AI technologies, such as LLMs, embedding models, and a database like ApertureDB that is purpose-built for multimodal AI, can significantly enhance the ability to retrieve and generate relevant content.
Read More
Read More
Industry Experts
Building Real World RAG-based Applications with ApertureDB
Blogs
Building Real World RAG-based Applications with ApertureDB
Combining different AI technologies, such as LLMs, embedding models, and a database like ApertureDB that is purpose-built for multimodal AI, can significantly enhance the ability to retrieve and generate relevant content.
Read More
Managing Visual Data for Machine Learning and Data Science. Painlessly.
Blogs
Managing Visual Data for Machine Learning and Data Science. Painlessly.
Visual data or image/video data is growing fast. ApertureDB is a unique database...
Read More
What’s in Your Visual Dataset?
Blogs
What’s in Your Visual Dataset?
CV/ML users need to find, analyze, pre-process as needed; and to visualize their images and videos along with any metadata easily...
Read More
Transforming Retail and Ecommerce with Multimodal AI
Blogs
Transforming Retail and Ecommerce with Multimodal AI
Multimodal AI can boost retail sales by enabling better user experience at lower cost but needs the right infrastructure...
Read More
Vector Databases and Beyond for Multimodal AI: A Beginner's Guide Part 1
Blogs
Vector Databases and Beyond for Multimodal AI: A Beginner's Guide Part 1
Multimodal AI, vector databases, large language models (LLMs)...
Read More
How a Purpose-Built Database for Multimodal AI Can Save You Time and Money
Blogs
How a Purpose-Built Database for Multimodal AI Can Save You Time and Money
With extensive data systems needed for modern applications, costs...
Read More
Minute-Made Data Preparation with ApertureDB
Blogs
Minute-Made Data Preparation with ApertureDB
Working with visual data (images, videos) and its metadata is no picnic...
Read More
Why Do We Need A Purpose-Built Database For Multimodal Data?
Blogs
Why Do We Need A Purpose-Built Database For Multimodal Data?
Recently, data engineering and management has grown difficult for companies building modern applications...
Read More
Building a Specialized Database for Analytics on Images and Videos
Blogs
Building a Specialized Database for Analytics on Images and Videos
ApertureDB is a database for visual data such as images, videos, embeddings and associated metadata like annotations, purpose-built for...
Read More
Vector Databases and Beyond for Multimodal AI: A Beginner's Guide Part 2
Blogs
Vector Databases and Beyond for Multimodal AI: A Beginner's Guide Part 2
Multimodal AI, vector databases, large language models (LLMs)...
Read More
Challenges and Triumphs: Multimodal AI in Life Sciences
Blogs
Challenges and Triumphs: Multimodal AI in Life Sciences
AI presents a new and unparalleled transformational opportunity for the life sciences sector...
Read More
Your Multimodal Data Is Constantly Evolving - How Bad Can It Get?
Blogs
Your Multimodal Data Is Constantly Evolving - How Bad Can It Get?
The data landscape has dramatically changed in the last two decades...
Read More
Can A RAG Chatbot Really Improve Content?
Blogs
Can A RAG Chatbot Really Improve Content?
We asked our chatbot questions like "Can ApertureDB store pdfs?" and the answer it gave..
Read More
ApertureDB Now Available on DockerHub
Blogs
ApertureDB Now Available on DockerHub
Getting started with ApertureDB has never been easier or safer...
Read More
Are Vector Databases Enough for Visual Data Use Cases?
Blogs
Are Vector Databases Enough for Visual Data Use Cases?
ApertureDB vector search and classification functionality is offered as part of our unified API defined to...
Read More
Accelerate Industrial and Visual Inspection with Multimodal AI
Blogs
Accelerate Industrial and Visual Inspection with Multimodal AI
From worker safety to detecting product defects to overall quality control, industrial and visual inspection plays a crucial role...
Read More
ApertureDB 2.0: Redefining Visual Data Management for AI
Blogs
ApertureDB 2.0: Redefining Visual Data Management for AI
A key to solving Visual AI challenges is to bring together the key learnings of...
Read More
Stay Connected:
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.