ApertureData

‍Build the Semantic Layer (Entity Embeddings + Metadata Linking)

‍

‍Introduction

In the previous parts (Part 1 and Part 2) of this series we used Google’s Gemini to auto-extract class schemas, concrete entities, and their properties from PDF source material, then batch-ingested everything into an ApertureDB instance. That gave us a fully structured symbolic knowledge graph. We then added explicit relationships and visualized the graph. Now we’re ready to turn that graph into a hybrid retrieval substrate by layering semantic vector embeddings directly onto the entities you already stored. This is the critical first step toward GraphRAG: dense vector similarity lets us find semantically related entities; the graph lets us traverse contextually rich neighborhoods around them. Together, they dramatically improve recall, grounding, and explanation in downstream RAG workloads.
‍
ApertureDB is purpose-built for multimodal AI/graph workloads: it stores structured metadata (your entities + relationships) alongside blobs, embeddings, and indexes optimized for similarity search; exactly what we need to bind vectors to graph nodes and query across both signals. (ApertureDB Documentation)
‍
In this segment we’ll do three things:

Generate embeddings (Descriptors) for every entity in our existing graph using Gemini embeddings and package them with minimal metadata.
Create an ApertureDB vector index (DescriptorSet) sized to the embedding dimensionality.
Ingest + link: load embeddings at scale using ApertureDB’s ParallelLoader, store source entity metadata on each descriptor, and create has_embedding connections so we can hop from a vector hit back into the graph.

Notebook & Data: As before, code below is trimmed for readability. The full, executable Colab notebook (with setup cells, error handling, logging, etc.) and the complete Github repo are linked. The repo also contains some sample data you can use to run the notebooks. You will find that there are two versions of the notebook:

The cloud version involves signing up on ApertureDB Cloud and configuring your instance there.
The local udocker version involves setting up the DB instance locally using udocker (which allows running Docker containers in constrained environments like Colab notebooks) - no sign-up needed.

‍

A Note on Scale & Inline Pipelines

In many real‑world projects you would likely embed while you are building the graph - ApertureDB’s loaders make that straightforward by letting you stream raw entities, blobs, and their embeddings in the same parallel ingest job, so a million‑node graph with tens of millions of vectors can be built end‑to‑end without a second pass (see docs).

But the opposite scenario is just as common: you might inherit an already‑constructed graph (or decide to experiment with a new embedding model) and only later realize you want to add or refresh vector data. Here too, the ParallelLoader uses exactly the same API: simply add an AddDescriptor entry for every entity row and run the ingest again.

In this tutorial series we decouple the steps on purpose: first to keep each post focused, and second to highlight ApertureDB’s native support for both graph and vector workloads. Whether you embed inline during graph creation or revisit the graph to try different models, the mechanics remain identical.

What You Should Already Have (Prerequisites)

You should be starting from an ApertureDB instance populated with:

Entity classes + properties auto-derived from your document(s).
Deduplicated entity instances, each with a unique id property.
(Optional but recommended) The original PDF stored as a Blob.
Relationships between entities.

If you haven’t completed those steps, pause here and work through Knowledge Graph Automation Part 1 and Part 2 - they automate nearly everything: upload a PDF, extract structure via Gemini, clean it, and batch-ingest into ApertureDB.

GraphRAG Phase 1 Overview: From Entities → Semantic Embeddings

Here’s the high-level flow we’ll implement in this post:

Fetch entities from ApertureDB (all classes, all properties).
Synthesize a text “document” per entity by concatenating its name, class, and property key-values. These become embedding inputs.
Batch embed the documents using Google’s gemini-embedding-001 model (we’ll use 768-dim vectors for a balance of quality vs. storage/search cost).
Create an ApertureDB DescriptorSet (vector index), ingest embeddings at scale, and create has_embedding edges back to source entities so semantic hits are graph-navigable.

‍
‍Environment Setup

Install required packages (ApertureDB client, Google GenAI SDK, utilities) and pull credentials from your Colab secrets store (or set up udocker and Google Drive if you’re not using ApertureDB Cloud).

‍

%pip install -q aperturedb google-genai


import os, json, time
import numpy as np
from typing import Any, Dict, List


from google.colab import userdata  # or your own secrets loader
from aperturedb.CommonLibrary import create_connector
from aperturedb.ParallelLoader import ParallelLoader  # used later
from google import genai
from google.genai import types


# Credentials (stored in Colab "secrets" / userdata)
google_api_key = userdata.get("GOOGLE_API_KEY")


# Connect to ApertureDB
client = create_connector(key=db_key)

‍

Preparing Entity Data for Embedding

Why synthesize entity documents? Embedding models expect text. Your graph entities are structured key-value rows; we need a reliable textual serialization that captures discriminative properties (name, class, key attributes) so semantically similar entities land near each other in vector space. Including the class label often helps cluster related concepts and improves downstream filtering/re-ranking.

Fetch all entities from ApertureDB

We first read the schema to discover valid (non-internal) classes, then issue a FindEntity query per class requesting all properties. Internal classes (like _Blob) are skipped to avoid errors and noise.

‍

def fetch_entities(client):
    """Fetch all user-defined entities (skip internal classes)."""
    # 1. Get schema
    schema_query = [{"GetSchema": {}}]
    schema_response, _ = client.query(schema_query)


    if (not schema_response 
        or "entities" not in schema_response[0]["GetSchema"]):
        print("Error: Could not retrieve schema."); 
        return []


    all_class_names = schema_response[0]["GetSchema"]["entities"]["classes"].keys()
    valid_class_names = [c for c in all_class_names if not c.startswith("_")]


    all_entities = []
    for class_name in valid_class_names:
        q = [{
            "FindEntity": {
                "with_class": class_name,
                "results": {"all_properties": True}
            }
        }]
        resp, _ = client.query(q)
        ents = resp[0].get("FindEntity", {}).get("entities", [])
        for e in ents:
            e["class"] = class_name  # convenience
        all_entities.extend(ents)


    print(f"Fetched {len(all_entities)} total entities.")
    return all_entities

‍

Serialize entities into embedding documents

We build a minimal but information-rich string: Entity: <name>. Class: <class>. key1: val1. key2: val2. You can enrich/curate which fields to include; keep it consistent so embeddings are comparable. We currently include all available entity properties that had been extracted by the LLM in knowledge graph creation.

‍

def create_entity_documents(entities):
    """Return [{'entity_id', 'class', 'document'}, ...]"""
    docs = []
    for e in entities:
        parts = [
            f"Entity: {e.get('name', '')}.",
            f"Class: {e.get('class', 'N/A')}."
        ]
        for k, v in e.items():
            if k in ("_uniqueid", "name", "class"): 
                continue
            if v is not None:
                parts.append(f"{k}: {v}.")
        docs.append({
            "entity_id": e.get("id"),
            "class": e.get("class"),
            "document": " ".join(parts)
        })
    print(f"Created {len(docs)} documents for embedding.")
    return docs

Run it.
‍

all_entities = fetch_entities(client)
docs_to_embed = create_entity_documents(all_entities) if all_entities else []
print(docs_to_embed[0])

Sample (truncated):
‍

{
  "entity_id": 208,
  "class": "Application Architecture",
  "document": "Entity: Cloud-Native Applications (CNA). Class: Application Architecture. characteristic: Highly Scalable..."
}


…and:
{
  "entity_id": 209,
  "class": "Application Architecture",
  "document": "Entity: Microservices Architecture. Class: Application Architecture. characteristic: Modular Approach..."
}

Including both the entity name and class in the serialized text tends to improve semantic grouping and downstream filtering when performing hybrid vector+metadata queries.

Generating Vector Embeddings with Gemini

We’ll embed each synthesized document using Google’s gemini-embedding-001 model. Gemini supports configurable output dimensionality; here we choose 768 dimensions to balance semantic fidelity with storage footprint and ANN search cost. You can scale up or down depending on retrieval precision needs and budget.

Throughput Note: Free-tier rate limits apply; batching + polite sleeps help avoid throttling. Adjust batch_size and time.sleep() to match your quota.
‍

GEMINI_MODEL_NAME      = "gemini-embedding-001"
EMBEDDING_DIMENSIONS   = 768


gemini_client = genai.Client(api_key=google_api_key)


def generate_embeddings_with_gemini(docs, 
                                    batch_size=96, 
                                    model_name=GEMINI_MODEL_NAME, 
                                    output_dim=EMBEDDING_DIMENSIONS,
                                    rpm_delay=60):
    """Add 'embedding' (list[float]) to each doc in-place."""
    texts = [d["document"] for d in docs]


    for start in range(0, len(texts), batch_size):
        batch_texts = texts[start:start+batch_size]


        embed_cfg = types.EmbedContentConfig(
            output_dimensionality=output_dim,
            task_type="RETRIEVAL_DOCUMENT",  # Recommended for corpus items
        )


        result = gemini_client.models.embed_content(
            model=model_name,
            contents=batch_texts,
            config=embed_cfg,
        )


        batch_embs = [e.values for e in result.embeddings]
        if len(batch_embs) != len(batch_texts):
            print("Embedding count mismatch; skipping batch.")
            continue


        for j, emb in enumerate(batch_embs):
            docs[start + j]["embedding"] = emb


        # simple rate-limit guard
        time.sleep(rpm_delay)


    docs_with_embs = [d for d in docs if "embedding" in d]
    print(f"Generated embeddings for {len(docs_with_embs)} docs.")
    return docs_with_embs

Run:
‍

docs_with_embeddings = generate_embeddings_with_gemini(docs_to_embed)
print(len(docs_with_embeddings), "embedded docs")
print("Dim:", len(docs_with_embeddings[0]["embedding"]))
print("Head:", docs_with_embeddings[0]["embedding"][:5])

Example output:
‍

Generated embeddings for 276 docs.
276 embedded docs
Dim: 768
Head: [-0.00747136  0.01111416 -0.00263643 -0.0615893  -0.00403898]

At this point we have an in-memory list of dicts: each entry carries (entity_id, class, document, embedding[]). Next we’ll create an ApertureDB DescriptorSet configured for 768-dim vectors (Cosine, HNSW), ingest the vectors in parallel, and link each descriptor back to its source entity, unlocking fast semantic recall with graph-native joins.

Creating the Vector Index (DescriptorSet)

ApertureDB stores vectors inside DescriptorSets.
When you create a set, you must lock in the dimension. The distance metric and search engine can be left flexible, allowing multiple options if needed.

For GraphRAG, the dimensions are set to 768 to match the output from Gemini. The metric used is cosine similarity (CS), which is well-suited for dense semantic embeddings. The engine chosen is HNSW because it provides fast and memory-efficient approximate nearest neighbor (ANN) search.

ApertureDB lets you mix-and-match metrics/engines and even store several per set, but a single HNSW-cosine index is plenty for our use-case. (ApertureDB Documentation)

‍

DESCRIPTOR_SET_NAME = "entity_embeddings_gemini"


def create_vector_index(client, set_name, dims):
    """Create a DescriptorSet (vector index)."""
    add = [{
        "AddDescriptorSet": {
            "name": set_name,
            "dimensions": dims,
            "metric": "CS",   # cosine
            "engine": "HNSW"  # hierarchical NSW
        }
    }]
    resp, _ = client.query(add)
    assert resp[0]["AddDescriptorSet"]["status"] == 0
    print(f"Created DescriptorSet '{set_name}'.")


create_vector_index(client, DESCRIPTOR_SET_NAME, EMBEDDING_DIMENSIONS)

‍
Ingesting Embeddings at Scale

The ParallelLoader helper takes an iterator of (query, blobs) tuples, batches them, and writes to the database with a configurable thread-pool. This drives ingestion throughput into the hundreds of descriptors per second on modest hardware.

‍

def ingest_embeddings(client, set_name, docs):
    """Convert each embedding to bytes and upload as Descriptor."""
    data = []
    for d in docs:
        q = [{
            "AddDescriptor": {
                "set": set_name,
                "properties": {
                    # Store source metadata *on* the descriptor
                    "source_entity_id": d["entity_id"],
                    "source_entity_class": d["class"],
                }
            }
        }]
        blob = [np.array(d["embedding"], dtype=np.float32).tobytes()]
        data.append((q, blob))


    loader = ParallelLoader(client)
    loader.ingest(generator=data, batchsize=64, numthreads=8, stats=True)


ingest_embeddings(client, DESCRIPTOR_SET_NAME, docs_with_embeddings)

Each element corresponds to one AddDescriptor call; we now have 276 vectors sitting in entity_embeddings_gemini.

Linking Descriptors Back to the Graph (has_embedding edges)

Vector search alone is not GraphRAG.
We need an explicit edge that lets us jump from a semantic hit → the originating entity → that entity’s neighbors. The pattern we’ll follow is: (Entity)-[:has_embedding]->(Descriptor)

‍

def connect_embeddings(client, set_name, docs):
    """Create (Entity)<-[:has_embedding]-(Descriptor) edges."""
    queries = []
    for d in docs:
        eid, eclass = d["entity_id"], d["class"]


        q = [
            {  # find the entity node
              "FindEntity": {
                  "_ref": 1,
                  "with_class": eclass,
                  "constraints": {"id": ["==", eid]},
              }
            },
            {  # find its descriptor (by metadata)
              "FindDescriptor": {
                  "_ref": 2,
                  "set": set_name,
                  "constraints": {"source_entity_id": ["==", eid]},
              }
            },
            {  # connect them
              "AddConnection": {
                  "class": "has_embedding",
                  "src": 1,
                  "dst": 2
              }
            }
        ]
        queries.append((q, []))


    loader = ParallelLoader(client)
    loader.ingest(generator=queries, batchsize=64, numthreads=8, stats=True)


connect_embeddings(client, DESCRIPTOR_SET_NAME, docs_with_embeddings)

Loader output confirms 276 AddConnection commands succeeded - the same number as our documents.

What We Achieved

768-dim Gemini embeddings for every entity (≈ 276 vectors).
HNSW-cosine DescriptorSet in ApertureDB.
Batch ingestion using ParallelLoader.
has_embedding edges linking each descriptor to its source entity.

With these pieces in place we can issue hybrid queries:

Vector similarity → returns a descriptor id.
Traverse has_embedding-in → fetch the entity.
Graph hops → pull in first-ring neighbors (properties, relationships, provenance).

This enables Graph-aware RAG: answers are seeded by dense semantic recall and enriched with structured graph context, improving grounding and interpretability.

Next Up - Part 4: End-to-End GraphRAG Retrieval & Performance Validation

In the next post we will:

Write a small retriever that:
- Accepts a natural-language question.
- Calls the embedding model (embed_query) to vectorize it.
- Performs an ANN search on entity_embeddings_gemini.

Expands results via graph traversal (FindConnection, FindEntity), and validate the performance of GraphRAG against vector RAG.

Stay tuned!

References

Colab Notebooks: Cloud Version and Local (udocker) Version
Github Repo with all Notebooks
Google’s Gemini Docs
ApertureDB Docs

Blog Series

‍‍

Ayesha Imran | LinkedIn | Github

I am a software engineer passionate about AI/ML, Generative AI, and secure full-stack app development. I’m experienced in RAG and Agentic AI systems, LLMOps, full-stack development, cloud computing, and deploying scalable AI solutions. My ambition lies in building and contributing to innovative projects with real-world impact. Endlessly learning, perpetually building.
‍
Images by Volodymyr Shostakovych, Senior Grahphic Designer

‍

Tags:

Knowledge graph and graph databases

AI Agents

Related Blogs

Human Memory As The Perfect Template For AI Memory

Blogs

October 23, 2025

Human Memory As The Perfect Template For AI Memory

This blog explores how human memory inspires the next generation of AI systems that don’t just recall data but learn, adapt, and reason through context. By mirroring how we process multimodal information, AI memory can evolve into a living, dynamic engine for intelligent decision-making.

Watch Now

ApertureDB and AI Workflows: Building Blocks of Multimodal AI Applications

Blogs

September 1, 2025

ApertureDB and AI Workflows: Building Blocks of Multimodal AI Applications

ApertureDB AI Workflows are designed to simplify the creation of multimodal AI applications by providing modular, flexible, and purpose-built components for AI pipelines. These workflows automate common AI/ML tasks such as data ingestion, search, and data correlation, integrating with ApertureDB's graph, vector, and multimodal capabilities, and partnering with models and services from other tools.

Watch Now

Blogs

August 28, 2025

Smart AI Agents In the Wild

The future of AI isn’t bigger models—it’s smarter agents. Discover how real-world teams are using ApertureDB to power agents that see, reason, and act."

Watch Now

Automating Knowledge Graph Creation with Gemini and ApertureDB – Part 4

Blogs

August 15, 2025

Automating Knowledge Graph Creation with Gemini and ApertureDB – Part 4

This blog compares Vanilla RAG and Graph RAG pipelines built with Gemini and ApertureDB, showing that while traditional retrieval metrics favor pure vector search, LLM-based evaluation often prefers Graph RAG for producing clearer, richer, and more connected answers.

Watch Now

Building Real World RAG-based Applications with ApertureDB

Blogs

Nov 21, 2024

Building Real World RAG-based Applications with ApertureDB

Combining different AI technologies, such as LLMs, embedding models, and a database like ApertureDB that is purpose-built for multimodal AI, can significantly enhance the ability to retrieve and generate relevant content.

Managing Visual Data for Machine Learning and Data Science. Painlessly.

Blogs

Oct 15, 2024

Managing Visual Data for Machine Learning and Data Science. Painlessly.

Visual data or image/video data is growing fast. ApertureDB is a unique database...

Blogs

Oct 15, 2024

What’s in Your Visual Dataset?

CV/ML users need to find, analyze, pre-process as needed; and to visualize their images and videos along with any metadata easily...

Transforming Retail and Ecommerce with Multimodal AI

Blogs

Oct 15, 2024

Transforming Retail and Ecommerce with Multimodal AI

Multimodal AI can boost retail sales by enabling better user experience at lower cost but needs the right infrastructure...

Vector Databases and Beyond for Multimodal AI: A Beginner's Guide Part 1

Blogs

Oct 15, 2024

Vector Databases and Beyond for Multimodal AI: A Beginner's Guide Part 1

Multimodal AI, vector databases, large language models (LLMs)...

How a Purpose-Built Database for Multimodal AI Can Save You Time and Money

Blogs

Oct 15, 2024

How a Purpose-Built Database for Multimodal AI Can Save You Time and Money

With extensive data systems needed for modern applications, costs...

Minute-Made Data Preparation with ApertureDB

Blogs

Oct 15, 2024

Minute-Made Data Preparation with ApertureDB

Working with visual data (images, videos) and its metadata is no picnic...

Why Do We Need A Purpose-Built Database For Multimodal Data?

Blogs

Oct 15, 2024

Why Do We Need A Purpose-Built Database For Multimodal Data?

Recently, data engineering and management has grown difficult for companies building modern applications...

Building a Specialized Database for Analytics on Images and Videos

Blogs

Oct 15, 2024

Building a Specialized Database for Analytics on Images and Videos

ApertureDB is a database for visual data such as images, videos, embeddings and associated metadata like annotations, purpose-built for...

Vector Databases and Beyond for Multimodal AI: A Beginner's Guide Part 2

Blogs

Oct 15, 2024

Vector Databases and Beyond for Multimodal AI: A Beginner's Guide Part 2

Multimodal AI, vector databases, large language models (LLMs)...

Challenges and Triumphs: Multimodal AI in Life Sciences

Blogs

Oct 15, 2024

Challenges and Triumphs: Multimodal AI in Life Sciences

AI presents a new and unparalleled transformational opportunity for the life sciences sector...

Your Multimodal Data Is Constantly Evolving - How Bad Can It Get?

Blogs

Oct 15, 2024

Your Multimodal Data Is Constantly Evolving - How Bad Can It Get?

The data landscape has dramatically changed in the last two decades...

Can A RAG Chatbot Really Improve Content?

Blogs

Oct 15, 2024

Can A RAG Chatbot Really Improve Content?

We asked our chatbot questions like "Can ApertureDB store pdfs?" and the answer it gave..

Blogs

Oct 15, 2024

ApertureDB Now Available on DockerHub

Getting started with ApertureDB has never been easier or safer...

Are Vector Databases Enough for Visual Data Use Cases?

Blogs

Oct 15, 2024

Are Vector Databases Enough for Visual Data Use Cases?

ApertureDB vector search and classification functionality is offered as part of our unified API defined to...

Accelerate Industrial and Visual Inspection with Multimodal AI

Blogs

Oct 15, 2024

Accelerate Industrial and Visual Inspection with Multimodal AI

From worker safety to detecting product defects to overall quality control, industrial and visual inspection plays a crucial role...

ApertureDB 2.0: Redefining Visual Data Management for AI

Blogs

Oct 15, 2024

ApertureDB 2.0: Redefining Visual Data Management for AI

A key to solving Visual AI challenges is to bring together the key learnings of...

Automating Knowledge Graph Creation with Gemini and ApertureDB – Part 3

‍Build the Semantic Layer (Entity Embeddings + Metadata Linking)

‍Introduction

A Note on Scale & Inline Pipelines

What You Should Already Have (Prerequisites)

GraphRAG Phase 1 Overview: From Entities → Semantic Embeddings

‍
‍Environment Setup

Preparing Entity Data for Embedding

Fetch all entities from ApertureDB

Serialize entities into embedding documents

Generating Vector Embeddings with Gemini

Creating the Vector Index (DescriptorSet)

‍
Ingesting Embeddings at Scale

Linking Descriptors Back to the Graph (has_embedding edges)

What We Achieved

Next Up - Part 4: End-to-End GraphRAG Retrieval & Performance Validation

References

Blog Series

Ayesha Imran | LinkedIn | Github

Related Blogs

Start Your Multimodal AI Journey Today

Automating Knowledge Graph Creation with Gemini and ApertureDB – Part 3

‍Build the Semantic Layer (Entity Embeddings + Metadata Linking)

‍Introduction

A Note on Scale & Inline Pipelines

What You Should Already Have (Prerequisites)

GraphRAG Phase 1 Overview: From Entities → Semantic Embeddings

‍‍Environment Setup

Preparing Entity Data for Embedding

Fetch all entities from ApertureDB

Serialize entities into embedding documents

Generating Vector Embeddings with Gemini

Creating the Vector Index (DescriptorSet)

‍Ingesting Embeddings at Scale

Linking Descriptors Back to the Graph (has_embedding edges)

What We Achieved

Next Up - Part 4: End-to-End GraphRAG Retrieval & Performance Validation

References

Blog Series

Ayesha Imran | LinkedIn | Github

Related Blogs

Start Your Multimodal AI Journey Today

‍
‍Environment Setup

‍
Ingesting Embeddings at Scale