Automating Knowledge Graph Creation with Gemini and ApertureDB – Part 3

August 5, 2025
10
 Ayesha Imran
Ayesha Imran
Automating Knowledge Graph Creation with Gemini and ApertureDB – Part 3

Build the Semantic Layer (Entity Embeddings + Metadata Linking)

Introduction

In the previous parts (Part 1 and Part 2) of this series we used Google’s Gemini to auto-extract class schemas, concrete entities, and their properties from PDF source material, then batch-ingested everything into an ApertureDB instance. That gave us a fully structured symbolic knowledge graph. We then added explicit relationships and visualized the graph. Now we’re ready to turn that graph into a hybrid retrieval substrate by layering semantic vector embeddings directly onto the entities you already stored. This is the critical first step toward GraphRAG: dense vector similarity lets us find semantically related entities; the graph lets us traverse contextually rich neighborhoods around them. Together, they dramatically improve recall, grounding, and explanation in downstream RAG workloads.

ApertureDB is purpose-built for multimodal AI/graph workloads: it stores structured metadata (your entities + relationships) alongside blobs, embeddings, and indexes optimized for similarity search; exactly what we need to bind vectors to graph nodes and query across both signals. (ApertureDB Documentation)

In this segment we’ll do three things:

  1. Generate embeddings (Descriptors) for every entity in our existing graph using Gemini embeddings and package them with minimal metadata.
  2. Create an ApertureDB vector index (DescriptorSet) sized to the embedding dimensionality.
  3. Ingest + link: load embeddings at scale using ApertureDB’s ParallelLoader, store source entity metadata on each descriptor, and create has_embedding connections so we can hop from a vector hit back into the graph.

Notebook & Data: As before, code below is trimmed for readability. The full, executable Colab notebook (with setup cells, error handling, logging, etc.) and the complete Github repo are linked. The repo also contains some sample data you can use to run the notebooks. You will find that there are two versions of the notebook:

  • The cloud version involves signing up on ApertureDB Cloud and configuring your instance there.
  • The local udocker version involves setting up the DB instance locally using udocker (which allows running Docker containers in constrained environments like Colab notebooks) - no sign-up needed.

A Note on Scale & Inline Pipelines 

In many real‑world projects you would likely  embed while you are building the graph - ApertureDB’s loaders make that straightforward by letting you stream raw entities, blobs, and their embeddings in the same parallel ingest job, so a million‑node graph with tens of millions of vectors can be built end‑to‑end without a second pass (see docs).

But the opposite scenario is just as common: you might inherit an already‑constructed graph (or decide to experiment with a new embedding model) and only later realize you want to add or refresh vector data. Here too, the ParallelLoader uses exactly the same API: simply add an AddDescriptor entry for every entity row and run the ingest again.

In this tutorial series we decouple the steps on purpose: first to keep each post focused, and second to highlight ApertureDB’s native support for both graph and vector workloads. Whether you embed inline during graph creation or revisit the graph to try different models, the mechanics remain identical.

What You Should Already Have (Prerequisites)

You should be starting from an ApertureDB instance populated with:

  • Entity classes + properties auto-derived from your document(s).

  • Deduplicated entity instances, each with a unique id property.

  • (Optional but recommended) The original PDF stored as a Blob.

  • Relationships between entities.

If you haven’t completed those steps, pause here and work through Knowledge Graph Automation Part 1 and Part 2 - they automate nearly everything: upload a PDF, extract structure via Gemini, clean it, and batch-ingest into ApertureDB.


GraphRAG Phase 1 Overview: From Entities → Semantic Embeddings

Here’s the high-level flow we’ll implement in this post:

  1. Fetch entities from ApertureDB (all classes, all properties).

  2. Synthesize a text “document” per entity by concatenating its name, class, and property key-values. These become embedding inputs.

  3. Batch embed the documents using Google’s gemini-embedding-001 model (we’ll use 768-dim vectors for a balance of quality vs. storage/search cost).

  4. Create an ApertureDB DescriptorSet (vector index), ingest embeddings at scale, and create has_embedding edges back to source entities so semantic hits are graph-navigable.


Environment Setup

Install required packages (ApertureDB client, Google GenAI SDK, utilities) and pull credentials from your Colab secrets store (or set up udocker and Google Drive if you’re not using ApertureDB Cloud). 

%pip install -q aperturedb google-genai


import os, json, time
import numpy as np
from typing import Any, Dict, List


from google.colab import userdata  # or your own secrets loader
from aperturedb.CommonLibrary import create_connector
from aperturedb.ParallelLoader import ParallelLoader  # used later
from google import genai
from google.genai import types


# Credentials (stored in Colab "secrets" / userdata)
google_api_key = userdata.get("GOOGLE_API_KEY")


# Connect to ApertureDB
client = create_connector(key=db_key)

Preparing Entity Data for Embedding

Why synthesize entity documents? Embedding models expect text. Your graph entities are structured key-value rows; we need a reliable textual serialization that captures discriminative properties (name, class, key attributes) so semantically similar entities land near each other in vector space. Including the class label often helps cluster related concepts and improves downstream filtering/re-ranking.

Fetch all entities from ApertureDB

We first read the schema to discover valid (non-internal) classes, then issue a FindEntity query per class requesting all properties. Internal classes (like _Blob) are skipped to avoid errors and noise.

def fetch_entities(client):
    """Fetch all user-defined entities (skip internal classes)."""
    # 1. Get schema
    schema_query = [{"GetSchema": {}}]
    schema_response, _ = client.query(schema_query)


    if (not schema_response 
        or "entities" not in schema_response[0]["GetSchema"]):
        print("Error: Could not retrieve schema."); 
        return []


    all_class_names = schema_response[0]["GetSchema"]["entities"]["classes"].keys()
    valid_class_names = [c for c in all_class_names if not c.startswith("_")]


    all_entities = []
    for class_name in valid_class_names:
        q = [{
            "FindEntity": {
                "with_class": class_name,
                "results": {"all_properties": True}
            }
        }]
        resp, _ = client.query(q)
        ents = resp[0].get("FindEntity", {}).get("entities", [])
        for e in ents:
            e["class"] = class_name  # convenience
        all_entities.extend(ents)


    print(f"Fetched {len(all_entities)} total entities.")
    return all_entities

Serialize entities into embedding documents

We build a minimal but information-rich string: Entity: <name>. Class: <class>. key1: val1. key2: val2. You can enrich/curate which fields to include; keep it consistent so embeddings are comparable. We currently include all available entity properties that had been extracted by the LLM in knowledge graph creation.

def create_entity_documents(entities):
    """Return [{'entity_id', 'class', 'document'}, ...]"""
    docs = []
    for e in entities:
        parts = [
            f"Entity: {e.get('name', '')}.",
            f"Class: {e.get('class', 'N/A')}."
        ]
        for k, v in e.items():
            if k in ("_uniqueid", "name", "class"): 
                continue
            if v is not None:
                parts.append(f"{k}: {v}.")
        docs.append({
            "entity_id": e.get("id"),
            "class": e.get("class"),
            "document": " ".join(parts)
        })
    print(f"Created {len(docs)} documents for embedding.")
    return docs

Run it.

all_entities = fetch_entities(client)
docs_to_embed = create_entity_documents(all_entities) if all_entities else []
print(docs_to_embed[0])

Sample (truncated):

{
  "entity_id": 208,
  "class": "Application Architecture",
  "document": "Entity: Cloud-Native Applications (CNA). Class: Application Architecture. characteristic: Highly Scalable..."
}


…and:
{
  "entity_id": 209,
  "class": "Application Architecture",
  "document": "Entity: Microservices Architecture. Class: Application Architecture. characteristic: Modular Approach..."
}

Including both the entity name and class in the serialized text tends to improve semantic grouping and downstream filtering when performing hybrid vector+metadata queries. 

Generating Vector Embeddings with Gemini

We’ll embed each synthesized document using Google’s gemini-embedding-001 model. Gemini supports configurable output dimensionality; here we choose 768 dimensions to balance semantic fidelity with storage footprint and ANN search cost. You can scale up or down depending on retrieval precision needs and budget.

Throughput Note: Free-tier rate limits apply; batching + polite sleeps help avoid throttling. Adjust batch_size and time.sleep() to match your quota. 

GEMINI_MODEL_NAME      = "gemini-embedding-001"
EMBEDDING_DIMENSIONS   = 768


gemini_client = genai.Client(api_key=google_api_key)


def generate_embeddings_with_gemini(docs, 
                                    batch_size=96, 
                                    model_name=GEMINI_MODEL_NAME, 
                                    output_dim=EMBEDDING_DIMENSIONS,
                                    rpm_delay=60):
    """Add 'embedding' (list[float]) to each doc in-place."""
    texts = [d["document"] for d in docs]


    for start in range(0, len(texts), batch_size):
        batch_texts = texts[start:start+batch_size]


        embed_cfg = types.EmbedContentConfig(
            output_dimensionality=output_dim,
            task_type="RETRIEVAL_DOCUMENT",  # Recommended for corpus items
        )


        result = gemini_client.models.embed_content(
            model=model_name,
            contents=batch_texts,
            config=embed_cfg,
        )


        batch_embs = [e.values for e in result.embeddings]
        if len(batch_embs) != len(batch_texts):
            print("Embedding count mismatch; skipping batch.")
            continue


        for j, emb in enumerate(batch_embs):
            docs[start + j]["embedding"] = emb


        # simple rate-limit guard
        time.sleep(rpm_delay)


    docs_with_embs = [d for d in docs if "embedding" in d]
    print(f"Generated embeddings for {len(docs_with_embs)} docs.")
    return docs_with_embs

Run:

docs_with_embeddings = generate_embeddings_with_gemini(docs_to_embed)
print(len(docs_with_embeddings), "embedded docs")
print("Dim:", len(docs_with_embeddings[0]["embedding"]))
print("Head:", docs_with_embeddings[0]["embedding"][:5])

Example output:

Generated embeddings for 276 docs.
276 embedded docs
Dim: 768
Head: [-0.00747136  0.01111416 -0.00263643 -0.0615893  -0.00403898]

At this point we have an in-memory list of dicts: each entry carries (entity_id, class, document, embedding[]). Next we’ll create an ApertureDB DescriptorSet configured for 768-dim vectors (Cosine, HNSW), ingest the vectors in parallel, and link each descriptor back to its source entity, unlocking fast semantic recall with graph-native joins. 

Creating the Vector Index (DescriptorSet)

ApertureDB stores vectors inside DescriptorSets.
When you create a set, you must lock in the dimension. The distance metric and search engine can be left flexible, allowing multiple options if needed. 

For GraphRAG, the dimensions are set to 768 to match the output from Gemini. The metric used is cosine similarity (CS), which is well-suited for dense semantic embeddings. The engine chosen is HNSW because it provides fast and memory-efficient approximate nearest neighbor (ANN) search.

ApertureDB lets you mix-and-match metrics/engines and even store several per set, but a single HNSW-cosine index is plenty for our use-case. (ApertureDB Documentation)

DESCRIPTOR_SET_NAME = "entity_embeddings_gemini"


def create_vector_index(client, set_name, dims):
    """Create a DescriptorSet (vector index)."""
    add = [{
        "AddDescriptorSet": {
            "name": set_name,
            "dimensions": dims,
            "metric": "CS",   # cosine
            "engine": "HNSW"  # hierarchical NSW
        }
    }]
    resp, _ = client.query(add)
    assert resp[0]["AddDescriptorSet"]["status"] == 0
    print(f"Created DescriptorSet '{set_name}'.")


create_vector_index(client, DESCRIPTOR_SET_NAME, EMBEDDING_DIMENSIONS)


Ingesting Embeddings at Scale

The ParallelLoader helper takes an iterator of (query, blobs) tuples, batches them, and writes to the database with a configurable thread-pool. This drives ingestion throughput into the hundreds of descriptors per second on modest hardware.

def ingest_embeddings(client, set_name, docs):
    """Convert each embedding to bytes and upload as Descriptor."""
    data = []
    for d in docs:
        q = [{
            "AddDescriptor": {
                "set": set_name,
                "properties": {
                    # Store source metadata *on* the descriptor
                    "source_entity_id": d["entity_id"],
                    "source_entity_class": d["class"],
                }
            }
        }]
        blob = [np.array(d["embedding"], dtype=np.float32).tobytes()]
        data.append((q, blob))


    loader = ParallelLoader(client)
    loader.ingest(generator=data, batchsize=64, numthreads=8, stats=True)


ingest_embeddings(client, DESCRIPTOR_SET_NAME, docs_with_embeddings)

Each element corresponds to one AddDescriptor call; we now have 276 vectors sitting in entity_embeddings_gemini.

Linking Descriptors Back to the Graph (has_embedding edges)

Vector search alone is not GraphRAG.
We need an explicit edge that lets us jump from a semantic hit → the originating entity → that entity’s neighbors. The pattern we’ll follow is: (Entity)-[:has_embedding]->(Descriptor)

def connect_embeddings(client, set_name, docs):
    """Create (Entity)<-[:has_embedding]-(Descriptor) edges."""
    queries = []
    for d in docs:
        eid, eclass = d["entity_id"], d["class"]


        q = [
            {  # find the entity node
              "FindEntity": {
                  "_ref": 1,
                  "with_class": eclass,
                  "constraints": {"id": ["==", eid]},
              }
            },
            {  # find its descriptor (by metadata)
              "FindDescriptor": {
                  "_ref": 2,
                  "set": set_name,
                  "constraints": {"source_entity_id": ["==", eid]},
              }
            },
            {  # connect them
              "AddConnection": {
                  "class": "has_embedding",
                  "src": 1,
                  "dst": 2
              }
            }
        ]
        queries.append((q, []))


    loader = ParallelLoader(client)
    loader.ingest(generator=queries, batchsize=64, numthreads=8, stats=True)


connect_embeddings(client, DESCRIPTOR_SET_NAME, docs_with_embeddings)

Loader output confirms 276 AddConnection commands succeeded - the same number as our documents.

What We Achieved

  1. 768-dim Gemini embeddings for every entity (≈ 276 vectors).

  2. HNSW-cosine DescriptorSet in ApertureDB.

  3. Batch ingestion using ParallelLoader.

  4. has_embedding edges linking each descriptor to its source entity.

With these pieces in place we can issue hybrid queries:

  1. Vector similarity → returns a descriptor id.

  2. Traverse has_embedding-in → fetch the entity.

  3. Graph hops → pull in first-ring neighbors (properties, relationships, provenance).

This enables Graph-aware RAG: answers are seeded by dense semantic recall and enriched with structured graph context, improving grounding and interpretability.

Next Up - Part 4: End-to-End GraphRAG Retrieval & Performance Validation

In the next post we will:

  1. Write a small retriever that:


    • Accepts a natural-language question.

    • Calls the embedding model (embed_query) to vectorize it.

    • Performs an ANN search on entity_embeddings_gemini.

Expands results via graph traversal (FindConnection, FindEntity), and validate the performance of GraphRAG against vector RAG. 

Stay tuned!

References

Blog Series

Ayesha Imran | LinkedIn | Github

I am a software engineer passionate about AI/ML, Generative AI, and secure full-stack app development. I’m experienced in RAG and Agentic AI systems, LLMOps, full-stack development, cloud computing, and deploying scalable AI solutions. My ambition lies in building and contributing to innovative projects with real-world impact. Endlessly learning, perpetually building.

Images by Volodymyr Shostakovych, Senior Grahphic Designer

Related Blogs

The Misunderstood World of Knowledge Graphs
Blogs
The Misunderstood World of Knowledge Graphs
Graph databases are powerful in what they can let us build but there are a lot of misconceptions limiting their adoption. This blog addresses those and shows what's possible.
Read More
Watch Now
Industry Experts
What Does Multimodality Truly Mean For AI?
Blogs
What Does Multimodality Truly Mean For AI?
For human quality AI or better, applications based on classic ML to Gen AI to AI agents, will have to be based on multimodal data since we, as humans, process a combination of text, voice, imagery to, relationships to answer questions or decide what we want to do. We explore what that really means.
Read More
Watch Now
Industry Experts
Your Smart  AI Agent Needs A Multimodal Brain
Blogs
Your Smart AI Agent Needs A Multimodal Brain
Smart AI agents need more than text to truly act like humans—they need unified memory across text, images, video, audio, and metadata. Part 2 of this 3 part series blog series explains how a purpose-built multimodal database like ApertureDB delivers that memory, enabling modern AI agents to perceive, reason, and act with real context and speed.
Read More
Watch Now
Applied
Automating Knowledge Graph Creation with Gemini and ApertureDB - Part 2
Blogs
Automating Knowledge Graph Creation with Gemini and ApertureDB - Part 2
Part 2 of the tutorial walks you through extracting relationships between entities using Gemini 2.5 and building a fully connected, interactive knowledge graph in ApertureDB. It also covers visualizing the graph and highlights real-world applications in search, education, and AI pipelines.
Read More
Watch Now
Applied
Building Real World RAG-based Applications with ApertureDB
Blogs
Building Real World RAG-based Applications with ApertureDB
Combining different AI technologies, such as LLMs, embedding models, and a database like ApertureDB that is purpose-built for multimodal AI, can significantly enhance the ability to retrieve and generate relevant content.
Read More
Managing Visual Data for Machine Learning and Data Science. Painlessly.
Blogs
Managing Visual Data for Machine Learning and Data Science. Painlessly.
Visual data or image/video data is growing fast. ApertureDB is a unique database...
Read More
What’s in Your Visual Dataset?
Blogs
What’s in Your Visual Dataset?
CV/ML users need to find, analyze, pre-process as needed; and to visualize their images and videos along with any metadata easily...
Read More
Transforming Retail and Ecommerce with Multimodal AI
Blogs
Transforming Retail and Ecommerce with Multimodal AI
Multimodal AI can boost retail sales by enabling better user experience at lower cost but needs the right infrastructure...
Read More
Vector Databases and Beyond for Multimodal AI: A Beginner's Guide Part 1
Blogs
Vector Databases and Beyond for Multimodal AI: A Beginner's Guide Part 1
Multimodal AI, vector databases, large language models (LLMs)...
Read More
How a Purpose-Built Database for Multimodal AI Can Save You Time and Money
Blogs
How a Purpose-Built Database for Multimodal AI Can Save You Time and Money
With extensive data systems needed for modern applications, costs...
Read More
Minute-Made Data Preparation with ApertureDB
Blogs
Minute-Made Data Preparation with ApertureDB
Working with visual data (images, videos) and its metadata is no picnic...
Read More
Why Do We Need A Purpose-Built Database For Multimodal Data?
Blogs
Why Do We Need A Purpose-Built Database For Multimodal Data?
Recently, data engineering and management has grown difficult for companies building modern applications...
Read More
Building a Specialized Database for Analytics on Images and Videos
Blogs
Building a Specialized Database for Analytics on Images and Videos
ApertureDB is a database for visual data such as images, videos, embeddings and associated metadata like annotations, purpose-built for...
Read More
Vector Databases and Beyond for Multimodal AI: A Beginner's Guide Part 2
Blogs
Vector Databases and Beyond for Multimodal AI: A Beginner's Guide Part 2
Multimodal AI, vector databases, large language models (LLMs)...
Read More
Challenges and Triumphs: Multimodal AI in Life Sciences
Blogs
Challenges and Triumphs: Multimodal AI in Life Sciences
AI presents a new and unparalleled transformational opportunity for the life sciences sector...
Read More
Your Multimodal Data Is Constantly Evolving - How Bad Can It Get?
Blogs
Your Multimodal Data Is Constantly Evolving - How Bad Can It Get?
The data landscape has dramatically changed in the last two decades...
Read More
Can A RAG Chatbot Really Improve Content?
Blogs
Can A RAG Chatbot Really Improve Content?
We asked our chatbot questions like "Can ApertureDB store pdfs?" and the answer it gave..
Read More
ApertureDB Now Available on DockerHub
Blogs
ApertureDB Now Available on DockerHub
Getting started with ApertureDB has never been easier or safer...
Read More
Are Vector Databases Enough for Visual Data Use Cases?
Blogs
Are Vector Databases Enough for Visual Data Use Cases?
ApertureDB vector search and classification functionality is offered as part of our unified API defined to...
Read More
Accelerate Industrial and Visual Inspection with Multimodal AI
Blogs
Accelerate Industrial and Visual Inspection with Multimodal AI
From worker safety to detecting product defects to overall quality control, industrial and visual inspection plays a crucial role...
Read More
ApertureDB 2.0: Redefining Visual Data Management for AI
Blogs
ApertureDB 2.0: Redefining Visual Data Management for AI
A key to solving Visual AI challenges is to bring together the key learnings of...
Read More

Ready to Accelerate your AI Workflows?

Unlock 10X productivity and simplify multimodal AI data management with ApertureDB—try it for free or schedule a demo today!

Stay Connected:
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.