ApertureData

‍Introduction

Knowledge graphs are powerful tools for visually organizing data, connecting entities through clear and meaningful relationships. They’re especially helpful for making Retrieval-Augmented Generation (RAG) systems smarter, providing accurate, context-rich responses. Companies using knowledge graphs with RAG have seen noticeable improvements, including nearly 30% faster customer issue resolutions (https://arxiv.org/abs/2404.17723) and better transparency in how AI models generate answers.

In this tutorial, we'll combine ApertureDB, a robust multimodal database, with Google's advanced Gemini 2.5 Flash model to efficiently extract structured information and build a comprehensive knowledge graph. This integration utilizes the contextual understanding and scalability of Gemini, coupled with the versatile and rich-media capabilities of ApertureDB, making it ideal for creating powerful, intelligent systems.

The tutorial is divided into two parts - this part, the first one, covers the steps for extracting and storing entities in ApertureDB. In the second part, we’ll extract relationships between entities and define them in ApertureDB, as well as visualize the created knowledge graph by fetching entities and their relationships from the DB.

Please note that the code snippets in this blog have been shortened for brevity. You can find the complete code in the Colab Notebook.
‍

Practical Use Cases

The knowledge graph created with this approach isn't just theoretical; it forms a practical backbone for several real-world uses. Clearly organizing entities and how they're connected is a process that supports diverse functionalities in many different fields. As such, the knowledge graph constructed through this workflow has broad applications across multiple domains:

Enhanced Information Retrieval: Query entities and their relationships with precision, enabling structured semantic search over large textual corpora.
Customer Support Systems: Automatically extract and link entities such as products, services, issues, and solutions from support documents, forming the backbone of intelligent FAQ systems.
Educational Tools: Organize and visualize learning materials by topics, concepts, and their interdependencies—ideal for building interactive curricula or study guides.
Data Integration: Merge semi-structured data from disparate sources into a unified graph, simplifying analysis and downstream reasoning tasks.
‍

Understanding the Components

To build our knowledge graph, we'll use:

ApertureDB: A specialized multimodal database designed to handle structured metadata along with text, embeddings, images, videos, and other rich media. It excels in rapidly querying and managing complex inter-entity relationships.
‍
Gemini 2.5 Flash: Google's cutting-edge large language model (LLM) provides extensive contextual memory and speedy response times, ideal for detailed content extraction and analysis from lengthy documents.
‍
LangChain: An efficient workflow orchestration tool that enables parallel execution of tasks using the RunnableLamba .batch() functionality, speeding up our entire pipeline significantly.

Workflow Overview

Our knowledge graph creation follows a structured workflow:
‍

‍

Entity Class Schema Extraction: Identify general classes and their properties using Gemini.
‍
Entity Instance Extraction: Extract specific instances of these classes in parallel.
‍
Deduplication and ID Assignment: Clean up duplicates and assign unique IDs for each entity.
‍
Insert Entities in ApertureDB: Insert entities into the ApertureDB instance along with all their properties.
‍
Relationship Extraction: Clearly define explicit relationships between entities through Gemini.
‍
Knowledge Graph Creation in ApertureDB: create connections between the entities in ApertureDB.
‍
Visualization: Interactively visualize the constructed knowledge graph using PyVis.

In this blog, we’ll cover steps 1 - 4, providing detailed insights and practical code snippets to guide you through the creation of your powerful knowledge graph.

‍

Environment Setup and Dependency Installation

Getting started is simple with Google Colab. Just install the required libraries (langchain, pydantic, aperturedb, etc.) and set up your API keys. The Colab notebook handles all imports and initializes the Gemini LLM and ApertureDB client for you. Full setup instructions and code are included in the notebook.
‍

Document Preparation and Loading

We load and parse the PDF using LangChain’s PyPDFLoader, which extracts text content for processing. For this tutorial, we'll use a practical example: Cloud Computing lecture notes PDF, from my Cloud Computing Course notes (because why not kill two birds with one stone?) . The PDF consists of 42 (!) pages, showcasing the efficiency of our knowledge graph creation approach.
‍

Connect to ApertureDB & Optionally Store Full Document as a Blob

Now we’ll connect to our ApertureDB instance.

‍

def connect_db():
    """Create ApertureDB connection."""
    try:
        client = Connector.Connector(
            host=db_host,
            user="admin",
            password=db_password
        )
        print("Connected to ApertureDB")
        return client
    except Exception as e:
        print(f"Database connection failed: {e}")
        raise


client = connect_db()

‍

This step is optional, but you can add the entire PDF to the DB instance as a blob, and then you may connect it to its extracted entities. This way, you can even have multiple related documents referring to the same entities or actually access the original content whenever needed (either for validation or more information).

‍

def upload_pdf_to_db(pdf_path, client):
    """Upload PDF to ApertureDB as blob with metadata."""


    # Extract file name from path
    file_name = os.path.basename(pdf_path)


    # Read PDF as binary data
    with open(pdf_path, 'rb') as f:
        pdf_data = f.read()


    # Create upload query
    query = [{
        "AddBlob": {
            "properties": {
                "id": 0,
                "type": "pdf",
                "name": file_name,  # dynamically set file name
                "description": "Source document for knowledge graph creation",
                "upload_timestamp": time.strftime("%Y-%m-%d %H:%M:%S")
            }
        }
    }]


    # Upload with binary data
    response, _ = client.query(query, [pdf_data])


    if response[0]["AddBlob"]["status"] == 0:
        print(f"PDF '{file_name}' uploaded to ApertureDB")
    else:
        print(f"PDF '{file_name}' upload failed")


    return response


upload_response = upload_pdf_to_db("/content/Cloud Computing Lecture Notes.pdf", client)

‍

Step 1: Extracting Entity Class Schemas

To build an effective knowledge graph, the first step is identifying high-level entity types (or classes) and their potential properties. This foundational step leverages the contextual understanding capabilities of Google's Gemini 2.5 Flash model to interpret raw text and extract structured information.

We define a clear schema using Pydantic, ensuring the output from Gemini is consistent and correctly formatted:

‍

class ClassSchema(BaseModel):
    classes: Dict[str, List[str]] = Field(
        description="Dictionary mapping class types to their possible properties"
    )

The following prompt guides Gemini to extract generic class-property mappings without naming specific entities:
‍

prompt = PromptTemplate(
    template="""
    You are the first agent in a multi-step workflow to build a Knowledge Graph.
    YOUR TASK:
    - Identify general class types (e.g., Person, Company, Location).
    - List possible properties for each class type.
    - Keep output general; avoid specifics and examples.
    FORMAT:
    {{
        "ClassType": ["property1", "property2"]
    }}
    Text: {input}
    {format_instructions}
    Response:
    """,
    input_variables=["input"],
    partial_variables={"format_instructions": parser.get_format_instructions()},
)

We combine the prompt with Gemini and parse the response:
‍

parser = JsonOutputParser(pydantic_object=ClassSchema)
chain = prompt | llm | parser


def extract_class_schema(text):
    def attempt():
        result = chain.invoke({"input": text})
        return result if isinstance(result, dict) else {"classes": result}
    return retry_llm_call(attempt)

Run the extraction and preview results:
‍

classes = extract_class_schema(document_content)
entity_classes = classes.get("classes", {})

Here’s the first two classes with their possible properties, taken from the output of this step:

{
  "classes": {
    "Computing System": [
      "definition",
      "purpose",
      "characteristics",
      "components",
      "use_cases",
      "limitations",
      "speed"
    ],,...

This initial extraction step sets the foundation for robust knowledge graph construction by clearly defining what information we seek from our data.

Step 2: Extracting Specific Entities

With the entity classes defined, the next step is to extract specific instances of these classes from the document. To manage this effectively, particularly with large documents, we split the text into manageable chunks and process them in parallel using Gemini and LangChain.

Splitting the Document

First, we use LangChain’s RecursiveCharacterTextSplitter to divide large texts into smaller, overlapping chunks, balancing efficiency with contextual clarity. I used a chunk size of 5000 with 500 overlap, which is perfect for Gemini’s long context window.

Again, we define a clear output format using Pydantic to ensure consistent and accurate output from Gemini:

‍

# Define Pydantic models for entity extraction
class ClassEntities(BaseModel):
    """Entities of a specific class type."""
    Class: str = Field(description="The class type name")
    Entities: Dict[str, Dict[str, Any]] = Field(
        description="Dictionary mapping entity names to their properties"
    )


class ChunkExtractionResult(BaseModel):
    """Result of entity extraction from a single chunk."""
    classes: List[ClassEntities] = Field(
        description="List of class types and their entities found in this chunk"
    )

Now we extract specific entities and their properties from the text in parallel using RunnableLambda.batch() - ideal for large documents.

Prompt for Entity Extraction
‍

prompt = PromptTemplate(
    template="""
    You are part of a multi-step workflow to build a Knowledge Graph from raw text.


    Workflow Steps:
    1. Extract class types and properties. [DONE]
    2. Extract entity instances from chunks. [CURRENT]
    3. Deduplicate entities, assign IDs.
    4. Identify relationships.
    5. Create the graph.


    YOUR TASK:
    - Extract concrete entities of known class types from this chunk.
    - Only include properties explicitly mentioned.
    - Omit properties not found; no nulls or guesses.


    INPUT TEXT CHUNK:
    {chunk}


    CLASS TYPES AND THEIR PROPERTIES:
    {class_schema}


    FORMAT:
    {{
      "classes": [
        {{
          "Class": "ClassName",
          "Entities": {{
            "EntityName": {{"property1": "value", ...}}
          }}
        }}
      ]
    }}


    {format_instructions}
    Begin your extraction now:
    """,
    input_variables=["chunk", "class_schema"],
    partial_variables={"format_instructions": parser.get_format_instructions()},
)

Parallel Execution
‍

def extract_entities_from_chunk(chunk_data):
    chunk_text, classes = chunk_data
    return retry_llm_call(lambda: chain.invoke({
        "chunk": chunk_text,
        "class_schema": json.dumps(classes, indent=2)
    }))


processor = RunnableLambda(extract_entities_from_chunk)
chunk_data = [(chunk.page_content, entity_classes) for chunk in split_text(document_content)]


results = processor.batch(chunk_data, config={"max_concurrency": 6})
extracted_entities = merge_entity_results(results)
save_json(extracted_entities, "step2_output.json")

Here’s a part of the output from the entity extraction step:
‍

[
  {
    "Class": "Computing System",
    "Entities": {
      "Distributed computing": {
        "definition": "a system where computing resources are distributed across multiple locations rather than being centralized in a single system.",
"purpose": "enables task distribution and efficient resource utilization.",
        "efficiency": "efficient resource utilization",
        "scalability": "allow for hardware scaling"
      },
      "Traditional computing": {
        "limitations": "faces bottlenecks due to hardware limitations"
      },...

‍

Step 3: Entity Deduplication and ID Assignment

Deduplication is critical in building robust and reliable knowledge graphs. Since our data was processed in chunks for entity extraction, the same entities may be present in multiple lists in case it was mentioned in various places in the original text, hence being distributed across chunks. The chances of the number of duplicate entities increase as you decrease the chunk size. Duplicate entities can introduce inconsistencies, confusion, and inaccuracies. By removing duplicates and assigning clear, unique identifiers (IDs), we maintain the integrity and usability of our knowledge graph. We’re using simple incremental IDs as that is sufficient for this purpose and will ensure uniqueness. uuids can also be used if further global uniqueness is required. Unique IDs are very important for our use case as we will use these later on to make connections between related entities in the database. Note that this part does not involve any LLMs, just simple Python code.

Here's how we efficiently deduplicate entities and assign unique IDs:

‍

def deduplicate_entities(entities_data):
    """Deduplicate entities and assign unique integer IDs."""


    deduplicated = []
    id_counter = 1
    total_before = 0
    total_after = 0


    for class_obj in entities_data:
        class_type = class_obj["Class"]
        entities = class_obj["Entities"]
        total_before += len(entities)


        # Deduplicate by entity name
        unique_entities = {}
        for entity_name, props in entities.items():
            if entity_name in unique_entities:
                # Merge properties from duplicate
                unique_entities[entity_name].update(props)
            else:
                unique_entities[entity_name] = props.copy()


        # Assign IDs to unique entities
        for entity_name, props in unique_entities.items():
            props["id"] = id_counter
            id_counter += 1


        deduplicated.append({
            "Class": class_type,
            "Entities": unique_entities
        })


        total_after += len(unique_entities)
        print(f"  • {class_type}: {len(entities)} → {len(unique_entities)} entities")


    duplicates_removed = total_before - total_after
    print(f"\nRemoved {duplicates_removed} duplicates, assigned IDs 1-{id_counter-1}")


    return deduplicated

‍
Here’s a part of the result of the deduplication and ID-assignment step:
‍

[
  {
    "Class": "Computing System",
    "Entities": {
      "Distributed computing": {
        "definition": "a system where computing resources are distributed across multiple locations rather than being centralized in a single system.",
        "purpose": "enables task distribution and efficient resource utilization.",
        "efficiency": "efficient resource utilization",
        "scalability": "allow for hardware scaling",
        "id": 1
      },
      "Traditional computing": {
        "limitations": "faces bottlenecks due to hardware limitations",
        "id": 2
      },...

‍

Step 4: Populating ApertureDB with Entities

ApertureDB is optimized for storing structured data alongside rich media, making it ideal for building knowledge graphs. We’ll batch-insert entities and relationships efficiently, leveraging parallel execution to handle large datasets.
‍

Batch Insertion of Entities using ApertureDB’s ParallelLoader:

Using ApertureDb’s ParallelLoader is excellent for fast performance, especially with large numbers of entities:

‍

def create_entity_indexes(client, entities):
    queries = [{
        "AddIndex": {
            "class": obj["Class"],
            "property_key": "id",
            "metric": "L2"
        }
    } for obj in entities]
    client.query(queries)


def insert_entities_parallel_loader(client, entities):
    create_entity_indexes(client, entities)
    data = []


    for obj in entities:
        cls, instances = obj["Class"], obj["Entities"]
        for name, props in instances.items():
            props_str = {
                k: (json.dumps(v) if isinstance(v, dict) else ", ".join(v) if isinstance(v, list) else str(v))
                for k, v in props.items() if k != "id"
            }
            query = [{
                "AddEntity": {
                    "class": cls,
                    "properties": {
                        "name": name,
                        "id": props["id"],
                        **props_str
                    }
                }
            }]
            data.append((query, []))


    ParallelLoader(client).ingest(generator=data, batchsize=50, numthreads=4, stats=True)
    print(f"Inserted {len(data)} entities.")


# Run insertion
insert_entities_parallel_loader(client, deduplicated_entities)

‍

We can check out the inserted entities in our ApertureDB dashboard:

‍

‍Conclusion

In this first part, we've successfully laid the groundwork by extracting, deduplicating, and storing structured entities from documents into ApertureDB using Google's Gemini 2.5 Flash model. This sets the stage for relationship extraction and visualization, which we'll tackle in the next part of this series, bringing our knowledge graph fully to life.

Appendix

For your reference and further exploration:

👉 ApertureDB is available on Google Cloud. Subscribe Now

Ayesha Imran | LinkedIn | Github

I am a software engineer passionate about AI/ML, Generative AI, and secure full-stack app development. I’m experienced in RAG and Agentic AI systems, LLMOps, full-stack development, cloud computing, and deploying scalable AI solutions. My ambition lies in building and contributing to innovative projects with real-world impact. Endlessly learning, perpetually building.

Tags:

Knowledge graph and graph databases

Retrieval augmented generation (RAG)

Related Blogs

Automating Knowledge Graph Creation with Gemini 2.5 and ApertureDB - Part 2

Blogs

June 13, 2025

Automating Knowledge Graph Creation with Gemini 2.5 and ApertureDB - Part 2

Part 2 of the tutorial walks you through extracting relationships between entities using Gemini 2.5 and building a fully connected, interactive knowledge graph in ApertureDB. It also covers visualizing the graph and highlights real-world applications in search, education, and AI pipelines.

Watch Now

Beyond Vanilla RAG: Unlocking Enhanced Retrieval with GraphRAG and ApertureDB

Blogs

April 2, 2025

Beyond Vanilla RAG: Unlocking Enhanced Retrieval with GraphRAG and ApertureDB

Unlock the power of GraphRAG for better AI retrieval. Learn how ApertureDB enables structured knowledge graphs for accurate, context-rich LLM responses in addition to its vector search and multimodal data management capabilities.

Watch Now

Blogs

February 10, 2025

Is Your Chatbot Secure?

ApertureData and Realm Labs help developers build secure RAG chatbots by combining advanced permissions management with graph-vector storage, ensuring data protection and efficient access control.

Watch Now

Agentic RAG with ApertureDB and HuggingFace SmolAgents

Blogs

February 7, 2025

Agentic RAG with ApertureDB and HuggingFace SmolAgents

Agentic RAG is the future of LLM applications! This blog article shows you how to build a powerful research paper search engine using ApertureDB & Huggingface SmolAgents.

Watch Now

Building Real World RAG-based Applications with ApertureDB

Blogs

Nov 21, 2024

Building Real World RAG-based Applications with ApertureDB

Combining different AI technologies, such as LLMs, embedding models, and a database like ApertureDB that is purpose-built for multimodal AI, can significantly enhance the ability to retrieve and generate relevant content.

Managing Visual Data for Machine Learning and Data Science. Painlessly.

Blogs

Oct 15, 2024

Managing Visual Data for Machine Learning and Data Science. Painlessly.

Visual data or image/video data is growing fast. ApertureDB is a unique database...

Blogs

Oct 15, 2024

What’s in Your Visual Dataset?

CV/ML users need to find, analyze, pre-process as needed; and to visualize their images and videos along with any metadata easily...

Transforming Retail and Ecommerce with Multimodal AI

Blogs

Oct 15, 2024

Transforming Retail and Ecommerce with Multimodal AI

Multimodal AI can boost retail sales by enabling better user experience at lower cost but needs the right infrastructure...

Vector Databases and Beyond for Multimodal AI: A Beginner's Guide Part 1

Blogs

Oct 15, 2024

Vector Databases and Beyond for Multimodal AI: A Beginner's Guide Part 1

Multimodal AI, vector databases, large language models (LLMs)...

How a Purpose-Built Database for Multimodal AI Can Save You Time and Money

Blogs

Oct 15, 2024

How a Purpose-Built Database for Multimodal AI Can Save You Time and Money

With extensive data systems needed for modern applications, costs...

Minute-Made Data Preparation with ApertureDB

Blogs

Oct 15, 2024

Minute-Made Data Preparation with ApertureDB

Working with visual data (images, videos) and its metadata is no picnic...

Why Do We Need A Purpose-Built Database For Multimodal Data?

Blogs

Oct 15, 2024

Why Do We Need A Purpose-Built Database For Multimodal Data?

Recently, data engineering and management has grown difficult for companies building modern applications...

Building a Specialized Database for Analytics on Images and Videos

Blogs

Oct 15, 2024

Building a Specialized Database for Analytics on Images and Videos

ApertureDB is a database for visual data such as images, videos, embeddings and associated metadata like annotations, purpose-built for...

Vector Databases and Beyond for Multimodal AI: A Beginner's Guide Part 2

Blogs

Oct 15, 2024

Vector Databases and Beyond for Multimodal AI: A Beginner's Guide Part 2

Multimodal AI, vector databases, large language models (LLMs)...

Challenges and Triumphs: Multimodal AI in Life Sciences

Blogs

Oct 15, 2024

Challenges and Triumphs: Multimodal AI in Life Sciences

AI presents a new and unparalleled transformational opportunity for the life sciences sector...

Your Multimodal Data Is Constantly Evolving - How Bad Can It Get?

Blogs

Oct 15, 2024

Your Multimodal Data Is Constantly Evolving - How Bad Can It Get?

The data landscape has dramatically changed in the last two decades...

Can A RAG Chatbot Really Improve Content?

Blogs

Oct 15, 2024

Can A RAG Chatbot Really Improve Content?

We asked our chatbot questions like "Can ApertureDB store pdfs?" and the answer it gave..

Blogs

Oct 15, 2024

ApertureDB Now Available on DockerHub

Getting started with ApertureDB has never been easier or safer...

Are Vector Databases Enough for Visual Data Use Cases?

Blogs

Oct 15, 2024

Are Vector Databases Enough for Visual Data Use Cases?

ApertureDB vector search and classification functionality is offered as part of our unified API defined to...

Accelerate Industrial and Visual Inspection with Multimodal AI

Blogs

Oct 15, 2024

Accelerate Industrial and Visual Inspection with Multimodal AI

From worker safety to detecting product defects to overall quality control, industrial and visual inspection plays a crucial role...

ApertureDB 2.0: Redefining Visual Data Management for AI

Blogs

Oct 15, 2024

ApertureDB 2.0: Redefining Visual Data Management for AI

A key to solving Visual AI challenges is to bring together the key learnings of...

Automating Knowledge Graph Creation with Gemini 2.5 and ApertureDB - Part 1

‍Introduction

Practical Use Cases

Understanding the Components

Workflow Overview

Environment Setup and Dependency Installation

Document Preparation and Loading

Connect to ApertureDB & Optionally Store Full Document as a Blob

Step 1: Extracting Entity Class Schemas

Step 2: Extracting Specific Entities

Splitting the Document

Step 3: Entity Deduplication and ID Assignment

Step 4: Populating ApertureDB with Entities

Batch Insertion of Entities using ApertureDB’s ParallelLoader:

‍Conclusion

Appendix

👉 ApertureDB is available on Google Cloud. Subscribe Now

Ayesha Imran | LinkedIn | Github

Related Blogs

Ready to Accelerate your AI Workflows?