Introduction
Knowledge graphs are powerful tools for visually organizing data, connecting entities through clear and meaningful relationships. They’re especially helpful for making Retrieval-Augmented Generation (RAG) systems smarter, providing accurate, context-rich responses. Companies using knowledge graphs with RAG have seen noticeable improvements, including nearly 30% faster customer issue resolutions (https://arxiv.org/abs/2404.17723) and better transparency in how AI models generate answers.
In this tutorial, we'll combine ApertureDB, a robust multimodal database, with Google's advanced Gemini 2.5 Flash model to efficiently extract structured information and build a comprehensive knowledge graph. This integration utilizes the contextual understanding and scalability of Gemini, coupled with the versatile and rich-media capabilities of ApertureDB, making it ideal for creating powerful, intelligent systems.
The tutorial is divided into two parts - this part, the first one, covers the steps for extracting and storing entities in ApertureDB. In the second part, we’ll extract relationships between entities and define them in ApertureDB, as well as visualize the created knowledge graph by fetching entities and their relationships from the DB.
Please note that the code snippets in this blog have been shortened for brevity. You can find the complete code in the Colab Notebook.
Practical Use Cases
The knowledge graph created with this approach isn't just theoretical; it forms a practical backbone for several real-world uses. Clearly organizing entities and how they're connected is a process that supports diverse functionalities in many different fields. As such, the knowledge graph constructed through this workflow has broad applications across multiple domains:
- Enhanced Information Retrieval: Query entities and their relationships with precision, enabling structured semantic search over large textual corpora.
- Customer Support Systems: Automatically extract and link entities such as products, services, issues, and solutions from support documents, forming the backbone of intelligent FAQ systems.
- Educational Tools: Organize and visualize learning materials by topics, concepts, and their interdependencies—ideal for building interactive curricula or study guides.
- Data Integration: Merge semi-structured data from disparate sources into a unified graph, simplifying analysis and downstream reasoning tasks.
Understanding the Components
To build our knowledge graph, we'll use:
- ApertureDB: A specialized multimodal database designed to handle structured metadata along with text, embeddings, images, videos, and other rich media. It excels in rapidly querying and managing complex inter-entity relationships.
- Gemini 2.5 Flash: Google's cutting-edge large language model (LLM) provides extensive contextual memory and speedy response times, ideal for detailed content extraction and analysis from lengthy documents.
- LangChain: An efficient workflow orchestration tool that enables parallel execution of tasks using the RunnableLamba .batch() functionality, speeding up our entire pipeline significantly.
Workflow Overview
Our knowledge graph creation follows a structured workflow:

- Entity Class Schema Extraction: Identify general classes and their properties using Gemini.
- Entity Instance Extraction: Extract specific instances of these classes in parallel.
- Deduplication and ID Assignment: Clean up duplicates and assign unique IDs for each entity.
- Insert Entities in ApertureDB: Insert entities into the ApertureDB instance along with all their properties.
- Relationship Extraction: Clearly define explicit relationships between entities through Gemini.
- Knowledge Graph Creation in ApertureDB: create connections between the entities in ApertureDB.
- Visualization: Interactively visualize the constructed knowledge graph using PyVis.
In this blog, we’ll cover steps 1 - 4, providing detailed insights and practical code snippets to guide you through the creation of your powerful knowledge graph.
Environment Setup and Dependency Installation
Getting started is simple with Google Colab. Just install the required libraries (langchain, pydantic, aperturedb, etc.) and set up your API keys. The Colab notebook handles all imports and initializes the Gemini LLM and ApertureDB client for you. Full setup instructions and code are included in the notebook.
Document Preparation and Loading
We load and parse the PDF using LangChain’s PyPDFLoader, which extracts text content for processing. For this tutorial, we'll use a practical example: Cloud Computing lecture notes PDF, from my Cloud Computing Course notes (because why not kill two birds with one stone?) . The PDF consists of 42 (!) pages, showcasing the efficiency of our knowledge graph creation approach.
Connect to ApertureDB & Optionally Store Full Document as a Blob
Now we’ll connect to our ApertureDB instance.
def connect_db():
"""Create ApertureDB connection."""
try:
client = Connector.Connector(
host=db_host,
user="admin",
password=db_password
)
print("Connected to ApertureDB")
return client
except Exception as e:
print(f"Database connection failed: {e}")
raise
client = connect_db()
This step is optional, but you can add the entire PDF to the DB instance as a blob, and then you may connect it to its extracted entities. This way, you can even have multiple related documents referring to the same entities or actually access the original content whenever needed (either for validation or more information).
def upload_pdf_to_db(pdf_path, client):
"""Upload PDF to ApertureDB as blob with metadata."""
# Extract file name from path
file_name = os.path.basename(pdf_path)
# Read PDF as binary data
with open(pdf_path, 'rb') as f:
pdf_data = f.read()
# Create upload query
query = [{
"AddBlob": {
"properties": {
"id": 0,
"type": "pdf",
"name": file_name, # dynamically set file name
"description": "Source document for knowledge graph creation",
"upload_timestamp": time.strftime("%Y-%m-%d %H:%M:%S")
}
}
}]
# Upload with binary data
response, _ = client.query(query, [pdf_data])
if response[0]["AddBlob"]["status"] == 0:
print(f"PDF '{file_name}' uploaded to ApertureDB")
else:
print(f"PDF '{file_name}' upload failed")
return response
upload_response = upload_pdf_to_db("/content/Cloud Computing Lecture Notes.pdf", client)
Step 1: Extracting Entity Class Schemas
To build an effective knowledge graph, the first step is identifying high-level entity types (or classes) and their potential properties. This foundational step leverages the contextual understanding capabilities of Google's Gemini 2.5 Flash model to interpret raw text and extract structured information.
We define a clear schema using Pydantic, ensuring the output from Gemini is consistent and correctly formatted:
class ClassSchema(BaseModel):
classes: Dict[str, List[str]] = Field(
description="Dictionary mapping class types to their possible properties"
)
The following prompt guides Gemini to extract generic class-property mappings without naming specific entities:
prompt = PromptTemplate(
template="""
You are the first agent in a multi-step workflow to build a Knowledge Graph.
YOUR TASK:
- Identify general class types (e.g., Person, Company, Location).
- List possible properties for each class type.
- Keep output general; avoid specifics and examples.
FORMAT:
{{
"ClassType": ["property1", "property2"]
}}
Text: {input}
{format_instructions}
Response:
""",
input_variables=["input"],
partial_variables={"format_instructions": parser.get_format_instructions()},
)
We combine the prompt with Gemini and parse the response:
parser = JsonOutputParser(pydantic_object=ClassSchema)
chain = prompt | llm | parser
def extract_class_schema(text):
def attempt():
result = chain.invoke({"input": text})
return result if isinstance(result, dict) else {"classes": result}
return retry_llm_call(attempt)
Run the extraction and preview results:
classes = extract_class_schema(document_content)
entity_classes = classes.get("classes", {})
Here’s the first two classes with their possible properties, taken from the output of this step:
{
"classes": {
"Computing System": [
"definition",
"purpose",
"characteristics",
"components",
"use_cases",
"limitations",
"speed"
],,...
This initial extraction step sets the foundation for robust knowledge graph construction by clearly defining what information we seek from our data.
Step 2: Extracting Specific Entities
With the entity classes defined, the next step is to extract specific instances of these classes from the document. To manage this effectively, particularly with large documents, we split the text into manageable chunks and process them in parallel using Gemini and LangChain.
Splitting the Document
First, we use LangChain’s RecursiveCharacterTextSplitter to divide large texts into smaller, overlapping chunks, balancing efficiency with contextual clarity. I used a chunk size of 5000 with 500 overlap, which is perfect for Gemini’s long context window.
Again, we define a clear output format using Pydantic to ensure consistent and accurate output from Gemini:
# Define Pydantic models for entity extraction
class ClassEntities(BaseModel):
"""Entities of a specific class type."""
Class: str = Field(description="The class type name")
Entities: Dict[str, Dict[str, Any]] = Field(
description="Dictionary mapping entity names to their properties"
)
class ChunkExtractionResult(BaseModel):
"""Result of entity extraction from a single chunk."""
classes: List[ClassEntities] = Field(
description="List of class types and their entities found in this chunk"
)
Now we extract specific entities and their properties from the text in parallel using RunnableLambda.batch() - ideal for large documents.
Prompt for Entity Extraction
prompt = PromptTemplate(
template="""
You are part of a multi-step workflow to build a Knowledge Graph from raw text.
Workflow Steps:
1. Extract class types and properties. [DONE]
2. Extract entity instances from chunks. [CURRENT]
3. Deduplicate entities, assign IDs.
4. Identify relationships.
5. Create the graph.
YOUR TASK:
- Extract concrete entities of known class types from this chunk.
- Only include properties explicitly mentioned.
- Omit properties not found; no nulls or guesses.
INPUT TEXT CHUNK:
{chunk}
CLASS TYPES AND THEIR PROPERTIES:
{class_schema}
FORMAT:
{{
"classes": [
{{
"Class": "ClassName",
"Entities": {{
"EntityName": {{"property1": "value", ...}}
}}
}}
]
}}
{format_instructions}
Begin your extraction now:
""",
input_variables=["chunk", "class_schema"],
partial_variables={"format_instructions": parser.get_format_instructions()},
)
Parallel Execution
def extract_entities_from_chunk(chunk_data):
chunk_text, classes = chunk_data
return retry_llm_call(lambda: chain.invoke({
"chunk": chunk_text,
"class_schema": json.dumps(classes, indent=2)
}))
processor = RunnableLambda(extract_entities_from_chunk)
chunk_data = [(chunk.page_content, entity_classes) for chunk in split_text(document_content)]
results = processor.batch(chunk_data, config={"max_concurrency": 6})
extracted_entities = merge_entity_results(results)
save_json(extracted_entities, "step2_output.json")
Here’s a part of the output from the entity extraction step:
[
{
"Class": "Computing System",
"Entities": {
"Distributed computing": {
"definition": "a system where computing resources are distributed across multiple locations rather than being centralized in a single system.",
"purpose": "enables task distribution and efficient resource utilization.",
"efficiency": "efficient resource utilization",
"scalability": "allow for hardware scaling"
},
"Traditional computing": {
"limitations": "faces bottlenecks due to hardware limitations"
},...
Step 3: Entity Deduplication and ID Assignment
Deduplication is critical in building robust and reliable knowledge graphs. Since our data was processed in chunks for entity extraction, the same entities may be present in multiple lists in case it was mentioned in various places in the original text, hence being distributed across chunks. The chances of the number of duplicate entities increase as you decrease the chunk size. Duplicate entities can introduce inconsistencies, confusion, and inaccuracies. By removing duplicates and assigning clear, unique identifiers (IDs), we maintain the integrity and usability of our knowledge graph. We’re using simple incremental IDs as that is sufficient for this purpose and will ensure uniqueness. uuids can also be used if further global uniqueness is required. Unique IDs are very important for our use case as we will use these later on to make connections between related entities in the database. Note that this part does not involve any LLMs, just simple Python code.
Here's how we efficiently deduplicate entities and assign unique IDs:
def deduplicate_entities(entities_data):
"""Deduplicate entities and assign unique integer IDs."""
deduplicated = []
id_counter = 1
total_before = 0
total_after = 0
for class_obj in entities_data:
class_type = class_obj["Class"]
entities = class_obj["Entities"]
total_before += len(entities)
# Deduplicate by entity name
unique_entities = {}
for entity_name, props in entities.items():
if entity_name in unique_entities:
# Merge properties from duplicate
unique_entities[entity_name].update(props)
else:
unique_entities[entity_name] = props.copy()
# Assign IDs to unique entities
for entity_name, props in unique_entities.items():
props["id"] = id_counter
id_counter += 1
deduplicated.append({
"Class": class_type,
"Entities": unique_entities
})
total_after += len(unique_entities)
print(f" • {class_type}: {len(entities)} → {len(unique_entities)} entities")
duplicates_removed = total_before - total_after
print(f"\nRemoved {duplicates_removed} duplicates, assigned IDs 1-{id_counter-1}")
return deduplicated
Here’s a part of the result of the deduplication and ID-assignment step:
[
{
"Class": "Computing System",
"Entities": {
"Distributed computing": {
"definition": "a system where computing resources are distributed across multiple locations rather than being centralized in a single system.",
"purpose": "enables task distribution and efficient resource utilization.",
"efficiency": "efficient resource utilization",
"scalability": "allow for hardware scaling",
"id": 1
},
"Traditional computing": {
"limitations": "faces bottlenecks due to hardware limitations",
"id": 2
},...
Step 4: Populating ApertureDB with Entities
ApertureDB is optimized for storing structured data alongside rich media, making it ideal for building knowledge graphs. We’ll batch-insert entities and relationships efficiently, leveraging parallel execution to handle large datasets.
Batch Insertion of Entities using ApertureDB’s ParallelLoader:
Using ApertureDb’s ParallelLoader is excellent for fast performance, especially with large numbers of entities:
def create_entity_indexes(client, entities):
queries = [{
"AddIndex": {
"class": obj["Class"],
"property_key": "id",
"metric": "L2"
}
} for obj in entities]
client.query(queries)
def insert_entities_parallel_loader(client, entities):
create_entity_indexes(client, entities)
data = []
for obj in entities:
cls, instances = obj["Class"], obj["Entities"]
for name, props in instances.items():
props_str = {
k: (json.dumps(v) if isinstance(v, dict) else ", ".join(v) if isinstance(v, list) else str(v))
for k, v in props.items() if k != "id"
}
query = [{
"AddEntity": {
"class": cls,
"properties": {
"name": name,
"id": props["id"],
**props_str
}
}
}]
data.append((query, []))
ParallelLoader(client).ingest(generator=data, batchsize=50, numthreads=4, stats=True)
print(f"Inserted {len(data)} entities.")
# Run insertion
insert_entities_parallel_loader(client, deduplicated_entities)
We can check out the inserted entities in our ApertureDB dashboard:

Conclusion
In this first part, we've successfully laid the groundwork by extracting, deduplicating, and storing structured entities from documents into ApertureDB using Google's Gemini 2.5 Flash model. This sets the stage for relationship extraction and visualization, which we'll tackle in the next part of this series, bringing our knowledge graph fully to life.
Appendix
For your reference and further exploration:
- Complete Colab Notebook
- ApertureDB documentation
- GraphRAG with ApertureDB
- Google’s Gemini documentation
👉 ApertureDB is available on Google Cloud. Subscribe Now
Ayesha Imran | LinkedIn | Github
I am a software engineer passionate about AI/ML, Generative AI, and secure full-stack app development. I’m experienced in RAG and Agentic AI systems, LLMOps, full-stack development, cloud computing, and deploying scalable AI solutions. My ambition lies in building and contributing to innovative projects with real-world impact. Endlessly learning, perpetually building.