Building a Structured LLM Wiki from Scientific Sources
This article documents the practical implementation of a structured LLM knowledge ingestion workflow.
The objective was to transform scattered web pages and peer-reviewed articles into a queryable knowledge system capable of supporting structured reasoning.
Inspired by the LLM Wiki architecture, the workflow focuses on building a persistent knowledge layer before querying any model.
1. Creating a Dedicated Knowledge Workspace
The first step was to initialize a clean knowledge environment.
Create a New Obsidian Vault
This vault acts as the central repository for all knowledge artifacts.
Structure:
Vault/
├── Clipping/
├── PDFs/
├── Wiki/
└── Logs/
This separation ensures:
- traceability
- modular structure
- predictable ingestion
2. Setting Up the Agentic Environment
The ingestion workflow is executed inside an agent-assisted development environment.
Environment used:
- Antigravity
- Claude Code
- Obsidian Vault (mounted workspace)
This enables:
- repeatable prompts
- automated indexing
- structured content generation
3. Initializing the LLM Wiki Schema
The wiki structure is initialized using the Karpathy LLM Wiki prompt.
Reference:
https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f
This schema defines:
- page organization
- indexing rules
- linking strategy
- maintenance logic
Example:
index.md
topics/
concepts/
references/
log.md
This creates the foundation of the knowledge graph.
4. Capturing Source Material
Two categories of sources were collected:
Web Sources
Captured using:
Obsidian Web Clipper
Each page was:
- reviewed
- saved as Markdown
- placed into:
Clipping/
Scientific PDFs
Peer-reviewed, open-access papers were manually collected and stored in:
PDFs/
Source discipline included:
- publication validation
- relevance filtering
- topic alignment
This ensured high signal-to-noise ratio.
5. Indexing the Clipping Folder
The first transformation step begins here.
Raw Markdown files are converted into structured entries.
Prompt Used
Index the Markdown files in the Clipping folder
and create initial wiki entries with topics,
concepts, and references.
This produces:
- topic pages
- concept definitions
- source references
At this stage:
the knowledge graph begins to form.
6. Ingesting Scientific PDF Articles
Scientific articles are processed using a structured ingestion prompt.
Reference:
https://gist.github.com/artemmelnyk-extern/e9e54d962284838d6c246a99caf04125
Each article is:
- Renamed consistently
- Processed using the prompt
- Converted into structured wiki pages
Expected outputs:
- structured summaries
- extracted concepts
- cross-links
- references
This step significantly expands the knowledge network.
7. Querying the Knowledge Base
After ingestion, the system becomes queryable.
Not searchable.
Queryable.
That distinction is fundamental.
Example interaction:
Query:
Explain the relationship between Topic A and Topic B
based on the ingested sources.
The answer now reflects:
- structured context
- linked knowledge
- curated sources
Not raw text fragments.
8. Visualizing the Knowledge Graph
The Obsidian graph provides a visual confirmation of structure.
Each:
- node = structured knowledge
- edge = contextual relationship
This visualization confirms:
the transformation from documents to knowledge.
Architecture Overview
Raw Sources
↓
Clipping
↓
Indexing
↓
PDF Ingestion
↓
Linked Wiki
↓
Queryable Knowledge
This architecture transforms:
Information → Knowledge → Reasoning
Result
After full ingestion:
- knowledge became modular
- relationships became visible
- querying became meaningful
The system evolved from:
file storage
to:
knowledge infrastructure
Key Takeaways
- Knowledge must be structured before reasoning
- Source quality determines reasoning quality
- Linking is as important as content
- Indexing creates the first real structure
- Querying is the final step — not the first
Conclusion
This workflow demonstrates how modern LLM systems benefit from structured knowledge ingestion rather than raw document retrieval.
The shift is not about smarter models.
It is about better knowledge architecture.