Dateien nach "docs" hochladen

2025-11-10 10:26:40 +01:00 · 2025-11-10 10:26:40 +01:00 · 8bd8f7c0bb
commit 8bd8f7c0bb
parent 1c42cd1f78
1 changed files with 324 additions and 0 deletions
--- a/docs/mindnet_v2_implementation_playbook.md
+++ b/docs/mindnet_v2_implementation_playbook.md
@ -0,0 +1,324 @@
+# mindnet v2 – Implementierungs-Playbook & Handover (Side‑by‑Side zu v1)
+*Stand:* 2025-11-10 08:03  
+*Autor:* ChatGPT (Projekt‑Handover)  
+*Zweck:* Vollständige, ausführbare Anleitung zur Weiterentwicklung von **mindnet** auf eine versionierte, testbare v2‑Architektur (neben v1), inklusive Prompt für den Folgethread, Akzeptanzkriterien, Dateiliste, Test‑ und Rollback‑Pläne.
+
+---
+
+## 0) Kontext & Zielbild (kurz)
+- **Mission:** Persönliches Wissensnetz, das langfristig die eigene Persönlichkeit, Erfahrungen und Entscheidungslogik abbildet; später nachvollziehbare Erklärungen für Familie/Nachwelt.  
+- **Heutiger Stand (v1):** `mindnet_notes` in Qdrant vorhanden; **Chunks** und **Edges** bisher nur teilweise/nicht konsistent; Edge‑Defaults („1 Dot“) vorhanden, aber Ziele teils nicht materialisiert. Importskripte: `import_markdown.py`, `edges.py` (historisch).  
+- **Ziel (v2):**  
+  - **Collections:** `mindnet_notes_v2`, `mindnet_chunks_v2`, `mindnet_edges_v2` (saubere Payload‑Schemata + Indizes).  
+  - **Chunker v2:** Block/Heading‑aware, Ziel 900–1200 Zeichen, Überlappung 120–180, deterministische `chunk_id`s.  
+  - **Edge‑Builder v2:** Reihenfolge: **explicit → rule → default_resolved → default**; saubere `provenance`.  
+  - **Importer‑Pipeline v2:** Parse → Chunk → Edge → Embeddings → Batch‑Upsert + Payload‑Indexpflege + Snapshots.  
+  - **Policies:** Privacy/Recency/Type‑Gewichte als **Konfiguration** (YAML + JSON‑Schema), nicht hart im Code.  
+  - **Observability:** FastAPI + OpenTelemetry (Tracing/Metrics); Import‑Reports (CSV/JSON).  
+  - **Side‑by‑Side:** v2 **parallel** zu v1; kein Big‑Bang.
+
+> **Warum so?** Qdrant‑Payload‑Indizes/Filter → schnelle, erklärbare Selektionen; Snapshots → betriebssicheres Rollback; OTel → Nachvollziehbarkeit; YAML/JSON‑Schema → Validierbarkeit & Stabilität.
+
+---
+
+## 1) Datenmodelle (Schemata, v2)
+Die tatsächlichen JSON‑Schemata (2020‑12) werden im Repo abgelegt und in CI validiert.
+
+### 1.1 note.schema.json
+```json
+{
+  "$schema": "https://json-schema.org/draft/2020-12/schema",
+  "$id": "https://example.org/mindnet/note.schema.json",
+  "type": "object",
+  "required": ["id", "title", "type", "privacy", "created", "hash_body"],
+  "properties": {
+    "id":           { "type": "string", "pattern": "^[a-zA-Z0-9_-]+$" },
+    "title":        { "type": "string", "minLength": 1 },
+    "type":         { "type": "string" }, 
+    "privacy":      { "type": "string", "enum": ["public","internal","private"] },
+    "created":      { "type": "string", "format": "date-time" },
+    "modified":     { "type": "string", "format": "date-time" },
+    "tags":         { "type": "array", "items": { "type": "string" } },
+    "lang":         { "type": "string", "default": "de" },
+    "source_path":  { "type": "string" },
+    "source_collection": { "type": "string" },
+    "hash_body":    { "type": "string" },
+    "hash_frontmatter": { "type": "string" },
+    "token_count":  { "type": "integer", "minimum": 0 }
+  }
+}
+```
+
+### 1.2 chunk.schema.json
+```json
+{
+  "$schema": "https://json-schema.org/draft/2020-12/schema",
+  "$id": "https://example.org/mindnet/chunk.schema.json",
+  "type": "object",
+  "required": ["chunk_id","note_id","text","ord"],
+  "properties": {
+    "chunk_id":      { "type": "string", "pattern": "^[a-zA-Z0-9_-]+#c\d{4}$" },
+    "note_id":       { "type": "string" },
+    "text":          { "type": "string" },
+    "ord":           { "type": "integer", "minimum": 0 },
+    "span_char_start": { "type": "integer", "minimum": 0 },
+    "span_char_end":   { "type": "integer", "minimum": 0 },
+    "heading_path":  { "type": "array", "items": { "type": "string" } },
+    "section_title": { "type": "string" },
+    "tokens_start":  { "type": "integer", "minimum": 0 },
+    "tokens_end":    { "type": "integer", "minimum": 0 },
+    "embeddings_version": { "type": "string" }
+  }
+}
+```
+
+### 1.3 edge.schema.json
+```json
+{
+  "$schema": "https://json-schema.org/draft/2020-12/schema",
+  "$id": "https://example.org/mindnet/edge.schema.json",
+  "type": "object",
+  "required": ["edge_id","src_note_id","relation","provenance"],
+  "properties": {
+    "edge_id":       { "type": "string" },
+    "src_note_id":   { "type": "string" },
+    "src_chunk_id":  { "type": "string" },
+    "dst_note_id":   { "type": "string" },
+    "dst_chunk_id":  { "type": "string" },
+    "relation":      { "type": "string" },
+    "evidence_spans":{ "type": "array", "items": {"type":"object","properties":{"chunk_id":{"type":"string"},"span":[{"type":"integer"},{"type":"integer"}]]} },
+    "provenance":    { "type": "string", "enum": ["explicit","rule","default_resolved","default"] },
+    "rule_id":       { "type": "string" },
+    "confidence":    { "type": "number", "minimum": 0, "maximum": 1 }
+  }
+}
+```
+
+### 1.4 default_edge.schema.json
+```json
+{
+  "$schema": "https://json-schema.org/draft/2020-12/schema",
+  "$id": "https://example.org/mindnet/default_edge.schema.json",
+  "type": "object",
+  "required": ["src_note_id","relation","target_kind"],
+  "properties": {
+    "src_note_id":  { "type": "string" },
+    "relation":     { "type": "string" },
+    "target_kind":  { "type": "string" }, 
+    "when":         { "type": "string" }, 
+    "strength_hint":{ "type": "number", "minimum": 0, "maximum": 1 }
+  }
+}
+```
+
+---
+
+## 2) Qdrant: Collections & Indizes (v2)
+Anlegen **neben** v1:
+- `mindnet_notes_v2` (Payload‑Index: `type, privacy, created, modified, tags, source_path`)
+- `mindnet_chunks_v2` (Payload‑Index: `note_id, ord, heading_path`)
+- `mindnet_edges_v2`  (Payload‑Index: `src_note_id, dst_note_id, relation, provenance`)
+
+> Snapshots für Backups/Restore einplanen (Runbook siehe unten).
+
+---
+
+## 3) Import‑Pipeline v2 – Module & Flags
+- **Chunker v2** (`chunking/block_chunker.py`): `target_len=1000`, `overlap=150`, `respect_headings=True`, `min_chunk_len=600`, `max_chunk_len=1400`, `chunk_id = f\"{note_id}#c{ordinal:04d}\"`  
+- **Edge‑Builder v2** (`graph/edge_builder_v2.py`): Reihenfolge **explicit → rule → default_resolved → default**, `provenance` korrekt setzen.  
+- **Importer** (`scripts/import_markdown.py` erweitern oder `scripts/import_markdown_v2.py`):  
+  Flags: `--schema v2`, `--chunker v2`, `--edges v2`, `--dry-run`, `--apply`, `--prefix "$COLLECTION_PREFIX"`.
+
+**Beispiel‑Run (dry‑run):**
+```bash
+python3 -m scripts.import_markdown_v2 --vault ./vault --schema v2 --chunker v2 --edges v2 --dry-run --prefix "$COLLECTION_PREFIX"
+```
+
+**Apply:**
+```bash
+python3 -m scripts.import_markdown_v2 --vault ./vault --schema v2 --chunker v2 --edges v2 --apply --prefix "$COLLECTION_PREFIX"
+```
+
+---
+
+## 4) Policies (konfigurierbar, kein Hard‑Code)
+- `policies/retrieval.schema.json` (JSON‑Schema) und `policies/retrieval.yaml`, z. B.:
+```yaml
+privacy_order: [private, internal, public]
+recency_boost_half_life_days: 90
+type_priority:
+  person: 1.2
+  event: 1.1
+  concept: 1.0
+max_chunks_per_note: 8
+edge_provenance_weights:
+  explicit: 1.0
+  rule: 0.8
+  default_resolved: 0.6
+  default: 0.2
+rule_sets:
+  - id: "rules:types-v1"
+    enabled: true
+```
+
+---
+
+## 5) Observability & Reports
+- **FastAPI + OpenTelemetry**: Traces um Phasen *parse/chunk/edge/upsert*; Metriken (Counter/Histogramme) pro Phase.  
+- **Import‑Report** je Lauf (CSV/JSON): counts für Notes/Chunks/Edges, Fehler, Dauer, Embedding‑Version.
+
+---
+
+## 6) Teststrategie
+- **Gold‑Notizen** (3–5 repräsentative .md) mit erwarteten *Chunk‑Counts* und *Edge‑Sets*.  
+- **Regression:** fixer Soll‑Count (z. B. 171 ± 15 %) über Gesamt‑Vault; keine Duplikate `(src,relation,dst)`.  
+- **Dry‑Run vs Apply**: identische Counts, nur ohne Upserts.  
+- **Filter‑Smoke‑Tests:** Qdrant‑Filter nach `privacy`, `type`, `tags`, `created` funktionieren performant.
+
+---
+
+## 7) Akzeptanzkriterien (KPI)
+1. **Chunk‑Qualität:** ≥ 90 % der Chunks enden an semantischen Grenzen; Ziel‑Länge 900–1200 Zeichen; deterministische IDs.  
+2. **Edge‑Kohärenz:** Keine Duplikate gleicher `(src,relation,dst)`; ≥ 95 % der expliziten Links materialisiert; `default` nur, wenn kein Ziel existiert.  
+3. **Filterbarkeit:** Queries nach `privacy/type/tags/created` performant (Payload‑Index vorhanden).  
+4. **Betriebsfestigkeit:** Telemetrie aktiv; fehlertoleranter Import; Qdrant‑Snapshot nach erfolgreichem Lauf.
+
+---
+
+## 8) Roadmap in kleinen, testbaren Schritten
+
+### Step 0 – Safeguards
+- Qdrant‑Snapshot v1 erstellen + Restore‑Probe.
+- ENV: `MINDNET_SCHEMA_VERSION=2` (Importer wird v2 schreiben, v1 bleibt unberührt).
+
+### Step 1 – Schemata (nur Dateien)
+- Ablage: `/schemas/note.schema.json`, `/schemas/chunk.schema.json`, `/schemas/edge.schema.json`, `/schemas/default_edge.schema.json`  
+- CI‑Job: `make schema-validate` (jsonschema).  
+- **Abnahme:** Validator grün auf 3 Gold‑Notizen.
+
+### Step 2 – Qdrant v2‑Collections
+- Anlegen der 3 v2‑Collections + Payload‑Indizes.  
+- **Abnahme:** Filter‑Query liefert erwartbare Ergebnisse.
+
+### Step 3 – Chunker v2
+- Implementierung & Flag `--chunker v2`.  
+- **Abnahme:** Chunk‑Counts ~ alt (≈171 ± 15 %), semantische Schnitte.
+
+### Step 4 – Edge‑Builder v2
+- Reihenfolge & `provenance` strikt umsetzen, `default_resolved` integrieren.  
+- **Abnahme:** erwartete Relationensätze auf Gold‑Notizen, keine Duplikate.
+
+### Step 5 – Importer‑Pipeline v2
+- `--schema v2` Side‑by‑Side; Batch‑Upserts; vollständige Payloads; Recency‑Boost Konfig **nur** in Policy.  
+- **Abnahme:** Dry‑Run/Apply‑Parität, Reports, Telemetrie‑Ereignisse sichtbar.
+
+### Step 6 – Observability
+- OTel‑Instrumentierung + `/import/status` + Reports.  
+- **Abnahme:** Traces & Metriken vorhanden.
+
+### Step 7 – Snapshots & Runbook
+- Snapshot nach Erfolg; Restore‑Dokument.  
+- **Abnahme:** Restore‑Probe erfolgreich.
+
+### Step 8 – Policies
+- `policies/retrieval.yaml` + JSON‑Schema; A/B‑Test (Policy ändern → Retrieval ändert sich ohne Reimport).  
+- **Abnahme:** Sichtbarer Effekt laut Testfall.
+
+### Step 9 – Umschalten
+- Abnahme‑Doku, KPI‑Delta dokumentiert; Alias/Flag‑Switch auf v2.  
+- **Abnahme:** Funktionale Gleichwertigkeit + Qualitätsgewinn.
+
+---
+
+## 9) Dateien (neu/ändern)
+**Neu:**
+- `schemas/note.schema.json`  
+- `schemas/chunk.schema.json`  
+- `schemas/edge.schema.json`  
+- `schemas/default_edge.schema.json`  
+- `policies/retrieval.schema.json`  
+- `policies/retrieval.yaml`  
+- `chunking/block_chunker.py`  
+- `graph/edge_builder_v2.py`  
+- `docs/OPERATIONS.md` (Snapshots/Restore/Runbook)  
+- `tests/gold_notes/manifest.yaml` (+ 3–5 Notizen Kopien)
+
+**Änderungen:**
+- `scripts/import_markdown.py` **oder** `scripts/import_markdown_v2.py` neu (Flags `--schema/--chunker/--edges`).  
+- `edges.py` (falls weiterverwendet): auf **v2‑Edge‑Builder** migrieren oder auslaufen lassen.  
+- `types.yaml`: Regel‑Definitionen klarisieren (IDs, Bedingungen), **nicht** hart in Code.
+
+---
+
+## 10) Beispiel‑Kommandos
+```bash
+# Step 2: Collections prüfen
+curl -s http://127.0.0.1:6333/collections | jq
+
+# Step 3: Chunker v2 (dry-run)
+python3 -m scripts.import_markdown_v2 --vault ./vault --schema v2 --chunker v2 --edges v2 --dry-run --prefix "$COLLECTION_PREFIX"
+
+# Step 5: Apply + Report
+python3 -m scripts.import_markdown_v2 --vault ./vault --schema v2 --chunker v2 --edges v2 --apply --prefix "$COLLECTION_PREFIX" --report ./reports/import_$(date +%F_%H%M).json
+```
+
+---
+
+## 11) Rollback
+- v2‑Collections droppen (wenn Side‑by‑Side).  
+- Aus Snapshot wiederherstellen (v1).  
+- Flags zurück auf v1.
+
+---
+
+## 12) Referenzen (offizielle Quellen)
+- Qdrant – Collections, Payload‑Index, Filter: <https://qdrant.tech/documentation/>  
+- Qdrant – Snapshots/Backup: <https://qdrant.tech/documentation/guides/backup/>  
+- OpenTelemetry (Python/FastAPI): <https://opentelemetry.io/docs/instrumentation/python/>  
+- JSON‑Schema 2020‑12: <https://json-schema.org/>  
+- YAML 1.2: <https://yaml.org/spec/1.2.2/>  
+- Obsidian Frontmatter: <https://help.obsidian.md/Editing+and+formatting/Properties>  
+- Ollama CLI/API (lokale Modelle): <https://github.com/ollama/ollama>
+
+---
+
+## 13) **PROMPT für den neuen Chat (bitte exakt so einfügen)**
+```
+Rolle: Du bist mein Senior‑Entwickler & Architekt für mindnet. Arbeite strikt in kleinen, testbaren Schritten (Side‑by‑Side v2 neben v1). Liefere komplette Dateien als Downloads. Frage immer zuerst nach den **aktuellsten** Projektdateien, wenn du sie brauchst (z. B. import_markdown.py, edges.py, types.yaml), ändere keine Systemfunktion „workaround‑artig“ ohne die Gesamtwirkung zu prüfen.
+
+Kontext (Kurzfassung):
+- v1 hat nur mindnet_notes zuverlässig. Chunks/Edges sind inkonsistent/teilweise leer. Ziel: v2 mit drei Collections (notes/chunks/edges), neuem Chunker v2, Edge‑Builder v2 (explicit → rule → default_resolved → default), Policies (YAML), Observability (OTel), Snapshots. Kein Big‑Bang; v2 parallel zu v1.
+- Siehe angehängte Datei **mindnet_v2_implementation_playbook.md** (Pfad/Download im Chat). Alles daraus ist verbindlich.
+
+Deine ersten Aufgaben (Step‑by‑Step):
+1) **Schemata anlegen (Step 1)**  
+   - Erstelle die Dateien:  
+     - schemas/note.schema.json  
+     - schemas/chunk.schema.json  
+     - schemas/edge.schema.json  
+     - schemas/default_edge.schema.json  
+   - Nutze die in der Playbook‑Datei vorgegebenen Strukturen als Basis.  
+   - Liefere zusätzlich ein einfaches `Makefile`‑Ziel `schema-validate` (jsonschema via Python), plus `requirements.txt`.  
+   - Output: komplette Dateien als Downloads + kurze Testanleitung.
+
+2) **Qdrant v2‑Collections (Step 2)**  
+   - Erzeuge die Collections + Payload‑Indizes.  
+   - Liefere ein kleines Python‑Skript `tools/qdrant_bootstrap_v2.py`, das diese Anlage idempotent durchführt (inkl. Prüf‑Output).
+
+3) **Chunker v2 (Step 3)**  
+   - Implementiere `chunking/block_chunker.py` mit den im Playbook genannten Parametern.  
+   - Teste gegen 3 Gold‑Notizen (liefere mini Test‑Harness in `tests/gold_notes/…`).
+
+Arbeitsweise:
+- Jede Aufgabe einzeln, mit klaren Akzeptanzkriterien (aus Playbook) und Download‑Artefakten.  
+- Keine parallelen Großumbauten.  
+- Immer deterministische IDs, idempotente Upserts.
+
+Dateien, die ich dir anfänglich bereitstelle:
+- **mindnet_v2_implementation_playbook.md** (diese Datei)  
+- **import_markdown.py** (aktuelle Version aus dem Projekt)  
+- **edges.py** (aktuelle Version aus dem Projekt)  
+- **types.yaml** (aktuelle Version)
+
+Sage mir jeweils, welche Datei du als Nächstes brauchst. Beginne jetzt mit Aufgabe 1 (Schemata).
+```