scripts/import_markdown.py aktualisiert
Some checks failed
Deploy mindnet to llm-node / deploy (push) Failing after 1s

This commit is contained in:
Lars 2025-09-06 14:04:46 +02:00
parent df33293621
commit 47e6d56b21

View File

@ -1,261 +1,277 @@
#!/usr/bin/env python3 #!/usr/bin/env python3
# -*- coding: utf-8 -*- # -*- coding: utf-8 -*-
""" """
Script: scripts/import_markdown.py Script: scripts/import_markdown.py
Version: v2.4.1 (2025-09-05) Version: 0.6.0 (2025-09-06)
Autor: mindnet / Architektur Datenimporte & Sync
Beschreibung Kurzbeschreibung
Importiert Markdown-Notizen in Qdrant (Notes, Chunks, Edges) mit Delta-Detection. ---------------
- Chunking + Embedding (MiniLM 384d, externer Embed-Server) Importiert Markdown-Notizen aus einem Obsidian-ähnlichen Vault in Qdrant:
- Edges direkt beim Import aus Wikilinks ([[]]) ableiten (inkl. references_at) - Validiert Frontmatter / Note-Payload.
- Idempotenz via stabile UUIDv5-IDs und hash_fulltext - Chunking + Embeddings.
- Create/Update/Skip pro Note: - Leitet Edges direkt beim Import aus [[Wikilinks]] ab:
* Unverändert (hash_fulltext gleich) Skip - 'references' (NoteNote)
* Geändert Chunks & Edges der Note purge + Replace (Upsert) - 'references_at' (ChunkNote)
- Dry-Run löscht/ändert nichts; zeigt die Entscheidung je Note - 'backlink' (NoteNote) nur für NoteNote-Kanten.
Aufruf Neu in 0.6.0
python3 -m scripts.import_markdown --vault ./vault [--apply] [--note-id ID] ------------
[--embed-note] [--force-replace] - Option `--purge-before-upsert`: löscht für die jeweils verarbeitete Note
*vor* dem Upsert alle zugehörigen Chunks und Edges in Qdrant (selektiv!),
um Leichen nach Re-Chunking zu vermeiden.
- Robuste Link-Auflösung via Note-Index (ID / Titel-Slug / Datei-Slug)
konsistent zu `derive_edges.py`.
Aufrufbeispiele
---------------
Dry-Run (keine Schreibzugriffe):
python3 -m scripts.import_markdown --vault ./vault
Nur eine bestimmte Note:
python3 -m scripts.import_markdown --vault ./vault --note-id 20250821-foo
Apply (schreiben) mit Purge:
python3 -m scripts.import_markdown --vault ./vault --apply --purge-before-upsert
Parameter Parameter
--vault Pfad zum Obsidian-Vault (erforderlich) ---------
--apply Ohne Flag: Dry-Run (nur JSON-Zeilen). Mit Flag: schreibt in Qdrant. --vault PATH : Pflicht. Root-Verzeichnis des Vaults.
--note-id Nur eine spezifische Note-ID verarbeiten (Filter) --apply : Wenn gesetzt, werden Upserts durchgeführt (sonst Dry-Run).
--embed-note Optional: Note-Volltext zusätzlich einbetten --purge-before-upsert : Wenn gesetzt, werden vor dem Upsert (nur bei --apply)
--force-replace Erzwingt Neuaufbau von Chunks/Edges der Note (auch wenn Hash unverändert) alte Chunks und Edges dieser Note in Qdrant gelöscht.
--note-id ID : Optional, verarbeitet nur diese eine Note.
Hinweise Umgebungsvariablen (.env)
- Im venv arbeiten: `source .venv/bin/activate` -------------------------
- Embed-Server muss laufen (http://127.0.0.1:8990) QDRANT_URL, QDRANT_API_KEY, COLLECTION_PREFIX, VECTOR_DIM
- Qdrant via ENV: QDRANT_URL, QDRANT_API_KEY, COLLECTION_PREFIX, VECTOR_DIM Standard: url=http://127.0.0.1:6333, prefix=mindnet, dim=384
Changelog Kompatibilität
v2.4.1: FIX Kompatibilität zu verschiedenen qdrant-client Versionen: --------------
`scroll()`-Rückgabe kann 2- oder 3-teilig sein robustes Unpacking. - Nutzt die bestehenden Kernmodule:
v2.4.0: NEU Delta-Detection über hash_fulltext; Skip/Replace-Entscheidung. app.core.parser (read_markdown, normalize_frontmatter, validate_required_frontmatter)
Purge bei Updates: löscht Chunks & Edges der Quelle, dann Upsert. app.core.validate_note (validate_note_payload)
Dry-Run garantiert ohne Mutationen. app.core.chunker (assemble_chunks)
v2.3.1: FIX Für derive_wikilink_edges werden echte Chunk-Texte übergeben app.core.chunk_payload (make_chunk_payloads)
({"chunk_id","text"}) erzeugt `references_at`. app.core.embed (embed_texts)
v2.3.0: Umstellung auf app.core.derive_edges; Edge-IDs inkl. Occurrence. app.core.qdrant (QdrantConfig, get_client, ensure_collections)
app.core.qdrant_points (points_for_note, points_for_chunks, points_for_edges, upsert_batch)
app.core.derive_edges (build_note_index, derive_wikilink_edges)
Änderungshinweise vs. früherer Importer
---------------------------------------
- Alte, globale Lösch-Workarounds entfallen. Selektives Purge ist jetzt optional und sicher.
- Edges werden nur noch in der neuen, einheitlichen Struktur erzeugt.
""" """
from __future__ import annotations from __future__ import annotations
import argparse, os, glob, json, sys, hashlib import argparse
from typing import Optional, Tuple, List import glob
import json
import os
import sys
from typing import List, Dict
from dotenv import load_dotenv from dotenv import load_dotenv
from qdrant_client.http import models as rest from qdrant_client.http import models as rest
from qdrant_client import QdrantClient
from app.core.parser import read_markdown, normalize_frontmatter, validate_required_frontmatter # Kern-Bausteine (vorhanden in eurem Projekt)
from app.core.note_payload import make_note_payload from app.core.parser import (
read_markdown,
normalize_frontmatter,
validate_required_frontmatter,
)
from app.core.validate_note import validate_note_payload from app.core.validate_note import validate_note_payload
from app.core.chunker import assemble_chunks from app.core.chunker import assemble_chunks
from app.core.chunk_payload import make_chunk_payloads from app.core.chunk_payload import make_chunk_payloads
from app.core.embed import embed_texts, embed_one from app.core.embed import embed_texts
from app.core.qdrant import QdrantConfig, ensure_collections, get_client from app.core.qdrant import QdrantConfig, ensure_collections, get_client, collection_names
from app.core.qdrant_points import points_for_chunks, points_for_note, points_for_edges, upsert_batch from app.core.qdrant_points import (
points_for_note,
points_for_chunks,
points_for_edges,
upsert_batch,
)
from app.core.derive_edges import build_note_index, derive_wikilink_edges from app.core.derive_edges import build_note_index, derive_wikilink_edges
# -------------------------------
# Utility
# -------------------------------
def iter_md(root: str, exclude_dirs=("/.obsidian/", "/_backup_frontmatter/", "/_imported/")) -> List[str]: # -------------------------------------------------
files = [p for p in glob.glob(os.path.join(root, "**", "*.md"), recursive=True)] # Hilfsfunktionen
out = [] # -------------------------------------------------
def iter_md(root: str) -> List[str]:
patterns = ["**/*.md", "*.md"]
out: List[str] = []
for p in patterns:
out.extend(glob.glob(os.path.join(root, p), recursive=True))
return sorted(list(dict.fromkeys(out))) # de-dupe + sort
def make_note_stub(abs_path: str, vault_root: str) -> Dict:
"""
Erstellt einen minimalen Note-Stub für den Index (build_note_index):
{ note_id, title, path }
"""
parsed = read_markdown(abs_path)
fm = normalize_frontmatter(parsed.frontmatter or {})
# Validierung minimal: wir brauchen id + title (title optional für Slug-Auflösung)
if "id" not in fm or not fm["id"]:
raise ValueError(f"Missing id in frontmatter: {abs_path}")
rel = os.path.relpath(abs_path, vault_root)
return {"note_id": fm["id"], "title": fm.get("title"), "path": rel}
def build_vault_index(vault_root: str) -> tuple[Dict, Dict, Dict]:
"""
Liest alle Noten ein und baut den Dreifach-Index für Wikilink-Auflösung.
"""
files = iter_md(vault_root)
stubs = []
for p in files: for p in files:
pn = p.replace("\\","/") try:
if any(ex in pn for ex in exclude_dirs): stubs.append(make_note_stub(p, vault_root))
except Exception:
# Notiz ohne id → wird vom Importer später ohnehin übersprungen
continue continue
out.append(p) return build_note_index(stubs)
return out
def sha256_hex(text: str) -> str:
return hashlib.sha256(text.encode("utf-8")).hexdigest()
def collection_names(prefix: str) -> Tuple[str, str, str]: def purge_for_note(client, prefix: str, note_id: str, chunk_ids: List[str]) -> Dict[str, int]:
return f"{prefix}_notes", f"{prefix}_chunks", f"{prefix}_edges"
def _scroll(client: QdrantClient, **kwargs):
""" """
Wrapper für client.scroll, der 2-teilige und 3-teilige Rückgaben unterstützt. Selektives Purge für die aktuelle Note:
Neuere qdrant-client Versionen liefern (points, next_page), ältere evtl. (points, offset, next_page). - Chunks: alle mit payload.note_id == note_id
- Edges: alle mit payload.source_id == note_id ODER == einem der chunk_ids
- Notes: werden nicht gelöscht (Upsert überschreibt Payload/Vektor)
""" """
res = client.scroll(**kwargs) notes_col, chunks_col, edges_col = collection_names(prefix)
if isinstance(res, tuple): counts = {"chunks_deleted": 0, "edges_deleted": 0}
if len(res) == 2:
points, _ = res
return points
elif len(res) == 3:
points, _, _ = res
return points
# Fallback: wenn sich API ändert, versuchen wir, wie eine Sequenz zuzugreifen
try:
return res[0]
except Exception:
return []
def fetch_existing_note_hash(client: QdrantClient, prefix: str, note_id: str) -> Optional[str]: # Chunks löschen (Filter must: note_id == X)
"""Liest hash_fulltext der Note aus Qdrant (falls vorhanden).""" f_chunks = rest.Filter(
notes_col, _, _ = collection_names(prefix) must=[rest.FieldCondition(key="note_id", match=rest.MatchValue(value=note_id))]
f = rest.Filter(must=[rest.FieldCondition(key="note_id", match=rest.MatchValue(value=note_id))]) )
pts = _scroll(client, collection_name=notes_col, scroll_filter=f, with_payload=True, with_vectors=False, limit=1) res_chunks = client.delete(collection_name=chunks_col, points_selector=f_chunks, wait=True)
if not pts: counts["chunks_deleted"] = getattr(res_chunks, "status", None) and 0 or 0 # Qdrant liefert keine count hier
return None
pl = (pts[0].payload or {})
return pl.get("hash_fulltext") or None # kann bei Altbeständen fehlen
def purge_note_edges(client: QdrantClient, prefix: str, source_note_id: str) -> None: # Edges löschen: OR über Note-ID und alle Chunk-IDs
""" should_conds = [rest.FieldCondition(key="source_id", match=rest.MatchValue(value=note_id))]
Löscht Edges der Quelle: for cid in chunk_ids:
- alle mit source_id == note should_conds.append(rest.FieldCondition(key="source_id", match=rest.MatchValue(value=cid)))
- alle Backlinks, die auf die Quelle zeigen (kind=backlink & target_id=note)
"""
_, _, edges_col = collection_names(prefix)
cond_source = rest.FieldCondition(key="source_id", match=rest.MatchValue(value=source_note_id))
cond_kind = rest.FieldCondition(key="kind", match=rest.MatchValue(value="backlink"))
cond_target = rest.FieldCondition(key="target_id", match=rest.MatchValue(value=source_note_id))
filt = rest.Filter(should=[cond_source, rest.Filter(must=[cond_kind, cond_target])])
client.delete(collection_name=edges_col, points_selector=filt, wait=True)
def purge_note_chunks(client: QdrantClient, prefix: str, note_id: str) -> None: f_edges = rest.Filter(should=should_conds) if should_conds else None
"""Löscht alle Chunks einer Note (payload.note_id == note_id).""" if f_edges is not None:
_, chunks_col, _ = collection_names(prefix) client.delete(collection_name=edges_col, points_selector=f_edges, wait=True)
f = rest.Filter(must=[rest.FieldCondition(key="note_id", match=rest.MatchValue(value=note_id))])
client.delete(collection_name=chunks_col, points_selector=f, wait=True)
# ------------------------------- return counts
# Hauptlogik
# -------------------------------
# -------------------------------------------------
# Main
# -------------------------------------------------
def main(): def main():
load_dotenv()
ap = argparse.ArgumentParser() ap = argparse.ArgumentParser()
ap.add_argument("--vault", required=True, help="Pfad zum Obsidian Vault (z.B. ./vault)") ap.add_argument("--vault", required=True, help="Pfad zum Vault-Root")
ap.add_argument("--apply", action="store_true", help="Schreibt in Qdrant (sonst Dry-Run)") ap.add_argument("--apply", action="store_true", help="Schreibt in Qdrant (sonst Dry-Run)")
ap.add_argument("--note-id", help="Nur eine Note-ID verarbeiten") ap.add_argument(
ap.add_argument("--embed-note", action="store_true", help="Auch Note-Volltext einbetten (optional)") "--purge-before-upsert",
ap.add_argument("--force-replace", action="store_true", help="Erzwingt Purge+Replace der Note (auch wenn Hash gleich)") action="store_true",
help="Vor Upsert alte Chunks/Edges der aktuellen Note löschen (nur mit --apply wirksam).",
)
ap.add_argument("--note-id", help="Optional: nur diese Note verarbeiten")
args = ap.parse_args() args = ap.parse_args()
# Qdrant load_dotenv()
cfg = QdrantConfig( cfg = QdrantConfig(
url=os.getenv("QDRANT_URL", "http://127.0.0.1:6333"), url=os.getenv("QDRANT_URL", "http://127.0.0.1:6333"),
api_key=os.getenv("QDRANT_API_KEY") or None, api_key=os.getenv("QDRANT_API_KEY", None),
prefix=os.getenv("COLLECTION_PREFIX", "mindnet"), prefix=os.getenv("COLLECTION_PREFIX", "mindnet"),
dim=int(os.getenv("VECTOR_DIM","384")), dim=int(os.getenv("VECTOR_DIM", "384")),
) )
client = get_client(cfg) client = get_client(cfg)
ensure_collections(client, cfg.prefix, cfg.dim) ensure_collections(client, cfg.prefix, cfg.dim)
root = os.path.abspath(args.vault) vault_root = os.path.abspath(args.vault)
files = iter_md(root) files = iter_md(vault_root)
if not files: if not files:
print("Keine Markdown-Dateien gefunden.", file=sys.stderr); sys.exit(2) print("Keine Markdown-Dateien gefunden.", file=sys.stderr)
sys.exit(2)
# --- Vorab: Note-Index für Linkauflösung (by id/slug/path) --- # 1) Note-Index über den gesamten Vault (für robuste Link-Auflösung)
note_stubs = [] note_index = build_vault_index(vault_root)
for path in files:
parsed = read_markdown(path) processed = 0
fm = normalize_frontmatter(parsed.frontmatter) for abs_path in files:
parsed = read_markdown(abs_path)
fm = normalize_frontmatter(parsed.frontmatter or {})
try: try:
validate_required_frontmatter(fm) validate_required_frontmatter(fm)
except Exception: except Exception:
continue # unvollständige Note überspringen
if args.note_id and fm.get("id") != args.note_id:
continue
rel = os.path.relpath(parsed.path, root).replace("\\","/")
note_stubs.append({"note_id": fm["id"], "title": fm.get("title",""), "path": rel})
note_index = build_note_index(note_stubs)
total_notes = 0
for path in files:
parsed = read_markdown(path)
fm = normalize_frontmatter(parsed.frontmatter)
try:
validate_required_frontmatter(fm) # Pflichtfelder lt. Schema/Design
except Exception:
continue continue
if args.note_id and fm.get("id") != args.note_id: if args.note_id and fm.get("id") != args.note_id:
continue continue
total_notes += 1 processed += 1
note_id = fm["id"]
# --- Delta-Detection --- # --- Note-Payload ---
fulltext = parsed.body from app.core.note_payload import make_note_payload # lazy import (bestehende Funktion)
new_hash = sha256_hex(fulltext) note_pl = make_note_payload(parsed, vault_root=vault_root)
old_hash = fetch_existing_note_hash(client, cfg.prefix, note_id)
changed = (old_hash != new_hash) or (old_hash is None) or args.force_replace
# Note-Payload
note_pl = make_note_payload(parsed, vault_root=root)
note_pl["fulltext"] = fulltext # für derive_edges (references)
note_pl["hash_fulltext"] = new_hash # Schema-Feld vorhanden
validate_note_payload(note_pl) validate_note_payload(note_pl)
# Früher Exit (Dry-Run/Skip) # --- Chunking & Payloads ---
if not changed: chunks = assemble_chunks(fm["id"], parsed.body, fm.get("type", "concept"))
print(json.dumps({
"note_id": note_id,
"title": fm["title"],
"changed": False,
"decision": "skip",
"path": note_pl["path"]
}, ensure_ascii=False))
continue
# Chunks (inkl. Texte für references_at)
chunks = assemble_chunks(note_id, fulltext, fm.get("type", "concept"))
chunk_pls = make_chunk_payloads(fm, note_pl["path"], chunks) chunk_pls = make_chunk_payloads(fm, note_pl["path"], chunks)
chunks_for_links = [
{"chunk_id": (pl.get("chunk_id") or pl.get("id") or f"{note_id}#{i+1}"),
"text": chunks[i].text}
for i, pl in enumerate(chunk_pls)
if i < len(chunks)
]
# Embeddings (Chunks) # --- Embeddings ---
texts = [ch.text for ch in chunks] texts = [c.get("text") or c.get("content") or "" for c in chunk_pls]
vectors = embed_texts(texts) vectors = embed_texts(texts)
# Optional: Note-Vektor # --- Edge-Ableitung (direkt) ---
note_vec = embed_one(fulltext) if args.embed_note else None edges = derive_wikilink_edges(note_pl, chunk_pls, note_index)
# Kanten (Note- & Chunk-Ebene) # --- Ausgabe je Note (Entscheidung) ---
edges = derive_wikilink_edges(note_pl, chunks_for_links, note_index) decision = "apply" if args.apply else "dry-run"
# Dry-Run-Ausgabe # --- Purge vor Upsert (nur wenn --apply) ---
print(json.dumps({ if args.apply and args.purge_before_upsert:
"note_id": note_id, # Chunk-IDs (neu) ermitteln → für Edge-Purge by source_id
"title": fm["title"], chunk_ids = []
"chunks": len(chunk_pls), for i, ch in enumerate(chunk_pls, start=1):
"edges": len(edges), cid = ch.get("chunk_id") or ch.get("id") or f"{fm['id']}#{i}"
"changed": True, ch["chunk_id"] = cid # sicherstellen
"decision": "replace" if args.apply else "dry-run", chunk_ids.append(cid)
"path": note_pl["path"]
}, ensure_ascii=False))
purge_for_note(client, cfg.prefix, fm["id"], chunk_ids)
# --- Upserts (nur bei --apply) ---
if args.apply: if args.apply:
# Purge alte Daten der Note (Chunks + Edges), dann Upsert # Note
purge_note_chunks(client, cfg.prefix, note_id) notes_col, note_pts = points_for_note(cfg.prefix, note_pl, note_vec=None, dim=cfg.dim)
purge_note_edges(client, cfg.prefix, note_id)
# Notes upsert
notes_col, note_pts = points_for_note(cfg.prefix, note_pl, note_vec, cfg.dim)
upsert_batch(client, notes_col, note_pts) upsert_batch(client, notes_col, note_pts)
# Chunks upsert # Chunks
chunks_col, chunk_pts = points_for_chunks(cfg.prefix, chunk_pls, vectors) chunks_col, chunk_pts = points_for_chunks(cfg.prefix, chunk_pls, vectors)
upsert_batch(client, chunks_col, chunk_pts) upsert_batch(client, chunks_col, chunk_pts)
# Edges upsert # Edges
edges_col, edge_pts = points_for_edges(cfg.prefix, edges) edges_col, edge_pts = points_for_edges(cfg.prefix, edges)
upsert_batch(client, edges_col, edge_pts) upsert_batch(client, edges_col, edge_pts)
print(f"Done. Processed notes: {total_notes}") # Logging pro Note
print(json.dumps({
"note_id": fm["id"],
"title": fm.get("title"),
"chunks": len(chunk_pls),
"edges": len(edges),
"changed": True, # Hash-/Zeitvergleich kann optional hier ergänzt werden
"decision": decision,
"path": note_pl["path"],
}, ensure_ascii=False))
print(f"Done. Processed notes: {processed}")
if __name__ == "__main__": if __name__ == "__main__":
main() main()