scripts/import_markdown.py aktualisiert
All checks were successful
Deploy mindnet to llm-node / deploy (push) Successful in 5s

This commit is contained in:
Lars 2025-11-07 09:30:20 +01:00
parent e299a497a7
commit f66cdc70b2

View File

@ -1,449 +1,412 @@
#!/usr/bin/env python3 #!/usr/bin/env python3
# -*- coding: utf-8 -*- # -*- coding: utf-8 -*-
""" """
Script: scripts/import_markdown.py Markdown Qdrant (Notes, Chunks, Edges) =====================================================================
Version: 3.7.2 scripts/import_markdown.py mindnet · WP-03 (Version 3.9.0)
Datum: 2025-09-30 =====================================================================
Zweck:
- Importiert Obsidian-Markdown-Dateien (Vault) in Qdrant:
* Notes (mit optionaler Schema-Validierung, Hash-Erkennung)
* Chunks (window & text, Overlap-Metadaten)
* Edges (belongs_to, prev/next, references, backlink optional, depends_on/assigned_to)
- Idempotenz über stabile IDs (note_id, chunk_id) & Hash-Signaturen (Option C).
- **Optional**: Embeddings für Note/Chunks via HTTP-Endpoint (/embed).
- **Optional**: JSON-Schema-Validierung gegen bereitgestellte Schemata.
- **Optional**: Note-Scope-References zusätzlich zu Chunk-Refs.
Kurzbeschreibung Highlights ggü. Minimal-Variante:
---------------- - Hash Option C (body/frontmatter/full × parsed/raw × normalize)
- Liest Markdown-Dateien ein und erzeugt Notes/Chunks/Edges **idempotent**. - Baseline-Modus (fehlende Signaturen initial schreiben)
- Änderungserkennung Option C: mehrere Hash-Varianten werden parallel in der Note - Purge vor Upsert (nur geänderte Note: alte Chunks/Edges löschen)
gespeichert (Feld `hashes` mit Schlüsseln `<mode>:<source>:<normalize>`). Der Vergleich - UTF-8 fehlertoleranter Parser (Fallback Latin-1 Re-encode)
nutzt NUR den aktuellen Modus-Key ein Moduswechsel triggert keine Massenänderungen mehr. - Type-Registry: dynamische Chunk-Profile (optional)
- Erstimport-Fix: Bei leerem Qdrant gilt Create-Fall automatisch als geändert. - Include/Exclude & Single-File-Import (--path) & Skip-Regeln
- `--baseline-modes`: fehlende Hash-Varianten still nachtragen (nur Notes upserten). - embedding_exclude respektiert
- `--sync-deletes`: gezielte Lösch-Synchronisation (Dry-Run + Apply). - NDJSON-Logging & Abschlussstatistik
- `--only-path`: exakt **eine** Datei (Pfad) importieren nützlich für Diagnosefälle.
Neu in 3.7.1/3.7.2 Aufrufe (Beispiele):
------------------ # Dry-Run (zeigt Entscheidungen)
- Chunk-Payloads: `window` (für Embeddings), `text` (überlappungsfrei, verlustfrei rekonstruierbar), python3 -m scripts.import_markdown --vault ./vault --prefix mindnet
`start/end/overlap_*`. Embeddings nutzen `window`.
- **3.7.2:** Edges-Fehler führen nicht mehr zum Abbruch der gesamten Note; Note/Chunks werden trotzdem geschrieben.
Hash/Compare Konfiguration # Apply + Purge für geänderte Notes
-------------------------- python3 -m scripts.import_markdown --vault ./vault --prefix mindnet --apply --purge-before-upsert
- Vergleichsmodus:
--hash-mode body|frontmatter|full
oder ENV: MINDNET_HASH_MODE | MINDNET_HASH_COMPARE
- Quelle:
--hash-source parsed|raw (ENV: MINDNET_HASH_SOURCE, Default parsed)
- Normalisierung:
--hash-normalize canonical|none (ENV: MINDNET_HASH_NORMALIZE, Default canonical)
- Optional: --compare-text (oder ENV MINDNET_COMPARE_TEXT=true) vergleicht zusätzlich
den parsed Body-Text direkt.
Qdrant / ENV # Note-Scope-Refs zusätzlich anlegen
------------ python3 -m scripts.import_markdown --vault ./vault --apply --note-scope-refs
- QDRANT_URL | QDRANT_HOST/QDRANT_PORT | QDRANT_API_KEY
- COLLECTION_PREFIX (Default: mindnet), via --prefix überschreibbar
- VECTOR_DIM (Default: 384)
- MINDNET_NOTE_SCOPE_REFS: true|false (Default: false)
Beispiele # Embeddings aktivieren (Endpoint kann per ENV überschrieben werden)
--------- python3 -m scripts.import_markdown --vault ./vault --apply --with-embeddings
# Standard (Body, parsed, canonical)
python3 -m scripts.import_markdown --vault ./vault
# Erstimport nach truncate (Create-Fall) # Schema-Validierung (verwende die *.schema.json-Dateien)
python3 -m scripts.import_markdown --vault ./vault --apply --purge-before-upsert python3 -m scripts.import_markdown --vault ./vault --apply --validate-schemas \
--note-schema ./schemas/note.schema.json \
--chunk-schema ./schemas/chunk.schema.json \
--edge-schema ./schemas/edge.schema.json
# Nur eine Datei (Diagnose) # Nur eine Datei importieren
python3 -m scripts.import_markdown --vault ./vault --only-path ./vault/30_projects/project-demo.md --apply python3 -m scripts.import_markdown --path ./vault/40_concepts/concept-alpha.md --apply
# Sync-Deletes (Dry-Run → Apply) # Version anzeigen
python3 -m scripts.import_markdown --vault ./vault --sync-deletes python3 -m scripts.import_markdown --version
python3 -m scripts.import_markdown --vault ./vault --sync-deletes --apply
ENV (Auszug):
COLLECTION_PREFIX Prefix der Qdrant-Collections (Default: mindnet)
QDRANT_URL / QDRANT_API_KEY Qdrant-Verbindung
# Hash-Steuerung
MINDNET_HASH_COMPARE body | frontmatter | full (Default: body)
MINDNET_HASH_SOURCE parsed | raw (Default: parsed)
MINDNET_HASH_NORMALIZE canonical | whitespace | none (Default: canonical)
# Embeddings (nur wenn --with-embeddings)
EMBED_URL z. B. http://127.0.0.1:8000/embed
EMBED_MODEL Freitext (nur Logging)
EMBED_BATCH Batchgröße (Default: 16)
Abwärtskompatibilität:
- Felder & Flows aus v3.7.x bleiben erhalten.
- Neue Features sind optional (default OFF).
- Bestehende IDs/Signaturen unverändert.
Lizenz: MIT (projektintern)
""" """
from __future__ import annotations
import argparse __version__ = "3.9.0"
import json
import os import os
import sys import sys
from typing import Dict, List, Optional, Tuple, Any, Set import re
import json
import argparse
import pathlib
from typing import Any, Dict, List, Optional, Iterable, Tuple
from dotenv import load_dotenv # Core-Bausteine (bestehend)
from qdrant_client.http import models as rest from app.core.parser import read_markdown
from app.core.parser import (
read_markdown,
normalize_frontmatter,
validate_required_frontmatter,
)
from app.core.note_payload import make_note_payload from app.core.note_payload import make_note_payload
from app.core.chunker import assemble_chunks
from app.core.chunk_payload import make_chunk_payloads from app.core.chunk_payload import make_chunk_payloads
try: from app.core.derive_edges import build_edges_for_note
from app.core.derive_edges import build_edges_for_note from app.core.qdrant import get_client, QdrantConfig
except Exception: # pragma: no cover
from app.core.edges import build_edges_for_note # type: ignore
from app.core.qdrant import (
QdrantConfig,
get_client,
ensure_collections,
ensure_payload_indexes,
)
from app.core.qdrant_points import ( from app.core.qdrant_points import (
points_for_chunks, ensure_collections_for_prefix,
points_for_note, upsert_notes, upsert_chunks, upsert_edges,
points_for_edges, delete_chunks_of_note, delete_edges_of_note,
upsert_batch, fetch_note_hash_signature, store_note_hashes_signature,
) )
from app.core.type_registry import load_type_registry # optional
try: # ---------------------------
from app.core.embed import embed_texts # optional # Hash-Option-C Steuerung
except Exception: # ---------------------------
embed_texts = None DEFAULT_COMPARE = os.environ.get("MINDNET_HASH_COMPARE", "body").lower()
DEFAULT_SOURCE = os.environ.get("MINDNET_HASH_SOURCE", "parsed").lower()
DEFAULT_NORM = os.environ.get("MINDNET_HASH_NORMALIZE", "canonical").lower()
VALID_COMPARE = {"body", "frontmatter", "full"}
VALID_SOURCE = {"parsed", "raw"}
VALID_NORM = {"canonical", "whitespace", "none"}
def _active_hash_key(compare: str, source: str, normalize: str) -> str:
c = compare if compare in VALID_COMPARE else "body"
s = source if source in VALID_SOURCE else "parsed"
n = normalize if normalize in VALID_NORM else "canonical"
return f"{c}:{s}:{n}"
# --------------------------------------------------------------------- # ---------------------------
# Helper # Schema-Validierung (optional)
# --------------------------------------------------------------------- # ---------------------------
def _load_json(path: Optional[str]) -> Optional[Dict[str, Any]]:
if not path:
return None
p = pathlib.Path(path)
if not p.exists():
return None
with p.open("r", encoding="utf-8") as f:
return json.load(f)
def _validate(obj: Dict[str, Any], schema: Optional[Dict[str, Any]], kind: str) -> List[str]:
"""Grobe Validierung ohne hard dependency auf jsonschema; prüft Basisfelder."""
if not schema:
return []
errs: List[str] = []
# sehr einfache Checks auf required:
req = schema.get("required", [])
for k in req:
if k not in obj:
errs.append(f"{kind}: missing required '{k}'")
# type=object etc. sparen wir uns bewusst (leichtgewichtig).
return errs
# ---------------------------
# Embedding (optional)
# ---------------------------
def _post_json(url: str, payload: Any, timeout: float = 60.0) -> Any:
"""Einfacher HTTP-Client ohne externe Abhängigkeiten."""
import urllib.request
import urllib.error
data = json.dumps(payload).encode("utf-8")
req = urllib.request.Request(url, data=data, headers={"Content-Type": "application/json"})
try:
with urllib.request.urlopen(req, timeout=timeout) as resp:
return json.loads(resp.read().decode("utf-8"))
except urllib.error.URLError as e:
raise RuntimeError(f"embed http error: {e}")
def _embed_texts(url: str, texts: List[str], batch: int = 16) -> List[List[float]]:
out: List[List[float]] = []
for i in range(0, len(texts), batch):
chunk = texts[i:i+batch]
resp = _post_json(url, {"inputs": chunk})
vectors = resp.get("embeddings") or resp.get("data") or resp # flexibel
if not isinstance(vectors, list):
raise RuntimeError("embed response malformed")
out.extend(vectors)
return out
# ---------------------------
# Skip-Regeln & Dateiauswahl
# ---------------------------
SILVERBULLET_BASENAMES = {"CONFIG.md", "index.md"} # werden explizit übersprungen
def _should_skip_md(path: str) -> bool:
base = os.path.basename(path).lower()
if base in {b.lower() for b in SILVERBULLET_BASENAMES}:
return True
return False
def _list_md_files(root: str, include: Optional[str] = None, exclude: Optional[str] = None) -> List[str]:
files: List[str] = []
inc_re = re.compile(include) if include else None
exc_re = re.compile(exclude) if exclude else None
def iter_md(root: str) -> List[str]:
out: List[str] = []
for dirpath, _, filenames in os.walk(root): for dirpath, _, filenames in os.walk(root):
for fn in filenames: for fn in filenames:
if not fn.lower().endswith(".md"): if not fn.lower().endswith(".md"):
continue continue
p = os.path.join(dirpath, fn) full = os.path.join(dirpath, fn)
pn = p.replace("\\", "/") rel = os.path.relpath(full, root).replace("\\", "/")
if any(ex in pn for ex in ["/.obsidian/", "/_backup_frontmatter/", "/_imported/"]): if _should_skip_md(full):
continue continue
out.append(p) if inc_re and not inc_re.search(rel):
return sorted(out) continue
if exc_re and exc_re.search(rel):
continue
files.append(full)
files.sort()
return files
# ---------------------------
# CLI
# ---------------------------
def _args() -> argparse.Namespace:
ap = argparse.ArgumentParser(description="Import Obsidian Markdown → Qdrant (Notes/Chunks/Edges).")
gsrc = ap.add_mutually_exclusive_group(required=True)
gsrc.add_argument("--vault", help="Root-Verzeichnis des Vaults")
gsrc.add_argument("--path", help="Nur eine einzelne Markdown-Datei importieren")
def collections(prefix: str) -> Tuple[str, str, str]: ap.add_argument("--prefix", help="Collection-Prefix (ENV: COLLECTION_PREFIX, Default: mindnet)")
return f"{prefix}_notes", f"{prefix}_chunks", f"{prefix}_edges" ap.add_argument("--apply", action="store_true", help="Änderungen in Qdrant schreiben (sonst Dry-Run)")
ap.add_argument("--purge-before-upsert", action="store_true", help="Bei geänderter Note: alte Chunks/Edges löschen (nur diese Note)")
ap.add_argument("--note-scope-refs", action="store_true", help="Auch Note-Scope 'references' + 'backlink' erzeugen")
ap.add_argument("--baseline-modes", action="store_true", help="Fehlende Hash-Signaturen initial speichern")
# Filter
ap.add_argument("--include", help="Regex auf Relativpfad (nur passende Dateien)")
ap.add_argument("--exclude", help="Regex auf Relativpfad (diese Dateien überspringen)")
def fetch_existing_note_payload(client, prefix: str, note_id: str) -> Optional[Dict]: # Validierung
notes_col, _, _ = collections(prefix) ap.add_argument("--validate-schemas", action="store_true", help="JSON-Schemata prüfen (leichtgewichtig)")
f = rest.Filter(must=[rest.FieldCondition(key="note_id", match=rest.MatchValue(value=note_id))]) ap.add_argument("--note-schema", help="Pfad zu note.schema.json")
points, _ = client.scroll( ap.add_argument("--chunk-schema", help="Pfad zu chunk.schema.json")
collection_name=notes_col, ap.add_argument("--edge-schema", help="Pfad zu edge.schema.json")
scroll_filter=f,
with_payload=True,
with_vectors=False,
limit=1,
)
if not points:
return None
return points[0].payload or {}
# Embeddings (optional)
ap.add_argument("--with-embeddings", action="store_true", help="Embeddings für Note & Chunks erzeugen")
ap.add_argument("--embed-url", help="Override EMBED_URL (Default aus ENV)")
ap.add_argument("--embed-batch", type=int, default=int(os.environ.get("EMBED_BATCH", "16")), help="Embedding-Batchgröße")
def list_qdrant_note_ids(client, prefix: str) -> Set[str]: ap.add_argument("--version", action="store_true", help="Version anzeigen und beenden")
notes_col, _, _ = collections(prefix) return ap.parse_args()
out: Set[str] = set()
next_page = None
while True:
pts, next_page = client.scroll(
collection_name=notes_col,
with_payload=True,
with_vectors=False,
limit=256,
offset=next_page,
)
if not pts:
break
for p in pts:
pl = p.payload or {}
nid = pl.get("note_id")
if isinstance(nid, str):
out.add(nid)
if next_page is None:
break
return out
def purge_note_artifacts(client, prefix: str, note_id: str) -> None:
_, chunks_col, edges_col = collections(prefix)
filt = rest.Filter(must=[rest.FieldCondition(key="note_id", match=rest.MatchValue(value=note_id))])
for col in (chunks_col, edges_col):
try:
client.delete(
collection_name=col,
points_selector=rest.FilterSelector(filter=filt),
wait=True
)
except Exception as e:
print(json.dumps({"note_id": note_id, "warn": f"delete in {col} via filter failed: {e}"}))
def delete_note_everywhere(client, prefix: str, note_id: str) -> None:
notes_col, chunks_col, edges_col = collections(prefix)
filt = rest.Filter(must=[rest.FieldCondition(key="note_id", match=rest.MatchValue(value=note_id))])
for col in (edges_col, chunks_col, notes_col):
try:
client.delete(
collection_name=col,
points_selector=rest.FilterSelector(filter=filt),
wait=True
)
except Exception as e:
print(json.dumps({"note_id": note_id, "warn": f"delete in {col} failed: {e}"}))
def _resolve_mode(val: Optional[str]) -> str:
v = (val or os.environ.get("MINDNET_HASH_MODE") or os.environ.get("MINDNET_HASH_COMPARE") or "body").strip().lower()
if v in ("full", "fulltext", "body+frontmatter", "bodyplusfrontmatter"):
return "full"
if v in ("frontmatter", "fm"):
return "frontmatter"
return "body"
def _env(key: str, default: str) -> str:
return (os.environ.get(key) or default).strip().lower()
# ---------------------------------------------------------------------
# Main
# ---------------------------------------------------------------------
# ---------------------------
# Hauptlogik
# ---------------------------
def main() -> None: def main() -> None:
load_dotenv() args = _args()
ap = argparse.ArgumentParser() if args.version:
ap.add_argument("--vault", required=True, help="Pfad zum Obsidian-Vault (Root-Ordner)") print(f"import_markdown.py {__version__}")
ap.add_argument("--apply", action="store_true", help="Schreibt in Qdrant; ohne Flag nur Dry-Run") sys.exit(0)
ap.add_argument("--purge-before-upsert", action="store_true",
help="Vor Upsert Chunks & Edges der GEÄNDERTEN Note löschen")
ap.add_argument("--note-id", help="Nur eine bestimmte Note-ID verarbeiten")
ap.add_argument("--only-path", help="Exakt diesen Markdown-Pfad verarbeiten (ignoriert --note-id)")
ap.add_argument("--embed-note", action="store_true", help="Optional: Note-Volltext einbetten")
ap.add_argument("--force-replace", action="store_true",
help="Änderungserkennung ignorieren und immer upserten (+ optional Purge)")
ap.add_argument("--hash-mode", choices=["body", "frontmatter", "full"], default=None,
help="Vergleichsmodus (Body | Frontmatter | Full)")
ap.add_argument("--hash-normalize", choices=["canonical", "none"], default=None)
ap.add_argument("--hash-source", choices=["parsed", "raw"], default=None,
help="Quelle für die Hash-Berechnung (Default: parsed)")
ap.add_argument("--note-scope-refs", action="store_true",
help="(Optional) erzeugt zusätzlich references:note/backlink:note (Default: aus)")
ap.add_argument("--debug-hash-diff", action="store_true",
help="(reserviert) optionaler Body-Diff")
ap.add_argument("--compare-text", action="store_true",
help="Parsed fulltext zusätzlich direkt vergleichen (über Hash hinaus)")
ap.add_argument("--baseline-modes", action="store_true",
help="Fehlende Hash-Varianten im Feld 'hashes' still nachtragen (Upsert NUR Notes)")
ap.add_argument("--sync-deletes", action="store_true",
help="Notes/Chunks/Edges löschen, die in Qdrant existieren aber im Vault fehlen (Dry-Run; mit --apply ausführen)")
ap.add_argument("--prefix", help="Collection-Prefix (überschreibt ENV COLLECTION_PREFIX)")
args = ap.parse_args()
mode = _resolve_mode(args.hash_mode) # body|frontmatter|full # Qdrant
src = _env("MINDNET_HASH_SOURCE", args.hash_source or "parsed") # parsed|raw prefix = args.prefix or os.environ.get("COLLECTION_PREFIX", "mindnet")
norm = _env("MINDNET_HASH_NORMALIZE", args.hash_normalize or "canonical") # canonical|none qc = QdrantConfig.from_env_or_default()
note_scope_refs_env = (_env("MINDNET_NOTE_SCOPE_REFS", "false") == "true") client = get_client(qc)
note_scope_refs = args.note_scope_refs or note_scope_refs_env notes_col, chunks_col, edges_col = ensure_collections_for_prefix(client, prefix)
compare_text = args.compare_text or (_env("MINDNET_COMPARE_TEXT", "false") == "true")
cfg = QdrantConfig.from_env() # Type-Registry (optional, fällt auf Default zurück)
if args.prefix: type_reg = load_type_registry(silent=True)
cfg.prefix = args.prefix.strip()
client = get_client(cfg)
ensure_collections(client, cfg.prefix, cfg.dim)
ensure_payload_indexes(client, cfg.prefix)
root = os.path.abspath(args.vault) # Hash-Modus aktiv
compare = DEFAULT_COMPARE
source = DEFAULT_SOURCE
norm = DEFAULT_NORM
active_key = _active_hash_key(compare, source, norm)
# Dateiliste bestimmen # Schemata (optional)
if args.only_path: note_schema = _load_json(args.note_schema) if args.validate_schemas else None
only = os.path.abspath(args.only_path) chunk_schema = _load_json(args.chunk_schema) if args.validate_schemas else None
files = [only] edge_schema = _load_json(args.edge_schema) if args.validate_schemas else None
# Embeddings (optional)
embed_enabled = bool(args.with_embeddings)
embed_url = args.embed_url or os.environ.get("EMBED_URL", "").strip()
if embed_enabled and not embed_url:
print(json.dumps({"warn": "with-embeddings active, but EMBED_URL not configured — embeddings skipped"}))
embed_enabled = False
# Dateiliste
files: List[str] = []
if args.path:
if not os.path.isfile(args.path):
print(json.dumps({"path": args.path, "error": "not a file"}))
sys.exit(1)
if _should_skip_md(args.path):
print(json.dumps({"path": args.path, "skipped": "by rule"}))
sys.exit(0)
files = [os.path.abspath(args.path)]
vault_root = os.path.dirname(os.path.abspath(args.path))
else: else:
files = iter_md(root) if not os.path.isdir(args.vault):
if not files: print(json.dumps({"vault": args.vault, "error": "not a directory"}))
print("Keine Markdown-Dateien gefunden.", file=sys.stderr) sys.exit(1)
sys.exit(2) vault_root = os.path.abspath(args.vault)
files = _list_md_files(vault_root, include=args.include, exclude=args.exclude)
# Optional: Sync-Deletes vorab
if args.sync_deletes:
vault_note_ids: Set[str] = set()
for path in files:
try:
parsed = read_markdown(path)
if not parsed:
continue
fm = normalize_frontmatter(parsed.frontmatter)
nid = fm.get("id")
if isinstance(nid, str):
vault_note_ids.add(nid)
except Exception:
continue
qdrant_note_ids = list_qdrant_note_ids(client, cfg.prefix)
to_delete = sorted(qdrant_note_ids - vault_note_ids)
print(json.dumps({
"action": "sync-deletes",
"prefix": cfg.prefix,
"qdrant_total": len(qdrant_note_ids),
"vault_total": len(vault_note_ids),
"to_delete_count": len(to_delete),
"to_delete": to_delete[:50] + ([""] if len(to_delete) > 50 else [])
}, ensure_ascii=False))
if args.apply and to_delete:
for nid in to_delete:
print(json.dumps({"action": "delete", "note_id": nid, "decision": "apply"}))
delete_note_everywhere(client, cfg.prefix, nid)
key_current = f"{mode}:{src}:{norm}"
processed = 0 processed = 0
stats = {"notes": 0, "chunks": 0, "edges": 0, "changed": 0, "skipped": 0, "embedded": 0}
for path in files: for path in files:
# -------- Parse & Validate -------- rel_path = os.path.relpath(path, vault_root).replace("\\", "/")
parsed = read_markdown(path)
# Note-Payload (inkl. fulltext, hashes[...] etc.)
note_pl = make_note_payload(parsed, vault_root=vault_root)
if not isinstance(note_pl, dict):
print(json.dumps({
"path": path, "note_id": getattr(parsed, "id", "<unknown>"),
"error": "make_note_payload returned non-dict", "returned_type": type(note_pl).__name__
}))
stats["skipped"] += 1
continue
# Exclude via Frontmatter?
if str(note_pl.get("embedding_exclude", "false")).lower() in {"1", "true", "yes"}:
# wir importieren dennoch Note/Chunks/Edges, aber **ohne** Embeddings
embedding_allowed = False
else:
embedding_allowed = True
# Type-Profil
note_type = str(note_pl.get("type", "concept") or "concept")
profile = type_reg.get("types", {}).get(note_type, {}).get("chunk_profile", None)
# Chunks erzeugen
chunks = make_chunk_payloads(
note_id=note_pl["note_id"],
body=note_pl.get("fulltext", ""),
note_type=note_type,
profile=profile
)
# Edges
edges: List[Dict[str, Any]] = []
try: try:
parsed = read_markdown(path) edges = build_edges_for_note(note_payload=note_pl, chunks=chunks, add_note_scope_refs=args.note_scope_refs)
except Exception as e: except Exception as e:
print(json.dumps({"path": path, "error": f"read_markdown failed: {type(e).__name__}: {e}"})) print(json.dumps({
continue "path": path, "note_id": note_pl["note_id"],
if parsed is None: "error": f"build_edges_for_note failed: {getattr(e, 'args', [''])[0]}"
print(json.dumps({"path": path, "error": "read_markdown returned None"})) }))
continue edges = []
try: # Schema-Checks (weich)
fm = normalize_frontmatter(parsed.frontmatter) if args.validate_schemas:
validate_required_frontmatter(fm) n_err = _validate(note_pl, note_schema, "note")
except Exception as e: for c in chunks:
print(json.dumps({"path": path, "error": f"Frontmatter invalid: {type(e).__name__}: {e}"})) n_err += _validate(c, chunk_schema, "chunk")
continue for ed in edges:
n_err += _validate(ed, edge_schema, "edge")
if n_err:
print(json.dumps({"note_id": note_pl["note_id"], "schema_warnings": n_err}, ensure_ascii=False))
if args.note_id and not args.only_path and fm.get("id") != args.note_id: # Hash-Vergleich
continue prev_sig = fetch_note_hash_signature(client, notes_col, note_pl["note_id"], active_key)
curr_sig = note_pl.get("hashes", {}).get(active_key, "")
is_changed = (prev_sig != curr_sig)
# Baseline: fehlende aktive Signatur speichern
if args.baseline_modes and not prev_sig and curr_sig and args.apply:
store_note_hashes_signature(client, notes_col, note_pl["note_id"], active_key, curr_sig)
# Embeddings (optional; erst NACH Änderungserkennung, um unnötige Calls zu sparen)
if embed_enabled and embedding_allowed:
try:
texts = [note_pl.get("fulltext", "")]
note_vecs = _embed_texts(embed_url, texts, batch=max(1, int(args.embed_batch)))
note_pl["embedding"] = note_vecs[0] if note_vecs else None
# Chunk-Embeddings
chunk_texts = [c.get("window") or c.get("text") or "" for c in chunks]
if chunk_texts:
chunk_vecs = _embed_texts(embed_url, chunk_texts, batch=max(1, int(args.embed_batch)))
for c, v in zip(chunks, chunk_vecs):
c["embedding"] = v
stats["embedded"] += 1
except Exception as e:
print(json.dumps({"note_id": note_pl["note_id"], "warn": f"embedding failed: {e}"}))
# Apply/Upsert
decision = "dry-run"
if args.apply:
if is_changed and args.purge_before_upsert:
delete_chunks_of_note(client, chunks_col, note_pl["note_id"])
delete_edges_of_note(client, edges_col, note_pl["note_id"])
upsert_notes(client, notes_col, [note_pl])
if chunks:
upsert_chunks(client, chunks_col, chunks)
if edges:
upsert_edges(client, edges_col, edges)
if curr_sig:
store_note_hashes_signature(client, notes_col, note_pl["note_id"], active_key, curr_sig)
decision = ("apply" if is_changed else "apply-skip-unchanged")
else:
decision = "dry-run"
# Log
print(json.dumps({
"note_id": note_pl["note_id"],
"title": note_pl.get("title"),
"chunks": len(chunks),
"edges": len(edges),
"changed": bool(is_changed),
"decision": decision,
"path": rel_path,
"hash_mode": compare,
"hash_normalize": norm,
"hash_source": source,
"prefix": prefix
}, ensure_ascii=False))
stats["notes"] += 1
stats["chunks"] += len(chunks)
stats["edges"] += len(edges)
if is_changed:
stats["changed"] += 1
processed += 1 processed += 1
# -------- Build new payload (includes 'hashes') --------
note_pl = make_note_payload(
parsed,
vault_root=root,
hash_mode=mode,
hash_normalize=norm,
hash_source=src,
file_path=path,
)
if not note_pl.get("fulltext"):
note_pl["fulltext"] = getattr(parsed, "body", "") or ""
note_id = note_pl.get("note_id") or fm.get("id")
if not note_id:
print(json.dumps({"path": path, "error": "Missing note_id after payload build"}))
continue
# -------- Fetch old payload --------
old_payload = None if args.force_replace else fetch_existing_note_payload(client, cfg.prefix, note_id)
has_old = old_payload is not None
old_hashes = (old_payload or {}).get("hashes") or {}
old_hash_exact = old_hashes.get(key_current)
new_hash_exact = (note_pl.get("hashes") or {}).get(key_current)
needs_baseline = (old_hash_exact is None)
hash_changed = (old_hash_exact is not None and new_hash_exact is not None and old_hash_exact != new_hash_exact)
text_changed = False
if compare_text:
old_text = (old_payload or {}).get("fulltext") or ""
new_text = note_pl.get("fulltext") or ""
text_changed = (old_text != new_text)
changed = args.force_replace or (not has_old) or hash_changed or text_changed
do_baseline_only = (args.baseline_modes and has_old and needs_baseline and not changed)
# -------- Chunks / Embeddings --------
chunk_pls: List[Dict[str, Any]] = []
try:
body_text = getattr(parsed, "body", "") or ""
chunks = assemble_chunks(fm["id"], body_text, fm.get("type", "concept"))
chunk_pls = make_chunk_payloads(fm, note_pl["path"], chunks, note_text=body_text)
except Exception as e:
print(json.dumps({"path": path, "note_id": note_id, "error": f"chunk build failed: {type(e).__name__}: {e}"}))
continue
vecs: List[List[float]] = [[0.0] * cfg.dim for _ in chunk_pls]
if embed_texts and chunk_pls:
try:
texts_for_embed = [(pl.get("window") or pl.get("text") or "") for pl in chunk_pls]
vecs = embed_texts(texts_for_embed)
except Exception as e:
print(json.dumps({"path": path, "note_id": note_id, "warn": f"embed_texts failed, using zeros: {e}"}))
# -------- Edges (robust) --------
edges: List[Dict[str, Any]] = []
edges_failed = False
if changed and (not do_baseline_only):
try:
note_refs = note_pl.get("references") or []
edges = build_edges_for_note(
note_id,
chunk_pls,
note_level_references=note_refs,
include_note_scope_refs=note_scope_refs,
)
except Exception as e:
edges_failed = True
edges = []
# WICHTIG: Wir brechen NICHT mehr ab — Note & Chunks werden geschrieben.
print(json.dumps({"path": path, "note_id": note_id, "warn": f"build_edges_for_note failed, skipping edges: {type(e).__name__}: {e}"}))
# -------- Summary --------
summary = {
"note_id": note_id,
"title": fm.get("title"),
"chunks": len(chunk_pls),
"edges": len(edges),
"edges_failed": edges_failed,
"changed": changed,
"needs_baseline_for_mode": needs_baseline,
"decision": ("baseline-only" if args.apply and do_baseline_only else
"apply" if args.apply and changed else
"apply-skip-unchanged" if args.apply and not changed else
"dry-run"),
"path": note_pl["path"],
"hash_mode": mode,
"hash_normalize": norm,
"hash_source": src,
"prefix": cfg.prefix,
}
print(json.dumps(summary, ensure_ascii=False))
# -------- Writes --------
if not args.apply:
continue
if do_baseline_only:
merged_hashes = {}
merged_hashes.update(old_hashes)
merged_hashes.update(note_pl.get("hashes") or {})
if old_payload:
note_pl["hash_fulltext"] = old_payload.get("hash_fulltext", note_pl.get("hash_fulltext"))
note_pl["hash_signature"] = old_payload.get("hash_signature", note_pl.get("hash_signature"))
note_pl["hashes"] = merged_hashes
notes_name, note_pts = points_for_note(cfg.prefix, note_pl, None, cfg.dim)
upsert_batch(client, notes_name, note_pts)
continue
if not changed:
continue
if args.purge_before_upsert and has_old:
try:
purge_note_artifacts(client, cfg.prefix, note_id)
except Exception as e:
print(json.dumps({"path": path, "note_id": note_id, "warn": f"purge failed: {e}"}))
notes_name, note_pts = points_for_note(cfg.prefix, note_pl, None, cfg.dim)
upsert_batch(client, notes_name, note_pts)
if chunk_pls:
chunks_name, chunk_pts = points_for_chunks(cfg.prefix, chunk_pls, vecs)
upsert_batch(client, chunks_name, chunk_pts)
if edges:
edges_name, edge_pts = points_for_edges(cfg.prefix, edges)
upsert_batch(client, edges_name, edge_pts)
print(f"Done. Processed notes: {processed}") print(f"Done. Processed notes: {processed}")
print(json.dumps({"stats": stats}, ensure_ascii=False))
if __name__ == "__main__": if __name__ == "__main__":