docs/wp04_retriever_scoring.md hinzugefügt
All checks were successful
Deploy mindnet to llm-node / deploy (push) Successful in 3s
All checks were successful
Deploy mindnet to llm-node / deploy (push) Successful in 3s
This commit is contained in:
parent
9d9239b11e
commit
413dca770c
260
docs/wp04_retriever_scoring.md
Normal file
260
docs/wp04_retriever_scoring.md
Normal file
|
|
@ -0,0 +1,260 @@
|
||||||
|
# WP-04 / Step 4a – Retriever Scoring & Konfiguration
|
||||||
|
|
||||||
|
Dieses Dokument beschreibt den aktuellen Stand (2025-11-30) der Scoring-Logik des mindnet-Retrievers in WP-04 / Step 4a.
|
||||||
|
Es dient als Referenz für:
|
||||||
|
|
||||||
|
- Nachvollziehbarkeit des Rankings
|
||||||
|
- Konfiguration und Feintuning (ohne Codeänderungen)
|
||||||
|
- spätere Automatisierung / „Selbstjustierung“ des Retrievers
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Überblick
|
||||||
|
|
||||||
|
Der Retriever kombiniert drei Signalquellen zu einem einheitlichen Score:
|
||||||
|
|
||||||
|
1. **Semantik** – Vektorsuche über `mindnet_chunks`
|
||||||
|
2. **Typ-Gewichtung** – `retriever_weight` pro Note/Chunk (aus `types.yaml`)
|
||||||
|
3. **Graph-Boni** – `edge_bonus` und `centrality_bonus` aus dem Subgraph (Edges)
|
||||||
|
|
||||||
|
Die Berechnung erfolgt in `app/core/retriever.py` und ist über `config/retriever.yaml` konfigurierbar.
|
||||||
|
Die API stellt die Funktionalität über den `/query`-Endpoint (FastAPI) bereit.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Datenbasis (Kurzüberblick)
|
||||||
|
|
||||||
|
### 2.1 Notes (`<prefix>_notes`)
|
||||||
|
|
||||||
|
Relevante Felder:
|
||||||
|
|
||||||
|
- note_id: keyword
|
||||||
|
- title: text
|
||||||
|
- type: keyword
|
||||||
|
- retriever_weight: float
|
||||||
|
- chunk_profile: keyword
|
||||||
|
- edge_defaults: keyword[]
|
||||||
|
- tags: keyword[]
|
||||||
|
- path: text
|
||||||
|
- fulltext: text
|
||||||
|
|
||||||
|
### 2.2 Chunks (`<prefix>_chunks`)
|
||||||
|
|
||||||
|
Relevante Felder:
|
||||||
|
|
||||||
|
- note_id: keyword
|
||||||
|
- chunk_id: keyword
|
||||||
|
- text: text
|
||||||
|
- window: text
|
||||||
|
- retriever_weight: float
|
||||||
|
- type: keyword
|
||||||
|
- path: text
|
||||||
|
- section: text
|
||||||
|
- neighbors_prev / neighbors_next
|
||||||
|
- chunk_profile: keyword
|
||||||
|
|
||||||
|
### 2.3 Edges (`<prefix>_edges`)
|
||||||
|
|
||||||
|
Relevante Felder:
|
||||||
|
|
||||||
|
- kind: keyword
|
||||||
|
- source_id: keyword
|
||||||
|
- target_id: keyword
|
||||||
|
- note_id: keyword
|
||||||
|
- confidence: float
|
||||||
|
- rule_id, provenance, edge_id, scope, relation, ref_text
|
||||||
|
|
||||||
|
Edges stammen stets aus der Import-/Edge-Pipeline (WP-03) und bilden die graphbasierte Grundlage des Retrievers.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Scoring-Formel (Retriever)
|
||||||
|
|
||||||
|
### 3.1 Eingangsgrößen
|
||||||
|
|
||||||
|
Für jeden Treffer gelten:
|
||||||
|
|
||||||
|
semantic_score – aus der Vektorsuche (Qdrant)
|
||||||
|
retriever_weight – typabhängiges Gewicht (aus types.yaml)
|
||||||
|
edge_bonus – aggregierter Graph-Bonus
|
||||||
|
centrality_bonus – Zentralitätsbonus aus dem Subgraph
|
||||||
|
|
||||||
|
### 3.2 Globale Gewichte (aus retriever.yaml)
|
||||||
|
|
||||||
|
Datei: `config/retriever.yaml`
|
||||||
|
Beispiel:
|
||||||
|
|
||||||
|
version: 1.0
|
||||||
|
|
||||||
|
scoring:
|
||||||
|
semantic_weight: 1.0 # W_sem
|
||||||
|
edge_weight: 0.7 # W_edge
|
||||||
|
centrality_weight: 0.5 # W_cent
|
||||||
|
|
||||||
|
Falls diese Datei fehlt, greifen ENV-basierte Defaults (`RETRIEVER_W_*`).
|
||||||
|
|
||||||
|
### 3.3 Formel
|
||||||
|
|
||||||
|
total_score =
|
||||||
|
W_sem * semantic_score * max(retriever_weight, 0.0)
|
||||||
|
+ W_edge * edge_bonus
|
||||||
|
+ W_cent * centrality_bonus
|
||||||
|
|
||||||
|
Bemerkungen:
|
||||||
|
|
||||||
|
- Negative `retriever_weight` werden abgefangen.
|
||||||
|
- Die rohen `edge_bonus` und `centrality_bonus` werden im API-/Smoke-Output angezeigt.
|
||||||
|
- Die Gewichte aus `retriever.yaml` beeinflussen **nur den total_score**, nicht die Anzeige.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Konfigurationsdateien
|
||||||
|
|
||||||
|
### 4.1 Typen-Registry (`types.yaml`)
|
||||||
|
|
||||||
|
Beispielauszug:
|
||||||
|
|
||||||
|
defaults:
|
||||||
|
retriever_weight: 1.0
|
||||||
|
chunk_profile: default
|
||||||
|
|
||||||
|
types:
|
||||||
|
types:
|
||||||
|
concept:
|
||||||
|
retriever_weight: 0.60
|
||||||
|
edge_defaults: ["references", "related_to"]
|
||||||
|
|
||||||
|
project:
|
||||||
|
retriever_weight: 0.97
|
||||||
|
edge_defaults: ["references", "depends_on"]
|
||||||
|
|
||||||
|
journal:
|
||||||
|
retriever_weight: 0.80
|
||||||
|
edge_defaults: ["references", "related_to"]
|
||||||
|
|
||||||
|
source:
|
||||||
|
retriever_weight: 0.50
|
||||||
|
edge_defaults: []
|
||||||
|
|
||||||
|
Diese Werte bestimmen, wie wichtig eine Note/Chunk **prinzipiell** für den Retriever ist.
|
||||||
|
|
||||||
|
### 4.2 Retriever-Konfiguration (`retriever.yaml`)
|
||||||
|
|
||||||
|
Beispiel:
|
||||||
|
|
||||||
|
version: 1.0
|
||||||
|
|
||||||
|
scoring:
|
||||||
|
semantic_weight: 1.0
|
||||||
|
edge_weight: 0.7
|
||||||
|
centrality_weight: 0.5
|
||||||
|
|
||||||
|
# Blueprint für spätere Edge-Feinsteuerung:
|
||||||
|
edge_types:
|
||||||
|
references: 0.20
|
||||||
|
depends_on: 0.18
|
||||||
|
related_to: 0.15
|
||||||
|
similar_to: 0.12
|
||||||
|
belongs_to: 0.10
|
||||||
|
next: 0.06
|
||||||
|
prev: 0.06
|
||||||
|
|
||||||
|
Die Datei befindet sich standardmäßig unter `config/retriever.yaml`.
|
||||||
|
Override per ENV:
|
||||||
|
|
||||||
|
MINDNET_RETRIEVER_CONFIG=/pfad/zur/datei.yaml
|
||||||
|
|
||||||
|
### 4.3 Aktive Gewichte auslesen
|
||||||
|
|
||||||
|
python - << 'PY'
|
||||||
|
from app.core.retriever import _get_scoring_weights
|
||||||
|
print(_get_scoring_weights())
|
||||||
|
PY
|
||||||
|
|
||||||
|
Beispielausgabe:
|
||||||
|
|
||||||
|
(1.0, 0.7, 0.5)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Betriebsmodi
|
||||||
|
|
||||||
|
### 5.1 Semantikmodus (semantic)
|
||||||
|
|
||||||
|
Ablauf:
|
||||||
|
|
||||||
|
- Query-Vektor über embed_text()
|
||||||
|
- Qdrant-Suche über mindnet_chunks
|
||||||
|
- kein Subgraph → edge_bonus = centrality = 0
|
||||||
|
- total_score = W_sem * semantic_score * retriever_weight
|
||||||
|
|
||||||
|
### 5.2 Hybridmodus (hybrid)
|
||||||
|
|
||||||
|
Ablauf:
|
||||||
|
|
||||||
|
- Semantik wie oben
|
||||||
|
- falls expand.depth > 0:
|
||||||
|
- Subgraph-Berechnung über ga.expand()
|
||||||
|
- Ermittlung von edge_bonus und centrality_bonus
|
||||||
|
- total_score = kombinierte Formel aus Semantik + Graph
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. Tests
|
||||||
|
|
||||||
|
### Ausführung (Unit & Integration)
|
||||||
|
|
||||||
|
pytest tests/test_retriever_basic.py \
|
||||||
|
tests/test_retriever_weight.py \
|
||||||
|
tests/test_retriever_edges.py
|
||||||
|
|
||||||
|
pytest tests/test_query_unit.py \
|
||||||
|
tests/test_query_text_embed_unit.py
|
||||||
|
|
||||||
|
Tests decken ab:
|
||||||
|
|
||||||
|
- Semantiktreffer
|
||||||
|
- Typ-Gewichte
|
||||||
|
- Edge-Boni
|
||||||
|
- Hybridmodus / FastAPI-Endpunkt
|
||||||
|
|
||||||
|
### Smoke-Test
|
||||||
|
|
||||||
|
python tests/test_retriever_smoke.py \
|
||||||
|
--url "http://127.0.0.1:8001/query" \
|
||||||
|
--query "embeddings" \
|
||||||
|
--mode hybrid \
|
||||||
|
--expand-depth 1 \
|
||||||
|
--top-k 5
|
||||||
|
|
||||||
|
Beispielausgabe:
|
||||||
|
|
||||||
|
[1] note_id=...retriever-design
|
||||||
|
total=1.5054 (semantic=0.4038 edge=1.5910 centrality=0.0000)
|
||||||
|
|
||||||
|
Interpretation:
|
||||||
|
|
||||||
|
- `semantic`, `edge`, `centrality` = Rohwerte
|
||||||
|
- `total` = skaliert mit `retriever.yaml`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 7. Betriebs-Hinweise
|
||||||
|
|
||||||
|
### 7.1 Änderung der Gewichte
|
||||||
|
|
||||||
|
- retriever.yaml anpassen
|
||||||
|
- FastAPI/uvicorn neu starten
|
||||||
|
- Smoke-Test prüfen
|
||||||
|
|
||||||
|
### 7.2 Zielbild WP-04
|
||||||
|
|
||||||
|
Der jetzige Stand ermöglicht:
|
||||||
|
|
||||||
|
- transparente, konfigurierbare Score-Berechnung
|
||||||
|
- Kombination aus Semantik, Typwissen und Graphstruktur
|
||||||
|
- spätere agentenbasierte Selbstjustierung
|
||||||
|
- alternative Profile (z. B. „mehr Semantik“, „mehr Graph“)
|
||||||
|
|
||||||
|
Dieses Dokument dient als stabile Referenz für Entwickler, Tests und spätere Automatisierungsschritte.
|
||||||
|
|
||||||
Loading…
Reference in New Issue
Block a user