CLAUDE.md 5.5 KB

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Development Commands

Start Applications

# Start Chainlit conversational interface (primary UI)
chainlit run chainlit_app.py

# Start Flask web interface (simple API)
python flask_app.py

# Start advanced Flask application with full agent APIs
python citu_app.py

Training and Data Management

# Run training pipeline with data from data_pipeline/training_data directory
python -m data_pipeline.trainer.run_training --data_path ./data_pipeline/training_data/

# Complete automated schema workflow (DDL generation → Q&A generation → SQL validation → Training data loading)
python -m data_pipeline.schema_workflow \
  --db-connection "postgresql://user:pass@host:port/database_name" \
  --table-list tables.txt \
  --business-context "业务系统描述" \
  --output-dir ./data_pipeline/training_data/

# Generate only schema documentation without validation
python -m data_pipeline.schema_workflow \
  --db-connection "postgresql://user:pass@host:port/db_name" \
  --table-list tables.txt \
  --business-context "系统描述" \
  --skip-validation

Testing

# Test QA feedback and conversation management APIs
python test_qa_apis.py

# Test training data management APIs  
python test_training_data_apis.py

Core Architecture

Application Entry Points

  • chainlit_app.py - Modern conversational UI with streaming responses, fallback mechanisms, and comprehensive error handling
  • citu_app.py - Production Flask application with full REST APIs for agent queries, conversation management, QA feedback, and health monitoring
  • flask_app.py - Simple REST API for basic database queries

Central Configuration

app_config.py is the main configuration hub controlling:

# Multi-provider LLM selection
LLM_MODEL_TYPE = "api"  # api or ollama
API_LLM_MODEL = "qianwen"  # qianwen or deepseek

# Vector database selection  
VECTOR_DB_TYPE = "pgvector"  # chromadb or pgvector

# Agent routing behavior
QUESTION_ROUTING_MODE = "hybrid"  # hybrid, database_direct, chat_direct, llm_only

# Feature toggles
ENABLE_RESULT_SUMMARY = True
ENABLE_CONVERSATION_CONTEXT = True
DISPLAY_RESULT_THINKING = False

LLM and Vector Database Combinations

The system supports 6 LLM + vector database combinations via common/vanna_combinations.py:

  • QianWen + ChromaDB/PgVector
  • DeepSeek + ChromaDB/PgVector
  • Ollama + ChromaDB/PgVector

All combinations are created through core/vanna_llm_factory.py using factory pattern.

Agent System Architecture

agent/citu_agent.py implements a sophisticated LangGraph-based workflow:

Question → Classify → [DATABASE Path] → SQL Generation → SQL Validation → SQL Execution → Summary
                   → [CHAT Path] → General Chat

Routing Modes:

  • hybrid (default) - Intelligent classification between database and chat
  • database_direct - Skip classification, direct SQL generation
  • chat_direct - Skip classification, direct chat response
  • llm_only - LLM-based classification only

Database Integration

Three-Database Architecture:

  1. Business Database (APP_DB_CONFIG) - Source data for queries
  2. Vector Database (PGVECTOR_CONFIG) - Training data and embeddings
  3. Redis Cache (REDIS_*) - Conversations, QA results, embedding cache

Training Data Pipeline

Training data is managed in data_pipeline/training_data/ directory.

File Format Mapping:

  • .ddl files → train_ddl_statements()
  • .md/.markdowntrain_documentation_blocks()
  • _pair.json/_pairs.jsontrain_json_question_sql_pairs()
  • _pair.sql/_pairs.sqltrain_formatted_question_sql_pairs()
  • .sql (other) → train_sql_examples()

Data Pipeline System

data_pipeline/ provides automated database reverse engineering:

  1. Database Inspector - Automatic schema discovery
  2. DDL Generator - PostgreSQL DDL with intelligent comments
  3. Documentation Generator - Detailed markdown documentation
  4. Q&A Generator (qa_generation/) - LLM-generated question-SQL pairs
  5. SQL Validator (validators/) - EXPLAIN-based validation with auto-repair
  6. Training Pipeline (trainer/) - Vanna.ai training data ingestion

Key Patterns

Singleton Pattern

common/vanna_instance.py implements thread-safe singleton for global Vanna instance management.

Caching Strategy

Multi-layer caching via common/:

  • session_aware_cache.py - Web session-aware caching
  • embedding_cache_manager.py - High-performance embedding caching
  • redis_conversation_manager.py - Conversation lifecycle management

Error Handling

Comprehensive fallback mechanisms throughout the stack:

  • SQL generation failures → General chat responses
  • LLM timeouts → Cached responses
  • Database connection issues → Health check endpoints

Configuration Precedence

  1. Environment variables (.env file)
  2. app_config.py defaults
  3. Module-specific configs (e.g., data_pipeline/config.py)

Important Notes

  • The system requires PostgreSQL for business data and optionally PgVector for vector storage
  • Redis is essential for conversation management and caching
  • Training data generation is resource-intensive and should be run with appropriate database permissions
  • The agent system supports both streaming and non-streaming responses based on LLM provider capabilities
  • Always test configuration changes with health check endpoints before production deployment