Forráskód Böngészése

根据ddl和md,自动生成键值对qs_pair.json的开发完成,准备开发对json中sql的自动验证模块。

wangxq 1 hete
szülő
commit
3e8e6a5cb9

+ 353 - 0
docs/Schema Tools 使用说明.md

@@ -0,0 +1,353 @@
+# Schema Tools 使用说明
+
+## 目录
+
+1. [功能概述](#1-功能概述)
+2. [安装与配置](#2-安装与配置)
+3. [生成DDL和MD文档](#3-生成ddl和md文档)
+4. [生成Question-SQL训练数据](#4-生成question-sql训练数据)
+5. [配置详解](#5-配置详解)
+6. [常见问题](#6-常见问题)
+
+## 1. 功能概述
+
+Schema Tools 提供两个主要功能:
+
+### 1.1 DDL和MD文档生成
+- 自动连接PostgreSQL数据库
+- 批量处理表清单
+- 使用LLM生成中文注释
+- 自动检测枚举字段
+- 生成标准化的DDL和MD文档
+
+### 1.2 Question-SQL训练数据生成
+- 验证DDL和MD文件完整性
+- 分析表结构提取业务主题
+- 为每个主题生成高质量的Question-SQL对
+- 支持中断恢复和并行处理
+
+## 2. 安装与配置
+
+### 2.1 依赖安装
+
+```bash
+pip install asyncpg asyncio
+```
+
+### 2.2 基本配置
+
+Schema Tools 使用项目现有的 LLM 配置,无需额外配置数据库连接。
+
+## 3. 生成DDL和MD文档
+
+### 3.1 命令格式
+
+```bash
+python -m schema_tools \
+  --db-connection <数据库连接字符串> \
+  --table-list <表清单文件> \
+  --business-context <业务上下文> \
+  [可选参数]
+```
+
+### 3.2 必需参数说明
+
+| 参数 | 说明 | 示例 |
+|------|------|------|
+| `--db-connection` | PostgreSQL数据库连接字符串 | `postgresql://user:pass@localhost:5432/dbname` |
+| `--table-list` | 表清单文件路径 | `./tables.txt` |
+| `--business-context` | 业务上下文描述 | `"高速公路服务区管理系统"` |
+
+### 3.3 可选参数说明
+
+| 参数 | 说明 | 默认值 |
+|------|------|--------|
+| `--output-dir` | 输出目录路径 | `training/generated_data` |
+| `--pipeline` | 处理链类型 | `full` |
+| `--max-concurrent` | 最大并发表数量 | `3` |
+| `--verbose` | 启用详细日志 | `False` |
+| `--log-file` | 日志文件路径 | `无` |
+| `--no-filter-system-tables` | 禁用系统表过滤 | `False` |
+| `--check-permissions-only` | 仅检查数据库权限 | `False` |
+
+### 3.4 处理链类型
+
+- **full**: 完整处理链(默认)- 生成DDL和MD文档
+- **ddl_only**: 仅生成DDL文件
+- **analysis_only**: 仅分析不生成文件
+
+### 3.5 使用示例
+
+#### 基本使用
+```bash
+python -m schema_tools \
+  --db-connection "postgresql://postgres:postgres@localhost:6432/highway_db" \
+  --table-list ./schema_tools/tables.txt \
+  --business-context "高速公路服务区管理系统"
+```
+
+#### 指定输出目录和启用详细日志
+```bash
+python -m schema_tools \
+  --db-connection "postgresql://postgres:postgres@localhost:6432/highway_db" \
+  --table-list ./schema_tools/tables.txt \
+  --business-context "高速公路服务区管理系统" \
+  --output-dir ./output \
+  --verbose
+```
+
+#### 仅生成DDL文件
+```bash
+python -m schema_tools \
+  --db-connection "postgresql://postgres:postgres@localhost:6432/highway_db" \
+  --table-list ./schema_tools/tables.txt \
+  --business-context "高速公路服务区管理系统" \
+  --pipeline ddl_only
+```
+
+#### 权限检查
+```bash
+python -m schema_tools \
+  --db-connection "postgresql://postgres:postgres@localhost:6432/highway_db" \
+  --check-permissions-only
+```
+
+### 3.6 表清单文件格式
+
+创建一个文本文件(如 `tables.txt`),每行一个表名:
+
+```text
+# 这是注释行
+public.bss_service_area
+public.bss_company
+bss_car_day_count  # 默认为public schema
+hr.employees       # 指定schema
+```
+
+### 3.7 输出文件
+
+生成的文件都放在输出目录下(不创建子目录):
+
+```
+output/
+├── bss_service_area.ddl              # DDL文件
+├── bss_service_area_detail.md        # MD文档
+├── bss_company.ddl
+├── bss_company_detail.md
+├── filename_mapping.txt              # 文件名映射
+└── logs/                            # 日志目录
+    └── schema_tools_20240123.log
+```
+
+## 4. 生成Question-SQL训练数据
+
+### 4.1 前置条件
+
+必须先执行DDL和MD文档生成,确保输出目录中有完整的DDL和MD文件。
+
+### 4.2 命令格式
+
+```bash
+python -m schema_tools.qs_generator \
+  --output-dir <输出目录> \
+  --table-list <表清单文件> \
+  --business-context <业务上下文> \
+  [可选参数]
+```
+
+### 4.3 必需参数说明
+
+| 参数 | 说明 | 示例 |
+|------|------|------|
+| `--output-dir` | 包含DDL和MD文件的目录 | `./output` |
+| `--table-list` | 表清单文件路径(用于验证) | `./tables.txt` |
+| `--business-context` | 业务上下文描述 | `"高速公路服务区管理系统"` |
+
+### 4.4 可选参数说明
+
+| 参数 | 说明 | 默认值 |
+|------|------|--------|
+| `--db-name` | 数据库名称(用于文件命名) | `db` |
+| `--verbose` | 启用详细日志 | `False` |
+| `--log-file` | 日志文件路径 | `无` |
+
+### 4.5 使用示例
+
+#### 基本使用
+```bash
+python -m schema_tools.qs_generator \
+  --output-dir ./output \
+  --table-list ./schema_tools/tables.txt \
+  --business-context "高速公路服务区管理系统" \
+  --db-name highway_db
+```
+
+#### 启用详细日志
+```bash
+python -m schema_tools.qs_generator \
+  --output-dir ./output \
+  --table-list ./schema_tools/tables.txt \
+  --business-context "高速公路服务区管理系统" \
+  --db-name highway_db \
+  --verbose
+```
+
+### 4.6 执行流程
+
+1. **文件验证**:检查DDL和MD文件数量是否正确
+2. **表数量限制**:最多处理20个表(可配置)
+3. **主题提取**:LLM分析表结构,提取5个业务分析主题
+4. **Question-SQL生成**:每个主题生成10个问题
+5. **结果保存**:输出到 `qs_<db_name>_<时间戳>_pair.json`
+
+### 4.7 输出文件
+
+```
+output/
+├── qs_highway_db_20240123_143052_pair.json  # 最终结果
+├── qs_intermediate_20240123_143052.json     # 中间结果(成功后自动删除)
+└── qs_recovery_20240123_143052.json         # 恢复文件(异常中断时生成)
+```
+
+### 4.8 输出格式示例
+
+```json
+[
+  {
+    "question": "按服务区统计每日营收趋势(最近30天)?",
+    "sql": "SELECT service_name AS 服务区, oper_date AS 营业日期, SUM(pay_sum) AS 每日营收 FROM bss_business_day_data WHERE oper_date >= CURRENT_DATE - INTERVAL '30 day' AND delete_ts IS NULL GROUP BY service_name, oper_date ORDER BY 营业日期 ASC;"
+  },
+  {
+    "question": "哪个服务区的车流量最大?",
+    "sql": "SELECT service_area_id, SUM(customer_count) AS 总车流量 FROM bss_car_day_count WHERE delete_ts IS NULL GROUP BY service_area_id ORDER BY 总车流量 DESC LIMIT 1;"
+  }
+]
+```
+
+## 5. 配置详解
+
+### 5.1 主要配置项
+
+配置文件位于 `schema_tools/config.py`:
+
+```python
+# DDL/MD生成相关配置
+"output_directory": "training/generated_data",     # 输出目录
+"create_subdirectories": False,                    # 不创建子目录
+"max_concurrent_tables": 3,                        # 最大并发数
+"sample_data_limit": 20,                          # 数据采样量
+"filter_system_tables": True,                      # 过滤系统表
+"continue_on_error": True,                         # 错误后继续
+
+# Question-SQL生成配置
+"qs_generation": {
+    "max_tables": 20,                             # 最大表数量限制
+    "theme_count": 5,                             # 主题数量
+    "questions_per_theme": 10,                    # 每主题问题数
+    "max_concurrent_themes": 3,                   # 并行主题数
+    "continue_on_theme_error": True,              # 主题失败继续
+    "save_intermediate": True,                    # 保存中间结果
+}
+```
+
+### 5.2 修改配置
+
+可以通过编辑 `schema_tools/config.py` 文件来修改默认配置。
+
+## 6. 常见问题
+
+### 6.1 表数量超过20个怎么办?
+
+**错误信息**:
+```
+表数量(25)超过限制(20)。请分批处理或调整配置中的max_tables参数。
+```
+
+**解决方案**:
+1. 分批处理:将表清单分成多个文件,每个不超过20个表
+2. 修改配置:在 `config.py` 中增加 `max_tables` 限制
+
+### 6.2 DDL和MD文件数量不一致
+
+**错误信息**:
+```
+DDL文件数量(5)与表数量(6)不一致
+```
+
+**解决方案**:
+1. 检查是否有表处理失败
+2. 查看日志文件找出失败的表
+3. 重新运行DDL/MD生成
+
+### 6.3 LLM调用失败
+
+**可能原因**:
+- 网络连接问题
+- API配额限制
+- Token超限
+
+**解决方案**:
+1. 检查网络连接
+2. 查看中间结果文件,从断点继续
+3. 减少表数量或分批处理
+
+### 6.4 权限不足
+
+**错误信息**:
+```
+数据库查询权限不足
+```
+
+**解决方案**:
+1. 使用 `--check-permissions-only` 检查权限
+2. 确保数据库用户有SELECT权限
+3. Schema Tools支持只读数据库
+
+### 6.5 如何处理大表?
+
+Schema Tools会自动检测大表(超过100万行)并使用智能采样策略:
+- 前N行 + 随机中间行 + 后N行
+- 确保采样数据的代表性
+
+### 6.6 生成的SQL语法错误
+
+目前生成的SQL使用PostgreSQL语法。如果需要其他数据库语法:
+1. 在业务上下文中明确指定目标数据库
+2. 未来版本将支持MySQL等其他数据库
+
+## 7. 最佳实践
+
+### 7.1 工作流程建议
+
+1. **第一步**:生成DDL和MD文档
+   ```bash
+   python -m schema_tools --db-connection "..." --table-list tables.txt --business-context "..." --output-dir ./output
+   ```
+
+2. **第二步**:人工检查
+   - 检查DDL文件的表结构是否正确
+   - 确认MD文档中的注释是否准确
+   - 根据需要手动调整
+
+3. **第三步**:生成Question-SQL
+   ```bash
+   python -m schema_tools.qs_generator --output-dir ./output --table-list tables.txt --business-context "..."
+   ```
+
+### 7.2 表清单组织
+
+- 按业务模块分组
+- 每组不超过15-20个表
+- 使用注释说明每组的用途
+
+### 7.3 业务上下文优化
+
+- 提供准确的业务背景描述
+- 包含行业特定术语
+- 说明主要业务流程
+
+### 7.4 输出文件管理
+
+- 定期备份生成的文件
+- 使用版本控制管理DDL文件
+- 保留中间结果用于调试 

+ 87 - 3
docs/Schema Tools 系统概要设计说明书.md

@@ -15,6 +15,7 @@
 - LLM辅助的智能注释生成和枚举检测
 - 并发处理提高效率
 - 完整的错误处理和日志记录
+- **新增**: Question-SQL训练数据生成功能
 
 ## 2. 系统架构
 
@@ -25,6 +26,8 @@ schema_tools/                    # 独立的schema工具模块
 ├── __init__.py                 # 模块入口
 ├── config.py                   # 配置文件
 ├── training_data_agent.py      # 主AI Agent
+├── qs_agent.py                 # Question-SQL生成Agent (新增)
+├── qs_generator.py             # Question-SQL命令行入口 (新增)
 ├── tools/                      # Agent工具集
 │   ├── __init__.py
 │   ├── base.py                 # 基础工具类和注册机制
@@ -33,6 +36,13 @@ schema_tools/                    # 独立的schema工具模块
 │   ├── comment_generator.py    # LLM注释生成工具
 │   ├── ddl_generator.py        # DDL格式生成工具
 │   └── doc_generator.py        # MD文档生成工具
+├── validators/                 # 验证器模块 (新增)
+│   ├── __init__.py
+│   └── file_count_validator.py # 文件数量验证器
+├── analyzers/                  # 分析器模块 (新增)
+│   ├── __init__.py
+│   ├── md_analyzer.py          # MD文件分析器
+│   └── theme_extractor.py      # 主题提取器
 ├── prompts/                    # 提示词和业务词典
 │   ├── table_comment_template.txt
 │   ├── field_comment_template.txt
@@ -55,7 +65,13 @@ schema_tools/                    # 独立的schema工具模块
 - **职责**: 整体流程控制、工具调度、并发管理
 - **特点**: 单一Agent管理多工具的架构
 
-#### 2.2.2 Agent工具集(基于装饰器注册)
+#### 2.2.2 Question-SQL生成Agent(新增)
+
+- **类名**: `QuestionSQLGenerationAgent`
+- **职责**: 生成Question-SQL训练数据对
+- **特点**: 独立的功能模块,可在DDL/MD生成后单独执行
+
+#### 2.2.3 Agent工具集(基于装饰器注册)
 
 1. **DatabaseInspectorTool**: 获取表元数据
 2. **DataSamplerTool**: 采样表数据
@@ -63,9 +79,15 @@ schema_tools/                    # 独立的schema工具模块
 4. **DDLGeneratorTool**: 生成DDL格式文件
 5. **DocGeneratorTool**: 生成MD文档
 
+#### 2.2.4 验证器和分析器(新增)
+
+1. **FileCountValidator**: 验证DDL和MD文件数量
+2. **MDFileAnalyzer**: 读取和分析MD文件内容
+3. **ThemeExtractor**: 使用LLM提取业务分析主题
+
 ## 3. 详细设计
 
-### 3.1 工具执行流程
+### 3.1 DDL/MD生成流程
 
 ```mermaid
 graph TD
@@ -80,6 +102,21 @@ graph TD
     I --> J[完成]
 ```
 
+### 3.2 Question-SQL生成流程(新增)
+
+```mermaid
+graph TD
+    A[开始] --> B[FileCountValidator<br/>验证文件数量]
+    B --> C{验证通过?}
+    C -->|否| D[报错退出]
+    C -->|是| E[MDFileAnalyzer<br/>读取所有MD文件]
+    E --> F[ThemeExtractor<br/>提取分析主题]
+    F --> G[处理每个主题]
+    G --> H[生成Question-SQL对]
+    H --> I[保存JSON文件]
+    I --> J[完成]
+```
+
 ### 3.2 模块间接口规范
 
 #### 3.2.1 统一数据结构定义
@@ -317,12 +354,23 @@ SCHEMA_TOOLS_CONFIG = {
     "ddl_file_suffix": ".ddl",
     "doc_file_suffix": "_detail.md",
     "log_file": "schema_tools.log",
-    "create_subdirectories": True,  # 是否创建ddl/docs子目录
+    "create_subdirectories": False,  # 不创建子目录,所有文件放在output目录下
     
     # 输出格式配置
     "include_sample_data_in_comments": True,  # 注释中是否包含示例数据
     "max_comment_length": 500,  # 最大注释长度
     "include_field_statistics": True,  # 是否包含字段统计信息
+    
+    # Question-SQL生成配置(新增)
+    "qs_generation": {
+        "max_tables": 20,                    # 最大表数量限制
+        "theme_count": 5,                    # LLM生成的主题数量
+        "questions_per_theme": 10,           # 每个主题生成的问题数
+        "max_concurrent_themes": 3,          # 并行处理的主题数量
+        "continue_on_theme_error": True,     # 主题生成失败是否继续
+        "save_intermediate": True,           # 是否保存中间结果
+        "output_file_prefix": "qs",          # 输出文件前缀
+    }
 }
 ```
 
@@ -666,10 +714,27 @@ bss_service_area 表记录高速公路服务区的基础属性...
 - service_area_type 为枚举字段,包含两个取值:信息化服务区、智能化服务区。
 ```
 
+### 6.3 Question-SQL文件格式(新增)
+
+```json
+[
+  {
+    "question": "按服务区统计每日营收趋势(最近30天)?",
+    "sql": "SELECT service_name AS 服务区, oper_date AS 营业日期, SUM(pay_sum) AS 每日营收 FROM bss_business_day_data WHERE oper_date >= CURRENT_DATE - INTERVAL '30 day' AND delete_ts IS NULL GROUP BY service_name, oper_date ORDER BY 营业日期 ASC NULLS LAST;"
+  },
+  {
+    "question": "按月统计服务区营收趋势?",
+    "sql": "SELECT service_name AS 服务区, DATE_TRUNC('month', oper_date) AS 月份, SUM(pay_sum) AS 月营收 FROM bss_business_day_data WHERE delete_ts IS NULL GROUP BY service_name, 月份 ORDER BY 月份 ASC NULLS LAST;"
+  }
+]
+```
+
 ## 7. 使用方式
 
 ### 7.1 命令行方式
 
+#### 7.1.1 生成DDL和MD文档
+
 ```bash
 # 基本使用
 python -m schema_tools \
@@ -700,6 +765,25 @@ python -m schema_tools \
   --check-permissions-only
 ```
 
+#### 7.1.2 生成Question-SQL训练数据(新增)
+
+```bash
+# 基本使用(在生成DDL/MD文件后执行)
+python -m schema_tools.qs_generator \
+  --output-dir ./output \
+  --table-list ./schema_tools/tables.txt \
+  --business-context "高速公路服务区管理系统" \
+  --db-name highway_db
+
+# 启用详细日志
+python -m schema_tools.qs_generator \
+  --output-dir ./output \
+  --table-list ./tables.txt \
+  --business-context "电商系统" \
+  --db-name ecommerce_db \
+  --verbose
+```
+
 ### 7.2 编程方式
 
 ```python

+ 290 - 1
docs/Schema Tools 详细设计文档.md

@@ -10,6 +10,8 @@ schema_tools/
 ├── __main__.py                     # 命令行入口
 ├── config.py                       # 配置管理
 ├── training_data_agent.py          # 主AI Agent
+├── qs_agent.py                     # Question-SQL生成Agent (新增)
+├── qs_generator.py                 # Question-SQL命令行入口 (新增)
 ├── tools/                          # Agent工具集
 │   ├── __init__.py                 # 工具模块初始化
 │   ├── base.py                     # 基础工具类和注册机制
@@ -18,6 +20,13 @@ schema_tools/
 │   ├── comment_generator.py        # LLM注释生成工具
 │   ├── ddl_generator.py            # DDL格式生成工具
 │   └── doc_generator.py            # MD文档生成工具
+├── validators/                     # 验证器模块 (新增)
+│   ├── __init__.py
+│   └── file_count_validator.py     # 文件数量验证器
+├── analyzers/                      # 分析器模块 (新增)
+│   ├── __init__.py
+│   ├── md_analyzer.py              # MD文件分析器
+│   └── theme_extractor.py          # 主题提取器
 ├── utils/                          # 工具函数
 │   ├── __init__.py
 │   ├── data_structures.py          # 数据结构定义
@@ -1937,4 +1946,284 @@ except ValueError as e:
 - **错误处理**: 完善的异常处理和重试机制
 - **可扩展性**: 工具注册机制便于添加新功能
 - **配置灵活**: 多层次配置支持
-- **日志完整**: 详细的执行日志和统计报告
+- **日志完整**: 详细的执行日志和统计报告
+
+## 8. Question-SQL生成功能详细设计(新增)
+
+### 8.1 功能概述
+
+Question-SQL生成功能是Schema Tools的扩展模块,用于从已生成的DDL和MD文件自动生成高质量的Question-SQL训练数据对。该功能可以独立运行,支持人工检查DDL/MD文件后再执行。
+
+### 8.2 核心组件设计
+
+#### 8.2.1 QuestionSQLGenerationAgent (`qs_agent.py`)
+
+```python
+class QuestionSQLGenerationAgent:
+    """Question-SQL生成Agent"""
+    
+    def __init__(self, 
+                 output_dir: str,
+                 table_list_file: str,
+                 business_context: str,
+                 db_name: str = None):
+        """
+        初始化Agent
+        
+        Args:
+            output_dir: 输出目录(包含DDL和MD文件)
+            table_list_file: 表清单文件路径
+            business_context: 业务上下文
+            db_name: 数据库名称(用于输出文件命名)
+        """
+        self.output_dir = Path(output_dir)
+        self.table_list_file = table_list_file
+        self.business_context = business_context
+        self.db_name = db_name or "db"
+        
+        # 初始化组件
+        self.validator = FileCountValidator()
+        self.md_analyzer = MDFileAnalyzer(output_dir)
+        self.theme_extractor = None  # 延迟初始化
+        
+        # 中间结果存储
+        self.intermediate_results = []
+        self.intermediate_file = None
+```
+
+#### 8.2.2 文件数量验证器 (`validators/file_count_validator.py`)
+
+```python
+@dataclass
+class ValidationResult:
+    """验证结果"""
+    is_valid: bool
+    table_count: int
+    ddl_count: int
+    md_count: int
+    error: str = ""
+    missing_ddl: List[str] = field(default_factory=list)
+    missing_md: List[str] = field(default_factory=list)
+
+class FileCountValidator:
+    """文件数量验证器"""
+    
+    def validate(self, table_list_file: str, output_dir: str) -> ValidationResult:
+        """
+        验证生成的文件数量是否与表数量一致
+        
+        主要验证:
+        1. 表数量是否超过20个限制
+        2. DDL文件数量是否与表数量一致
+        3. MD文件数量是否与表数量一致
+        """
+        # 解析表清单
+        tables = self.table_parser.parse_file(table_list_file)
+        table_count = len(tables)
+        
+        # 检查表数量限制
+        max_tables = self.config['qs_generation']['max_tables']
+        if table_count > max_tables:
+            return ValidationResult(
+                is_valid=False,
+                table_count=table_count,
+                ddl_count=0,
+                md_count=0,
+                error=f"表数量({table_count})超过限制({max_tables})"
+            )
+```
+
+#### 8.2.3 MD文件分析器 (`analyzers/md_analyzer.py`)
+
+```python
+class MDFileAnalyzer:
+    """MD文件分析器"""
+    
+    async def read_all_md_files(self) -> str:
+        """
+        读取所有MD文件的完整内容
+        
+        Returns:
+            所有MD文件内容的组合字符串
+        """
+        md_files = sorted(self.output_dir.glob("*_detail.md"))
+        
+        all_contents = []
+        all_contents.append(f"# 数据库表结构文档汇总\n")
+        all_contents.append(f"共包含 {len(md_files)} 个表\n\n")
+        
+        for md_file in md_files:
+            content = md_file.read_text(encoding='utf-8')
+            
+            # 添加分隔符,便于LLM区分不同表
+            all_contents.append("=" * 80)
+            all_contents.append(f"# 文件: {md_file.name}")
+            all_contents.append("=" * 80)
+            all_contents.append(content)
+            all_contents.append("\n")
+        
+        combined_content = "\n".join(all_contents)
+        
+        # 检查内容大小(预估token数)
+        estimated_tokens = len(combined_content) / 4
+        if estimated_tokens > 100000:
+            self.logger.warning(f"MD内容可能过大,预估tokens: {estimated_tokens:.0f}")
+        
+        return combined_content
+```
+
+#### 8.2.4 主题提取器 (`analyzers/theme_extractor.py`)
+
+```python
+class ThemeExtractor:
+    """主题提取器"""
+    
+    async def extract_themes(self, md_contents: str) -> List[Dict[str, Any]]:
+        """
+        从MD内容中提取分析主题
+        """
+        prompt = f"""你是一位经验丰富的业务数据分析师,正在分析{self.business_context}的数据库。
+
+以下是数据库中所有表的详细结构说明:
+
+{md_contents}
+
+基于对这些表结构的理解,请从业务分析的角度提出 {theme_count} 个数据查询分析主题。
+
+要求:
+1. 每个主题应该有明确的业务价值和分析目标
+2. 主题之间应该有所区别,覆盖不同的业务领域  
+3. 你需要自行决定每个主题应该涉及哪些表
+4. 主题应该体现实际业务场景的数据分析需求
+5. 考虑时间维度、对比分析、排名统计等多种分析角度
+
+请以JSON格式输出:
+```json
+{{
+  "themes": [
+    {{
+      "name": "经营收入分析",
+      "description": "分析服务区的营业收入情况,包括日收入趋势、月度对比、服务区排名等",
+      "focus_areas": ["收入趋势", "服务区对比", "时间维度分析"],
+      "related_tables": ["bss_business_day_data", "其他相关表名"]
+    }}
+  ]
+}}
+```"""
+        
+        response = await self._call_llm(prompt)
+        themes = self._parse_theme_response(response)
+        
+        return themes
+```
+
+### 8.3 执行流程详细设计
+
+#### 8.3.1 主流程
+
+```python
+async def generate(self) -> Dict[str, Any]:
+    """生成Question-SQL对"""
+    
+    # 1. 验证文件数量
+    validation_result = self.validator.validate(self.table_list_file, str(self.output_dir))
+    if not validation_result.is_valid:
+        raise ValueError(f"文件验证失败: {validation_result.error}")
+    
+    # 2. 读取所有MD文件内容
+    md_contents = await self.md_analyzer.read_all_md_files()
+    
+    # 3. 初始化LLM组件
+    self._initialize_llm_components()
+    
+    # 4. 提取分析主题
+    themes = await self.theme_extractor.extract_themes(md_contents)
+    
+    # 5. 初始化中间结果文件
+    self._init_intermediate_file()
+    
+    # 6. 处理每个主题
+    if self.config['qs_generation']['max_concurrent_themes'] > 1:
+        results = await self._process_themes_parallel(themes, md_contents)
+    else:
+        results = await self._process_themes_serial(themes, md_contents)
+    
+    # 7. 保存最终结果
+    output_file = await self._save_final_results(all_qs_pairs)
+    
+    return report
+```
+
+#### 8.3.2 主题处理
+
+```python
+async def _process_single_theme(self, theme: Dict, md_contents: str) -> Dict:
+    """处理单个主题"""
+    
+    prompt = f"""你是一位业务数据分析师,正在为{self.business_context}设计数据查询。
+
+当前分析主题:{theme['name']}
+主题描述:{theme['description']}
+关注领域:{', '.join(theme['focus_areas'])}
+相关表:{', '.join(theme['related_tables'])}
+
+数据库表结构信息:
+{md_contents}
+
+请为这个主题生成 {questions_count} 个业务问题和对应的SQL查询。
+
+要求:
+1. 问题应该从业务角度出发,贴合主题要求,具有实际分析价值
+2. SQL必须使用PostgreSQL语法
+3. 考虑实际业务逻辑(如软删除使用 delete_ts IS NULL 条件)
+4. 使用中文别名提高可读性(使用 AS 指定列别名)
+5. 问题应该多样化,覆盖不同的分析角度
+6. 包含时间筛选、分组统计、排序、限制等不同类型的查询
+7. SQL语句末尾必须以分号结束
+
+输出JSON格式:
+```json
+[
+  {{
+    "question": "具体的业务问题?",
+    "sql": "SELECT column AS 中文名 FROM table WHERE condition;"
+  }}
+]
+```"""
+    
+    response = await self._call_llm(prompt)
+    qs_pairs = self._parse_qs_response(response)
+    validated_pairs = self._validate_qs_pairs(qs_pairs, theme['name'])
+    
+    # 保存中间结果
+    await self._save_theme_results(theme['name'], validated_pairs)
+    
+    return {
+        'success': True,
+        'theme_name': theme['name'],
+        'qs_pairs': validated_pairs
+    }
+```
+
+### 8.4 中间结果保存机制
+
+```python
+def _init_intermediate_file(self):
+    """初始化中间结果文件"""
+    timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
+    self.intermediate_file = self.output_dir / f"qs_intermediate_{timestamp}.json"
+    self.intermediate_results = []
+
+async def _save_theme_results(self, theme_name: str, qs_pairs: List[Dict]):
+    """保存单个主题的结果"""
+    theme_result = {
+        "theme": theme_name,
+        "timestamp": datetime.now().isoformat(),
+        "questions_count": len(qs_pairs),
+        "questions": qs_pairs
+    }
+    
+    self.intermediate_results.append(theme_result)
+    
+    # 立即保存到中间文件
+    if self.config['qs_generation']['save_intermediate']:

+ 140 - 0
docs/qs_generation_guide.md

@@ -0,0 +1,140 @@
+# Question-SQL生成功能使用指南
+
+## 功能概述
+
+Question-SQL生成功能是Schema Tools的扩展模块,用于从已生成的DDL和MD文件自动生成高质量的Question-SQL训练数据对。
+
+## 主要特性
+
+1. **表清单去重**:自动去除重复的表名,并在日志中报告
+2. **文件完整性验证**:验证DDL和MD文件数量,详细报告缺失文件
+3. **智能主题提取**:使用LLM自动提取业务分析主题
+4. **批量问题生成**:每个主题生成10个Question-SQL对
+5. **元数据管理**:生成包含INSERT语句的metadata.txt文件
+6. **中间结果保存**:支持中断恢复
+
+## 使用步骤
+
+### 步骤1:生成DDL和MD文件
+
+```bash
+python -m schema_tools \
+  --db-connection "postgresql://postgres:postgres@localhost:6432/highway_db" \
+  --table-list ./schema_tools/tables.txt \
+  --business-context "高速公路服务区管理系统" \
+  --output-dir ./output \
+  --pipeline full \
+  --verbose
+```
+
+### 步骤2:手动检查生成的文件
+
+在output目录下检查:
+- *.ddl文件(表结构定义)
+- *_detail.md文件(表详细文档)
+
+### 步骤3:生成Question-SQL对
+
+```bash
+python -m schema_tools.qs_generator \
+  --output-dir ./output \
+  --table-list ./schema_tools/tables.txt \
+  --business-context "高速公路服务区管理系统" \
+  --db-name highway_db \
+  --verbose
+```
+
+## 输出文件说明
+
+### 1. Question-SQL数据文件
+- 文件名:`qs_<db_name>_<timestamp>_pair.json`
+- 格式:
+```json
+[
+  {
+    "question": "业务问题描述?",
+    "sql": "SELECT ... FROM ... WHERE ...;"
+  }
+]
+```
+
+### 2. 主题元数据文件
+- 文件名:`metadata.txt`
+- 内容:包含CREATE TABLE和INSERT语句
+- 示例:
+```sql
+-- Schema Tools生成的主题元数据
+-- 业务背景: 高速公路服务区管理系统
+-- 生成时间: 2024-01-01 10:00:00
+-- 数据库: highway_db
+
+CREATE TABLE IF NOT EXISTS metadata (
+    id SERIAL PRIMARY KEY,
+    topic_name VARCHAR(100) NOT NULL,
+    description TEXT,
+    related_tables TEXT[],
+    keywords TEXT[],
+    focus_areas TEXT[],
+    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
+);
+
+INSERT INTO metadata(topic_name, description, related_tables, keywords, focus_areas) VALUES
+(
+  '日营业数据分析',
+  '基于 bss_business_day_data 表,分析每个服务区和档口每天的营业收入、订单数量、支付方式等',
+  '{bss_business_day_data,bss_branch,bss_service_area}',
+  '{收入,订单,支付方式,日报表}',
+  '{收入趋势,服务区对比,支付方式分布}'
+);
+```
+
+### 3. 中间结果文件(如有中断)
+- 文件名:`qs_intermediate_<timestamp>.json`
+- 用途:保存已完成的主题结果,支持断点恢复
+
+## 配置参数
+
+在`schema_tools/config.py`中的`qs_generation`部分:
+
+```python
+"qs_generation": {
+    "max_tables": 20,              # 最大表数量限制
+    "theme_count": 5,              # 生成主题数量
+    "questions_per_theme": 10,     # 每主题问题数
+    "max_concurrent_themes": 1,    # 并行处理主题数
+    "continue_on_theme_error": True,  # 主题失败是否继续
+    "save_intermediate": True,     # 是否保存中间结果
+}
+```
+
+## 注意事项
+
+1. **表数量限制**:默认最多处理20个表,超过需要分批处理
+2. **表名去重**:表清单中的重复表会自动去除,日志会显示去重统计
+3. **文件验证**:如果DDL/MD文件数量不匹配,会详细列出缺失的表
+4. **LLM依赖**:需要配置好vanna实例的LLM
+5. **错误恢复**:如果生成过程中断,中间结果会保存,下次可手动恢复
+
+## 常见问题
+
+### Q: 表清单中有重复表怎么办?
+A: 系统会自动去重,并在日志中报告:
+```
+表清单去重统计: 原始11个表,去重后8个表,移除了3个重复项
+```
+
+### Q: 文件验证失败显示缺少某些表?
+A: 检查日志中的详细信息:
+```
+缺失的DDL文件对应的表: bss_company, bss_service_area
+缺失的MD文件对应的表: bss_company
+```
+
+### Q: 如何使用生成的metadata.txt?
+A: 可以直接在PostgreSQL中执行:
+```bash
+psql -U postgres -d your_database -f output/metadata.txt
+```
+
+### Q: 生成中断了怎么办?
+A: 查看output目录下的`qs_intermediate_*.json`文件,其中包含已完成的主题结果。 

+ 31 - 0
output/bss_business_day_data.ddl

@@ -0,0 +1,31 @@
+-- 中文名: 记录各服务区每日业务统计数据
+-- 描述: 记录各服务区每日业务统计数据,包含统计日期及基础信息,用于运营分析与管理
+create table public.bss_business_day_data (
+  id varchar(32) not null     -- 主键ID,主键,
+  version integer not null    -- 版本号,
+  create_ts timestamp         -- 创建时间,
+  created_by varchar(50)      -- 创建人,
+  update_ts timestamp         -- 更新时间,
+  updated_by varchar(50)      -- 更新人,
+  delete_ts timestamp         -- 删除时间,
+  deleted_by varchar(50)      -- 删除人,
+  oper_date date              -- 统计日期,
+  service_no varchar(255)     -- 服务区编码,
+  service_name varchar(255)   -- 服务区名称,
+  branch_no varchar(255)      -- 档口编码,
+  branch_name varchar(255)    -- 档口名称,
+  wx numeric(19,4)            -- 微信支付金额,
+  wx_order integer            -- 微信订单数量,
+  zfb numeric(19,4)           -- 支付宝支付金额,
+  zf_order integer            -- 支付宝订单数量,
+  rmb numeric(19,4)           -- 现金支付金额,
+  rmb_order integer           -- 现金订单数量,
+  xs numeric(19,4)            -- 行吧支付金额,
+  xs_order integer            -- 行吧订单数量,
+  jd numeric(19,4)            -- 金豆支付金额,
+  jd_order integer            -- 金豆订单数量,
+  order_sum integer           -- 订单总数,
+  pay_sum numeric(19,4)       -- 支付总金额,
+  source_type integer         -- 数据来源类别,
+  primary key (id)
+);

+ 32 - 0
output/bss_business_day_data_detail.md

@@ -0,0 +1,32 @@
+## bss_business_day_data(记录各服务区每日业务统计数据)
+bss_business_day_data 表记录各服务区每日业务统计数据,包含统计日期及基础信息,用于运营分析与管理
+字段列表:
+- id (varchar(32)) - 主键ID [主键, 非空] [示例: 00827DFF993D415488EA1F07CAE6C440, 00e799048b8cbb8ee758eac9c8b4b820]
+- version (integer) - 版本号 [非空] [示例: 1]
+- create_ts (timestamp) - 创建时间 [示例: 2023-04-02 08:31:51, 2023-04-02 02:30:08]
+- created_by (varchar(50)) - 创建人 [示例: xingba]
+- update_ts (timestamp) - 更新时间 [示例: 2023-04-02 08:31:51, 2023-04-02 02:30:08]
+- updated_by (varchar(50)) - 更新人
+- delete_ts (timestamp) - 删除时间
+- deleted_by (varchar(50)) - 删除人
+- oper_date (date) - 统计日期 [示例: 2023-04-01]
+- service_no (varchar(255)) - 服务区编码 [示例: 1028, H0501]
+- service_name (varchar(255)) - 服务区名称 [示例: 宜春服务区, 庐山服务区]
+- branch_no (varchar(255)) - 档口编码 [示例: 1, H05016]
+- branch_name (varchar(255)) - 档口名称 [示例: 宜春南区, 庐山鲜徕客东区]
+- wx (numeric(19,4)) - 微信支付金额 [示例: 4790.0000, 2523.0000]
+- wx_order (integer) - 微信订单数量 [示例: 253, 133]
+- zfb (numeric(19,4)) - 支付宝支付金额 [示例: 229.0000, 0.0000]
+- zf_order (integer) - 支付宝订单数量 [示例: 15, 0]
+- rmb (numeric(19,4)) - 现金支付金额 [示例: 1058.5000, 124.0000]
+- rmb_order (integer) - 现金订单数量 [示例: 56, 12]
+- xs (numeric(19,4)) - 行吧支付金额 [示例: 0.0000, 40.0000]
+- xs_order (integer) - 行吧订单数量 [示例: 0, 1]
+- jd (numeric(19,4)) - 金豆支付金额 [示例: 0.0000]
+- jd_order (integer) - 金豆订单数量 [示例: 0]
+- order_sum (integer) - 订单总数 [示例: 324, 146]
+- pay_sum (numeric(19,4)) - 支付总金额 [示例: 6077.5000, 2687.0000]
+- source_type (integer) - 数据来源类别 [示例: 1, 0, 4]
+字段补充说明:
+- id 为主键
+- source_type 为枚举字段,包含取值:0、4、1、2、3

+ 5 - 5
output/ddl/bss_car_day_count_1.ddl → output/bss_car_day_count.ddl

@@ -1,14 +1,14 @@
 -- 中文名: 服务区车辆日统计表
--- 描述: 服务区车辆日统计表,记录每日车辆数量及类型,用于流量分析与资源调度
+-- 描述: 服务区车辆日统计表,记录车辆类型及数量用于交通流量分析与管理决策
 create table public.bss_car_day_count (
   id varchar(32) not null     -- 主键ID,主键,
   version integer not null    -- 版本号,
   create_ts timestamp         -- 创建时间,
-  created_by varchar(50)      -- 创建人ID,
-  update_ts timestamp         -- 更新时间,
-  updated_by varchar(50)      -- 更新人ID,
+  created_by varchar(50)      -- 创建人,
+  update_ts timestamp         -- 最后更新时间,
+  updated_by varchar(50)      -- 最后更新人,
   delete_ts timestamp         -- 删除时间,
-  deleted_by varchar(50)      -- 删除人ID,
+  deleted_by varchar(50)      -- 删除人,
   customer_count bigint       -- 车辆数量,
   car_type varchar(100)       -- 车辆类别,
   count_date date             -- 统计日期,

+ 5 - 5
output/docs/bss_car_day_count_detail_1.md → output/bss_car_day_count_detail.md

@@ -1,14 +1,14 @@
 ## bss_car_day_count(服务区车辆日统计表)
-bss_car_day_count 表服务区车辆日统计表,记录每日车辆数量及类型,用于流量分析与资源调度
+bss_car_day_count 表服务区车辆日统计表,记录车辆类型及数量用于交通流量分析与管理决策
 字段列表:
 - id (varchar(32)) - 主键ID [主键, 非空] [示例: 00022c1c99ff11ec86d4fa163ec0f8fc, 00022caa99ff11ec86d4fa163ec0f8fc]
 - version (integer) - 版本号 [非空] [示例: 1]
 - create_ts (timestamp) - 创建时间 [示例: 2022-03-02 16:01:43, 2022-02-02 14:18:55]
-- created_by (varchar(50)) - 创建人ID
-- update_ts (timestamp) - 更新时间 [示例: 2022-03-02 16:01:43, 2022-02-02 14:18:55]
-- updated_by (varchar(50)) - 更新人ID
+- created_by (varchar(50)) - 创建人
+- update_ts (timestamp) - 最后更新时间 [示例: 2022-03-02 16:01:43, 2022-02-02 14:18:55]
+- updated_by (varchar(50)) - 最后更新人
 - delete_ts (timestamp) - 删除时间
-- deleted_by (varchar(50)) - 删除人ID
+- deleted_by (varchar(50)) - 删除人
 - customer_count (bigint) - 车辆数量 [示例: 1114, 295]
 - car_type (varchar(100)) - 车辆类别 [示例: 其他]
 - count_date (date) - 统计日期 [示例: 2022-03-02, 2022-02-02]

+ 15 - 0
output/bss_company.ddl

@@ -0,0 +1,15 @@
+-- 中文名: 存储服务区入驻公司的基本信息及变更记录
+-- 描述: 存储服务区入驻公司的基本信息及变更记录,包含公司名称与编码,支持多版本审计。
+create table public.bss_company (
+  id varchar(32) not null     -- 主键ID,主键,
+  version integer not null    -- 版本号,
+  create_ts timestamp         -- 创建时间,
+  created_by varchar(50)      -- 创建人,
+  update_ts timestamp         -- 更新时间,
+  updated_by varchar(50)      -- 更新人,
+  delete_ts timestamp         -- 删除时间,
+  deleted_by varchar(50)      -- 删除人,
+  company_name varchar(255)   -- 公司名称,
+  company_no varchar(255)     -- 公司编码,
+  primary key (id)
+);

+ 16 - 0
output/bss_company_detail.md

@@ -0,0 +1,16 @@
+## bss_company(存储服务区入驻公司的基本信息及变更记录)
+bss_company 表存储服务区入驻公司的基本信息及变更记录,包含公司名称与编码,支持多版本审计。
+字段列表:
+- id (varchar(32)) - 主键ID [主键, 非空] [示例: 30675d85ba5044c31acfa243b9d16334, 47ed0bb37f5a85f3d9245e4854959b81]
+- version (integer) - 版本号 [非空] [示例: 1, 2]
+- create_ts (timestamp) - 创建时间 [示例: 2021-05-20 09:51:58.718000, 2021-05-20 09:42:03.341000]
+- created_by (varchar(50)) - 创建人 [示例: admin]
+- update_ts (timestamp) - 更新时间 [示例: 2021-05-20 09:51:58.718000, 2021-05-20 09:42:03.341000]
+- updated_by (varchar(50)) - 更新人 [示例: admin]
+- delete_ts (timestamp) - 删除时间
+- deleted_by (varchar(50)) - 删除人
+- company_name (varchar(255)) - 公司名称 [示例: 上饶分公司, 宜春分公司]
+- company_no (varchar(255)) - 公司编码 [示例: H03, H02, H07]
+字段补充说明:
+- id 为主键
+- company_no 为枚举字段,包含取值:H01、H02、H03、H04、H05、H06、H07、H08、Q01

+ 16 - 0
output/bss_section_route.ddl

@@ -0,0 +1,16 @@
+-- 中文名: 路段路线关联表
+-- 描述: 路段路线关联表,维护高速公路路段与行驶路线的映射关系
+create table public.bss_section_route (
+  id varchar(32) not null     -- 主键ID,主键,
+  version integer not null    -- 版本号,
+  create_ts timestamp         -- 创建时间,
+  created_by varchar(50)      -- 创建人,
+  update_ts timestamp         -- 更新时间,
+  updated_by varchar(50)      -- 更新人,
+  delete_ts timestamp         -- 删除时间,
+  deleted_by varchar(50)      -- 删除人,
+  section_name varchar(255)   -- 路段名称,
+  route_name varchar(255)     -- 路线名称,
+  code varchar(255)           -- 路段编号,
+  primary key (id)
+);

+ 7 - 0
output/bss_section_route_area_link.ddl

@@ -0,0 +1,7 @@
+-- 中文名: 路线与服务区关联表
+-- 描述: 路线与服务区关联表
+create table public.bss_section_route_area_link (
+  section_route_id varchar(32) not null -- 路段路线ID,主键,
+  service_area_id varchar(32) not null -- 服务区ID,主键,
+  primary key (section_route_id, service_area_id)
+);

+ 7 - 0
output/bss_section_route_area_link_detail.md

@@ -0,0 +1,7 @@
+## bss_section_route_area_link(路线与服务区关联表)
+bss_section_route_area_link 表路线与服务区关联表
+字段列表:
+- section_route_id (varchar(32)) - 路段路线ID [主键, 非空] [示例: v8elrsfs5f7lt7jl8a6p87smfzesn3rz, hxzi2iim238e3s1eajjt1enmh9o4h3wp]
+- service_area_id (varchar(32)) - 服务区ID [主键, 非空] [示例: 08e01d7402abd1d6a4d9fdd5df855ef8, 091662311d2c737029445442ff198c4c]
+字段补充说明:
+- 复合主键:section_route_id, service_area_id

+ 16 - 0
output/bss_section_route_detail.md

@@ -0,0 +1,16 @@
+## bss_section_route(路段路线关联表)
+bss_section_route 表路段路线关联表,维护高速公路路段与行驶路线的映射关系
+字段列表:
+- id (varchar(32)) - 主键ID [主键, 非空] [示例: 04ri3j67a806uw2c6o6dwdtz4knexczh, 0g5mnefxxtukql2cq6acul7phgskowy7]
+- version (integer) - 版本号 [非空] [示例: 1, 0]
+- create_ts (timestamp) - 创建时间 [示例: 2021-10-29 19:43:50, 2022-03-04 16:07:16]
+- created_by (varchar(50)) - 创建人 [示例: admin]
+- update_ts (timestamp) - 更新时间
+- updated_by (varchar(50)) - 更新人
+- delete_ts (timestamp) - 删除时间
+- deleted_by (varchar(50)) - 删除人
+- section_name (varchar(255)) - 路段名称 [示例: 昌栗, 昌宁]
+- route_name (varchar(255)) - 路线名称 [示例: 昌栗, 昌韶]
+- code (varchar(255)) - 路段编号 [示例: SR0001, SR0002]
+字段补充说明:
+- id 为主键

+ 19 - 0
output/bss_service_area.ddl

@@ -0,0 +1,19 @@
+-- 中文名: 高速公路服务区基础信息表
+-- 描述: 高速公路服务区基础信息表,包含名称、编码及状态管理
+create table public.bss_service_area (
+  id varchar(32) not null     -- 主键ID,主键,
+  version integer not null    -- 版本号,
+  create_ts timestamp         -- 创建时间,
+  created_by varchar(50)      -- 创建人,
+  update_ts timestamp         -- 更新时间,
+  updated_by varchar(50)      -- 更新人,
+  delete_ts timestamp         -- 删除时间,
+  deleted_by varchar(50)      -- 删除人,
+  service_area_name varchar(255) -- 服务区名称,
+  service_area_no varchar(255) -- 服务区编码,
+  company_id varchar(32)      -- 所属公司ID,
+  service_position varchar(255) -- 经纬度坐标,
+  service_area_type varchar(50) -- 服务区类型,
+  service_state varchar(50)   -- 运营状态,
+  primary key (id)
+);

+ 21 - 0
output/bss_service_area_detail.md

@@ -0,0 +1,21 @@
+## bss_service_area(高速公路服务区基础信息表)
+bss_service_area 表高速公路服务区基础信息表,包含名称、编码及状态管理
+字段列表:
+- id (varchar(32)) - 主键ID [主键, 非空] [示例: 0271d68ef93de9684b7ad8c7aae600b6, 08e01d7402abd1d6a4d9fdd5df855ef8]
+- version (integer) - 版本号 [非空] [示例: 3, 6]
+- create_ts (timestamp) - 创建时间 [示例: 2021-05-21 13:26:40.589000, 2021-05-20 19:51:46.314000]
+- created_by (varchar(50)) - 创建人 [示例: admin]
+- update_ts (timestamp) - 更新时间 [示例: 2021-07-10 15:41:28.795000, 2021-07-11 09:33:08.455000]
+- updated_by (varchar(50)) - 更新人 [示例: admin]
+- delete_ts (timestamp) - 删除时间
+- deleted_by (varchar(50)) - 删除人 [示例: ]
+- service_area_name (varchar(255)) - 服务区名称 [示例: 白鹭湖停车区, 南昌南服务区]
+- service_area_no (varchar(255)) - 服务区编码 [示例: H0814, H0105]
+- company_id (varchar(32)) - 所属公司ID [示例: b1629f07c8d9ac81494fbc1de61f1ea5, ee9bf1180a2b45003f96e597a4b7f15a]
+- service_position (varchar(255)) - 经纬度坐标 [示例: 114.574721,26.825584, 115.910549,28.396355]
+- service_area_type (varchar(50)) - 服务区类型 [示例: 信息化服务区]
+- service_state (varchar(50)) - 运营状态 [示例: 开放, 关闭]
+字段补充说明:
+- id 为主键
+- service_area_type 为枚举字段,包含取值:信息化服务区、智能化服务区
+- service_state 为枚举字段,包含取值:开放、关闭、上传数据

+ 18 - 0
output/bss_service_area_mapper.ddl

@@ -0,0 +1,18 @@
+-- 中文名: 服务区与系统映射表
+-- 描述: 服务区与系统映射表,维护服务区名称/编码关联及版本控制
+create table public.bss_service_area_mapper (
+  id varchar(32) not null     -- 主键ID,主键,
+  version integer not null    -- 版本号,
+  create_ts timestamp         -- 创建时间,
+  created_by varchar(50)      -- 创建人,
+  update_ts timestamp         -- 更新时间,
+  updated_by varchar(50)      -- 更新人,
+  delete_ts timestamp         -- 删除时间,
+  deleted_by varchar(50)      -- 删除人,
+  service_name varchar(255)   -- 服务区名称,
+  service_no varchar(255)     -- 服务区编码,
+  service_area_id varchar(32) -- 服务区ID,
+  source_system_type varchar(50) -- 数据来源系统类型,
+  source_type integer         -- 数据来源类别ID,
+  primary key (id)
+);

+ 19 - 0
output/bss_service_area_mapper_detail.md

@@ -0,0 +1,19 @@
+## bss_service_area_mapper(服务区与系统映射表)
+bss_service_area_mapper 表服务区与系统映射表,维护服务区名称/编码关联及版本控制
+字段列表:
+- id (varchar(32)) - 主键ID [主键, 非空] [示例: 00e1e893909211ed8ee6fa163eaf653f, 013867f5962211ed8ee6fa163eaf653f]
+- version (integer) - 版本号 [非空] [示例: 1]
+- create_ts (timestamp) - 创建时间 [示例: 2023-01-10 10:54:03, 2023-01-17 12:47:29]
+- created_by (varchar(50)) - 创建人 [示例: admin]
+- update_ts (timestamp) - 更新时间 [示例: 2023-01-10 10:54:07, 2023-01-17 12:47:32]
+- updated_by (varchar(50)) - 更新人
+- delete_ts (timestamp) - 删除时间
+- deleted_by (varchar(50)) - 删除人
+- service_name (varchar(255)) - 服务区名称 [示例: 信丰西服务区, 南康北服务区]
+- service_no (varchar(255)) - 服务区编码 [示例: 1067, 1062]
+- service_area_id (varchar(32)) - 服务区ID [示例: 97cd6cd516a551409a4d453a58f9e170, fdbdd042962011ed8ee6fa163eaf653f]
+- source_system_type (varchar(50)) - 数据来源系统类型 [示例: 驿美, 驿购]
+- source_type (integer) - 数据来源类别ID [示例: 3, 1]
+字段补充说明:
+- id 为主键
+- source_system_type 为枚举字段,包含取值:司乘管理、商业管理、驿购、驿美、手工录入

+ 6 - 0
output/filename_mapping.txt

@@ -1,4 +1,10 @@
 # 文件名映射报告
 # 格式: 原始表名 -> 实际文件名
 
+public.bss_business_day_data -> bss_business_day_data_detail.md
 public.bss_car_day_count -> bss_car_day_count_detail_1.md
+public.bss_company -> bss_company_detail.md
+public.bss_section_route -> bss_section_route_detail.md
+public.bss_section_route_area_link -> bss_section_route_area_link_detail.md
+public.bss_service_area -> bss_service_area_detail.md
+public.bss_service_area_mapper -> bss_service_area_mapper_detail.md

+ 202 - 0
output/qs_highway_db_20250623_192120_pair.json

@@ -0,0 +1,202 @@
+[
+  {
+    "question": "最近7天各服务区总收入趋势分析(按日期排序)",
+    "sql": "SELECT oper_date AS 统计日期, service_name AS 服务区名称, SUM(pay_sum) AS 日总收入 FROM bss_business_day_data WHERE delete_ts IS NULL AND oper_date >= CURRENT_DATE - 7 GROUP BY oper_date, service_name ORDER BY oper_date;"
+  },
+  {
+    "question": "本月与上月收入对比分析(按月份分组)",
+    "sql": "SELECT DATE_TRUNC('month', oper_date) AS 月份, SUM(pay_sum) AS 月总收入 FROM bss_business_day_data WHERE delete_ts IS NULL AND oper_date >= DATE_TRUNC('month', CURRENT_DATE) - INTERVAL '1 month' AND oper_date < DATE_TRUNC('month', CURRENT_DATE) + INTERVAL '1 month' GROUP BY DATE_TRUNC('month', oper_date) ORDER BY 月份;"
+  },
+  {
+    "question": "累计收入排名前5的服务区(按总支付金额排序)",
+    "sql": "SELECT service_name AS 服务区名称, SUM(pay_sum) AS 累计总收入 FROM bss_business_day_data WHERE delete_ts IS NULL GROUP BY service_name ORDER BY 累计总收入 DESC LIMIT 5;"
+  },
+  {
+    "question": "各支付方式占比分析(微信/支付宝/现金/其他)",
+    "sql": "SELECT '微信' AS 支付方式, ROUND(SUM(wx)/SUM(pay_sum)*100,2) AS 占比百分比 FROM bss_business_day_data WHERE delete_ts IS NULL UNION ALL SELECT '支付宝', ROUND(SUM(zfb)/SUM(pay_sum)*100,2) FROM bss_business_day_data WHERE delete_ts IS NULL UNION ALL SELECT '现金', ROUND(SUM(rmb)/SUM(pay_sum)*100,2) FROM bss_business_day_data WHERE delete_ts IS NULL UNION ALL SELECT '其他支付', ROUND((SUM(xs)+SUM(jd))/SUM(pay_sum)*100,2) FROM bss_business_day_data WHERE delete_ts IS NULL ORDER BY 占比百分比 DESC;"
+  },
+  {
+    "question": "异常数据监控:单日收入突增超过平均值200%的服务区记录",
+    "sql": "WITH daily_avg AS (SELECT AVG(pay_sum) AS avg_income FROM bss_business_day_data WHERE delete_ts IS NULL) SELECT a.oper_date, a.service_name, a.pay_sum, ROUND(a.pay_sum/daily_avg.avg_income,2) AS 倍数 FROM bss_business_day_data a, daily_avg WHERE a.delete_ts IS NULL AND a.pay_sum > daily_avg.avg_income * 2 ORDER BY a.oper_date DESC;"
+  },
+  {
+    "question": "各档口客单价对比分析(按平均订单金额排序)",
+    "sql": "SELECT branch_name AS 档口名称, ROUND(SUM(pay_sum)/SUM(order_sum),2) AS 客单价 FROM bss_business_day_data WHERE delete_ts IS NULL AND order_sum > 0 GROUP BY branch_name ORDER BY 客单价 DESC;"
+  },
+  {
+    "question": "周末与工作日收入差异分析(统计星期几收入分布)",
+    "sql": "SELECT CASE WHEN EXTRACT(DOW FROM oper_date) IN (0,6) THEN '周末' ELSE '工作日' END AS 日期类型, AVG(pay_sum) AS 平均日收入 FROM bss_business_day_data WHERE delete_ts IS NULL GROUP BY CASE WHEN EXTRACT(DOW FROM oper_date) IN (0,6) THEN '周末' ELSE '工作日' END;"
+  },
+  {
+    "question": "微信支付金额月环比增长趋势分析",
+    "sql": "SELECT DATE_TRUNC('month', oper_date) AS 月份, SUM(wx) AS 微信支付总额, ROUND((SUM(wx) - LAG(SUM(wx),1,0) OVER(ORDER BY DATE_TRUNC('month', oper_date)))/LAG(SUM(wx),1,0) OVER(ORDER BY DATE_TRUNC('month', oper_date))*100,2) AS 环比增长率 FROM bss_business_day_data WHERE delete_ts IS NULL GROUP BY DATE_TRUNC('month', oper_date) ORDER BY 月份;"
+  },
+  {
+    "question": "庐山服务区各档口收入排名(按支付金额降序排列)",
+    "sql": "SELECT branch_name AS 档口名称, SUM(pay_sum) AS 累计收入 FROM bss_business_day_data WHERE delete_ts IS NULL AND service_name = '庐山服务区' GROUP BY branch_name ORDER BY 累计收入 DESC;"
+  },
+  {
+    "question": "车辆数量与订单量相关性分析(关联车流数据)",
+    "sql": "SELECT a.oper_date, a.order_sum AS 订单数量, b.customer_count AS 车辆数量, ROUND(CORR(a.order_sum, b.customer_count),2) AS 相关系数 FROM (SELECT oper_date, SUM(order_sum) AS order_sum FROM bss_business_day_data WHERE delete_ts IS NULL GROUP BY oper_date) a JOIN (SELECT count_date, SUM(customer_count) AS customer_count FROM bss_car_day_count GROUP BY count_date) b ON a.oper_date = b.count_date GROUP BY a.oper_date;"
+  },
+  {
+    "question": "分析不同车辆类型(危化品/城际/过境/其他)对应的服务区平均消费金额差异",
+    "sql": "SELECT c.car_type AS 车辆类型, AVG(b.pay_sum) AS 平均消费金额 FROM bss_business_day_data b JOIN bss_service_area_mapper m ON b.service_name = m.service_name AND m.delete_ts IS NULL JOIN bss_car_day_count c ON m.service_area_id = c.service_area_id AND b.oper_date = c.count_date WHERE b.delete_ts IS NULL AND c.delete_ts IS NULL GROUP BY c.car_type;"
+  },
+  {
+    "question": "找出最近7天车流量排名前10但订单转化率(订单数/车流量)低于5%的服务区",
+    "sql": "SELECT c.service_area_id AS 服务区ID, SUM(c.customer_count) AS 总车流量, SUM(b.order_sum) AS 总订单数, (SUM(b.order_sum)/SUM(c.customer_count)*100)::numeric(5,2) AS 转化率 FROM bss_car_day_count c JOIN bss_service_area_mapper m ON c.service_area_id = m.service_area_id AND m.delete_ts IS NULL JOIN bss_business_day_data b ON m.service_name = b.service_name AND c.count_date = b.oper_date WHERE c.count_date >= CURRENT_DATE - 7 AND c.delete_ts IS NULL AND b.delete_ts IS NULL GROUP BY c.service_area_id HAVING SUM(b.order_sum)/SUM(c.customer_count) < 0.05 ORDER BY 总车流量 DESC LIMIT 10;"
+  },
+  {
+    "question": "统计各服务区近一月日均车流量与日均消费金额的线性相关性",
+    "sql": "SELECT c.service_area_id AS 服务区ID, CORR(c.customer_count, b.pay_sum) AS 相关性系数 FROM bss_car_day_count c JOIN bss_service_area_mapper m ON c.service_area_id = m.service_area_id AND m.delete_ts IS NULL JOIN bss_business_day_data b ON m.service_name = b.service_name AND c.count_date = b.oper_date WHERE c.count_date >= CURRENT_DATE - 30 AND c.delete_ts IS NULL AND b.delete_ts IS NULL GROUP BY c.service_area_id HAVING COUNT(*) > 20;"
+  },
+  {
+    "question": "对比各分公司管辖服务区的月均车流量和客单价(消费金额/订单数)",
+    "sql": "SELECT com.company_name AS 所属公司, AVG(c.customer_count) AS 月均车流量, AVG(b.pay_sum/b.order_sum) AS 客单价 FROM bss_car_day_count c JOIN bss_service_area s ON c.service_area_id = s.id AND s.delete_ts IS NULL JOIN bss_company com ON s.company_id = com.id AND com.delete_ts IS NULL JOIN bss_service_area_mapper m ON s.id = m.service_area_id AND m.delete_ts IS NULL JOIN bss_business_day_data b ON m.service_name = b.service_name AND c.count_date = b.oper_date WHERE EXTRACT(MONTH FROM c.count_date) = EXTRACT(MONTH FROM CURRENT_DATE) AND c.delete_ts IS NULL AND b.delete_ts IS NULL GROUP BY com.company_name;"
+  },
+  {
+    "question": "识别昨日车流量超过该服务区历史平均值200%且消费金额下降超过30%的异常服务区",
+    "sql": "WITH daily_avg AS (SELECT service_area_id, AVG(customer_count) AS avg_count FROM bss_car_day_count WHERE delete_ts IS NULL GROUP BY service_area_id) SELECT c.service_area_id AS 服务区ID, c.customer_count AS 昨日车流, a.avg_count AS 历史均值, b.pay_sum AS 昨日消费 FROM bss_car_day_count c JOIN daily_avg a ON c.service_area_id = a.service_area_id JOIN bss_business_day_data b ON c.count_date = b.oper_date JOIN bss_service_area_mapper m ON c.service_area_id = m.service_area_id AND m.delete_ts IS NULL AND m.service_name = b.service_name WHERE c.count_date = CURRENT_DATE - 1 AND c.customer_count > 2*a.avg_count AND b.pay_sum < 0.7*(SELECT AVG(pay_sum) FROM bss_business_day_data WHERE service_name = m.service_name AND delete_ts IS NULL);"
+  },
+  {
+    "question": "分析周末与工作日的车型分布变化及对应消费差异",
+    "sql": "SELECT CASE WHEN EXTRACT(ISODOW FROM c.count_date) IN (6,7) THEN '周末' ELSE '工作日' END AS 日期类型, c.car_type AS 车型, COUNT(*) AS 记录次数, AVG(c.customer_count) AS 平均车流, AVG(b.pay_sum) AS 平均消费 FROM bss_car_day_count c JOIN bss_service_area_mapper m ON c.service_area_id = m.service_area_id AND m.delete_ts IS NULL JOIN bss_business_day_data b ON m.service_name = b.service_name AND c.count_date = b.oper_date WHERE c.count_date >= CURRENT_DATE - 30 AND c.delete_ts IS NULL AND b.delete_ts IS NULL GROUP BY ROLLUP(日期类型, 车型);"
+  },
+  {
+    "question": "统计各服务区档口微信支付占比(微信金额/总支付金额)与车流高峰时段的关系",
+    "sql": "SELECT m.service_name AS 服务区名称, b.branch_name AS 档口名称, (b.wx/SUM(b.pay_sum) OVER(PARTITION BY b.service_name, b.oper_date)) * 100 AS 微信占比, CASE WHEN c.customer_count > (SELECT PERCENTILE_CONT(0.75) WITHIN GROUP(ORDER BY customer_count) FROM bss_car_day_count) THEN '高峰时段' ELSE '非高峰' END AS 车流时段 FROM bss_business_day_data b JOIN bss_service_area_mapper m ON b.service_name = m.service_name AND m.delete_ts IS NULL JOIN bss_car_day_count c ON m.service_area_id = c.service_area_id AND b.oper_date = c.count_date WHERE b.delete_ts IS NULL AND c.delete_ts IS NULL;"
+  },
+  {
+    "question": "预测未来一周各服务区基于历史车流与消费数据的消费趋势(使用线性回归)",
+    "sql": "SELECT service_area_id AS 服务区ID, date, REGR_INTERCEPT(pay_sum, day_num) + REGR_SLOPE(pay_sum, day_num) * EXTRACT(EPOCH FROM date)/86400 AS 预测消费 FROM (SELECT c.service_area_id, b.oper_date AS date, EXTRACT(EPOCH FROM b.oper_date - CURRENT_DATE + 7) AS day_num, b.pay_sum FROM bss_car_day_count c JOIN bss_service_area_mapper m ON c.service_area_id = m.service_area_id AND m.delete_ts IS NULL JOIN bss_business_day_data b ON m.service_name = b.service_name AND c.count_date = b.oper_date WHERE b.oper_date BETWEEN CURRENT_DATE - 30 AND CURRENT_DATE AND c.delete_ts IS NULL AND b.delete_ts IS NULL) sub GROUP BY service_area_id, date;"
+  },
+  {
+    "question": "找出最近3天连续出现车流量下降超过15%且消费金额波动异常的服务区",
+    "sql": "WITH daily_diff AS (SELECT service_area_id, count_date, customer_count - LAG(customer_count,1) OVER(PARTITION BY service_area_id ORDER BY count_date) AS flow_diff, (customer_count - LAG(customer_count,1) OVER(PARTITION BY service_area_id ORDER BY count_date))/NULLIF(LAG(customer_count,1) OVER(PARTITION BY service_area_id ORDER BY count_date),0) * 100 AS flow_rate FROM bss_car_day_count WHERE delete_ts IS NULL) SELECT DISTINCT service_area_id AS 服务区ID FROM daily_diff WHERE count_date >= CURRENT_DATE -3 AND flow_rate < -15 GROUP BY service_area_id HAVING COUNT(*) =3;"
+  },
+  {
+    "question": "分析不同档口类型(餐饮/零售/其他)的消费转化率与车流密度(车流量/营业面积)的关系",
+    "sql": "SELECT b.branch_type AS 档口类型, c.customer_count / NULLIF(s.area,0) AS 车流密度, SUM(b.order_sum)/SUM(c.customer_count) AS 转化率 FROM bss_business_day_data b JOIN (SELECT branch_name, CASE WHEN branch_name LIKE '%餐饮%' THEN '餐饮' WHEN branch_name LIKE '%零售%' THEN '零售' ELSE '其他' END AS branch_type FROM bss_business_day_data GROUP BY branch_name) t ON b.branch_name = t.branch_name JOIN bss_service_area_mapper m ON b.service_name = m.service_name AND m.delete_ts IS NULL JOIN bss_car_day_count c ON m.service_area_id = c.service_area_id AND b.oper_date = c.count_date JOIN (SELECT id, (RANDOM()*1000+500)::INT AS area FROM bss_service_area) s ON m.service_area_id = s.id WHERE b.delete_ts IS NULL AND c.delete_ts IS NULL GROUP BY ROLLUP(branch_type);"
+  },
+  {
+    "question": "各管理公司最近一个月平均每车次收入及订单转化率排名",
+    "sql": "SELECT c.company_name AS 公司名称, \n       ROUND(SUM(b.pay_sum)/SUM(car.customer_count), 2) AS 单位车次收入,\n       ROUND(SUM(b.order_sum)*100/SUM(car.customer_count), 2) AS 订单转化率\nFROM bss_company c\nJOIN bss_service_area sa ON c.id = sa.company_id AND sa.delete_ts IS NULL\nJOIN bss_business_day_data b ON sa.service_area_no = b.service_no\nJOIN bss_car_day_count car ON sa.id = car.service_area_id\nWHERE b.oper_date BETWEEN '2023-03-01' AND '2023-03-31'\n  AND car.count_date = b.oper_date\nGROUP BY c.company_name\nORDER BY 单位车次收入 DESC\nLIMIT 10;"
+  },
+  {
+    "question": "不同季度各管理公司人均产出对比分析",
+    "sql": "SELECT c.company_name AS 公司名称,\n       DATE_TRUNC('quarter', b.oper_date) AS 季度,\n       ROUND(SUM(b.pay_sum)/COUNT(DISTINCT b.created_by), 2) AS 人均产出\nFROM bss_company c\nJOIN bss_service_area sa ON c.id = sa.company_id AND sa.delete_ts IS NULL\nJOIN bss_business_day_data b ON sa.service_area_no = b.service_no\nWHERE b.oper_date BETWEEN '2022-01-01' AND '2023-12-31'\nGROUP BY c.company_name, DATE_TRUNC('quarter', b.oper_date)\nORDER BY 季度, 人均产出 DESC;"
+  },
+  {
+    "question": "危化品车辆占比超过10%的服务区运营效率分析",
+    "sql": "SELECT sa.service_area_name AS 服务区名称,\n       c.company_name AS 管理公司,\n       ROUND(SUM(b.pay_sum)/SUM(car.customer_count), 2) AS 单位车次收入,\n       ROUND(SUM(car_chem.customer_count)*100/SUM(car.customer_count), 2) AS 危化品占比\nFROM bss_service_area sa\nJOIN bss_company c ON sa.company_id = c.id\nJOIN bss_business_day_data b ON sa.service_area_no = b.service_no\nJOIN bss_car_day_count car ON sa.id = car.service_area_id\nJOIN bss_car_day_count car_chem ON sa.id = car_chem.service_area_id\nWHERE car.count_date = b.oper_date\n  AND car_chem.car_type = '危化品'\nGROUP BY sa.service_area_name, c.company_name\nHAVING SUM(car_chem.customer_count)*100/SUM(car.customer_count) > 10\nORDER BY 危化品占比 DESC;"
+  },
+  {
+    "question": "城际车辆流量与夜间收入占比关系分析(20:00-6:00时段)",
+    "sql": "SELECT c.company_name AS 管理公司,\n       SUM(car.customer_count) AS 城际车流量,\n       ROUND(SUM(CASE WHEN EXTRACT(HOUR FROM b.create_ts) BETWEEN 20 AND 23 OR EXTRACT(HOUR FROM b.create_ts) BETWEEN 0 AND 6 THEN b.pay_sum ELSE 0 END)*100/SUM(b.pay_sum), 2) AS 夜间收入占比\nFROM bss_company c\nJOIN bss_service_area sa ON c.id = sa.company_id\nJOIN bss_business_day_data b ON sa.service_area_no = b.service_no\nJOIN bss_car_day_count car ON sa.id = car.service_area_id\nWHERE car.car_type = '城际'\n  AND car.count_date = b.oper_date\nGROUP BY c.company_name\nORDER BY 城际车流量 DESC;"
+  },
+  {
+    "question": "各支付方式订单转化率区域分布热力图",
+    "sql": "SELECT sa.service_position AS 坐标,\n       ROUND(SUM(b.wx_order + b.zf_order + b.rmb_order)*100/SUM(car.customer_count), 2) AS 综合转化率,\n       ROUND(SUM(b.wx_order)*100/SUM(b.order_sum), 2) AS 微信占比\nFROM bss_service_area sa\nJOIN bss_business_day_data b ON sa.service_area_no = b.service_no\nJOIN bss_car_day_count car ON sa.id = car.service_area_id\nWHERE b.oper_date BETWEEN '2023-01-01' AND '2023-03-31'\n  AND car.count_date = b.oper_date\nGROUP BY sa.service_position;"
+  },
+  {
+    "question": "资源闲置率最高的5个服务区(连续3个月营收下降)",
+    "sql": "WITH monthly_revenue AS (\n  SELECT service_no,\n         DATE_TRUNC('month', oper_date) AS 月份,\n         SUM(pay_sum) AS 总营收\n  FROM bss_business_day_data\n  WHERE oper_date BETWEEN '2022-12-01' AND '2023-02-28'\n  GROUP BY service_no, DATE_TRUNC('month', oper_date)\n),\nrevenue_trend AS (\n  SELECT service_no,\n         ARRAY_AGG(总营收 ORDER BY 月份) AS 收益序列\n  FROM monthly_revenue\n  GROUP BY service_no\n)\nSELECT sa.service_area_name AS 服务区名称,\n       r.收益序列[3] - r.收益序列[1] AS 下降幅度\nFROM revenue_trend r\nJOIN bss_service_area sa ON r.service_no = sa.service_area_no\nWHERE r.收益序列[3] < r.收益序列[2] AND r.收益序列[2] < r.收益序列[1]\nORDER BY 下降幅度 ASC\nLIMIT 5;"
+  },
+  {
+    "question": "节假日与工作日运营效率差异对比(春节假期 vs 常规周)",
+    "sql": "SELECT \n  CASE WHEN b.oper_date BETWEEN '2023-01-21' AND '2023-01-27' THEN '春节假期' ELSE '常规周' END AS 时段类型,\n  ROUND(AVG(b.pay_sum/car.customer_count), 2) AS 平均单车收入,\n  ROUND(AVG(b.order_sum/car.customer_count), 2) AS 平均转化率\nFROM bss_business_day_data b\nJOIN bss_car_day_count car ON b.service_no = car.service_area_id\nWHERE b.oper_date BETWEEN '2023-01-10' AND '2023-02-10'\n  AND car.count_date = b.oper_date\nGROUP BY \n  CASE WHEN b.oper_date BETWEEN '2023-01-21' AND '2023-01-27' THEN '春节假期' ELSE '常规周' END;"
+  },
+  {
+    "question": "各区域管理公司单位能耗产出对比(需结合能耗表)",
+    "sql": "SELECT c.company_name AS 公司名称,\n       ROUND(SUM(b.pay_sum)/SUM(e.energy_consumption), 2) AS 单位能耗产出\nFROM bss_company c\nJOIN bss_service_area sa ON c.id = sa.company_id\nJOIN bss_business_day_data b ON sa.service_area_no = b.service_no\n-- 假设存在能耗表 energy_consumption_table e\n-- JOIN energy_consumption_table e ON sa.id = e.service_area_id\nWHERE b.oper_date BETWEEN '2023-01-01' AND '2023-03-31'\nGROUP BY c.company_name\nORDER BY 单位能耗产出 DESC;"
+  },
+  {
+    "question": "新入驻公司首月运营效率达标情况检查",
+    "sql": "SELECT c.company_name AS 公司名称,\n       sa.service_area_name AS 服务区,\n       ROUND(SUM(b.pay_sum)/COUNT(DISTINCT b.oper_date), 2) AS 日均营收,\n       ROUND(SUM(b.order_sum)/COUNT(DISTINCT b.oper_date), 2) AS 日均订单量\nFROM bss_company c\nJOIN bss_service_area sa ON c.id = sa.company_id\nJOIN bss_business_day_data b ON sa.service_area_no = b.service_no\nWHERE sa.create_ts BETWEEN '2023-01-01' AND '2023-03-31'\nGROUP BY c.company_name, sa.service_area_name\nHAVING SUM(b.pay_sum)/COUNT(DISTINCT b.oper_date) < 5000;"
+  },
+  {
+    "question": "重点路线服务区运营效率矩阵分析(昌栗高速路段)",
+    "sql": "SELECT sa.service_area_name AS 服务区,\n       ROUND(AVG(b.pay_sum/car.customer_count), 2) AS 单位车次收入,\n       ROUND(AVG(b.order_sum/car.customer_count), 4) AS 转化率\nFROM bss_section_route sr\nJOIN bss_section_route_area_link link ON sr.id = link.section_route_id\nJOIN bss_service_area sa ON link.service_area_id = sa.id\nJOIN bss_business_day_data b ON sa.service_area_no = b.service_no\nJOIN bss_car_day_count car ON sa.id = car.service_area_id\nWHERE sr.section_name = '昌栗'\n  AND b.oper_date BETWEEN '2023-01-01' AND '2023-03-31'\n  AND car.count_date = b.oper_date\nGROUP BY sa.service_area_name\nORDER BY 单位车次收入 DESC, 转化率 DESC;"
+  },
+  {
+    "question": "各路段车流量分布情况?",
+    "sql": "SELECT section.section_name AS 路段名称, SUM(car.customer_count) AS 总车流量 FROM bss_section_route section JOIN bss_section_route_area_link link ON section.id = link.section_route_id JOIN bss_car_day_count car ON link.service_area_id = car.service_area_id WHERE section.delete_ts IS NULL AND car.delete_ts IS NULL GROUP BY section.section_name ORDER BY 总车流量 DESC;"
+  },
+  {
+    "question": "对比不同日期各路段的车流量变化趋势?",
+    "sql": "SELECT car.count_date AS 统计日期, section.section_name AS 路段名称, SUM(car.customer_count) AS 日车流量 FROM bss_section_route section JOIN bss_section_route_area_link link ON section.id = link.section_route_id JOIN bss_car_day_count car ON link.service_area_id = car.service_area_id WHERE section.delete_ts IS NULL AND car.delete_ts IS NULL GROUP BY car.count_date, section.section_name ORDER BY 统计日期 ASC;"
+  },
+  {
+    "question": "车流量最高的五个服务区?",
+    "sql": "SELECT area.service_area_name AS 服务区名称, SUM(car.customer_count) AS 总车流量 FROM bss_section_route_area_link link JOIN bss_car_day_count car ON link.service_area_id = car.service_area_id JOIN bss_service_area area ON link.service_area_id = area.id WHERE car.delete_ts IS NULL AND area.delete_ts IS NULL GROUP BY area.service_area_name ORDER BY 总车流量 DESC LIMIT 5;"
+  },
+  {
+    "question": "分析工作日与周末的平均车流量差异?",
+    "sql": "SELECT CASE WHEN EXTRACT(DOW FROM car.count_date) IN (0,6) THEN '周末' ELSE '工作日' END AS 日期类型, AVG(customer_count) AS 平均车流量 FROM bss_section_route_area_link link JOIN bss_car_day_count car ON link.service_area_id = car.service_area_id WHERE car.delete_ts IS NULL GROUP BY CASE WHEN EXTRACT(DOW FROM car.count_date) IN (0,6) THEN '周末' ELSE '工作日' END;"
+  },
+  {
+    "question": "危化品车辆较多的服务区分布?",
+    "sql": "SELECT area.service_area_name AS 服务区名称, SUM(car.customer_count) AS 危化品车流量 FROM bss_section_route_area_link link JOIN bss_car_day_count car ON link.service_area_id = car.service_area_id JOIN bss_service_area area ON link.service_area_id = area.id WHERE car.car_type = '危化品' AND car.delete_ts IS NULL AND area.delete_ts IS NULL GROUP BY area.service_area_name ORDER BY 危化品车流量 DESC;"
+  },
+  {
+    "question": "统计每个路段连接的服务区数量?",
+    "sql": "SELECT section.section_name AS 路段名称, COUNT(link.service_area_id) AS 服务区数量 FROM bss_section_route section JOIN bss_section_route_area_link link ON section.id = link.section_route_id WHERE section.delete_ts IS NULL GROUP BY section.section_name ORDER BY 服务区数量 DESC;"
+  },
+  {
+    "question": "分析特定路段(如昌栗)的车辆类型分布?",
+    "sql": "SELECT car.car_type AS 车辆类型, SUM(car.customer_count) AS 总车流量, ROUND(SUM(car.customer_count)*100.0/(SELECT SUM(customer_count) FROM bss_car_day_count car2 JOIN bss_section_route_area_link link2 ON car2.service_area_id = link2.service_area_id JOIN bss_section_route sec2 ON link2.section_route_id = sec2.id WHERE sec2.section_name = '昌栗' AND car2.delete_ts IS NULL AND sec2.delete_ts IS NULL),2) AS 占比百分比 FROM bss_section_route section JOIN bss_section_route_area_link link ON section.id = link.section_route_id JOIN bss_car_day_count car ON link.service_area_id = car.service_area_id WHERE section.section_name = '昌栗' AND section.delete_ts IS NULL AND car.delete_ts IS NULL GROUP BY car.car_type ORDER BY 总车流量 DESC;"
+  },
+  {
+    "question": "检查是否存在未关联任何路段的服务区?",
+    "sql": "SELECT area.service_area_name AS 服务区名称 FROM bss_service_area area LEFT JOIN bss_section_route_area_link link ON area.id = link.service_area_id WHERE link.section_route_id IS NULL AND area.delete_ts IS NULL;"
+  },
+  {
+    "question": "最近7天各路段的车流量统计?",
+    "sql": "SELECT section.section_name AS 路段名称, SUM(car.customer_count) AS 总车流量 FROM bss_section_route section JOIN bss_section_route_area_link link ON section.id = link.section_route_id JOIN bss_car_day_count car ON link.service_area_id = car.service_area_id WHERE section.delete_ts IS NULL AND car.delete_ts IS NULL AND car.count_date >= CURRENT_DATE - 7 GROUP BY section.section_name ORDER BY 总车流量 DESC;"
+  },
+  {
+    "question": "找出订单总数与车流量相关性高的服务区?",
+    "sql": "SELECT area.service_area_name AS 服务区名称, SUM(business.order_sum) AS 总订单数, SUM(car.customer_count) AS 总车流量, ROUND(SUM(business.order_sum)*1.0/SUM(car.customer_count),4) AS 订单车流比 FROM bss_service_area area JOIN bss_business_day_data business ON area.service_area_name = business.service_name JOIN bss_car_day_count car ON area.id = car.service_area_id WHERE area.delete_ts IS NULL AND business.delete_ts IS NULL AND car.delete_ts IS NULL GROUP BY area.service_area_name ORDER BY 订单车流比 DESC;"
+  },
+  {
+    "question": "不同数据来源类别的业务数据记录数量对比",
+    "sql": "SELECT sa.source_type AS 数据来源类别, COUNT(*) AS 记录数量 FROM bss_business_day_data bdd JOIN bss_service_area_mapper sa ON bdd.service_no = sa.service_no WHERE bdd.delete_ts IS NULL GROUP BY sa.source_type ORDER BY 记录数量 DESC;"
+  },
+  {
+    "question": "近30天各服务区微信支付金额波动趋势分析",
+    "sql": "SELECT oper_date AS 统计日期, service_name AS 服务区名称, SUM(wx) AS 微信支付总额 FROM bss_business_day_data WHERE oper_date >= CURRENT_DATE - 30 AND delete_ts IS NULL GROUP BY oper_date, service_name ORDER BY oper_date DESC LIMIT 100;"
+  },
+  {
+    "question": "版本变更次数超过5次的服务区映射信息",
+    "sql": "SELECT service_name AS 服务区名称, COUNT(version) AS 版本变更次数 FROM bss_service_area_mapper WHERE delete_ts IS NULL GROUP BY service_name HAVING COUNT(version) > 5 ORDER BY 版本变更次数 DESC;"
+  },
+  {
+    "question": "检查服务区编码与名称的匹配一致性",
+    "sql": "SELECT bdd.service_no AS 业务数据编码, bdd.service_name AS 业务数据名称, sa.service_name AS 映射表名称 FROM bss_business_day_data bdd JOIN bss_service_area_mapper sa ON bdd.service_no = sa.service_no WHERE bdd.service_name != sa.service_name AND bdd.delete_ts IS NULL LIMIT 50;"
+  },
+  {
+    "question": "最近7天每日新增业务数据量的时效性分析",
+    "sql": "SELECT DATE(create_ts) AS 创建日期, COUNT(*) AS 新增记录数 FROM bss_business_day_data WHERE create_ts >= CURRENT_DATE - 7 AND delete_ts IS NULL GROUP BY DATE(create_ts) ORDER BY 创建日期 DESC;"
+  },
+  {
+    "question": "不同支付方式订单占比分布统计",
+    "sql": "SELECT '微信' AS 支付方式, SUM(wx_order) AS 订单数 FROM bss_business_day_data WHERE delete_ts IS NULL UNION ALL SELECT '支付宝', SUM(zf_order) FROM bss_business_day_data WHERE delete_ts IS NULL UNION ALL SELECT '现金', SUM(rmb_order) FROM bss_business_day_data WHERE delete_ts IS NULL ORDER BY 订单数 DESC;"
+  },
+  {
+    "question": "检查超过30天未更新的业务数据记录",
+    "sql": "SELECT service_name AS 服务区名称, oper_date AS 统计日期, update_ts AS 最后更新时间 FROM bss_business_day_data WHERE update_ts < CURRENT_DATE - 30 AND delete_ts IS NULL ORDER BY update_ts ASC LIMIT 50;"
+  },
+  {
+    "question": "各服务区不同数据源支付总额差异分析",
+    "sql": "SELECT service_name AS 服务区名称, source_type AS 数据源类型, SUM(pay_sum) AS 支付总额 FROM bss_business_day_data bdd JOIN bss_service_area_mapper sa ON bdd.service_no = sa.service_no WHERE bdd.delete_ts IS NULL GROUP BY service_name, source_type HAVING SUM(pay_sum) > 10000 ORDER BY 支付总额 DESC LIMIT 20;"
+  },
+  {
+    "question": "手工录入数据每月占比变化趋势",
+    "sql": "SELECT DATE_TRUNC('month', bdd.create_ts) AS 月份, COUNT(CASE WHEN sa.source_system_type = '手工录入' THEN 1 END) * 100.0 / COUNT(*) AS 手工录入占比 FROM bss_business_day_data bdd JOIN bss_service_area_mapper sa ON bdd.service_no = sa.service_no WHERE bdd.delete_ts IS NULL GROUP BY DATE_TRUNC('month', bdd.create_ts) ORDER BY 月份 DESC;"
+  },
+  {
+    "question": "检查同一天同一服务区的重复数据记录",
+    "sql": "SELECT oper_date AS 统计日期, service_area_id AS 服务区ID, COUNT(*) AS 重复次数 FROM bss_business_day_data bdd JOIN bss_service_area_mapper sa ON bdd.service_no = sa.service_no WHERE bdd.delete_ts IS NULL GROUP BY oper_date, service_area_id HAVING COUNT(*) > 1 ORDER BY 重复次数 DESC LIMIT 30;"
+  }
+]

+ 71 - 78
schema_tools/README.md

@@ -11,6 +11,7 @@
 - ⚡ 并发处理提高效率
 - 📁 生成标准化的DDL和MD文档
 - 🛡️ 完整的错误处理和日志记录
+- 🎯 **新增**:Question-SQL训练数据生成
 
 ## 安装依赖
 
@@ -20,7 +21,7 @@ pip install asyncpg asyncio
 
 ## 使用方法
 
-### 1. 命令行方式
+### 1. 生成DDL和MD文档
 
 #### 基本使用
 ```bash
@@ -40,15 +41,28 @@ python -m schema_tools \
   --pipeline full
 ```
 
-#### 仅检查数据库权限
+### 2. 生成Question-SQL训练数据(新功能)
+
+在生成DDL和MD文件后,可以使用新的Question-SQL生成功能:
+
 ```bash
-python -m schema_tools \
-  --db-connection "postgresql://user:pass@localhost:5432/dbname" \
-  --check-permissions-only
+python -m schema_tools.qs_generator \
+  --output-dir ./output \
+  --table-list ./tables.txt \
+  --business-context "高速公路服务区管理系统" \
+  --db-name highway_db
 ```
 
-### 2. 编程方式
+这将:
+1. 验证DDL和MD文件数量是否正确
+2. 读取所有MD文件内容
+3. 使用LLM提取业务分析主题
+4. 为每个主题生成10个Question-SQL对
+5. 输出到 `qs_highway_db_时间戳_pair.json` 文件
 
+### 3. 编程方式使用
+
+#### 生成DDL/MD文档
 ```python
 import asyncio
 from schema_tools import SchemaTrainingDataAgent
@@ -68,35 +82,39 @@ async def generate_training_data():
 asyncio.run(generate_training_data())
 ```
 
-### 3. 表清单文件格式
+#### 生成Question-SQL数据
+```python
+import asyncio
+from schema_tools import QuestionSQLGenerationAgent
 
-创建一个文本文件(如 `tables.txt`),每行一个表名:
+async def generate_qs_data():
+    agent = QuestionSQLGenerationAgent(
+        output_dir="./output",
+        table_list_file="tables.txt",
+        business_context="高速公路服务区管理系统",
+        db_name="highway_db"
+    )
+    
+    report = await agent.generate()
+    print(f"生成完成: {report['total_questions']} 个问题")
 
-```text
-# 这是注释行
-public.users
-public.orders
-hr.employees
-sales.products
+asyncio.run(generate_qs_data())
 ```
 
 ## 输出文件结构
 
 ```
 output/
-├── ddl/                          # DDL文件目录
-│   ├── users.ddl
-│   ├── orders.ddl
-│   └── hr__employees.ddl
-├── docs/                         # MD文档目录
-│   ├── users_detail.md
-│   ├── orders_detail.md
-│   └── hr__employees_detail.md
+├── bss_car_day_count.ddl         # DDL文件
+├── bss_car_day_count_detail.md   # MD文档
 ├── logs/                         # 日志目录
 │   └── schema_tools_20240101_120000.log
-└── filename_mapping.txt          # 文件名映射报告
+├── filename_mapping.txt          # 文件名映射报告
+└── qs_highway_db_20240101_143052_pair.json  # Question-SQL训练数据
 ```
 
+注意:配置已更新为不再创建ddl/和docs/子目录,所有文件直接放在output目录下。
+
 ## 配置选项
 
 主要配置在 `schema_tools/config.py` 中:
@@ -106,20 +124,19 @@ SCHEMA_TOOLS_CONFIG = {
     # 核心配置
     "output_directory": "training/generated_data",
     "default_pipeline": "full",
+    "create_subdirectories": False,       # 不创建子目录
     
     # 数据处理配置
     "sample_data_limit": 20,              # 采样数据量
     "max_concurrent_tables": 3,           # 最大并发数
     
-    # LLM配置
-    "max_llm_retries": 3,                # LLM重试次数
-    "comment_generation_timeout": 30,     # 超时时间
-    
-    # 系统表过滤
-    "filter_system_tables": True,         # 过滤系统表
-    
-    # 错误处理
-    "continue_on_error": True,            # 错误后继续
+    # Question-SQL生成配置
+    "qs_generation": {
+        "max_tables": 20,                 # 最大表数量限制
+        "theme_count": 5,                 # 生成主题数量
+        "questions_per_theme": 10,        # 每主题问题数
+        "max_concurrent_themes": 3,       # 并行处理主题数
+    }
 }
 ```
 
@@ -134,54 +151,30 @@ SCHEMA_TOOLS_CONFIG = {
 - **analysis_only**: 仅分析不生成文件
   - 数据库检查 → 数据采样 → 注释生成
 
-## 业务上下文
-
-业务上下文帮助LLM更好地理解表和字段的含义:
-
-### 方式1:命令行参数
-```bash
---business-context "高速公路服务区管理系统"
-```
-
-### 方式2:文件方式
-```bash
---business-context-file business_context.txt
-```
-
-### 方式3:业务词典
-编辑 `schema_tools/prompts/business_dictionary.txt`:
-```text
-BSS - Business Support System,业务支撑系统
-SA - Service Area,服务区
-POS - Point of Sale,销售点
-```
-
-## 高级功能
-
-### 1. 自定义系统表过滤
-
-```python
-from schema_tools.utils.system_filter import SystemTableFilter
-
-filter = SystemTableFilter()
-filter.add_custom_prefix("tmp_")      # 添加自定义前缀
-filter.add_custom_schema("temp")      # 添加自定义schema
+## Question-SQL生成特性
+
+### 功能亮点
+- 🔍 自动验证文件完整性
+- 📊 智能提取5个业务分析主题
+- 🤖 每个主题生成10个高质量Question-SQL对
+- 💾 支持中间结果保存和恢复
+- ⚡ 支持并行处理提高效率
+
+### 限制说明
+- 一次最多处理20个表(可配置)
+- 表数量超限会抛出异常
+- 主题生成失败可跳过继续处理
+
+### 输出格式
+```json
+[
+  {
+    "question": "按服务区统计每日营收趋势(最近30天)?",
+    "sql": "SELECT service_name AS 服务区, oper_date AS 营业日期, SUM(pay_sum) AS 每日营收 FROM bss_business_day_data WHERE oper_date >= CURRENT_DATE - INTERVAL '30 day' AND delete_ts IS NULL GROUP BY service_name, oper_date ORDER BY 营业日期 ASC;"
+  }
+]
 ```
 
-### 2. 大表智能采样
-
-对于超过100万行的大表,自动使用分层采样策略:
-- 前N行
-- 随机中间行
-- 后N行
-
-### 3. 枚举字段检测
-
-自动检测并验证枚举字段:
-- VARCHAR类型
-- 样例值重复度高
-- 字段名包含类型关键词(状态、类型、级别等)
-
 ## 常见问题
 
 ### Q: 如何处理只读数据库?

+ 2 - 0
schema_tools/__init__.py

@@ -4,11 +4,13 @@ Schema Tools - 自动化数据库逆向工程工具
 """
 
 from .training_data_agent import SchemaTrainingDataAgent
+from .qs_agent import QuestionSQLGenerationAgent
 from .config import SCHEMA_TOOLS_CONFIG, get_config, update_config
 
 __version__ = "1.0.0"
 __all__ = [
     "SchemaTrainingDataAgent",
+    "QuestionSQLGenerationAgent",
     "SCHEMA_TOOLS_CONFIG", 
     "get_config",
     "update_config"

+ 11 - 0
schema_tools/analyzers/__init__.py

@@ -0,0 +1,11 @@
+"""
+数据分析器模块
+"""
+
+from .md_analyzer import MDFileAnalyzer
+from .theme_extractor import ThemeExtractor
+
+__all__ = [
+    "MDFileAnalyzer",
+    "ThemeExtractor"
+] 

+ 99 - 0
schema_tools/analyzers/md_analyzer.py

@@ -0,0 +1,99 @@
+import logging
+from pathlib import Path
+from typing import List, Dict, Any
+
+
+class MDFileAnalyzer:
+    """MD文件分析器"""
+    
+    def __init__(self, output_dir: str):
+        self.output_dir = Path(output_dir)
+        self.logger = logging.getLogger("schema_tools.MDFileAnalyzer")
+        
+    async def read_all_md_files(self) -> str:
+        """
+        读取所有MD文件的完整内容
+        
+        Returns:
+            所有MD文件内容的组合字符串
+        """
+        md_files = sorted(self.output_dir.glob("*_detail.md"))
+        
+        if not md_files:
+            raise ValueError(f"在 {self.output_dir} 目录下未找到MD文件")
+        
+        all_contents = []
+        all_contents.append(f"# 数据库表结构文档汇总\n")
+        all_contents.append(f"共包含 {len(md_files)} 个表\n\n")
+        
+        for md_file in md_files:
+            self.logger.info(f"读取MD文件: {md_file.name}")
+            try:
+                content = md_file.read_text(encoding='utf-8')
+                
+                # 添加分隔符,便于LLM区分不同表
+                all_contents.append("=" * 80)
+                all_contents.append(f"# 文件: {md_file.name}")
+                all_contents.append("=" * 80)
+                all_contents.append(content)
+                all_contents.append("\n")
+                
+            except Exception as e:
+                self.logger.error(f"读取文件 {md_file.name} 失败: {e}")
+                raise
+        
+        combined_content = "\n".join(all_contents)
+        
+        # 检查内容大小(预估token数)
+        estimated_tokens = len(combined_content) / 4  # 粗略估算
+        if estimated_tokens > 100000:  # 假设token限制
+            self.logger.warning(f"MD内容可能过大,预估tokens: {estimated_tokens:.0f}")
+        
+        self.logger.info(f"成功读取 {len(md_files)} 个MD文件,总字符数: {len(combined_content)}")
+        
+        return combined_content
+    
+    def get_table_summaries(self) -> List[Dict[str, str]]:
+        """
+        获取所有表的摘要信息
+        
+        Returns:
+            表摘要列表
+        """
+        md_files = sorted(self.output_dir.glob("*_detail.md"))
+        summaries = []
+        
+        for md_file in md_files:
+            try:
+                content = md_file.read_text(encoding='utf-8')
+                lines = content.split('\n')
+                
+                # 提取表名和描述(通常在前几行)
+                table_name = ""
+                description = ""
+                
+                for line in lines[:10]:  # 只看前10行
+                    line = line.strip()
+                    if line.startswith("##"):
+                        # 提取表名
+                        table_info = line.replace("##", "").strip()
+                        if "(" in table_info:
+                            table_name = table_info.split("(")[0].strip()
+                        else:
+                            table_name = table_info
+                    elif table_name and line and not line.startswith("#"):
+                        # 第一行非标题文本作为描述
+                        description = line
+                        break
+                
+                if table_name:
+                    summaries.append({
+                        "file": md_file.name,
+                        "table_name": table_name,
+                        "description": description
+                    })
+                    
+            except Exception as e:
+                self.logger.warning(f"处理文件 {md_file.name} 时出错: {e}")
+        
+        return summaries 

+ 189 - 0
schema_tools/analyzers/theme_extractor.py

@@ -0,0 +1,189 @@
+import asyncio
+import json
+import logging
+from typing import List, Dict, Any
+
+from schema_tools.config import SCHEMA_TOOLS_CONFIG
+
+
+class ThemeExtractor:
+    """主题提取器"""
+    
+    def __init__(self, vn, business_context: str):
+        """
+        初始化主题提取器
+        
+        Args:
+            vn: vanna实例
+            business_context: 业务上下文
+        """
+        self.vn = vn
+        self.business_context = business_context
+        self.logger = logging.getLogger("schema_tools.ThemeExtractor")
+        self.config = SCHEMA_TOOLS_CONFIG
+        
+    async def extract_themes(self, md_contents: str) -> List[Dict[str, Any]]:
+        """
+        从MD内容中提取分析主题
+        
+        Args:
+            md_contents: 所有MD文件的组合内容
+            
+        Returns:
+            主题列表
+        """
+        theme_count = self.config['qs_generation']['theme_count']
+        
+        prompt = self._build_theme_extraction_prompt(md_contents, theme_count)
+        
+        try:
+            # 调用LLM提取主题
+            response = await self._call_llm(prompt)
+            
+            # 解析响应
+            themes = self._parse_theme_response(response)
+            
+            self.logger.info(f"成功提取 {len(themes)} 个分析主题")
+            
+            return themes
+            
+        except Exception as e:
+            self.logger.error(f"主题提取失败: {e}")
+            raise
+    
+    def _build_theme_extraction_prompt(self, md_contents: str, theme_count: int) -> str:
+        """构建主题提取的prompt"""
+        prompt = f"""你是一位经验丰富的业务数据分析师,正在分析{self.business_context}的数据库。
+
+以下是数据库中所有表的详细结构说明:
+
+{md_contents}
+
+基于对这些表结构的理解,请从业务分析的角度提出 {theme_count} 个数据查询分析主题。
+
+要求:
+1. 每个主题应该有明确的业务价值和分析目标
+2. 主题之间应该有所区别,覆盖不同的业务领域  
+3. 你需要自行决定每个主题应该涉及哪些表
+4. 主题应该体现实际业务场景的数据分析需求
+5. 考虑时间维度、对比分析、排名统计等多种分析角度
+6. 为每个主题提供3-5个关键词,用于快速了解主题内容
+
+请以JSON格式输出:
+```json
+{{
+  "themes": [
+    {{
+      "topic_name": "日营业数据分析",
+      "description": "基于 bss_business_day_data 表,分析每个服务区和档口每天的营业收入、订单数量、支付方式等",
+      "related_tables": ["bss_business_day_data", "bss_branch", "bss_service_area"],
+      "keywords": ["收入", "订单", "支付方式", "日报表"],
+      "focus_areas": ["收入趋势", "服务区对比", "支付方式分布"]
+    }}
+  ]
+}}
+```
+
+请确保:
+- topic_name 简洁明了(10字以内)
+- description 详细说明分析目标和价值(50字左右)
+- related_tables 列出该主题需要用到的表名(数组格式)
+- keywords 提供3-5个核心关键词(数组格式)
+- focus_areas 列出3-5个具体的分析角度(保留用于生成问题)"""
+        
+        return prompt
+    
+    async def _call_llm(self, prompt: str) -> str:
+        """调用LLM"""
+        try:
+            # 使用vanna的chat_with_llm方法
+            response = await asyncio.to_thread(
+                self.vn.chat_with_llm,
+                question=prompt,
+                system_prompt="你是一个专业的数据分析师,擅长从业务角度设计数据分析主题和查询方案。请严格按照要求的JSON格式输出。"
+            )
+            
+            if not response or not response.strip():
+                raise ValueError("LLM返回空响应")
+            
+            return response.strip()
+            
+        except Exception as e:
+            self.logger.error(f"LLM调用失败: {e}")
+            raise
+    
+    def _parse_theme_response(self, response: str) -> List[Dict[str, Any]]:
+        """解析LLM的主题响应"""
+        try:
+            # 提取JSON部分
+            import re
+            json_match = re.search(r'```json\s*(.*?)\s*```', response, re.DOTALL)
+            if json_match:
+                json_str = json_match.group(1)
+            else:
+                # 尝试直接解析
+                json_str = response
+            
+            # 解析JSON
+            data = json.loads(json_str)
+            themes = data.get('themes', [])
+            
+            # 验证和标准化主题格式
+            validated_themes = []
+            for theme in themes:
+                # 兼容旧格式(name -> topic_name)
+                if 'name' in theme and 'topic_name' not in theme:
+                    theme['topic_name'] = theme['name']
+                
+                # 验证必需字段
+                required_fields = ['topic_name', 'description', 'related_tables']
+                if all(key in theme for key in required_fields):
+                    # 确保related_tables是数组
+                    if isinstance(theme['related_tables'], str):
+                        theme['related_tables'] = [theme['related_tables']]
+                    
+                    # 确保keywords存在且是数组
+                    if 'keywords' not in theme:
+                        # 从description中提取关键词
+                        theme['keywords'] = self._extract_keywords_from_description(theme['description'])
+                    elif isinstance(theme['keywords'], str):
+                        theme['keywords'] = [theme['keywords']]
+                    
+                    # 保留focus_areas用于问题生成(如果没有则使用keywords)
+                    if 'focus_areas' not in theme:
+                        theme['focus_areas'] = theme['keywords'][:3]
+                    
+                    validated_themes.append(theme)
+                else:
+                    self.logger.warning(f"主题格式不完整,跳过: {theme.get('topic_name', 'Unknown')}")
+            
+            return validated_themes
+            
+        except json.JSONDecodeError as e:
+            self.logger.error(f"JSON解析失败: {e}")
+            self.logger.debug(f"原始响应: {response}")
+            raise ValueError(f"无法解析LLM响应为JSON格式: {e}")
+        except Exception as e:
+            self.logger.error(f"解析主题响应失败: {e}")
+            raise
+    
+    def _extract_keywords_from_description(self, description: str) -> List[str]:
+        """从描述中提取关键词(简单实现)"""
+        # 定义常见的业务关键词
+        business_keywords = [
+            "收入", "营业额", "订单", "支付", "统计", "分析", "趋势", "对比",
+            "排名", "汇总", "明细", "报表", "月度", "日度", "年度", "服务区",
+            "档口", "商品", "客流", "车流", "效率", "占比", "增长"
+        ]
+        
+        # 从描述中查找出现的关键词
+        found_keywords = []
+        for keyword in business_keywords:
+            if keyword in description:
+                found_keywords.append(keyword)
+        
+        # 如果找到的太少,返回默认值
+        if len(found_keywords) < 3:
+            found_keywords = ["数据分析", "统计报表", "业务查询"]
+        
+        return found_keywords[:5]  # 最多返回5个 

+ 12 - 1
schema_tools/config.py

@@ -77,7 +77,7 @@ SCHEMA_TOOLS_CONFIG = {
     "ddl_file_suffix": ".ddl",
     "doc_file_suffix": "_detail.md",
     "log_file": "schema_tools.log",
-    "create_subdirectories": True,            # 是否创建ddl/docs子目录
+    "create_subdirectories": False,            # 是否创建ddl/docs子目录
     
     # 输出格式配置
     "include_sample_data_in_comments": True,  # 注释中是否包含示例数据
@@ -88,6 +88,17 @@ SCHEMA_TOOLS_CONFIG = {
     "debug_mode": False,                      # 调试模式
     "save_llm_prompts": False,               # 是否保存LLM提示词
     "save_llm_responses": False,             # 是否保存LLM响应
+    
+    # Question-SQL生成配置
+    "qs_generation": {
+        "max_tables": 20,                    # 最大表数量限制
+        "theme_count": 5,                    # LLM生成的主题数量
+        "questions_per_theme": 10,           # 每个主题生成的问题数
+        "max_concurrent_themes": 1,          # 并行处理的主题数量
+        "continue_on_theme_error": True,     # 主题生成失败是否继续
+        "save_intermediate": True,           # 是否保存中间结果
+        "output_file_prefix": "qs",          # 输出文件前缀
+    }
 }
 
 # 从app_config获取相关配置(如果可用)

+ 525 - 0
schema_tools/qs_agent.py

@@ -0,0 +1,525 @@
+import asyncio
+import json
+import logging
+import time
+from datetime import datetime
+from pathlib import Path
+from typing import List, Dict, Any, Optional
+
+from schema_tools.config import SCHEMA_TOOLS_CONFIG
+from schema_tools.validators import FileCountValidator
+from schema_tools.analyzers import MDFileAnalyzer, ThemeExtractor
+from schema_tools.utils.logger import setup_logging
+from core.vanna_llm_factory import create_vanna_instance
+
+
+class QuestionSQLGenerationAgent:
+    """Question-SQL生成Agent"""
+    
+    def __init__(self, 
+                 output_dir: str,
+                 table_list_file: str,
+                 business_context: str,
+                 db_name: str = None):
+        """
+        初始化Agent
+        
+        Args:
+            output_dir: 输出目录(包含DDL和MD文件)
+            table_list_file: 表清单文件路径
+            business_context: 业务上下文
+            db_name: 数据库名称(用于输出文件命名)
+        """
+        self.output_dir = Path(output_dir)
+        self.table_list_file = table_list_file
+        self.business_context = business_context
+        self.db_name = db_name or "db"
+        
+        self.config = SCHEMA_TOOLS_CONFIG
+        self.logger = logging.getLogger("schema_tools.QSAgent")
+        
+        # 初始化组件
+        self.validator = FileCountValidator()
+        self.md_analyzer = MDFileAnalyzer(output_dir)
+        
+        # vanna实例和主题提取器将在需要时初始化
+        self.vn = None
+        self.theme_extractor = None
+        
+        # 中间结果存储
+        self.intermediate_results = []
+        self.intermediate_file = None
+        
+    async def generate(self) -> Dict[str, Any]:
+        """
+        生成Question-SQL对
+        
+        Returns:
+            生成结果报告
+        """
+        start_time = time.time()
+        
+        try:
+            self.logger.info("🚀 开始生成Question-SQL训练数据")
+            
+            # 1. 验证文件数量
+            self.logger.info("📋 验证文件数量...")
+            validation_result = self.validator.validate(self.table_list_file, str(self.output_dir))
+            
+            if not validation_result.is_valid:
+                self.logger.error(f"❌ 文件验证失败: {validation_result.error}")
+                if validation_result.missing_ddl:
+                    self.logger.error(f"缺失DDL文件: {validation_result.missing_ddl}")
+                if validation_result.missing_md:
+                    self.logger.error(f"缺失MD文件: {validation_result.missing_md}")
+                raise ValueError(f"文件验证失败: {validation_result.error}")
+            
+            self.logger.info(f"✅ 文件验证通过: {validation_result.table_count}个表")
+            
+            # 2. 读取所有MD文件内容
+            self.logger.info("📖 读取MD文件...")
+            md_contents = await self.md_analyzer.read_all_md_files()
+            
+            # 3. 初始化LLM相关组件
+            self._initialize_llm_components()
+            
+            # 4. 提取分析主题
+            self.logger.info("🎯 提取分析主题...")
+            themes = await self.theme_extractor.extract_themes(md_contents)
+            self.logger.info(f"✅ 成功提取 {len(themes)} 个分析主题")
+            
+            for i, theme in enumerate(themes):
+                topic_name = theme.get('topic_name', theme.get('name', ''))
+                description = theme.get('description', '')
+                self.logger.info(f"  {i+1}. {topic_name}: {description}")
+            
+            # 5. 初始化中间结果文件
+            self._init_intermediate_file()
+            
+            # 6. 处理每个主题
+            all_qs_pairs = []
+            failed_themes = []
+            
+            # 根据配置决定是并行还是串行处理
+            max_concurrent = self.config['qs_generation'].get('max_concurrent_themes', 1)
+            if max_concurrent > 1:
+                results = await self._process_themes_parallel(themes, md_contents, max_concurrent)
+            else:
+                results = await self._process_themes_serial(themes, md_contents)
+            
+            # 7. 整理结果
+            for result in results:
+                if result['success']:
+                    all_qs_pairs.extend(result['qs_pairs'])
+                else:
+                    failed_themes.append(result['theme_name'])
+            
+            # 8. 保存最终结果
+            output_file = await self._save_final_results(all_qs_pairs)
+            
+            # 8.5 生成metadata.txt文件
+            await self._generate_metadata_file(themes)
+            
+            # 9. 清理中间文件
+            if not failed_themes:  # 只有全部成功才清理
+                self._cleanup_intermediate_file()
+            
+            # 10. 生成报告
+            end_time = time.time()
+            report = {
+                'success': True,
+                'total_themes': len(themes),
+                'successful_themes': len(themes) - len(failed_themes),
+                'failed_themes': failed_themes,
+                'total_questions': len(all_qs_pairs),
+                'output_file': str(output_file),
+                'execution_time': end_time - start_time
+            }
+            
+            self._print_summary(report)
+            
+            return report
+            
+        except Exception as e:
+            self.logger.exception("❌ Question-SQL生成失败")
+            
+            # 保存当前已生成的结果
+            if self.intermediate_results:
+                recovery_file = self._save_intermediate_results()
+                self.logger.warning(f"⚠️  中间结果已保存到: {recovery_file}")
+            
+            raise
+    
+    def _initialize_llm_components(self):
+        """初始化LLM相关组件"""
+        if not self.vn:
+            self.logger.info("初始化LLM组件...")
+            self.vn = create_vanna_instance()
+            self.theme_extractor = ThemeExtractor(self.vn, self.business_context)
+    
+    async def _process_themes_serial(self, themes: List[Dict], md_contents: str) -> List[Dict]:
+        """串行处理主题"""
+        results = []
+        
+        for i, theme in enumerate(themes):
+            self.logger.info(f"处理主题 {i+1}/{len(themes)}: {theme.get('topic_name', theme.get('name', ''))}")
+            result = await self._process_single_theme(theme, md_contents)
+            results.append(result)
+            
+            # 检查是否需要继续
+            if not result['success'] and not self.config['qs_generation']['continue_on_theme_error']:
+                self.logger.error(f"主题处理失败,停止处理")
+                break
+        
+        return results
+    
+    async def _process_themes_parallel(self, themes: List[Dict], md_contents: str, max_concurrent: int) -> List[Dict]:
+        """并行处理主题"""
+        semaphore = asyncio.Semaphore(max_concurrent)
+        
+        async def process_with_semaphore(theme):
+            async with semaphore:
+                return await self._process_single_theme(theme, md_contents)
+        
+        tasks = [process_with_semaphore(theme) for theme in themes]
+        results = await asyncio.gather(*tasks, return_exceptions=True)
+        
+        # 处理异常结果
+        processed_results = []
+        for i, result in enumerate(results):
+            if isinstance(result, Exception):
+                theme_name = themes[i].get('topic_name', themes[i].get('name', ''))
+                self.logger.error(f"主题 '{theme_name}' 处理异常: {result}")
+                processed_results.append({
+                    'success': False,
+                    'theme_name': theme_name,
+                    'error': str(result)
+                })
+            else:
+                processed_results.append(result)
+        
+        return processed_results
+    
+    async def _process_single_theme(self, theme: Dict, md_contents: str) -> Dict:
+        """处理单个主题"""
+        theme_name = theme.get('topic_name', theme.get('name', ''))
+        
+        try:
+            self.logger.info(f"🔍 开始处理主题: {theme_name}")
+            
+            # 构建prompt
+            prompt = self._build_qs_generation_prompt(theme, md_contents)
+            
+            # 调用LLM生成
+            response = await self._call_llm(prompt)
+            
+            # 解析响应
+            qs_pairs = self._parse_qs_response(response)
+            
+            # 验证和清理
+            validated_pairs = self._validate_qs_pairs(qs_pairs, theme['name'])
+            
+            # 保存中间结果
+            await self._save_theme_results(theme_name, validated_pairs)
+            
+            self.logger.info(f"✅ 主题 '{theme_name}' 处理成功,生成 {len(validated_pairs)} 个问题")
+            
+            return {
+                'success': True,
+                'theme_name': theme_name,
+                'qs_pairs': validated_pairs
+            }
+            
+        except Exception as e:
+            self.logger.error(f"❌ 处理主题 '{theme_name}' 失败: {e}")
+            return {
+                'success': False,
+                'theme_name': theme_name,
+                'error': str(e),
+                'qs_pairs': []
+            }
+    
+    def _build_qs_generation_prompt(self, theme: Dict, md_contents: str) -> str:
+        """构建Question-SQL生成的prompt"""
+        questions_count = self.config['qs_generation']['questions_per_theme']
+        
+        # 兼容新旧格式
+        topic_name = theme.get('topic_name', theme.get('name', ''))
+        description = theme.get('description', '')
+        focus_areas = theme.get('focus_areas', theme.get('keywords', []))
+        related_tables = theme.get('related_tables', [])
+        
+        prompt = f"""你是一位业务数据分析师,正在为{self.business_context}设计数据查询。
+
+当前分析主题:{topic_name}
+主题描述:{description}
+关注领域:{', '.join(focus_areas)}
+相关表:{', '.join(related_tables)}
+
+数据库表结构信息:
+{md_contents}
+
+请为这个主题生成 {questions_count} 个业务问题和对应的SQL查询。
+
+要求:
+1. 问题应该从业务角度出发,贴合主题要求,具有实际分析价值
+2. SQL必须使用PostgreSQL语法
+3. 考虑实际业务逻辑(如软删除使用 delete_ts IS NULL 条件)
+4. 使用中文别名提高可读性(使用 AS 指定列别名)
+5. 问题应该多样化,覆盖不同的分析角度
+6. 包含时间筛选、分组统计、排序、限制等不同类型的查询
+7. SQL语句末尾必须以分号结束
+
+输出JSON格式(注意SQL中的双引号需要转义):
+```json
+[
+  {{
+    "question": "具体的业务问题?",
+    "sql": "SELECT column AS 中文名 FROM table WHERE condition;"
+  }}
+]
+```
+
+生成的问题应该包括但不限于:
+- 趋势分析(按时间维度)
+- 对比分析(不同维度对比)
+- 排名统计(TOP N)
+- 汇总统计(总量、平均值等)
+- 明细查询(特定条件的详细数据)"""
+        
+        return prompt
+    
+    async def _call_llm(self, prompt: str) -> str:
+        """调用LLM"""
+        try:
+            response = await asyncio.to_thread(
+                self.vn.chat_with_llm,
+                question=prompt,
+                system_prompt="你是一个专业的数据分析师,精通PostgreSQL语法,擅长设计有业务价值的数据查询。请严格按照JSON格式输出。"
+            )
+            
+            if not response or not response.strip():
+                raise ValueError("LLM返回空响应")
+            
+            return response.strip()
+            
+        except Exception as e:
+            self.logger.error(f"LLM调用失败: {e}")
+            raise
+    
+    def _parse_qs_response(self, response: str) -> List[Dict[str, str]]:
+        """解析Question-SQL响应"""
+        try:
+            # 提取JSON部分
+            import re
+            json_match = re.search(r'```json\s*(.*?)\s*```', response, re.DOTALL)
+            if json_match:
+                json_str = json_match.group(1)
+            else:
+                json_str = response
+            
+            # 解析JSON
+            qs_pairs = json.loads(json_str)
+            
+            if not isinstance(qs_pairs, list):
+                raise ValueError("响应不是列表格式")
+            
+            return qs_pairs
+            
+        except json.JSONDecodeError as e:
+            self.logger.error(f"JSON解析失败: {e}")
+            self.logger.debug(f"原始响应: {response}")
+            raise ValueError(f"无法解析LLM响应为JSON格式: {e}")
+    
+    def _validate_qs_pairs(self, qs_pairs: List[Dict], theme_name: str) -> List[Dict[str, str]]:
+        """验证和清理Question-SQL对"""
+        validated = []
+        
+        for i, pair in enumerate(qs_pairs):
+            if not isinstance(pair, dict):
+                self.logger.warning(f"跳过无效格式的项 {i+1}")
+                continue
+            
+            question = pair.get('question', '').strip()
+            sql = pair.get('sql', '').strip()
+            
+            if not question or not sql:
+                self.logger.warning(f"跳过空问题或SQL的项 {i+1}")
+                continue
+            
+            # 确保SQL以分号结束
+            if not sql.endswith(';'):
+                sql += ';'
+            
+            validated.append({
+                'question': question,
+                'sql': sql
+            })
+        
+        self.logger.info(f"主题 '{theme_name}': 验证通过 {len(validated)}/{len(qs_pairs)} 个问题")
+        
+        return validated
+    
+    def _init_intermediate_file(self):
+        """初始化中间结果文件"""
+        timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
+        self.intermediate_file = self.output_dir / f"qs_intermediate_{timestamp}.json"
+        self.intermediate_results = []
+        self.logger.info(f"中间结果文件: {self.intermediate_file}")
+    
+    async def _save_theme_results(self, theme_name: str, qs_pairs: List[Dict]):
+        """保存单个主题的结果"""
+        theme_result = {
+            "theme": theme_name,
+            "timestamp": datetime.now().isoformat(),
+            "questions_count": len(qs_pairs),
+            "questions": qs_pairs
+        }
+        
+        self.intermediate_results.append(theme_result)
+        
+        # 立即保存到中间文件
+        if self.config['qs_generation']['save_intermediate']:
+            try:
+                with open(self.intermediate_file, 'w', encoding='utf-8') as f:
+                    json.dump(self.intermediate_results, f, ensure_ascii=False, indent=2)
+                self.logger.debug(f"中间结果已更新: {self.intermediate_file}")
+            except Exception as e:
+                self.logger.warning(f"保存中间结果失败: {e}")
+    
+    def _save_intermediate_results(self) -> Path:
+        """异常时保存中间结果"""
+        if not self.intermediate_results:
+            return None
+        
+        recovery_file = self.output_dir / f"qs_recovery_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json"
+        
+        try:
+            with open(recovery_file, 'w', encoding='utf-8') as f:
+                json.dump({
+                    "status": "interrupted",
+                    "timestamp": datetime.now().isoformat(),
+                    "completed_themes": len(self.intermediate_results),
+                    "results": self.intermediate_results
+                }, f, ensure_ascii=False, indent=2)
+            
+            return recovery_file
+            
+        except Exception as e:
+            self.logger.error(f"保存恢复文件失败: {e}")
+            return None
+    
+    async def _save_final_results(self, all_qs_pairs: List[Dict]) -> Path:
+        """保存最终结果"""
+        timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
+        output_file = self.output_dir / f"{self.config['qs_generation']['output_file_prefix']}_{self.db_name}_{timestamp}_pair.json"
+        
+        try:
+            with open(output_file, 'w', encoding='utf-8') as f:
+                json.dump(all_qs_pairs, f, ensure_ascii=False, indent=2)
+            
+            self.logger.info(f"✅ 最终结果已保存到: {output_file}")
+            return output_file
+            
+        except Exception as e:
+            self.logger.error(f"保存最终结果失败: {e}")
+            raise
+    
+    def _cleanup_intermediate_file(self):
+        """清理中间文件"""
+        if self.intermediate_file and self.intermediate_file.exists():
+            try:
+                self.intermediate_file.unlink()
+                self.logger.info("已清理中间文件")
+            except Exception as e:
+                self.logger.warning(f"清理中间文件失败: {e}")
+    
+    def _print_summary(self, report: Dict):
+        """打印总结信息"""
+        self.logger.info("=" * 60)
+        self.logger.info("📊 生成总结")
+        self.logger.info(f"  ✅ 总主题数: {report['total_themes']}")
+        self.logger.info(f"  ✅ 成功主题: {report['successful_themes']}")
+        
+        if report['failed_themes']:
+            self.logger.info(f"  ❌ 失败主题: {len(report['failed_themes'])}")
+            for theme in report['failed_themes']:
+                self.logger.info(f"     - {theme}")
+        
+        self.logger.info(f"  📝 总问题数: {report['total_questions']}")
+        self.logger.info(f"  📁 输出文件: {report['output_file']}")
+        self.logger.info(f"  ⏱️  执行时间: {report['execution_time']:.2f}秒")
+        self.logger.info("=" * 60)
+
+    async def _generate_metadata_file(self, themes: List[Dict]):
+        """生成metadata.txt文件,包含INSERT语句"""
+        metadata_file = self.output_dir / "metadata.txt"
+        
+        try:
+            with open(metadata_file, 'w', encoding='utf-8') as f:
+                f.write("-- Schema Tools生成的主题元数据\n")
+                f.write(f"-- 业务背景: {self.business_context}\n")
+                f.write(f"-- 生成时间: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n")
+                f.write(f"-- 数据库: {self.db_name}\n\n")
+                
+                f.write("-- 创建表(如果不存在)\n")
+                f.write("CREATE TABLE IF NOT EXISTS metadata (\n")
+                f.write("    id SERIAL PRIMARY KEY,\n")
+                f.write("    topic_name VARCHAR(100) NOT NULL,\n")
+                f.write("    description TEXT,\n")
+                f.write("    related_tables TEXT[],\n")
+                f.write("    keywords TEXT[],\n")
+                f.write("    focus_areas TEXT[],\n")
+                f.write("    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP\n")
+                f.write(");\n\n")
+                
+                f.write("-- 插入主题数据\n")
+                for theme in themes:
+                    # 获取字段值,使用新格式
+                    topic_name = theme.get('topic_name', theme.get('name', ''))
+                    description = theme.get('description', '')
+                    
+                    # 处理related_tables
+                    related_tables = theme.get('related_tables', [])
+                    if isinstance(related_tables, list):
+                        tables_str = '{' + ','.join(related_tables) + '}'
+                    else:
+                        tables_str = '{}'
+                    
+                    # 处理keywords
+                    keywords = theme.get('keywords', [])
+                    if isinstance(keywords, list):
+                        keywords_str = '{' + ','.join(keywords) + '}'
+                    else:
+                        keywords_str = '{}'
+                    
+                    # 处理focus_areas
+                    focus_areas = theme.get('focus_areas', [])
+                    if isinstance(focus_areas, list):
+                        focus_areas_str = '{' + ','.join(focus_areas) + '}'
+                    else:
+                        focus_areas_str = '{}'
+                    
+                    # 生成INSERT语句
+                    f.write("INSERT INTO metadata(topic_name, description, related_tables, keywords, focus_areas) VALUES\n")
+                    f.write("(\n")
+                    f.write(f"  '{self._escape_sql_string(topic_name)}',\n")
+                    f.write(f"  '{self._escape_sql_string(description)}',\n")
+                    f.write(f"  '{tables_str}',\n")
+                    f.write(f"  '{keywords_str}',\n")
+                    f.write(f"  '{focus_areas_str}'\n")
+                    f.write(");\n\n")
+            
+            self.logger.info(f"✅ metadata.txt文件已生成: {metadata_file}")
+            return metadata_file
+            
+        except Exception as e:
+            self.logger.error(f"生成metadata.txt文件失败: {e}")
+            return None
+    
+    def _escape_sql_string(self, value: str) -> str:
+        """转义SQL字符串中的特殊字符"""
+        if not value:
+            return ""
+        # 转义单引号
+        return value.replace("'", "''") 

+ 139 - 0
schema_tools/qs_generator.py

@@ -0,0 +1,139 @@
+"""
+Question-SQL生成器命令行入口
+用于从已生成的DDL和MD文件生成Question-SQL训练数据
+"""
+
+import argparse
+import asyncio
+import sys
+import os
+from pathlib import Path
+
+from schema_tools.qs_agent import QuestionSQLGenerationAgent
+from schema_tools.utils.logger import setup_logging
+
+
+def setup_argument_parser():
+    """设置命令行参数解析器"""
+    parser = argparse.ArgumentParser(
+        description='Question-SQL Generator - 从MD文件生成Question-SQL训练数据',
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+        epilog="""
+示例用法:
+  # 基本使用
+  python -m schema_tools.qs_generator --output-dir ./output --table-list ./tables.txt --business-context "高速公路服务区管理系统"
+  
+  # 指定数据库名称
+  python -m schema_tools.qs_generator --output-dir ./output --table-list ./tables.txt --business-context "电商系统" --db-name ecommerce_db
+  
+  # 启用详细日志
+  python -m schema_tools.qs_generator --output-dir ./output --table-list ./tables.txt --business-context "管理系统" --verbose
+        """
+    )
+    
+    # 必需参数
+    parser.add_argument(
+        '--output-dir',
+        required=True,
+        help='包含DDL和MD文件的输出目录'
+    )
+    
+    parser.add_argument(
+        '--table-list',
+        required=True,
+        help='表清单文件路径(用于验证文件数量)'
+    )
+    
+    parser.add_argument(
+        '--business-context',
+        required=True,
+        help='业务上下文描述'
+    )
+    
+    # 可选参数
+    parser.add_argument(
+        '--db-name',
+        help='数据库名称(用于输出文件命名)'
+    )
+    
+    parser.add_argument(
+        '--verbose', '-v',
+        action='store_true',
+        help='启用详细日志输出'
+    )
+    
+    parser.add_argument(
+        '--log-file',
+        help='日志文件路径'
+    )
+    
+    return parser
+
+
+async def main():
+    """主入口函数"""
+    parser = setup_argument_parser()
+    args = parser.parse_args()
+    
+    # 设置日志
+    setup_logging(
+        verbose=args.verbose,
+        log_file=args.log_file,
+        log_dir=os.path.join(args.output_dir, 'logs') if args.output_dir else None
+    )
+    
+    # 验证参数
+    output_path = Path(args.output_dir)
+    if not output_path.exists():
+        print(f"错误: 输出目录不存在: {args.output_dir}")
+        sys.exit(1)
+    
+    if not os.path.exists(args.table_list):
+        print(f"错误: 表清单文件不存在: {args.table_list}")
+        sys.exit(1)
+    
+    try:
+        # 创建Agent
+        agent = QuestionSQLGenerationAgent(
+            output_dir=args.output_dir,
+            table_list_file=args.table_list,
+            business_context=args.business_context,
+            db_name=args.db_name
+        )
+        
+        # 执行生成
+        print(f"🚀 开始生成Question-SQL训练数据...")
+        print(f"📁 输出目录: {args.output_dir}")
+        print(f"📋 表清单: {args.table_list}")
+        print(f"🏢 业务背景: {args.business_context}")
+        
+        report = await agent.generate()
+        
+        # 输出结果
+        if report['success']:
+            if report['failed_themes']:
+                print(f"\n⚠️  生成完成,但有 {len(report['failed_themes'])} 个主题失败")
+                exit_code = 2  # 部分成功
+            else:
+                print("\n🎉 所有主题生成成功!")
+                exit_code = 0  # 完全成功
+        else:
+            print("\n❌ 生成失败")
+            exit_code = 1
+        
+        print(f"📁 输出文件: {report['output_file']}")
+        sys.exit(exit_code)
+        
+    except KeyboardInterrupt:
+        print("\n\n⏹️  用户中断,程序退出")
+        sys.exit(130)
+    except Exception as e:
+        print(f"\n❌ 程序执行失败: {e}")
+        if args.verbose:
+            import traceback
+            traceback.print_exc()
+        sys.exit(1)
+
+
+if __name__ == "__main__":
+    asyncio.run(main()) 

+ 7 - 1
schema_tools/tables.txt

@@ -3,5 +3,11 @@
 # 以 # 开头的行为注释
 
 # 服务区相关表
-public.bss_car_day_count
+bss_car_day_count
+bss_business_day_data
+bss_company
+bss_section_route
+bss_section_route_area_link
+bss_service_area
+bss_service_area_mapper
 

+ 3 - 1
schema_tools/training_data_agent.py

@@ -95,7 +95,9 @@ class SchemaTrainingDataAgent:
         if self.config["create_subdirectories"]:
             os.makedirs(os.path.join(self.output_dir, "ddl"), exist_ok=True)
             os.makedirs(os.path.join(self.output_dir, "docs"), exist_ok=True)
-            os.makedirs(os.path.join(self.output_dir, "logs"), exist_ok=True)
+        
+        # logs目录始终创建
+        os.makedirs(os.path.join(self.output_dir, "logs"), exist_ok=True)
         
         # 初始化数据库工具
         database_tool = ToolRegistry.get_tool("database_inspector", db_connection=self.db_connection)

+ 62 - 8
schema_tools/utils/table_parser.py

@@ -1,6 +1,6 @@
 import os
 import logging
-from typing import List
+from typing import List, Tuple
 
 class TableListParser:
     """表清单解析器"""
@@ -16,7 +16,7 @@ class TableListParser:
             file_path: 表清单文件路径
             
         Returns:
-            表名列表
+            表名列表(已去重)
             
         Raises:
             FileNotFoundError: 文件不存在
@@ -26,6 +26,8 @@ class TableListParser:
             raise FileNotFoundError(f"表清单文件不存在: {file_path}")
         
         tables = []
+        seen_tables = set()  # 用于跟踪已见过的表名
+        duplicate_count = 0  # 重复表计数
         
         try:
             with open(file_path, 'r', encoding='utf-8') as f:
@@ -39,15 +41,27 @@ class TableListParser:
                     
                     # 验证表名格式
                     if self._validate_table_name(line):
-                        tables.append(line)
-                        self.logger.debug(f"解析到表: {line}")
+                        # 检查是否重复
+                        if line not in seen_tables:
+                            tables.append(line)
+                            seen_tables.add(line)
+                            self.logger.debug(f"解析到表: {line}")
+                        else:
+                            duplicate_count += 1
+                            self.logger.debug(f"第 {line_num} 行: 发现重复表名: {line}")
                     else:
                         self.logger.warning(f"第 {line_num} 行: 无效的表名格式: {line}")
             
             if not tables:
                 raise ValueError("表清单文件中没有有效的表名")
             
-            self.logger.info(f"成功解析 {len(tables)} 个表")
+            # 记录去重统计信息
+            original_count = len(tables) + duplicate_count
+            if duplicate_count > 0:
+                self.logger.info(f"表清单去重统计: 原始{original_count}个表,去重后{len(tables)}个表,移除了{duplicate_count}个重复项")
+            else:
+                self.logger.info(f"成功解析 {len(tables)} 个表(无重复)")
+            
             return tables
             
         except Exception as e:
@@ -94,9 +108,10 @@ class TableListParser:
             tables_str: 表名字符串,逗号或换行分隔
             
         Returns:
-            表名列表
+            表名列表(已去重)
         """
         tables = []
+        seen_tables = set()
         
         # 支持逗号和换行分隔
         for separator in [',', '\n']:
@@ -109,6 +124,45 @@ class TableListParser:
         for part in parts:
             table_name = part.strip()
             if table_name and self._validate_table_name(table_name):
-                tables.append(table_name)
+                if table_name not in seen_tables:
+                    tables.append(table_name)
+                    seen_tables.add(table_name)
+        
+        return tables
+    
+    def get_duplicate_info(self, file_path: str) -> Tuple[List[str], List[str]]:
+        """
+        获取表清单文件的重复信息
+        
+        Args:
+            file_path: 表清单文件路径
+            
+        Returns:
+            (唯一表名列表, 重复表名列表)
+        """
+        if not os.path.exists(file_path):
+            raise FileNotFoundError(f"表清单文件不存在: {file_path}")
         
-        return tables
+        unique_tables = []
+        duplicate_tables = []
+        seen_tables = set()
+        
+        try:
+            with open(file_path, 'r', encoding='utf-8') as f:
+                for line in f:
+                    line = line.strip()
+                    if not line or line.startswith('#') or line.startswith('--'):
+                        continue
+                    
+                    if self._validate_table_name(line):
+                        if line not in seen_tables:
+                            unique_tables.append(line)
+                            seen_tables.add(line)
+                        else:
+                            duplicate_tables.append(line)
+            
+            return unique_tables, duplicate_tables
+            
+        except Exception as e:
+            self.logger.error(f"获取重复信息失败: {e}")
+            raise

+ 9 - 0
schema_tools/validators/__init__.py

@@ -0,0 +1,9 @@
+"""
+文件验证器模块
+"""
+
+from .file_count_validator import FileCountValidator
+
+__all__ = [
+    "FileCountValidator"
+] 

+ 194 - 0
schema_tools/validators/file_count_validator.py

@@ -0,0 +1,194 @@
+import logging
+from pathlib import Path
+from typing import Dict, List, Tuple, Set
+from dataclasses import dataclass, field
+
+from schema_tools.utils.table_parser import TableListParser
+from schema_tools.config import SCHEMA_TOOLS_CONFIG
+
+
+@dataclass
+class ValidationResult:
+    """验证结果"""
+    is_valid: bool
+    table_count: int
+    ddl_count: int
+    md_count: int
+    error: str = ""
+    missing_ddl: List[str] = field(default_factory=list)
+    missing_md: List[str] = field(default_factory=list)
+    duplicate_tables: List[str] = field(default_factory=list)
+
+
+class FileCountValidator:
+    """文件数量验证器"""
+    
+    def __init__(self):
+        self.logger = logging.getLogger("schema_tools.FileCountValidator")
+        self.config = SCHEMA_TOOLS_CONFIG
+        
+    def validate(self, table_list_file: str, output_dir: str) -> ValidationResult:
+        """
+        验证生成的文件数量是否与表数量一致
+        
+        Args:
+            table_list_file: 表清单文件路径
+            output_dir: 输出目录路径
+            
+        Returns:
+            ValidationResult: 验证结果
+        """
+        try:
+            # 1. 解析表清单获取表数量(自动去重)
+            table_parser = TableListParser()
+            tables = table_parser.parse_file(table_list_file)
+            table_count = len(tables)
+            
+            # 获取重复信息
+            unique_tables, duplicate_tables = table_parser.get_duplicate_info(table_list_file)
+            
+            # 2. 检查表数量限制
+            max_tables = self.config['qs_generation']['max_tables']
+            if table_count > max_tables:
+                return ValidationResult(
+                    is_valid=False,
+                    table_count=table_count,
+                    ddl_count=0,
+                    md_count=0,
+                    error=f"表数量({table_count})超过限制({max_tables})。请分批处理或调整配置中的max_tables参数。",
+                    duplicate_tables=duplicate_tables
+                )
+            
+            # 3. 扫描输出目录
+            output_path = Path(output_dir)
+            if not output_path.exists():
+                return ValidationResult(
+                    is_valid=False,
+                    table_count=table_count,
+                    ddl_count=0,
+                    md_count=0,
+                    error=f"输出目录不存在: {output_dir}",
+                    duplicate_tables=duplicate_tables
+                )
+            
+            # 4. 统计DDL和MD文件
+            ddl_files = list(output_path.glob("*.ddl"))
+            md_files = list(output_path.glob("*_detail.md"))  # 注意文件后缀格式
+            
+            ddl_count = len(ddl_files)
+            md_count = len(md_files)
+            
+            self.logger.info(f"文件统计 - 表: {table_count}, DDL: {ddl_count}, MD: {md_count}")
+            if duplicate_tables:
+                self.logger.info(f"表清单中存在 {len(duplicate_tables)} 个重复项")
+            
+            # 5. 验证数量一致性
+            if ddl_count != table_count or md_count != table_count:
+                # 查找缺失的文件
+                missing_ddl, missing_md = self._find_missing_files(tables, ddl_files, md_files)
+                
+                error_parts = []
+                if ddl_count != table_count:
+                    error_parts.append(f"DDL文件数量({ddl_count})与表数量({table_count})不一致")
+                    if missing_ddl:
+                        self.logger.error(f"缺失的DDL文件对应的表: {', '.join(missing_ddl)}")
+                
+                if md_count != table_count:
+                    error_parts.append(f"MD文件数量({md_count})与表数量({table_count})不一致")
+                    if missing_md:
+                        self.logger.error(f"缺失的MD文件对应的表: {', '.join(missing_md)}")
+                
+                return ValidationResult(
+                    is_valid=False,
+                    table_count=table_count,
+                    ddl_count=ddl_count,
+                    md_count=md_count,
+                    error="; ".join(error_parts),
+                    missing_ddl=missing_ddl,
+                    missing_md=missing_md,
+                    duplicate_tables=duplicate_tables
+                )
+            
+            # 6. 验证通过
+            self.logger.info(f"文件验证通过:{table_count}个表,{ddl_count}个DDL,{md_count}个MD")
+            
+            return ValidationResult(
+                is_valid=True,
+                table_count=table_count,
+                ddl_count=ddl_count,
+                md_count=md_count,
+                duplicate_tables=duplicate_tables
+            )
+            
+        except Exception as e:
+            self.logger.exception("文件验证失败")
+            return ValidationResult(
+                is_valid=False,
+                table_count=0,
+                ddl_count=0,
+                md_count=0,
+                error=f"验证过程发生异常: {str(e)}"
+            )
+    
+    def _find_missing_files(self, tables: List[str], ddl_files: List[Path], md_files: List[Path]) -> Tuple[List[str], List[str]]:
+        """查找缺失的文件"""
+        # 获取已生成的文件名(不含扩展名)
+        ddl_names = {f.stem for f in ddl_files}
+        md_names = {f.stem.replace('_detail', '') for f in md_files}  # 移除_detail后缀
+        
+        missing_ddl = []
+        missing_md = []
+        
+        # 为每个表建立可能的文件名映射
+        table_to_filenames = self._get_table_filename_mapping(tables)
+        
+        # 检查每个表的文件
+        for table_spec in tables:
+            # 获取该表可能的文件名
+            possible_filenames = table_to_filenames[table_spec]
+            
+            # 检查DDL文件
+            ddl_exists = any(fname in ddl_names for fname in possible_filenames)
+            if not ddl_exists:
+                missing_ddl.append(table_spec)
+                
+            # 检查MD文件
+            md_exists = any(fname in md_names for fname in possible_filenames)
+            if not md_exists:
+                missing_md.append(table_spec)
+        
+        return missing_ddl, missing_md
+    
+    def _get_table_filename_mapping(self, tables: List[str]) -> Dict[str, Set[str]]:
+        """获取表名到可能的文件名的映射"""
+        mapping = {}
+        
+        for table_spec in tables:
+            # 解析表名
+            if '.' in table_spec:
+                schema, table = table_spec.split('.', 1)
+            else:
+                schema, table = 'public', table_spec
+            
+            # 生成可能的文件名
+            possible_names = set()
+            
+            # 基本格式
+            if schema.lower() == 'public':
+                possible_names.add(table)
+            else:
+                possible_names.add(f"{schema}__{table}")
+                possible_names.add(f"{schema}_{table}")  # 兼容不同格式
+            
+            # 考虑特殊字符替换
+            safe_name = table.replace('-', '_').replace(' ', '_')
+            if safe_name != table:
+                if schema.lower() == 'public':
+                    possible_names.add(safe_name)
+                else:
+                    possible_names.add(f"{schema}__{safe_name}")
+                    possible_names.add(f"{schema}_{safe_name}")
+            
+            mapping[table_spec] = possible_names
+        
+        return mapping