70 Коміти c793722717 ... d21d87cd3a

Автор SHA1 Опис Дата
  wangxq d21d87cd3a 删除了相关测试代码. 1 тиждень тому
  wangxq 14cd8fe9de 修正requirement.txt,准备发布到服务器 2 тижнів тому
  wangxq 68c2b0475f 准备提交到服务器 2 тижнів тому
  wangxq 0d8b9b469f 增强agent写入的redis数据的管理能力,增加一个api,增强原有的cleanup api. 2 тижнів тому
  wangxq b39d23bd36 简化unified_api.py的main函数,使用uvicorn启动使用asgi_app.py. 2 тижнів тому
  wangxq 05e3b2f182 已经完成ask_react_agent_stream api的开发. 2 тижнів тому
  wangxq 586bff9728 ask_react_agent_stream api 创建完成,现在准备优化step返回的结果. 2 тижнів тому
  wangxq 7d9f7a35a7 成功添加ask_agent_stream api,准备增加ask_react_agent_stream api. 2 тижнів тому
  wangxq c777698d27 准备开发langgraph agent的astream功能。 2 тижнів тому
  wangxq e3d9eef93d 已经为react_agent添加checkpoint删除和管理功能。 3 тижнів тому
  wangxq 443b9c7c06 react_agent 准备添加checkpoint删除功能. 3 тижнів тому
  wangxq e5ef3966cf react_agent 添加state 裁剪功能. 3 тижнів тому
  wangxq 0b5b110d0d 确定了State裁剪方案,准备开始修改代码。 3 тижнів тому
  wangxq 1672d4ab47 完成了react方式的步骤轮询查询功能,准备对State进行裁剪. 3 тижнів тому
  wangxq 55f7a4198c 准备添加langgraph react agent步骤推送功能. 3 тижнів тому
  wangxq ee6129025e 将agent下config中的阈值配置参数迁移到了yaml文件中. 3 тижнів тому
  wangxq 0266aae652 精简了所谓的渐进式的规则判断,减少了上下文对当前问题模式判断的影响。 3 тижнів тому
  wangxq 25437a7e22 没有修改agent_sql_generation 无法生成SQL的时候转发至agent_chat,但是修改了规则判断的条,在进行规则判断的时候只根据[CURRENT]来判断,跳过了[CONTEXT]. 3 тижнів тому
  wangxq 964e5daa7a 准备修改让 agent_sql_generation 无法生成SQL的时候转发至agent_chat. 3 тижнів тому
  wangxq 5b9e10fad4 修改State的字段,删除了四个不使用的字段. 4 тижнів тому
  wangxq 072f2d31ea 删除两个节点,统一graph视图,修改windows日志. 4 тижнів тому
  wangxq 68988d52d4 修改unified_api.py,解决ask_agent返回结果中没有conversation_id的问题. 4 тижнів тому
  wangxq 1317e42918 修复 react_agent的config中redis配置的问题。 1 місяць тому
  wangxq eca80a93b7 准备发布到143服务器。 1 місяць тому
  wangxq 560632a932 把pgvector的pg_conn改为db_connection,准备发布到公司服务器. 1 місяць тому
  wangxq 9c50a68ead 为pgvector增加了备份和恢复功能的API,注意命令行方式没有做. 1 місяць тому
  wangxq ef0a85b5d9 已经修复了data_pipeline API的问题,支持truncate_vector_tables和backup_vector_tables两个参数。 1 місяць тому
  wangxq b5601c3dab 已经修复了data_pipeline API的问题,支持truncate_vector_tables和backup_vector_tables两个参数。 1 місяць тому
  wangxq 32e79e37cb 准备修改解决api 执行时的pgvector备份问题. 1 місяць тому
  wangxq 1946fe5ac3 已经为./data_pipeline模块的命令行模式添加了truncate/backup参数,现在准备为API添加这两个参数。 1 місяць тому
  wangxq e927f8ce0f 完成对./data_pipeline/模块新增两个参数truncate和backup的设计,准备修改代码. 1 місяць тому
  wangxq 246c3b61b0 修复./data_pipeline模块脚本执行时的问题,增加输出日志,修改output目录下也要创建task子目录。 1 місяць тому
  wangxq 1faefca376 修改了日志的配置,改为按照日期滚动。 1 місяць тому
  wangxq 69478355a5 修复了qa_feedback反馈API迁移后被简化的问题. 1 місяць тому
  wangxq 52c3eb150b 规范化获取ask_react_agent对话历史记录的API,返回结果符合标准。 1 місяць тому
  wangxq 85928bf5a2 完成对api获取聊天历史记录的这一组两个API的修改:/api/v0/react/users/wang11/conversations 1 місяць тому
  wangxq 27ee386b02 增加data_training的三个API,增加langgraph的递归数量的计数器。 1 місяць тому
  wangxq 707ca80f27 修复training_data的相关api迁移到unified_api.py之后的问题,准备增加training_data的更新和批量加载,以及合并的api. 1 місяць тому
  wangxq e19d1a0975 部分修复了Event loop的错误,简化提示词, 1 місяць тому
  wangxq 346d9bb862 重构了整个项目的log服务,测试时又发现了Event loop is closed错误,准备修复这个错误。 1 місяць тому
  wangxq 114b05bc09 准备开始重构日志模块。 1 місяць тому
  wangxq 989b3bf3ca 补充了漏掉的data_pipeline API. 1 місяць тому
  wangxq f3e38aca1d 修复unified_api.py中的ask_react_agent的返回结果的问题,准备迁移api.py中的其它漏掉的api. 1 місяць тому
  wangxq 44fca2b272 把react_agent从/test/目录下复制出来,创建统一的unified_api.py,把chat api 改为ask_react_api. 1 місяць тому
  wangxq 84d59c7150 修改了整合方案,现在准备开始整合 1 місяць тому
  wangxq d3ae37134d 准备实施custom_react_agent整合. 1 місяць тому
  wangxq 698c852d73 删除执行过程中的api_data的生成函数,保留最后chat的api_data的生成函数. 1 місяць тому
  wangxq f7cd131223 为了解决多轮对话时,llm返回空值的问题,我添加了注释在agent上,查看返回的结果,然后产生了一个奇怪的现象,多轮对话不再出现返回为空了。 1 місяць тому
  wangxq 29ac9a1418 删除[Formatted Output]字符的输出,准备增加详细的日志,排查二次对话返回为空的问题。 1 місяць тому
  wangxq 76916624b8 尝试解决chat api解答非技术问题时,解释不查数据库的原因,准备修改去除Final Answer: [Formatted Output]. 1 місяць тому
  wangxq 54e550cafb 解决chat api进行非数据库查询时输出Based on general knowledge. 1 місяць тому
  wangxq b5555af32f 发现vanna有时会回答非技术问题,造成最终返回的答案重复冗余,需要等到最后在修改,因为修改vanna的提示词后,会影响第一个版本的api的正常运行。 1 місяць тому
  wangxq 29dab53747 解决了无法生成SQL时,仍然发送给valid_sql()的问题。 1 місяць тому
  wangxq 4dd770bd90 格式化chat api的结果,修复一些错误. 1 місяць тому
  wangxq d4ffb11686 修复chat api的返回格式和数据查询错误. 1 місяць тому
  wangxq 2cae793303 准备开始修改 /aip/chat. 1 місяць тому
  wangxq 61304dcb0b ./custom_react_agent 修改LLM超时后造成的错误问题,shell.py测试基本通过,准备测试api 1 місяць тому
  wangxq dfe57a4995 使用WsgiToAsgi/uvicorn代替WSGI,切换到异步操作,貌似测试成果了。 1 місяць тому
  wangxq a3dc6a7183 第一个异步改造完,我在操作时就遇到了同样的问题,现在准备采用新的方案,进行异步改造。 1 місяць тому
  wangxq 4782e33fae 前面的异步改造失败,现在进行第二次尝试。 1 місяць тому
  wangxq 989d3b37bb 基本完成对Recat Redis的调试,准备进行异步改造。 1 місяць тому
  wangxq bb31ab62cc 继续修改从Redis获取对话记录,准备增加直接从Redis获取对话详情的api. 1 місяць тому
  wangxq 419b47f544 已经解决./custom_react_agent的chat返回结果取SQL错误的问题,现在准备彻底格式化输出. 1 місяць тому
  wangxq 1d61b8d5c5 准备修改./custom_react_agent的输出结果代码 1 місяць тому
  wangxq 106a60d0f7 ./custom_react_agent 又恢复到json有ascii码的状态,现在准备修改解决这个问题。 1 місяць тому
  wangxq a415b7fe82 准备重构/test/custom_react_agent下的代码,已经完成了使用StateGraph替代create_react_agent()的方法,增加输出和中间处理节点。 1 місяць тому
  wangxq b7977aa8cf test目录下增加了redis原生支持的测试,修正了citu_app.py中任务删除api的问题. 1 місяць тому
  wangxq 3b7110dde2 已经完成了react原型的开发,现在准备开始测试memory saver. 1 місяць тому
  wangxq 8fc13a21cd 增加对test目录的追踪. 1 місяць тому
  wangxq 6243095d72 启用test目录的git跟踪:包含所有测试代码文件和Jupyter notebook 1 місяць тому
100 змінених файлів з 5325 додано та 2264 видалено
  1. 2 1
      .claude/settings.local.json
  2. 6 2
      .gitignore
  3. 257 154
      agent/citu_agent.py
  4. 122 230
      agent/classifier.py
  5. 468 0
      agent/classifier_dict.yaml
  6. 46 36
      agent/config.py
  7. 221 0
      agent/dict_loader.py
  8. 2 8
      agent/state.py
  9. 3 3
      app_config.py
  10. 27 0
      asgi_app.py
  11. 123 51
      citu_app.py
  12. 297 8
      common/redis_conversation_manager.py
  13. 7 6
      common/result.py
  14. 103 128
      common/session_aware_cache.py
  15. 37 20
      config/logging_config.yaml
  16. 110 0
      config/logging_config_backup_20250725_181936.yaml
  17. 104 0
      config/logging_config_windows.yaml
  18. 32 2
      core/logging/__init__.py
  19. 17 7
      core/logging/log_manager.py
  20. 15 15
      customllm/base_llm_chat.py
  21. 5 3
      custompgvector/pgvector.py
  22. 32 6
      data_pipeline/api/simple_db_manager.py
  23. 59 26
      data_pipeline/api/simple_file_manager.py
  24. 121 4
      data_pipeline/api/simple_workflow.py
  25. 551 0
      data_pipeline/api/vector_restore_manager.py
  26. 16 1
      data_pipeline/config.py
  27. 273 0
      data_pipeline/create_task_cli.py
  28. 33 2
      data_pipeline/ddl_generation/ddl_md_generator.py
  29. 66 2
      data_pipeline/ddl_generation/training_data_agent.py
  30. 43 18
      data_pipeline/dp_logging/__init__.py
  31. 0 156
      data_pipeline/dp_logging/manager.py
  32. 26 1
      data_pipeline/qa_generation/qs_agent.py
  33. 45 7
      data_pipeline/qa_generation/qs_generator.py
  34. 272 19
      data_pipeline/schema_workflow.py
  35. 5 5
      data_pipeline/tables.txt
  36. 18 4
      data_pipeline/task_executor.py
  37. 192 26
      data_pipeline/trainer/run_training.py
  38. 358 0
      data_pipeline/trainer/vector_table_manager.py
  39. 2 2
      data_pipeline/training_data/manual_20250720_134836/bss_business_day_data.ddl
  40. 2 2
      data_pipeline/training_data/manual_20250720_134836/bss_business_day_data_detail.md
  41. 2 2
      data_pipeline/training_data/manual_20250720_134836/bss_car_day_count.ddl
  42. 2 2
      data_pipeline/training_data/manual_20250720_134836/bss_car_day_count_detail.md
  43. 3 3
      data_pipeline/training_data/manual_20250720_134836/bss_company.ddl
  44. 3 3
      data_pipeline/training_data/manual_20250720_134836/bss_company_detail.md
  45. 3 3
      data_pipeline/training_data/manual_20250720_134836/bss_section_route.ddl
  46. 1 1
      data_pipeline/training_data/manual_20250720_134836/bss_section_route_area_link.ddl
  47. 1 1
      data_pipeline/training_data/manual_20250720_134836/bss_section_route_area_link_detail.md
  48. 3 3
      data_pipeline/training_data/manual_20250720_134836/bss_section_route_detail.md
  49. 6 6
      data_pipeline/training_data/manual_20250720_134836/bss_service_area.ddl
  50. 6 6
      data_pipeline/training_data/manual_20250720_134836/bss_service_area_detail.md
  51. 2 2
      data_pipeline/training_data/manual_20250720_134836/bss_service_area_mapper.ddl
  52. 3 3
      data_pipeline/training_data/manual_20250720_134836/bss_service_area_mapper_detail.md
  53. 70 0
      data_pipeline/training_data/manual_20250720_134836/db_query_decision_prompt.txt
  54. 0 0
      data_pipeline/training_data/manual_20250720_134836/filename_mapping.txt
  55. 62 0
      data_pipeline/training_data/manual_20250720_134836/metadata.txt
  56. 3 3
      data_pipeline/training_data/manual_20250720_134836/metadata_detail.md
  57. 198 0
      data_pipeline/training_data/manual_20250720_134836/qs_highway_db_20250720_135235_pair.json
  58. 202 0
      data_pipeline/training_data/manual_20250720_134836/qs_highway_db_20250720_135235_pair.json.backup
  59. 8 8
      data_pipeline/training_data/manual_20250722_164749/bss_business_day_data.ddl
  60. 32 0
      data_pipeline/training_data/manual_20250722_164749/bss_business_day_data_detail.md
  61. 5 5
      data_pipeline/training_data/manual_20250722_164749/bss_car_day_count.ddl
  62. 6 6
      data_pipeline/training_data/manual_20250722_164749/bss_car_day_count_detail.md
  63. 15 0
      data_pipeline/training_data/manual_20250722_164749/bss_company.ddl
  64. 16 0
      data_pipeline/training_data/manual_20250722_164749/bss_company_detail.md
  65. 3 3
      data_pipeline/training_data/manual_20250722_164749/bss_section_route.ddl
  66. 7 0
      data_pipeline/training_data/manual_20250722_164749/bss_section_route_area_link.ddl
  67. 7 0
      data_pipeline/training_data/manual_20250722_164749/bss_section_route_area_link_detail.md
  68. 5 5
      data_pipeline/training_data/manual_20250722_164749/bss_section_route_detail.md
  69. 4 4
      data_pipeline/training_data/manual_20250722_164749/bss_service_area.ddl
  70. 4 4
      data_pipeline/training_data/manual_20250722_164749/bss_service_area_detail.md
  71. 3 3
      data_pipeline/training_data/manual_20250722_164749/bss_service_area_mapper.ddl
  72. 5 4
      data_pipeline/training_data/manual_20250722_164749/bss_service_area_mapper_detail.md
  73. 35 0
      data_pipeline/training_data/manual_20250722_164749/db_query_decision_prompt.txt
  74. 0 0
      data_pipeline/training_data/manual_20250722_164749/filename_mapping.txt
  75. 62 0
      data_pipeline/training_data/manual_20250722_164749/metadata.txt
  76. 3 3
      data_pipeline/training_data/manual_20250722_164749/metadata_detail.md
  77. 198 0
      data_pipeline/training_data/manual_20250722_164749/qs_highway_db_20250722_165543_pair.json
  78. 202 0
      data_pipeline/training_data/manual_20250722_164749/qs_highway_db_20250722_165543_pair.json.backup
  79. 5 0
      data_pipeline/training_data/manual_20250722_164749/vector_bak/langchain_pg_collection_20250722_165619.csv
  80. 1 0
      data_pipeline/training_data/manual_20250722_164749/vector_bak/langchain_pg_embedding_20250722_165619.csv
  81. 11 0
      data_pipeline/training_data/manual_20250722_164749/vector_bak/vector_backup_log.txt
  82. 0 31
      data_pipeline/training_data/task_20250701_131627/bss_business_day_data.ddl
  83. 0 32
      data_pipeline/training_data/task_20250701_131627/bss_business_day_data_detail.md
  84. 0 15
      data_pipeline/training_data/task_20250701_131627/bss_company_detail.md
  85. 0 10
      data_pipeline/training_data/task_20250701_131627/db_query_decision_prompt.txt
  86. 0 62
      data_pipeline/training_data/task_20250701_131627/metadata.txt
  87. 0 190
      data_pipeline/training_data/task_20250701_131627/qs_highway_db_20250701_134736_pair.json
  88. 0 202
      data_pipeline/training_data/task_20250701_131627/qs_highway_db_20250701_134736_pair.json.backup
  89. 0 14
      data_pipeline/training_data/task_20250701_131627/task_config.json
  90. 0 88
      data_pipeline/training_data/task_20250701_131627/task_result.json
  91. 0 17
      data_pipeline/training_data/task_20250701_175640/bss_car_day_count.ddl
  92. 0 18
      data_pipeline/training_data/task_20250701_175640/bss_car_day_count_detail.md
  93. 0 14
      data_pipeline/training_data/task_20250701_175640/task_config.json
  94. 0 14
      data_pipeline/training_data/task_20250701_180014/task_config.json
  95. 0 38
      data_pipeline/training_data/task_20250701_184430/db_query_decision_prompt.txt
  96. 0 5
      data_pipeline/training_data/task_20250701_184430/filename_mapping.txt
  97. 0 62
      data_pipeline/training_data/task_20250701_184430/metadata.txt
  98. 0 198
      data_pipeline/training_data/task_20250701_184430/qs_highway_db_20250701_185822_pair.json
  99. 0 202
      data_pipeline/training_data/task_20250701_184430/qs_highway_db_20250701_185822_pair.json.backup
  100. 0 14
      data_pipeline/training_data/task_20250701_184430/task_config.json

+ 2 - 1
.claude/settings.local.json

@@ -23,7 +23,8 @@
       "Bash(\".venv/Scripts/python.exe\" -c \"import sys; sys.path.append('.'); from data_pipeline.schema_workflow import SchemaWorkflowOrchestrator; print('SchemaWorkflowOrchestrator导入成功')\")",
       "Bash(\".venv/Scripts/python.exe\" -c \"import sys; sys.path.append('.'); from data_pipeline.api.simple_workflow import SimpleWorkflowExecutor; print('SimpleWorkflowExecutor导入成功')\")",
       "Bash(\".venv/Scripts/python.exe\":*)",
-      "Bash(curl:*)"
+      "Bash(curl:*)",
+      "Bash(cat:*)"
     ],
     "deny": []
   }

+ 6 - 2
.gitignore

@@ -32,5 +32,9 @@ node_modules/
 # 忽略所有一级UUID目录
 /[0-9a-fA-F]*-[0-9a-fA-F]*-[0-9a-fA-F]*-[0-9a-fA-F]*-[0-9a-fA-F]*/
 
-
-test/
+# 忽略 test 目录中的临时文件和缓存,但跟踪所有代码文件
+test/__pycache__/
+test/.pytest_cache/
+test/.ipynb_checkpoints/
+test/*.pyc
+test/*.pyo

+ 257 - 154
agent/citu_agent.py

@@ -40,158 +40,63 @@ class CituLangGraphAgent:
         self.logger.info("LangGraph Agent with Direct Tools初始化完成")
     
     def _create_workflow(self, routing_mode: str = None) -> StateGraph:
-        """根据路由模式创建不同的工作流"""
-        # 确定使用的路由模式
-        if routing_mode:
-            QUESTION_ROUTING_MODE = routing_mode
-            self.logger.info(f"创建工作流,使用传入的路由模式: {QUESTION_ROUTING_MODE}")
-        else:
-            try:
-                from app_config import QUESTION_ROUTING_MODE
-                self.logger.info(f"创建工作流,使用配置文件路由模式: {QUESTION_ROUTING_MODE}")
-            except ImportError:
-                QUESTION_ROUTING_MODE = "hybrid"
-                self.logger.warning(f"配置导入失败,使用默认路由模式: {QUESTION_ROUTING_MODE}")
+        """创建统一的工作流,所有路由模式都通过classify_question进行分类"""
+        self.logger.info(f"🏗️ [WORKFLOW] 创建统一workflow")
         
         workflow = StateGraph(AgentState)
         
-        # 根据路由模式创建不同的工作流
-        if QUESTION_ROUTING_MODE == "database_direct":
-            # 直接数据库模式:跳过分类,直接进入数据库处理(使用新的拆分节点)
-            workflow.add_node("init_direct_database", self._init_direct_database_node)
-            workflow.add_node("agent_sql_generation", self._agent_sql_generation_node)
-            workflow.add_node("agent_sql_execution", self._agent_sql_execution_node)
-            workflow.add_node("format_response", self._format_response_node)
-            
-            workflow.set_entry_point("init_direct_database")
-            
-            # 添加条件路由
-            workflow.add_edge("init_direct_database", "agent_sql_generation")
-            workflow.add_conditional_edges(
-                "agent_sql_generation",
-                self._route_after_sql_generation,
-                {
-                    "continue_execution": "agent_sql_execution",
-                    "return_to_user": "format_response"
-                }
-            )
-            workflow.add_edge("agent_sql_execution", "format_response")
-            workflow.add_edge("format_response", END)
-            
-        elif QUESTION_ROUTING_MODE == "chat_direct":
-            # 直接聊天模式:跳过分类,直接进入聊天处理
-            workflow.add_node("init_direct_chat", self._init_direct_chat_node)
-            workflow.add_node("agent_chat", self._agent_chat_node)
-            workflow.add_node("format_response", self._format_response_node)
-            
-            workflow.set_entry_point("init_direct_chat")
-            workflow.add_edge("init_direct_chat", "agent_chat")
-            workflow.add_edge("agent_chat", "format_response")
-            workflow.add_edge("format_response", END)
-            
-        else:
-            # 其他模式(hybrid, llm_only):使用新的拆分工作流
-            workflow.add_node("classify_question", self._classify_question_node)
-            workflow.add_node("agent_chat", self._agent_chat_node)
-            workflow.add_node("agent_sql_generation", self._agent_sql_generation_node)
-            workflow.add_node("agent_sql_execution", self._agent_sql_execution_node)
-            workflow.add_node("format_response", self._format_response_node)
-            
-            workflow.set_entry_point("classify_question")
-            
-            # 添加条件边:分类后的路由
-            workflow.add_conditional_edges(
-                "classify_question",
-                self._route_after_classification,
-                {
-                    "DATABASE": "agent_sql_generation",
-                    "CHAT": "agent_chat"
-                }
-            )
-            
-            # 添加条件边:SQL生成后的路由
-            workflow.add_conditional_edges(
-                "agent_sql_generation",
-                self._route_after_sql_generation,
-                {
-                    "continue_execution": "agent_sql_execution",
-                    "return_to_user": "format_response"
-                }
-            )
-            
-            # 普通边
-            workflow.add_edge("agent_chat", "format_response")
-            workflow.add_edge("agent_sql_execution", "format_response")
-            workflow.add_edge("format_response", END)
+        # 统一的工作流结构 - 所有模式都使用相同的节点和路由
+        workflow.add_node("classify_question", self._classify_question_node)
+        workflow.add_node("agent_chat", self._agent_chat_node) 
+        workflow.add_node("agent_sql_generation", self._agent_sql_generation_node)
+        workflow.add_node("agent_sql_execution", self._agent_sql_execution_node)
+        workflow.add_node("format_response", self._format_response_node)
+        
+        # 统一入口点
+        workflow.set_entry_point("classify_question")
+        
+        # 添加条件边:分类后的路由
+        workflow.add_conditional_edges(
+            "classify_question",
+            self._route_after_classification,
+            {
+                "DATABASE": "agent_sql_generation",
+                "CHAT": "agent_chat"
+            }
+        )
+        
+        # 添加条件边:SQL生成后的路由
+        workflow.add_conditional_edges(
+            "agent_sql_generation", 
+            self._route_after_sql_generation,
+            {
+                "continue_execution": "agent_sql_execution",
+                "return_to_user": "format_response"
+            }
+        )
+        
+        # 普通边
+        workflow.add_edge("agent_chat", "format_response")
+        workflow.add_edge("agent_sql_execution", "format_response") 
+        workflow.add_edge("format_response", END)
         
         return workflow.compile()
-    
-    def _init_direct_database_node(self, state: AgentState) -> AgentState:
-        """初始化直接数据库模式的状态"""
-        try:
-            # 从state中获取路由模式,而不是从配置文件读取
-            routing_mode = state.get("routing_mode", "database_direct")
-            
-            # 设置直接数据库模式的分类状态
-            state["question_type"] = "DATABASE"
-            state["classification_confidence"] = 1.0
-            state["classification_reason"] = "配置为直接数据库查询模式"
-            state["classification_method"] = "direct_database"
-            state["routing_mode"] = routing_mode
-            state["current_step"] = "direct_database_init"
-            state["execution_path"].append("init_direct_database")
-            
-            self.logger.info("直接数据库模式初始化完成")
-            
-            return state
-            
-        except Exception as e:
-            self.logger.error(f"直接数据库模式初始化异常: {str(e)}")
-            state["error"] = f"直接数据库模式初始化失败: {str(e)}"
-            state["error_code"] = 500
-            state["execution_path"].append("init_direct_database_error")
-            return state
 
-    def _init_direct_chat_node(self, state: AgentState) -> AgentState:
-        """初始化直接聊天模式的状态"""
-        try:
-            # 从state中获取路由模式,而不是从配置文件读取
-            routing_mode = state.get("routing_mode", "chat_direct")
-            
-            # 设置直接聊天模式的分类状态
-            state["question_type"] = "CHAT"
-            state["classification_confidence"] = 1.0
-            state["classification_reason"] = "配置为直接聊天模式"
-            state["classification_method"] = "direct_chat"
-            state["routing_mode"] = routing_mode
-            state["current_step"] = "direct_chat_init"
-            state["execution_path"].append("init_direct_chat")
-            
-            self.logger.info("直接聊天模式初始化完成")
-            
-            return state
-            
-        except Exception as e:
-            self.logger.error(f"直接聊天模式初始化异常: {str(e)}")
-            state["error"] = f"直接聊天模式初始化失败: {str(e)}"
-            state["error_code"] = 500
-            state["execution_path"].append("init_direct_chat_error")
-            return state
     
     def _classify_question_node(self, state: AgentState) -> AgentState:
-        """问题分类节点 - 支持渐进式分类策略"""
+        """问题分类节点 - 使用混合分类策略(规则+LLM)"""
         try:
             # 从state中获取路由模式,而不是从配置文件读取
             routing_mode = state.get("routing_mode", "hybrid")
             
             self.logger.info(f"开始分类问题: {state['question']}")
             
-            # 获取上下文类型(如果有的话
+            # 获取上下文类型(保留兼容性,但不在分类中使用)
             context_type = state.get("context_type")
             if context_type:
                 self.logger.info(f"检测到上下文类型: {context_type}")
             
-            # 使用渐进式分类策略,传递路由模式
+            # 使用混合分类策略(规则+LLM),传递路由模式
             classification_result = self.classifier.classify(state["question"], context_type, routing_mode)
             
             # 更新状态
@@ -231,7 +136,6 @@ class CituLangGraphAgent:
                 error_message = sql_result.get("error", "")
                 error_type = sql_result.get("error_type", "")
                 
-                #print(f"[SQL_GENERATION] SQL生成失败: {error_message}")
                 self.logger.debug(f"error_type = '{error_type}'")
                 
                 # 根据错误类型生成用户提示
@@ -573,6 +477,15 @@ class CituLangGraphAgent:
     def _agent_chat_node(self, state: AgentState) -> AgentState:
         """聊天Agent节点 - 直接工具调用模式"""
         try:
+            # 🔹 添加State调试日志 - 打印agent_chat接收到的完整State内容
+            import json
+            try:
+                state_debug = dict(state)
+                self.logger.debug(f"agent_chat接收到的State内容: {json.dumps(state_debug, ensure_ascii=False, indent=2)}")
+            except Exception as debug_e:
+                self.logger.debug(f"State序列化失败: {debug_e}")
+                self.logger.debug(f"agent_chat接收到的State内容: {state}")
+            
             self.logger.info(f"开始处理聊天: {state['question']}")
             
             question = state["question"]
@@ -582,9 +495,8 @@ class CituLangGraphAgent:
             enable_context_injection = self.config.get("chat_agent", {}).get("enable_context_injection", True)
             context = None
             if enable_context_injection:
-                # TODO: 在这里可以添加真实的对话历史上下文
-                # 例如从Redis或其他存储中获取最近的对话记录
-                # context = get_conversation_history(state.get("session_id"))
+                # 实际上上下文已经在API层面处理,并合并到question中了
+                # 这里不需要再次获取Redis上下文
                 pass
             
             # 直接调用general_chat工具
@@ -742,6 +654,17 @@ class CituLangGraphAgent:
                 }
             
             self.logger.info("响应格式化完成")
+            
+            # 输出完整的 STATE 内容用于调试
+            import json
+            try:
+                # 创建一个可序列化的 state 副本
+                debug_state = dict(state)
+                self.logger.debug(f"format_response_node 完整 STATE 内容: {json.dumps(debug_state, ensure_ascii=False, indent=2)}")
+            except Exception as debug_e:
+                self.logger.debug(f"STATE 序列化失败,使用简单输出: {debug_e}")
+                self.logger.debug(f"format_response_node STATE 内容: {state}")
+            
             return state
             
         except Exception as e:
@@ -752,6 +675,16 @@ class CituLangGraphAgent:
                 "error_code": 500,
                 "execution_path": state["execution_path"]
             }
+            
+            # 即使在异常情况下也输出 STATE 内容用于调试
+            import json
+            try:
+                debug_state = dict(state)
+                self.logger.debug(f"format_response_node 异常情况下的完整 STATE 内容: {json.dumps(debug_state, ensure_ascii=False, indent=2)}")
+            except Exception as debug_e:
+                self.logger.debug(f"异常情况下 STATE 序列化失败: {debug_e}")
+                self.logger.debug(f"format_response_node 异常情况下的 STATE 内容: {state}")
+            
             return state
     
     def _route_after_sql_generation(self, state: AgentState) -> Literal["continue_execution", "return_to_user"]:
@@ -793,14 +726,14 @@ class CituLangGraphAgent:
             # 聊天Agent可以处理不确定的情况,并在必要时引导用户提供更多信息
             return "CHAT"
     
-    async def process_question(self, question: str, session_id: str = None, context_type: str = None, routing_mode: str = None) -> Dict[str, Any]:
+    async def process_question(self, question: str, conversation_id: str = None, context_type: str = None, routing_mode: str = None) -> Dict[str, Any]:
         """
         统一的问题处理入口
         
         Args:
             question: 用户问题
-            session_id: 会话ID
-            context_type: 上下文类型 ("DATABASE" 或 "CHAT"),用于渐进式分类
+            conversation_id: 对话ID
+            context_type: 上下文类型(保留兼容性参数,当前未使用)
             routing_mode: 路由模式,可选,用于覆盖配置文件设置
             
         Returns:
@@ -814,17 +747,18 @@ class CituLangGraphAgent:
                 self.logger.info(f"使用指定路由模式: {routing_mode}")
             
             # 动态创建workflow(基于路由模式)
+            self.logger.info(f"🔄 [PROCESS] 调用动态创建workflow")
             workflow = self._create_workflow(routing_mode)
             
             # 初始化状态
-            initial_state = self._create_initial_state(question, session_id, context_type, routing_mode)
+            initial_state = self._create_initial_state(question, conversation_id, context_type, routing_mode)
             
             # 执行工作流
             final_state = await workflow.ainvoke(
                 initial_state,
                 config={
-                    "configurable": {"session_id": session_id}
-                } if session_id else None
+                    "configurable": {"conversation_id": conversation_id}
+                } if conversation_id else None
             )
             
             # 提取最终结果
@@ -842,9 +776,83 @@ class CituLangGraphAgent:
                 "error_code": 500,
                 "execution_path": ["error"]
             }
+
+    async def process_question_stream(self, question: str, user_id: str, conversation_id: str = None, context_type: str = None, routing_mode: str = None):
+        """
+        流式处理用户问题 - 复用process_question()的所有逻辑
+        
+        Args:
+            question: 用户问题
+            user_id: 用户ID,用于生成conversation_id
+            conversation_id: 对话ID,可选,不提供则自动生成
+            context_type: 上下文类型(保留兼容性参数,当前未使用)
+            routing_mode: 路由模式,可选,用于覆盖配置文件设置
+            
+        Yields:
+            Dict: 流式状态更新,包含进度信息或最终结果
+        """
+        try:
+            self.logger.info(f"🌊 [STREAM] 开始流式处理问题: {question}")
+            if context_type:
+                self.logger.info(f"🌊 [STREAM] 上下文类型: {context_type}")
+            if routing_mode:
+                self.logger.info(f"🌊 [STREAM] 使用指定路由模式: {routing_mode}")
+            
+            # 生成conversation_id(如果未提供)
+            if not conversation_id:
+                conversation_id = self._generate_conversation_id(user_id)
+            
+            # 1. 复用现有的初始化逻辑
+            self.logger.info(f"🌊 [STREAM] 动态创建workflow")
+            workflow = self._create_workflow(routing_mode)
+            
+            # 2. 创建初始状态(复用现有逻辑)
+            initial_state = self._create_initial_state(question, conversation_id, context_type, routing_mode)
+            
+            # 3. 使用astream流式执行
+            self.logger.info(f"🌊 [STREAM] 开始流式执行workflow")
+            async for chunk in workflow.astream(
+                initial_state,
+                config={
+                    "configurable": {"conversation_id": conversation_id}
+                } if conversation_id else None
+            ):
+                # 处理每个节点的输出
+                for node_name, node_data in chunk.items():
+                    self.logger.debug(f"🌊 [STREAM] 收到节点输出: {node_name}")
+                    
+                    # 映射节点状态为用户友好的进度信息
+                    progress_info = self._map_node_to_progress(node_name, node_data)
+                    if progress_info:
+                        yield {
+                            "type": "progress",
+                            "node": node_name,
+                            "progress": progress_info,
+                            "state_data": self._extract_relevant_state(node_data),
+                            "conversation_id": conversation_id
+                        }
+            
+            # 4. 最终结果处理(复用现有的结果提取逻辑)
+            # 注意:由于astream的特性,最后一个chunk包含最终状态
+            final_result = node_data.get("final_response", {})
+            
+            self.logger.info(f"🌊 [STREAM] 流式处理完成: {final_result.get('success', False)}")
+            yield {
+                "type": "completed",
+                "result": final_result,
+                "conversation_id": conversation_id
+            }
+            
+        except Exception as e:
+            self.logger.error(f"🌊 [STREAM] Agent流式执行异常: {str(e)}")
+            yield {
+                "type": "error", 
+                "error": str(e),
+                "conversation_id": conversation_id
+            }
     
-    def _create_initial_state(self, question: str, session_id: str = None, context_type: str = None, routing_mode: str = None) -> AgentState:
-        """创建初始状态 - 支持渐进式分类"""
+    def _create_initial_state(self, question: str, conversation_id: str = None, context_type: str = None, routing_mode: str = None) -> AgentState:
+        """创建初始状态 - 支持兼容性参数"""
         # 确定使用的路由模式
         if routing_mode:
             effective_routing_mode = routing_mode
@@ -858,7 +866,7 @@ class CituLangGraphAgent:
         return AgentState(
             # 输入信息
             question=question,
-            session_id=session_id,
+            conversation_id=conversation_id,
             
             # 上下文信息
             context_type=context_type,
@@ -871,7 +879,6 @@ class CituLangGraphAgent:
             
             # 数据库查询流程状态
             sql=None,
-            sql_generation_attempts=0,
             query_result=None,
             summary=None,
             
@@ -896,11 +903,6 @@ class CituLangGraphAgent:
             # 流程控制
             current_step="initialized",
             execution_path=["start"],
-            retry_count=0,
-            max_retries=3,
-            
-            # 调试信息
-            debug_info={},
             
             # 路由模式
             routing_mode=effective_routing_mode
@@ -1106,6 +1108,107 @@ class CituLangGraphAgent:
                 "error": f"修复异常: {str(e)}"
             }
 
+    def _generate_conversation_id(self, user_id: str) -> str:
+        """生成对话ID - 使用与React Agent一致的格式"""
+        import pandas as pd
+        timestamp = pd.Timestamp.now().strftime('%Y%m%d%H%M%S%f')[:-3]  # 去掉最后3位微秒
+        return f"{user_id}:{timestamp}"
+
+    def _map_node_to_progress(self, node_name: str, node_data: dict) -> dict:
+        """将节点执行状态映射为用户友好的进度信息"""
+        
+        if node_name == "classify_question":
+            question_type = node_data.get("question_type", "UNCERTAIN")
+            confidence = node_data.get("classification_confidence", 0)
+            return {
+                "display_name": "分析问题类型",
+                "icon": "🤔",
+                "details": f"问题类型: {question_type} (置信度: {confidence:.2f})",
+                "sub_status": f"使用{node_data.get('classification_method', '未知')}方法分类"
+            }
+        
+        elif node_name == "agent_sql_generation":
+            if node_data.get("sql_generation_success"):
+                sql = node_data.get("sql", "")
+                sql_preview = sql[:50] + "..." if len(sql) > 50 else sql
+                return {
+                    "display_name": "SQL生成成功",
+                    "icon": "✅",
+                    "details": f"生成SQL: {sql_preview}",
+                    "sub_status": "验证通过,准备执行"
+                }
+            else:
+                error_type = node_data.get("validation_error_type", "unknown")
+                return {
+                    "display_name": "SQL生成处理中",
+                    "icon": "🔧",
+                    "details": f"验证状态: {error_type}",
+                    "sub_status": node_data.get("user_prompt", "正在处理")
+                }
+        
+        elif node_name == "agent_sql_execution":
+            query_result = node_data.get("query_result", {})
+            row_count = query_result.get("row_count", 0)
+            return {
+                "display_name": "执行数据查询", 
+                "icon": "⚙️",
+                "details": f"查询完成,返回 {row_count} 行数据",
+                "sub_status": "正在生成摘要" if row_count > 0 else "查询执行完成"
+            }
+        
+        elif node_name == "agent_chat":
+            return {
+                "display_name": "思考回答",
+                "icon": "💭", 
+                "details": "正在处理您的问题",
+                "sub_status": "使用智能对话模式"
+            }
+        
+        elif node_name == "format_response":
+            return {
+                "display_name": "整理结果",
+                "icon": "📝",
+                "details": "正在格式化响应结果",
+                "sub_status": "即将完成"
+            }
+        
+        return None
+
+    def _extract_relevant_state(self, node_data: dict) -> dict:
+        """从节点数据中提取相关的状态信息,过滤敏感信息"""
+        try:
+            relevant_keys = [
+                "current_step", "execution_path", "question_type",
+                "classification_confidence", "classification_method", 
+                "sql_generation_success", "sql_validation_success",
+                "routing_mode"
+            ]
+            
+            extracted = {}
+            for key in relevant_keys:
+                if key in node_data:
+                    extracted[key] = node_data[key]
+            
+            # 特殊处理SQL:只返回前100个字符避免过长
+            if "sql" in node_data and node_data["sql"]:
+                sql = str(node_data["sql"])
+                extracted["sql_preview"] = sql[:100] + "..." if len(sql) > 100 else sql
+            
+            # 特殊处理查询结果:只返回行数统计
+            if "query_result" in node_data and node_data["query_result"]:
+                query_result = node_data["query_result"]
+                if isinstance(query_result, dict):
+                    extracted["query_summary"] = {
+                        "row_count": query_result.get("row_count", 0),
+                        "column_count": len(query_result.get("columns", []))
+                    }
+            
+            return extracted
+            
+        except Exception as e:
+            self.logger.warning(f"提取状态信息失败: {str(e)}")
+            return {"error": "state_extraction_failed"}
+
     # ==================== 原有方法 ====================
     
     def _extract_original_question(self, question: str) -> str:
@@ -1144,7 +1247,7 @@ class CituLangGraphAgent:
             
             if enable_full_test:
                 # 完整流程测试
-                test_result = await self.process_question(test_question, "health_check")
+                test_result = await self.process_question(test_question, conversation_id="health_check")
                 
                 return {
                     "status": "healthy" if test_result.get("success") else "degraded",

+ 122 - 230
agent/classifier.py

@@ -20,146 +20,73 @@ class QuestionClassifier:
         # 初始化日志
         self.logger = get_agent_logger("Classifier")
         
-        # 从配置文件加载阈值参数
+        # 初始化默认参数(作为后备)
+        self.high_confidence_threshold = 0.7
+        self.max_confidence = 0.9
+        self.llm_fallback_confidence = 0.5
+        self.uncertain_confidence = 0.2
+        
+        # 加载词典配置(新增逻辑)
+        self._load_dict_config()
+
+    def _load_dict_config(self):
+        """加载分类器词典配置"""
         try:
-            from agent.config import get_current_config, get_nested_config
-            config = get_current_config()
-            self.high_confidence_threshold = get_nested_config(config, "classification.high_confidence_threshold", 0.7)
-            self.low_confidence_threshold = get_nested_config(config, "classification.low_confidence_threshold", 0.4)
-            self.max_confidence = get_nested_config(config, "classification.max_confidence", 0.9)
-            self.base_confidence = get_nested_config(config, "classification.base_confidence", 0.4)
-            self.confidence_increment = get_nested_config(config, "classification.confidence_increment", 0.08)
-            self.llm_fallback_confidence = get_nested_config(config, "classification.llm_fallback_confidence", 0.5)
-            self.uncertain_confidence = get_nested_config(config, "classification.uncertain_confidence", 0.2)
-            self.medium_confidence_threshold = get_nested_config(config, "classification.medium_confidence_threshold", 0.6)
-            self.logger.info("从配置文件加载分类器参数完成")
-        except ImportError:
-            self.high_confidence_threshold = 0.7
-            self.low_confidence_threshold = 0.4
-            self.max_confidence = 0.9
-            self.base_confidence = 0.4
-            self.confidence_increment = 0.08
-            self.llm_fallback_confidence = 0.5
-            self.uncertain_confidence = 0.2
-            self.medium_confidence_threshold = 0.6
-            self.logger.warning("配置文件不可用,使用默认分类器参数")
-        
-        # 基于高速公路服务区业务的精准关键词
-        self.strong_business_keywords = {
-            "核心业务实体": [
-                "服务区", "档口", "商铺", "收费站", "高速公路",
-                "驿美", "驿购",  # 业务系统名称
-                "北区", "南区", "西区", "东区", "两区",  # 物理分区
-                "停车区", "公司", "管理公司", "运营公司", "驿美运营公司"  # 公司相关
-            ],
-            "支付业务": [
-                "微信支付", "支付宝支付", "现金支付", "行吧支付", "金豆支付",
-                "支付金额", "订单数量", "营业额", "收入", "营业收入",
-                "微信", "支付宝", "现金", "行吧", "金豆",  # 简化形式
-                "wx", "zfb", "rmb", "xs", "jd"  # 系统字段名
-            ],
-            "经营品类": [
-                "餐饮", "小吃", "便利店", "整体租赁",
-                "驿美餐饮", "品牌", "经营品类", "商业品类"
-            ],
-            "车流业务": [
-                "车流量", "车辆数量", "客车", "货车", 
-                "过境", "危化品", "城际", "车辆统计",
-                "流量统计", "车型分布"
-            ],
-            "地理路线": [
-                "大广", "昌金", "昌栗", "线路", "路段", "路线",
-                "高速线路", "公路线路"
-            ],
-            "系统查询指示词": [
-                "当前系统", "当前数据库", "当前数据", "数据库"
-                "本系统", "系统", "数据库中", "数据中",
-                "现有数据", "已有数据", "存储的数据",
-                "平台数据", "我们的数据库", "这个系统"
-            ]
-        }
-        
-        # 查询意图词(辅助判断)
-        self.query_intent_keywords = [
-            "统计", "查询", "分析", "排行", "排名",
-            "报表", "报告", "汇总", "计算", "对比",
-            "趋势", "占比", "百分比", "比例",
-            "最大", "最小", "最高", "最低", "平均",
-            "总计", "合计", "累计", "求和", "求平均",
-            "生成", "导出", "显示", "列出", "共有"
-        ]
-        
-        # 非业务实体词(包含则倾向CHAT)
-        self.non_business_keywords = [
-            # 农产品/食物
-            "荔枝", "苹果", "西瓜", "水果", "蔬菜", "大米", "小麦",
-            "橙子", "香蕉", "葡萄", "草莓", "樱桃", "桃子", "梨",
+            from agent.config import get_classifier_dict_config
+            dict_config = get_classifier_dict_config()
+            
+            # 加载关键词列表
+            self.strong_business_keywords = dict_config.strong_business_keywords
+            self.query_intent_keywords = dict_config.query_intent_keywords
+            self.non_business_keywords = dict_config.non_business_keywords
+            self.sql_patterns = dict_config.sql_patterns
+            self.chat_keywords = dict_config.chat_keywords
+            
+            # 加载权重配置
+            self.weights = dict_config.weights
             
-            # 技术概念  
-            "人工智能", "机器学习", "编程", "算法", "深度学习",
-            "AI", "神经网络", "模型训练", "数据挖掘",
+            # 从YAML权重配置中加载分类器参数(优先使用YAML配置)
+            self.high_confidence_threshold = self.weights.get('high_confidence_threshold', self.high_confidence_threshold)
+            self.max_confidence = self.weights.get('max_confidence', self.max_confidence)
+            self.llm_fallback_confidence = self.weights.get('llm_fallback_confidence', self.llm_fallback_confidence)
+            self.uncertain_confidence = self.weights.get('uncertain_confidence', self.uncertain_confidence)
             
-            # 身份询问
-            "你是谁", "你是什么", "你叫什么", "你的名字", "你是什么AI",
-            "什么模型", "大模型", "AI助手", "助手", "机器人",
+            # 加载其他配置
+            self.metadata = dict_config.metadata
             
-            # 天气相关
-            "天气", "气温", "下雨", "晴天", "阴天", "温度",
-            "天气预报", "气候", "降雨", "雪天",
+            total_keywords = (
+                sum(len(keywords) for keywords in self.strong_business_keywords.values()) +
+                len(self.query_intent_keywords) +
+                len(self.non_business_keywords) +
+                len(self.sql_patterns) +
+                len(self.chat_keywords)
+            )
+            
+            self.logger.info(f"从YAML配置文件加载词典完成,共加载 {total_keywords} 个关键词")
+            self.logger.info(f"从YAML配置文件加载分类器参数完成:高置信度阈值={self.high_confidence_threshold}")
+            self.logger.debug(f"所有分类器参数:high_threshold={self.high_confidence_threshold}, max_conf={self.max_confidence}, llm_fallback={self.llm_fallback_confidence}")
             
-            # 其他生活常识
-            "怎么做饭", "如何减肥", "健康", "医疗", "病症",
-            "历史", "地理", "文学", "电影", "音乐", "体育",
-            "娱乐", "游戏", "小说", "新闻", "政治", "战争",
-            "足球", "NBA", "篮球", "乒乓球", "冠军", "夺冠",
-            "高考",
+        except Exception as e:
+            self.logger.warning(f"加载YAML词典配置失败: {str(e)},使用代码中的备用配置")
+            self._load_default_dict()
 
-            # 旅游出行
-            "旅游","景点","门票","酒店","机票","航班","高铁","的士",
-            #情绪
-            "伤心","开心","无聊","生气","孤独","累了","烦恼","心情","难过","抑郁",
-            #商业
-            "股票","基金","理财","投资","经济","通货膨胀","上市",
-            #哲学
-            "人生意义","价值观","道德","信仰","宗教","爱情",
-            #地理
-            "全球","全国","亚洲","发展中","欧洲","美洲","东亚","东南亚","南美","非洲","大洋"
-        ]
+    def _load_default_dict(self):
+        """YAML配置加载失败时的处理"""
+        error_msg = "YAML词典配置文件加载失败,无法初始化分类器"
+        self.logger.error(error_msg)
         
-        # SQL关键词(技术层面的数据库操作)
-        # business_score +3
-        self.sql_patterns = [
-            r"\b(select|from|where|group by|order by|having|join|update)\b",
-            r"\b(数据库|表名|表|字段名|SQL|sql|database|table)\b"
-        ]
+        # 初始化空的weights字典,使用代码中的默认值
+        self.weights = {}
         
-        # 聊天关键词(平台功能和帮助)
-        self.chat_keywords = [
-            "你好啊", "谢谢", "再见", "怎么样", "如何", "为什么", "什么是",
-            "介绍", "解释", "说明", "帮助", "操作", "使用方法", "功能",
-            "教程", "指南", "手册","讲解"
-        ]
-        
-        # 追问关键词(用于检测追问型问题)
-        self.follow_up_keywords = [
-            "还有", "详细", "具体", "更多", "继续", "再", "也",
-            "那么", "另外", "其他", "以及", "还", "进一步",
-            "深入", "补充", "额外", "此外", "同时", "并且"
-        ]
-        
-        # 话题切换关键词(明显的话题转换)
-        self.topic_switch_keywords = [
-            "你好", "你是", "介绍", "功能", "帮助", "使用方法",
-            "平台", "系统", "AI", "助手", "谢谢", "再见"
-        ]
+        raise RuntimeError(error_msg)
 
     def classify(self, question: str, context_type: Optional[str] = None, routing_mode: Optional[str] = None) -> ClassificationResult:
         """
-        主分类方法:支持渐进式分类策略
+        主分类方法:简化为混合分类策略
         
         Args:
             question: 当前问题
-            context_type: 上下文类型 ("DATABASE" 或 "CHAT"),可选
+            context_type: 上下文类型(保留参数兼容性,但不使用)
             routing_mode: 路由模式,可选,用于覆盖配置文件设置
         """
         # 确定使用的路由模式
@@ -192,93 +119,8 @@ class QuestionClassifier:
         elif QUESTION_ROUTING_MODE == "llm_only":
             return self._enhanced_llm_classify(question)
         else:
-            # hybrid模式:使用渐进式分类策略
-            return self._progressive_classify(question, context_type)
-
-    def _progressive_classify(self, question: str, context_type: Optional[str] = None) -> ClassificationResult:
-        """
-        渐进式分类策略:
-        1. 首先只基于问题本身分类
-        2. 如果置信度不够且有上下文,考虑上下文辅助
-        3. 检测话题切换,避免错误继承
-        """
-        self.logger.info(f"渐进式分类 - 问题: {question}")
-        if context_type:
-            self.logger.info(f"上下文类型: {context_type}")
-        
-        # 第一步:只基于问题本身分类
-        primary_result = self._hybrid_classify(question)
-        self.logger.info(f"主分类结果: {primary_result.question_type}, 置信度: {primary_result.confidence}")
-        
-        # 如果没有上下文,直接返回主分类结果
-        if not context_type:
-            self.logger.debug("无上下文,使用主分类结果")
-            return primary_result
-        
-        # 如果置信度足够高,直接使用主分类结果
-        if primary_result.confidence >= self.high_confidence_threshold:
-            self.logger.info(f"高置信度({primary_result.confidence}≥{self.high_confidence_threshold}),使用主分类结果")
-            return primary_result
-        
-        # 检测明显的话题切换
-        if self._is_topic_switch(question):
-            self.logger.info("检测到话题切换,忽略上下文")
-            return primary_result
-        
-        # 如果置信度较低,考虑上下文辅助
-        if primary_result.confidence < self.medium_confidence_threshold:
-            self.logger.info(f"低置信度({primary_result.confidence}<{self.medium_confidence_threshold}),考虑上下文辅助")
-            
-            # 检测是否为追问型问题
-            if self._is_follow_up_question(question):
-                self.logger.info(f"检测到追问型问题,继承上下文类型: {context_type}")
-                return ClassificationResult(
-                    question_type=context_type,
-                    confidence=0.75,  # 给予中等置信度
-                    reason=f"追问型问题,继承上下文类型。原分类: {primary_result.reason}",
-                    method="progressive_context_inherit"
-                )
-        
-        # 中等置信度或其他情况,保持主分类结果
-        self.logger.debug("保持主分类结果")
-        return primary_result
-
-    def _is_follow_up_question(self, question: str) -> bool:
-        """检测是否为追问型问题"""
-        question_lower = question.lower()
-        
-        # 检查追问关键词
-        for keyword in self.follow_up_keywords:
-            if keyword in question_lower:
-                return True
-        
-        # 检查问号开头的短问题(通常是追问)
-        if question.strip().startswith(('还', '再', '那', '这', '有')) and len(question.strip()) < 15:
-            return True
-        
-        return False
-
-    def _is_topic_switch(self, question: str) -> bool:
-        """检测是否为明显的话题切换"""
-        question_lower = question.lower()
-        
-        # 检查话题切换关键词
-        for keyword in self.topic_switch_keywords:
-            if keyword in question_lower:
-                return True
-        
-        # 检查问候语模式
-        greeting_patterns = [
-            r"^(你好|您好|hi|hello)",
-            r"(你是|您是).*(什么|谁|哪)",
-            r"(介绍|说明).*(功能|平台|系统)"
-        ]
-        
-        for pattern in greeting_patterns:
-            if re.search(pattern, question_lower):
-                return True
-        
-        return False
+            # hybrid模式:直接使用混合分类策略(规则+LLM)
+            return self._hybrid_classify(question)
 
     def _hybrid_classify(self, question: str) -> ClassificationResult:
         """
@@ -292,7 +134,7 @@ class QuestionClassifier:
         if rule_result.confidence >= self.high_confidence_threshold:
             return rule_result
         
-        # 第二步:使用增强的LLM分类
+        # 否则:使用增强的LLM分类
         llm_result = self._enhanced_llm_classify(question)
         
         # 选择置信度更高的结果
@@ -301,9 +143,49 @@ class QuestionClassifier:
         else:
             return rule_result
     
+    def _extract_current_question_for_rule_classification(self, question: str) -> str:
+        """
+        从enhanced_question中提取[CURRENT]部分用于规则分类
+        如果没有[CURRENT]标签,返回原问题
+        
+        Args:
+            question: 可能包含上下文的完整问题
+            
+        Returns:
+            str: 用于规则分类的当前问题
+        """
+        try:
+            # 处理None或非字符串输入
+            if question is None:
+                self.logger.warning("输入问题为None,返回空字符串")
+                return ""
+            
+            if not isinstance(question, str):
+                self.logger.warning(f"输入问题类型错误: {type(question)},转换为字符串")
+                question = str(question)
+            
+            # 检查是否为enhanced_question格式
+            if "\n[CURRENT]\n" in question:
+                current_start = question.find("\n[CURRENT]\n")
+                if current_start != -1:
+                    current_question = question[current_start + len("\n[CURRENT]\n"):].strip()
+                    self.logger.info(f"规则分类从[CURRENT]标签提取到问题: {current_question}")
+                    return current_question
+            
+            # 如果不是enhanced_question格式,直接返回原问题
+            stripped_question = question.strip()
+            self.logger.info(f"规则分类未找到[CURRENT]标签,使用完整问题: {stripped_question}")
+            return stripped_question
+            
+        except Exception as e:
+            self.logger.warning(f"提取当前问题失败: {str(e)},返回空字符串")
+            return ""
+
     def _rule_based_classify(self, question: str) -> ClassificationResult:
         """基于规则的预分类"""
-        question_lower = question.lower()
+        # 提取当前问题用于规则判断,避免上下文干扰
+        current_question = self._extract_current_question_for_rule_classification(question)
+        question_lower = current_question.lower()
         
         # 检查非业务实体词
         non_business_matched = []
@@ -315,7 +197,7 @@ class QuestionClassifier:
         if non_business_matched:
             return ClassificationResult(
                 question_type="CHAT",
-                confidence=0.85,
+                confidence=self.weights.get('non_business_confidence', 0.85),  # 使用YAML配置的置信度
                 reason=f"包含非业务实体词: {non_business_matched}",
                 method="rule_based_non_business"
             )
@@ -329,7 +211,7 @@ class QuestionClassifier:
                 continue
             for keyword in keywords:
                 if keyword in question_lower:
-                    business_score += 2  # 业务实体词权重更高
+                    business_score += self.weights.get('business_entity', 2)  # 使用YAML配置的权重
                     business_matched.append(f"{category}:{keyword}")
         
         # 检查系统查询指示词
@@ -337,7 +219,7 @@ class QuestionClassifier:
         system_matched = []
         for keyword in self.strong_business_keywords.get("系统查询指示词", []):
             if keyword in question_lower:
-                system_indicator_score += 1
+                system_indicator_score += self.weights.get('system_indicator', 1)  # 使用YAML配置的权重
                 system_matched.append(f"系统查询指示词:{keyword}")
         
         # 检查查询意图词
@@ -345,14 +227,14 @@ class QuestionClassifier:
         intent_matched = []
         for keyword in self.query_intent_keywords:
             if keyword in question_lower:
-                intent_score += 1
+                intent_score += self.weights.get('query_intent', 1)  # 使用YAML配置的权重
                 intent_matched.append(keyword)
         
         # 检查SQL模式
         sql_patterns_matched = []
         for pattern in self.sql_patterns:
             if re.search(pattern, question_lower, re.IGNORECASE):
-                business_score += 3  # SQL模式权重最高
+                business_score += self.weights.get('sql_pattern', 3)  # 使用YAML配置的权重
                 sql_patterns_matched.append(pattern)
         
         # 检查聊天关键词
@@ -360,25 +242,29 @@ class QuestionClassifier:
         chat_matched = []
         for keyword in self.chat_keywords:
             if keyword in question_lower:
-                chat_score += 1
+                chat_score += self.weights.get('chat_keyword', 1)  # 使用YAML配置的权重
                 chat_matched.append(keyword)
         
         # 系统指示词组合评分逻辑
         if system_indicator_score > 0 and business_score > 0:
             # 系统指示词 + 业务实体 = 强组合效应
-            business_score += 3  # 组合加分
+            business_score += self.weights.get('combination_bonus', 3)  # 使用YAML配置的组合加分权重
             business_matched.extend(system_matched)
         elif system_indicator_score > 0:
             # 仅有系统指示词 = 中等业务倾向
-            business_score += 1
+            business_score += self.weights.get('system_indicator', 1)  # 使用YAML配置的权重
             business_matched.extend(system_matched)
         
         # 分类决策逻辑
         total_business_score = business_score + intent_score
         
         # 强业务特征:包含业务实体 + 查询意图
-        if business_score >= 2 and intent_score >= 1:
-            confidence = min(self.max_confidence, 0.8 + (total_business_score * 0.05))
+        min_business_score = self.weights.get('strong_business_min_score', 2)
+        min_intent_score = self.weights.get('strong_business_min_intent', 1)
+        if business_score >= min_business_score and intent_score >= min_intent_score:
+            base_conf = self.weights.get('strong_business_base', 0.8)
+            increment = self.weights.get('strong_business_increment', 0.05)
+            confidence = min(self.max_confidence, base_conf + (total_business_score * increment))
             return ClassificationResult(
                 question_type="DATABASE",
                 confidence=confidence,
@@ -387,8 +273,10 @@ class QuestionClassifier:
             )
         
         # 中等业务特征:包含多个业务实体词
-        elif business_score >= 4:
-            confidence = min(self.max_confidence, 0.7 + (business_score * 0.03))
+        elif business_score >= self.weights.get('medium_business_min_score', 4):
+            base_conf = self.weights.get('medium_business_base', 0.7)
+            increment = self.weights.get('medium_business_increment', 0.03)
+            confidence = min(self.max_confidence, base_conf + (business_score * increment))
             return ClassificationResult(
                 question_type="DATABASE", 
                 confidence=confidence,
@@ -397,8 +285,10 @@ class QuestionClassifier:
             )
         
         # 聊天特征
-        elif chat_score >= 1 and business_score == 0:
-            confidence = min(self.max_confidence, self.base_confidence + (chat_score * self.confidence_increment))
+        elif chat_score >= self.weights.get('chat_min_score', 1) and business_score == 0:
+            base_conf = self.weights.get('chat_base_confidence', 0.4)
+            increment = self.weights.get('chat_confidence_increment', 0.08)
+            confidence = min(self.max_confidence, base_conf + (chat_score * increment))
             return ClassificationResult(
                 question_type="CHAT",
                 confidence=confidence,
@@ -506,6 +396,8 @@ class QuestionClassifier:
                 question=classification_prompt,
                 system_prompt=system_prompt
             )
+
+            self.logger.debug(f"LLM原始分类响应信息: {response}")
             
             # 解析响应
             return self._parse_llm_response(response)
@@ -515,7 +407,7 @@ class QuestionClassifier:
             self.logger.error(f"LLM分类失败,业务上下文不可用: {str(e)}")
             return ClassificationResult(
                 question_type="CHAT",  # 失败时默认为CHAT,更安全
-                confidence=0.1,  # 很低的置信度表示分类不可靠
+                confidence=self.weights.get('llm_error_confidence', 0.1),  # 使用YAML配置的低置信度
                 reason=f"业务上下文加载失败,无法进行准确分类: {str(e)}",
                 method="llm_context_error"
             )

+ 468 - 0
agent/classifier_dict.yaml

@@ -0,0 +1,468 @@
+# agent/classifier_dict.yaml
+# 问题分类器词典配置文件
+# 版本: v1.0
+# 最后更新: 2024-12-21
+
+# ===========================================
+# 配置元信息
+# ===========================================
+metadata:
+  version: "1.0"
+  description: "Citu智能数据问答平台问题分类器关键词配置"
+  last_updated: "2024-12-21"
+  author: "系统管理员"
+
+# ===========================================
+# 权重配置
+# ===========================================
+weights:
+  # ===========================================
+  # 关键词权重配置
+  # ===========================================
+  
+  # 业务实体词权重(强业务关键词中除系统指示词外的部分)
+  business_entity: 2
+  
+  # 系统指示词权重(强业务关键词中的系统查询指示词)
+  system_indicator: 1
+  
+  # 查询意图词权重
+  query_intent: 1
+  
+  # SQL模式权重(最高权重)
+  sql_pattern: 3
+  
+  # 聊天关键词权重
+  chat_keyword: 1
+  
+  # 组合加分权重(系统指示词+业务实体词)
+  combination_bonus: 3
+
+  # ===========================================
+  # 置信度计算配置
+  # ===========================================
+  
+  # 非业务词固定置信度(匹配非业务关键词时直接返回此置信度)
+  non_business_confidence: 0.85
+  
+  # 强业务特征置信度配置(业务实体≥2分 且 查询意图≥1分)
+  strong_business_base: 0.8        # 强业务特征基础置信度
+  strong_business_increment: 0.05  # 每增加1分总分的置信度增量
+  
+  # 中等业务特征置信度配置(业务实体≥4分)
+  medium_business_base: 0.7        # 中等业务特征基础置信度
+  medium_business_increment: 0.03  # 每增加1分业务分的置信度增量
+  
+  # 聊天特征置信度配置(聊天分≥1 且 业务分=0)
+  chat_base_confidence: 0.4        # 聊天特征基础置信度(对应base_confidence)
+  chat_confidence_increment: 0.08  # 每增加1分聊天分的置信度增量
+  
+  # 分类阈值配置
+  strong_business_min_score: 2     # 强业务特征最低业务分要求
+  strong_business_min_intent: 1    # 强业务特征最低意图分要求
+  medium_business_min_score: 4     # 中等业务特征最低业务分要求
+  chat_min_score: 1               # 聊天特征最低聊天分要求
+  
+  # ===========================================
+  # 从config.py迁移的分类器配置
+  # ===========================================
+  
+  # 高置信度阈值:当规则分类的置信度 >= 此值时,直接使用规则分类结果,不再调用LLM
+  # 建议范围:0.7-0.9,过高可能错过需要LLM辅助的边界情况,过低会增加LLM调用成本
+  high_confidence_threshold: 0.7
+  
+  # 低置信度阈值:当规则分类的置信度 <= 此值时,启用LLM二次分类进行辅助判断
+  # 建议范围:0.2-0.5,过高会频繁调用LLM,过低可能错过需要LLM辅助的情况
+  # low_confidence_threshold: 0.4  # 未使用 - 已注释
+  
+  # 最大置信度上限:规则分类计算出的置信度不会超过此值,防止过度自信
+  # 建议范围:0.8-1.0,通常设为0.9以保留不确定性空间
+  max_confidence: 0.9
+  
+  # 基础置信度:规则分类的起始置信度,会根据匹配的关键词数量递增
+  # 建议范围:0.3-0.6,这是匹配到1个关键词时的基础置信度
+  # base_confidence: 0.4  # 未使用,实际使用chat_base_confidence - 已注释
+  
+  # 置信度增量步长:每匹配一个额外关键词,置信度增加的数值
+  # 建议范围:0.05-0.2,过大会导致置信度增长过快,过小则区分度不够
+  # confidence_increment: 0.08  # 未使用,实际使用chat_confidence_increment - 已注释
+  
+  # LLM分类失败时的默认置信度:当LLM调用异常或解析失败时使用
+  # 建议范围:0.3-0.6,通常设为中等水平,避免过高或过低的错误影响
+  llm_fallback_confidence: 0.5
+  
+  # 不确定分类的默认置信度:当规则分类无法明确判断时使用
+  # 建议范围:0.1-0.3,应设为较低值,表示确实不确定
+  uncertain_confidence: 0.2
+  
+  # LLM业务上下文加载失败时的置信度:用于混合分类模式的置信度比较
+  # 建议范围:0.05-0.2,设为极低值表示上下文加载失败的严重性
+  llm_error_confidence: 0.1
+  
+  # 中等置信度阈值:用于三级置信度判断的中间阈值
+  # 建议范围:0.5-0.7,位于low_confidence_threshold和high_confidence_threshold之间
+  # medium_confidence_threshold: 0.6  # 未使用 - 已注释
+
+# ===========================================
+# 强业务关键词(字典结构,保持原有层次)
+# ===========================================
+strong_business_keywords:
+  核心业务实体:
+    description: "高速公路服务区基础设施和业务系统"
+    keywords:
+      - 服务区
+      - 档口
+      - 商铺
+      - 收费站
+      - 高速公路
+      - 驿美          
+      - 驿购          
+      - 北区          # 物理分区
+      - 南区
+      - 西区
+      - 东区
+      - 两区
+      - 停车区
+      - 公司
+      - 管理公司
+      - 运营公司
+    
+  支付业务:
+    description: "支付方式、金额、订单等支付相关业务"
+    keywords:
+      # 支付方式全称
+      - 微信支付
+      - 支付宝支付
+      - 现金支付
+      - 行吧支付
+      - 金豆支付
+      
+      # 业务指标
+      - 支付金额
+      - 订单数量
+      - 营业额
+      - 收入
+      - 营业收入
+      
+      # 简化形式
+      - 微信
+      - 支付宝
+      - 现金
+      - 行吧
+      - 金豆
+      
+      # 系统字段名
+      - wx
+      - zfb
+      - rmb
+      - xs
+      - jd
+    
+  经营品类:
+    description: "经营类型、品牌、商业品类"
+    keywords:
+      - 餐饮
+      - 小吃
+      - 便利店
+      - 整体租赁
+      - 驿美餐饮
+      - 品牌
+      - 经营品类
+      - 商业品类
+    
+  车流业务:
+    description: "车辆流量、车型统计等车流相关业务"
+    keywords:
+      # 流量概念
+      - 车流量
+      - 车辆数量
+      - 客车
+      - 货车
+      - 过境
+      - 危化品
+      - 城际
+      - 车辆统计
+      - 流量统计
+      - 车型分布
+    
+  地理路线:
+    description: "高速线路、路段等地理位置信息"
+    keywords:
+      # 具体线路
+      - 大广
+      - 昌金
+      - 昌栗
+      
+      # 概念词
+      - 线路
+      - 路段
+      - 路线
+      - 高速线路
+      - 公路线路
+    
+  系统查询指示词:
+    description: "系统、数据库等查询指示词(特殊权重处理)"
+    weight: 1  # 特殊标记:权重低于其他业务实体词
+    keywords:
+      # 系统指示
+      - 当前系统
+      - 当前数据库
+      - 当前数据
+      - 数据库
+      - 本系统
+      - 系统
+      
+      # 数据指示
+      - 数据库中
+      - 数据中
+      - 现有数据
+      - 已有数据
+      - 存储的数据
+      
+      # 平台指示
+      - 平台数据
+      - 我们的数据库
+      - 这个系统
+
+# ===========================================
+# 查询意图关键词
+# ===========================================
+query_intent_keywords:
+  description: "用于识别数据查询意图的关键词"
+  keywords:
+    # 统计分析
+    - 统计
+    - 查询
+    - 分析
+    - 排行
+    - 排名
+    - 报表
+    - 报告
+    - 汇总
+    - 计算
+    - 对比
+    - 趋势
+    - 占比
+    - 百分比
+    - 比例
+    
+    # 聚合函数
+    - 最大
+    - 最小
+    - 最多
+    - 最高
+    - 最低
+    - 平均
+    - 总计
+    - 合计
+    - 累计
+    - 求和
+    - 求平均
+    - 数量
+    
+    # 输出动作
+    - 生成
+    - 导出
+    - 显示
+    - 列出
+    - 共有
+
+# ===========================================
+# 非业务实体词(一旦匹配立即分类为CHAT)
+# ===========================================
+non_business_keywords:
+  description: "明确的非业务领域问题,最高优先级直接分类"
+  
+  农产品食物:
+    - 荔枝
+    - 苹果
+    - 西瓜
+    - 水果
+    - 蔬菜
+    - 大米
+    - 小麦
+    - 橙子
+    - 香蕉
+    - 葡萄
+    - 草莓
+    - 樱桃
+    - 桃子
+    - 梨
+    
+  技术概念:
+    - 人工智能
+    - 机器学习
+    - 编程
+    - 算法
+    - 深度学习
+    - AI
+    - 神经网络
+    - 模型训练
+    - 数据挖掘
+    
+  身份询问:
+    - 你是谁
+    - 你是什么
+    - 你叫什么
+    - 你的名字
+    - 你是什么AI
+    - 什么模型
+    - 大模型
+    - AI助手
+    - 助手
+    - 机器人
+    
+  天气相关:
+    - 天气
+    - 气温
+    - 下雨
+    - 晴天
+    - 阴天
+    - 温度
+    - 天气预报
+    - 气候
+    - 降雨
+    - 雪天
+    
+  生活常识:
+    - 怎么做饭
+    - 如何减肥
+    - 健康
+    - 医疗
+    - 病症
+    - 历史
+    - 地理
+    - 文学
+    - 电影
+    - 音乐
+    - 体育
+    - 娱乐
+    - 游戏
+    - 小说
+    - 新闻
+    - 政治
+    - 战争
+    - 足球
+    - NBA
+    - 篮球
+    - 乒乓球
+    - 冠军
+    - 夺冠
+    - 高考
+    - 菜谱
+    - 食谱
+    - 烹饪
+    - 联赛
+    
+  旅游出行:
+    - 旅游
+    - 景点
+    - 门票
+    - 酒店
+    - 机票
+    - 航班
+    - 高铁
+    - 的士
+    
+  情绪表达:
+    - 伤心
+    - 开心
+    - 无聊
+    - 生气
+    - 孤独
+    - 累了
+    - 烦恼
+    - 心情
+    - 难过
+    - 抑郁
+    
+  商业金融:
+    - 股票
+    - 基金
+    - 理财
+    - 投资
+    - 经济
+    - 通货膨胀
+    - 上市
+    
+  哲学思考:
+    - 人生意义
+    - 价值观
+    - 道德
+    - 信仰
+    - 宗教
+    - 爱情
+    
+  地理范围:
+    - 全球
+    - 全国
+    - 亚洲
+    - 发展中
+    - 欧洲
+    - 美洲
+    - 东亚
+    - 东南亚
+    - 南美
+    - 非洲
+    - 大洋
+
+# ===========================================
+# SQL模式(正则表达式)
+# ===========================================
+sql_patterns:
+  description: "用于识别SQL语句特征的正则表达式"
+  patterns:
+    - pattern: "\\b(select|from|where|group by|order by|having|join|update)\\b"
+      description: "SQL关键字匹配"
+      case_sensitive: false
+      
+    - pattern: "\\b(数据库|表名|表|字段名|SQL|sql|database|table)\\b"
+      description: "数据库概念词匹配"
+      case_sensitive: false
+
+# ===========================================
+# 聊天关键词
+# ===========================================
+chat_keywords:
+  description: "倾向于聊天分类的关键词"
+  keywords:
+    # 问候语
+    - 你好啊
+    - 谢谢
+    - 再见
+    
+    # 疑问词
+    - 怎么样
+    - 如何
+    - 为什么
+    - 什么是
+    
+    # 帮助请求
+    - 介绍
+    - 解释
+    - 说明
+    - 帮助
+    - 操作
+    - 使用方法
+    - 功能
+    - 教程
+    - 指南
+    - 手册
+    - 讲解
+
+# ===========================================
+# 配置验证规则
+# ===========================================
+validation:
+  required_sections:
+    - strong_business_keywords
+    - query_intent_keywords
+    - non_business_keywords
+    - sql_patterns
+    - chat_keywords
+  
+  min_keywords_count:
+    strong_business_keywords: 50
+    query_intent_keywords: 20
+    non_business_keywords: 70
+    chat_keywords: 15 

+ 46 - 36
agent/config.py

@@ -10,36 +10,8 @@ Agent配置文件
 """
 
 AGENT_CONFIG = {
-    # ==================== 问题分类器配置 ====================
-    "classification": {
-        # 高置信度阈值:当规则分类的置信度 >= 此值时,直接使用规则分类结果,不再调用LLM
-        # 建议范围:0.7-0.9,过高可能错过需要LLM辅助的边界情况,过低会增加LLM调用成本
-        "high_confidence_threshold": 0.7,
-        
-        # 低置信度阈值:当规则分类的置信度 <= 此值时,启用LLM二次分类进行辅助判断
-        # 建议范围:0.2-0.5,过高会频繁调用LLM,过低可能错过需要LLM辅助的情况
-        "low_confidence_threshold": 0.4,
-        
-        # 最大置信度上限:规则分类计算出的置信度不会超过此值,防止过度自信
-        # 建议范围:0.8-1.0,通常设为0.9以保留不确定性空间
-        "max_confidence": 0.9,
-        
-        # 基础置信度:规则分类的起始置信度,会根据匹配的关键词数量递增
-        # 建议范围:0.3-0.6,这是匹配到1个关键词时的基础置信度
-        "base_confidence": 0.4,
-        
-        # 置信度增量步长:每匹配一个额外关键词,置信度增加的数值
-        # 建议范围:0.05-0.2,过大会导致置信度增长过快,过小则区分度不够
-        "confidence_increment": 0.08,
-        
-        # LLM分类失败时的默认置信度:当LLM调用异常或解析失败时使用
-        # 建议范围:0.3-0.6,通常设为中等水平,避免过高或过低的错误影响
-        "llm_fallback_confidence": 0.5,
-        
-        # 不确定分类的默认置信度:当规则分类无法明确判断时使用
-        # 建议范围:0.1-0.3,应设为较低值,表示确实不确定
-        "uncertain_confidence": 0.2,
-    },
+    # ==================== 问题分类器配置已迁移到 classifier_dict.yaml ====================
+    # 注意:问题分类器的所有配置参数已迁移到 agent/classifier_dict.yaml 文件的 weights 部分
     
     # ==================== 数据库Agent配置 ====================
     "database_agent": {
@@ -133,11 +105,11 @@ def get_nested_config(config: dict, key_path: str, default=None):
         配置值或默认值
         
     Example:
-        >>> config = {"classification": {"high_confidence_threshold": 0.8}}
-        >>> get_nested_config(config, "classification.high_confidence_threshold", 0.5)
-        0.8
-        >>> get_nested_config(config, "classification.missing_key", 0.5)
-        0.5
+        >>> config = {"database_agent": {"max_iterations": 10}}
+        >>> get_nested_config(config, "database_agent.max_iterations", 5)
+        10
+        >>> get_nested_config(config, "database_agent.missing_key", 5)
+        5
     """
     keys = key_path.split('.')
     current = config
@@ -160,4 +132,42 @@ def get_current_config() -> dict:
         此函数返回的是配置的引用,修改返回值会影响全局配置
         如需修改配置,建议创建副本后再修改
     """
-    return AGENT_CONFIG 
+    return AGENT_CONFIG
+
+# ==================== 分类器词典配置加载 ====================
+
+try:
+    from .dict_loader import load_classifier_dict_config, get_dict_loader
+    
+    def get_classifier_dict_config(force_reload: bool = False):
+        """
+        获取分类器词典配置
+        
+        Args:
+            force_reload: 是否强制重新加载
+            
+        Returns:
+            ClassifierDictConfig: 词典配置对象
+        """
+        return load_classifier_dict_config(force_reload)
+    
+    def reload_classifier_dict_config():
+        """重新加载分类器词典配置"""
+        return load_classifier_dict_config(force_reload=True)
+    
+    # 导出词典配置函数
+    __all__ = [
+        'get_current_config', 
+        'get_nested_config', 
+        'AGENT_CONFIG',
+        'get_classifier_dict_config',
+        'reload_classifier_dict_config'
+    ]
+    
+except ImportError as e:
+    # 如果dict_loader模块不存在,提供空实现
+    def get_classifier_dict_config(force_reload: bool = False):
+        raise ImportError("词典加载器模块不可用,请检查dict_loader.py是否存在")
+    
+    def reload_classifier_dict_config():
+        raise ImportError("词典加载器模块不可用,请检查dict_loader.py是否存在") 

+ 221 - 0
agent/dict_loader.py

@@ -0,0 +1,221 @@
+# agent/dict_loader.py
+"""
+分类器词典配置加载器
+负责从YAML文件加载分类器词典配置,并提供数据转换和验证功能
+"""
+
+import yaml
+import os
+import re
+from typing import Dict, Any, List, Optional
+from dataclasses import dataclass
+from core.logging import get_agent_logger
+
+# 初始化日志 [[memory:3840221]]
+logger = get_agent_logger("DictLoader")
+
+@dataclass
+class ClassifierDictConfig:
+    """分类器词典配置数据类"""
+    strong_business_keywords: Dict[str, List[str]]
+    query_intent_keywords: List[str]
+    non_business_keywords: List[str]
+    sql_patterns: List[str]
+    chat_keywords: List[str]
+    weights: Dict[str, float]
+    metadata: Dict[str, Any]
+
+class DictLoader:
+    """分类器词典配置加载器"""
+    
+    def __init__(self, dict_file: str = None):
+        """
+        初始化加载器
+        
+        Args:
+            dict_file: 词典配置文件路径,默认为agent/classifier_dict.yaml
+        """
+        if dict_file is None:
+            current_dir = os.path.dirname(os.path.abspath(__file__))
+            dict_file = os.path.join(current_dir, "classifier_dict.yaml")
+        
+        self.dict_file = dict_file
+        self.config_cache = None
+    
+    def load_config(self, force_reload: bool = False) -> ClassifierDictConfig:
+        """
+        加载词典配置
+        
+        Args:
+            force_reload: 是否强制重新加载,默认使用缓存
+            
+        Returns:
+            ClassifierDictConfig: 词典配置对象
+            
+        Raises:
+            FileNotFoundError: 配置文件不存在
+            ValueError: 配置文件格式错误
+        """
+        if self.config_cache is not None and not force_reload:
+            return self.config_cache
+        
+        try:
+            logger.info(f"加载词典配置文件: {self.dict_file}")
+            
+            with open(self.dict_file, 'r', encoding='utf-8') as f:
+                yaml_data = yaml.safe_load(f)
+            
+            # 验证配置文件
+            self._validate_config(yaml_data)
+            
+            # 转换数据格式
+            config = self._convert_config(yaml_data)
+            
+            # 缓存配置
+            self.config_cache = config
+            
+            logger.info("词典配置加载成功")
+            return config
+            
+        except FileNotFoundError:
+            error_msg = f"词典配置文件不存在: {self.dict_file}"
+            logger.error(error_msg)
+            raise FileNotFoundError(error_msg)
+        except yaml.YAMLError as e:
+            error_msg = f"词典配置文件YAML格式错误: {str(e)}"
+            logger.error(error_msg)
+            raise ValueError(error_msg)
+        except Exception as e:
+            error_msg = f"词典配置加载失败: {str(e)}"
+            logger.error(error_msg)
+            raise ValueError(error_msg)
+    
+    def _validate_config(self, yaml_data: Dict[str, Any]) -> None:
+        """验证配置文件格式和必要字段"""
+        required_sections = [
+            'strong_business_keywords',
+            'query_intent_keywords', 
+            'non_business_keywords',
+            'sql_patterns',
+            'chat_keywords',
+            'weights'
+        ]
+        
+        for section in required_sections:
+            if section not in yaml_data:
+                raise ValueError(f"配置文件缺少必要部分: {section}")
+        
+        # 验证权重配置
+        required_weights = [
+            'business_entity',
+            'system_indicator', 
+            'query_intent',
+            'sql_pattern',
+            'chat_keyword',
+            'non_business_confidence',
+            'high_confidence_threshold',
+            'max_confidence',
+            'llm_fallback_confidence',
+            'uncertain_confidence',
+            'llm_error_confidence'
+        ]
+        
+        for weight in required_weights:
+            if weight not in yaml_data['weights']:
+                raise ValueError(f"权重配置缺少: {weight}")
+        
+        logger.debug("配置文件验证通过")
+    
+    def _convert_config(self, yaml_data: Dict[str, Any]) -> ClassifierDictConfig:
+        """将YAML数据转换为ClassifierDictConfig对象"""
+        
+        # 转换强业务关键词(保持字典结构)
+        strong_business_keywords = {}
+        for category, data in yaml_data['strong_business_keywords'].items():
+            if isinstance(data, dict) and 'keywords' in data:
+                strong_business_keywords[category] = data['keywords']
+            else:
+                # 兼容简单格式
+                strong_business_keywords[category] = data
+        
+        # 转换查询意图关键词
+        query_intent_data = yaml_data['query_intent_keywords']
+        if isinstance(query_intent_data, dict) and 'keywords' in query_intent_data:
+            query_intent_keywords = query_intent_data['keywords']
+        else:
+            query_intent_keywords = query_intent_data
+        
+        # 转换非业务实体词(展平为列表)
+        non_business_keywords = self._flatten_non_business_keywords(
+            yaml_data['non_business_keywords']
+        )
+        
+        # 转换SQL模式
+        sql_patterns = []
+        patterns_data = yaml_data['sql_patterns']
+        if isinstance(patterns_data, dict) and 'patterns' in patterns_data:
+            for pattern_info in patterns_data['patterns']:
+                if isinstance(pattern_info, dict):
+                    sql_patterns.append(pattern_info['pattern'])
+                else:
+                    sql_patterns.append(pattern_info)
+        else:
+            sql_patterns = patterns_data
+        
+        # 转换其他关键词列表
+        chat_keywords = self._extract_keywords_list(yaml_data['chat_keywords'])
+        
+        return ClassifierDictConfig(
+            strong_business_keywords=strong_business_keywords,
+            query_intent_keywords=query_intent_keywords,
+            non_business_keywords=non_business_keywords,
+            sql_patterns=sql_patterns,
+            chat_keywords=chat_keywords,
+            weights=yaml_data['weights'],
+            metadata=yaml_data.get('metadata', {})
+        )
+    
+    def _flatten_non_business_keywords(self, non_business_data: Dict[str, Any]) -> List[str]:
+        """将分类的非业务词展平为列表"""
+        flattened = []
+        
+        # 跳过description字段
+        for category, keywords in non_business_data.items():
+            if category == 'description':
+                continue
+            if isinstance(keywords, list):
+                flattened.extend(keywords)
+        
+        return flattened
+    
+    def _extract_keywords_list(self, data: Any) -> List[str]:
+        """从可能包含description的数据中提取关键词列表"""
+        if isinstance(data, dict) and 'keywords' in data:
+            return data['keywords']
+        elif isinstance(data, list):
+            return data
+        else:
+            return []
+
+# 全局加载器实例
+_dict_loader = None
+
+def get_dict_loader() -> DictLoader:
+    """获取全局词典加载器实例"""
+    global _dict_loader
+    if _dict_loader is None:
+        _dict_loader = DictLoader()
+    return _dict_loader
+
+def load_classifier_dict_config(force_reload: bool = False) -> ClassifierDictConfig:
+    """
+    加载分类器词典配置(便捷函数)
+    
+    Args:
+        force_reload: 是否强制重新加载
+        
+    Returns:
+        ClassifierDictConfig: 词典配置对象
+    """
+    loader = get_dict_loader()
+    return loader.load_config(force_reload) 

+ 2 - 8
agent/state.py

@@ -7,10 +7,10 @@ class AgentState(TypedDict):
     
     # 输入信息
     question: str
-    session_id: Optional[str]
+    conversation_id: Optional[str]
     
     # 上下文信息
-    context_type: Optional[str]  # 上下文类型 ("DATABASE" 或 "CHAT")
+    context_type: Optional[str]  # 上下文类型(保留兼容性字段,当前未使用)
     
     # 分类结果
     question_type: Literal["DATABASE", "CHAT", "UNCERTAIN"]
@@ -20,7 +20,6 @@ class AgentState(TypedDict):
     
     # 数据库查询流程状态
     sql: Optional[str]
-    sql_generation_attempts: int
     query_result: Optional[Dict[str, Any]]
     summary: Optional[str]
     
@@ -45,11 +44,6 @@ class AgentState(TypedDict):
     # 流程控制
     current_step: str
     execution_path: List[str]
-    retry_count: int
-    max_retries: int
-    
-    # 调试信息
-    debug_info: Dict[str, Any]
     
     # 路由模式相关
     routing_mode: Optional[str]  # 记录使用的路由模式

+ 3 - 3
app_config.py

@@ -30,7 +30,7 @@ API_DEEPSEEK_CONFIG = {
     "n_results": 6,
     "language": "Chinese",
     "stream": True,  # 是否使用流式模式
-    "enable_thinking": True  # 自定义,是否支持流模式
+    "enable_thinking": False  # 自定义,是否支持流模式
 }
 
 # Qwen模型配置
@@ -43,7 +43,7 @@ API_QIANWEN_CONFIG = {
     "n_results": 6,
     "language": "Chinese",
     "stream": True,  # 是否使用流式模式
-    "enable_thinking": False  # 是否启用思考功能(要求stream=True)
+    "enable_thinking": True  # 是否启用思考功能(要求stream=True)
 }
 #qwen3-30b-a3b
 #qwen3-235b-a22b
@@ -164,7 +164,7 @@ DEFAULT_ANONYMOUS_USER = "guest"        # 匿名用户统一ID
 # Redis配置
 REDIS_HOST = "localhost"
 REDIS_PORT = 6379
-REDIS_DB = 0
+REDIS_DB = 1
 REDIS_PASSWORD = None
 
 # 缓存开关配置

+ 27 - 0
asgi_app.py

@@ -0,0 +1,27 @@
+"""
+ASGI应用启动文件
+将Flask WSGI应用转换为ASGI应用,支持异步路由
+
+启动方式:
+1. 开发环境:python unified_api.py (直接Flask)
+2. 生产环境:uvicorn asgi_app:asgi_app (ASGI服务器)
+"""
+from asgiref.wsgi import WsgiToAsgi
+from unified_api import app
+
+# 将Flask WSGI应用转换为ASGI应用
+asgi_app = WsgiToAsgi(app)
+
+# 启动方式示例:
+# 开发环境(单进程 + 重载):
+# uvicorn asgi_app:asgi_app --host 127.0.0.1 --port 8084 --reload
+
+# 生产环境(多进程 + 性能优化):
+# uvicorn asgi_app:asgi_app --host 0.0.0.0 --port 8084 --workers 4 --limit-concurrency 100 --limit-max-requests 1000 --access-log
+
+# uvicorn asgi_app:asgi_app --host 0.0.0.0 --port 8084 --workers 4 --limit-concurrency 100 --limit-max-requests 1000
+
+#在 Ubuntu 上安装 uvloop、httptools 没坏处,建议保留,不必显式加 --loop uvloop,Uvicorn 默认 --loop auto
+
+
+# uvicorn asgi_app:asgi_app --host 0.0.0.0 --port 8084 --workers 4 --limit-concurrency 12 --limit-max-requests 5000 --timeout-graceful-shutdown 30

+ 123 - 51
citu_app.py

@@ -9,7 +9,7 @@ from flask import request, jsonify
 import pandas as pd
 import common.result as result
 from datetime import datetime, timedelta
-from common.session_aware_cache import WebSessionAwareMemoryCache
+from common.session_aware_cache import ConversationAwareMemoryCache
 from app_config import API_MAX_RETURN_ROWS, ENABLE_RESULT_SUMMARY
 import re
 import chainlit as cl
@@ -41,13 +41,13 @@ MAX_RETURN_ROWS = API_MAX_RETURN_ROWS if API_MAX_RETURN_ROWS is not None else DE
 
 vn = create_vanna_instance()
 
-# 创建带时间戳的缓存
-timestamped_cache = WebSessionAwareMemoryCache()
+# 创建对话感知的缓存
+conversation_cache = ConversationAwareMemoryCache()
 
 # 实例化 VannaFlaskApp,使用自定义缓存
 app = VannaFlaskApp(
     vn,
-    cache=timestamped_cache,  # 使用带时间戳的缓存
+    cache=conversation_cache,  # 使用对话感知的缓存
     title="辞图智能数据问答平台",
     logo = "https://www.citupro.com/img/logo-black-2.png",
     subtitle="让 AI 为你写 SQL",
@@ -61,12 +61,13 @@ app = VannaFlaskApp(
 # 创建Redis对话管理器实例
 redis_conversation_manager = RedisConversationManager()
 
-# 修改ask接口,支持前端传递session_id
+# 修改ask接口,支持前端传递conversation_id
 @app.flask_app.route('/api/v0/ask', methods=['POST'])
 def ask_full():
     req = request.get_json(force=True)
     question = req.get("question", None)
-    browser_session_id = req.get("session_id", None)  # 前端传递的会话ID
+    conversation_id = req.get("conversation_id", None)  # 前端传递的对话ID
+    user_id = req.get("user_id", None)  # 前端传递的用户ID
     
     if not question:
         from common.result import bad_request_response
@@ -75,14 +76,13 @@ def ask_full():
             missing_params=["question"]
         )), 400
 
-    # 如果使用WebSessionAwareMemoryCache
-    if hasattr(app.cache, 'generate_id_with_browser_session') and browser_session_id:
-        # 这里需要修改vanna的ask方法来支持传递session_id
-        # 或者预先调用generate_id来建立会话关联
-        conversation_id = app.cache.generate_id_with_browser_session(
-            question=question, 
-            browser_session_id=browser_session_id
-        )
+    # 如果没有传递user_id,使用默认值guest
+    if not user_id:
+        user_id = "guest"
+
+    # 如果前端没有传递conversation_id,则生成新的
+    if not conversation_id:
+        conversation_id = app.cache.generate_id(question=question, user_id=user_id)
 
     try:
         sql, df, _ = vn.ask(
@@ -144,8 +144,7 @@ def ask_full():
         response_data = {
             "sql": sql,
             "query_result": query_result,
-            "conversation_id": conversation_id if 'conversation_id' in locals() else None,
-            "session_id": browser_session_id
+            "conversation_id": conversation_id
         }
         
         # 添加摘要(如果启用且生成成功)
@@ -237,7 +236,8 @@ def ask_cached():
     """
     req = request.get_json(force=True)
     question = req.get("question", None)
-    browser_session_id = req.get("session_id", None)
+    conversation_id = req.get("conversation_id", None)
+    user_id = req.get("user_id", None)
     
     if not question:
         from common.result import bad_request_response
@@ -246,15 +246,19 @@ def ask_cached():
             missing_params=["question"]
         )), 400
 
+    # 如果没有传递user_id,使用默认值guest
+    if not user_id:
+        user_id = "guest"
+
     try:
         # 生成conversation_id
         # 调试:查看generate_id的实际行为
         logger.debug(f"输入问题: '{question}'")
-        conversation_id = app.cache.generate_id(question=question)
+        conversation_id = app.cache.generate_id(question=question, user_id=user_id)
         logger.debug(f"生成的conversation_id: {conversation_id}")
         
         # 再次用相同问题测试
-        conversation_id2 = app.cache.generate_id(question=question)
+        conversation_id2 = app.cache.generate_id(question=question, user_id=user_id)
         logger.debug(f"再次生成的conversation_id: {conversation_id2}")
         logger.debug(f"两次ID是否相同: {conversation_id == conversation_id2}")
         
@@ -336,7 +340,6 @@ def ask_cached():
             "sql": sql,
             "query_result": query_result,
             "conversation_id": conversation_id,
-            "session_id": browser_session_id,
             "cached": cached_sql is not None  # 标识是否来自缓存
         }
         
@@ -449,11 +452,10 @@ def ask_agent():
     """
     req = request.get_json(force=True)
     question = req.get("question", None)
-    browser_session_id = req.get("session_id", None)
+    conversation_id_input = req.get("conversation_id", None)
     
     # 新增参数解析
     user_id_input = req.get("user_id", None)
-    conversation_id_input = req.get("conversation_id", None)
     continue_conversation = req.get("continue_conversation", False)
     
     # 新增:路由模式参数解析和验证
@@ -477,9 +479,38 @@ def ask_agent():
         # 1. 获取登录用户ID(修正:在函数中获取session信息)
         login_user_id = session.get('user_id') if 'user_id' in session else None
         
-        # 2. 智能ID解析(修正:传入登录用户ID)
+        # 2. 用户ID和对话ID一致性校验
+        from common.session_aware_cache import ConversationAwareMemoryCache
+        
+        # 2.1 如果传递了conversation_id,从中解析user_id
+        extracted_user_id = None
+        if conversation_id_input:
+            extracted_user_id = ConversationAwareMemoryCache.extract_user_id(conversation_id_input)
+            
+            # 如果同时传递了user_id和conversation_id,进行一致性校验
+            if user_id_input:
+                is_valid, error_msg = ConversationAwareMemoryCache.validate_user_id_consistency(
+                    conversation_id_input, user_id_input
+                )
+                if not is_valid:
+                    return jsonify(bad_request_response(
+                        response_text=error_msg,
+                        invalid_params=["user_id", "conversation_id"]
+                    )), 400
+            
+            # 如果没有传递user_id,但有conversation_id,则从conversation_id中解析
+            elif not user_id_input and extracted_user_id:
+                user_id_input = extracted_user_id
+                logger.info(f"从conversation_id解析出user_id: {user_id_input}")
+        
+        # 2.2 如果没有传递user_id,使用默认值guest
+        if not user_id_input:
+            user_id_input = "guest"
+            logger.info("未传递user_id,使用默认值: guest")
+        
+        # 3. 智能ID解析(修正:传入登录用户ID)
         user_id = redis_conversation_manager.resolve_user_id(
-            user_id_input, browser_session_id, request.remote_addr, login_user_id
+            user_id_input, None, request.remote_addr, login_user_id
         )
         conversation_id, conversation_status = redis_conversation_manager.resolve_conversation_id(
             user_id, conversation_id_input, continue_conversation
@@ -493,7 +524,7 @@ def ask_agent():
         if context:
             try:
                 # 获取最后一条助手消息的metadata
-                messages = redis_conversation_manager.get_messages(conversation_id, limit=10)
+                messages = redis_conversation_manager.get_conversation_messages(conversation_id, limit=10)
                 for message in reversed(messages):  # 从最新的开始找
                     if message.get("role") == "assistant":
                         metadata = message.get("metadata", {})
@@ -552,16 +583,13 @@ def ask_agent():
                 sql=cached_answer.get("sql"),
                 records=cached_answer.get("query_result"),  # 修改:query_result改为records
                 summary=cached_answer.get("summary"),
-                session_id=browser_session_id,
+                conversation_id=conversation_id,
                 execution_path=cached_answer.get("execution_path", []),
                 classification_info=cached_answer.get("classification_info", {}),
-                conversation_id=conversation_id,
                 user_id=user_id,
-                is_guest_user=(user_id == DEFAULT_ANONYMOUS_USER),
                 context_used=bool(context),
                 from_cache=True,
                 conversation_status=conversation_status["status"],
-                conversation_message=conversation_status["message"],
                 requested_conversation_id=conversation_status.get("requested_id")
             ))
         
@@ -605,7 +633,7 @@ def ask_agent():
         import asyncio
         agent_result = asyncio.run(agent.process_question(
             question=enhanced_question,  # 使用增强后的问题
-            session_id=browser_session_id,
+            conversation_id=conversation_id,
             context_type=context_type,  # 传递上下文类型
             routing_mode=effective_routing_mode  # 新增:传递路由模式
         ))
@@ -662,16 +690,13 @@ def ask_agent():
                 sql=sql,
                 records=query_result,  # 修改:query_result改为records
                 summary=summary,
-                session_id=browser_session_id,
+                conversation_id=conversation_id,
                 execution_path=execution_path,
                 classification_info=classification_info,
-                conversation_id=conversation_id,
                 user_id=user_id,
-                is_guest_user=(user_id == DEFAULT_ANONYMOUS_USER),
                 context_used=bool(context),
                 from_cache=False,
                 conversation_status=conversation_status["status"],
-                conversation_message=conversation_status["message"],
                 requested_conversation_id=conversation_status.get("requested_id"),
                 routing_mode_used=effective_routing_mode,  # 新增:实际使用的路由模式
                 routing_mode_source="api" if api_routing_mode else "config"  # 新增:路由模式来源
@@ -685,7 +710,6 @@ def ask_agent():
                 response_text=error_message,
                 error_type="agent_processing_failed",
                 code=error_code,
-                session_id=browser_session_id,
                 conversation_id=conversation_id,
                 user_id=user_id
             )), error_code
@@ -3847,6 +3871,7 @@ def get_table_list_info(task_id):
             "file_size_formatted": "1.0 KB",
             "uploaded_at": "2025-07-01T12:34:56",
             "table_count": 5,
+            "table_names": ["table_name_1", "table_name_2", "table_name_3", "table_name_4", "table_name_5"],
             "is_readable": true
         }
     }
@@ -4130,8 +4155,10 @@ def delete_task_directory_simple(task_id, delete_database_records=False):
             # 删除目录
             shutil.rmtree(task_dir)
             directory_deleted = True
+            operation_message = "目录删除成功"
         else:
             directory_deleted = False
+            operation_message = "目录不存在,无需删除"
         
         # 2. 更新数据库
         database_records_deleted = False
@@ -4180,7 +4207,8 @@ def delete_task_directory_simple(task_id, delete_database_records=False):
             "database_records_deleted": database_records_deleted,
             "deleted_files_count": deleted_files_count,
             "deleted_size": format_size(deleted_size),
-            "deleted_at": datetime.now().isoformat()
+            "deleted_at": datetime.now().isoformat(),
+            "operation_message": operation_message  # 新增:具体的操作消息
         }
         
     except Exception as e:
@@ -4189,19 +4217,45 @@ def delete_task_directory_simple(task_id, delete_database_records=False):
             "success": False,
             "task_id": task_id,
             "error": str(e),
-            "error_code": "DELETE_FAILED"
+            "error_code": "DELETE_FAILED",
+            "operation_message": f"删除操作失败: {str(e)}"  # 新增:失败消息
         }
 
 @app.flask_app.route('/api/v0/data_pipeline/tasks', methods=['DELETE'])
 def delete_tasks():
     """删除任务目录(支持单个和批量)"""
     try:
-        # 获取请求参数
-        req = request.get_json(force=True)
+        # 智能获取参数:支持JSON body和URL查询参数两种方式
+        def get_request_parameter(param_name, array_param_name=None):
+            """从JSON body或URL查询参数中获取参数值"""
+            # 1. 优先从JSON body获取
+            if request.is_json:
+                try:
+                    json_data = request.get_json()
+                    if json_data and param_name in json_data:
+                        return json_data[param_name]
+                except:
+                    pass
+            
+            # 2. 从URL查询参数获取
+            if param_name in request.args:
+                value = request.args.get(param_name)
+                # 处理布尔值
+                if value.lower() in ('true', '1', 'yes'):
+                    return True
+                elif value.lower() in ('false', '0', 'no'):
+                    return False
+                return value
+            
+            # 3. 处理数组参数(如 task_ids[])
+            if array_param_name and array_param_name in request.args:
+                return request.args.getlist(array_param_name)
+            
+            return None
         
-        # 验证必需参数
-        task_ids = req.get('task_ids')
-        confirm = req.get('confirm')
+        # 获取参数
+        task_ids = get_request_parameter('task_ids', 'task_ids[]')
+        confirm = get_request_parameter('confirm')
         
         if not task_ids:
             return jsonify(bad_request_response(
@@ -4226,8 +4280,10 @@ def delete_tasks():
             )), 400
         
         # 获取可选参数
-        delete_database_records = req.get('delete_database_records', False)
-        continue_on_error = req.get('continue_on_error', True)
+        delete_database_records = get_request_parameter('delete_database_records') or False
+        continue_on_error = get_request_parameter('continue_on_error')
+        if continue_on_error is None:
+            continue_on_error = True
         
         # 执行批量删除操作
         deleted_tasks = []
@@ -4264,20 +4320,36 @@ def delete_tasks():
             "deleted_at": datetime.now().isoformat()
         }
         
+        # 构建智能响应消息
         if len(task_ids) == 1:
-            # 单个删除
+            # 单个删除:使用具体的操作消息
             if summary["failed"] == 0:
-                message = "任务目录删除成功"
+                # 从deleted_tasks中获取具体的操作消息
+                operation_msg = deleted_tasks[0].get('operation_message', '任务处理完成')
+                message = operation_msg
             else:
-                message = "任务目录删除失败"
+                # 从failed_tasks中获取错误消息
+                error_msg = failed_tasks[0].get('error', '删除失败')
+                message = f"任务删除失败: {error_msg}"
         else:
-            # 批量删除
+            # 批量删除:统计各种操作结果
+            directory_deleted_count = sum(1 for task in deleted_tasks if task.get('directory_deleted', False))
+            directory_not_exist_count = sum(1 for task in deleted_tasks if not task.get('directory_deleted', False))
+            
             if summary["failed"] == 0:
-                message = "批量删除完成"
+                # 全部成功
+                if directory_deleted_count > 0 and directory_not_exist_count > 0:
+                    message = f"批量操作完成:{directory_deleted_count}个目录已删除,{directory_not_exist_count}个目录不存在"
+                elif directory_deleted_count > 0:
+                    message = f"批量删除完成:成功删除{directory_deleted_count}个目录"
+                elif directory_not_exist_count > 0:
+                    message = f"批量操作完成:{directory_not_exist_count}个目录不存在,无需删除"
+                else:
+                    message = "批量操作完成"
             elif summary["successfully_deleted"] == 0:
-                message = "批量删除失败"
+                message = f"批量删除失败:{summary['failed']}个任务处理失败"
             else:
-                message = "批量删除部分完成"
+                message = f"批量删除部分完成:成功{summary['successfully_deleted']}个,失败{summary['failed']}个"
         
         return jsonify(success_response(
             response_text=message,

+ 297 - 8
common/redis_conversation_manager.py

@@ -81,7 +81,7 @@ class RedisConversationManager:
         Returns:
             tuple: (conversation_id, status_info)
             status_info包含:
-            - status: "existing" | "new" | "invalid_id_new"
+            - status: "continue" | "new" | "invalid_id_new"
             - message: 状态说明
             - requested_id: 原始请求的ID(如果有)
         """
@@ -91,7 +91,7 @@ class RedisConversationManager:
             if self._is_valid_conversation(conversation_id_input, user_id):
                 self.logger.debug(f"使用指定对话: {conversation_id_input}")
                 return conversation_id_input, {
-                    "status": "existing",
+                    "status": "continue",
                     "message": "继续已有对话"
                 }
             else:
@@ -109,7 +109,7 @@ class RedisConversationManager:
             if recent_conversation:
                 self.logger.debug(f"继续最近对话: {recent_conversation}")
                 return recent_conversation, {
-                    "status": "existing",
+                    "status": "continue",
                     "message": "继续最近对话"
                 }
         
@@ -155,9 +155,10 @@ class RedisConversationManager:
     
     def create_conversation(self, user_id: str) -> str:
         """创建新对话"""
-        # 生成包含时间戳的conversation_id
-        timestamp = int(datetime.now().timestamp())
-        conversation_id = f"conv_{timestamp}_{uuid.uuid4().hex[:8]}"
+        # 生成包含时间戳的conversation_id,格式:{user_id}:YYYYMMDDHHMMSSsss
+        now = datetime.now()
+        timestamp = now.strftime("%Y%m%d%H%M%S") + f"{now.microsecond // 1000:03d}"
+        conversation_id = f"{user_id}:{timestamp}"
         
         if not self.is_available():
             return conversation_id  # Redis不可用时返回ID,但不存储
@@ -518,12 +519,13 @@ class RedisConversationManager:
     def cleanup_expired_conversations(self):
         """清理过期对话(Redis TTL自动处理,这里可添加额外逻辑)"""
         if not self.is_available():
-            return
+            return {"processed_users": 0, "cleaned_references": 0}
         
         try:
             # 清理用户对话列表中的无效对话ID
             user_keys = self.redis_client.keys("user:*:conversations")
             cleaned_count = 0
+            processed_users = len(user_keys)
             
             for user_key in user_keys:
                 conversation_ids = self.redis_client.lrange(user_key, 0, -1)
@@ -542,12 +544,299 @@ class RedisConversationManager:
                     if valid_ids:
                         self.redis_client.lpush(user_key, *reversed(valid_ids))
                         # 重新设置TTL
-                        self.redis_client.expire(user_key, USER_CONVERSATIONS_TTL)
+                        if USER_CONVERSATIONS_TTL:
+                            self.redis_client.expire(user_key, USER_CONVERSATIONS_TTL)
             
             self.logger.info(f"清理完成,移除了 {cleaned_count} 个无效对话引用")
+            return {"processed_users": processed_users, "cleaned_references": cleaned_count}
             
         except Exception as e:
             self.logger.error(f"清理失败: {str(e)}")
+            raise e
+
+    def enforce_conversation_limits(self, user_id: Optional[str] = None, 
+                                  user_max_conversations: Optional[int] = None,
+                                  conversation_max_length: Optional[int] = None,
+                                  dry_run: bool = False) -> Dict[str, Any]:
+        """
+        执行对话限额策略
+        
+        Args:
+            user_id: 指定用户ID,如果为None则处理所有用户
+            user_max_conversations: 用户最大对话数,如果为None则使用配置值
+            conversation_max_length: 对话最大消息数,如果为None则使用配置值
+            dry_run: 是否为试运行模式
+            
+        Returns:
+            执行结果统计
+        """
+        if not self.is_available():
+            raise Exception("Redis连接不可用")
+        
+        # 使用传入参数或默认配置
+        max_conversations = user_max_conversations if user_max_conversations is not None else USER_MAX_CONVERSATIONS
+        max_length = conversation_max_length if conversation_max_length is not None else CONVERSATION_MAX_LENGTH
+        
+        try:
+            start_time = time.time()
+            
+            # 确定要处理的用户
+            if user_id:
+                user_keys = [f"user:{user_id}:conversations"]
+                mode = "user_specific"
+            else:
+                user_keys = self.redis_client.keys("user:*:conversations")
+                mode = "global"
+            
+            processed_users = 0
+            total_conversations_processed = 0
+            total_conversations_deleted = 0
+            total_messages_trimmed = 0
+            execution_summary = []
+            
+            for user_key in user_keys:
+                user_id_from_key = user_key.split(":")[1]
+                conversation_ids = self.redis_client.lrange(user_key, 0, -1)
+                
+                original_conversations = len(conversation_ids)
+                total_conversations_processed += original_conversations
+                
+                # 1. 检查用户对话数量限制
+                conversations_to_keep = []
+                conversations_to_delete = []
+                
+                if len(conversation_ids) > max_conversations:
+                    # 获取对话的创建时间并排序
+                    conversations_with_time = []
+                    for conv_id in conversation_ids:
+                        meta_key = f"conversation:{conv_id}:meta"
+                        if self.redis_client.exists(meta_key):
+                            meta_data = self.redis_client.hgetall(meta_key)
+                            created_at = meta_data.get('created_at', '0')
+                            conversations_with_time.append((conv_id, created_at))
+                    
+                    # 按创建时间降序排序,保留最新的
+                    conversations_with_time.sort(key=lambda x: x[1], reverse=True)
+                    conversations_to_keep = [conv_id for conv_id, _ in conversations_with_time[:max_conversations]]
+                    conversations_to_delete = [conv_id for conv_id, _ in conversations_with_time[max_conversations:]]
+                else:
+                    conversations_to_keep = conversation_ids
+                
+                # 2. 处理要删除的对话
+                if conversations_to_delete and not dry_run:
+                    for conv_id in conversations_to_delete:
+                        self.redis_client.delete(f"conversation:{conv_id}:meta")
+                        self.redis_client.delete(f"conversation:{conv_id}:messages")
+                    
+                    # 更新用户对话列表
+                    self.redis_client.delete(user_key)
+                    if conversations_to_keep:
+                        self.redis_client.lpush(user_key, *reversed(conversations_to_keep))
+                        if USER_CONVERSATIONS_TTL:
+                            self.redis_client.expire(user_key, USER_CONVERSATIONS_TTL)
+                
+                total_conversations_deleted += len(conversations_to_delete)
+                
+                # 3. 检查每个保留对话的消息数量限制
+                messages_trimmed_for_user = 0
+                for conv_id in conversations_to_keep:
+                    messages_key = f"conversation:{conv_id}:messages"
+                    current_length = self.redis_client.llen(messages_key)
+                    
+                    if current_length > max_length:
+                        messages_to_trim = current_length - max_length
+                        if not dry_run:
+                            self.redis_client.ltrim(messages_key, 0, max_length - 1)
+                        messages_trimmed_for_user += messages_to_trim
+                
+                total_messages_trimmed += messages_trimmed_for_user
+                processed_users += 1
+                
+                # 记录用户处理结果
+                execution_summary.append({
+                    "user_id": user_id_from_key,
+                    "original_conversations": original_conversations,
+                    "kept_conversations": len(conversations_to_keep),
+                    "deleted_conversations": len(conversations_to_delete),
+                    "messages_trimmed": messages_trimmed_for_user
+                })
+            
+            execution_time_ms = int((time.time() - start_time) * 1000)
+            
+            return {
+                "mode": mode,
+                "dry_run": dry_run,
+                "parameters": {
+                    "user_max_conversations": max_conversations,
+                    "conversation_max_length": max_length
+                },
+                "processed_users": processed_users,
+                "total_conversations_processed": total_conversations_processed,
+                "total_conversations_deleted": total_conversations_deleted,
+                "total_messages_trimmed": total_messages_trimmed,
+                "execution_summary": execution_summary,
+                "execution_time_ms": execution_time_ms
+            }
+            
+        except Exception as e:
+            self.logger.error(f"执行对话限额策略失败: {str(e)}")
+            raise e
+
+    def delete_user_conversations(self, user_id: str) -> Dict[str, Any]:
+        """
+        删除指定用户的所有对话数据
+        
+        Args:
+            user_id: 用户ID
+            
+        Returns:
+            删除结果统计
+        """
+        if not self.is_available():
+            raise Exception("Redis连接不可用")
+        
+        try:
+            start_time = time.time()
+            
+            user_key = f"user:{user_id}:conversations"
+            conversation_ids = self.redis_client.lrange(user_key, 0, -1)
+            
+            deleted_conversations = 0
+            deleted_messages = 0
+            
+            # 删除每个对话的数据
+            for conv_id in conversation_ids:
+                meta_key = f"conversation:{conv_id}:meta"
+                messages_key = f"conversation:{conv_id}:messages"
+                
+                if self.redis_client.exists(meta_key):
+                    self.redis_client.delete(meta_key)
+                    deleted_conversations += 1
+                
+                if self.redis_client.exists(messages_key):
+                    message_count = self.redis_client.llen(messages_key)
+                    self.redis_client.delete(messages_key)
+                    deleted_messages += message_count
+            
+            # 删除用户对话索引
+            self.redis_client.delete(user_key)
+            
+            execution_time_ms = int((time.time() - start_time) * 1000)
+            
+            return {
+                "user_id": user_id,
+                "deleted_conversations": deleted_conversations,
+                "deleted_messages": deleted_messages,
+                "execution_time_ms": execution_time_ms
+            }
+            
+        except Exception as e:
+            self.logger.error(f"删除用户对话失败: {str(e)}")
+            raise e
+
+    def delete_conversation(self, conversation_id: str) -> Dict[str, Any]:
+        """
+        删除指定的对话
+        
+        Args:
+            conversation_id: 对话ID
+            
+        Returns:
+            删除结果统计
+        """
+        if not self.is_available():
+            raise Exception("Redis连接不可用")
+        
+        try:
+            start_time = time.time()
+            
+            meta_key = f"conversation:{conversation_id}:meta"
+            messages_key = f"conversation:{conversation_id}:messages"
+            
+            # 检查对话是否存在
+            if not self.redis_client.exists(meta_key):
+                # 对话不存在,返回空结果(符合DELETE操作的幂等性原则)
+                return {
+                    "conversation_id": conversation_id,
+                    "user_id": None,
+                    "deleted_messages": 0,
+                    "execution_time_ms": int((time.time() - start_time) * 1000),
+                    "existed": False
+                }
+            
+            # 获取对话所属用户
+            meta_data = self.redis_client.hgetall(meta_key)
+            user_id = meta_data.get('user_id')
+            
+            # 统计要删除的消息数
+            deleted_messages = self.redis_client.llen(messages_key) if self.redis_client.exists(messages_key) else 0
+            
+            # 删除对话数据
+            self.redis_client.delete(meta_key)
+            self.redis_client.delete(messages_key)
+            
+            # 从用户对话列表中移除
+            if user_id:
+                user_key = f"user:{user_id}:conversations"
+                self.redis_client.lrem(user_key, 0, conversation_id)
+            
+            execution_time_ms = int((time.time() - start_time) * 1000)
+            
+            return {
+                "conversation_id": conversation_id,
+                "user_id": user_id,
+                "deleted_messages": deleted_messages,
+                "execution_time_ms": execution_time_ms,
+                "existed": True
+            }
+            
+        except Exception as e:
+            self.logger.error(f"删除对话失败: {str(e)}")
+            raise e
+
+    def clear_all_agent_data(self) -> Dict[str, Any]:
+        """
+        清空所有agent对话数据
+        
+        Returns:
+            删除结果统计
+        """
+        if not self.is_available():
+            raise Exception("Redis连接不可用")
+        
+        try:
+            start_time = time.time()
+            
+            # 扫描并删除所有相关键
+            meta_keys = self.redis_client.keys("conversation:*:meta")
+            messages_keys = self.redis_client.keys("conversation:*:messages")
+            user_keys = self.redis_client.keys("user:*:conversations")
+            
+            deleted_conversation_metas = len(meta_keys)
+            deleted_conversation_messages = len(messages_keys)
+            deleted_user_conversations = len(user_keys)
+            
+            # 批量删除
+            all_keys = meta_keys + messages_keys + user_keys
+            if all_keys:
+                self.redis_client.delete(*all_keys)
+            
+            total_keys_deleted = len(all_keys)
+            execution_time_ms = int((time.time() - start_time) * 1000)
+            
+            self.logger.warning(f"已清空所有agent对话数据,共删除 {total_keys_deleted} 个键")
+            
+            return {
+                "deleted_conversation_metas": deleted_conversation_metas,
+                "deleted_conversation_messages": deleted_conversation_messages,
+                "deleted_user_conversations": deleted_user_conversations,
+                "total_keys_deleted": total_keys_deleted,
+                "execution_time_ms": execution_time_ms
+            }
+            
+        except Exception as e:
+            self.logger.error(f"清空所有agent数据失败: {str(e)}")
+            raise e
     
     # ==================== 问答缓存管理方法 ====================
     

+ 7 - 6
common/result.py

@@ -58,6 +58,7 @@ def success_response(response_text=None, data=None, message=MessageTemplate.SUCC
     response_data = data or {}
     if response_text is not None:
         response_data["response"] = response_text
+    response_data["timestamp"] = datetime.now().isoformat()
     
     return {
         "code": code,
@@ -100,14 +101,14 @@ def error_response(response_text, error_type=None, message=MessageTemplate.PROCE
 
 # ===== Ask Agent API 专用响应方法 =====
 
-def agent_success_response(response_type, session_id=None, execution_path=None, 
+def agent_success_response(response_type, conversation_id=None, execution_path=None, 
                           classification_info=None, agent_version="langgraph_v1", **kwargs):
     """
     Ask Agent API 成功响应格式
     
     Args:
         response_type: 响应类型 ("DATABASE" 或 "CHAT")
-        session_id: 会话ID
+        conversation_id: 对话ID
         execution_path: 执行路径
         classification_info: 分类信息
         agent_version: Agent版本
@@ -118,7 +119,7 @@ def agent_success_response(response_type, session_id=None, execution_path=None,
     """
     data = {
         "type": response_type,
-        "session_id": session_id,
+        "conversation_id": conversation_id,
         "execution_path": execution_path or [],
         "classification_info": classification_info or {},
         "agent_version": agent_version,
@@ -138,7 +139,7 @@ def agent_success_response(response_type, session_id=None, execution_path=None,
     }
 
 def agent_error_response(response_text, error_type=None, message=MessageTemplate.PROCESSING_FAILED,
-                        code=500, session_id=None, execution_path=None, 
+                        code=500, conversation_id=None, execution_path=None, 
                         classification_info=None, agent_version="langgraph_v1", **kwargs):
     """
     Ask Agent API 错误响应格式
@@ -148,7 +149,7 @@ def agent_error_response(response_text, error_type=None, message=MessageTemplate
         error_type: 错误类型标识
         message: 高层级描述信息
         code: HTTP状态码
-        session_id: 会话ID
+        conversation_id: 对话ID
         execution_path: 执行路径
         classification_info: 分类信息
         agent_version: Agent版本
@@ -159,7 +160,7 @@ def agent_error_response(response_text, error_type=None, message=MessageTemplate
     """
     data = {
         "response": response_text,
-        "session_id": session_id,
+        "conversation_id": conversation_id,
         "execution_path": execution_path or [],
         "classification_info": classification_info or {},
         "agent_version": agent_version,

+ 103 - 128
common/session_aware_cache.py

@@ -1,164 +1,139 @@
-# 修正后的 custom_cache.py
+# 简化后的对话感知缓存
 from datetime import datetime
 from vanna.flask import MemoryCache
 import uuid
 
-class SessionAwareMemoryCache(MemoryCache):
-    """区分会话(Session)和对话(Conversation)的缓存实现"""
+class ConversationAwareMemoryCache(MemoryCache):
+    """基于对话ID的简单时间感知缓存实现"""
     
     def __init__(self):
         super().__init__()
-        self.conversation_start_times = {}  # 每个对话的开始时间
-        self.session_info = {}  # 会话信息: {session_id: {'start_time': datetime, 'conversations': []}}
-        self.conversation_to_session = {}  # 对话ID到会话ID的映射
+        self.conversation_start_times = {}  # 每个对话的开始时间: {conversation_id: datetime}
     
-    def create_or_get_session_id(self, user_identifier=None):
-        """
-        创建或获取会话ID
-        在实际应用中,这可以通过以下方式确定:
-        1. HTTP请求中的session cookie
-        2. JWT token中的session信息
-        3. 前端传递的session_id
-        4. IP地址 + User-Agent的组合
-        """
-        # 简化实现:使用时间窗口来判断是否为同一会话
-        # 实际应用中应该从HTTP请求中获取session信息
-        current_time = datetime.now()
+    def generate_id(self, question: str = None, user_id: str = None) -> str:
+        """生成对话ID并记录时间,格式为 {user_id}:YYYYMMDDHHMMSSsss"""
+        # 如果没有传递user_id,使用默认值
+        if not user_id:
+            user_id = "guest"
         
-        # 检查是否有近期的会话(比如30分钟内)
-        for session_id, session_data in self.session_info.items():
-            last_activity = session_data.get('last_activity', session_data['start_time'])
-            if (current_time - last_activity).total_seconds() < 1800:  # 30分钟内
-                # 更新最后活动时间
-                session_data['last_activity'] = current_time
-                return session_id
+        # 生成时间戳:年月日时分秒毫秒格式
+        now = datetime.now()
+        timestamp = now.strftime("%Y%m%d%H%M%S") + f"{now.microsecond // 1000:03d}"
         
-        # 创建新会话
-        new_session_id = str(uuid.uuid4())
-        self.session_info[new_session_id] = {
-            'start_time': current_time,
-            'last_activity': current_time,
-            'conversations': []
-        }
-        return new_session_id
-    
-    def generate_id(self, question: str = None, session_id: str = None) -> str:
-        """重载generate_id方法,关联会话和对话"""
-        conversation_id = super().generate_id(question=question)
-        
-        # 确定会话ID
-        if not session_id:
-            session_id = self.create_or_get_session_id()
+        # 生成对话ID:{user_id}:{timestamp}
+        conversation_id = f"{user_id}:{timestamp}"
         
         # 记录对话开始时间
-        conversation_start_time = datetime.now()
-        self.conversation_start_times[conversation_id] = conversation_start_time
-        
-        # 建立对话与会话的关联
-        self.conversation_to_session[conversation_id] = session_id
-        self.session_info[session_id]['conversations'].append(conversation_id)
-        self.session_info[session_id]['last_activity'] = conversation_start_time
+        self.conversation_start_times[conversation_id] = now
         
         return conversation_id
     
-    def set(self, id: str, field: str, value, session_id: str = None):
+    def set(self, id: str, field: str, value, **kwargs):
         """重载set方法,确保时间信息正确"""
         # 如果这是新对话,初始化时间信息
         if id not in self.conversation_start_times:
-            if not session_id:
-                session_id = self.create_or_get_session_id()
-            
-            conversation_start_time = datetime.now()
-            self.conversation_start_times[id] = conversation_start_time
-            self.conversation_to_session[id] = session_id
-            self.session_info[session_id]['conversations'].append(id)
-            self.session_info[session_id]['last_activity'] = conversation_start_time
+            self.conversation_start_times[id] = datetime.now()
         
         # 调用父类的set方法
         super().set(id=id, field=field, value=value)
         
-        # 设置时间相关字段
-        if field != 'conversation_start_time' and field != 'session_start_time':
-            # 设置对话开始时间
+        # 自动设置对话开始时间字段
+        if field != 'conversation_start_time':
             super().set(id=id, field='conversation_start_time', 
                        value=self.conversation_start_times[id])
-            
-            # 设置会话开始时间
-            session_id = self.conversation_to_session.get(id)
-            if session_id and session_id in self.session_info:
-                super().set(id=id, field='session_start_time', 
-                           value=self.session_info[session_id]['start_time'])
-                super().set(id=id, field='session_id', value=session_id)
     
     def get_conversation_start_time(self, conversation_id: str) -> datetime:
         """获取对话开始时间"""
         return self.conversation_start_times.get(conversation_id)
     
-    def get_session_start_time(self, conversation_id: str) -> datetime:
-        """获取会话开始时间"""
-        session_id = self.conversation_to_session.get(conversation_id)
-        if session_id and session_id in self.session_info:
-            return self.session_info[session_id]['start_time']
-        return None
-    
-    def get_session_info(self, session_id: str = None, conversation_id: str = None):
-        """获取会话信息"""
-        if conversation_id:
-            session_id = self.conversation_to_session.get(conversation_id)
-        
-        if session_id and session_id in self.session_info:
-            session_data = self.session_info[session_id].copy()
-            session_data['conversation_count'] = len(session_data['conversations'])
-            if session_data['conversations']:
-                # 计算会话持续时间
-                duration = datetime.now() - session_data['start_time']
-                session_data['session_duration_seconds'] = duration.total_seconds()
-                session_data['session_duration_formatted'] = str(duration)
-            return session_data
+    def get_conversation_info(self, conversation_id: str):
+        """获取对话信息"""
+        start_time = self.get_conversation_start_time(conversation_id)
+        if start_time:
+            duration = datetime.now() - start_time
+            
+            # 从conversation_id解析user_id
+            user_id = "unknown"
+            if ":" in conversation_id:
+                user_id = conversation_id.split(":")[0]
+            
+            return {
+                'conversation_id': conversation_id,
+                'user_id': user_id,
+                'start_time': start_time,
+                'duration_seconds': duration.total_seconds(),
+                'duration_formatted': str(duration)
+            }
         return None
     
-    def get_all_sessions(self):
-        """获取所有话信息"""
+    def get_all_conversations(self):
+        """获取所有话信息"""
         result = {}
-        for session_id, session_data in self.session_info.items():
-            session_info = session_data.copy()
-            session_info['conversation_count'] = len(session_data['conversations'])
-            if session_data['conversations']:
-                duration = datetime.now() - session_data['start_time']
-                session_info['session_duration_seconds'] = duration.total_seconds()
-                session_info['session_duration_formatted'] = str(duration)
-            result[session_id] = session_info
+        for conversation_id, start_time in self.conversation_start_times.items():
+            duration = datetime.now() - start_time
+            
+            # 从conversation_id解析user_id
+            user_id = "unknown"
+            if ":" in conversation_id:
+                user_id = conversation_id.split(":")[0]
+                
+            result[conversation_id] = {
+                'user_id': user_id,
+                'start_time': start_time,
+                'duration_seconds': duration.total_seconds(),
+                'duration_formatted': str(duration)
+            }
         return result
 
+    @staticmethod
+    def parse_conversation_id(conversation_id: str):
+        """解析conversation_id,返回user_id和timestamp"""
+        if ":" not in conversation_id:
+            return None, None
+        
+        parts = conversation_id.split(":", 1)
+        user_id = parts[0]
+        timestamp_str = parts[1]
+        
+        try:
+            # 解析时间戳:YYYYMMDDHHMMSSsss
+            if len(timestamp_str) == 17:  # 20250722204550155
+                timestamp = datetime.strptime(timestamp_str[:14], "%Y%m%d%H%M%S")
+                # 添加毫秒
+                milliseconds = int(timestamp_str[14:])
+                timestamp = timestamp.replace(microsecond=milliseconds * 1000)
+                return user_id, timestamp
+        except ValueError:
+            pass
+        
+        return user_id, None
 
-# 升级版:支持前端传递会话ID
-class WebSessionAwareMemoryCache(SessionAwareMemoryCache):
-    """支持从前端获取会话ID的版本"""
-    
-    def __init__(self):
-        super().__init__()
-        self.browser_sessions = {}  # browser_session_id -> our_session_id
-    
-    def register_browser_session(self, browser_session_id: str, user_info: dict = None):
-        """注册浏览器会话"""
-        if browser_session_id not in self.browser_sessions:
-            our_session_id = str(uuid.uuid4())
-            self.browser_sessions[browser_session_id] = our_session_id
-            
-            self.session_info[our_session_id] = {
-                'start_time': datetime.now(),
-                'last_activity': datetime.now(),
-                'conversations': [],
-                'browser_session_id': browser_session_id,
-                'user_info': user_info or {}
-            }
-        return self.browser_sessions[browser_session_id]
-    
-    def generate_id_with_browser_session(self, question: str = None, browser_session_id: str = None) -> str:
-        """使用浏览器会话ID生成对话ID"""
-        if browser_session_id:
-            our_session_id = self.register_browser_session(browser_session_id)
-        else:
-            our_session_id = self.create_or_get_session_id()
+    @staticmethod
+    def extract_user_id(conversation_id: str) -> str:
+        """从conversation_id中提取user_id"""
+        if ":" not in conversation_id:
+            return "unknown"
+        return conversation_id.split(":", 1)[0]
+
+    @staticmethod
+    def validate_user_id_consistency(conversation_id: str, provided_user_id: str) -> tuple[bool, str]:
+        """
+        校验conversation_id中的user_id与提供的user_id是否一致
+        
+        Returns:
+            tuple: (is_valid, error_message)
+        """
+        if not conversation_id or not provided_user_id:
+            return True, ""  # 如果任一为空,跳过校验
         
-        return super().generate_id(question=question, session_id=our_session_id)
+        extracted_user_id = ConversationAwareMemoryCache.extract_user_id(conversation_id)
+        
+        if extracted_user_id != provided_user_id:
+            return False, f"用户ID不匹配:conversation_id中的用户ID '{extracted_user_id}' 与提供的用户ID '{provided_user_id}' 不一致"
+        
+        return True, ""
+
+
+# 保持向后兼容的别名
+WebSessionAwareMemoryCache = ConversationAwareMemoryCache
+SessionAwareMemoryCache = ConversationAwareMemoryCache

+ 37 - 20
config/logging_config.yaml

@@ -15,11 +15,12 @@ default:
     enabled: true
     level: DEBUG
     filename: "app.log"
-    format: "%(asctime)s [%(levelname)s] [%(name)s] [user:%(user_id)s] [session:%(session_id)s] %(filename)s:%(lineno)d - %(message)s"
+    format: "%(asctime)s [%(levelname)s] [%(name)s] %(filename)s:%(lineno)d - %(message)s"
     rotation:
       enabled: true
-      max_size: "50MB"
-      backup_count: 10
+      when: "midnight"
+      interval: 1
+      backup_count: 30
 
 # 模块特定配置
 modules:
@@ -33,15 +34,14 @@ modules:
       enabled: true
       level: DEBUG
       filename: "app.log"
-      format: "%(asctime)s [%(levelname)s] [%(name)s] [user:%(user_id)s] [session:%(session_id)s] %(filename)s:%(lineno)d - %(message)s"
+      format: "%(asctime)s [%(levelname)s] [%(name)s] %(filename)s:%(lineno)d - %(message)s"
       rotation:
         enabled: true
-        max_size: "50MB"
-        backup_count: 10
+        when: "midnight"
+        interval: 1
+        backup_count: 30
   
   data_pipeline:
-    # 注意:data_pipeline的日志文件路径会在运行时动态设置到任务目录
-    # 这里的file配置主要用于格式和级别设置
     level: DEBUG
     console:
       enabled: true
@@ -50,15 +50,13 @@ modules:
     file:
       enabled: true
       level: DEBUG
-      # filename 将在运行时动态设置,不在这里指定
-      # filename: "data_pipeline.log"  # 移除固定路径
+      filename: "data_pipeline.log"
       format: "%(asctime)s [%(levelname)s] [%(name)s] %(filename)s:%(lineno)d - %(message)s"
       rotation:
-        # 对于任务特定的日志,通常不需要rotation
-        # 但保留配置以防单个任务产生大量日志
-        enabled: false  # 禁用rotation,因为每个任务的日志是独立的
-        max_size: "10MB"    # 如果启用,限制为10MB
-        backup_count: 2     # 如果启用,只保留2个备份
+        enabled: true
+        when: "midnight"
+        interval: 1
+        backup_count: 30
   
   agent:
     level: DEBUG
@@ -70,11 +68,12 @@ modules:
       enabled: true
       level: DEBUG
       filename: "agent.log"
-      format: "%(asctime)s [%(levelname)s] [%(name)s] [user:%(user_id)s] [session:%(session_id)s] %(filename)s:%(lineno)d - %(message)s"
+      format: "%(asctime)s [%(levelname)s] [%(name)s] %(filename)s:%(lineno)d - %(message)s"
       rotation:
         enabled: true
-        max_size: "30MB"
-        backup_count: 8
+        when: "midnight"
+        interval: 1
+        backup_count: 30
   
   vanna:
     level: DEBUG
@@ -89,5 +88,23 @@ modules:
       format: "%(asctime)s [%(levelname)s] [%(name)s] %(filename)s:%(lineno)d - %(message)s"
       rotation:
         enabled: true
-        max_size: "20MB"
-        backup_count: 5 
+        when: "midnight"
+        interval: 1
+        backup_count: 30
+  
+  react_agent:
+    level: DEBUG
+    console:
+      enabled: true
+      level: INFO
+      format: "%(asctime)s [%(levelname)s] ReactAgent: %(message)s"
+    file:
+      enabled: true
+      level: DEBUG
+      filename: "react_agent.log"
+      format: "%(asctime)s [%(levelname)s] [%(name)s] %(filename)s:%(lineno)d - %(message)s"
+      rotation:
+        enabled: true
+        when: "midnight"
+        interval: 1
+        backup_count: 30 

+ 110 - 0
config/logging_config_backup_20250725_181936.yaml

@@ -0,0 +1,110 @@
+version: 1
+
+# 全局配置
+global:
+  base_level: INFO
+  
+# 默认配置(用于app.log)
+default:
+  level: INFO
+  console:
+    enabled: true
+    level: INFO
+    format: "%(asctime)s [%(levelname)s] %(name)s: %(message)s"
+  file:
+    enabled: true
+    level: DEBUG
+    filename: "app.log"
+    format: "%(asctime)s [%(levelname)s] [%(name)s] %(filename)s:%(lineno)d - %(message)s"
+    rotation:
+      enabled: true
+      when: "midnight"
+      interval: 1
+      backup_count: 30
+
+# 模块特定配置
+modules:
+  app:
+    level: INFO
+    console:
+      enabled: true
+      level: INFO
+      format: "%(asctime)s [%(levelname)s] %(name)s: %(message)s"
+    file:
+      enabled: true
+      level: DEBUG
+      filename: "app.log"
+      format: "%(asctime)s [%(levelname)s] [%(name)s] %(filename)s:%(lineno)d - %(message)s"
+      rotation:
+        enabled: true
+        when: "midnight"
+        interval: 1
+        backup_count: 30
+  
+  data_pipeline:
+    level: DEBUG
+    console:
+      enabled: true
+      level: INFO
+      format: "%(asctime)s [%(levelname)s] Pipeline: %(message)s"
+    file:
+      enabled: true
+      level: DEBUG
+      filename: "data_pipeline.log"
+      format: "%(asctime)s [%(levelname)s] [%(name)s] %(filename)s:%(lineno)d - %(message)s"
+      rotation:
+        enabled: true
+        when: "midnight"
+        interval: 1
+        backup_count: 30
+  
+  agent:
+    level: DEBUG
+    console:
+      enabled: true
+      level: INFO
+      format: "%(asctime)s [%(levelname)s] Agent: %(message)s"
+    file:
+      enabled: true
+      level: DEBUG
+      filename: "agent.log"
+      format: "%(asctime)s [%(levelname)s] [%(name)s] %(filename)s:%(lineno)d - %(message)s"
+      rotation:
+        enabled: true
+        when: "H"
+        interval: 24
+        backup_count: 7
+  
+  vanna:
+    level: DEBUG
+    console:
+      enabled: true
+      level: INFO
+      format: "%(asctime)s [%(levelname)s] Vanna: %(message)s"
+    file:
+      enabled: true
+      level: DEBUG
+      filename: "vanna.log"
+      format: "%(asctime)s [%(levelname)s] [%(name)s] %(filename)s:%(lineno)d - %(message)s"
+      rotation:
+        enabled: true
+        when: "midnight"
+        interval: 1
+        backup_count: 30
+  
+  react_agent:
+    level: DEBUG
+    console:
+      enabled: true
+      level: INFO
+      format: "%(asctime)s [%(levelname)s] ReactAgent: %(message)s"
+    file:
+      enabled: true
+      level: DEBUG
+      filename: "react_agent.log"
+      format: "%(asctime)s [%(levelname)s] [%(name)s] %(filename)s:%(lineno)d - %(message)s"
+      rotation:
+        enabled: true
+        when: "midnight"
+        interval: 1
+        backup_count: 30 

+ 104 - 0
config/logging_config_windows.yaml

@@ -0,0 +1,104 @@
+version: 1
+
+# 全局配置
+global:
+  base_level: INFO
+  
+# 默认配置(用于app.log)
+default:
+  level: INFO
+  console:
+    enabled: true
+    level: INFO
+    format: "%(asctime)s [%(levelname)s] %(name)s: %(message)s"
+  file:
+    enabled: true
+    level: DEBUG
+    filename: "app.log"
+    format: "%(asctime)s [%(levelname)s] [%(name)s] %(filename)s:%(lineno)d - %(message)s"
+    rotation:
+      enabled: true
+      max_size: "10MB"
+      backup_count: 5
+
+# 模块特定配置
+modules:
+  app:
+    level: INFO
+    console:
+      enabled: true
+      level: INFO
+      format: "%(asctime)s [%(levelname)s] %(name)s: %(message)s"
+    file:
+      enabled: true
+      level: DEBUG
+      filename: "app.log"
+      format: "%(asctime)s [%(levelname)s] [%(name)s] %(filename)s:%(lineno)d - %(message)s"
+      rotation:
+        enabled: true
+        max_size: "10MB"
+        backup_count: 5
+  
+  data_pipeline:
+    level: DEBUG
+    console:
+      enabled: true
+      level: INFO
+      format: "%(asctime)s [%(levelname)s] Pipeline: %(message)s"
+    file:
+      enabled: true
+      level: DEBUG
+      filename: "data_pipeline.log"
+      format: "%(asctime)s [%(levelname)s] [%(name)s] %(filename)s:%(lineno)d - %(message)s"
+      rotation:
+        enabled: true
+        max_size: "10MB"
+        backup_count: 5
+  
+  agent:
+    level: DEBUG
+    console:
+      enabled: true
+      level: DEBUG
+      format: "%(asctime)s [%(levelname)s] Agent: %(message)s"
+    file:
+      enabled: true
+      level: DEBUG
+      filename: "agent.log"
+      format: "%(asctime)s [%(levelname)s] [%(name)s] %(filename)s:%(lineno)d - %(message)s"
+      rotation:
+        enabled: true
+        max_size: "10MB"
+        backup_count: 5
+  
+  vanna:
+    level: DEBUG
+    console:
+      enabled: true
+      level: DEBUG
+      format: "%(asctime)s [%(levelname)s] Vanna: %(message)s"
+    file:
+      enabled: true
+      level: DEBUG
+      filename: "vanna.log"
+      format: "%(asctime)s [%(levelname)s] [%(name)s] %(filename)s:%(lineno)d - %(message)s"
+      rotation:
+        enabled: true
+        max_size: "10MB"
+        backup_count: 5
+  
+  react_agent:
+    level: DEBUG
+    console:
+      enabled: true
+      level: INFO
+      format: "%(asctime)s [%(levelname)s] ReactAgent: %(message)s"
+    file:
+      enabled: true
+      level: DEBUG
+      filename: "react_agent.log"
+      format: "%(asctime)s [%(levelname)s] [%(name)s] %(filename)s:%(lineno)d - %(message)s"
+      rotation:
+        enabled: true
+        max_size: "10MB"
+        backup_count: 5 

+ 32 - 2
core/logging/__init__.py

@@ -1,11 +1,37 @@
 from .log_manager import LogManager
 import logging
+import platform
+import os
 
 # 全局日志管理器实例
 _log_manager = LogManager()
 
-def initialize_logging(config_path: str = "config/logging_config.yaml"):
-    """初始化项目日志系统"""
+def get_platform_specific_config_path() -> str:
+    """根据操作系统自动选择合适的日志配置文件"""
+    if platform.system() == "Windows":
+        config_path = "config/logging_config_windows.yaml"
+    else:
+        config_path = "config/logging_config.yaml"
+    
+    # 检查配置文件是否存在,如果不存在则回退到默认配置
+    if not os.path.exists(config_path):
+        fallback_path = "config/logging_config.yaml"
+        if os.path.exists(fallback_path):
+            return fallback_path
+        else:
+            raise FileNotFoundError(f"日志配置文件不存在: {config_path} 和 {fallback_path}")
+    
+    return config_path
+
+def initialize_logging(config_path: str = None):
+    """初始化项目日志系统
+    
+    Args:
+        config_path: 可选的配置文件路径。如果不提供,将根据操作系统自动选择
+    """
+    if config_path is None:
+        config_path = get_platform_specific_config_path()
+    
     _log_manager.initialize(config_path)
 
 def get_logger(name: str, module: str = "default") -> logging.Logger:
@@ -29,6 +55,10 @@ def get_app_logger(name: str) -> logging.Logger:
     """获取app模块logger"""
     return get_logger(name, "app")
 
+def get_react_agent_logger(name: str) -> logging.Logger:
+    """获取react_agent模块logger"""
+    return get_logger(name, "react_agent")
+
 # 上下文管理便捷方法
 def set_log_context(**kwargs):
     """设置日志上下文(可选)

+ 17 - 7
core/logging/log_manager.py

@@ -180,15 +180,25 @@ class LogManager:
         """创建文件处理器(支持自动轮转)"""
         log_file = self.base_log_dir / file_config.get('filename', f'{module}.log')
         
-        # 使用RotatingFileHandler实现自动轮转和清理
+        # 使用RotatingFileHandler或TimedRotatingFileHandler实现自动轮转和清理
         rotation_config = file_config.get('rotation', {})
         if rotation_config.get('enabled', False):
-            handler = logging.handlers.RotatingFileHandler(
-                log_file,
-                maxBytes=self._parse_size(rotation_config.get('max_size', '50MB')),
-                backupCount=rotation_config.get('backup_count', 10),
-                encoding='utf-8'
-            )
+            # 检查是否配置了时间滚动
+            if 'when' in rotation_config:
+                handler = logging.handlers.TimedRotatingFileHandler(
+                    log_file,
+                    when=rotation_config.get('when', 'midnight'),
+                    interval=rotation_config.get('interval', 1),
+                    backupCount=rotation_config.get('backup_count', 30),
+                    encoding='utf-8'
+                )
+            else:
+                handler = logging.handlers.RotatingFileHandler(
+                    log_file,
+                    maxBytes=self._parse_size(rotation_config.get('max_size', '50MB')),
+                    backupCount=rotation_config.get('backup_count', 10),
+                    encoding='utf-8'
+                )
         else:
             handler = logging.FileHandler(log_file, encoding='utf-8')
         

+ 15 - 15
customllm/base_llm_chat.py

@@ -62,18 +62,18 @@ class BaseLLMChat(VannaBase, ABC):
         # 将Vanna的log输出转换为项目的日志格式
         if title == "SQL Prompt":
             # 对于SQL Prompt,使用debug级别,避免输出过长的内容
-            # 将列表格式转换为字符串,只显示前200个字符
+            # 将列表格式转换为字符串,只显示前500个字符
             if isinstance(message, list):
-                message_str = str(message)[:200] + "..." if len(str(message)) > 200 else str(message)
+                message_str = str(message)[:500] + "..." if len(str(message)) > 500 else str(message)
             else:
-                message_str = str(message)[:200] + "..." if len(str(message)) > 200 else str(message)
+                message_str = str(message)[:500] + "..." if len(str(message)) > 500 else str(message)
             self.logger.debug(f"[Vanna] {title}: {message_str}")
         elif title == "LLM Response":
             # 对于LLM响应,记录但不显示全部内容
             if isinstance(message, str):
-                message_str = message[:200] + "..." if len(message) > 200 else message
+                message_str = message[:500] + "..." if len(message) > 500 else message
             else:
-                message_str = str(message)[:200] + "..." if len(str(message)) > 200 else str(message)
+                message_str = str(message)[:500] + "..." if len(str(message)) > 500 else str(message)
             self.logger.debug(f"[Vanna] {title}: {message_str}")
         elif title == "Extracted SQL":
             # 对于提取的SQL,使用info级别
@@ -162,19 +162,19 @@ class BaseLLMChat(VannaBase, ABC):
 
         initial_prompt += self.prompt_loader.get_sql_response_guidelines(self.dialect)
 
-        message_log = [self.system_message(initial_prompt)]
+        sql_prompt_messages = [self.system_message(initial_prompt)]
 
         for example in question_sql_list:
             if example is None:
                 self.logger.warning("example is None")
             else:
                 if example is not None and "question" in example and "sql" in example:
-                    message_log.append(self.user_message(example["question"]))
-                    message_log.append(self.assistant_message(example["sql"]))
+                    sql_prompt_messages.append(self.user_message(example["question"]))
+                    sql_prompt_messages.append(self.assistant_message(example["sql"]))
 
-        message_log.append(self.user_message(question))
-        
-        return message_log
+        sql_prompt_messages.append(self.user_message(question))
+        # 实际发送给LLM的内容,当前做了格式化处理       
+        return sql_prompt_messages
 
     def generate_plotly_code(self, question: str = None, sql: str = None, df_metadata: str = None, **kwargs) -> str:
         """
@@ -190,13 +190,13 @@ class BaseLLMChat(VannaBase, ABC):
         # 构建用户消息
         user_msg = self.prompt_loader.get_chart_user_message()
 
-        message_log = [
+        chart_prompt_messages = [
             self.system_message(system_msg),
             self.user_message(user_msg),
         ]
 
         # 调用submit_prompt方法,并清理结果
-        plotly_code = self.submit_prompt(message_log, **kwargs)
+        plotly_code = self.submit_prompt(chart_prompt_messages, **kwargs)
         
         # 根据 DISPLAY_RESULT_THINKING 参数处理thinking内容
         if not DISPLAY_RESULT_THINKING:
@@ -485,12 +485,12 @@ class BaseLLMChat(VannaBase, ABC):
             # 构建用户消息,强调中文思考和回答
             user_content = self.prompt_loader.get_summary_user_instructions()
             
-            message_log = [
+            summary_prompt_messages = [
                 self.system_message(system_content),
                 self.user_message(user_content)
             ]
             
-            summary = self.submit_prompt(message_log, **kwargs)
+            summary = self.submit_prompt(summary_prompt_messages, **kwargs)
             
             # 检查是否需要隐藏 thinking 内容
             display_thinking = kwargs.get("display_result_thinking", DISPLAY_RESULT_THINKING)

+ 5 - 3
custompgvector/pgvector.py

@@ -169,7 +169,8 @@ class PG_VectorStore(VannaBase):
 
         # 检查过滤后结果是否为空
         if results and not filtered_results:
-            self.logger.warning(f"向量查询找到了 {len(results)} 条SQL问答对,但全部被阈值过滤掉,问题: {question}")
+            self.logger.warning(f"向量查询找到了 {len(results)} 条SQL问答对,但全部被阈值过滤掉了.")
+           # self.logger.warning(f"问题: {question}")
 
         return filtered_results
 
@@ -662,14 +663,15 @@ class PG_VectorStore(VannaBase):
             
             # 检查原始查询结果是否为空
             if not results:
-                self.logger.warning(f"向量查询未找到任何相关的错误SQL示例,问题: {question}")
+                self.logger.warning(f"向量查询未找到任何相关的错误SQL示例")
 
             # 应用错误SQL特有的阈值过滤逻辑
             filtered_results = self._apply_error_sql_threshold_filter(results)
             
             # 检查过滤后结果是否为空
             if results and not filtered_results:
-                self.logger.warning(f"向量查询找到了 {len(results)} 条错误SQL示例,但全部被阈值过滤掉,问题: {question}")
+                self.logger.warning(f"向量查询找到了 {len(results)} 条错误SQL示例,但全部被阈值过滤掉.")
+                # self.logger.warning(f"问题: {question}")
 
             return filtered_results
             

+ 32 - 6
data_pipeline/api/simple_db_manager.py

@@ -754,8 +754,11 @@ class SimpleTaskManager:
             with open(log_file_path, 'r', encoding='utf-8') as f:
                 lines = f.readlines()
             
-            # 日志行格式: 2025-07-01 14:30:52 [INFO] SimpleWorkflowExecutor: 任务开始执行
-            log_pattern = r'^(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}) \[(\w+)\] (.+?): (.+)$'
+            # 支持两种日志格式:
+            # 格式1: 2025-07-21 11:37:08 [INFO] TaskDir_task_20250721_113010: 任务开始执行
+            # 格式2: 2025-07-21 11:37:08 [INFO] [data_pipeline.TrainingDataLoader] run_training.py:367 - 处理DDL文件: 文件路径
+            log_pattern_1 = r'^(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}) \[(\w+)\] ([^:]+): (.+)$'
+            log_pattern_2 = r'^(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}) \[(\w+)\] (\[.+?\] [^:]+:\d+) - (.+)$'
             current_log = None
             line_number = 0
             
@@ -766,7 +769,8 @@ class SimpleTaskManager:
                 if not line.strip():
                     continue
                 
-                match = re.match(log_pattern, line)
+                # 先尝试格式2(带文件名行号的格式)
+                match = re.match(log_pattern_2, line)
                 if match:
                     # 如果有之前的日志,先保存
                     if current_log:
@@ -787,9 +791,31 @@ class SimpleTaskManager:
                         "line_number": line_number
                     }
                 else:
-                    # 多行日志(如异常堆栈),追加到当前日志的消息中
-                    if current_log:
-                        current_log["message"] += f"\n{line}"
+                    # 再尝试格式1(简单格式)
+                    match = re.match(log_pattern_1, line)
+                    if match:
+                        # 如果有之前的日志,先保存
+                        if current_log:
+                            logs.append(current_log)
+                        
+                        # 解析新的日志条目
+                        timestamp, level, logger_name, message = match.groups()
+                        
+                        # 尝试从日志记录器名称中提取步骤信息
+                        step_name = self._extract_step_from_logger(logger_name)
+                        
+                        current_log = {
+                            "timestamp": timestamp,
+                            "level": level,
+                            "logger": logger_name,
+                            "step": step_name,
+                            "message": message,
+                            "line_number": line_number
+                        }
+                    else:
+                        # 多行日志(如异常堆栈),追加到当前日志的消息中
+                        if current_log:
+                            current_log["message"] += f"\n{line}"
             
             # 保存最后一个日志条目
             if current_log:

+ 59 - 26
data_pipeline/api/simple_file_manager.py

@@ -315,24 +315,36 @@ class SimpleFileManager:
             if not content.strip():
                 raise ValueError("表清单文件为空")
             
-            # 简单验证:检查是否包含至少一个非空行
-            lines = [line.strip() for line in content.split('\n') if line.strip()]
-            if not lines:
+            # 解析表名,支持换行符和逗号分隔
+            all_tables = []
+            lines = content.split('\n')
+            
+            for line in lines:
+                line = line.strip()
+                # 跳过空行和注释行
+                if not line or line.startswith('#') or line.startswith('--'):
+                    continue
+                
+                # 如果行内包含逗号,按逗号分割;否则整行作为一个表名
+                if ',' in line:
+                    tables_in_line = [t.strip() for t in line.split(',') if t.strip()]
+                else:
+                    tables_in_line = [line]
+                
+                all_tables.extend(tables_in_line)
+            
+            if not all_tables:
                 raise ValueError("表清单文件不包含有效的表名")
             
-            # 可选:验证表名格式(避免SQL注入等安全问题)
+            # 验证表名格式(避免SQL注入等安全问题)
             import re
             table_name_pattern = re.compile(r'^[a-zA-Z_][a-zA-Z0-9_]*(\.[a-zA-Z_][a-zA-Z0-9_]*)?$')
             invalid_tables = []
             
-            for line in lines[:10]:  # 只检查前10行以避免过度验证
-                # 忽略注释行
-                if line.startswith('#') or line.startswith('--'):
-                    continue
-                
-                # 检查表名格式
-                if not table_name_pattern.match(line):
-                    invalid_tables.append(line)
+            # 只检查前10个表名以避免过度验证
+            for table_name in all_tables[:10]:
+                if not table_name_pattern.match(table_name):
+                    invalid_tables.append(table_name)
             
             if invalid_tables:
                 raise ValueError(f"表清单文件包含无效的表名格式: {', '.join(invalid_tables[:3])}")
@@ -373,11 +385,11 @@ class SimpleFileManager:
                         "error": f"无法解码文件内容,请确保文件编码为 {encoding}"
                     }
             
-            # 分析文件内容
+            # 分析文件内容,支持换行符和逗号分隔
             lines = content.splitlines()
             total_lines = len(lines)
             
-            # 过滤空行和注释行
+            # 过滤空行和注释行,解析表名
             valid_lines = []
             comment_lines = 0
             empty_lines = 0
@@ -389,16 +401,23 @@ class SimpleFileManager:
                 elif stripped.startswith('#'):
                     comment_lines += 1
                 else:
-                    # 简单验证表名格式
-                    if self._is_valid_table_name(stripped):
-                        valid_lines.append(stripped)
+                    # 如果行内包含逗号,按逗号分割;否则整行作为一个表名
+                    if ',' in stripped:
+                        tables_in_line = [t.strip() for t in stripped.split(',') if t.strip()]
                     else:
-                        return {
-                            "valid": False,
-                            "error": f"第 {line_num} 行包含无效的表名: {stripped}",
-                            "details": {
-                                "line_number": line_num,
-                                "invalid_content": stripped
+                        tables_in_line = [stripped]
+                    
+                    # 验证每个表名格式
+                    for table_name in tables_in_line:
+                        if self._is_valid_table_name(table_name):
+                            valid_lines.append(table_name)
+                        else:
+                            return {
+                                "valid": False,
+                                "error": f"第 {line_num} 行包含无效的表名: {table_name}",
+                                "details": {
+                                    "line_number": line_num,
+                                    "invalid_content": table_name
                             }
                         }
             
@@ -486,13 +505,26 @@ class SimpleFileManager:
             
             file_stat = file_path.stat()
             
-            # 尝试读取文件内容进行分析
+            # 尝试读取文件内容进行分析,支持换行符和逗号分隔
             try:
                 with open(file_path, 'r', encoding='utf-8') as f:
                     content = f.read()
                     lines = content.splitlines()
-                    valid_tables = [line.strip() for line in lines 
-                                   if line.strip() and not line.strip().startswith('#')]
+                    valid_tables = []
+                    
+                    for line in lines:
+                        line = line.strip()
+                        # 跳过空行和注释行
+                        if not line or line.startswith('#') or line.startswith('--'):
+                            continue
+                        
+                        # 如果行内包含逗号,按逗号分割;否则整行作为一个表名
+                        if ',' in line:
+                            tables_in_line = [t.strip() for t in line.split(',') if t.strip()]
+                        else:
+                            tables_in_line = [line]
+                        
+                        valid_tables.extend(tables_in_line)
             except Exception:
                 valid_tables = []
             
@@ -505,6 +537,7 @@ class SimpleFileManager:
                 "uploaded_at": datetime.fromtimestamp(file_stat.st_mtime).isoformat(),
                 "created_at": datetime.fromtimestamp(file_stat.st_ctime).isoformat(),
                 "table_count": len(valid_tables),
+                "table_names": valid_tables,  # 新增:返回解析出的表名列表
                 "is_readable": os.access(file_path, os.R_OK)
             }
             

+ 121 - 4
data_pipeline/api/simple_workflow.py

@@ -8,6 +8,7 @@ import asyncio
 import json
 import os
 import logging
+import shutil
 from datetime import datetime
 from pathlib import Path
 from typing import Dict, Any, Optional, List
@@ -22,16 +23,31 @@ from data_pipeline.dp_logging import get_logger
 class SimpleWorkflowExecutor:
     """简化的任务工作流执行器"""
     
-    def __init__(self, task_id: str):
+    def __init__(self, task_id: str, backup_vector_tables: bool = False, truncate_vector_tables: bool = False, skip_training: bool = False):
         """
         初始化工作流执行器
         
         Args:
             task_id: 任务ID
+            backup_vector_tables: 是否备份vector表数据
+            truncate_vector_tables: 是否清空vector表数据(自动启用备份)
+            skip_training: 是否跳过训练文件处理,仅执行Vector表管理
         """
         self.task_id = task_id
+        self.backup_vector_tables = backup_vector_tables
+        self.truncate_vector_tables = truncate_vector_tables
+        self.skip_training = skip_training
+        
+        # 参数逻辑:truncate自动启用backup
+        if self.truncate_vector_tables:
+            self.backup_vector_tables = True
+        
         self.logger = get_logger("SimpleWorkflowExecutor", task_id)
         
+        # 记录Vector表管理参数状态
+        if self.backup_vector_tables or self.truncate_vector_tables:
+            self.logger.info(f"🗂️ Vector表管理已启用: backup={self.backup_vector_tables}, truncate={self.truncate_vector_tables}")
+        
         # 初始化管理器
         self.task_manager = SimpleTaskManager()
         self.file_manager = SimpleFileManager()
@@ -135,6 +151,81 @@ class SimpleWorkflowExecutor:
             except Exception as e:
                 self.logger.error(f"记录任务目录日志失败: {e}")
     
+    def _backup_existing_files_if_needed(self):
+        """如果需要,备份现有文件(仅备份文件,不包括子目录)"""
+        try:
+            task_dir = self.file_manager.get_task_directory(self.task_id)
+            
+            # 严格检查:只允许保留指定文件
+            allowed_files = {"table_list.txt", "data_pipeline.log"}
+            
+            # 扫描任务目录中的文件(排除子目录和允许的文件)
+            files_to_backup = []
+            for item in task_dir.iterdir():
+                if item.is_file() and item.name not in allowed_files:
+                    files_to_backup.append(item)
+            
+            # 如果没有文件需要备份,直接返回
+            if not files_to_backup:
+                self._log_to_task_directory("INFO", "任务目录中没有需要备份的文件")
+                return
+            
+            # 创建备份目录
+            timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
+            backup_dir_name = f"file_bak_{timestamp}"
+            backup_dir = task_dir / backup_dir_name
+            
+            # 处理备份目录名冲突
+            counter = 1
+            while backup_dir.exists():
+                backup_dir = task_dir / f"{backup_dir_name}_{counter}"
+                counter += 1
+            
+            backup_dir.mkdir(parents=True)
+            
+            # 移动文件到备份目录
+            moved_files = []
+            failed_files = []
+            
+            for file_path in files_to_backup:
+                try:
+                    target_path = backup_dir / file_path.name
+                    shutil.move(str(file_path), str(target_path))
+                    moved_files.append(file_path.name)
+                    self._log_to_task_directory("DEBUG", f"文件已备份: {file_path.name}")
+                except Exception as e:
+                    failed_files.append({"file": file_path.name, "error": str(e)})
+                    self._log_to_task_directory("WARNING", f"文件备份失败: {file_path.name} - {e}")
+            
+            # 生成备份记录文件
+            backup_info = {
+                "backup_time": datetime.now().isoformat(),
+                "backup_directory": backup_dir.name,
+                "moved_files": moved_files,
+                "failed_files": failed_files,
+                "task_id": self.task_id
+            }
+            
+            backup_info_file = backup_dir / "backup_info.json"
+            with open(backup_info_file, 'w', encoding='utf-8') as f:
+                json.dump(backup_info, f, ensure_ascii=False, indent=2)
+            
+            # 记录备份完成
+            self._log_to_task_directory("INFO", 
+                f"文件备份完成: {len(moved_files)} 个文件已移动到 {backup_dir.name}")
+            
+            # 如果有文件备份失败,中断作业
+            if failed_files:
+                error_msg = f"❌ 无法清理工作目录,以下文件移动失败: {[f['file'] for f in failed_files]}"
+                self._log_to_task_directory("ERROR", error_msg)
+                raise Exception(error_msg)
+        
+        except Exception as e:
+            # 备份失败必须中断作业
+            error_msg = f"❌ 文件备份过程失败,作业中断: {e}"
+            self._log_to_task_directory("ERROR", error_msg)
+            raise Exception(error_msg)
+    
     def _resolve_table_list_file_path(self) -> str:
         """解析表清单文件路径"""
         table_list_file = self.task_params['table_list_file']
@@ -183,7 +274,11 @@ class SimpleWorkflowExecutor:
             enable_sql_validation=self.task_params.get('enable_sql_validation', True),
             enable_llm_repair=self.task_params.get('enable_llm_repair', True),
             modify_original_file=self.task_params.get('modify_original_file', True),
-            enable_training_data_load=self.task_params.get('enable_training_data_load', True)
+            enable_training_data_load=self.task_params.get('enable_training_data_load', True),
+            # 新增:Vector表管理参数
+            backup_vector_tables=self.backup_vector_tables,
+            truncate_vector_tables=self.truncate_vector_tables,
+            skip_training=self.skip_training
         )
     
     @contextmanager
@@ -219,7 +314,10 @@ class SimpleWorkflowExecutor:
     async def execute_complete_workflow(self) -> Dict[str, Any]:
         """执行完整工作流"""
         try:
-            # 确保任务目录存在
+            # 🆕 新增:先备份现有文件(清理环境)
+            self._backup_existing_files_if_needed()
+            
+            # 确保任务目录存在并写入新配置
             if not self._ensure_task_directory():
                 raise Exception("无法创建任务目录")
             
@@ -314,6 +412,19 @@ class SimpleWorkflowExecutor:
     async def execute_single_step(self, step_name: str) -> Dict[str, Any]:
         """执行单个步骤"""
         try:
+            # 新增:非training_load步骤的Vector表管理参数警告
+            if step_name != 'training_load' and (self.backup_vector_tables or self.truncate_vector_tables or self.skip_training):
+                self.logger.warning(
+                    f"⚠️ Vector表管理参数仅在training_load步骤有效,当前步骤: {step_name},忽略参数"
+                )
+                # 临时禁用Vector表管理参数
+                temp_backup = self.backup_vector_tables
+                temp_truncate = self.truncate_vector_tables
+                temp_skip = self.skip_training
+                self.backup_vector_tables = False
+                self.truncate_vector_tables = False
+                self.skip_training = False
+            
             # 确保任务目录存在
             if not self._ensure_task_directory():
                 raise Exception("无法创建任务目录")
@@ -321,7 +432,7 @@ class SimpleWorkflowExecutor:
             # 更新任务状态
             self.task_manager.update_task_status(self.task_id, 'in_progress')
             
-            # 创建工作流编排器
+            # 创建工作流编排器(会根据当前参数状态创建)
             orchestrator = self._create_orchestrator()
             
             # 重定向SchemaWorkflowOrchestrator的日志到任务目录
@@ -352,6 +463,12 @@ class SimpleWorkflowExecutor:
                 # 写入步骤结果文件
                 self._write_step_result_file(step_name, result)
             
+            # 恢复原始参数状态(如果被临时修改)
+            if step_name != 'training_load' and 'temp_backup' in locals():
+                self.backup_vector_tables = temp_backup
+                self.truncate_vector_tables = temp_truncate
+                self.skip_training = temp_skip
+            
             # 检查是否所有步骤都已完成
             self._update_overall_task_status()
             

+ 551 - 0
data_pipeline/api/vector_restore_manager.py

@@ -0,0 +1,551 @@
+"""
+Vector表恢复管理器
+
+提供pgvector表备份文件扫描和数据恢复功能,与VectorTableManager形成完整的备份恢复解决方案
+"""
+
+import os
+import re
+import time
+import glob
+from datetime import datetime
+from pathlib import Path
+from typing import Dict, Any, List, Optional
+import psycopg2
+import logging
+
+
+class VectorRestoreManager:
+    """Vector表恢复管理器 - 仿照VectorTableManager设计"""
+    
+    def __init__(self, base_output_dir: str = None):
+        """
+        初始化恢复管理器,复用现有配置机制
+        
+        Args:
+            base_output_dir: 基础输出目录,默认从data_pipeline.config获取
+        """
+        if base_output_dir is None:
+            # 从配置文件获取默认目录
+            from data_pipeline.config import SCHEMA_TOOLS_CONFIG
+            base_output_dir = SCHEMA_TOOLS_CONFIG.get("output_directory", "./data_pipeline/training_data/")
+        
+        self.base_output_dir = Path(base_output_dir)
+        
+        # 从data_pipeline.config获取配置
+        from data_pipeline.config import SCHEMA_TOOLS_CONFIG
+        self.config = SCHEMA_TOOLS_CONFIG.get("vector_table_management", {})
+        
+        # 初始化日志
+        self.logger = logging.getLogger("VectorRestoreManager")
+        
+        # 支持的表名
+        self.supported_tables = self.config.get("supported_tables", [
+            "langchain_pg_collection",
+            "langchain_pg_embedding"
+        ])
+    
+    def scan_backup_files(self, global_only: bool = False, task_id: str = None) -> Dict[str, Any]:
+        """
+        扫描可用的备份文件
+        
+        Args:
+            global_only: 仅查询全局备份目录(training_data/vector_bak/)
+            task_id: 指定task_id,仅查询该任务下的备份文件
+            
+        Returns:
+            包含备份文件信息的字典
+        """
+        scan_start_time = datetime.now()
+        backup_locations = []
+        
+        try:
+            # 确定扫描范围
+            if task_id:
+                # 仅扫描指定任务
+                directories_to_scan = [self.base_output_dir / task_id / "vector_bak"]
+            elif global_only:
+                # 仅扫描全局目录
+                directories_to_scan = [self.base_output_dir / "vector_bak"]
+            else:
+                # 扫描所有目录
+                directories_to_scan = self._get_all_vector_bak_directories()
+            
+            # 扫描每个目录
+            for backup_dir in directories_to_scan:
+                if not backup_dir.exists():
+                    continue
+                    
+                # 查找有效的备份集
+                backup_sets = self._find_backup_sets(backup_dir)
+                if not backup_sets:
+                    continue
+                
+                # 构建备份位置信息
+                location_info = self._build_location_info(backup_dir, backup_sets)
+                if location_info:
+                    backup_locations.append(location_info)
+            
+            # 构建汇总信息
+            summary = self._build_summary(backup_locations, scan_start_time)
+            
+            return {
+                "backup_locations": backup_locations,
+                "summary": summary
+            }
+            
+        except Exception as e:
+            self.logger.error(f"扫描备份文件失败: {e}")
+            raise
+    
+    def restore_from_backup(self, backup_path: str, timestamp: str, 
+                          tables: List[str] = None, db_connection: str = None,
+                          truncate_before_restore: bool = False) -> Dict[str, Any]:
+        """
+        从备份文件恢复数据
+        
+        Args:
+            backup_path: 备份文件所在的目录路径(相对路径)
+            timestamp: 备份文件的时间戳
+            tables: 要恢复的表名列表,None表示恢复所有表
+            db_connection: PostgreSQL连接字符串,None则从config获取
+            truncate_before_restore: 恢复前是否清空目标表
+            
+        Returns:
+            恢复操作的详细结果
+        """
+        start_time = time.time()
+        
+        # 设置默认表列表
+        if tables is None:
+            tables = self.supported_tables.copy()
+        
+        # 验证表名
+        invalid_tables = [t for t in tables if t not in self.supported_tables]
+        if invalid_tables:
+            raise ValueError(f"不支持的表名: {invalid_tables}")
+        
+        # 解析备份路径
+        backup_dir = Path(backup_path)
+        if not backup_dir.is_absolute():
+            # 相对路径,相对于项目根目录
+            project_root = Path(__file__).parent.parent.parent
+            backup_dir = project_root / backup_path
+        
+        if not backup_dir.exists():
+            raise FileNotFoundError(f"备份目录不存在: {backup_path}")
+        
+        # 验证备份文件存在
+        missing_files = []
+        backup_files = {}
+        for table_name in tables:
+            csv_file = backup_dir / f"{table_name}_{timestamp}.csv"
+            if not csv_file.exists():
+                missing_files.append(csv_file.name)
+            else:
+                backup_files[table_name] = csv_file
+        
+        if missing_files:
+            raise FileNotFoundError(f"备份文件不存在: {', '.join(missing_files)}")
+        
+        # 初始化结果
+        result = {
+            "restore_performed": True,
+            "truncate_performed": truncate_before_restore,
+            "backup_info": {
+                "backup_path": backup_path,
+                "timestamp": timestamp,
+                "backup_date": self._parse_timestamp_to_date(timestamp)
+            },
+            "truncate_results": {},
+            "restore_results": {},
+            "errors": [],
+            "duration": 0
+        }
+        
+        # 临时修改数据库连接配置
+        original_config = None
+        if db_connection:
+            from data_pipeline.config import SCHEMA_TOOLS_CONFIG
+            original_config = SCHEMA_TOOLS_CONFIG.get("default_db_connection")
+            SCHEMA_TOOLS_CONFIG["default_db_connection"] = db_connection
+        
+        try:
+            # 执行清空操作(如果需要)
+            if truncate_before_restore:
+                self.logger.info("🗑️ 开始清空目标表...")
+                for table_name in tables:
+                    truncate_result = self._truncate_table(table_name)
+                    result["truncate_results"][table_name] = truncate_result
+                    if not truncate_result.get("success", False):
+                        result["errors"].append(f"{table_name}表清空失败")
+            
+            # 执行恢复操作
+            self.logger.info("📥 开始恢复表数据...")
+            for table_name in tables:
+                csv_file = backup_files[table_name]
+                restore_result = self._restore_table_from_csv(table_name, csv_file)
+                result["restore_results"][table_name] = restore_result
+                if not restore_result.get("success", False):
+                    result["errors"].append(f"{table_name}表恢复失败")
+            
+            # 计算总耗时
+            result["duration"] = time.time() - start_time
+            
+            # 记录最终状态
+            if result["errors"]:
+                self.logger.warning(f"⚠️ Vector表恢复完成,但有错误: {'; '.join(result['errors'])}")
+            else:
+                self.logger.info(f"✅ Vector表恢复完成,耗时: {result['duration']:.2f}秒")
+            
+            return result
+            
+        finally:
+            # 恢复原始配置
+            if original_config is not None:
+                SCHEMA_TOOLS_CONFIG["default_db_connection"] = original_config
+    
+    def get_connection(self):
+        """获取数据库连接 - 完全复用VectorTableManager的连接逻辑"""
+        try:
+            # 方法1:如果SCHEMA_TOOLS_CONFIG中有连接字符串,直接使用
+            from data_pipeline.config import SCHEMA_TOOLS_CONFIG
+            connection_string = SCHEMA_TOOLS_CONFIG.get("default_db_connection")
+            if connection_string:
+                conn = psycopg2.connect(connection_string)
+            else:
+                # 方法2:从app_config获取pgvector数据库配置
+                import app_config
+                pgvector_config = app_config.PGVECTOR_CONFIG
+                conn = psycopg2.connect(
+                    host=pgvector_config.get('host'),
+                    port=pgvector_config.get('port'),
+                    database=pgvector_config.get('dbname'),
+                    user=pgvector_config.get('user'),
+                    password=pgvector_config.get('password')
+                )
+            
+            # 设置自动提交
+            conn.autocommit = True
+            return conn
+            
+        except Exception as e:
+            self.logger.error(f"pgvector数据库连接失败: {e}")
+            raise
+    
+    def _get_all_vector_bak_directories(self) -> List[Path]:
+        """获取所有vector_bak目录"""
+        directories = []
+        
+        # 全局备份目录
+        global_backup_dir = self.base_output_dir / "vector_bak"
+        if global_backup_dir.exists():
+            directories.append(global_backup_dir)
+        
+        # 任务备份目录 (task_* 和 manual_*)
+        for pattern in ["task_*", "manual_*"]:
+            for task_dir in self.base_output_dir.glob(pattern):
+                if task_dir.is_dir():
+                    vector_bak_dir = task_dir / "vector_bak"
+                    if vector_bak_dir.exists():
+                        directories.append(vector_bak_dir)
+        
+        return directories
+    
+    def _find_backup_sets(self, backup_dir: Path) -> List[str]:
+        """查找备份目录中的有效备份集"""
+        # 查找所有CSV文件
+        collection_files = list(backup_dir.glob("langchain_pg_collection_*.csv"))
+        embedding_files = list(backup_dir.glob("langchain_pg_embedding_*.csv"))
+        
+        # 提取时间戳
+        collection_timestamps = set()
+        embedding_timestamps = set()
+        
+        for file in collection_files:
+            timestamp = self._extract_timestamp_from_filename(file.name)
+            if timestamp:
+                collection_timestamps.add(timestamp)
+        
+        for file in embedding_files:
+            timestamp = self._extract_timestamp_from_filename(file.name)
+            if timestamp:
+                embedding_timestamps.add(timestamp)
+        
+        # 找到同时存在两个文件的时间戳
+        valid_timestamps = collection_timestamps & embedding_timestamps
+        
+        # 按时间戳降序排列(最新的在前)
+        return sorted(valid_timestamps, reverse=True)
+    
+    def _extract_timestamp_from_filename(self, filename: str) -> Optional[str]:
+        """从文件名中提取时间戳"""
+        # 匹配格式:langchain_pg_collection_20250722_010318.csv
+        pattern = r'langchain_pg_(?:collection|embedding)_(\d{8}_\d{6})\.csv'
+        match = re.search(pattern, filename)
+        return match.group(1) if match else None
+    
+    def _build_location_info(self, backup_dir: Path, backup_sets: List[str]) -> Optional[Dict[str, Any]]:
+        """构建备份位置信息"""
+        if not backup_sets:
+            return None
+        
+        # 确定位置类型和相关信息
+        relative_path = self._get_relative_path(backup_dir)
+        location_type, task_id = self._determine_location_type(backup_dir)
+        
+        # 构建备份信息列表
+        backups = []
+        for timestamp in backup_sets:
+            backup_info = self._build_backup_info(backup_dir, timestamp)
+            if backup_info:
+                backups.append(backup_info)
+        
+        location_info = {
+            "type": location_type,
+            "relative_path": relative_path,
+            "backups": backups
+        }
+        
+        if task_id:
+            location_info["task_id"] = task_id
+        
+        return location_info
+    
+    def _get_relative_path(self, backup_dir: Path) -> str:
+        """获取相对路径(Unix风格)"""
+        try:
+            # 计算相对于项目根目录的路径
+            project_root = Path(__file__).parent.parent.parent
+            relative_path = backup_dir.relative_to(project_root)
+            # 转换为Unix风格路径
+            return "./" + str(relative_path).replace("\\", "/")
+        except ValueError:
+            # 如果无法计算相对路径,直接转换
+            return str(backup_dir).replace("\\", "/")
+    
+    def _determine_location_type(self, backup_dir: Path) -> tuple:
+        """确定位置类型和task_id"""
+        backup_dir_str = str(backup_dir)
+        
+        if "/vector_bak" in backup_dir_str.replace("\\", "/"):
+            parent = backup_dir.parent.name
+            if parent.startswith(("task_", "manual_")):
+                return "task", parent
+            else:
+                return "global", None
+        
+        return "unknown", None
+    
+    def _build_backup_info(self, backup_dir: Path, timestamp: str) -> Optional[Dict[str, Any]]:
+        """构建单个备份信息"""
+        try:
+            collection_file = backup_dir / f"langchain_pg_collection_{timestamp}.csv"
+            embedding_file = backup_dir / f"langchain_pg_embedding_{timestamp}.csv"
+            log_file = backup_dir / "vector_backup_log.txt"
+            
+            # 检查文件存在性
+            if not (collection_file.exists() and embedding_file.exists()):
+                return None
+            
+            # 获取文件大小
+            collection_size = self._format_file_size(collection_file.stat().st_size)
+            embedding_size = self._format_file_size(embedding_file.stat().st_size)
+            
+            # 解析备份日期
+            backup_date = self._parse_timestamp_to_date(timestamp)
+            
+            return {
+                "timestamp": timestamp,
+                "collection_file": collection_file.name,
+                "embedding_file": embedding_file.name,
+                "collection_size": collection_size,
+                "embedding_size": embedding_size,
+                "backup_date": backup_date,
+                "has_log": log_file.exists(),
+                "log_file": log_file.name if log_file.exists() else None
+            }
+            
+        except Exception as e:
+            self.logger.warning(f"构建备份信息失败: {e}")
+            return None
+    
+    def _parse_timestamp_to_date(self, timestamp: str) -> str:
+        """将时间戳转换为可读日期格式"""
+        try:
+            # 解析格式:20250722_010318
+            dt = datetime.strptime(timestamp, "%Y%m%d_%H%M%S")
+            return dt.strftime("%Y-%m-%d %H:%M:%S")
+        except Exception:
+            return timestamp
+    
+    def _build_summary(self, backup_locations: List[Dict], scan_start_time: datetime) -> Dict[str, Any]:
+        """构建汇总信息"""
+        total_backup_sets = sum(len(loc["backups"]) for loc in backup_locations)
+        global_backups = sum(len(loc["backups"]) for loc in backup_locations if loc["type"] == "global")
+        task_backups = total_backup_sets - global_backups
+        
+        return {
+            "total_locations": len(backup_locations),
+            "total_backup_sets": total_backup_sets,
+            "global_backups": global_backups,
+            "task_backups": task_backups,
+            "scan_time": scan_start_time.isoformat()
+        }
+    
+    def _restore_table_from_csv(self, table_name: str, csv_file: Path) -> Dict[str, Any]:
+        """从CSV文件恢复单个表 - 使用COPY FROM STDIN"""
+        try:
+            start_time = time.time()
+            
+            with self.get_connection() as conn:
+                with conn.cursor() as cursor:
+                    # 检查是否是embedding表,需要特殊处理JSON格式
+                    if table_name == "langchain_pg_embedding":
+                        self._restore_embedding_table_with_json_fix(cursor, csv_file)
+                    else:
+                        # 其他表直接使用COPY FROM STDIN
+                        with open(csv_file, 'r', encoding='utf-8') as f:
+                            # 使用CSV HEADER选项自动跳过表头,无需手动next(f)
+                            cursor.copy_expert(
+                                f"COPY {table_name} FROM STDIN WITH (FORMAT CSV, HEADER)",
+                                f
+                            )
+                    
+                    # 验证导入结果
+                    cursor.execute(f"SELECT COUNT(*) FROM {table_name}")
+                    rows_restored = cursor.fetchone()[0]
+            
+            duration = time.time() - start_time
+            file_size = csv_file.stat().st_size
+            
+            return {
+                "success": True,
+                "source_file": csv_file.name,
+                "rows_restored": rows_restored,
+                "file_size": self._format_file_size(file_size),
+                "duration": duration
+            }
+            
+        except Exception as e:
+            return {
+                "success": False,
+                "source_file": csv_file.name,
+                "error": str(e)
+            }
+    
+    def _truncate_table(self, table_name: str) -> Dict[str, Any]:
+        """清空指定表"""
+        try:
+            start_time = time.time()
+            
+            with self.get_connection() as conn:
+                with conn.cursor() as cursor:
+                    # 获取清空前的行数
+                    cursor.execute(f"SELECT COUNT(*) FROM {table_name}")
+                    rows_before = cursor.fetchone()[0]
+                    
+                    # 执行TRUNCATE
+                    cursor.execute(f"TRUNCATE TABLE {table_name}")
+                    
+                    # 验证清空结果
+                    cursor.execute(f"SELECT COUNT(*) FROM {table_name}")
+                    rows_after = cursor.fetchone()[0]
+            
+            duration = time.time() - start_time
+            
+            if rows_after == 0:
+                return {
+                    "success": True,
+                    "rows_before": rows_before,
+                    "rows_after": rows_after,
+                    "duration": duration
+                }
+            else:
+                raise Exception(f"清空失败,表中仍有 {rows_after} 行数据")
+                
+        except Exception as e:
+            return {
+                "success": False,
+                "error": str(e)
+            }
+    
+    def _format_file_size(self, size_bytes: int) -> str:
+        """格式化文件大小显示"""
+        if size_bytes == 0:
+            return "0 B"
+        
+        size_names = ["B", "KB", "MB", "GB"]
+        i = 0
+        size = float(size_bytes)
+        
+        while size >= 1024.0 and i < len(size_names) - 1:
+            size /= 1024.0
+            i += 1
+        
+        return f"{size:.1f} {size_names[i]}" 
+    
+    def _restore_embedding_table_with_json_fix(self, cursor, csv_file: Path):
+        """恢复embedding表,修复cmetadata列的JSON格式问题"""
+        import csv
+        import json
+        import ast
+        import io
+        
+        # 读取CSV并修复JSON格式
+        corrected_data = io.StringIO()
+        
+        with open(csv_file, 'r', encoding='utf-8') as f:
+            reader = csv.reader(f)
+            writer = csv.writer(corrected_data)
+            
+            # 处理表头
+            header = next(reader)
+            writer.writerow(header)
+            
+            # 找到cmetadata列的索引
+            try:
+                cmetadata_index = header.index('cmetadata')
+            except ValueError:
+                # 如果没有cmetadata列,直接使用原始CSV
+                corrected_data.seek(0)
+                corrected_data.truncate(0)
+                f.seek(0)
+                corrected_data.write(f.read())
+                corrected_data.seek(0)
+                cursor.copy_expert(
+                    "COPY langchain_pg_embedding FROM STDIN WITH (FORMAT CSV, HEADER)",
+                    corrected_data
+                )
+                return
+            
+            # 处理数据行
+            for row in reader:
+                if len(row) > cmetadata_index and row[cmetadata_index]:
+                    try:
+                        # 尝试将Python字典格式转换为JSON格式
+                        # 如果已经是JSON格式,json.loads会成功
+                        if row[cmetadata_index].startswith('{') and row[cmetadata_index].endswith('}'):
+                            try:
+                                # 先尝试作为JSON解析
+                                json.loads(row[cmetadata_index])
+                                # 已经是有效JSON,不需要转换
+                            except json.JSONDecodeError:
+                                # 不是有效JSON,尝试作为Python字典解析并转换
+                                try:
+                                    python_dict = ast.literal_eval(row[cmetadata_index])
+                                    row[cmetadata_index] = json.dumps(python_dict, ensure_ascii=False)
+                                except (ValueError, SyntaxError):
+                                    # 如果都失败了,记录错误但继续处理
+                                    self.logger.warning(f"无法解析cmetadata: {row[cmetadata_index]}")
+                    except Exception as e:
+                        self.logger.warning(f"处理cmetadata时出错: {e}")
+                
+                writer.writerow(row)
+        
+        # 使用修复后的数据进行导入
+        corrected_data.seek(0)
+        cursor.copy_expert(
+            "COPY langchain_pg_embedding FROM STDIN WITH (FORMAT CSV, HEADER)",
+            corrected_data
+        )

+ 16 - 1
data_pipeline/config.py

@@ -133,6 +133,21 @@ SCHEMA_TOOLS_CONFIG = {
         "max_lines": 1000,                   # 最大行数限制
         "encoding": "utf-8",                 # 文件编码
         "allow_overwrite": True,             # 是否允许覆盖已存在的文件
+    },
+    
+    # Vector表管理配置
+    "vector_table_management": {
+        "backup_enabled": True,
+        "backup_directory": "vector_bak",
+        "supported_tables": [
+            "langchain_pg_collection",
+            "langchain_pg_embedding"
+        ],
+        "truncate_tables": [
+            "langchain_pg_embedding"  # 只清空embedding表
+        ],
+        "timestamp_format": "%Y%m%d_%H%M%S",
+        "backup_temp_suffix": ".tmp"
     }
 }
 
@@ -184,4 +199,4 @@ try:
 except ValueError as e:
     # 在配置文件中使用stderr输出警告,避免依赖logging
     import sys
-    print(f"警告: {e}", file=sys.stderr)
+    sys.stderr.write(f"警告: {e}\n")

+ 273 - 0
data_pipeline/create_task_cli.py

@@ -0,0 +1,273 @@
+"""
+Data Pipeline 命令行任务创建工具
+
+专门用于手动创建任务,生成manual_前缀的task_id
+仅创建任务目录,不涉及数据库或配置文件
+"""
+
+import argparse
+import os
+import sys
+from datetime import datetime
+from pathlib import Path
+
+
+def generate_manual_task_id() -> str:
+    """生成手动任务ID,格式: manual_YYYYMMDD_HHMMSS"""
+    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
+    return f"manual_{timestamp}"
+
+
+def resolve_base_directory():
+    """解析基础输出目录"""
+    try:
+        from data_pipeline.config import SCHEMA_TOOLS_CONFIG
+        base_dir = SCHEMA_TOOLS_CONFIG.get("output_directory", "./data_pipeline/training_data/")
+    except ImportError:
+        # 如果无法导入配置,使用默认路径
+        base_dir = "./data_pipeline/training_data/"
+    
+    # 处理相对路径
+    if not Path(base_dir).is_absolute():
+        # 相对于项目根目录解析
+        project_root = Path(__file__).parent.parent
+        base_dir = project_root / base_dir
+    
+    return Path(base_dir)
+
+
+def create_task_directory(task_id: str, logger) -> Path:
+    """创建任务目录"""
+    base_dir = resolve_base_directory()
+    task_dir = base_dir / task_id
+    
+    try:
+        task_dir.mkdir(parents=True, exist_ok=True)
+        logger.info(f"任务目录已创建: {task_dir}")
+        return task_dir
+    except Exception as e:
+        logger.error(f"创建任务目录失败: {e}")
+        raise
+
+
+def extract_db_name_from_connection(connection_string: str) -> str:
+    """从数据库连接字符串中提取数据库名称"""
+    try:
+        if '/' in connection_string:
+            db_name = connection_string.split('/')[-1]
+            if '?' in db_name:
+                db_name = db_name.split('?')[0]
+            return db_name if db_name else "database"
+        else:
+            return "database"
+    except Exception:
+        return "database"
+
+
+def setup_argument_parser():
+    """设置命令行参数解析器"""
+    parser = argparse.ArgumentParser(
+        description='Data Pipeline 任务创建工具 - 创建手动执行的训练任务',
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+        epilog="""
+示例用法:
+  # 基本创建
+  python -m data_pipeline.create_task_cli --business-context "电商系统" --db-connection "postgresql://user:pass@localhost:5432/ecommerce_db"
+  
+  # 指定表清单文件
+  python -m data_pipeline.create_task_cli --table-list tables.txt --business-context "高速公路管理系统" --db-connection "postgresql://user:pass@localhost:5432/highway_db"
+  
+  # 指定任务名称
+  python -m data_pipeline.create_task_cli --task-name "电商数据训练" --business-context "电商系统" --db-connection "postgresql://user:pass@localhost:5432/ecommerce_db"
+
+创建成功后,可以使用返回的task_id进行分步执行:
+  python -m data_pipeline.ddl_generation.ddl_md_generator --task-id <task_id> --db-connection "..." --table-list tables.txt --business-context "..."
+        """
+    )
+    
+    # 必需参数
+    parser.add_argument(
+        '--business-context',
+        required=True,
+        help='业务上下文描述'
+    )
+    
+    parser.add_argument(
+        '--db-connection',
+        required=True,
+        help='数据库连接字符串 (postgresql://user:pass@host:port/dbname)'
+    )
+    
+    # 可选参数
+    parser.add_argument(
+        '--table-list',
+        help='表清单文件路径'
+    )
+    
+    parser.add_argument(
+        '--task-name',
+        help='任务名称'
+    )
+    
+    parser.add_argument(
+        '--db-name',
+        help='数据库名称(如果不提供,将从连接字符串中提取)'
+    )
+    
+    parser.add_argument(
+        '--verbose', '-v',
+        action='store_true',
+        help='启用详细输出和日志'
+    )
+    
+    return parser
+
+
+def print_usage_instructions(task_id: str, task_dir: Path, logger, **params):
+    """输出使用说明"""
+    # 总是向控制台输出结果,同时记录到日志
+    output_lines = [
+        "",
+        "=" * 60,
+        "🎉 任务创建成功!",
+        "=" * 60,
+        f"📋 任务ID: {task_id}",
+        f"📁 任务目录: {task_dir}"
+    ]
+    
+    if params.get('task_name'):
+        output_lines.append(f"🎯 任务名称: {params['task_name']}")
+    
+    if params.get('db_name'):
+        output_lines.append(f"🗄️  数据库: {params['db_name']}")
+    
+    output_lines.append(f"🏢 业务背景: {params['business_context']}")
+    
+    if params.get('table_list'):
+        output_lines.append(f"📋 表清单文件: {params['table_list']}")
+    
+    output_lines.extend([
+        "",
+        "💡 现在可以使用以下命令执行分步操作:",
+        "=" * 60
+    ])
+    
+    # 构建示例命令
+    db_conn = params['db_connection']
+    business_context = params['business_context']
+    table_list = params.get('table_list', 'tables.txt')
+    
+    command_lines = [
+        "# 步骤1: 生成DDL和MD文件",
+        f'python -m data_pipeline.ddl_generation.ddl_md_generator \\',
+        f'  --task-id {task_id} \\',
+        f'  --db-connection "{db_conn}" \\',
+        f'  --table-list {table_list} \\',
+        f'  --business-context "{business_context}"',
+        "",
+        "# 步骤2: 生成Question-SQL对",
+        f'python -m data_pipeline.qa_generation.qs_generator \\',
+        f'  --task-id {task_id} \\',
+        f'  --table-list {table_list} \\',
+        f'  --business-context "{business_context}"',
+        "",
+        "# 步骤3: 验证和修正SQL",
+        f'python -m data_pipeline.validators.sql_validate_cli \\',
+        f'  --task-id {task_id} \\',
+        f'  --db-connection "{db_conn}"',
+        "",
+        "# 步骤4: 训练数据加载",
+        f'python -m data_pipeline.trainer.run_training \\',
+        f'  --task-id {task_id}',
+        "",
+        "=" * 60
+    ]
+    
+    # 输出到控制台(总是显示)
+    for line in output_lines + command_lines:
+        print(line)
+    
+    # 记录到日志
+    logger.info("任务创建成功总结:")
+    for line in output_lines[2:]:  # 跳过装饰线
+        if line and not line.startswith("="):
+            logger.info(f"  {line}")
+    
+    logger.info("分步执行命令:")
+    for line in command_lines:
+        if line and not line.startswith("#") and line.strip():
+            logger.info(f"  {line}")
+
+
+def main():
+    """主入口函数"""
+    parser = setup_argument_parser()
+    args = parser.parse_args()
+    
+    # 生成任务ID
+    task_id = generate_manual_task_id()
+    
+    # 初始化统一日志服务
+    try:
+        from data_pipeline.dp_logging import get_logger
+        logger = get_logger("CreateTaskCLI", task_id)
+        logger.info(f"开始创建手动任务: {task_id}")
+    except ImportError:
+        # 如果无法导入统一日志服务,创建简单的logger
+        import logging
+        logger = logging.getLogger("CreateTaskCLI")
+        logger.setLevel(logging.INFO)
+        if not logger.handlers:
+            handler = logging.StreamHandler()
+            formatter = logging.Formatter('%(asctime)s [%(levelname)s] %(name)s: %(message)s')
+            handler.setFormatter(formatter)
+            logger.addHandler(handler)
+        logger.warning("无法导入统一日志服务,使用简单日志")
+    
+    try:
+        logger.info(f"生成任务ID: {task_id}")
+        
+        # 提取数据库名称
+        db_name = args.db_name or extract_db_name_from_connection(args.db_connection)
+        logger.info(f"数据库名称: {db_name}")
+        
+        # 验证表清单文件(如果提供)
+        if args.table_list:
+            if not os.path.exists(args.table_list):
+                error_msg = f"表清单文件不存在: {args.table_list}"
+                logger.error(error_msg)
+                sys.exit(1)
+            else:
+                logger.info(f"表清单文件验证通过: {args.table_list}")
+        
+        # 创建任务目录
+        task_dir = create_task_directory(task_id, logger)
+        
+        logger.info(f"任务创建完成: {task_id}")
+        logger.info(f"参数信息: 业务背景='{args.business_context}', 数据库='{db_name}', 表清单='{args.table_list}'")
+        
+        # 输出使用说明
+        print_usage_instructions(
+            task_id=task_id,
+            task_dir=task_dir,
+            logger=logger,
+            task_name=args.task_name,
+            db_name=db_name,
+            business_context=args.business_context,
+            table_list=args.table_list,
+            db_connection=args.db_connection
+        )
+        
+        logger.info("任务创建工具执行完成")
+        sys.exit(0)
+        
+    except KeyboardInterrupt:
+        logger.warning("用户中断,程序退出")
+        sys.exit(130)
+    except Exception as e:
+        logger.error(f"任务创建失败: {e}", exc_info=args.verbose)
+        sys.exit(1)
+
+
+if __name__ == "__main__":
+    main() 

+ 33 - 2
data_pipeline/ddl_generation/ddl_md_generator.py

@@ -19,6 +19,9 @@ def setup_argument_parser():
   # 基本使用
   python -m data_pipeline.ddl_md_generator --db-connection "postgresql://user:pass@host:5432/db" --table-list tables.txt --business-context "电商系统"
   
+  # 使用task_id自动解析路径
+  python -m data_pipeline.ddl_md_generator --task-id manual_20250720_130541 --db-connection "..." --table-list tables.txt --business-context "电商系统"
+  
   # 指定输出目录
   python -m data_pipeline.ddl_md_generator --db-connection "..." --table-list tables.txt --business-context "电商系统" --output-dir ./data_pipeline/training_data/
   
@@ -38,6 +41,11 @@ def setup_argument_parser():
     )
     
     # 可选参数
+    parser.add_argument(
+        '--task-id',
+        help='任务ID,指定后将自动构建输出目录路径 (基础目录/task_id)'
+    )
+    
     parser.add_argument(
         '--table-list',
         help='表清单文件路径'
@@ -96,6 +104,29 @@ def setup_argument_parser():
     
     return parser
 
+def resolve_output_directory(args):
+    """解析输出目录路径"""
+    if args.output_dir:
+        # 用户明确指定了输出目录
+        return args.output_dir
+    elif args.task_id:
+        # 使用task_id构建输出目录
+        from data_pipeline.config import SCHEMA_TOOLS_CONFIG
+        base_dir = SCHEMA_TOOLS_CONFIG.get("output_directory", "./data_pipeline/training_data/")
+        
+        # 处理相对路径
+        from pathlib import Path
+        if not Path(base_dir).is_absolute():
+            # 相对于项目根目录解析
+            project_root = Path(__file__).parent.parent.parent
+            base_dir = project_root / base_dir
+        
+        return str(Path(base_dir) / args.task_id)
+    else:
+        # 使用默认配置
+        from data_pipeline.config import SCHEMA_TOOLS_CONFIG
+        return SCHEMA_TOOLS_CONFIG.get("output_directory", "./data_pipeline/training_data/")
+
 def load_config_with_overrides(args):
     """加载配置并应用命令行覆盖"""
     from data_pipeline.config import SCHEMA_TOOLS_CONFIG
@@ -103,8 +134,8 @@ def load_config_with_overrides(args):
     config = SCHEMA_TOOLS_CONFIG.copy()
     
     # 命令行参数覆盖配置
-    if args.output_dir:
-        config["output_directory"] = args.output_dir
+    output_dir = resolve_output_directory(args)
+    config["output_directory"] = output_dir
     
     if args.pipeline:
         config["default_pipeline"] = args.pipeline

+ 66 - 2
data_pipeline/ddl_generation/training_data_agent.py

@@ -125,6 +125,13 @@ class SchemaTrainingDataAgent:
         if not inspector.connection_pool:
             await inspector._create_connection_pool()
         
+        # 解析并打印数据库连接信息
+        try:
+            db_info = self._parse_db_connection(self.db_connection)
+            self.logger.info(f"🔗 数据库连接信息: 用户名={db_info['user']}, 密码={'*' * len(db_info['password'])}, 主机={db_info['host']}:{db_info['port']}, 数据库={db_info['dbname']}")
+        except Exception as e:
+            self.logger.warning(f"无法解析数据库连接字符串: {e}")
+        
         checker = DatabasePermissionChecker(inspector)
         
         permissions = await checker.check_permissions()
@@ -140,6 +147,35 @@ class SchemaTrainingDataAgent:
         
         self.logger.info(f"数据库权限检查完成: {permissions}")
     
+    def _parse_db_connection(self, db_connection: str) -> Dict[str, str]:
+        """
+        解析PostgreSQL连接字符串
+        
+        Args:
+            db_connection: PostgreSQL连接字符串,格式为 postgresql://user:password@host:port/dbname
+        
+        Returns:
+            包含数据库连接参数的字典
+        """
+        import re
+        
+        # 解析连接字符串的正则表达式
+        pattern = r'postgresql://([^:]+):([^@]+)@([^:]+):(\d+)/(.+)'
+        match = re.match(pattern, db_connection)
+        
+        if not match:
+            raise ValueError(f"无效的PostgreSQL连接字符串格式: {db_connection}")
+        
+        user, password, host, port, dbname = match.groups()
+        
+        return {
+            'user': user,
+            'password': password,
+            'host': host,
+            'port': port,
+            'dbname': dbname
+        }
+    
     async def _parse_table_list(self) -> List[str]:
         """解析表清单文件"""
         tables = self.table_parser.parse_file(self.table_list_file)
@@ -279,6 +315,25 @@ class SchemaTrainingDataAgent:
         
         avg_execution_time = sum(r.get('execution_time', 0) for r in results) / len(results) if results else 0
         
+        # 计算生成的文件数量
+        successful_count = len(successful_results)
+        if self.pipeline == 'full':
+            md_files_generated = successful_count
+            ddl_files_generated = successful_count
+            total_files_generated = successful_count * 2
+        elif self.pipeline == 'ddl_only':
+            md_files_generated = 0
+            ddl_files_generated = successful_count
+            total_files_generated = successful_count
+        elif self.pipeline == 'analysis_only':
+            md_files_generated = successful_count
+            ddl_files_generated = 0
+            total_files_generated = successful_count
+        else:
+            md_files_generated = successful_count
+            ddl_files_generated = 0
+            total_files_generated = successful_count
+        
         report = {
             'summary': {
                 'total_tables': self.stats['total_tables'],
@@ -291,7 +346,9 @@ class SchemaTrainingDataAgent:
             'statistics': {
                 'total_fields_processed': total_fields,
                 'enum_fields_detected': total_enum_fields,
-                'files_generated': len(successful_results) * (2 if self.pipeline == 'full' else 1)
+                'md_files_generated': md_files_generated,
+                'ddl_files_generated': ddl_files_generated,
+                'total_files_generated': total_files_generated
             },
             'failed_tables': self.failed_tables,
             'detailed_results': results,
@@ -308,7 +365,14 @@ class SchemaTrainingDataAgent:
         self.logger.info(f"  ✅ 成功: {report['summary']['processed_successfully']} 个表")
         self.logger.info(f"  ❌ 失败: {report['summary']['failed']} 个表")
         self.logger.info(f"  ⏭️  跳过: {report['summary']['skipped_system_tables']} 个系统表")
-        self.logger.info(f"  📁 生成文件: {report['statistics']['files_generated']} 个")
+        if md_files_generated > 0 and ddl_files_generated > 0:
+            self.logger.info(f"  📁 生成文件: {md_files_generated} 个MD文件,{ddl_files_generated} 个DDL文件")
+        elif md_files_generated > 0:
+            self.logger.info(f"  📁 生成文件: {md_files_generated} 个MD文件")
+        elif ddl_files_generated > 0:
+            self.logger.info(f"  📁 生成文件: {ddl_files_generated} 个DDL文件")
+        else:
+            self.logger.info(f"  📁 生成文件: 0 个")
         self.logger.info(f"  🕐 总耗时: {total_time:.2f} 秒")
         
         if self.failed_tables:

+ 43 - 18
data_pipeline/dp_logging/__init__.py

@@ -1,29 +1,54 @@
 """
-Data Pipeline 独立日志管理系统
-
-完全脱离主项目的日志管理,专门为data_pipeline模块设计
-支持任务级别的日志文件管理,同时支持API调用和脚本调用
+Data Pipeline 统一日志管理
+支持API和命令行两种模式
 """
 
-from .manager import DataPipelineLogManager
+from core.logging import get_data_pipeline_logger, initialize_logging
+import os
+import logging
+import logging.handlers
+from pathlib import Path
+
+def init_data_pipeline_logging():
+    """初始化data_pipeline日志系统"""
+    # 确保日志系统已初始化
+    initialize_logging()
 
-# 对外接口
-def get_logger(name: str, task_id: str):
+def get_logger(name: str, task_id: str = None):
     """
     获取data_pipeline专用logger
     
     Args:
-        name: logger名称 (如: "SchemaWorkflowOrchestrator", "DDLGenerator")
-        task_id: 任务ID,必须提供
-                API模式: task_YYYYMMDD_HHMMSS
-                脚本模式: manual_YYYYMMDD_HHMMSS
+        name: logger名称
+        task_id: 任务ID(可选,用于任务特定日志)
     
     Returns:
-        配置好的logger,输出到 ./data_pipeline/training_data/{task_id}/data_pipeline.log
+        配置好的logger实例
     """
-    return DataPipelineLogManager.get_logger(name, task_id)
-
-# 便捷方法(保持接口一致性)
-def get_data_pipeline_logger(name: str, task_id: str):
-    """便捷方法,与get_logger功能相同"""
-    return get_logger(name, task_id)
+    logger = get_data_pipeline_logger(name)
+    
+    # 如果提供了task_id,添加任务特定的文件处理器
+    if task_id:
+        # 创建任务特定的日志文件
+        task_log_file = Path(f"data_pipeline/training_data/{task_id}/data_pipeline.log")
+        task_log_file.parent.mkdir(parents=True, exist_ok=True)
+        
+        # 创建任务特定的文件处理器(支持滚动)
+        task_handler = logging.handlers.RotatingFileHandler(
+            task_log_file, 
+            maxBytes=10*1024*1024,  # 10MB
+            backupCount=3,
+            encoding='utf-8'
+        )
+        task_handler.setLevel(logging.DEBUG)
+        
+        formatter = logging.Formatter(
+            '%(asctime)s [%(levelname)s] [%(name)s] %(filename)s:%(lineno)d - %(message)s',
+            datefmt='%Y-%m-%d %H:%M:%S'
+        )
+        task_handler.setFormatter(formatter)
+        
+        # 添加到logger
+        logger.addHandler(task_handler)
+    
+    return logger

+ 0 - 156
data_pipeline/dp_logging/manager.py

@@ -1,156 +0,0 @@
-"""
-Data Pipeline 独立日志管理器
-
-专门为data_pipeline模块设计的日志管理器,完全独立于主项目的日志系统
-"""
-
-import os
-from pathlib import Path
-from typing import Dict
-
-# 明确导入Python内置logging模块
-import logging as std_logging
-
-
-class DataPipelineLogManager:
-    """Data Pipeline 专用日志管理器"""
-    
-    _loggers: Dict[str, std_logging.Logger] = {}
-    _file_handlers: Dict[str, std_logging.FileHandler] = {}
-    
-    @classmethod
-    def get_logger(cls, name: str, task_id: str) -> std_logging.Logger:
-        """
-        获取或创建logger
-        
-        Args:
-            name: logger名称
-            task_id: 任务ID,用于确定日志文件位置
-        
-        Returns:
-            配置好的logger实例
-        """
-        logger_key = f"data_pipeline.{name}.{task_id}"
-        
-        if logger_key not in cls._loggers:
-            logger = cls._create_logger(name, task_id)
-            cls._loggers[logger_key] = logger
-        
-        return cls._loggers[logger_key]
-    
-    @classmethod
-    def _create_logger(cls, name: str, task_id: str) -> std_logging.Logger:
-        """创建新的logger实例"""
-        # 创建logger
-        logger_name = f"data_pipeline.{name}"
-        logger = std_logging.getLogger(logger_name)
-        
-        # 设置日志级别
-        logger.setLevel(std_logging.DEBUG)
-        
-        # 防止日志重复(清除已有处理器)
-        logger.handlers.clear()
-        logger.propagate = False
-        
-        # 添加控制台处理器
-        console_handler = cls._create_console_handler()
-        logger.addHandler(console_handler)
-        
-        # 添加文件处理器
-        file_handler = cls._create_file_handler(task_id)
-        if file_handler:
-            logger.addHandler(file_handler)
-        
-        return logger
-    
-    @classmethod
-    def _create_console_handler(cls) -> std_logging.StreamHandler:
-        """创建控制台处理器"""
-        handler = std_logging.StreamHandler()
-        handler.setLevel(std_logging.INFO)
-        
-        formatter = std_logging.Formatter(
-            '%(asctime)s [%(levelname)s] Pipeline: %(message)s',
-            datefmt='%Y-%m-%d %H:%M:%S'
-        )
-        handler.setFormatter(formatter)
-        
-        return handler
-    
-    @classmethod
-    def _create_file_handler(cls, task_id: str) -> std_logging.FileHandler:
-        """创建文件处理器"""
-        try:
-            # 获取项目根目录的绝对路径
-            project_root = Path(__file__).parent.parent.parent
-            task_dir = project_root / "data_pipeline" / "training_data" / task_id
-            
-            task_dir.mkdir(parents=True, exist_ok=True)
-            
-            log_file = task_dir / "data_pipeline.log"
-            
-            # 为每个任务创建独立的文件处理器
-            handler_key = f"file_handler_{task_id}"
-            
-            if handler_key not in cls._file_handlers:
-                handler = std_logging.FileHandler(log_file, encoding='utf-8')
-                handler.setLevel(std_logging.DEBUG)
-                
-                formatter = std_logging.Formatter(
-                    '%(asctime)s [%(levelname)s] [%(name)s] %(filename)s:%(lineno)d - %(message)s',
-                    datefmt='%Y-%m-%d %H:%M:%S'
-                )
-                handler.setFormatter(formatter)
-                
-                cls._file_handlers[handler_key] = handler
-            
-            return cls._file_handlers[handler_key]
-            
-        except Exception as e:
-            # 如果文件处理器创建失败,记录到stderr但不影响程序运行
-            import sys
-            sys.stderr.write(f"[WARNING] 无法创建data_pipeline日志文件处理器: {e}\n")
-            return None
-    
-    @classmethod
-    def cleanup_logger(cls, task_id: str):
-        """清理指定任务的logger和文件处理器"""
-        try:
-            # 关闭文件处理器
-            handler_key = f"file_handler_{task_id}"
-            if handler_key in cls._file_handlers:
-                cls._file_handlers[handler_key].close()
-                del cls._file_handlers[handler_key]
-            
-            # 清理相关的logger
-            keys_to_remove = [key for key in cls._loggers.keys() if task_id in key]
-            for key in keys_to_remove:
-                logger = cls._loggers[key]
-                for handler in logger.handlers:
-                    handler.close()
-                logger.handlers.clear()
-                del cls._loggers[key]
-                
-        except Exception as e:
-            import sys
-            sys.stderr.write(f"[WARNING] 清理data_pipeline日志资源失败: {e}\n")
-    
-    @classmethod
-    def cleanup_all(cls):
-        """清理所有logger和文件处理器"""
-        try:
-            # 关闭所有文件处理器
-            for handler in cls._file_handlers.values():
-                handler.close()
-            cls._file_handlers.clear()
-            
-            # 清理所有logger
-            for logger in cls._loggers.values():
-                for handler in logger.handlers:
-                    handler.close()
-                logger.handlers.clear()
-            cls._loggers.clear()
-            
-        except Exception as e:
-            import sys
-            sys.stderr.write(f"[WARNING] 清理所有data_pipeline日志资源失败: {e}\n")

+ 26 - 1
data_pipeline/qa_generation/qs_agent.py

@@ -73,7 +73,10 @@ class QuestionSQLGenerationAgent:
         try:
             self.logger.info("🚀 开始生成Question-SQL训练数据")
             
-            # 1. 验证文件数量
+            # 1. 重命名现有文件
+            await self._rename_existing_files()
+            
+            # 2. 验证文件数量
             self.logger.info("📋 验证文件数量...")
             validation_result = self.validator.validate(self.table_list_file, str(self.output_dir))
             
@@ -167,6 +170,28 @@ class QuestionSQLGenerationAgent:
             
             raise
     
+    async def _rename_existing_files(self):
+        """重命名现有的输出文件"""
+        try:
+            # 查找现有的 *_pair.json 文件
+            pair_files = list(self.output_dir.glob("*_pair.json"))
+            
+            for pair_file in pair_files:
+                old_name = f"{pair_file}_old"
+                pair_file.rename(old_name)
+                self.logger.info(f"重命名文件: {pair_file.name} → {Path(old_name).name}")
+            
+            # 查找现有的 backup 文件
+            backup_files = list(self.output_dir.glob("*_pair.json.backup"))
+            
+            for backup_file in backup_files:
+                old_name = f"{backup_file}_old"
+                backup_file.rename(old_name)
+                self.logger.info(f"重命名备份文件: {backup_file.name} → {Path(old_name).name}")
+                
+        except Exception as e:
+            self.logger.warning(f"重命名现有文件时出错: {e}")
+
     def _initialize_llm_components(self):
         """初始化LLM相关组件"""
         if not self.vn:

+ 45 - 7
data_pipeline/qa_generation/qs_generator.py

@@ -23,6 +23,9 @@ def setup_argument_parser():
   # 基本使用
   python -m data_pipeline.qa_generation.qs_generator --output-dir ./output --table-list ./tables.txt --business-context "高速公路服务区管理系统"
   
+  # 使用task_id自动解析路径
+  python -m data_pipeline.qa_generation.qs_generator --task-id manual_20250720_130541 --table-list ./tables.txt --business-context "高速公路服务区管理系统"
+  
   # 指定数据库名称
   python -m data_pipeline.qa_generation.qs_generator --output-dir ./output --table-list ./tables.txt --business-context "电商系统" --db-name ecommerce_db
   
@@ -31,10 +34,14 @@ def setup_argument_parser():
         """
     )
     
-    # 必需参数
+    # 可选参数(当使用task-id时,output-dir变为可选)
+    parser.add_argument(
+        '--task-id',
+        help='任务ID,指定后将自动构建输出目录路径 (基础目录/task_id)'
+    )
+    
     parser.add_argument(
         '--output-dir',
-        required=True,
         help='包含DDL和MD文件的输出目录'
     )
     
@@ -69,6 +76,28 @@ def setup_argument_parser():
     
     return parser
 
+def resolve_output_directory(args):
+    """解析输出目录路径"""
+    if args.output_dir:
+        # 用户明确指定了输出目录
+        return args.output_dir
+    elif args.task_id:
+        # 使用task_id构建输出目录
+        from data_pipeline.config import SCHEMA_TOOLS_CONFIG
+        base_dir = SCHEMA_TOOLS_CONFIG.get("output_directory", "./data_pipeline/training_data/")
+        
+        # 处理相对路径
+        from pathlib import Path
+        if not Path(base_dir).is_absolute():
+            # 相对于项目根目录解析
+            project_root = Path(__file__).parent.parent.parent
+            base_dir = project_root / base_dir
+        
+        return str(Path(base_dir) / args.task_id)
+    else:
+        # 没有指定输出目录或task_id
+        return None
+
 
 async def main():
     """主入口函数"""
@@ -81,10 +110,18 @@ async def main():
         log_file=args.log_file
     )
     
+    # 解析输出目录
+    output_dir = resolve_output_directory(args)
+    
     # 验证参数
-    output_path = Path(args.output_dir)
+    if not output_dir:
+        print("错误: 需要指定 --output-dir 或 --task-id 参数")
+        parser.print_help()
+        sys.exit(1)
+    
+    output_path = Path(output_dir)
     if not output_path.exists():
-        print(f"错误: 输出目录不存在: {args.output_dir}")
+        print(f"错误: 输出目录不存在: {output_dir}")
         sys.exit(1)
     
     if not os.path.exists(args.table_list):
@@ -94,15 +131,16 @@ async def main():
     try:
         # 创建Agent
         agent = QuestionSQLGenerationAgent(
-            output_dir=args.output_dir,
+            output_dir=output_dir,
             table_list_file=args.table_list,
             business_context=args.business_context,
-            db_name=args.db_name
+            db_name=args.db_name,
+            task_id=args.task_id  # 传递task_id
         )
         
         # 执行生成
         print(f"🚀 开始生成Question-SQL训练数据...")
-        print(f"📁 输出目录: {args.output_dir}")
+        print(f"📁 输出目录: {output_dir}")
         print(f"📋 表清单: {args.table_list}")
         print(f"🏢 业务背景: {args.business_context}")
         

+ 272 - 19
data_pipeline/schema_workflow.py

@@ -15,6 +15,7 @@ from data_pipeline.qa_generation.qs_agent import QuestionSQLGenerationAgent
 from data_pipeline.validators.sql_validation_agent import SQLValidationAgent
 from data_pipeline.config import SCHEMA_TOOLS_CONFIG
 from data_pipeline.dp_logging import get_logger
+from data_pipeline.utils.logger import setup_logging
 
 
 class SchemaWorkflowOrchestrator:
@@ -29,7 +30,10 @@ class SchemaWorkflowOrchestrator:
                  enable_sql_validation: bool = True,
                  enable_llm_repair: bool = True,
                  modify_original_file: bool = True,
-                 enable_training_data_load: bool = True):
+                 enable_training_data_load: bool = True,
+                 backup_vector_tables: bool = False,
+                 truncate_vector_tables: bool = False,
+                 skip_training: bool = False):
         """
         初始化Schema工作流编排器
         
@@ -43,6 +47,9 @@ class SchemaWorkflowOrchestrator:
             enable_llm_repair: 是否启用LLM修复功能
             modify_original_file: 是否修改原始JSON文件
             enable_training_data_load: 是否启用训练数据加载
+            backup_vector_tables: 是否备份vector表数据
+            truncate_vector_tables: 是否清空vector表数据(自动启用备份)
+            skip_training: 是否跳过训练文件处理,仅执行Vector表管理
         """
         self.db_connection = db_connection
         self.table_list_file = table_list_file
@@ -53,6 +60,15 @@ class SchemaWorkflowOrchestrator:
         self.modify_original_file = modify_original_file
         self.enable_training_data_load = enable_training_data_load
         
+        # 处理vector表管理参数
+        # 参数验证和自动启用逻辑:如果启用truncate,自动启用backup
+        if truncate_vector_tables:
+            backup_vector_tables = True
+            
+        self.backup_vector_tables = backup_vector_tables
+        self.truncate_vector_tables = truncate_vector_tables
+        self.skip_training = skip_training
+        
         # 处理task_id
         if task_id is None:
             # 脚本模式:自动生成manual开头的task_id
@@ -63,20 +79,36 @@ class SchemaWorkflowOrchestrator:
         
         # 设置输出目录
         if output_dir is None:
-            # 脚本模式或未指定输出目录时,使用任务目录
+            # 脚本模式或未指定输出目录时,使用默认基础目录
             # 获取项目根目录的绝对路径
             project_root = Path(__file__).parent.parent
-            self.output_dir = project_root / "data_pipeline" / "training_data" / self.task_id
+            base_dir = project_root / "data_pipeline" / "training_data"
+            # 在基础目录下创建task子目录
+            self.output_dir = base_dir / self.task_id
         else:
-            # API模式或明确指定输出目录时,使用指定的目录
-            self.output_dir = Path(output_dir)
+            # 用户指定了输出目录时,检查是否为API模式
+            output_path = Path(output_dir)
             
+            # API模式判断:如果output_dir路径已经包含task_id,则直接使用,不再创建子目录
+            if self.task_id in str(output_path):
+                # API模式:直接使用传入的目录,这个目录已经是task专用目录
+                self.output_dir = output_path
+            else:
+                # 脚本模式:在指定目录下创建task子目录
+                self.output_dir = output_path / self.task_id
+        
         # 确保输出目录存在
         self.output_dir.mkdir(parents=True, exist_ok=True)
             
         # 初始化独立日志系统
         self.logger = get_logger("SchemaWorkflowOrchestrator", self.task_id)
         
+        # 记录Vector表管理参数状态
+        if self.truncate_vector_tables and truncate_vector_tables != backup_vector_tables:
+            self.logger.info("🔄 启用truncate时自动启用backup")
+        if self.backup_vector_tables or self.truncate_vector_tables:
+            self.logger.info(f"🗂️ Vector表管理参数: backup={self.backup_vector_tables}, truncate={self.truncate_vector_tables}")
+        
         # 工作流程状态
         self.workflow_state = {
             "start_time": None,
@@ -138,6 +170,8 @@ class SchemaWorkflowOrchestrator:
             else:
                 self.logger.info("⏭️ 跳过SQL验证步骤")
             
+
+            
             # 步骤4: 训练数据加载(可选)
             if self.enable_training_data_load:
                 await self._execute_step_4_training_data_load()
@@ -192,13 +226,28 @@ class SchemaWorkflowOrchestrator:
                 "total_tables": ddl_md_result.get("summary", {}).get("total_tables", 0),
                 "processed_successfully": ddl_md_result.get("summary", {}).get("processed_successfully", 0),
                 "failed": ddl_md_result.get("summary", {}).get("failed", 0),
-                "files_generated": ddl_md_result.get("statistics", {}).get("files_generated", 0),
+                "files_generated": ddl_md_result.get("statistics", {}).get("total_files_generated", 0),
                 "duration": step_duration
             }
             self.workflow_state["statistics"]["step1_duration"] = step_duration
             
             processed_tables = ddl_md_result.get("summary", {}).get("processed_successfully", 0)
-            self.logger.info(f"✅ 步骤1完成: 成功处理 {processed_tables} 个表,耗时 {step_duration:.2f}秒")
+            
+            # 获取文件统计信息
+            statistics = ddl_md_result.get("statistics", {})
+            md_files = statistics.get("md_files_generated", 0)
+            ddl_files = statistics.get("ddl_files_generated", 0)
+            
+            if md_files > 0 and ddl_files > 0:
+                file_info = f"生成 {md_files} 个MD文件,{ddl_files} 个DDL文件"
+            elif md_files > 0:
+                file_info = f"生成 {md_files} 个MD文件"
+            elif ddl_files > 0:
+                file_info = f"生成 {ddl_files} 个DDL文件"
+            else:
+                file_info = "未生成文件"
+                
+            self.logger.info(f"✅ 步骤1完成: 成功处理 {processed_tables} 个表,{file_info},耗时 {step_duration:.2f}秒")
             
         except Exception as e:
             self.workflow_state["failed_steps"].append("ddl_md_generation")
@@ -335,6 +384,51 @@ class SchemaWorkflowOrchestrator:
             self.logger.error(f"❌ 步骤3失败: {str(e)}")
             raise
     
+    async def _execute_vector_table_management(self):
+        """独立执行Vector表管理"""
+        if not (self.backup_vector_tables or self.truncate_vector_tables):
+            return
+            
+        self.logger.info("=" * 60)
+        self.logger.info("🗂️ 开始执行Vector表管理")
+        self.logger.info("=" * 60)
+        
+        vector_stats = None
+        try:
+            from data_pipeline.trainer.vector_table_manager import VectorTableManager
+            
+            vector_manager = VectorTableManager(
+                task_output_dir=str(self.output_dir),
+                task_id=self.task_id
+            )
+            
+            # 执行vector表管理
+            vector_stats = vector_manager.execute_vector_management(
+                backup=self.backup_vector_tables,
+                truncate=self.truncate_vector_tables
+            )
+            
+            # 记录结果到工作流状态(无论成功失败都记录)
+            self.workflow_state["artifacts"]["vector_management"] = vector_stats
+            
+            if vector_stats.get("errors"):
+                self.logger.warning(f"⚠️ Vector表管理完成,但有错误: {'; '.join(vector_stats['errors'])}")
+            else:
+                self.logger.info("✅ Vector表管理完成")
+            
+        except Exception as e:
+            self.logger.error(f"❌ Vector表管理失败: {e}")
+            # 即使异常也要记录基本状态
+            if vector_stats is None:
+                vector_stats = {
+                    "backup_performed": self.backup_vector_tables,
+                    "truncate_performed": False,
+                    "errors": [f"执行异常: {str(e)}"],
+                    "duration": 0
+                }
+                self.workflow_state["artifacts"]["vector_management"] = vector_stats
+            raise
+    
     async def _execute_step_4_training_data_load(self):
         """步骤4: 训练数据加载"""
         self.workflow_state["current_step"] = "training_data_load"
@@ -358,10 +452,20 @@ class SchemaWorkflowOrchestrator:
             
             # 执行训练数据加载
             self.logger.info("🔄 开始处理训练文件...")
-            load_successful = process_training_files(training_data_dir, self.task_id)
+            # 传递Vector表管理参数到training步骤
+            load_successful, vector_stats = process_training_files(training_data_dir, self.task_id, 
+                                                                  backup_vector_tables=self.backup_vector_tables, 
+                                                                  truncate_vector_tables=self.truncate_vector_tables,
+                                                                  skip_training=self.skip_training)
             
             step_duration = time.time() - step_start_time
             
+            # 记录Vector表管理结果到工作流状态
+            if vector_stats:
+                if "artifacts" not in self.workflow_state:
+                    self.workflow_state["artifacts"] = {}
+                self.workflow_state["artifacts"]["vector_management"] = vector_stats
+            
             if load_successful:
                 # 获取统计信息
                 from data_pipeline.trainer.vanna_trainer import flush_training, shutdown_trainer
@@ -527,10 +631,42 @@ class SchemaWorkflowOrchestrator:
             self.logger.info(f"⏱️  总耗时: {summary['total_duration']} 秒")
             self.logger.info(f"📝 完成步骤: {len(summary['completed_steps'])}/{summary['total_steps']}")
             
-            # DDL/MD生成结果
+            # 获取并显示embedding模型信息
+            try:
+                from common.utils import get_current_model_info
+                model_info = get_current_model_info()
+                self.logger.info(f"🤖 使用的embedding模型: {model_info['embedding_model']} ({model_info['embedding_type']})")
+            except Exception as e:
+                self.logger.info(f"🤖 使用的embedding模型: 未知 (获取信息失败: {e})")
+            
+            # 解析并显示源库信息
+            try:
+                db_info = self._parse_db_connection(self.db_connection)
+                self.logger.info(f"🗄️  源库名: {db_info['dbname']}")
+                self.logger.info(f"🏠 源库Hostname: {db_info['host']}:{db_info['port']}")
+            except Exception as e:
+                self.logger.info(f"🗄️  源库名: {self.db_name}")
+                self.logger.info(f"🏠 源库Hostname: 未知 (解析失败: {e})")
+            
+            # DDL/MD生成结果 - 增加详细的文件统计
             if "ddl_md_generation" in results:
                 ddl_md = results["ddl_md_generation"]
                 self.logger.info(f"📋 DDL/MD生成: {ddl_md.get('processed_successfully', 0)} 个表成功处理")
+                
+                # 尝试获取详细的文件统计信息
+                try:
+                    # 从输出目录统计实际生成的文件
+                    output_path = Path(self.output_dir)
+                    if output_path.exists():
+                        md_files = list(output_path.glob("*.md"))
+                        ddl_files = list(output_path.glob("*.ddl"))
+                        md_count = len([f for f in md_files if not f.name.startswith('metadata')])  # 排除metadata.md
+                        ddl_count = len(ddl_files)
+                        self.logger.info(f"📁 生成文件: {md_count} 个MD文件,{ddl_count} 个DDL文件")
+                    else:
+                        self.logger.info(f"📁 生成文件: 统计信息不可用")
+                except Exception as e:
+                    self.logger.info(f"📁 生成文件: 统计失败 ({e})")
             
             # Question-SQL生成结果
             if "question_sql_generation" in results:
@@ -544,9 +680,44 @@ class SchemaWorkflowOrchestrator:
                 self.logger.info(f"🔍 SQL验证: {success_rate:.1%} 成功率 ({validation.get('valid_sql_count', 0)}/{validation.get('original_sql_count', 0)})")
             
             self.logger.info(f"📁 输出目录: {outputs['output_directory']}")
-            self.logger.info(f"📄 主要输出文件: {outputs['primary_output_file']}")
+            self.logger.info(f"📄 QUESTION/SQL键值对文件: {outputs['primary_output_file']}")
             self.logger.info(f"❓ 最终问题数量: {outputs['final_question_count']}")
             
+            # 配置参数反馈
+            self.logger.info("⚙️ 执行配置:")
+            self.logger.info(f"  🔍 SQL验证: {'启用' if self.enable_sql_validation else '禁用'}")
+            self.logger.info(f"  🔧 LLM修复: {'启用' if self.enable_llm_repair else '禁用'}")
+            self.logger.info(f"  📝 文件修改: {'启用' if self.modify_original_file else '禁用'}")
+            if not self.enable_training_data_load:
+                self.logger.info(f"  ⏭️ 训练数据加载: 已跳过")
+            else:
+                self.logger.info(f"  📚 训练数据加载: 启用")
+            
+            # Vector表管理总结
+            vector_stats = report.get("workflow_state", {}).get("artifacts", {}).get("vector_management")
+            if vector_stats:
+                self.logger.info("📊 Vector表管理:")
+                if vector_stats.get("backup_performed", False):
+                    tables_count = len(vector_stats.get("tables_backed_up", {}))
+                    total_size = sum(
+                        self._parse_file_size(info.get("file_size", "0 B")) 
+                        for info in vector_stats.get("tables_backed_up", {}).values() 
+                        if info.get("success", False)
+                    )
+                    self.logger.info(f"   ✅ 备份执行: {tables_count}个表,总大小: {self._format_size(total_size)}")
+                else:
+                    self.logger.info("   - 备份执行: 未执行")
+                    
+                if vector_stats.get("truncate_performed", False):
+                    self.logger.info("   ✅ 清空执行: langchain_pg_embedding表已清空")
+                else:
+                    self.logger.info("   - 清空执行: 未执行")
+                    
+                duration = vector_stats.get("duration", 0)
+                self.logger.info(f"   ⏱️  执行耗时: {duration:.1f}秒")
+            else:
+                self.logger.info("📊 Vector表管理: 未执行(未启用相关参数)")
+            
         else:
             error = report["error"]
             summary = report["workflow_summary"]
@@ -558,6 +729,72 @@ class SchemaWorkflowOrchestrator:
             self.logger.error(f"✅ 已完成步骤: {', '.join(summary['completed_steps']) if summary['completed_steps'] else '无'}")
         
         self.logger.info("=" * 80)
+    
+    def _parse_file_size(self, size_str: str) -> float:
+        """解析文件大小字符串为字节数"""
+        import re
+        
+        # 匹配数字和单位的正则表达式
+        match = re.match(r'(\d+\.?\d*)\s*([KMGT]?B)', size_str.upper())
+        if not match:
+            return 0.0
+            
+        size, unit = match.groups()
+        size = float(size)
+        
+        unit_multipliers = {
+            'B': 1,
+            'KB': 1024,
+            'MB': 1024**2,
+            'GB': 1024**3,
+            'TB': 1024**4
+        }
+        
+        return size * unit_multipliers.get(unit, 1)
+    
+    def _format_size(self, size_bytes: float) -> str:
+        """格式化字节数为可读的大小字符串"""
+        if size_bytes == 0:
+            return "0 B"
+        
+        size_names = ["B", "KB", "MB", "GB"]
+        i = 0
+        size = float(size_bytes)
+        
+        while size >= 1024.0 and i < len(size_names) - 1:
+            size /= 1024.0
+            i += 1
+        
+        return f"{size:.1f} {size_names[i]}"
+    
+    def _parse_db_connection(self, db_connection: str) -> Dict[str, str]:
+        """
+        解析PostgreSQL连接字符串
+        
+        Args:
+            db_connection: PostgreSQL连接字符串,格式为 postgresql://user:password@host:port/dbname
+        
+        Returns:
+            包含数据库连接参数的字典
+        """
+        import re
+        
+        # 解析连接字符串的正则表达式
+        pattern = r'postgresql://([^:]+):([^@]+)@([^:]+):(\d+)/(.+)'
+        match = re.match(pattern, db_connection)
+        
+        if not match:
+            raise ValueError(f"无效的PostgreSQL连接字符串格式: {db_connection}")
+        
+        user, password, host, port, dbname = match.groups()
+        
+        return {
+            'user': user,
+            'password': password,
+            'host': host,
+            'port': port,
+            'dbname': dbname
+        }
 
 
 # 便捷的命令行接口
@@ -570,7 +807,7 @@ def setup_argument_parser():
         formatter_class=argparse.RawDescriptionHelpFormatter,
         epilog="""
 示例用法:
-  # 完整工作流程
+  # 完整工作流程(会在指定目录下创建任务子目录)
   python -m data_pipeline.schema_workflow \\
     --db-connection "postgresql://user:pass@localhost:5432/highway_db" \\
     --table-list tables.txt \\
@@ -623,8 +860,8 @@ def setup_argument_parser():
     # 可选参数
     parser.add_argument(
         "--output-dir",
-        default="./data_pipeline/training_data/",
-        help="输出目录(默认:./data_pipeline/training_data/)"
+        default=None,
+        help="基础输出目录,将在此目录下创建任务子目录(默认:./data_pipeline/training_data/)"
     )
     
     parser.add_argument(
@@ -645,10 +882,24 @@ def setup_argument_parser():
         help="不修改原始JSON文件(仅生成报告)"
     )
     
+
+    
+    parser.add_argument(
+        "--backup-vector-tables",
+        action="store_true",
+        help="备份vector表数据到任务目录"
+    )
+    
+    parser.add_argument(
+        "--truncate-vector-tables",
+        action="store_true",
+        help="清空vector表数据(自动启用备份)"
+    )
+    
     parser.add_argument(
-        "--skip-training-load",
+        "--skip-training",
         action="store_true",
-        help="跳过训练数据加载步骤"
+        help="跳过训练文件处理,仅执行Vector表管理"
     )
     
     parser.add_argument(
@@ -700,7 +951,10 @@ async def main():
             enable_sql_validation=not args.skip_validation,
             enable_llm_repair=not args.disable_llm_repair,
             modify_original_file=not args.no_modify_file,
-            enable_training_data_load=not args.skip_training_load
+            enable_training_data_load=True,
+            backup_vector_tables=args.backup_vector_tables,
+            truncate_vector_tables=args.truncate_vector_tables,
+            skip_training=args.skip_training
         )
         
         # 获取logger用于启动信息
@@ -711,13 +965,13 @@ async def main():
         from data_pipeline.dp_logging import get_logger
         logger = get_logger("SchemaWorkflow", script_task_id)
         logger.info(f"🚀 开始执行Schema工作流编排...")
-        logger.info(f"📁 输出目录: {args.output_dir}")
+        logger.info(f"📁 输出目录: {orchestrator.output_dir}")
         logger.info(f"📋 表清单: {args.table_list}")
         logger.info(f"🏢 业务背景: {args.business_context}")
         logger.info(f"💾 数据库: {orchestrator.db_name}")
         logger.info(f"🔍 SQL验证: {'启用' if not args.skip_validation else '禁用'}")
         logger.info(f"🔧 LLM修复: {'启用' if not args.disable_llm_repair else '禁用'}")
-        logger.info(f"🎯 训练数据加载: {'启用' if not args.skip_training_load else '禁用'}")
+        logger.info(f"🎯 训练数据加载: {'启用' if not args.skip_training else '禁用'}")
         
         # 执行完整工作流程
         report = await orchestrator.execute_complete_workflow()
@@ -737,7 +991,6 @@ async def main():
             logger.error(f"\n❌ 工作流程执行失败")
             exit_code = 2  # 失败
         
-        logger.info(f"📄 主要输出文件: {report['final_outputs']['primary_output_file']}")
         sys.exit(exit_code)
         
     except KeyboardInterrupt:

+ 5 - 5
data_pipeline/tables.txt

@@ -5,9 +5,9 @@
 # 服务区相关表
 bss_car_day_count
 bss_business_day_data
-#bss_company
-#bss_section_route
-#bss_section_route_area_link
-#bss_service_area
-#bss_service_area_mapper
+bss_company
+bss_section_route
+bss_section_route_area_link
+bss_service_area
+bss_service_area_mapper
 

+ 18 - 4
data_pipeline/task_executor.py

@@ -24,6 +24,11 @@ def main():
     parser.add_argument('--execution-mode', default='complete', choices=['complete', 'step'], help='执行模式')
     parser.add_argument('--step-name', help='步骤名称(当execution-mode=step时必需)')
     
+    # 新增:Vector表管理参数
+    parser.add_argument('--backup-vector-tables', action='store_true', help='备份vector表数据')
+    parser.add_argument('--truncate-vector-tables', action='store_true', help='清空vector表数据(自动启用备份)')
+    parser.add_argument('--skip-training', action='store_true', help='跳过训练文件处理,仅执行Vector表管理')
+    
     args = parser.parse_args()
     
     # 初始化日志系统(不需要,使用独立的日志系统)
@@ -35,8 +40,15 @@ def main():
         sys.exit(1)
     
     try:
-        # 执行任务
-        result = asyncio.run(execute_task(args.task_id, args.execution_mode, args.step_name))
+        # 传递新参数到execute_task
+        result = asyncio.run(execute_task(
+            args.task_id, 
+            args.execution_mode, 
+            args.step_name,
+            args.backup_vector_tables,
+            args.truncate_vector_tables,
+            args.skip_training
+        ))
         
         # 输出结果到stdout(供父进程读取)
         print(json.dumps(result, ensure_ascii=False, default=str))
@@ -55,11 +67,13 @@ def main():
         sys.exit(1)
 
 
-async def execute_task(task_id: str, execution_mode: str, step_name: str = None):
+async def execute_task(task_id: str, execution_mode: str, step_name: str = None, 
+                      backup_vector_tables: bool = False, truncate_vector_tables: bool = False,
+                      skip_training: bool = False):
     """执行任务的异步函数"""
     executor = None
     try:
-        executor = SimpleWorkflowExecutor(task_id)
+        executor = SimpleWorkflowExecutor(task_id, backup_vector_tables, truncate_vector_tables, skip_training)
         
         if execution_mode == "complete":
             return await executor.execute_complete_workflow()

+ 192 - 26
data_pipeline/trainer/run_training.py

@@ -198,46 +198,53 @@ def train_formatted_question_sql_pairs(formatted_file):
     # 按双空行分割不同的问答对
     # 使用更精确的分隔符,避免误识别
     pairs = []
-    blocks = content.split("\n\nQuestion:")
+    # 使用大小写不敏感的正则表达式来分割
+    import re
+    blocks = re.split(r'\n\n(?=question\s*:)', content, flags=re.IGNORECASE)
     
     # 处理第一块(可能没有前导的"\n\nQuestion:")
     first_block = blocks[0]
-    if first_block.strip().startswith("Question:"):
+    if re.search(r'^\s*question\s*:', first_block.strip(), re.IGNORECASE):
         pairs.append(first_block.strip())
-    elif "Question:" in first_block:
+    elif re.search(r'question\s*:', first_block, re.IGNORECASE):
         # 处理文件开头没有Question:的情况
-        question_start = first_block.find("Question:")
-        pairs.append(first_block[question_start:].strip())
+        question_match = re.search(r'question\s*:', first_block, re.IGNORECASE)
+        pairs.append(first_block[question_match.start():].strip())
     
     # 处理其余块
     for block in blocks[1:]:
-        pairs.append("Question:" + block.strip())
+        pairs.append(block.strip())
     
     # 处理每个问答对
     successfully_processed = 0
     for idx, pair in enumerate(pairs, start=1):
         try:
-            if "Question:" not in pair or "SQL:" not in pair:
+            # 使用大小写不敏感的匹配
+            question_match = re.search(r'question\s*:', pair, re.IGNORECASE)
+            sql_match = re.search(r'sql\s*:', pair, re.IGNORECASE)
+            
+            if not question_match or not sql_match:
                 print(f" 跳过不符合格式的对 #{idx}")
                 continue
-                
-            # 提取问题部分
-            question_start = pair.find("Question:") + len("Question:")
-            sql_start = pair.find("SQL:", question_start)
             
-            if sql_start == -1:
+            # 确保SQL在Question之后
+            if sql_match.start() <= question_match.end():
                 print(f" SQL部分未找到,跳过对 #{idx}")
                 continue
                 
+            # 提取问题部分
+            question_start = question_match.end()
+            sql_start = sql_match.start()
+            
             question = pair[question_start:sql_start].strip()
             
             # 提取SQL部分(支持多行)
-            sql_part = pair[sql_start + len("SQL:"):].strip()
+            sql_part = pair[sql_match.end():].strip()
             
             # 检查是否存在下一个Question标记(防止解析错误)
-            next_question = pair.find("Question:", sql_start)
-            if next_question != -1:
-                sql_part = pair[sql_start + len("SQL:"):next_question].strip()
+            next_question_match = re.search(r'question\s*:', pair[sql_match.end():], re.IGNORECASE)
+            if next_question_match:
+                sql_part = pair[sql_match.end():sql_match.end() + next_question_match.start()].strip()
             
             if not question or not sql_part:
                 print(f" 问题或SQL为空,跳过对 #{idx}")
@@ -255,6 +262,25 @@ def train_formatted_question_sql_pairs(formatted_file):
     
     print(f"格式化问答训练完成,共成功处理 {successfully_processed} 对问答(总计 {len(pairs)} 对)")
 
+def _is_valid_training_file(filename: str) -> bool:
+    """判断是否为有效的训练文件"""
+    import re
+    filename_lower = filename.lower()
+    
+    # 排除带数字后缀的文件
+    if re.search(r'\.(ddl|md)_\d+$', filename_lower):
+        return False
+    
+    # 排除 _old 后缀的文件
+    if filename_lower.endswith('_old'):
+        return False
+    
+    # 排除 .backup 相关文件
+    if '.backup' in filename_lower:
+        return False
+    
+    return True
+
 def train_json_question_sql_pairs(json_file):
     """训练JSON格式的问答对
     
@@ -280,12 +306,30 @@ def train_json_question_sql_pairs(json_file):
         for idx, pair in enumerate(data, start=1):
             try:
                 # 检查问答对格式
-                if not isinstance(pair, dict) or "question" not in pair or "sql" not in pair:
+                if not isinstance(pair, dict):
+                    print(f" 跳过不符合格式的对 #{idx}")
+                    continue
+                
+                # 大小写不敏感地查找question和sql键
+                question_key = None
+                sql_key = None
+                question_value = None
+                sql_value = None
+                
+                for key, value in pair.items():
+                    if key.lower() == "question":
+                        question_key = key
+                        question_value = value
+                    elif key.lower() == "sql":
+                        sql_key = key
+                        sql_value = value
+                
+                if question_key is None or sql_key is None:
                     print(f" 跳过不符合格式的对 #{idx}")
                     continue
                 
-                question = pair["question"].strip()
-                sql = pair["sql"].strip()
+                question = str(question_value).strip()
+                sql = str(sql_value).strip()
                 
                 if not question or not sql:
                     print(f" 问题或SQL为空,跳过对 #{idx}")
@@ -308,12 +352,18 @@ def train_json_question_sql_pairs(json_file):
     except Exception as e:
         print(f" 错误:处理JSON问答训练 - {e}")
 
-def process_training_files(data_path, task_id=None):
+def process_training_files(data_path, task_id=None, backup_vector_tables=False, truncate_vector_tables=False, skip_training=False):
     """处理指定路径下的所有训练文件
     
     Args:
         data_path (str): 训练数据目录路径
         task_id (str): 任务ID,用于日志记录
+        backup_vector_tables (bool): 是否备份vector表数据
+        truncate_vector_tables (bool): 是否清空vector表数据
+        skip_training (bool): 是否跳过训练文件处理,仅执行Vector表管理
+    
+    Returns:
+        tuple: (处理成功标志, Vector表管理统计信息)
     """
     # 初始化日志
     if task_id:
@@ -341,6 +391,37 @@ def process_training_files(data_path, task_id=None):
         else:
             print(message)
     
+    # Vector表管理(前置步骤)
+    vector_stats = None
+    if backup_vector_tables or truncate_vector_tables:
+        # 参数验证和自动启用逻辑
+        if truncate_vector_tables:
+            backup_vector_tables = True
+        
+        try:
+            import asyncio
+            from data_pipeline.trainer.vector_table_manager import VectorTableManager
+            
+            log_message("🗂️ 开始执行Vector表管理...")
+            
+            vector_manager = VectorTableManager(data_path, task_id)
+            vector_stats = vector_manager.execute_vector_management(backup_vector_tables, truncate_vector_tables)
+            
+            log_message("✅ Vector表管理完成")
+            
+        except Exception as e:
+            log_message(f"❌ Vector表管理失败: {e}", "error")
+            return False, None
+        
+        # 如果是跳过训练模式,跳过训练文件处理
+        if skip_training:
+            log_message("✅ Vector表管理完成,跳过训练文件处理(skip_training=True)")
+            return True, vector_stats
+    elif skip_training:
+        # 如果设置了skip_training但没有Vector操作,记录警告并跳过
+        log_message("⚠️ 设置了skip_training=True但未指定Vector操作,跳过所有处理")
+        return True, None
+    
     # 初始化统计计数器
     stats = {
         "ddl": 0,
@@ -365,6 +446,11 @@ def process_training_files(data_path, task_id=None):
             
             # 根据文件类型调用相应的处理函数
             try:
+                # 检查是否为有效的训练文件
+                if not _is_valid_training_file(item):
+                    log_message(f"跳过无效训练文件: {item}")
+                    continue
+                    
                 if file_lower.endswith(".ddl"):
                     log_message(f"处理DDL文件: {item_path}")
                     train_ddl_statements(item_path)
@@ -396,7 +482,7 @@ def process_training_files(data_path, task_id=None):
                 
     except OSError as e:
         log_message(f"读取目录失败: {e}", "error")
-        return False
+        return False, vector_stats
     
     # 打印处理统计
     log_message("训练文件处理统计:")
@@ -409,9 +495,9 @@ def process_training_files(data_path, task_id=None):
     total_files = sum(stats.values())
     if total_files == 0:
         log_message(f"警告: 在目录 {data_path} 中未找到任何可训练的文件", "warning")
-        return False
+        return False, vector_stats
         
-    return True
+    return True, vector_stats
 
 def check_pgvector_connection():
     """检查 PgVector 数据库连接是否可用
@@ -508,14 +594,55 @@ def main():
         project_root = os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
         return os.path.join(project_root, config_path)
     
+    def resolve_data_path_with_task_id(task_id):
+        """使用task_id构建训练数据路径"""
+        # 使用data_pipeline统一配置
+        try:
+            from data_pipeline.config import SCHEMA_TOOLS_CONFIG
+            base_dir = SCHEMA_TOOLS_CONFIG.get("output_directory", './data_pipeline/training_data/')
+        except ImportError:
+            # 如果无法导入data_pipeline配置,使用默认路径
+            base_dir = './data_pipeline/training_data/'
+        
+        # 处理相对路径
+        from pathlib import Path
+        if not Path(base_dir).is_absolute():
+            # 相对于项目根目录解析
+            project_root = Path(__file__).parent.parent.parent
+            base_dir = project_root / base_dir
+        
+        return str(Path(base_dir) / task_id)
+    
     default_path = resolve_training_data_path()
     
+    # 参数定义
+    parser.add_argument(
+        '--task-id',
+        help='任务ID,指定后将自动构建训练数据目录路径 (基础目录/task_id)'
+    )
+    
     parser.add_argument('--data_path', type=str, default=default_path,
                         help='训练数据目录路径 (默认: 从data_pipeline.config.SCHEMA_TOOLS_CONFIG)')
+    
+    parser.add_argument('--backup-vector-tables', action='store_true',
+                        help='备份vector表数据')
+    
+    parser.add_argument('--truncate-vector-tables', action='store_true',
+                        help='清空vector表数据(自动启用备份)')
+    
+    parser.add_argument('--skip-training', action='store_true',
+                        help='跳过训练文件处理,仅执行Vector表管理')
+    
     args = parser.parse_args()
     
-    # 使用Path对象处理路径以确保跨平台兼容性
-    data_path = Path(args.data_path)
+    # 处理task_id和data_path的关系
+    if args.task_id:
+        # 如果指定了task_id,覆盖data_path
+        data_path = Path(resolve_data_path_with_task_id(args.task_id))
+        print(f"使用task_id构建路径: {args.task_id}")
+    else:
+        # 使用指定或默认的data_path
+        data_path = Path(args.data_path)
     
     # 显示路径解析结果
     print(f"\n===== 训练数据路径配置 =====")
@@ -525,6 +652,9 @@ def main():
         print(f"data_pipeline配置路径: {config_value}")
     except ImportError:
         print(f"data_pipeline配置: 无法导入")
+    
+    if args.task_id:
+        print(f"指定的task_id: {args.task_id}")
     print(f"解析后的绝对路径: {os.path.abspath(data_path)}")
     print("==============================")
     
@@ -580,7 +710,10 @@ def main():
         print(f"\n===== 未知的向量数据库类型: {vector_db_type} =====\n")
     
     # 处理训练文件
-    process_successful = process_training_files(data_path)
+    process_successful, vector_stats = process_training_files(data_path, args.task_id, 
+                                                             args.backup_vector_tables, 
+                                                             args.truncate_vector_tables,
+                                                             args.skip_training)
     
     if process_successful:
         # 训练结束,刷新和关闭批处理器
@@ -617,6 +750,39 @@ def main():
     else:
         print("\n===== 未能找到或处理任何训练文件,训练过程终止 =====")
     
+    # Vector表管理总结
+    print("\n===== Vector表管理统计 =====")
+    if vector_stats:
+        if vector_stats.get("backup_performed", False):
+            tables_info = vector_stats.get("tables_backed_up", {})
+            print(f"✓ 备份执行: 成功备份 {len(tables_info)} 个表")
+            for table_name, info in tables_info.items():
+                if info.get("success", False):
+                    print(f"  - {table_name}: {info['row_count']}行 -> {info['backup_file']} ({info['file_size']})")
+                else:
+                    print(f"  - {table_name}: 备份失败 - {info.get('error', '未知错误')}")
+        else:
+            print("- 备份执行: 未执行")
+            
+        if vector_stats.get("truncate_performed", False):
+            truncate_info = vector_stats.get("truncate_results", {})
+            print("✓ 清空执行: langchain_pg_embedding表已清空")
+            for table_name, info in truncate_info.items():
+                if info.get("success", False):
+                    print(f"  - {table_name}: {info['rows_before']}行 -> 0行")
+                else:
+                    print(f"  - {table_name}: 清空失败 - {info.get('error', '未知错误')}")
+        else:
+            print("- 清空执行: 未执行")
+            
+        print(f"✓ 总耗时: {vector_stats.get('duration', 0):.1f}秒")
+        
+        if vector_stats.get("errors"):
+            print(f"⚠ 错误: {'; '.join(vector_stats['errors'])}")
+    else:
+        print("- 未执行vector表管理操作")
+    print("===========================")
+    
     # 输出embedding模型信息
     print("\n===== Embedding模型信息 =====")
     try:

+ 358 - 0
data_pipeline/trainer/vector_table_manager.py

@@ -0,0 +1,358 @@
+import asyncio
+import time
+import os
+from datetime import datetime
+from pathlib import Path
+from typing import Dict, Any, List
+import psycopg2
+import logging
+
+
+class VectorTableManager:
+    """Vector表管理器,负责备份和清空操作"""
+    
+    def __init__(self, task_output_dir: str, task_id: str = None):
+        """
+        Args:
+            task_output_dir: 任务输出目录(用于存放备份文件)
+            task_id: 任务ID(用于日志记录)
+        Note:
+            数据库连接将从data_pipeline.config.SCHEMA_TOOLS_CONFIG自动获取
+        """
+        self.task_output_dir = task_output_dir
+        self.task_id = task_id
+        
+        # 从data_pipeline.config获取配置
+        from data_pipeline.config import SCHEMA_TOOLS_CONFIG
+        self.config = SCHEMA_TOOLS_CONFIG.get("vector_table_management", {})
+        
+        # 初始化日志
+        if task_id:
+            from data_pipeline.dp_logging import get_logger
+            self.logger = get_logger("VectorTableManager", task_id)
+        else:
+            import logging
+            self.logger = logging.getLogger("VectorTableManager")
+    
+    def execute_vector_management(self, backup: bool, truncate: bool) -> Dict[str, Any]:
+        """执行vector表管理操作的主流程"""
+        
+        start_time = time.time()
+        
+        # 1. 参数验证和自动启用逻辑
+        if truncate and not backup:
+            backup = True
+            self.logger.info("🔄 启用truncate时自动启用backup")
+        
+        if not backup and not truncate:
+            self.logger.info("⏭️ 未启用vector表管理,跳过操作")
+            return {"backup_performed": False, "truncate_performed": False}
+        
+        # 2. 初始化结果统计
+        result = {
+            "backup_performed": backup,
+            "truncate_performed": truncate,
+            "tables_backed_up": {},
+            "truncate_results": {},
+            "errors": [],
+            "backup_directory": None,
+            "duration": 0
+        }
+        
+        try:
+            # 3. 创建备份目录
+            backup_dir = Path(self.task_output_dir) / self.config.get("backup_directory", "vector_bak")
+            if backup:
+                backup_dir.mkdir(parents=True, exist_ok=True)
+                result["backup_directory"] = str(backup_dir)
+                self.logger.info(f"📁 备份目录: {backup_dir}")
+            
+            # 4. 执行备份操作
+            if backup:
+                self.logger.info("🗂️ 开始备份vector表...")
+                backup_results = self.backup_vector_tables()
+                result["tables_backed_up"] = backup_results
+                
+                # 检查备份是否全部成功
+                backup_failed = any(not r.get("success", False) for r in backup_results.values())
+                if backup_failed:
+                    result["errors"].append("部分表备份失败")
+                    if truncate:
+                        self.logger.error("❌ 备份失败,取消清空操作")
+                        result["truncate_performed"] = False
+                        truncate = False
+            
+            # 5. 执行清空操作(仅在备份成功时)
+            if truncate:
+                self.logger.info("🗑️ 开始清空vector表...")
+                truncate_results = self.truncate_vector_tables()
+                result["truncate_results"] = truncate_results
+                
+                # 检查清空是否成功
+                truncate_failed = any(not r.get("success", False) for r in truncate_results.values())
+                if truncate_failed:
+                    result["errors"].append("部分表清空失败")
+            
+            # 6. 生成备份日志文件
+            if backup and backup_dir.exists():
+                self._write_backup_log(backup_dir, result)
+            
+            # 7. 计算总耗时
+            result["duration"] = time.time() - start_time
+            
+            # 8. 记录最终状态
+            if result["errors"]:
+                self.logger.warning(f"⚠️ Vector表管理完成,但有错误: {'; '.join(result['errors'])}")
+            else:
+                self.logger.info(f"✅ Vector表管理完成,耗时: {result['duration']:.2f}秒")
+            
+            return result
+            
+        except Exception as e:
+            result["duration"] = time.time() - start_time
+            result["errors"].append(f"执行失败: {str(e)}")
+            self.logger.error(f"❌ Vector表管理失败: {e}")
+            raise
+    
+    def backup_vector_tables(self) -> Dict[str, Any]:
+        """备份vector表数据"""
+        
+        # 1. 创建备份目录
+        backup_dir = Path(self.task_output_dir) / self.config.get("backup_directory", "vector_bak")
+        backup_dir.mkdir(parents=True, exist_ok=True)
+        
+        # 2. 生成时间戳
+        timestamp = datetime.now().strftime(self.config.get("timestamp_format", "%Y%m%d_%H%M%S"))
+        
+        # 3. 执行备份(每个表分别处理)
+        results = {}
+        supported_tables = self.config.get("supported_tables", ["langchain_pg_collection", "langchain_pg_embedding"])
+        
+        for table_name in supported_tables:
+            try:
+                # 3.1 定义文件路径(.tmp临时文件)
+                temp_file = backup_dir / f"{table_name}_{timestamp}.csv.tmp"
+                final_file = backup_dir / f"{table_name}_{timestamp}.csv"
+                
+                # 确保使用绝对路径(PostgreSQL COPY命令要求)
+                temp_file_abs = temp_file.resolve()
+                
+                # 3.2 通过psycopg2使用流式客户端导出(支持大数据量)
+                start_time = time.time()
+                row_count = 0
+                batch_size = 10000  # 每批处理1万条记录
+                
+                with self.get_connection() as conn:
+                    # 临时关闭autocommit以支持流式处理
+                    old_autocommit = conn.autocommit
+                    conn.autocommit = False
+                    
+                    try:
+                        with conn.cursor() as cursor:
+                            # 设置游标为流式模式
+                            cursor.itersize = batch_size
+                            
+                            # 执行编码设置
+                            cursor.execute("SET client_encoding TO 'UTF8'")
+                            
+                            # 执行查询
+                            cursor.execute(f"SELECT * FROM {table_name}")
+                            
+                            # 获取列名
+                            colnames = [desc[0] for desc in cursor.description]
+                            
+                            # 使用流式方式写入CSV文件
+                            import csv
+                            with open(temp_file_abs, 'w', newline='', encoding='utf-8') as csvfile:
+                                writer = csv.writer(csvfile)
+                                
+                                # 写入表头
+                                writer.writerow(colnames)
+                                
+                                # 流式读取和写入数据
+                                while True:
+                                    rows = cursor.fetchmany(batch_size)
+                                    if not rows:
+                                        break
+                                        
+                                    # 批量写入当前批次的数据
+                                    for row in rows:
+                                        writer.writerow(row)
+                                        row_count += 1
+                                    
+                                    # 记录进度(大数据量时有用)
+                                    if row_count % (batch_size * 5) == 0:  # 每5万条记录记录一次
+                                        self.logger.info(f"📊 {table_name} 已导出 {row_count} 行数据...")
+                        
+                        # 提交事务
+                        conn.commit()
+                        
+                    finally:
+                        # 恢复原来的autocommit设置
+                        conn.autocommit = old_autocommit
+                
+                self.logger.info(f"📊 {table_name} 流式导出完成,总计 {row_count} 行")
+                
+                # 3.3 导出完成后,重命名文件 (.tmp -> .csv)
+                if temp_file.exists():
+                    temp_file.rename(final_file)
+                    
+                    # 3.4 获取文件信息
+                    file_stat = final_file.stat()
+                    duration = time.time() - start_time
+                    
+                    results[table_name] = {
+                        "success": True,
+                        "row_count": row_count,
+                        "file_size": self._format_file_size(file_stat.st_size),
+                        "backup_file": final_file.name,
+                        "duration": duration
+                    }
+                    
+                    self.logger.info(f"✅ {table_name} 备份成功: {row_count}行 -> {final_file.name}")
+                else:
+                    raise Exception(f"临时文件 {temp_file} 未生成")
+                    
+            except Exception as e:
+                results[table_name] = {
+                    "success": False,
+                    "error": str(e)
+                }
+                self.logger.error(f"❌ {table_name} 备份失败: {e}")
+                
+                # 清理可能的临时文件
+                if temp_file.exists():
+                    temp_file.unlink()
+        
+        return results
+    
+    def truncate_vector_tables(self) -> Dict[str, Any]:
+        """清空vector表数据(只清空langchain_pg_embedding)"""
+        
+        results = {}
+        
+        # 只清空配置中指定的表(通常只有langchain_pg_embedding)
+        truncate_tables = self.config.get("truncate_tables", ["langchain_pg_embedding"])
+        
+        for table_name in truncate_tables:
+            try:
+                # 记录清空前的行数(用于统计)
+                count_sql = f"SELECT COUNT(*) FROM {table_name}"
+                
+                start_time = time.time()
+                with self.get_connection() as conn:
+                    with conn.cursor() as cursor:
+                        # 1. 获取清空前的行数
+                        cursor.execute(count_sql)
+                        rows_before = cursor.fetchone()[0]
+                        
+                        # 2. 执行TRUNCATE
+                        cursor.execute(f"TRUNCATE TABLE {table_name}")
+                        
+                        # 3. 验证清空结果
+                        cursor.execute(count_sql)
+                        rows_after = cursor.fetchone()[0]
+                
+                duration = time.time() - start_time
+                
+                if rows_after == 0:
+                    results[table_name] = {
+                        "success": True,
+                        "rows_before": rows_before,
+                        "rows_after": rows_after,
+                        "duration": duration
+                    }
+                    self.logger.info(f"✅ {table_name} 清空成功: {rows_before}行 -> 0行")
+                else:
+                    raise Exception(f"清空失败,表中仍有 {rows_after} 行数据")
+                    
+            except Exception as e:
+                results[table_name] = {
+                    "success": False,
+                    "error": str(e)
+                }
+                self.logger.error(f"❌ {table_name} 清空失败: {e}")
+        
+        return results
+    
+    def get_connection(self):
+        """获取pgvector数据库连接(从data_pipeline.config获取配置)"""
+        import psycopg2
+        
+        try:
+            # 方法1:如果SCHEMA_TOOLS_CONFIG中有连接字符串,直接使用
+            from data_pipeline.config import SCHEMA_TOOLS_CONFIG
+            connection_string = SCHEMA_TOOLS_CONFIG.get("default_db_connection")
+            if connection_string:
+                conn = psycopg2.connect(connection_string)
+            else:
+                # 方法2:从app_config获取pgvector数据库配置
+                import app_config
+                pgvector_config = app_config.PGVECTOR_CONFIG
+                conn = psycopg2.connect(
+                    host=pgvector_config.get('host'),
+                    port=pgvector_config.get('port'),
+                    database=pgvector_config.get('dbname'),
+                    user=pgvector_config.get('user'),
+                    password=pgvector_config.get('password')
+                )
+            
+            # 设置自动提交,避免事务问题
+            conn.autocommit = True
+            return conn
+            
+        except Exception as e:
+            self.logger.error(f"pgvector数据库连接失败: {e}")
+            raise
+
+    def _write_backup_log(self, backup_dir: Path, result: Dict[str, Any]):
+        """写入详细的备份日志"""
+        log_file = backup_dir / "vector_backup_log.txt"
+        
+        try:
+            with open(log_file, 'w', encoding='utf-8') as f:
+                f.write("=== Vector Table Backup Log ===\n")
+                f.write(f"Backup Time: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n")
+                f.write(f"Task ID: {self.task_id or 'Unknown'}\n")
+                f.write(f"Duration: {result.get('duration', 0):.2f}s\n\n")
+                
+                # 备份状态
+                f.write("Tables Backup Status:\n")
+                for table_name, info in result.get("tables_backed_up", {}).items():
+                    if info.get("success", False):
+                        f.write(f"✓ {table_name}: {info['row_count']} rows -> {info['backup_file']} ({info['file_size']})\n")
+                    else:
+                        f.write(f"✗ {table_name}: FAILED - {info.get('error', 'Unknown error')}\n")
+                
+                # 清空状态
+                if result.get("truncate_performed", False):
+                    f.write("\nTruncate Status:\n")
+                    for table_name, info in result.get("truncate_results", {}).items():
+                        if info.get("success", False):
+                            f.write(f"✓ {table_name}: TRUNCATED ({info['rows_before']} rows removed)\n")
+                        else:
+                            f.write(f"✗ {table_name}: FAILED - {info.get('error', 'Unknown error')}\n")
+                else:
+                    f.write("\nTruncate Status:\n- Not performed\n")
+                
+                # 错误汇总
+                if result.get("errors"):
+                    f.write(f"\nErrors: {'; '.join(result['errors'])}\n")
+                    
+        except Exception as e:
+            self.logger.warning(f"写入备份日志失败: {e}")
+    
+    def _format_file_size(self, size_bytes: int) -> str:
+        """格式化文件大小显示"""
+        if size_bytes == 0:
+            return "0 B"
+        
+        size_names = ["B", "KB", "MB", "GB"]
+        i = 0
+        size = float(size_bytes)
+        
+        while size >= 1024.0 and i < len(size_names) - 1:
+            size /= 1024.0
+            i += 1
+        
+        return f"{size:.1f} {size_names[i]}" 

+ 2 - 2
data_pipeline/training_data/task_20250701_212426/bss_business_day_data.ddl → data_pipeline/training_data/manual_20250720_134836/bss_business_day_data.ddl

@@ -1,5 +1,5 @@
--- 中文名: 高速公路服务区每日经营数据记录表
--- 描述: 高速公路服务区每日经营数据记录表,存储交易流水、运营统计及状态变更信息,支撑业务分析与运营管理
+-- 中文名: `bss_business_day_data` 表用于记录高速公路服务区每日业务统计数据
+-- 描述: `bss_business_day_data` 表用于记录高速公路服务区每日业务统计数据,包括创建、更新、删除操作的时间戳及操作人信息,支持业务数据的版本管理和审计追溯
 create table public.bss_business_day_data (
   id varchar(32) not null     -- 主键ID,主键,
   version integer not null    -- 版本号,

+ 2 - 2
data_pipeline/training_data/task_20250701_212426/bss_business_day_data_detail.md → data_pipeline/training_data/manual_20250720_134836/bss_business_day_data_detail.md

@@ -1,5 +1,5 @@
-## bss_business_day_data(高速公路服务区每日经营数据记录表
-bss_business_day_data 表高速公路服务区每日经营数据记录表,存储交易流水、运营统计及状态变更信息,支撑业务分析与运营管理
+## bss_business_day_data(`bss_business_day_data` 表用于记录高速公路服务区每日业务统计数据
+bss_business_day_data 表`bss_business_day_data` 表用于记录高速公路服务区每日业务统计数据,包括创建、更新、删除操作的时间戳及操作人信息,支持业务数据的版本管理和审计追溯
 字段列表:
 - id (varchar(32)) - 主键ID [主键, 非空] [示例: 00827DFF993D415488EA1F07CAE6C440, 00e799048b8cbb8ee758eac9c8b4b820]
 - version (integer) - 版本号 [非空] [示例: 1]

+ 2 - 2
data_pipeline/training_data/task_20250701_131627/bss_car_day_count.ddl → data_pipeline/training_data/manual_20250720_134836/bss_car_day_count.ddl

@@ -1,5 +1,5 @@
--- 中文名: 服务区车辆日统计表
--- 描述: 服务区车辆日统计表,记录各类型车辆日通行量及操作信息,用于交通流量分析和运营管理
+-- 中文名: `bss_car_day_count` 表用于记录每日车辆统计信息
+-- 描述: `bss_car_day_count` 表用于记录每日车辆统计信息,包括车辆数量和类别等关键指标,支撑服务区车流分析与运营决策
 create table public.bss_car_day_count (
   id varchar(32) not null     -- 主键ID,主键,
   version integer not null    -- 版本号,

+ 2 - 2
data_pipeline/training_data/task_20250701_131627/bss_car_day_count_detail.md → data_pipeline/training_data/manual_20250720_134836/bss_car_day_count_detail.md

@@ -1,5 +1,5 @@
-## bss_car_day_count(服务区车辆日统计表
-bss_car_day_count 表服务区车辆日统计表,记录各类型车辆日通行量及操作信息,用于交通流量分析和运营管理
+## bss_car_day_count(`bss_car_day_count` 表用于记录每日车辆统计信息
+bss_car_day_count 表`bss_car_day_count` 表用于记录每日车辆统计信息,包括车辆数量和类别等关键指标,支撑服务区车流分析与运营决策
 字段列表:
 - id (varchar(32)) - 主键ID [主键, 非空] [示例: 00022c1c99ff11ec86d4fa163ec0f8fc, 00022caa99ff11ec86d4fa163ec0f8fc]
 - version (integer) - 版本号 [非空] [示例: 1]

+ 3 - 3
data_pipeline/training_data/task_20250703_012750/bss_company.ddl → data_pipeline/training_data/manual_20250720_134836/bss_company.ddl

@@ -1,7 +1,7 @@
--- 中文名: 公司信息
--- 描述: 公司信息表,存储BSS系统中的公司名称、编码及变更记录
+-- 中文名: `bss_company` 表用于存储高速公路服务区相关公司的基本信息
+-- 描述: `bss_company` 表用于存储高速公路服务区相关公司的基本信息,包括公司名称、编码及操作记录,为服务区运营管理提供组织数据支撑。
 create table public.bss_company (
-  id varchar(32) not null     -- 主键ID,主键,
+  id varchar(32) not null     -- 公司唯一标识,主键,
   version integer not null    -- 版本号,
   create_ts timestamp         -- 创建时间,
   created_by varchar(50)      -- 创建人,

+ 3 - 3
data_pipeline/training_data/task_20250703_012750/bss_company_detail.md → data_pipeline/training_data/manual_20250720_134836/bss_company_detail.md

@@ -1,7 +1,7 @@
-## bss_company(公司信息
-bss_company 表公司信息表,存储BSS系统中的公司名称、编码及变更记录
+## bss_company(`bss_company` 表用于存储高速公路服务区相关公司的基本信息)
+bss_company 表`bss_company` 表用于存储高速公路服务区相关公司的基本信息,包括公司名称、编码及操作记录,为服务区运营管理提供组织数据支撑。
 字段列表:
-- id (varchar(32)) - 主键ID [主键, 非空] [示例: 30675d85ba5044c31acfa243b9d16334, 47ed0bb37f5a85f3d9245e4854959b81]
+- id (varchar(32)) - 公司唯一标识 [主键, 非空] [示例: 30675d85ba5044c31acfa243b9d16334, 47ed0bb37f5a85f3d9245e4854959b81]
 - version (integer) - 版本号 [非空] [示例: 1, 2]
 - create_ts (timestamp) - 创建时间 [示例: 2021-05-20 09:51:58.718000, 2021-05-20 09:42:03.341000]
 - created_by (varchar(50)) - 创建人 [示例: admin]

+ 3 - 3
data_pipeline/training_data/task_20250701_131627/bss_section_route.ddl → data_pipeline/training_data/manual_20250720_134836/bss_section_route.ddl

@@ -1,5 +1,5 @@
--- 中文名: 存储高速公路路段与路线信息
--- 描述: 存储高速公路路段与路线信息,支持服务区路线关联管理
+-- 中文名: 路段与路线信息
+-- 描述: 路段与路线信息表,用于管理高速公路服务区所属路段及路线名称等基础信息
 create table public.bss_section_route (
   id varchar(32) not null     -- 主键ID,主键,
   version integer not null    -- 版本号,
@@ -11,6 +11,6 @@ create table public.bss_section_route (
   deleted_by varchar(50)      -- 删除人,
   section_name varchar(255)   -- 路段名称,
   route_name varchar(255)     -- 路线名称,
-  code varchar(255)           -- 路段编号,
+  code varchar(255)           -- 编号,
   primary key (id)
 );

+ 1 - 1
data_pipeline/training_data/task_20250703_012750/bss_section_route_area_link.ddl → data_pipeline/training_data/manual_20250720_134836/bss_section_route_area_link.ddl

@@ -1,5 +1,5 @@
 -- 中文名: 路线与服务区关联表
--- 描述: 路线与服务区关联表,记录路线经过的服务区信息
+-- 描述: 路线与服务区关联表,记录高速公路路线对应的服务区信息。
 create table public.bss_section_route_area_link (
   section_route_id varchar(32) not null -- 路段路线ID,主键,
   service_area_id varchar(32) not null -- 服务区ID,主键,

+ 1 - 1
data_pipeline/training_data/task_20250703_012750/bss_section_route_area_link_detail.md → data_pipeline/training_data/manual_20250720_134836/bss_section_route_area_link_detail.md

@@ -1,5 +1,5 @@
 ## bss_section_route_area_link(路线与服务区关联表)
-bss_section_route_area_link 表路线与服务区关联表,记录路线经过的服务区信息
+bss_section_route_area_link 表路线与服务区关联表,记录高速公路路线对应的服务区信息。
 字段列表:
 - section_route_id (varchar(32)) - 路段路线ID [主键, 非空] [示例: v8elrsfs5f7lt7jl8a6p87smfzesn3rz, hxzi2iim238e3s1eajjt1enmh9o4h3wp]
 - service_area_id (varchar(32)) - 服务区ID [主键, 非空] [示例: 08e01d7402abd1d6a4d9fdd5df855ef8, 091662311d2c737029445442ff198c4c]

+ 3 - 3
data_pipeline/training_data/task_20250703_000820/bss_section_route_detail.md → data_pipeline/training_data/manual_20250720_134836/bss_section_route_detail.md

@@ -1,5 +1,5 @@
-## bss_section_route(路段路线关联表)
-bss_section_route 表路段路线关联表,维护路段与路线的对应关系,支持高速公路路线规划与管理
+## bss_section_route(路段与路线信息表)
+bss_section_route 表路段与路线信息表,用于管理高速公路服务区所属路段及路线名称等基础信息
 字段列表:
 - id (varchar(32)) - 主键ID [主键, 非空] [示例: 04ri3j67a806uw2c6o6dwdtz4knexczh, 0g5mnefxxtukql2cq6acul7phgskowy7]
 - version (integer) - 版本号 [非空] [示例: 1, 0]
@@ -11,6 +11,6 @@ bss_section_route 表路段路线关联表,维护路段与路线的对应关
 - deleted_by (varchar(50)) - 删除人
 - section_name (varchar(255)) - 路段名称 [示例: 昌栗, 昌宁, 昌九]
 - route_name (varchar(255)) - 路线名称 [示例: 昌栗, 昌韶, /]
-- code (varchar(255)) - 路段编号 [示例: SR0001, SR0002]
+- code (varchar(255)) - 编号 [示例: SR0001, SR0002, SR0147]
 字段补充说明:
 - id 为主键

+ 6 - 6
data_pipeline/training_data/task_20250703_000820/bss_service_area.ddl → data_pipeline/training_data/manual_20250720_134836/bss_service_area.ddl

@@ -1,18 +1,18 @@
--- 中文名: 存储高速公路服务区基本信息(名称、编码等)
--- 描述: 存储高速公路服务区基本信息(名称、编码等),支持服务区运营管理
+-- 中文名: `bss_service_area` 表用于存储高速公路服务区基本信息
+-- 描述: `bss_service_area` 表用于存储高速公路服务区的基本信息,包括服务区名称、编码及操作记录,为核心业务提供数据支撑
 create table public.bss_service_area (
   id varchar(32) not null     -- 主键ID,主键,
   version integer not null    -- 版本号,
   create_ts timestamp         -- 创建时间,
   created_by varchar(50)      -- 创建人,
-  update_ts timestamp         -- 最后更新时间,
-  updated_by varchar(50)      -- 最后更新人,
+  update_ts timestamp         -- 更新时间,
+  updated_by varchar(50)      -- 更新人,
   delete_ts timestamp         -- 删除时间,
-  deleted_by varchar(50)      -- 删除操作人,
+  deleted_by varchar(50)      -- 删除人,
   service_area_name varchar(255) -- 服务区名称,
   service_area_no varchar(255) -- 服务区编码,
   company_id varchar(32)      -- 所属公司ID,
-  service_position varchar(255) -- 地理位置坐标,
+  service_position varchar(255) -- 服务区经纬度,
   service_area_type varchar(50) -- 服务区类型,
   service_state varchar(50)   -- 服务区状态,
   primary key (id)

+ 6 - 6
data_pipeline/training_data/task_20250703_000820/bss_service_area_detail.md → data_pipeline/training_data/manual_20250720_134836/bss_service_area_detail.md

@@ -1,18 +1,18 @@
-## bss_service_area(存储高速公路服务区基本信息(名称、编码等)
-bss_service_area 表存储高速公路服务区基本信息(名称、编码等),支持服务区运营管理
+## bss_service_area(`bss_service_area` 表用于存储高速公路服务区基本信息)
+bss_service_area 表`bss_service_area` 表用于存储高速公路服务区的基本信息,包括服务区名称、编码及操作记录,为核心业务提供数据支撑
 字段列表:
 - id (varchar(32)) - 主键ID [主键, 非空] [示例: 0271d68ef93de9684b7ad8c7aae600b6, 08e01d7402abd1d6a4d9fdd5df855ef8]
 - version (integer) - 版本号 [非空] [示例: 3, 6]
 - create_ts (timestamp) - 创建时间 [示例: 2021-05-21 13:26:40.589000, 2021-05-20 19:51:46.314000]
 - created_by (varchar(50)) - 创建人 [示例: admin]
-- update_ts (timestamp) - 最后更新时间 [示例: 2021-07-10 15:41:28.795000, 2021-07-11 09:33:08.455000]
-- updated_by (varchar(50)) - 最后更新人 [示例: admin]
+- update_ts (timestamp) - 更新时间 [示例: 2021-07-10 15:41:28.795000, 2021-07-11 09:33:08.455000]
+- updated_by (varchar(50)) - 更新人 [示例: admin]
 - delete_ts (timestamp) - 删除时间
-- deleted_by (varchar(50)) - 删除操作人 [示例: ]
+- deleted_by (varchar(50)) - 删除人 [示例: ]
 - service_area_name (varchar(255)) - 服务区名称 [示例: 白鹭湖停车区, 南昌南服务区]
 - service_area_no (varchar(255)) - 服务区编码 [示例: H0814, H0105]
 - company_id (varchar(32)) - 所属公司ID [示例: b1629f07c8d9ac81494fbc1de61f1ea5, ee9bf1180a2b45003f96e597a4b7f15a]
-- service_position (varchar(255)) - 地理位置坐标 [示例: 114.574721,26.825584, 115.910549,28.396355]
+- service_position (varchar(255)) - 服务区经纬度 [示例: 114.574721,26.825584, 115.910549,28.396355]
 - service_area_type (varchar(50)) - 服务区类型 [示例: 信息化服务区]
 - service_state (varchar(50)) - 服务区状态 [示例: 开放, 关闭]
 字段补充说明:

+ 2 - 2
data_pipeline/training_data/task_20250701_131627/bss_service_area_mapper.ddl → data_pipeline/training_data/manual_20250720_134836/bss_service_area_mapper.ddl

@@ -1,5 +1,5 @@
--- 中文名: BSS服务区基础信息映射表
--- 描述: BSS服务区基础信息映射表,记录服务区名称、编码及全生命周期操作日志
+-- 中文名: 服务区基础信息映射表
+-- 描述: 服务区基础信息映射表,用于统一管理全国高速服务区名称与编码的对应关系。
 create table public.bss_service_area_mapper (
   id varchar(32) not null     -- 主键ID,主键,
   version integer not null    -- 版本号,

+ 3 - 3
data_pipeline/training_data/task_20250703_012750/bss_service_area_mapper_detail.md → data_pipeline/training_data/manual_20250720_134836/bss_service_area_mapper_detail.md

@@ -1,5 +1,5 @@
-## bss_service_area_mapper(BSS系统服务区主数据表)
-bss_service_area_mapper 表BSS系统服务区主数据表,存储服务区名称、编码及版本生命周期信息
+## bss_service_area_mapper(服务区基础信息映射表)
+bss_service_area_mapper 表服务区基础信息映射表,用于统一管理全国高速服务区名称与编码的对应关系。
 字段列表:
 - id (varchar(32)) - 主键ID [主键, 非空] [示例: 00e1e893909211ed8ee6fa163eaf653f, 013867f5962211ed8ee6fa163eaf653f]
 - version (integer) - 版本号 [非空] [示例: 1]
@@ -12,7 +12,7 @@ bss_service_area_mapper 表BSS系统服务区主数据表,存储服务区名
 - service_name (varchar(255)) - 服务区名称 [示例: 信丰西服务区, 南康北服务区]
 - service_no (varchar(255)) - 服务区编码 [示例: 1067, 1062]
 - service_area_id (varchar(32)) - 服务区ID [示例: 97cd6cd516a551409a4d453a58f9e170, fdbdd042962011ed8ee6fa163eaf653f]
-- source_system_type (varchar(50)) - 数据来源类别名称 [示例: 驿美, 驿购]
+- source_system_type (varchar(50)) - 数据来源系统类型 [示例: 驿美, 驿购]
 - source_type (integer) - 数据来源类别ID [示例: 3, 1]
 字段补充说明:
 - id 为主键

+ 70 - 0
data_pipeline/training_data/manual_20250720_134836/db_query_decision_prompt.txt

@@ -0,0 +1,70 @@
+{
+  "数据库业务范围": "当前数据库存储的是高速公路服务区运营管理的相关数据,主要涉及服务区业务统计、车辆流量、公司信息及路段路线关联数据,包含以下业务数据:",
+  "核心业务实体": [
+    {
+      "实体类型": "服务区",
+      "详细描述": "记录服务区基本信息及所属公司、状态、位置等,主要字段包括服务区名称、编码、类型、状态、经纬度、所属公司ID",
+      "主要字段": [
+        "service_area_name",
+        "service_area_no",
+        "service_area_type",
+        "service_state",
+        "company_id"
+      ]
+    },
+    {
+      "实体类型": "档口",
+      "详细描述": "记录服务区内的经营档口信息,包括档口名称、编码、所属服务区及业务来源,主要字段包括服务区编码、档口名称、档口编码、数据来源类别",
+      "主要字段": [
+        "service_no",
+        "branch_name",
+        "branch_no",
+        "source_type"
+      ]
+    },
+    {
+      "实体类型": "公司",
+      "详细描述": "记录服务区所属公司的基本信息,包括公司名称、编码,主要字段包括公司名称、公司编码",
+      "主要字段": [
+        "company_name",
+        "company_no"
+      ]
+    },
+    {
+      "实体类型": "路段与路线",
+      "详细描述": "记录高速公路路段与路线名称、编号,用于服务区所属路段管理,主要字段包括路段名称、路线名称、编号",
+      "主要字段": [
+        "section_name",
+        "route_name",
+        "code"
+      ]
+    },
+    {
+      "实体类型": "车辆",
+      "详细描述": "记录每日车辆统计信息,包括车辆数量、类别、统计日期,用于车流分析,主要字段包括车辆数量、车辆类别、统计日期",
+      "主要字段": [
+        "customer_count",
+        "car_type",
+        "count_date"
+      ]
+    }
+  ],
+  "关键业务指标": [
+    {
+      "指标类型": "支付金额",
+      "详细描述": "记录各服务区每日通过不同支付方式(微信、支付宝、现金、行吧、金豆)的支付金额,用于分析营收结构"
+    },
+    {
+      "指标类型": "订单数量",
+      "详细描述": "记录各服务区每日通过不同支付方式(微信、支付宝、现金、行吧、金豆)的订单数量,用于分析消费频次"
+    },
+    {
+      "指标类型": "支付总额与订单总数",
+      "详细描述": "记录每日总支付金额和订单总数,用于分析整体营收和消费趋势"
+    },
+    {
+      "指标类型": "车流统计",
+      "详细描述": "记录每日各服务区车辆数量和类别,用于分析车流分布和运营策略制定"
+    }
+  ]
+}

+ 0 - 0
data_pipeline/training_data/task_20250701_131627/filename_mapping.txt → data_pipeline/training_data/manual_20250720_134836/filename_mapping.txt


+ 62 - 0
data_pipeline/training_data/manual_20250720_134836/metadata.txt

@@ -0,0 +1,62 @@
+-- Schema Tools生成的主题元数据
+-- 业务背景: 高速公路服务区管理系统
+-- 生成时间: 2025-07-20 13:52:35
+-- 数据库: highway_db
+
+-- 创建表(如果不存在)
+CREATE TABLE IF NOT EXISTS metadata (
+    id SERIAL PRIMARY KEY,    -- 主键
+    topic_name VARCHAR(100) NOT NULL,  -- 业务主题名称
+    description TEXT,                  -- 业务主体说明
+    related_tables TEXT[],			  -- 相关表名
+    biz_entities TEXT[],               -- 主要业务实体名称
+    biz_metrics TEXT[],                -- 主要业务指标名称
+    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP    -- 插入时间
+);
+
+-- 插入主题数据
+INSERT INTO metadata(topic_name, description, related_tables, biz_entities, biz_metrics) VALUES
+(
+  '日营收分析',
+  '分析各服务区每日的营收情况,包括支付方式分布、收入趋势和订单数量,用于评估经营状况。',
+  'bss_business_day_data',
+  '服务区,档口,支付方式,统计日期',
+  '收入趋势,支付方式分布,订单总数,服务区对比'
+);
+
+INSERT INTO metadata(topic_name, description, related_tables, biz_entities, biz_metrics) VALUES
+(
+  '车辆流量分析',
+  '通过车辆统计数据,分析不同日期和类别下的车流量变化,用于优化服务区资源配置。',
+  'bss_car_day_count',
+  '服务区,车辆类别,统计日期',
+  '车流量趋势,车辆类别分布,服务区车流排名'
+);
+
+INSERT INTO metadata(topic_name, description, related_tables, biz_entities, biz_metrics) VALUES
+(
+  '公司管理分析',
+  '基于公司信息,分析不同公司下属服务区的数量与分布,支撑组织管理与资源分配决策。',
+  'bss_company,bss_service_area',
+  '公司,服务区,路段',
+  '公司服务区数量,服务区分布,路段关联分析'
+);
+
+INSERT INTO metadata(topic_name, description, related_tables, biz_entities, biz_metrics) VALUES
+(
+  '服务区关联分析',
+  '分析路段与服务区之间的关联关系,了解服务区在高速路网中的分布与连接情况。',
+  'bss_section_route,bss_section_route_area_link,bss_service_area',
+  '路段,路线,服务区',
+  '路段服务区数量,路线覆盖分析,服务区连接分布'
+);
+
+INSERT INTO metadata(topic_name, description, related_tables, biz_entities, biz_metrics) VALUES
+(
+  '数据来源分析',
+  '分析不同数据来源的服务区业务数据分布,评估数据采集系统的覆盖范围与使用情况。',
+  'bss_service_area_mapper,bss_business_day_data',
+  '数据来源系统,服务区,编码',
+  '来源系统分布,数据覆盖范围,编码一致性分析'
+);
+

+ 3 - 3
data_pipeline/training_data/task_20250701_131627/metadata_detail.md → data_pipeline/training_data/manual_20250720_134836/metadata_detail.md

@@ -7,9 +7,9 @@
 - `id` (serial) - 主键ID [主键, 非空]
 - `topic_name` (varchar(100)) - 业务主题名称 [非空]
 - `description` (text) - 业务主题说明
-- `related_tables` (text[]) - 涉及的数据表 [示例: bss_business_day_data, bss_section_route_area_link]
-- `biz_entities` (text[]) - 主要业务实体名称 [示例: 车辆类型, 节假日, 路线]
-- `biz_metrics` (text[]) - 主要业务指标名称 [示例: 总营收, 现金占比, 人均营收]
+- `related_tables` (text[]) - 涉及的数据表 [示例: bss_service_area, bss_service_area_mapper]
+- `biz_entities` (text[]) - 主要业务实体名称 [示例: 档口, 数据来源系统, 编码]
+- `biz_metrics` (text[]) - 主要业务指标名称 [示例: 订单总数, 服务区车流排名, 公司服务区数量]
 - `created_at` (timestamp) - 插入时间 [默认值: `CURRENT_TIMESTAMP`]
 
 字段补充说明:

+ 198 - 0
data_pipeline/training_data/manual_20250720_134836/qs_highway_db_20250720_135235_pair.json

@@ -0,0 +1,198 @@
+[
+  {
+    "question": "统计最近7天每个服务区的总营收金额和订单数量,按营收金额降序排列。",
+    "sql": "SELECT service_name AS 服务区名称, SUM(pay_sum) AS 总营收金额, SUM(order_sum) AS 总订单数量 FROM bss_business_day_data WHERE delete_ts IS NULL AND oper_date >= CURRENT_DATE - INTERVAL '7 days' GROUP BY service_name ORDER BY 总营收金额 DESC;"
+  },
+  {
+    "question": "查询2023年4月1日各档口的现金支付金额及订单数量,按现金支付金额降序排列。",
+    "sql": "SELECT branch_name AS 档口名称, rmb AS 现金支付金额, rmb_order AS 现金订单数量 FROM bss_business_day_data WHERE delete_ts IS NULL AND oper_date = '2023-04-01' ORDER BY rmb DESC;"
+  },
+  {
+    "question": "查询各服务区不同支付方式的订单数量,按微信订单数量降序排列。",
+    "sql": "SELECT service_name AS 服务区名称, wx_order AS 微信订单数量, zf_order AS 支付宝订单数量, rmb_order AS 现金订单数量 FROM bss_business_day_data WHERE delete_ts IS NULL ORDER BY wx_order DESC;"
+  },
+  {
+    "question": "统计2023年3月每个服务区的平均每日营收金额,并按平均金额降序显示前5名。",
+    "sql": "SELECT service_name AS 服务区名称, AVG(pay_sum) AS 平均每日营收金额 FROM bss_business_day_data WHERE delete_ts IS NULL AND EXTRACT(MONTH FROM oper_date) = 3 AND EXTRACT(YEAR FROM oper_date) = 2023 GROUP BY service_name ORDER BY 平均每日营收金额 DESC LIMIT 5;"
+  },
+  {
+    "question": "查询宜春服务区在2023年4月1日至2023年4月7日的每日营收金额,用于分析收入趋势。",
+    "sql": "SELECT oper_date AS 统计日期, pay_sum AS 营收金额 FROM bss_business_day_data WHERE delete_ts IS NULL AND service_name = '宜春服务区' AND oper_date BETWEEN '2023-04-01' AND '2023-04-07' ORDER BY 统计日期;"
+  },
+  {
+    "question": "查询每个服务区的微信、支付宝、现金支付金额占比,分析支付方式分布。",
+    "sql": "SELECT service_name AS 服务区名称, (wx / pay_sum) * 100 AS 微信占比, (zfb / pay_sum) * 100 AS 支付宝占比, (rmb / pay_sum) * 100 AS 现金占比 FROM bss_business_day_data WHERE delete_ts IS NULL AND pay_sum > 0;"
+  },
+  {
+    "question": "统计2023年各月的总营收金额,分析全年营收趋势。",
+    "sql": "SELECT EXTRACT(MONTH FROM oper_date) AS 月份, SUM(pay_sum) AS 总营收金额 FROM bss_business_day_data WHERE delete_ts IS NULL AND EXTRACT(YEAR FROM oper_date) = 2023 GROUP BY 月份 ORDER BY 月份;"
+  },
+  {
+    "question": "查询2023年4月1日营收金额最高的前3个服务区,并显示其订单总数。",
+    "sql": "SELECT service_name AS 服务区名称, pay_sum AS 营收金额, order_sum AS 订单总数 FROM bss_business_day_data WHERE delete_ts IS NULL AND oper_date = '2023-04-01' ORDER BY pay_sum DESC LIMIT 3;"
+  },
+  {
+    "question": "查询2023年4月1日宜春服务区各档口的营收金额,按营收金额降序排列。",
+    "sql": "SELECT branch_name AS 档口名称, pay_sum AS 营收金额 FROM bss_business_day_data WHERE delete_ts IS NULL AND oper_date = '2023-04-01' AND service_name = '宜春服务区' ORDER BY pay_sum DESC;"
+  },
+  {
+    "question": "统计2023年4月1日各服务区的现金支付金额与订单数量,筛选现金支付金额大于1000元的数据。",
+    "sql": "SELECT service_name AS 服务区名称, SUM(rmb) AS 现金支付金额, SUM(rmb_order) AS 现金订单数量 FROM bss_business_day_data WHERE delete_ts IS NULL AND oper_date = '2023-04-01' GROUP BY service_name HAVING SUM(rmb) > 1000 ORDER BY 现金支付金额 DESC;"
+  },
+  {
+    "question": "统计2023年4月每天的总车流量,分析车流趋势。",
+    "sql": "SELECT count_date AS 统计日期, SUM(customer_count) AS 总车流量 FROM bss_car_day_count WHERE count_date BETWEEN '2023-04-01' AND '2023-04-30' AND delete_ts IS NULL GROUP BY count_date ORDER BY count_date;"
+  },
+  {
+    "question": "按车辆类别统计2023年4月的总车流量,查看各类别占比。",
+    "sql": "SELECT car_type AS 车辆类别, SUM(customer_count) AS 总车流量 FROM bss_car_day_count WHERE count_date BETWEEN '2023-04-01' AND '2023-04-30' AND delete_ts IS NULL GROUP BY car_type ORDER BY 总车流量 DESC;"
+  },
+  {
+    "question": "找出2023年4月车流量最高的前5个服务区。",
+    "sql": "SELECT service_area_id AS 服务区ID, SUM(customer_count) AS 总车流量 FROM bss_car_day_count WHERE count_date BETWEEN '2023-04-01' AND '2023-04-30' AND delete_ts IS NULL GROUP BY service_area_id ORDER BY 总车流量 DESC LIMIT 5;"
+  },
+  {
+    "question": "分析2023年4月每周的平均车流量,观察周趋势变化。",
+    "sql": "SELECT EXTRACT(WEEK FROM count_date) AS 周数, AVG(customer_count) AS 平均车流量 FROM bss_car_day_count WHERE count_date BETWEEN '2023-04-01' AND '2023-04-30' AND delete_ts IS NULL GROUP BY 周数 ORDER BY 周数;"
+  },
+  {
+    "question": "比较2023年4月与2022年4月的总车流量变化情况。",
+    "sql": "SELECT EXTRACT(MONTH FROM count_date) AS 月份, SUM(customer_count) AS 总车流量 FROM bss_car_day_count WHERE (count_date BETWEEN '2023-04-01' AND '2023-04-30') OR (count_date BETWEEN '2022-04-01' AND '2022-04-30') AND delete_ts IS NULL GROUP BY 月份 ORDER BY 月份;"
+  },
+  {
+    "question": "查询2023年4月每天的城际车辆流量,分析城际车流趋势。",
+    "sql": "SELECT count_date AS 统计日期, customer_count AS 城际车流量 FROM bss_car_day_count WHERE count_date BETWEEN '2023-04-01' AND '2023-04-30' AND car_type = '城际' AND delete_ts IS NULL ORDER BY count_date;"
+  },
+  {
+    "question": "找出2023年4月危化品车流量最少的后3个服务区。",
+    "sql": "SELECT service_area_id AS 服务区ID, SUM(customer_count) AS 危化品车流量 FROM bss_car_day_count WHERE count_date BETWEEN '2023-04-01' AND '2023-04-30' AND car_type = '危化品' AND delete_ts IS NULL GROUP BY service_area_id ORDER BY 危化品车流量 ASC LIMIT 3;"
+  },
+  {
+    "question": "统计2023年4月每天的过境车辆流量,并按天排序。",
+    "sql": "SELECT count_date AS 统计日期, customer_count AS 过境车流量 FROM bss_car_day_count WHERE count_date BETWEEN '2023-04-01' AND '2023-04-30' AND car_type = '过境' AND delete_ts IS NULL ORDER BY count_date;"
+  },
+  {
+    "question": "对比2023年4月不同服务区的车辆类别分布情况。",
+    "sql": "SELECT service_area_id AS 服务区ID, car_type AS 车辆类别, SUM(customer_count) AS 总车流量 FROM bss_car_day_count WHERE count_date BETWEEN '2023-04-01' AND '2023-04-30' AND delete_ts IS NULL GROUP BY 服务区ID, 车辆类别 ORDER BY 服务区ID, 总车流量 DESC;"
+  },
+  {
+    "question": "查询2023年4月车流量超过1000的日期和对应车流量。",
+    "sql": "SELECT count_date AS 统计日期, SUM(customer_count) AS 总车流量 FROM bss_car_day_count WHERE count_date BETWEEN '2023-04-01' AND '2023-04-30' AND delete_ts IS NULL GROUP BY count_date HAVING SUM(customer_count) > 1000 ORDER BY count_date;"
+  },
+  {
+    "question": "统计各公司下属服务区的数量,并按数量降序排列。",
+    "sql": "SELECT company_name AS 公司名称, COUNT(*) AS 服务区数量 FROM bss_service_area sa JOIN bss_company c ON sa.company_id = c.id WHERE sa.delete_ts IS NULL AND c.delete_ts IS NULL GROUP BY company_name ORDER BY 服务区数量 DESC;"
+  },
+  {
+    "question": "列出所有处于开放状态的服务区及其所属公司名称。",
+    "sql": "SELECT sa.service_area_name AS 服务区名称, c.company_name AS 公司名称 FROM bss_service_area sa JOIN bss_company c ON sa.company_id = c.id WHERE sa.service_state = '开放' AND sa.delete_ts IS NULL AND c.delete_ts IS NULL;"
+  },
+  {
+    "question": "找出2023年4月1日微信支付金额最高的前5个服务区。",
+    "sql": "SELECT service_name AS 服务区名称, wx AS 微信支付金额 FROM bss_business_day_data WHERE oper_date = '2023-04-01' ORDER BY wx DESC LIMIT 5;"
+  },
+  {
+    "question": "统计每个路段关联的服务区数量,并按数量降序排列。",
+    "sql": "SELECT sr.section_name AS 路段名称, COUNT(sral.service_area_id) AS 服务区数量 FROM bss_section_route sr JOIN bss_section_route_area_link sral ON sr.id = sral.section_route_id GROUP BY sr.section_name ORDER BY 服务区数量 DESC;"
+  },
+  {
+    "question": "查找2022年3月2日记录中车辆类别为'危化品'的服务区名称及车辆数量。",
+    "sql": "SELECT sa.service_area_name AS 服务区名称, cc.customer_count AS 车辆数量 FROM bss_car_day_count cc JOIN bss_service_area sa ON cc.service_area_id = sa.id WHERE cc.count_date = '2022-03-02' AND cc.car_type = '危化品' AND cc.delete_ts IS NULL;"
+  },
+  {
+    "question": "列出所有关闭状态的服务区及其所属公司编码。",
+    "sql": "SELECT sa.service_area_name AS 服务区名称, c.company_no AS 公司编码 FROM bss_service_area sa JOIN bss_company c ON sa.company_id = c.id WHERE sa.service_state = '关闭' AND sa.delete_ts IS NULL AND c.delete_ts IS NULL;"
+  },
+  {
+    "question": "统计各公司2023年4月1日的微信支付总金额,并按金额降序排列。",
+    "sql": "SELECT sa.company_id AS 公司ID, SUM(bd.wx) AS 微信支付总金额 FROM bss_business_day_data bd JOIN bss_service_area sa ON bd.service_no = sa.service_area_no WHERE bd.oper_date = '2023-04-01' GROUP BY sa.company_id ORDER BY 微信支付总金额 DESC;"
+  },
+  {
+    "question": "查询2023年4月1日所有服务区的支付总金额,并按金额升序排列。",
+    "sql": "SELECT service_name AS 服务区名称, pay_sum AS 支付总金额 FROM bss_business_day_data WHERE oper_date = '2023-04-01' ORDER BY pay_sum ASC;"
+  },
+  {
+    "question": "列出每个公司下所有服务区的经纬度信息。",
+    "sql": "SELECT sa.service_area_name AS 服务区名称, sa.service_position AS 经纬度, c.company_name AS 公司名称 FROM bss_service_area sa JOIN bss_company c ON sa.company_id = c.id WHERE sa.delete_ts IS NULL AND c.delete_ts IS NULL;"
+  },
+  {
+    "question": "统计2023年4月1日各服务区的现金支付订单数量,并按数量降序排列。",
+    "sql": "SELECT service_name AS 服务区名称, rmb_order AS 现金订单数量 FROM bss_business_day_data WHERE oper_date = '2023-04-01' ORDER BY rmb_order DESC;"
+  },
+  {
+    "question": "统计每个路段关联的服务区数量,并按数量降序排列。",
+    "sql": "SELECT section_name AS 路段名称, COUNT(service_area_id) AS 服务区数量 FROM bss_section_route JOIN bss_section_route_area_link ON id = section_route_id WHERE delete_ts IS NULL GROUP BY section_name ORDER BY 服务区数量 DESC;"
+  },
+  {
+    "question": "列出所有路线及其覆盖的服务区数量,并筛选出服务区数量大于5的路线。",
+    "sql": "SELECT route_name AS 路线名称, COUNT(service_area_id) AS 服务区数量 FROM bss_section_route JOIN bss_section_route_area_link ON id = section_route_id WHERE delete_ts IS NULL GROUP BY route_name HAVING COUNT(service_area_id) > 5;"
+  },
+  {
+    "question": "查询每个服务区所归属的路线和路段信息。",
+    "sql": "SELECT service_area_name AS 服务区名称, route_name AS 路线名称, section_name AS 路段名称 FROM bss_service_area JOIN bss_section_route_area_link ON id = service_area_id JOIN bss_section_route ON section_route_id = bss_section_route.id WHERE bss_service_area.delete_ts IS NULL AND bss_section_route.delete_ts IS NULL;"
+  },
+  {
+    "question": "找出没有关联任何服务区的路段。",
+    "sql": "SELECT section_name AS 路段名称 FROM bss_section_route LEFT JOIN bss_section_route_area_link ON id = section_route_id WHERE service_area_id IS NULL AND bss_section_route.delete_ts IS NULL;"
+  },
+  {
+    "question": "统计每个路段下服务区的经纬度分布,用于地图可视化。",
+    "sql": "SELECT section_name AS 路段名称, service_area_name AS 服务区名称, service_position AS 经纬度 FROM bss_section_route JOIN bss_section_route_area_link ON id = section_route_id JOIN bss_service_area ON service_area_id = bss_service_area.id WHERE bss_section_route.delete_ts IS NULL AND bss_service_area.delete_ts IS NULL;"
+  },
+  {
+    "question": "查询2023年4月1日所有服务区的微信支付总额,并按金额降序排列前10名。",
+    "sql": "SELECT service_name AS 服务区名称, SUM(wx) AS 微信支付总额 FROM bss_business_day_data WHERE oper_date = '2023-04-01' GROUP BY service_name ORDER BY 微信支付总额 DESC LIMIT 10;"
+  },
+  {
+    "question": "查询2022年3月所有服务区的车辆数量,并按车辆数量降序排列。",
+    "sql": "SELECT service_area_name AS 服务区名称, SUM(customer_count) AS 车辆数量 FROM bss_car_day_count JOIN bss_service_area ON service_area_id = bss_service_area.id WHERE count_date BETWEEN '2022-03-01' AND '2022-03-31' GROUP BY service_area_name ORDER BY 车辆数量 DESC;"
+  },
+  {
+    "question": "找出2023年4月1日微信支付订单数量最多的前5个服务区。",
+    "sql": "SELECT service_name AS 服务区名称, wx_order AS 微信订单数量 FROM bss_business_day_data WHERE oper_date = '2023-04-01' ORDER BY wx_order DESC LIMIT 5;"
+  },
+  {
+    "question": "统计各路段2022年3月的总车流量,并按车流量降序排列。",
+    "sql": "SELECT section_name AS 路段名称, SUM(customer_count) AS 总车流量 FROM bss_car_day_count JOIN bss_service_area ON bss_car_day_count.service_area_id = bss_service_area.id JOIN bss_section_route_area_link ON bss_service_area.id = bss_section_route_area_link.service_area_id JOIN bss_section_route ON bss_section_route_area_link.section_route_id = bss_section_route.id WHERE count_date BETWEEN '2022-03-01' AND '2022-03-31' GROUP BY section_name ORDER BY 总车流量 DESC;"
+  },
+  {
+    "question": "统计各数据来源系统类型对应的服务区数量,评估数据采集系统的覆盖范围。",
+    "sql": "SELECT source_system_type AS 数据来源系统类型, COUNT(*) AS 服务区数量 FROM bss_service_area_mapper WHERE delete_ts IS NULL GROUP BY source_system_type ORDER BY 服务区数量 DESC;"
+  },
+  {
+    "question": "分析不同数据来源类别ID(source_type)的服务区业务数据记录数量分布。",
+    "sql": "SELECT source_type AS 数据来源类别ID, COUNT(*) AS 记录数量 FROM bss_business_day_data WHERE delete_ts IS NULL GROUP BY source_type ORDER BY 记录数量 DESC;"
+  },
+  {
+    "question": "查询2023年4月1日各数据来源系统类型的服务区总支付金额汇总。",
+    "sql": "SELECT mapper.source_system_type AS 数据来源系统类型, SUM(data.pay_sum) AS 总支付金额 FROM bss_business_day_data data JOIN bss_service_area_mapper mapper ON data.service_no = mapper.service_no WHERE data.oper_date = '2023-04-01' AND data.delete_ts IS NULL AND mapper.delete_ts IS NULL GROUP BY mapper.source_system_type;"
+  },
+  {
+    "question": "列出最近一个月内无数据更新的数据来源系统类型及其服务区数量。",
+    "sql": "SELECT mapper.source_system_type AS 数据来源系统类型, COUNT(DISTINCT mapper.service_area_id) AS 无更新服务区数量 FROM bss_service_area_mapper mapper LEFT JOIN bss_business_day_data data ON mapper.service_no = data.service_no AND data.update_ts >= '2023-03-01' WHERE data.id IS NULL AND mapper.delete_ts IS NULL GROUP BY mapper.source_system_type;"
+  },
+  {
+    "question": "查询2023年4月1日各数据来源系统类型的微信支付金额占比。",
+    "sql": "SELECT mapper.source_system_type AS 数据来源系统类型, SUM(data.wx) AS 微信支付总额, SUM(data.wx) / SUM(data.pay_sum) * 100 AS 支付占比 FROM bss_business_day_data data JOIN bss_service_area_mapper mapper ON data.service_no = mapper.service_no WHERE data.oper_date = '2023-04-01' AND data.delete_ts IS NULL AND mapper.delete_ts IS NULL GROUP BY mapper.source_system_type;"
+  },
+  {
+    "question": "列出数据来源系统类型为'驿购'且2023年4月1日订单总数排名前10的服务区名称。",
+    "sql": "SELECT mapper.service_name AS 服务区名称, data.order_sum AS 订单总数 FROM bss_business_day_data data JOIN bss_service_area_mapper mapper ON data.service_no = mapper.service_no WHERE data.oper_date = '2023-04-01' AND mapper.source_system_type = '驿购' AND data.delete_ts IS NULL AND mapper.delete_ts IS NULL ORDER BY data.order_sum DESC LIMIT 10;"
+  },
+  {
+    "question": "对比2023年4月1日各数据来源系统类型的支付总金额与订单总数。",
+    "sql": "SELECT mapper.source_system_type AS 数据来源系统类型, SUM(data.pay_sum) AS 支付总金额, SUM(data.order_sum) AS 订单总数 FROM bss_business_day_data data JOIN bss_service_area_mapper mapper ON data.service_no = mapper.service_no WHERE data.oper_date = '2023-04-01' AND data.delete_ts IS NULL AND mapper.delete_ts IS NULL GROUP BY mapper.source_system_type ORDER BY 支付总金额 DESC;"
+  },
+  {
+    "question": "查询各数据来源系统类型中最近一次数据更新时间,并按时间排序。",
+    "sql": "SELECT mapper.source_system_type AS 数据来源系统类型, MAX(data.update_ts) AS 最近更新时间 FROM bss_business_day_data data JOIN bss_service_area_mapper mapper ON data.service_no = mapper.service_no WHERE data.delete_ts IS NULL AND mapper.delete_ts IS NULL GROUP BY mapper.source_system_type ORDER BY 最近更新时间 DESC;"
+  },
+  {
+    "question": "查找2023年4月1日数据来源系统类型为'手工录入'的所有服务区的现金支付金额明细。",
+    "sql": "SELECT mapper.service_name AS 服务区名称, data.rmb AS 现金支付金额 FROM bss_business_day_data data JOIN bss_service_area_mapper mapper ON data.service_no = mapper.service_no WHERE data.oper_date = '2023-04-01' AND mapper.source_system_type = '手工录入' AND data.delete_ts IS NULL AND mapper.delete_ts IS NULL ORDER BY data.rmb DESC;"
+  },
+  {
+    "question": "统计各数据来源系统类型在2023年4月1日的平均支付金额,并按平均值排序。",
+    "sql": "SELECT mapper.source_system_type AS 数据来源系统类型, AVG(data.pay_sum) AS 平均支付金额 FROM bss_business_day_data data JOIN bss_service_area_mapper mapper ON data.service_no = mapper.service_no WHERE data.oper_date = '2023-04-01' AND data.delete_ts IS NULL AND mapper.delete_ts IS NULL GROUP BY mapper.source_system_type ORDER BY 平均支付金额 DESC;"
+  }
+]

+ 202 - 0
data_pipeline/training_data/manual_20250720_134836/qs_highway_db_20250720_135235_pair.json.backup

@@ -0,0 +1,202 @@
+[
+  {
+    "question": "统计最近7天每个服务区的总营收金额和订单数量,按营收金额降序排列。",
+    "sql": "SELECT service_name AS 服务区名称, SUM(pay_sum) AS 总营收金额, SUM(order_sum) AS 总订单数量 FROM bss_business_day_data WHERE delete_ts IS NULL AND oper_date >= CURRENT_DATE - INTERVAL '7 days' GROUP BY service_name ORDER BY 总营收金额 DESC;"
+  },
+  {
+    "question": "查询2023年4月1日各档口的现金支付金额及订单数量,按现金支付金额降序排列。",
+    "sql": "SELECT branch_name AS 档口名称, rmb AS 现金支付金额, rmb_order AS 现金订单数量 FROM bss_business_day_data WHERE delete_ts IS NULL AND oper_date = '2023-04-01' ORDER BY rmb DESC;"
+  },
+  {
+    "question": "查询各服务区不同支付方式的订单数量,按微信订单数量降序排列。",
+    "sql": "SELECT service_name AS 服务区名称, wx_order AS 微信订单数量, zf_order AS 支付宝订单数量, rmb_order AS 现金订单数量 FROM bss_business_day_data WHERE delete_ts IS NULL ORDER BY wx_order DESC;"
+  },
+  {
+    "question": "统计2023年3月每个服务区的平均每日营收金额,并按平均金额降序显示前5名。",
+    "sql": "SELECT service_name AS 服务区名称, AVG(pay_sum) AS 平均每日营收金额 FROM bss_business_day_data WHERE delete_ts IS NULL AND EXTRACT(MONTH FROM oper_date) = 3 AND EXTRACT(YEAR FROM oper_date) = 2023 GROUP BY service_name ORDER BY 平均每日营收金额 DESC LIMIT 5;"
+  },
+  {
+    "question": "查询宜春服务区在2023年4月1日至2023年4月7日的每日营收金额,用于分析收入趋势。",
+    "sql": "SELECT oper_date AS 统计日期, pay_sum AS 营收金额 FROM bss_business_day_data WHERE delete_ts IS NULL AND service_name = '宜春服务区' AND oper_date BETWEEN '2023-04-01' AND '2023-04-07' ORDER BY 统计日期;"
+  },
+  {
+    "question": "查询每个服务区的微信、支付宝、现金支付金额占比,分析支付方式分布。",
+    "sql": "SELECT service_name AS 服务区名称, (wx / pay_sum) * 100 AS 微信占比, (zfb / pay_sum) * 100 AS 支付宝占比, (rmb / pay_sum) * 100 AS 现金占比 FROM bss_business_day_data WHERE delete_ts IS NULL AND pay_sum > 0;"
+  },
+  {
+    "question": "统计2023年各月的总营收金额,分析全年营收趋势。",
+    "sql": "SELECT EXTRACT(MONTH FROM oper_date) AS 月份, SUM(pay_sum) AS 总营收金额 FROM bss_business_day_data WHERE delete_ts IS NULL AND EXTRACT(YEAR FROM oper_date) = 2023 GROUP BY 月份 ORDER BY 月份;"
+  },
+  {
+    "question": "查询2023年4月1日营收金额最高的前3个服务区,并显示其订单总数。",
+    "sql": "SELECT service_name AS 服务区名称, pay_sum AS 营收金额, order_sum AS 订单总数 FROM bss_business_day_data WHERE delete_ts IS NULL AND oper_date = '2023-04-01' ORDER BY pay_sum DESC LIMIT 3;"
+  },
+  {
+    "question": "查询2023年4月1日宜春服务区各档口的营收金额,按营收金额降序排列。",
+    "sql": "SELECT branch_name AS 档口名称, pay_sum AS 营收金额 FROM bss_business_day_data WHERE delete_ts IS NULL AND oper_date = '2023-04-01' AND service_name = '宜春服务区' ORDER BY pay_sum DESC;"
+  },
+  {
+    "question": "统计2023年4月1日各服务区的现金支付金额与订单数量,筛选现金支付金额大于1000元的数据。",
+    "sql": "SELECT service_name AS 服务区名称, SUM(rmb) AS 现金支付金额, SUM(rmb_order) AS 现金订单数量 FROM bss_business_day_data WHERE delete_ts IS NULL AND oper_date = '2023-04-01' GROUP BY service_name HAVING SUM(rmb) > 1000 ORDER BY 现金支付金额 DESC;"
+  },
+  {
+    "question": "统计2023年4月每天的总车流量,分析车流趋势。",
+    "sql": "SELECT count_date AS 统计日期, SUM(customer_count) AS 总车流量 FROM bss_car_day_count WHERE count_date BETWEEN '2023-04-01' AND '2023-04-30' AND delete_ts IS NULL GROUP BY count_date ORDER BY count_date;"
+  },
+  {
+    "question": "按车辆类别统计2023年4月的总车流量,查看各类别占比。",
+    "sql": "SELECT car_type AS 车辆类别, SUM(customer_count) AS 总车流量 FROM bss_car_day_count WHERE count_date BETWEEN '2023-04-01' AND '2023-04-30' AND delete_ts IS NULL GROUP BY car_type ORDER BY 总车流量 DESC;"
+  },
+  {
+    "question": "找出2023年4月车流量最高的前5个服务区。",
+    "sql": "SELECT service_area_id AS 服务区ID, SUM(customer_count) AS 总车流量 FROM bss_car_day_count WHERE count_date BETWEEN '2023-04-01' AND '2023-04-30' AND delete_ts IS NULL GROUP BY service_area_id ORDER BY 总车流量 DESC LIMIT 5;"
+  },
+  {
+    "question": "分析2023年4月每周的平均车流量,观察周趋势变化。",
+    "sql": "SELECT EXTRACT(WEEK FROM count_date) AS 周数, AVG(customer_count) AS 平均车流量 FROM bss_car_day_count WHERE count_date BETWEEN '2023-04-01' AND '2023-04-30' AND delete_ts IS NULL GROUP BY 周数 ORDER BY 周数;"
+  },
+  {
+    "question": "比较2023年4月与2022年4月的总车流量变化情况。",
+    "sql": "SELECT EXTRACT(MONTH FROM count_date) AS 月份, SUM(customer_count) AS 总车流量 FROM bss_car_day_count WHERE (count_date BETWEEN '2023-04-01' AND '2023-04-30') OR (count_date BETWEEN '2022-04-01' AND '2022-04-30') AND delete_ts IS NULL GROUP BY 月份 ORDER BY 月份;"
+  },
+  {
+    "question": "查询2023年4月每天的城际车辆流量,分析城际车流趋势。",
+    "sql": "SELECT count_date AS 统计日期, customer_count AS 城际车流量 FROM bss_car_day_count WHERE count_date BETWEEN '2023-04-01' AND '2023-04-30' AND car_type = '城际' AND delete_ts IS NULL ORDER BY count_date;"
+  },
+  {
+    "question": "找出2023年4月危化品车流量最少的后3个服务区。",
+    "sql": "SELECT service_area_id AS 服务区ID, SUM(customer_count) AS 危化品车流量 FROM bss_car_day_count WHERE count_date BETWEEN '2023-04-01' AND '2023-04-30' AND car_type = '危化品' AND delete_ts IS NULL GROUP BY service_area_id ORDER BY 危化品车流量 ASC LIMIT 3;"
+  },
+  {
+    "question": "统计2023年4月每天的过境车辆流量,并按天排序。",
+    "sql": "SELECT count_date AS 统计日期, customer_count AS 过境车流量 FROM bss_car_day_count WHERE count_date BETWEEN '2023-04-01' AND '2023-04-30' AND car_type = '过境' AND delete_ts IS NULL ORDER BY count_date;"
+  },
+  {
+    "question": "对比2023年4月不同服务区的车辆类别分布情况。",
+    "sql": "SELECT service_area_id AS 服务区ID, car_type AS 车辆类别, SUM(customer_count) AS 总车流量 FROM bss_car_day_count WHERE count_date BETWEEN '2023-04-01' AND '2023-04-30' AND delete_ts IS NULL GROUP BY 服务区ID, 车辆类别 ORDER BY 服务区ID, 总车流量 DESC;"
+  },
+  {
+    "question": "查询2023年4月车流量超过1000的日期和对应车流量。",
+    "sql": "SELECT count_date AS 统计日期, SUM(customer_count) AS 总车流量 FROM bss_car_day_count WHERE count_date BETWEEN '2023-04-01' AND '2023-04-30' AND delete_ts IS NULL GROUP BY count_date HAVING SUM(customer_count) > 1000 ORDER BY count_date;"
+  },
+  {
+    "question": "统计各公司下属服务区的数量,并按数量降序排列。",
+    "sql": "SELECT company_name AS 公司名称, COUNT(*) AS 服务区数量 FROM bss_service_area sa JOIN bss_company c ON sa.company_id = c.id WHERE sa.delete_ts IS NULL AND c.delete_ts IS NULL GROUP BY company_name ORDER BY 服务区数量 DESC;"
+  },
+  {
+    "question": "列出所有处于开放状态的服务区及其所属公司名称。",
+    "sql": "SELECT sa.service_area_name AS 服务区名称, c.company_name AS 公司名称 FROM bss_service_area sa JOIN bss_company c ON sa.company_id = c.id WHERE sa.service_state = '开放' AND sa.delete_ts IS NULL AND c.delete_ts IS NULL;"
+  },
+  {
+    "question": "找出2023年4月1日微信支付金额最高的前5个服务区。",
+    "sql": "SELECT service_name AS 服务区名称, wx AS 微信支付金额 FROM bss_business_day_data WHERE oper_date = '2023-04-01' ORDER BY wx DESC LIMIT 5;"
+  },
+  {
+    "question": "统计每个路段关联的服务区数量,并按数量降序排列。",
+    "sql": "SELECT sr.section_name AS 路段名称, COUNT(sral.service_area_id) AS 服务区数量 FROM bss_section_route sr JOIN bss_section_route_area_link sral ON sr.id = sral.section_route_id GROUP BY sr.section_name ORDER BY 服务区数量 DESC;"
+  },
+  {
+    "question": "查找2022年3月2日记录中车辆类别为'危化品'的服务区名称及车辆数量。",
+    "sql": "SELECT sa.service_area_name AS 服务区名称, cc.customer_count AS 车辆数量 FROM bss_car_day_count cc JOIN bss_service_area sa ON cc.service_area_id = sa.id WHERE cc.count_date = '2022-03-02' AND cc.car_type = '危化品' AND cc.delete_ts IS NULL;"
+  },
+  {
+    "question": "列出所有关闭状态的服务区及其所属公司编码。",
+    "sql": "SELECT sa.service_area_name AS 服务区名称, c.company_no AS 公司编码 FROM bss_service_area sa JOIN bss_company c ON sa.company_id = c.id WHERE sa.service_state = '关闭' AND sa.delete_ts IS NULL AND c.delete_ts IS NULL;"
+  },
+  {
+    "question": "统计各公司2023年4月1日的微信支付总金额,并按金额降序排列。",
+    "sql": "SELECT sa.company_id AS 公司ID, SUM(bd.wx) AS 微信支付总金额 FROM bss_business_day_data bd JOIN bss_service_area sa ON bd.service_no = sa.service_area_no WHERE bd.oper_date = '2023-04-01' GROUP BY sa.company_id ORDER BY 微信支付总金额 DESC;"
+  },
+  {
+    "question": "查询2023年4月1日所有服务区的支付总金额,并按金额升序排列。",
+    "sql": "SELECT service_name AS 服务区名称, pay_sum AS 支付总金额 FROM bss_business_day_data WHERE oper_date = '2023-04-01' ORDER BY pay_sum ASC;"
+  },
+  {
+    "question": "列出每个公司下所有服务区的经纬度信息。",
+    "sql": "SELECT sa.service_area_name AS 服务区名称, sa.service_position AS 经纬度, c.company_name AS 公司名称 FROM bss_service_area sa JOIN bss_company c ON sa.company_id = c.id WHERE sa.delete_ts IS NULL AND c.delete_ts IS NULL;"
+  },
+  {
+    "question": "统计2023年4月1日各服务区的现金支付订单数量,并按数量降序排列。",
+    "sql": "SELECT service_name AS 服务区名称, rmb_order AS 现金订单数量 FROM bss_business_day_data WHERE oper_date = '2023-04-01' ORDER BY rmb_order DESC;"
+  },
+  {
+    "question": "统计每个路段关联的服务区数量,并按数量降序排列。",
+    "sql": "SELECT section_name AS 路段名称, COUNT(service_area_id) AS 服务区数量 FROM bss_section_route JOIN bss_section_route_area_link ON id = section_route_id WHERE delete_ts IS NULL GROUP BY section_name ORDER BY 服务区数量 DESC;"
+  },
+  {
+    "question": "列出所有路线及其覆盖的服务区数量,并筛选出服务区数量大于5的路线。",
+    "sql": "SELECT route_name AS 路线名称, COUNT(service_area_id) AS 服务区数量 FROM bss_section_route JOIN bss_section_route_area_link ON id = section_route_id WHERE delete_ts IS NULL GROUP BY route_name HAVING COUNT(service_area_id) > 5;"
+  },
+  {
+    "question": "查询每个服务区所归属的路线和路段信息。",
+    "sql": "SELECT service_area_name AS 服务区名称, route_name AS 路线名称, section_name AS 路段名称 FROM bss_service_area JOIN bss_section_route_area_link ON id = service_area_id JOIN bss_section_route ON section_route_id = bss_section_route.id WHERE bss_service_area.delete_ts IS NULL AND bss_section_route.delete_ts IS NULL;"
+  },
+  {
+    "question": "找出没有关联任何服务区的路段。",
+    "sql": "SELECT section_name AS 路段名称 FROM bss_section_route LEFT JOIN bss_section_route_area_link ON id = section_route_id WHERE service_area_id IS NULL AND bss_section_route.delete_ts IS NULL;"
+  },
+  {
+    "question": "统计每个路段下服务区的经纬度分布,用于地图可视化。",
+    "sql": "SELECT section_name AS 路段名称, service_area_name AS 服务区名称, service_position AS 经纬度 FROM bss_section_route JOIN bss_section_route_area_link ON id = section_route_id JOIN bss_service_area ON service_area_id = bss_service_area.id WHERE bss_section_route.delete_ts IS NULL AND bss_service_area.delete_ts IS NULL;"
+  },
+  {
+    "question": "查询2023年4月1日所有服务区的微信支付总额,并按金额降序排列前10名。",
+    "sql": "SELECT service_name AS 服务区名称, SUM(wx) AS 微信支付总额 FROM bss_business_day_data WHERE oper_date = '2023-04-01' GROUP BY service_name ORDER BY 微信支付总额 DESC LIMIT 10;"
+  },
+  {
+    "question": "统计每个路段服务区的总订单数和总支付金额,用于运营绩效分析。",
+    "sql": "SELECT section_name AS 路段名称, SUM(order_sum) AS 总订单数, SUM(pay_sum) AS 总支付金额 FROM bss_business_day_data JOIN bss_section_route_area_link ON service_area_id = service_area_id JOIN bss_section_route ON id = section_route_id GROUP BY section_name;"
+  },
+  {
+    "question": "查询2022年3月所有服务区的车辆数量,并按车辆数量降序排列。",
+    "sql": "SELECT service_area_name AS 服务区名称, SUM(customer_count) AS 车辆数量 FROM bss_car_day_count JOIN bss_service_area ON service_area_id = bss_service_area.id WHERE count_date BETWEEN '2022-03-01' AND '2022-03-31' GROUP BY service_area_name ORDER BY 车辆数量 DESC;"
+  },
+  {
+    "question": "找出2023年4月1日微信支付订单数量最多的前5个服务区。",
+    "sql": "SELECT service_name AS 服务区名称, wx_order AS 微信订单数量 FROM bss_business_day_data WHERE oper_date = '2023-04-01' ORDER BY wx_order DESC LIMIT 5;"
+  },
+  {
+    "question": "统计各路段2022年3月的总车流量,并按车流量降序排列。",
+    "sql": "SELECT section_name AS 路段名称, SUM(customer_count) AS 总车流量 FROM bss_car_day_count JOIN bss_service_area ON service_area_id = bss_service_area.id JOIN bss_section_route_area_link ON service_area_id = bss_service_area.id JOIN bss_section_route ON id = section_route_id WHERE count_date BETWEEN '2022-03-01' AND '2022-03-31' GROUP BY section_name ORDER BY 总车流量 DESC;"
+  },
+  {
+    "question": "统计各数据来源系统类型对应的服务区数量,评估数据采集系统的覆盖范围。",
+    "sql": "SELECT source_system_type AS 数据来源系统类型, COUNT(*) AS 服务区数量 FROM bss_service_area_mapper WHERE delete_ts IS NULL GROUP BY source_system_type ORDER BY 服务区数量 DESC;"
+  },
+  {
+    "question": "分析不同数据来源类别ID(source_type)的服务区业务数据记录数量分布。",
+    "sql": "SELECT source_type AS 数据来源类别ID, COUNT(*) AS 记录数量 FROM bss_business_day_data WHERE delete_ts IS NULL GROUP BY source_type ORDER BY 记录数量 DESC;"
+  },
+  {
+    "question": "查询2023年4月1日各数据来源系统类型的服务区总支付金额汇总。",
+    "sql": "SELECT mapper.source_system_type AS 数据来源系统类型, SUM(data.pay_sum) AS 总支付金额 FROM bss_business_day_data data JOIN bss_service_area_mapper mapper ON data.service_no = mapper.service_no WHERE data.oper_date = '2023-04-01' AND data.delete_ts IS NULL AND mapper.delete_ts IS NULL GROUP BY mapper.source_system_type;"
+  },
+  {
+    "question": "列出最近一个月内无数据更新的数据来源系统类型及其服务区数量。",
+    "sql": "SELECT mapper.source_system_type AS 数据来源系统类型, COUNT(DISTINCT mapper.service_area_id) AS 无更新服务区数量 FROM bss_service_area_mapper mapper LEFT JOIN bss_business_day_data data ON mapper.service_no = data.service_no AND data.update_ts >= '2023-03-01' WHERE data.id IS NULL AND mapper.delete_ts IS NULL GROUP BY mapper.source_system_type;"
+  },
+  {
+    "question": "查询2023年4月1日各数据来源系统类型的微信支付金额占比。",
+    "sql": "SELECT mapper.source_system_type AS 数据来源系统类型, SUM(data.wx) AS 微信支付总额, SUM(data.wx) / SUM(data.pay_sum) * 100 AS 支付占比 FROM bss_business_day_data data JOIN bss_service_area_mapper mapper ON data.service_no = mapper.service_no WHERE data.oper_date = '2023-04-01' AND data.delete_ts IS NULL AND mapper.delete_ts IS NULL GROUP BY mapper.source_system_type;"
+  },
+  {
+    "question": "列出数据来源系统类型为'驿购'且2023年4月1日订单总数排名前10的服务区名称。",
+    "sql": "SELECT mapper.service_name AS 服务区名称, data.order_sum AS 订单总数 FROM bss_business_day_data data JOIN bss_service_area_mapper mapper ON data.service_no = mapper.service_no WHERE data.oper_date = '2023-04-01' AND mapper.source_system_type = '驿购' AND data.delete_ts IS NULL AND mapper.delete_ts IS NULL ORDER BY data.order_sum DESC LIMIT 10;"
+  },
+  {
+    "question": "对比2023年4月1日各数据来源系统类型的支付总金额与订单总数。",
+    "sql": "SELECT mapper.source_system_type AS 数据来源系统类型, SUM(data.pay_sum) AS 支付总金额, SUM(data.order_sum) AS 订单总数 FROM bss_business_day_data data JOIN bss_service_area_mapper mapper ON data.service_no = mapper.service_no WHERE data.oper_date = '2023-04-01' AND data.delete_ts IS NULL AND mapper.delete_ts IS NULL GROUP BY mapper.source_system_type ORDER BY 支付总金额 DESC;"
+  },
+  {
+    "question": "查询各数据来源系统类型中最近一次数据更新时间,并按时间排序。",
+    "sql": "SELECT mapper.source_system_type AS 数据来源系统类型, MAX(data.update_ts) AS 最近更新时间 FROM bss_business_day_data data JOIN bss_service_area_mapper mapper ON data.service_no = mapper.service_no WHERE data.delete_ts IS NULL AND mapper.delete_ts IS NULL GROUP BY mapper.source_system_type ORDER BY 最近更新时间 DESC;"
+  },
+  {
+    "question": "查找2023年4月1日数据来源系统类型为'手工录入'的所有服务区的现金支付金额明细。",
+    "sql": "SELECT mapper.service_name AS 服务区名称, data.rmb AS 现金支付金额 FROM bss_business_day_data data JOIN bss_service_area_mapper mapper ON data.service_no = mapper.service_no WHERE data.oper_date = '2023-04-01' AND mapper.source_system_type = '手工录入' AND data.delete_ts IS NULL AND mapper.delete_ts IS NULL ORDER BY data.rmb DESC;"
+  },
+  {
+    "question": "统计各数据来源系统类型在2023年4月1日的平均支付金额,并按平均值排序。",
+    "sql": "SELECT mapper.source_system_type AS 数据来源系统类型, AVG(data.pay_sum) AS 平均支付金额 FROM bss_business_day_data data JOIN bss_service_area_mapper mapper ON data.service_no = mapper.service_no WHERE data.oper_date = '2023-04-01' AND data.delete_ts IS NULL AND mapper.delete_ts IS NULL GROUP BY mapper.source_system_type ORDER BY 平均支付金额 DESC;"
+  }
+]

+ 8 - 8
data_pipeline/training_data/task_20250703_000820/bss_business_day_data.ddl → data_pipeline/training_data/manual_20250722_164749/bss_business_day_data.ddl

@@ -1,13 +1,13 @@
--- 中文名: 记录各服务区每日经营数据
--- 描述: 记录各服务区每日经营数据,用于业务统计与运营分析
+-- 中文名: 高速公路服务区每日经营数据记录表
+-- 描述: 高速公路服务区每日经营数据记录表,用于统计各服务区按日维度的运营情况。
 create table public.bss_business_day_data (
   id varchar(32) not null     -- 主键ID,主键,
   version integer not null    -- 数据版本号,
-  create_ts timestamp         -- 创建时间,
+  create_ts timestamp         -- 创建时间,
   created_by varchar(50)      -- 创建人账号,
-  update_ts timestamp         -- 更新时间,
+  update_ts timestamp         -- 更新时间,
   updated_by varchar(50)      -- 更新人账号,
-  delete_ts timestamp         -- 删除时间,
+  delete_ts timestamp         -- 删除时间,
   deleted_by varchar(50)      -- 删除人账号,
   oper_date date              -- 统计日期,
   service_no varchar(255)     -- 服务区编码,
@@ -21,11 +21,11 @@ create table public.bss_business_day_data (
   rmb numeric(19,4)           -- 现金支付金额,
   rmb_order integer           -- 现金订单数量,
   xs numeric(19,4)            -- 行吧支付金额,
-  xs_order integer            -- 行吧订单数,
+  xs_order integer            -- 行吧支付订单数,
   jd numeric(19,4)            -- 金豆支付金额,
-  jd_order integer            -- 金豆订单数,
+  jd_order integer            -- 金豆支付订单数,
   order_sum integer           -- 订单总数,
-  pay_sum numeric(19,4)       -- 支付金额,
+  pay_sum numeric(19,4)       -- 支付金额,
   source_type integer         -- 数据来源类别,
   primary key (id)
 );

+ 32 - 0
data_pipeline/training_data/manual_20250722_164749/bss_business_day_data_detail.md

@@ -0,0 +1,32 @@
+## bss_business_day_data(高速公路服务区每日经营数据记录表)
+bss_business_day_data 表高速公路服务区每日经营数据记录表,用于统计各服务区按日维度的运营情况。
+字段列表:
+- id (varchar(32)) - 主键ID [主键, 非空] [示例: 00827DFF993D415488EA1F07CAE6C440, 00e799048b8cbb8ee758eac9c8b4b820]
+- version (integer) - 数据版本号 [非空] [示例: 1]
+- create_ts (timestamp) - 创建时间 [示例: 2023-04-02 08:31:51, 2023-04-02 02:30:08]
+- created_by (varchar(50)) - 创建人账号 [示例: xingba]
+- update_ts (timestamp) - 更新时间 [示例: 2023-04-02 08:31:51, 2023-04-02 02:30:08]
+- updated_by (varchar(50)) - 更新人账号
+- delete_ts (timestamp) - 删除时间
+- deleted_by (varchar(50)) - 删除人账号
+- oper_date (date) - 统计日期 [示例: 2023-04-01]
+- service_no (varchar(255)) - 服务区编码 [示例: 1028, H0501]
+- service_name (varchar(255)) - 服务区名称 [示例: 宜春服务区, 庐山服务区]
+- branch_no (varchar(255)) - 档口编码 [示例: 1, H05016]
+- branch_name (varchar(255)) - 档口名称 [示例: 宜春南区, 庐山鲜徕客东区]
+- wx (numeric(19,4)) - 微信支付金额 [示例: 4790.0000, 2523.0000]
+- wx_order (integer) - 微信订单数量 [示例: 253, 133]
+- zfb (numeric(19,4)) - 支付宝支付金额 [示例: 229.0000, 0.0000]
+- zf_order (integer) - 支付宝订单数量 [示例: 15, 0]
+- rmb (numeric(19,4)) - 现金支付金额 [示例: 1058.5000, 124.0000]
+- rmb_order (integer) - 现金订单数量 [示例: 56, 12]
+- xs (numeric(19,4)) - 行吧支付金额 [示例: 0.0000, 40.0000]
+- xs_order (integer) - 行吧支付订单数 [示例: 0, 1]
+- jd (numeric(19,4)) - 金豆支付金额 [示例: 0.0000]
+- jd_order (integer) - 金豆支付订单数 [示例: 0]
+- order_sum (integer) - 订单总数 [示例: 324, 146]
+- pay_sum (numeric(19,4)) - 总支付金额 [示例: 6077.5000, 2687.0000]
+- source_type (integer) - 数据来源类别 [示例: 1, 0, 4]
+字段补充说明:
+- id 为主键
+- source_type 为枚举字段,包含取值:0、4、1、2、3

+ 5 - 5
data_pipeline/training_data/task_20250703_000820/bss_car_day_count.ddl → data_pipeline/training_data/manual_20250722_164749/bss_car_day_count.ddl

@@ -1,12 +1,12 @@
--- 中文名: 抱歉
--- 描述: 抱歉,我暂时无法回答您的问题。请稍后再试
+-- 中文名: 高速公路服务区每日车辆流量统计表
+-- 描述: 高速公路服务区每日车辆流量统计表,记录各类型车辆进出数量及操作审计信息
 create table public.bss_car_day_count (
   id varchar(32) not null     -- 主键ID,主键,
-  version integer not null    -- 版本号,
+  version integer not null    -- 数据版本号,
   create_ts timestamp         -- 创建时间,
   created_by varchar(50)      -- 创建人,
-  update_ts timestamp         -- 最后更新时间,
-  updated_by varchar(50)      -- 最后更新人,
+  update_ts timestamp         -- 更新时间,
+  updated_by varchar(50)      -- 更新人,
   delete_ts timestamp         -- 删除时间,
   deleted_by varchar(50)      -- 删除人,
   customer_count bigint       -- 车辆数量,

+ 6 - 6
data_pipeline/training_data/task_20250701_212426/bss_car_day_count_detail.md → data_pipeline/training_data/manual_20250722_164749/bss_car_day_count_detail.md

@@ -1,14 +1,14 @@
-## bss_car_day_count(记录高速公路服务区每日车辆类型及数量统计
-bss_car_day_count 表记录高速公路服务区每日车辆类型及数量统计,用于车流分析与资源调配
+## bss_car_day_count(高速公路服务区每日车辆流量统计表
+bss_car_day_count 表高速公路服务区每日车辆流量统计表,记录各类型车辆进出数量及操作审计信息。
 字段列表:
 - id (varchar(32)) - 主键ID [主键, 非空] [示例: 00022c1c99ff11ec86d4fa163ec0f8fc, 00022caa99ff11ec86d4fa163ec0f8fc]
-- version (integer) - 版本号 [非空] [示例: 1]
+- version (integer) - 数据版本号 [非空] [示例: 1]
 - create_ts (timestamp) - 创建时间 [示例: 2022-03-02 16:01:43, 2022-02-02 14:18:55]
-- created_by (varchar(50)) - 创建人ID
+- created_by (varchar(50)) - 创建人
 - update_ts (timestamp) - 更新时间 [示例: 2022-03-02 16:01:43, 2022-02-02 14:18:55]
-- updated_by (varchar(50)) - 更新人ID
+- updated_by (varchar(50)) - 更新人
 - delete_ts (timestamp) - 删除时间
-- deleted_by (varchar(50)) - 删除人ID
+- deleted_by (varchar(50)) - 删除人
 - customer_count (bigint) - 车辆数量 [示例: 1114, 295]
 - car_type (varchar(100)) - 车辆类别 [示例: 其他]
 - count_date (date) - 统计日期 [示例: 2022-03-02, 2022-02-02]

+ 15 - 0
data_pipeline/training_data/manual_20250722_164749/bss_company.ddl

@@ -0,0 +1,15 @@
+-- 中文名: 高速公路服务区管理系统中的公司信息表
+-- 描述: 高速公路服务区管理系统中的公司信息表,存储服务区所属公司的基本信息及操作审计字段。
+create table public.bss_company (
+  id varchar(32) not null     -- 公司唯一标识,主键,
+  version integer not null    -- 数据版本号,
+  create_ts timestamp         -- 创建时间,
+  created_by varchar(50)      -- 创建人,
+  update_ts timestamp         -- 更新时间,
+  updated_by varchar(50)      -- 更新人,
+  delete_ts timestamp         -- 删除时间,
+  deleted_by varchar(50)      -- 删除人,
+  company_name varchar(255)   -- 公司名称,
+  company_no varchar(255)     -- 公司编码,
+  primary key (id)
+);

+ 16 - 0
data_pipeline/training_data/manual_20250722_164749/bss_company_detail.md

@@ -0,0 +1,16 @@
+## bss_company(高速公路服务区管理系统中的公司信息表)
+bss_company 表高速公路服务区管理系统中的公司信息表,存储服务区所属公司的基本信息及操作审计字段。
+字段列表:
+- id (varchar(32)) - 公司唯一标识 [主键, 非空] [示例: 30675d85ba5044c31acfa243b9d16334, 47ed0bb37f5a85f3d9245e4854959b81]
+- version (integer) - 数据版本号 [非空] [示例: 1, 2]
+- create_ts (timestamp) - 创建时间 [示例: 2021-05-20 09:51:58.718000, 2021-05-20 09:42:03.341000]
+- created_by (varchar(50)) - 创建人 [示例: admin]
+- update_ts (timestamp) - 更新时间 [示例: 2021-05-20 09:51:58.718000, 2021-05-20 09:42:03.341000]
+- updated_by (varchar(50)) - 更新人 [示例: admin]
+- delete_ts (timestamp) - 删除时间
+- deleted_by (varchar(50)) - 删除人
+- company_name (varchar(255)) - 公司名称 [示例: 上饶分公司, 宜春分公司]
+- company_no (varchar(255)) - 公司编码 [示例: H03, H02, H07]
+字段补充说明:
+- id 为主键
+- company_no 为枚举字段,包含取值:H01、H02、H03、H04、H05、H06、H07、H08、Q01

+ 3 - 3
data_pipeline/training_data/task_20250703_012750/bss_section_route.ddl → data_pipeline/training_data/manual_20250722_164749/bss_section_route.ddl

@@ -1,7 +1,7 @@
--- 中文名: 存储路段与路线信息
--- 描述: 存储路段与路线信息,支撑测试流程完整执行,记录操作日志
+-- 中文名: 高速公路路段与路线关联信息表
+-- 描述: 高速公路路段与路线关联信息表,用于管理服务区所属路段及路线关系。
 create table public.bss_section_route (
-  id varchar(32) not null     -- 主键标识符,主键,
+  id varchar(32) not null     -- 主键ID,主键,
   version integer not null    -- 版本号,
   create_ts timestamp         -- 创建时间,
   created_by varchar(50)      -- 创建人,

+ 7 - 0
data_pipeline/training_data/manual_20250722_164749/bss_section_route_area_link.ddl

@@ -0,0 +1,7 @@
+-- 中文名: 高速公路服务区与路线关联表
+-- 描述: 高速公路服务区与路线关联表,记录服务区所属路段关系。
+create table public.bss_section_route_area_link (
+  section_route_id varchar(32) not null -- 路段路线唯一标识,主键,
+  service_area_id varchar(32) not null -- 服务区唯一标识,主键,
+  primary key (section_route_id, service_area_id)
+);

+ 7 - 0
data_pipeline/training_data/manual_20250722_164749/bss_section_route_area_link_detail.md

@@ -0,0 +1,7 @@
+## bss_section_route_area_link(高速公路服务区与路线关联表)
+bss_section_route_area_link 表高速公路服务区与路线关联表,记录服务区所属路段关系。
+字段列表:
+- section_route_id (varchar(32)) - 路段路线唯一标识 [主键, 非空] [示例: v8elrsfs5f7lt7jl8a6p87smfzesn3rz, hxzi2iim238e3s1eajjt1enmh9o4h3wp]
+- service_area_id (varchar(32)) - 服务区唯一标识 [主键, 非空] [示例: 08e01d7402abd1d6a4d9fdd5df855ef8, 091662311d2c737029445442ff198c4c]
+字段补充说明:
+- 复合主键:section_route_id, service_area_id

+ 5 - 5
data_pipeline/training_data/task_20250701_131627/bss_section_route_detail.md → data_pipeline/training_data/manual_20250722_164749/bss_section_route_detail.md

@@ -1,5 +1,5 @@
-## bss_section_route(存储高速公路路段与路线信息)
-bss_section_route 表存储高速公路路段与路线信息,支持服务区路线关联管理
+## bss_section_route(高速公路路段与路线关联信息
+bss_section_route 表高速公路路段与路线关联信息表,用于管理服务区所属路段及路线关系
 字段列表:
 - id (varchar(32)) - 主键ID [主键, 非空] [示例: 04ri3j67a806uw2c6o6dwdtz4knexczh, 0g5mnefxxtukql2cq6acul7phgskowy7]
 - version (integer) - 版本号 [非空] [示例: 1, 0]
@@ -9,8 +9,8 @@ bss_section_route 表存储高速公路路段与路线信息,支持服务区
 - updated_by (varchar(50)) - 更新人
 - delete_ts (timestamp) - 删除时间
 - deleted_by (varchar(50)) - 删除人
-- section_name (varchar(255)) - 路段名称 [示例: 昌栗, 昌宁]
-- route_name (varchar(255)) - 路线名称 [示例: 昌栗, 昌韶]
-- code (varchar(255)) - 路段编号 [示例: SR0001, SR0002]
+- section_name (varchar(255)) - 路段名称 [示例: 昌栗, 昌宁, 昌九]
+- route_name (varchar(255)) - 路线名称 [示例: 昌栗, 昌韶, /]
+- code (varchar(255)) - 编号 [示例: SR0001, SR0002, SR0147]
 字段补充说明:
 - id 为主键

+ 4 - 4
data_pipeline/training_data/task_20250703_012750/bss_service_area.ddl → data_pipeline/training_data/manual_20250722_164749/bss_service_area.ddl

@@ -1,5 +1,5 @@
--- 中文名: 存储服务区基础信息
--- 描述: 存储服务区基础信息,包含名称、编码及操作记录,支撑业务区域管理
+-- 中文名: 高速公路服务区基础信息表
+-- 描述: 高速公路服务区基础信息表,存储服务区名称、编码及管理元数据。
 create table public.bss_service_area (
   id varchar(32) not null     -- 主键ID,主键,
   version integer not null    -- 版本号,
@@ -12,8 +12,8 @@ create table public.bss_service_area (
   service_area_name varchar(255) -- 服务区名称,
   service_area_no varchar(255) -- 服务区编码,
   company_id varchar(32)      -- 所属公司ID,
-  service_position varchar(255) -- 地理坐标,
+  service_position varchar(255) -- 服务区经纬度,
   service_area_type varchar(50) -- 服务区类型,
-  service_state varchar(50)   -- 运营状态,
+  service_state varchar(50)   -- 服务区状态,
   primary key (id)
 );

+ 4 - 4
data_pipeline/training_data/task_20250703_012750/bss_service_area_detail.md → data_pipeline/training_data/manual_20250722_164749/bss_service_area_detail.md

@@ -1,5 +1,5 @@
-## bss_service_area(存储服务区基础信息
-bss_service_area 表存储服务区基础信息,包含名称、编码及操作记录,支撑业务区域管理
+## bss_service_area(高速公路服务区基础信息表
+bss_service_area 表高速公路服务区基础信息表,存储服务区名称、编码及管理元数据。
 字段列表:
 - id (varchar(32)) - 主键ID [主键, 非空] [示例: 0271d68ef93de9684b7ad8c7aae600b6, 08e01d7402abd1d6a4d9fdd5df855ef8]
 - version (integer) - 版本号 [非空] [示例: 3, 6]
@@ -12,9 +12,9 @@ bss_service_area 表存储服务区基础信息,包含名称、编码及操作
 - service_area_name (varchar(255)) - 服务区名称 [示例: 白鹭湖停车区, 南昌南服务区]
 - service_area_no (varchar(255)) - 服务区编码 [示例: H0814, H0105]
 - company_id (varchar(32)) - 所属公司ID [示例: b1629f07c8d9ac81494fbc1de61f1ea5, ee9bf1180a2b45003f96e597a4b7f15a]
-- service_position (varchar(255)) - 地理坐标 [示例: 114.574721,26.825584, 115.910549,28.396355]
+- service_position (varchar(255)) - 服务区经纬度 [示例: 114.574721,26.825584, 115.910549,28.396355]
 - service_area_type (varchar(50)) - 服务区类型 [示例: 信息化服务区]
-- service_state (varchar(50)) - 运营状态 [示例: 开放, 关闭]
+- service_state (varchar(50)) - 服务区状态 [示例: 开放, 关闭]
 字段补充说明:
 - id 为主键
 - service_area_type 为枚举字段,包含取值:信息化服务区、智能化服务区

+ 3 - 3
data_pipeline/training_data/task_20250703_000820/bss_service_area_mapper.ddl → data_pipeline/training_data/manual_20250722_164749/bss_service_area_mapper.ddl

@@ -1,5 +1,5 @@
--- 中文名: BSS服务区信息映射表
--- 描述: BSS服务区信息映射表,记录服务区基础信息及状态变更记录,支持服务区全生命周期管理
+-- 中文名: 服务区信息映射表
+-- 描述: 服务区信息映射表,用于管理高速公路服务区的基础信息及变更记录
 create table public.bss_service_area_mapper (
   id varchar(32) not null     -- 主键ID,主键,
   version integer not null    -- 版本号,
@@ -11,7 +11,7 @@ create table public.bss_service_area_mapper (
   deleted_by varchar(50)      -- 删除人,
   service_name varchar(255)   -- 服务区名称,
   service_no varchar(255)     -- 服务区编码,
-  service_area_id varchar(32) -- 服务区ID,
+  service_area_id varchar(32) -- 服务区系统ID,
   source_system_type varchar(50) -- 数据来源系统,
   source_type integer         -- 数据来源类别ID,
   primary key (id)

+ 5 - 4
data_pipeline/training_data/task_20250703_000820/bss_service_area_mapper_detail.md → data_pipeline/training_data/manual_20250722_164749/bss_service_area_mapper_detail.md

@@ -1,5 +1,5 @@
-## bss_service_area_mapper(BSS服务区信息映射表)
-bss_service_area_mapper 表BSS服务区信息映射表,记录服务区基础信息及状态变更记录,支持服务区全生命周期管理
+## bss_service_area_mapper(服务区信息映射表)
+bss_service_area_mapper 表服务区信息映射表,用于管理高速公路服务区的基础信息及变更记录
 字段列表:
 - id (varchar(32)) - 主键ID [主键, 非空] [示例: 00e1e893909211ed8ee6fa163eaf653f, 013867f5962211ed8ee6fa163eaf653f]
 - version (integer) - 版本号 [非空] [示例: 1]
@@ -11,9 +11,10 @@ bss_service_area_mapper 表BSS服务区信息映射表,记录服务区基础
 - deleted_by (varchar(50)) - 删除人
 - service_name (varchar(255)) - 服务区名称 [示例: 信丰西服务区, 南康北服务区]
 - service_no (varchar(255)) - 服务区编码 [示例: 1067, 1062]
-- service_area_id (varchar(32)) - 服务区ID [示例: 97cd6cd516a551409a4d453a58f9e170, fdbdd042962011ed8ee6fa163eaf653f]
+- service_area_id (varchar(32)) - 服务区系统ID [示例: 97cd6cd516a551409a4d453a58f9e170, fdbdd042962011ed8ee6fa163eaf653f]
 - source_system_type (varchar(50)) - 数据来源系统 [示例: 驿美, 驿购]
 - source_type (integer) - 数据来源类别ID [示例: 3, 1]
 字段补充说明:
 - id 为主键
-- source_system_type 为枚举字段,包含取值:司乘管理、商业管理、驿购、驿美、手工录入
+- source_system_type 为枚举字段,包含取值:司乘管理、商业管理、驿购、驿美、手工录入
+- source_type 为枚举字段,包含取值:5、0、1、3、4

+ 35 - 0
data_pipeline/training_data/manual_20250722_164749/db_query_decision_prompt.txt

@@ -0,0 +1,35 @@
+{
+  "业务范围": "当前数据库存储的是高速公路服务区运营管理的相关数据,主要涉及服务区经营收入、车辆流量统计及基础信息管理,包含以下业务数据:",
+  "数据范围": "交易类数据(微信/支付宝/现金等支付方式的金额与订单数)、车辆流量数据(按车类统计的数量)、服务区基础属性(类型、状态、所属公司)及路段关联关系",
+  "核心业务实体": [
+    {
+      "实体类型": "服务区",
+      "描述": "高速公路沿线提供休息、餐饮、购物等服务的物理区域,是业务统计的核心维度",
+      "主要字段": [
+        "service_area_name",
+        "service_area_no",
+        "service_state",
+        "company_id"
+      ]
+    },
+    {
+      "实体类型": "经营档口",
+      "描述": "服务区内的具体商户或功能分区(如南区、鲜徕客东区),用于细化收入来源分析",
+      "主要字段": [
+        "branch_name",
+        "branch_no",
+        "service_no"
+      ]
+    }
+  ],
+  "关键业务指标": [
+    {
+      "指标类型": "日均营业额",
+      "描述": "通过pay_sum字段计算各服务区或档口每日总交易额,可用于趋势分析和横向对比"
+    },
+    {
+      "指标类型": "车类分布结构",
+      "描述": "基于car_type和customer_count字段分析不同类型车辆(危化品、城际等)在服务区的占比情况"
+    }
+  ]
+}

+ 0 - 0
data_pipeline/training_data/task_20250703_012750/filename_mapping.txt → data_pipeline/training_data/manual_20250722_164749/filename_mapping.txt


+ 62 - 0
data_pipeline/training_data/manual_20250722_164749/metadata.txt

@@ -0,0 +1,62 @@
+-- Schema Tools生成的主题元数据
+-- 业务背景: 高速公路服务区管理系统
+-- 生成时间: 2025-07-22 16:55:43
+-- 数据库: highway_db
+
+-- 创建表(如果不存在)
+CREATE TABLE IF NOT EXISTS metadata (
+    id SERIAL PRIMARY KEY,    -- 主键
+    topic_name VARCHAR(100) NOT NULL,  -- 业务主题名称
+    description TEXT,                  -- 业务主体说明
+    related_tables TEXT[],			  -- 相关表名
+    biz_entities TEXT[],               -- 主要业务实体名称
+    biz_metrics TEXT[],                -- 主要业务指标名称
+    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP    -- 插入时间
+);
+
+-- 插入主题数据
+INSERT INTO metadata(topic_name, description, related_tables, biz_entities, biz_metrics) VALUES
+(
+  '营收分析',
+  '基于每日经营数据,分析各服务区及档口的收入趋势、订单分布与支付方式结构,支撑经营决策优化。',
+  'bss_business_day_data,bss_service_area,bss_company',
+  '服务区,档口,公司,支付方式',
+  '总支付金额,订单总数,收入趋势,支付方式占比,服务区对比'
+);
+
+INSERT INTO metadata(topic_name, description, related_tables, biz_entities, biz_metrics) VALUES
+(
+  '车流分析',
+  '结合车辆类型与日流量数据,分析各服务区车流构成及时序变化,辅助服务资源配置与营销策略制定。',
+  'bss_car_day_count,bss_service_area,bss_section_route,bss_company',
+  '服务区,车辆类别,路段,公司',
+  '日均车流量,车类分布,车流趋势,服务区排名,跨路段对比'
+);
+
+INSERT INTO metadata(topic_name, description, related_tables, biz_entities, biz_metrics) VALUES
+(
+  '公司绩效',
+  '从所属公司维度汇总营收与车流数据,评估各分公司管理效能与市场表现,支持绩效考核与资源分配。',
+  'bss_business_day_data,bss_car_day_count,bss_service_area,bss_company',
+  '公司,服务区,统计日期',
+  '公司总营收,平均单区产出,车流覆盖率,同比增长率,公司间对比'
+);
+
+INSERT INTO metadata(topic_name, description, related_tables, biz_entities, biz_metrics) VALUES
+(
+  '路段关联',
+  '分析不同路段路线下的服务区运营与车流情况,识别高价值路线,优化路网服务布局与招商策略。',
+  'bss_section_route,bss_section_route_area_link,bss_service_area,bss_business_day_data,bss_car_day_count',
+  '路段,路线,服务区,公司',
+  '路段总营收,单区平均车流,路线密度,服务区覆盖率,路段排名'
+);
+
+INSERT INTO metadata(topic_name, description, related_tables, biz_entities, biz_metrics) VALUES
+(
+  '状态监控',
+  '统计不同类型与状态的服务区数量分布及其运营数据差异,掌握开放服务能力与系统覆盖情况。',
+  'bss_service_area,bss_business_day_data,bss_car_day_count',
+  '服务区状态,服务区类型,服务区,公司',
+  '开放服务区数,状态分布,运营率,异常状态预警,类型对比'
+);
+

+ 3 - 3
data_pipeline/training_data/task_20250703_012750/metadata_detail.md → data_pipeline/training_data/manual_20250722_164749/metadata_detail.md

@@ -7,9 +7,9 @@
 - `id` (serial) - 主键ID [主键, 非空]
 - `topic_name` (varchar(100)) - 业务主题名称 [非空]
 - `description` (text) - 业务主题说明
-- `related_tables` (text[]) - 涉及的数据表 [示例: bss_section_route, bss_company]
-- `biz_entities` (text[]) - 主要业务实体名称 [示例: 编码类型, 路线, 路段]
-- `biz_metrics` (text[]) - 主要业务指标名称 [示例: 人均消费, 路线覆盖率, 车流转化率]
+- `related_tables` (text[]) - 涉及的数据表 [示例: bss_service_area, bss_section_route]
+- `biz_entities` (text[]) - 主要业务实体名称 [示例: 统计日期, 服务区类型, 路段]
+- `biz_metrics` (text[]) - 主要业务指标名称 [示例: 收入趋势, 总支付金额, 车类分布]
 - `created_at` (timestamp) - 插入时间 [默认值: `CURRENT_TIMESTAMP`]
 
 字段补充说明:

+ 198 - 0
data_pipeline/training_data/manual_20250722_164749/qs_highway_db_20250722_165543_pair.json

@@ -0,0 +1,198 @@
+[
+  {
+    "question": "统计2023年4月1日各服务区的总支付金额和订单总数,并按收入降序排列?",
+    "sql": "SELECT service_name AS 服务区名称, SUM(pay_sum) AS 总支付金额, SUM(order_sum) AS 订单总数 FROM bss_business_day_data WHERE oper_date = '2023-04-01' AND delete_ts IS NULL GROUP BY service_name ORDER BY 总支付金额 DESC;"
+  },
+  {
+    "question": "查询2023年4月1日微信支付金额占比超过50%的服务区及其占比?",
+    "sql": "SELECT service_name AS 服务区名称, (SUM(wx) / SUM(pay_sum)) AS 微信支付占比 FROM bss_business_day_data WHERE oper_date = '2023-04-01' AND delete_ts IS NULL GROUP BY service_name HAVING (SUM(wx) / SUM(pay_sum)) > 0.5 ORDER BY 微信支付占比 DESC;"
+  },
+  {
+    "question": "列出2023年4月1日订单总数最多的前5个档口及其所属服务区?",
+    "sql": "SELECT branch_name AS 档口名称, service_name AS 服务区名称, order_sum AS 订单总数 FROM bss_business_day_data WHERE oper_date = '2023-04-01' AND delete_ts IS NULL ORDER BY order_sum DESC LIMIT 5;"
+  },
+  {
+    "question": "分析2023年4月1日各支付方式的总金额分布情况?",
+    "sql": "SELECT '微信' AS 支付方式, SUM(wx) AS 总金额 FROM bss_business_day_data WHERE oper_date = '2023-04-01' AND delete_ts IS NULL UNION ALL SELECT '支付宝' AS 支付方式, SUM(zfb) AS 总金额 FROM bss_business_day_data WHERE oper_date = '2023-04-01' AND delete_ts IS NULL UNION ALL SELECT '现金' AS 支付方式, SUM(rmb) AS 总金额 FROM bss_business_day_data WHERE oper_date = '2023-04-01' AND delete_ts IS NULL UNION ALL SELECT '行吧' AS 支付方式, SUM(xs) AS 总金额 FROM bss_business_day_data WHERE oper_date = '2023-04-01' AND delete_ts IS NULL UNION ALL SELECT '金豆' AS 支付方式, SUM(jd) AS 总金额 FROM bss_business_day_data WHERE oper_date = '2023-04-01' AND delete_ts IS NULL ORDER BY 总金额 DESC;"
+  },
+  {
+    "question": "计算各公司在2023年4月1日的总营收并按公司名称排序?",
+    "sql": "SELECT c.company_name AS 公司名称, SUM(b.pay_sum) AS 总营收 FROM bss_business_day_data b JOIN bss_service_area sa ON b.service_no = sa.service_area_no JOIN bss_company c ON sa.company_id = c.id WHERE b.oper_date = '2023-04-01' AND b.delete_ts IS NULL AND sa.delete_ts IS NULL AND c.delete_ts IS NULL GROUP BY c.company_name ORDER BY 公司名称;"
+  },
+  {
+    "question": "找出2023年4月1日平均客单价最高的前3个服务区(总支付金额/订单总数)?",
+    "sql": "SELECT service_name AS 服务区名称, (SUM(pay_sum) / NULLIF(SUM(order_sum), 0)) AS 平均客单价 FROM bss_business_day_data WHERE oper_date = '2023-04-01' AND delete_ts IS NULL GROUP BY service_name ORDER BY 平均客单价 DESC LIMIT 3;"
+  },
+  {
+    "question": "对比2023年4月1日各服务区现金支付与非现金支付的金额差异?",
+    "sql": "SELECT service_name AS 服务区名称, SUM(rmb) AS 现金支付总额, (SUM(wx) + SUM(zfb) + SUM(xs) + SUM(jd)) AS 非现金支付总额 FROM bss_business_day_data WHERE oper_date = '2023-04-01' AND delete_ts IS NULL GROUP BY service_name ORDER BY 现金支付总额 DESC;"
+  },
+  {
+    "question": "查询宜春分公司下所有服务区在2023年4月1日的营收汇总?",
+    "sql": "SELECT s.company_name AS 公司名称, SUM(b.pay_sum) AS 营收总额, SUM(b.order_sum) AS 订单总数 FROM bss_business_day_data b JOIN bss_service_area a ON b.service_no = a.service_area_no JOIN bss_company s ON a.company_id = s.id WHERE s.company_name = '宜春分公司' AND b.oper_date = '2023-04-01' AND b.delete_ts IS NULL AND a.delete_ts IS NULL AND s.delete_ts IS NULL GROUP BY s.company_name;"
+  },
+  {
+    "question": "统计2023年4月1日各服务区支付宝订单数量占总订单比例,并筛选高于10%的服务区?",
+    "sql": "SELECT service_name AS 服务区名称, (SUM(zf_order) * 1.0 / SUM(order_sum)) AS 支付宝订单占比 FROM bss_business_day_data WHERE oper_date = '2023-04-01' AND delete_ts IS NULL GROUP BY service_name HAVING (SUM(zf_order) * 1.0 / SUM(order_sum)) > 0.1 ORDER BY 支付宝订单占比 DESC;"
+  },
+  {
+    "question": "获取2023年4月1日所有开放状态的服务区经营数据,包括总支付金额、订单数及支付方式明细?",
+    "sql": "SELECT b.service_name AS 服务区名称, b.branch_name AS 档口名称, b.pay_sum AS 总支付金额, b.order_sum AS 订单总数, b.wx AS 微信金额, b.zfb AS 支付宝金额, b.rmb AS 现金金额 FROM bss_business_day_data b JOIN bss_service_area s ON b.service_no = s.service_area_no WHERE b.oper_date = '2023-04-01' AND s.service_state = '开放' AND b.delete_ts IS NULL AND s.delete_ts IS NULL ORDER BY b.pay_sum DESC;"
+  },
+  {
+    "question": "各服务区2023年日均车流量排名(前10名)?",
+    "sql": "SELECT sa.service_area_name AS 服务区名称, AVG(cdc.customer_count) AS 日均车流量 FROM bss_car_day_count cdc JOIN bss_service_area sa ON cdc.service_area_id = sa.id WHERE cdc.count_date BETWEEN '2023-01-01' AND '2023-12-31' AND cdc.delete_ts IS NULL AND sa.delete_ts IS NULL GROUP BY sa.service_area_name ORDER BY 日均车流量 DESC LIMIT 10;"
+  },
+  {
+    "question": "2023年各类车型在所有服务区的总流量分布占比?",
+    "sql": "SELECT car_type AS 车辆类别, SUM(customer_count) AS 总车流量, ROUND(SUM(customer_count)::numeric * 100 / (SELECT SUM(customer_count) FROM bss_car_day_count WHERE count_date BETWEEN '2023-01-01' AND '2023-12-31' AND delete_ts IS NULL), 2) AS 占比百分比 FROM bss_car_day_count WHERE count_date BETWEEN '2023-01-01' AND '2023-12-31' AND delete_ts IS NULL GROUP BY car_type ORDER BY 总车流量 DESC;"
+  },
+  {
+    "question": "2023年每月总车流量趋势变化情况?",
+    "sql": "SELECT EXTRACT(YEAR FROM count_date) AS 年份, EXTRACT(MONTH FROM count_date) AS 月份, SUM(customer_count) AS 月总车流量 FROM bss_car_day_count WHERE count_date BETWEEN '2023-01-01' AND '2023-12-31' AND delete_ts IS NULL GROUP BY 年份, 月份 ORDER BY 年份, 月份;"
+  },
+  {
+    "question": "昌九路段下各服务区2023年日均车流量对比?",
+    "sql": "SELECT sa.service_area_name AS 服务区名称, AVG(cdc.customer_count) AS 日均车流量 FROM bss_car_day_count cdc JOIN bss_service_area sa ON cdc.service_area_id = sa.id JOIN bss_section_route_area_link sral ON sa.id = sral.service_area_id JOIN bss_section_route sr ON sral.section_route_id = sr.id WHERE sr.section_name = '昌九' AND cdc.count_date BETWEEN '2023-01-01' AND '2023-12-31' AND cdc.delete_ts IS NULL AND sa.delete_ts IS NULL AND sr.delete_ts IS NULL GROUP BY sa.service_area_name ORDER BY 日均车流量 DESC;"
+  },
+  {
+    "question": "宜春分公司所属服务区2023年车流总量及平均值?",
+    "sql": "SELECT co.company_name AS 公司名称, COUNT(*) AS 统计天数, SUM(cdc.customer_count) AS 总车流量, AVG(cdc.customer_count) AS 日均车流量 FROM bss_car_day_count cdc JOIN bss_service_area sa ON cdc.service_area_id = sa.id JOIN bss_company co ON sa.company_id = co.id WHERE co.company_name = '宜春分公司' AND cdc.count_date BETWEEN '2023-01-01' AND '2023-12-31' AND cdc.delete_ts IS NULL AND sa.delete_ts IS NULL AND co.delete_ts IS NULL GROUP BY 公司名称;"
+  },
+  {
+    "question": "2023年危化品车辆通行量最高的前5个服务区?",
+    "sql": "SELECT sa.service_area_name AS 服务区名称, SUM(cdc.customer_count) AS 危化品车流量 FROM bss_car_day_count cdc JOIN bss_service_area sa ON cdc.service_area_id = sa.id WHERE cdc.car_type = '危化品' AND cdc.count_date BETWEEN '2023-01-01' AND '2023-12-31' AND cdc.delete_ts IS NULL AND sa.delete_ts IS NULL GROUP BY sa.service_area_name ORDER BY 危化品车流量 DESC LIMIT 5;"
+  },
+  {
+    "question": "2023年每个季度各公司下属服务区的总车流量对比?",
+    "sql": "SELECT co.company_name AS 公司名称, EXTRACT(YEAR FROM cdc.count_date) AS 年份, EXTRACT(QUARTER FROM cdc.count_date) AS 季度, SUM(cdc.customer_count) AS 季度总车流量 FROM bss_car_day_count cdc JOIN bss_service_area sa ON cdc.service_area_id = sa.id JOIN bss_company co ON sa.company_id = co.id WHERE cdc.count_date BETWEEN '2023-01-01' AND '2023-12-31' AND cdc.delete_ts IS NULL AND sa.delete_ts IS NULL AND co.delete_ts IS NULL GROUP BY 公司名称, 年份, 季度 ORDER BY 年份, 季度, 季度总车流量 DESC;"
+  },
+  {
+    "question": "2023年‘城际’类车辆日均车流量时间趋势(按月)?",
+    "sql": "SELECT EXTRACT(YEAR FROM count_date) AS 年份, EXTRACT(MONTH FROM count_date) AS 月, AVG(customer_count) AS 日均城际车流量 FROM bss_car_day_count WHERE car_type = '城际' AND count_date BETWEEN '2023-01-01' AND '2023-12-31' AND delete_ts IS NULL GROUP BY 年份, 月 ORDER BY 年份, 月;"
+  },
+  {
+    "question": "哪些服务区在2023年存在单日车流量超过10000的记录?列出其名称及最高单日流量。",
+    "sql": "SELECT sa.service_area_name AS 服务区名称, MAX(cdc.customer_count) AS 最高单日车流量 FROM bss_car_day_count cdc JOIN bss_service_area sa ON cdc.service_area_id = sa.id WHERE cdc.count_date BETWEEN '2023-01-01' AND '2023-12-31' AND cdc.delete_ts IS NULL AND sa.delete_ts IS NULL GROUP BY sa.service_area_name HAVING MAX(cdc.customer_count) > 10000 ORDER BY 最高单日车流量 DESC;"
+  },
+  {
+    "question": "2023年‘过境’与‘城际’车辆在各路段的日均车流对比分析?",
+    "sql": "SELECT sr.section_name AS 路段名称, cdc.car_type AS 车辆类型, AVG(cdc.customer_count) AS 日均车流量 FROM bss_car_day_count cdc JOIN bss_service_area sa ON cdc.service_area_id = sa.id JOIN bss_section_route_area_link sral ON sa.id = sral.service_area_id JOIN bss_section_route sr ON sral.section_route_id = sr.id WHERE cdc.car_type IN ('过境', '城际') AND cdc.count_date BETWEEN '2023-01-01' AND '2023-12-31' AND cdc.delete_ts IS NULL AND sa.delete_ts IS NULL AND sr.delete_ts IS NULL GROUP BY sr.section_name, cdc.car_type ORDER BY 路段名称, 车辆类型;"
+  },
+  {
+    "question": "各公司2023年4月总营收是多少?按营收降序排列。",
+    "sql": "SELECT c.company_name AS 公司名称, SUM(b.pay_sum) AS 总营收 FROM bss_business_day_data b JOIN bss_service_area sa ON b.service_no = sa.service_area_no JOIN bss_company c ON sa.company_id = c.id WHERE b.oper_date BETWEEN '2023-04-01' AND '2023-04-30' AND c.delete_ts IS NULL AND sa.delete_ts IS NULL AND b.delete_ts IS NULL GROUP BY c.company_name ORDER BY 总营收 DESC;"
+  },
+  {
+    "question": "各公司平均单个服务区的日均营收(2023年4月)排名如何?",
+    "sql": "SELECT c.company_name AS 公司名称, AVG(company_area_daily.avg_daily_revenue) AS 平均单区日均产出 FROM (SELECT sa.company_id, sa.service_area_no, AVG(b.pay_sum) AS avg_daily_revenue FROM bss_business_day_data b JOIN bss_service_area sa ON b.service_no = sa.service_area_no WHERE b.oper_date BETWEEN '2023-04-01' AND '2023-04-30' AND b.delete_ts IS NULL AND sa.delete_ts IS NULL GROUP BY sa.company_id, sa.service_area_no) AS company_area_daily JOIN bss_company c ON company_area_daily.company_id = c.id WHERE c.delete_ts IS NULL GROUP BY c.company_name ORDER BY 平均单区日均产出 DESC;"
+  },
+  {
+    "question": "各公司在2023年4月的服务区车流覆盖率(有车流数据的服务区占比)是多少?",
+    "sql": "SELECT c.company_name AS 公司名称, COUNT(DISTINCT car.service_area_id) * 1.0 / COUNT(DISTINCT sa.id) AS 车流覆盖率 FROM bss_company c JOIN bss_service_area sa ON c.id = sa.company_id LEFT JOIN bss_car_day_count car ON sa.id = car.service_area_id AND car.count_date BETWEEN '2023-04-01' AND '2023-04-30' WHERE c.delete_ts IS NULL AND sa.delete_ts IS NULL GROUP BY c.company_name ORDER BY 车流覆盖率 DESC;"
+  },
+  {
+    "question": "与2022年4月相比,各公司2023年4月营收的同比增长率是多少?",
+    "sql": "WITH revenue_2022 AS (SELECT sa.company_id, SUM(b.pay_sum) AS total_2022 FROM bss_business_day_data b JOIN bss_service_area sa ON b.service_no = sa.service_area_no WHERE b.oper_date BETWEEN '2022-04-01' AND '2022-04-30' AND b.delete_ts IS NULL AND sa.delete_ts IS NULL GROUP BY sa.company_id), revenue_2023 AS (SELECT sa.company_id, SUM(b.pay_sum) AS total_2023 FROM bss_business_day_data b JOIN bss_service_area sa ON b.service_no = sa.service_area_no WHERE b.oper_date BETWEEN '2023-04-01' AND '2023-04-30' AND b.delete_ts IS NULL AND sa.delete_ts IS NULL GROUP BY sa.company_id) SELECT c.company_name AS 公司名称, COALESCE((r2023.total_2023 - r2022.total_2022) * 100.0 / NULLIF(r2022.total_2022, 0), 0) AS 同比增长率 FROM bss_company c LEFT JOIN revenue_2022 r2022 ON c.id = r2022.company_id LEFT JOIN revenue_2023 r2023 ON c.id = r2023.company_id WHERE c.delete_ts IS NULL ORDER BY 同比增长率 DESC;"
+  },
+  {
+    "question": "哪些公司的平均单区日均营收高于整体平均水平(2023年4月)?",
+    "sql": "WITH area_avg AS (SELECT sa.company_id, sa.service_area_no, AVG(b.pay_sum) AS daily_avg FROM bss_business_day_data b JOIN bss_service_area sa ON b.service_no = sa.service_area_no WHERE b.oper_date BETWEEN '2023-04-01' AND '2023-04-30' AND b.delete_ts IS NULL AND sa.delete_ts IS NULL GROUP BY sa.company_id, sa.service_area_no), company_avg AS (SELECT company_id, AVG(daily_avg) AS company_daily_avg FROM area_avg GROUP BY company_id), overall_avg AS (SELECT AVG(company_daily_avg) AS global_avg FROM company_avg) SELECT c.company_name AS 公司名称, ca.company_daily_avg AS 平均单区日均产出 FROM company_avg ca JOIN bss_company c ON ca.company_id = c.id CROSS JOIN overall_avg o WHERE ca.company_daily_avg > o.global_avg AND c.delete_ts IS NULL ORDER BY company_daily_avg DESC;"
+  },
+  {
+    "question": "各公司2023年4月微信支付占总支付金额的比例是多少?",
+    "sql": "SELECT c.company_name AS 公司名称, SUM(b.wx) * 100.0 / SUM(b.pay_sum) AS 微信支付占比 FROM bss_business_day_data b JOIN bss_service_area sa ON b.service_no = sa.service_area_no JOIN bss_company c ON sa.company_id = c.id WHERE b.oper_date BETWEEN '2023-04-01' AND '2023-04-30' AND b.delete_ts IS NULL AND sa.delete_ts IS NULL AND c.delete_ts IS NULL GROUP BY c.company_name ORDER BY 微信支付占比 DESC;"
+  },
+  {
+    "question": "车流量最高的前5个服务区及其所属公司是哪些(2023年4月)?",
+    "sql": "SELECT sa.service_area_name AS 服务区名称, c.company_name AS 所属公司, SUM(car.customer_count) AS 总车流量 FROM bss_car_day_count car JOIN bss_service_area sa ON car.service_area_id = sa.id JOIN bss_company c ON sa.company_id = c.id WHERE car.count_date BETWEEN '2023-04-01' AND '2023-04-30' AND car.delete_ts IS NULL AND sa.delete_ts IS NULL AND c.delete_ts IS NULL GROUP BY sa.service_area_name, c.company_name ORDER BY 总车流量 DESC LIMIT 5;"
+  },
+  {
+    "question": "各公司2023年4月每日平均订单总数是多少?按从高到低排序。",
+    "sql": "SELECT c.company_name AS 公司名称, AVG(b.order_sum) AS 日均订单总数 FROM bss_business_day_data b JOIN bss_service_area sa ON b.service_no = sa.service_area_no JOIN bss_company c ON sa.company_id = c.id WHERE b.oper_date BETWEEN '2023-04-01' AND '2023-04-30' AND b.delete_ts IS NULL AND sa.delete_ts IS NULL AND c.delete_ts IS NULL GROUP BY c.company_name ORDER BY 日均订单总数 DESC;"
+  },
+  {
+    "question": "宜春分公司在2023年4月每天的总营收趋势如何?",
+    "sql": "SELECT b.oper_date AS 统计日期, SUM(b.pay_sum) AS 日总营收 FROM bss_business_day_data b JOIN bss_service_area sa ON b.service_no = sa.service_area_no JOIN bss_company c ON sa.company_id = c.id WHERE c.company_name = '宜春分公司' AND b.oper_date BETWEEN '2023-04-01' AND '2023-04-30' AND c.delete_ts IS NULL AND sa.delete_ts IS NULL AND b.delete_ts IS NULL GROUP BY b.oper_date ORDER BY 统计日期;"
+  },
+  {
+    "question": "各公司在2023年4月的现金支付总额占比分布情况如何?",
+    "sql": "SELECT c.company_name AS 公司名称, SUM(b.rmb) * 100.0 / SUM(b.pay_sum) AS 现金支付占比 FROM bss_business_day_data b JOIN bss_service_area sa ON b.service_no = sa.service_area_no JOIN bss_company c ON sa.company_id = c.id WHERE b.oper_date BETWEEN '2023-04-01' AND '2023-04-30' AND b.delete_ts IS NULL AND sa.delete_ts IS NULL AND c.delete_ts IS NULL GROUP BY c.company_name ORDER BY 现金支付占比 DESC;"
+  },
+  {
+    "question": "各路段路线的总营收排名(近30天),用于识别高价值路线?",
+    "sql": "SELECT sr.route_name AS 路线名称, SUM(bdd.pay_sum) AS 总营收 FROM bss_section_route sr JOIN bss_section_route_area_link link ON sr.id = link.section_route_id JOIN bss_service_area sa ON link.service_area_id = sa.id JOIN bss_business_day_data bdd ON sa.service_area_no = bdd.service_no WHERE bdd.oper_date >= CURRENT_DATE - 30 AND sr.delete_ts IS NULL AND sa.delete_ts IS NULL AND bdd.delete_ts IS NULL GROUP BY sr.route_name ORDER BY 总营收 DESC;"
+  },
+  {
+    "question": "每条路线下的平均单区车流量(近7天),用于评估路线吸引力?",
+    "sql": "SELECT sr.route_name AS 路线名称, AVG(car.customer_count) AS 单区平均车流 FROM bss_section_route sr JOIN bss_section_route_area_link link ON sr.id = link.section_route_id JOIN bss_service_area sa ON link.service_area_id = sa.id JOIN bss_car_day_count car ON sa.id = car.service_area_id WHERE car.count_date >= CURRENT_DATE - 7 AND sr.delete_ts IS NULL AND sa.delete_ts IS NULL AND car.delete_ts IS NULL GROUP BY sr.route_name;"
+  },
+  {
+    "question": "各路段路线的服务区数量及覆盖率(开放状态),辅助招商布局决策?",
+    "sql": "SELECT sr.section_name AS 路段名称, sr.route_name AS 路线名称, COUNT(sa.id) AS 服务区数量, ROUND(COUNT(sa.id)::numeric / (SELECT COUNT(*) FROM bss_service_area WHERE service_state = '开放' AND delete_ts IS NULL), 4) AS 服务区覆盖率 FROM bss_section_route sr JOIN bss_section_route_area_link link ON sr.id = link.section_route_id JOIN bss_service_area sa ON link.service_area_id = sa.id WHERE sa.service_state = '开放' AND sr.delete_ts IS NULL AND sa.delete_ts IS NULL GROUP BY sr.section_name, sr.route_name ORDER BY 服务区数量 DESC;"
+  },
+  {
+    "question": "昌九路段下各服务区近一周日均车流量TOP5?",
+    "sql": "SELECT sa.service_area_name AS 服务区名称, AVG(car.customer_count) AS 日均车流量 FROM bss_section_route sr JOIN bss_section_route_area_link link ON sr.id = link.section_route_id JOIN bss_service_area sa ON link.service_area_id = sa.id JOIN bss_car_day_count car ON sa.id = car.service_area_id WHERE sr.section_name = '昌九' AND car.count_date >= CURRENT_DATE - 7 AND sr.delete_ts IS NULL AND sa.delete_ts IS NULL AND car.delete_ts IS NULL GROUP BY sa.service_area_name ORDER BY 日均车流量 DESC LIMIT 5;"
+  },
+  {
+    "question": "不同公司管理的路段路线数量分布,用于资源均衡分析?",
+    "sql": "SELECT c.company_name AS 公司名称, COUNT(DISTINCT sr.id) AS 管辖路线数 FROM bss_company c JOIN bss_service_area sa ON c.id = sa.company_id JOIN bss_section_route_area_link link ON sa.id = link.service_area_id JOIN bss_section_route sr ON link.section_route_id = sr.id WHERE c.delete_ts IS NULL AND sa.delete_ts IS NULL AND sr.delete_ts IS NULL GROUP BY c.company_name ORDER BY 管辖路线数 DESC;"
+  },
+  {
+    "question": "近一个月微信支付金额最高的服务区TOP3及其所属路线?",
+    "sql": "SELECT sa.service_area_name AS 服务区名称, sr.route_name AS 所属路线, SUM(bdd.wx) AS 微信总金额 FROM bss_service_area sa JOIN bss_business_day_data bdd ON sa.service_area_no = bdd.service_no JOIN bss_section_route_area_link link ON sa.id = link.service_area_id JOIN bss_section_route sr ON link.section_route_id = sr.id WHERE bdd.oper_date >= CURRENT_DATE - 30 AND bdd.delete_ts IS NULL AND sa.delete_ts IS NULL AND sr.delete_ts IS NULL GROUP BY sa.service_area_name, sr.route_name ORDER BY 微信总金额 DESC LIMIT 3;"
+  },
+  {
+    "question": "昌栗路段每日总营收趋势(最近7天),用于短期运营监控?",
+    "sql": "SELECT bdd.oper_date AS 统计日期, SUM(bdd.pay_sum) AS 日总营收 FROM bss_section_route sr JOIN bss_section_route_area_link link ON sr.id = link.section_route_id JOIN bss_service_area sa ON link.service_area_id = sa.id JOIN bss_business_day_data bdd ON sa.service_area_no = bdd.service_no WHERE sr.section_name = '昌栗' AND bdd.oper_date >= CURRENT_DATE - 7 AND sr.delete_ts IS NULL AND sa.delete_ts IS NULL AND bdd.delete_ts IS NULL GROUP BY bdd.oper_date ORDER BY 统计日期;"
+  },
+  {
+    "question": "哪些路线没有关联任何服务区?用于数据完整性校验?",
+    "sql": "SELECT sr.route_name AS 无服务区路线 FROM bss_section_route sr LEFT JOIN bss_section_route_area_link link ON sr.id = link.section_route_id WHERE link.section_route_id IS NULL AND sr.delete_ts IS NULL;"
+  },
+  {
+    "question": "各路段路线的订单总数与平均客单价(近30天),综合评估消费活跃度?",
+    "sql": "SELECT sr.section_name AS 路段名称, sr.route_name AS 路线名称, SUM(bdd.order_sum) AS 订单总数, ROUND(SUM(bdd.pay_sum) / NULLIF(SUM(bdd.order_sum), 0), 2) AS 平均客单价 FROM bss_section_route sr JOIN bss_section_route_area_link link ON sr.id = link.section_route_id JOIN bss_service_area sa ON link.service_area_id = sa.id JOIN bss_business_day_data bdd ON sa.service_area_no = bdd.service_no WHERE bdd.oper_date >= CURRENT_DATE - 30 AND sr.delete_ts IS NULL AND sa.delete_ts IS NULL AND bdd.delete_ts IS NULL GROUP BY sr.section_name, sr.route_name ORDER BY 订单总数 DESC;"
+  },
+  {
+    "question": "统计当前各服务区状态的分布情况,包括开放、关闭和上传数据的服务区数量?",
+    "sql": "SELECT service_state AS 服务区状态, COUNT(*) AS 服务区间数 FROM bss_service_area WHERE delete_ts IS NULL GROUP BY service_state ORDER BY 服务区间数 DESC;"
+  },
+  {
+    "question": "按服务区类型统计各类别下处于开放状态的服务区数量及占比?",
+    "sql": "SELECT service_area_type AS 服务区类型, COUNT(*) AS 开放数量, ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER(), 2) AS 占比百分比 FROM bss_service_area WHERE delete_ts IS NULL AND service_state = '开放' GROUP BY service_area_type;"
+  },
+  {
+    "question": "查询最近7天内有经营数据记录的开放服务区列表及其所属公司名称?",
+    "sql": "SELECT DISTINCT sa.service_area_name AS 服务区名称, c.company_name AS 所属公司 FROM bss_service_area sa JOIN bss_company c ON sa.company_id = c.id JOIN bss_business_day_data bd ON sa.service_area_no = bd.service_no WHERE sa.delete_ts IS NULL AND c.delete_ts IS NULL AND bd.oper_date >= CURRENT_DATE - INTERVAL '7 days' AND sa.service_state = '开放' ORDER BY 所属公司, 服务区名称;"
+  },
+  {
+    "question": "列出所有未产生任何车辆流量数据的服务区(可能异常)及其基本信息?",
+    "sql": "SELECT sa.service_area_name AS 服务区名称, sa.service_area_no AS 服务区编码, sa.service_state AS 状态, c.company_name AS 所属公司 FROM bss_service_area sa LEFT JOIN bss_car_day_count cc ON sa.id = cc.service_area_id JOIN bss_company c ON sa.company_id = c.id WHERE sa.delete_ts IS NULL AND c.delete_ts IS NULL AND cc.id IS NULL ORDER BY 所属公司;"
+  },
+  {
+    "question": "统计各公司下属服务区的总数、开放数量及运营率(开放/总数)?",
+    "sql": "SELECT c.company_name AS 公司名称, COUNT(sa.id) AS 总服务区数, COUNT(CASE WHEN sa.service_state = '开放' THEN 1 END) AS 开放服务区数, ROUND(COUNT(CASE WHEN sa.service_state = '开放' THEN 1 END) * 100.0 / COUNT(sa.id), 2) AS 运营率 FROM bss_company c LEFT JOIN bss_service_area sa ON c.id = sa.company_id AND sa.delete_ts IS NULL WHERE c.delete_ts IS NULL GROUP BY c.company_name ORDER BY 运营率 DESC;"
+  },
+  {
+    "question": "找出过去30天日均支付总额最高的前5个开放服务区?",
+    "sql": "SELECT sa.service_area_name AS 服务区名称, AVG(bd.pay_sum) AS 日均支付金额 FROM bss_service_area sa JOIN bss_business_day_data bd ON sa.service_area_no = bd.service_no WHERE sa.delete_ts IS NULL AND sa.service_state = '开放' AND bd.oper_date >= CURRENT_DATE - INTERVAL '30 days' GROUP BY sa.service_area_name ORDER BY 日均支付金额 DESC LIMIT 5;"
+  },
+  {
+    "question": "分析不同类型服务区在最近一周的平均每日车辆流量差异?",
+    "sql": "SELECT sa.service_area_type AS 服务区类型, AVG(cd.customer_count) AS 平均每日车流量 FROM bss_service_area sa JOIN bss_car_day_count cd ON sa.id = cd.service_area_id WHERE sa.delete_ts IS NULL AND cd.count_date >= CURRENT_DATE - INTERVAL '7 days' GROUP BY sa.service_area_type ORDER BY 平均每日车流量 DESC;"
+  },
+  {
+    "question": "哪些服务区虽标记为‘开放’但近7天无任何经营数据记录(可能存在数据异常)?",
+    "sql": "SELECT sa.service_area_name AS 服务区名称, sa.service_area_no AS 服务区编码, c.company_name AS 所属公司 FROM bss_service_area sa JOIN bss_company c ON sa.company_id = c.id WHERE sa.delete_ts IS NULL AND c.delete_ts IS NULL AND sa.service_state = '开放' AND NOT EXISTS (SELECT 1 FROM bss_business_day_data bd WHERE bd.service_no = sa.service_area_no AND bd.oper_date >= CURRENT_DATE - INTERVAL '7 days') ORDER BY 所属公司;"
+  },
+  {
+    "question": "统计每种车辆类型在过去一个月中出现频率最高的服务区?",
+    "sql": "SELECT car_type AS 车辆类型, service_area_name AS 服务区名称, customer_count AS 车流量 FROM (SELECT cd.car_type, sa.service_area_name, cd.customer_count, ROW_NUMBER() OVER (PARTITION BY cd.car_type ORDER BY cd.customer_count DESC) AS rn FROM bss_car_day_count cd JOIN bss_service_area sa ON cd.service_area_id = sa.id WHERE cd.count_date >= CURRENT_DATE - INTERVAL '1 month' AND sa.delete_ts IS NULL) t WHERE rn = 1;"
+  },
+  {
+    "question": "汇总各公司在上一个自然月内的总订单量和总支付金额,并按金额排序?",
+    "sql": "SELECT c.company_name AS 公司名称, SUM(bd.order_sum) AS 总订单量, SUM(bd.pay_sum) AS 总支付金额 FROM bss_company c JOIN bss_service_area sa ON c.id = sa.company_id JOIN bss_business_day_data bd ON sa.service_area_no = bd.service_no WHERE c.delete_ts IS NULL AND sa.delete_ts IS NULL AND bd.oper_date >= DATE_TRUNC('month', CURRENT_DATE - INTERVAL '1 month') AND bd.oper_date < DATE_TRUNC('month', CURRENT_DATE) GROUP BY c.company_name ORDER BY 总支付金额 DESC;"
+  }
+]

+ 202 - 0
data_pipeline/training_data/manual_20250722_164749/qs_highway_db_20250722_165543_pair.json.backup

@@ -0,0 +1,202 @@
+[
+  {
+    "question": "统计2023年4月1日各服务区的总支付金额和订单总数,并按收入降序排列?",
+    "sql": "SELECT service_name AS 服务区名称, SUM(pay_sum) AS 总支付金额, SUM(order_sum) AS 订单总数 FROM bss_business_day_data WHERE oper_date = '2023-04-01' AND delete_ts IS NULL GROUP BY service_name ORDER BY 总支付金额 DESC;"
+  },
+  {
+    "question": "查询2023年4月1日微信支付金额占比超过50%的服务区及其占比?",
+    "sql": "SELECT service_name AS 服务区名称, (SUM(wx) / SUM(pay_sum)) AS 微信支付占比 FROM bss_business_day_data WHERE oper_date = '2023-04-01' AND delete_ts IS NULL GROUP BY service_name HAVING (SUM(wx) / SUM(pay_sum)) > 0.5 ORDER BY 微信支付占比 DESC;"
+  },
+  {
+    "question": "列出2023年4月1日订单总数最多的前5个档口及其所属服务区?",
+    "sql": "SELECT branch_name AS 档口名称, service_name AS 服务区名称, order_sum AS 订单总数 FROM bss_business_day_data WHERE oper_date = '2023-04-01' AND delete_ts IS NULL ORDER BY order_sum DESC LIMIT 5;"
+  },
+  {
+    "question": "分析2023年4月1日各支付方式的总金额分布情况?",
+    "sql": "SELECT '微信' AS 支付方式, SUM(wx) AS 总金额 FROM bss_business_day_data WHERE oper_date = '2023-04-01' AND delete_ts IS NULL UNION ALL SELECT '支付宝' AS 支付方式, SUM(zfb) AS 总金额 FROM bss_business_day_data WHERE oper_date = '2023-04-01' AND delete_ts IS NULL UNION ALL SELECT '现金' AS 支付方式, SUM(rmb) AS 总金额 FROM bss_business_day_data WHERE oper_date = '2023-04-01' AND delete_ts IS NULL UNION ALL SELECT '行吧' AS 支付方式, SUM(xs) AS 总金额 FROM bss_business_day_data WHERE oper_date = '2023-04-01' AND delete_ts IS NULL UNION ALL SELECT '金豆' AS 支付方式, SUM(jd) AS 总金额 FROM bss_business_day_data WHERE oper_date = '2023-04-01' AND delete_ts IS NULL ORDER BY 总金额 DESC;"
+  },
+  {
+    "question": "计算各公司在2023年4月1日的总营收并按公司名称排序?",
+    "sql": "SELECT c.company_name AS 公司名称, SUM(b.pay_sum) AS 总营收 FROM bss_business_day_data b JOIN bss_service_area sa ON b.service_no = sa.service_area_no JOIN bss_company c ON sa.company_id = c.id WHERE b.oper_date = '2023-04-01' AND b.delete_ts IS NULL AND sa.delete_ts IS NULL AND c.delete_ts IS NULL GROUP BY c.company_name ORDER BY 公司名称;"
+  },
+  {
+    "question": "找出2023年4月1日平均客单价最高的前3个服务区(总支付金额/订单总数)?",
+    "sql": "SELECT service_name AS 服务区名称, (SUM(pay_sum) / NULLIF(SUM(order_sum), 0)) AS 平均客单价 FROM bss_business_day_data WHERE oper_date = '2023-04-01' AND delete_ts IS NULL GROUP BY service_name ORDER BY 平均客单价 DESC LIMIT 3;"
+  },
+  {
+    "question": "对比2023年4月1日各服务区现金支付与非现金支付的金额差异?",
+    "sql": "SELECT service_name AS 服务区名称, SUM(rmb) AS 现金支付总额, (SUM(wx) + SUM(zfb) + SUM(xs) + SUM(jd)) AS 非现金支付总额 FROM bss_business_day_data WHERE oper_date = '2023-04-01' AND delete_ts IS NULL GROUP BY service_name ORDER BY 现金支付总额 DESC;"
+  },
+  {
+    "question": "查询宜春分公司下所有服务区在2023年4月1日的营收汇总?",
+    "sql": "SELECT s.company_name AS 公司名称, SUM(b.pay_sum) AS 营收总额, SUM(b.order_sum) AS 订单总数 FROM bss_business_day_data b JOIN bss_service_area a ON b.service_no = a.service_area_no JOIN bss_company s ON a.company_id = s.id WHERE s.company_name = '宜春分公司' AND b.oper_date = '2023-04-01' AND b.delete_ts IS NULL AND a.delete_ts IS NULL AND s.delete_ts IS NULL GROUP BY s.company_name;"
+  },
+  {
+    "question": "统计2023年4月1日各服务区支付宝订单数量占总订单比例,并筛选高于10%的服务区?",
+    "sql": "SELECT service_name AS 服务区名称, (SUM(zf_order) * 1.0 / SUM(order_sum)) AS 支付宝订单占比 FROM bss_business_day_data WHERE oper_date = '2023-04-01' AND delete_ts IS NULL GROUP BY service_name HAVING (SUM(zf_order) * 1.0 / SUM(order_sum)) > 0.1 ORDER BY 支付宝订单占比 DESC;"
+  },
+  {
+    "question": "获取2023年4月1日所有开放状态的服务区经营数据,包括总支付金额、订单数及支付方式明细?",
+    "sql": "SELECT b.service_name AS 服务区名称, b.branch_name AS 档口名称, b.pay_sum AS 总支付金额, b.order_sum AS 订单总数, b.wx AS 微信金额, b.zfb AS 支付宝金额, b.rmb AS 现金金额 FROM bss_business_day_data b JOIN bss_service_area s ON b.service_no = s.service_area_no WHERE b.oper_date = '2023-04-01' AND s.service_state = '开放' AND b.delete_ts IS NULL AND s.delete_ts IS NULL ORDER BY b.pay_sum DESC;"
+  },
+  {
+    "question": "各服务区2023年日均车流量排名(前10名)?",
+    "sql": "SELECT sa.service_area_name AS 服务区名称, AVG(cdc.customer_count) AS 日均车流量 FROM bss_car_day_count cdc JOIN bss_service_area sa ON cdc.service_area_id = sa.id WHERE cdc.count_date BETWEEN '2023-01-01' AND '2023-12-31' AND cdc.delete_ts IS NULL AND sa.delete_ts IS NULL GROUP BY sa.service_area_name ORDER BY 日均车流量 DESC LIMIT 10;"
+  },
+  {
+    "question": "2023年各类车型在所有服务区的总流量分布占比?",
+    "sql": "SELECT car_type AS 车辆类别, SUM(customer_count) AS 总车流量, ROUND(SUM(customer_count)::numeric * 100 / (SELECT SUM(customer_count) FROM bss_car_day_count WHERE count_date BETWEEN '2023-01-01' AND '2023-12-31' AND delete_ts IS NULL), 2) AS 占比百分比 FROM bss_car_day_count WHERE count_date BETWEEN '2023-01-01' AND '2023-12-31' AND delete_ts IS NULL GROUP BY car_type ORDER BY 总车流量 DESC;"
+  },
+  {
+    "question": "2023年每月总车流量趋势变化情况?",
+    "sql": "SELECT EXTRACT(YEAR FROM count_date) AS 年份, EXTRACT(MONTH FROM count_date) AS 月份, SUM(customer_count) AS 月总车流量 FROM bss_car_day_count WHERE count_date BETWEEN '2023-01-01' AND '2023-12-31' AND delete_ts IS NULL GROUP BY 年份, 月份 ORDER BY 年份, 月份;"
+  },
+  {
+    "question": "昌九路段下各服务区2023年日均车流量对比?",
+    "sql": "SELECT sa.service_area_name AS 服务区名称, AVG(cdc.customer_count) AS 日均车流量 FROM bss_car_day_count cdc JOIN bss_service_area sa ON cdc.service_area_id = sa.id JOIN bss_section_route_area_link sral ON sa.id = sral.service_area_id JOIN bss_section_route sr ON sral.section_route_id = sr.id WHERE sr.section_name = '昌九' AND cdc.count_date BETWEEN '2023-01-01' AND '2023-12-31' AND cdc.delete_ts IS NULL AND sa.delete_ts IS NULL AND sr.delete_ts IS NULL GROUP BY sa.service_area_name ORDER BY 日均车流量 DESC;"
+  },
+  {
+    "question": "宜春分公司所属服务区2023年车流总量及平均值?",
+    "sql": "SELECT co.company_name AS 公司名称, COUNT(*) AS 统计天数, SUM(cdc.customer_count) AS 总车流量, AVG(cdc.customer_count) AS 日均车流量 FROM bss_car_day_count cdc JOIN bss_service_area sa ON cdc.service_area_id = sa.id JOIN bss_company co ON sa.company_id = co.id WHERE co.company_name = '宜春分公司' AND cdc.count_date BETWEEN '2023-01-01' AND '2023-12-31' AND cdc.delete_ts IS NULL AND sa.delete_ts IS NULL AND co.delete_ts IS NULL GROUP BY 公司名称;"
+  },
+  {
+    "question": "2023年危化品车辆通行量最高的前5个服务区?",
+    "sql": "SELECT sa.service_area_name AS 服务区名称, SUM(cdc.customer_count) AS 危化品车流量 FROM bss_car_day_count cdc JOIN bss_service_area sa ON cdc.service_area_id = sa.id WHERE cdc.car_type = '危化品' AND cdc.count_date BETWEEN '2023-01-01' AND '2023-12-31' AND cdc.delete_ts IS NULL AND sa.delete_ts IS NULL GROUP BY sa.service_area_name ORDER BY 危化品车流量 DESC LIMIT 5;"
+  },
+  {
+    "question": "2023年每个季度各公司下属服务区的总车流量对比?",
+    "sql": "SELECT co.company_name AS 公司名称, EXTRACT(YEAR FROM cdc.count_date) AS 年份, EXTRACT(QUARTER FROM cdc.count_date) AS 季度, SUM(cdc.customer_count) AS 季度总车流量 FROM bss_car_day_count cdc JOIN bss_service_area sa ON cdc.service_area_id = sa.id JOIN bss_company co ON sa.company_id = co.id WHERE cdc.count_date BETWEEN '2023-01-01' AND '2023-12-31' AND cdc.delete_ts IS NULL AND sa.delete_ts IS NULL AND co.delete_ts IS NULL GROUP BY 公司名称, 年份, 季度 ORDER BY 年份, 季度, 季度总车流量 DESC;"
+  },
+  {
+    "question": "2023年‘城际’类车辆日均车流量时间趋势(按月)?",
+    "sql": "SELECT EXTRACT(YEAR FROM count_date) AS 年份, EXTRACT(MONTH FROM count_date) AS 月, AVG(customer_count) AS 日均城际车流量 FROM bss_car_day_count WHERE car_type = '城际' AND count_date BETWEEN '2023-01-01' AND '2023-12-31' AND delete_ts IS NULL GROUP BY 年份, 月 ORDER BY 年份, 月;"
+  },
+  {
+    "question": "哪些服务区在2023年存在单日车流量超过10000的记录?列出其名称及最高单日流量。",
+    "sql": "SELECT sa.service_area_name AS 服务区名称, MAX(cdc.customer_count) AS 最高单日车流量 FROM bss_car_day_count cdc JOIN bss_service_area sa ON cdc.service_area_id = sa.id WHERE cdc.count_date BETWEEN '2023-01-01' AND '2023-12-31' AND cdc.delete_ts IS NULL AND sa.delete_ts IS NULL GROUP BY sa.service_area_name HAVING MAX(cdc.customer_count) > 10000 ORDER BY 最高单日车流量 DESC;"
+  },
+  {
+    "question": "2023年‘过境’与‘城际’车辆在各路段的日均车流对比分析?",
+    "sql": "SELECT sr.section_name AS 路段名称, cdc.car_type AS 车辆类型, AVG(cdc.customer_count) AS 日均车流量 FROM bss_car_day_count cdc JOIN bss_service_area sa ON cdc.service_area_id = sa.id JOIN bss_section_route_area_link sral ON sa.id = sral.service_area_id JOIN bss_section_route sr ON sral.section_route_id = sr.id WHERE cdc.car_type IN ('过境', '城际') AND cdc.count_date BETWEEN '2023-01-01' AND '2023-12-31' AND cdc.delete_ts IS NULL AND sa.delete_ts IS NULL AND sr.delete_ts IS NULL GROUP BY sr.section_name, cdc.car_type ORDER BY 路段名称, 车辆类型;"
+  },
+  {
+    "question": "各公司2023年4月总营收是多少?按营收降序排列。",
+    "sql": "SELECT c.company_name AS 公司名称, SUM(b.pay_sum) AS 总营收 FROM bss_business_day_data b JOIN bss_service_area sa ON b.service_no = sa.service_area_no JOIN bss_company c ON sa.company_id = c.id WHERE b.oper_date BETWEEN '2023-04-01' AND '2023-04-30' AND c.delete_ts IS NULL AND sa.delete_ts IS NULL AND b.delete_ts IS NULL GROUP BY c.company_name ORDER BY 总营收 DESC;"
+  },
+  {
+    "question": "各公司平均单个服务区的日均营收(2023年4月)排名如何?",
+    "sql": "SELECT c.company_name AS 公司名称, AVG(company_area_daily.avg_daily_revenue) AS 平均单区日均产出 FROM (SELECT sa.company_id, sa.service_area_no, AVG(b.pay_sum) AS avg_daily_revenue FROM bss_business_day_data b JOIN bss_service_area sa ON b.service_no = sa.service_area_no WHERE b.oper_date BETWEEN '2023-04-01' AND '2023-04-30' AND b.delete_ts IS NULL AND sa.delete_ts IS NULL GROUP BY sa.company_id, sa.service_area_no) AS company_area_daily JOIN bss_company c ON company_area_daily.company_id = c.id WHERE c.delete_ts IS NULL GROUP BY c.company_name ORDER BY 平均单区日均产出 DESC;"
+  },
+  {
+    "question": "各公司在2023年4月的服务区车流覆盖率(有车流数据的服务区占比)是多少?",
+    "sql": "SELECT c.company_name AS 公司名称, COUNT(DISTINCT car.service_area_id) * 1.0 / COUNT(DISTINCT sa.id) AS 车流覆盖率 FROM bss_company c JOIN bss_service_area sa ON c.id = sa.company_id LEFT JOIN bss_car_day_count car ON sa.id = car.service_area_id AND car.count_date BETWEEN '2023-04-01' AND '2023-04-30' WHERE c.delete_ts IS NULL AND sa.delete_ts IS NULL GROUP BY c.company_name ORDER BY 车流覆盖率 DESC;"
+  },
+  {
+    "question": "与2022年4月相比,各公司2023年4月营收的同比增长率是多少?",
+    "sql": "WITH revenue_2022 AS (SELECT sa.company_id, SUM(b.pay_sum) AS total_2022 FROM bss_business_day_data b JOIN bss_service_area sa ON b.service_no = sa.service_area_no WHERE b.oper_date BETWEEN '2022-04-01' AND '2022-04-30' AND b.delete_ts IS NULL AND sa.delete_ts IS NULL GROUP BY sa.company_id), revenue_2023 AS (SELECT sa.company_id, SUM(b.pay_sum) AS total_2023 FROM bss_business_day_data b JOIN bss_service_area sa ON b.service_no = sa.service_area_no WHERE b.oper_date BETWEEN '2023-04-01' AND '2023-04-30' AND b.delete_ts IS NULL AND sa.delete_ts IS NULL GROUP BY sa.company_id) SELECT c.company_name AS 公司名称, COALESCE((r2023.total_2023 - r2022.total_2022) * 100.0 / NULLIF(r2022.total_2022, 0), 0) AS 同比增长率 FROM bss_company c LEFT JOIN revenue_2022 r2022 ON c.id = r2022.company_id LEFT JOIN revenue_2023 r2023 ON c.id = r2023.company_id WHERE c.delete_ts IS NULL ORDER BY 同比增长率 DESC;"
+  },
+  {
+    "question": "哪些公司的平均单区日均营收高于整体平均水平(2023年4月)?",
+    "sql": "WITH area_avg AS (SELECT sa.company_id, sa.service_area_no, AVG(b.pay_sum) AS daily_avg FROM bss_business_day_data b JOIN bss_service_area sa ON b.service_no = sa.service_area_no WHERE b.oper_date BETWEEN '2023-04-01' AND '2023-04-30' AND b.delete_ts IS NULL AND sa.delete_ts IS NULL GROUP BY sa.company_id, sa.service_area_no), company_avg AS (SELECT company_id, AVG(daily_avg) AS company_daily_avg FROM area_avg GROUP BY company_id), overall_avg AS (SELECT AVG(company_daily_avg) AS global_avg FROM company_avg) SELECT c.company_name AS 公司名称, ca.company_daily_avg AS 平均单区日均产出 FROM company_avg ca JOIN bss_company c ON ca.company_id = c.id CROSS JOIN overall_avg o WHERE ca.company_daily_avg > o.global_avg AND c.delete_ts IS NULL ORDER BY company_daily_avg DESC;"
+  },
+  {
+    "question": "各公司2023年4月微信支付占总支付金额的比例是多少?",
+    "sql": "SELECT c.company_name AS 公司名称, SUM(b.wx) * 100.0 / SUM(b.pay_sum) AS 微信支付占比 FROM bss_business_day_data b JOIN bss_service_area sa ON b.service_no = sa.service_area_no JOIN bss_company c ON sa.company_id = c.id WHERE b.oper_date BETWEEN '2023-04-01' AND '2023-04-30' AND b.delete_ts IS NULL AND sa.delete_ts IS NULL AND c.delete_ts IS NULL GROUP BY c.company_name ORDER BY 微信支付占比 DESC;"
+  },
+  {
+    "question": "车流量最高的前5个服务区及其所属公司是哪些(2023年4月)?",
+    "sql": "SELECT sa.service_area_name AS 服务区名称, c.company_name AS 所属公司, SUM(car.customer_count) AS 总车流量 FROM bss_car_day_count car JOIN bss_service_area sa ON car.service_area_id = sa.id JOIN bss_company c ON sa.company_id = c.id WHERE car.count_date BETWEEN '2023-04-01' AND '2023-04-30' AND car.delete_ts IS NULL AND sa.delete_ts IS NULL AND c.delete_ts IS NULL GROUP BY sa.service_area_name, c.company_name ORDER BY 总车流量 DESC LIMIT 5;"
+  },
+  {
+    "question": "各公司2023年4月每日平均订单总数是多少?按从高到低排序。",
+    "sql": "SELECT c.company_name AS 公司名称, AVG(b.order_sum) AS 日均订单总数 FROM bss_business_day_data b JOIN bss_service_area sa ON b.service_no = sa.service_area_no JOIN bss_company c ON sa.company_id = c.id WHERE b.oper_date BETWEEN '2023-04-01' AND '2023-04-30' AND b.delete_ts IS NULL AND sa.delete_ts IS NULL AND c.delete_ts IS NULL GROUP BY c.company_name ORDER BY 日均订单总数 DESC;"
+  },
+  {
+    "question": "宜春分公司在2023年4月每天的总营收趋势如何?",
+    "sql": "SELECT b.oper_date AS 统计日期, SUM(b.pay_sum) AS 日总营收 FROM bss_business_day_data b JOIN bss_service_area sa ON b.service_no = sa.service_area_no JOIN bss_company c ON sa.company_id = c.id WHERE c.company_name = '宜春分公司' AND b.oper_date BETWEEN '2023-04-01' AND '2023-04-30' AND c.delete_ts IS NULL AND sa.delete_ts IS NULL AND b.delete_ts IS NULL GROUP BY b.oper_date ORDER BY 统计日期;"
+  },
+  {
+    "question": "各公司在2023年4月的现金支付总额占比分布情况如何?",
+    "sql": "SELECT c.company_name AS 公司名称, SUM(b.rmb) * 100.0 / SUM(b.pay_sum) AS 现金支付占比 FROM bss_business_day_data b JOIN bss_service_area sa ON b.service_no = sa.service_area_no JOIN bss_company c ON sa.company_id = c.id WHERE b.oper_date BETWEEN '2023-04-01' AND '2023-04-30' AND b.delete_ts IS NULL AND sa.delete_ts IS NULL AND c.delete_ts IS NULL GROUP BY c.company_name ORDER BY 现金支付占比 DESC;"
+  },
+  {
+    "question": "各路段路线的总营收排名(近30天),用于识别高价值路线?",
+    "sql": "SELECT sr.route_name AS 路线名称, SUM(bdd.pay_sum) AS 总营收 FROM bss_section_route sr JOIN bss_section_route_area_link link ON sr.id = link.section_route_id JOIN bss_service_area sa ON link.service_area_id = sa.id JOIN bss_business_day_data bdd ON sa.service_area_no = bdd.service_no WHERE bdd.oper_date >= CURRENT_DATE - 30 AND sr.delete_ts IS NULL AND sa.delete_ts IS NULL AND bdd.delete_ts IS NULL GROUP BY sr.route_name ORDER BY 总营收 DESC;"
+  },
+  {
+    "question": "每条路线下的平均单区车流量(近7天),用于评估路线吸引力?",
+    "sql": "SELECT sr.route_name AS 路线名称, AVG(car.customer_count) AS 单区平均车流 FROM bss_section_route sr JOIN bss_section_route_area_link link ON sr.id = link.section_route_id JOIN bss_service_area sa ON link.service_area_id = sa.id JOIN bss_car_day_count car ON sa.id = car.service_area_id WHERE car.count_date >= CURRENT_DATE - 7 AND sr.delete_ts IS NULL AND sa.delete_ts IS NULL AND car.delete_ts IS NULL GROUP BY sr.route_name;"
+  },
+  {
+    "question": "各路段路线的服务区数量及覆盖率(开放状态),辅助招商布局决策?",
+    "sql": "SELECT sr.section_name AS 路段名称, sr.route_name AS 路线名称, COUNT(sa.id) AS 服务区数量, ROUND(COUNT(sa.id)::numeric / (SELECT COUNT(*) FROM bss_service_area WHERE service_state = '开放' AND delete_ts IS NULL), 4) AS 服务区覆盖率 FROM bss_section_route sr JOIN bss_section_route_area_link link ON sr.id = link.section_route_id JOIN bss_service_area sa ON link.service_area_id = sa.id WHERE sa.service_state = '开放' AND sr.delete_ts IS NULL AND sa.delete_ts IS NULL GROUP BY sr.section_name, sr.route_name ORDER BY 服务区数量 DESC;"
+  },
+  {
+    "question": "昌九路段下各服务区近一周日均车流量TOP5?",
+    "sql": "SELECT sa.service_area_name AS 服务区名称, AVG(car.customer_count) AS 日均车流量 FROM bss_section_route sr JOIN bss_section_route_area_link link ON sr.id = link.section_route_id JOIN bss_service_area sa ON link.service_area_id = sa.id JOIN bss_car_day_count car ON sa.id = car.service_area_id WHERE sr.section_name = '昌九' AND car.count_date >= CURRENT_DATE - 7 AND sr.delete_ts IS NULL AND sa.delete_ts IS NULL AND car.delete_ts IS NULL GROUP BY sa.service_area_name ORDER BY 日均车流量 DESC LIMIT 5;"
+  },
+  {
+    "question": "不同公司管理的路段路线数量分布,用于资源均衡分析?",
+    "sql": "SELECT c.company_name AS 公司名称, COUNT(DISTINCT sr.id) AS 管辖路线数 FROM bss_company c JOIN bss_service_area sa ON c.id = sa.company_id JOIN bss_section_route_area_link link ON sa.id = link.service_area_id JOIN bss_section_route sr ON link.section_route_id = sr.id WHERE c.delete_ts IS NULL AND sa.delete_ts IS NULL AND link.delete_ts IS NULL AND sr.delete_ts IS NULL GROUP BY c.company_name ORDER BY 管辖路线数 DESC;"
+  },
+  {
+    "question": "近一个月微信支付金额最高的服务区TOP3及其所属路线?",
+    "sql": "SELECT sa.service_area_name AS 服务区名称, sr.route_name AS 所属路线, SUM(bdd.wx) AS 微信总金额 FROM bss_service_area sa JOIN bss_business_day_data bdd ON sa.service_area_no = bdd.service_no JOIN bss_section_route_area_link link ON sa.id = link.service_area_id JOIN bss_section_route sr ON link.section_route_id = sr.id WHERE bdd.oper_date >= CURRENT_DATE - 30 AND bdd.delete_ts IS NULL AND sa.delete_ts IS NULL AND link.delete_ts IS NULL AND sr.delete_ts IS NULL GROUP BY sa.service_area_name, sr.route_name ORDER BY 微信总金额 DESC LIMIT 3;"
+  },
+  {
+    "question": "各路线危化品车辆占比(近30天),用于安全与服务策略优化?",
+    "sql": "SELECT sr.route_name AS 路线名称, SUM(CASE WHEN car.car_type = '危化品' THEN car.customer_count ELSE 0 END)::numeric / SUM(car.customer_count) AS 危化品占比 FROM bss_section_route sr JOIN bss_section_route_area_link link ON sr.id = link.section_route_id JOIN bss_service_area sa ON link.service_area_id = sa.id JOIN bss_car_day_count car ON sa.id = car.service_area_id WHERE car.count_date >= CURRENT_DATE - 30 AND car.delete_ts IS NULL AND sa.delete_ts IS NULL AND link.delete_ts IS NULL AND sr.delete_ts IS NULL GROUP BY sr.route_name ORDER BY 危化品占比 DESC;"
+  },
+  {
+    "question": "昌栗路段每日总营收趋势(最近7天),用于短期运营监控?",
+    "sql": "SELECT bdd.oper_date AS 统计日期, SUM(bdd.pay_sum) AS 日总营收 FROM bss_section_route sr JOIN bss_section_route_area_link link ON sr.id = link.section_route_id JOIN bss_service_area sa ON link.service_area_id = sa.id JOIN bss_business_day_data bdd ON sa.service_area_no = bdd.service_no WHERE sr.section_name = '昌栗' AND bdd.oper_date >= CURRENT_DATE - 7 AND sr.delete_ts IS NULL AND sa.delete_ts IS NULL AND bdd.delete_ts IS NULL GROUP BY bdd.oper_date ORDER BY 统计日期;"
+  },
+  {
+    "question": "哪些路线没有关联任何服务区?用于数据完整性校验?",
+    "sql": "SELECT sr.route_name AS 无服务区路线 FROM bss_section_route sr LEFT JOIN bss_section_route_area_link link ON sr.id = link.section_route_id WHERE link.section_route_id IS NULL AND sr.delete_ts IS NULL;"
+  },
+  {
+    "question": "各路段路线的订单总数与平均客单价(近30天),综合评估消费活跃度?",
+    "sql": "SELECT sr.section_name AS 路段名称, sr.route_name AS 路线名称, SUM(bdd.order_sum) AS 订单总数, ROUND(SUM(bdd.pay_sum) / NULLIF(SUM(bdd.order_sum), 0), 2) AS 平均客单价 FROM bss_section_route sr JOIN bss_section_route_area_link link ON sr.id = link.section_route_id JOIN bss_service_area sa ON link.service_area_id = sa.id JOIN bss_business_day_data bdd ON sa.service_area_no = bdd.service_no WHERE bdd.oper_date >= CURRENT_DATE - 30 AND sr.delete_ts IS NULL AND sa.delete_ts IS NULL AND bdd.delete_ts IS NULL GROUP BY sr.section_name, sr.route_name ORDER BY 订单总数 DESC;"
+  },
+  {
+    "question": "统计当前各服务区状态的分布情况,包括开放、关闭和上传数据的服务区数量?",
+    "sql": "SELECT service_state AS 服务区状态, COUNT(*) AS 服务区间数 FROM bss_service_area WHERE delete_ts IS NULL GROUP BY service_state ORDER BY 服务区间数 DESC;"
+  },
+  {
+    "question": "按服务区类型统计各类别下处于开放状态的服务区数量及占比?",
+    "sql": "SELECT service_area_type AS 服务区类型, COUNT(*) AS 开放数量, ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER(), 2) AS 占比百分比 FROM bss_service_area WHERE delete_ts IS NULL AND service_state = '开放' GROUP BY service_area_type;"
+  },
+  {
+    "question": "查询最近7天内有经营数据记录的开放服务区列表及其所属公司名称?",
+    "sql": "SELECT DISTINCT sa.service_area_name AS 服务区名称, c.company_name AS 所属公司 FROM bss_service_area sa JOIN bss_company c ON sa.company_id = c.id JOIN bss_business_day_data bd ON sa.service_area_no = bd.service_no WHERE sa.delete_ts IS NULL AND c.delete_ts IS NULL AND bd.oper_date >= CURRENT_DATE - INTERVAL '7 days' AND sa.service_state = '开放' ORDER BY 所属公司, 服务区名称;"
+  },
+  {
+    "question": "列出所有未产生任何车辆流量数据的服务区(可能异常)及其基本信息?",
+    "sql": "SELECT sa.service_area_name AS 服务区名称, sa.service_area_no AS 服务区编码, sa.service_state AS 状态, c.company_name AS 所属公司 FROM bss_service_area sa LEFT JOIN bss_car_day_count cc ON sa.id = cc.service_area_id JOIN bss_company c ON sa.company_id = c.id WHERE sa.delete_ts IS NULL AND c.delete_ts IS NULL AND cc.id IS NULL ORDER BY 所属公司;"
+  },
+  {
+    "question": "统计各公司下属服务区的总数、开放数量及运营率(开放/总数)?",
+    "sql": "SELECT c.company_name AS 公司名称, COUNT(sa.id) AS 总服务区数, COUNT(CASE WHEN sa.service_state = '开放' THEN 1 END) AS 开放服务区数, ROUND(COUNT(CASE WHEN sa.service_state = '开放' THEN 1 END) * 100.0 / COUNT(sa.id), 2) AS 运营率 FROM bss_company c LEFT JOIN bss_service_area sa ON c.id = sa.company_id AND sa.delete_ts IS NULL WHERE c.delete_ts IS NULL GROUP BY c.company_name ORDER BY 运营率 DESC;"
+  },
+  {
+    "question": "找出过去30天日均支付总额最高的前5个开放服务区?",
+    "sql": "SELECT sa.service_area_name AS 服务区名称, AVG(bd.pay_sum) AS 日均支付金额 FROM bss_service_area sa JOIN bss_business_day_data bd ON sa.service_area_no = bd.service_no WHERE sa.delete_ts IS NULL AND sa.service_state = '开放' AND bd.oper_date >= CURRENT_DATE - INTERVAL '30 days' GROUP BY sa.service_area_name ORDER BY 日均支付金额 DESC LIMIT 5;"
+  },
+  {
+    "question": "分析不同类型服务区在最近一周的平均每日车辆流量差异?",
+    "sql": "SELECT sa.service_area_type AS 服务区类型, AVG(cd.customer_count) AS 平均每日车流量 FROM bss_service_area sa JOIN bss_car_day_count cd ON sa.id = cd.service_area_id WHERE sa.delete_ts IS NULL AND cd.count_date >= CURRENT_DATE - INTERVAL '7 days' GROUP BY sa.service_area_type ORDER BY 平均每日车流量 DESC;"
+  },
+  {
+    "question": "哪些服务区虽标记为‘开放’但近7天无任何经营数据记录(可能存在数据异常)?",
+    "sql": "SELECT sa.service_area_name AS 服务区名称, sa.service_area_no AS 服务区编码, c.company_name AS 所属公司 FROM bss_service_area sa JOIN bss_company c ON sa.company_id = c.id WHERE sa.delete_ts IS NULL AND c.delete_ts IS NULL AND sa.service_state = '开放' AND NOT EXISTS (SELECT 1 FROM bss_business_day_data bd WHERE bd.service_no = sa.service_area_no AND bd.oper_date >= CURRENT_DATE - INTERVAL '7 days') ORDER BY 所属公司;"
+  },
+  {
+    "question": "统计每种车辆类型在过去一个月中出现频率最高的服务区?",
+    "sql": "SELECT car_type AS 车辆类型, service_area_name AS 服务区名称, customer_count AS 车流量 FROM (SELECT cd.car_type, sa.service_area_name, cd.customer_count, ROW_NUMBER() OVER (PARTITION BY cd.car_type ORDER BY cd.customer_count DESC) AS rn FROM bss_car_day_count cd JOIN bss_service_area sa ON cd.service_area_id = sa.id WHERE cd.count_date >= CURRENT_DATE - INTERVAL '1 month' AND sa.delete_ts IS NULL) t WHERE rn = 1;"
+  },
+  {
+    "question": "汇总各公司在上一个自然月内的总订单量和总支付金额,并按金额排序?",
+    "sql": "SELECT c.company_name AS 公司名称, SUM(bd.order_sum) AS 总订单量, SUM(bd.pay_sum) AS 总支付金额 FROM bss_company c JOIN bss_service_area sa ON c.id = sa.company_id JOIN bss_business_day_data bd ON sa.service_area_no = bd.service_no WHERE c.delete_ts IS NULL AND sa.delete_ts IS NULL AND bd.oper_date >= DATE_TRUNC('month', CURRENT_DATE - INTERVAL '1 month') AND bd.oper_date < DATE_TRUNC('month', CURRENT_DATE) GROUP BY c.company_name ORDER BY 总支付金额 DESC;"
+  }
+]

+ 5 - 0
data_pipeline/training_data/manual_20250722_164749/vector_bak/langchain_pg_collection_20250722_165619.csv

@@ -0,0 +1,5 @@
+uuid,name,cmetadata
+f4e11877-44e7-4741-b511-fa0e2e399395,sql,
+f0b714ca-44a9-433a-8768-390740bd1a18,ddl,
+98b97e3a-752d-4115-9667-7635687dbc1c,documentation,
+ab83ab0a-5649-4722-984d-b093227cdb02,error_sql,

Різницю між файлами не показано, бо вона завелика
+ 1 - 0
data_pipeline/training_data/manual_20250722_164749/vector_bak/langchain_pg_embedding_20250722_165619.csv


+ 11 - 0
data_pipeline/training_data/manual_20250722_164749/vector_bak/vector_backup_log.txt

@@ -0,0 +1,11 @@
+=== Vector Table Backup Log ===
+Backup Time: 2025-07-22 16:56:19
+Task ID: manual_20250722_164749
+Duration: 0.00s
+
+Tables Backup Status:
+✓ langchain_pg_collection: 4 rows -> langchain_pg_collection_20250722_165619.csv (209.0 B)
+✓ langchain_pg_embedding: 62 rows -> langchain_pg_embedding_20250722_165619.csv (818.9 KB)
+
+Truncate Status:
+- Not performed

+ 0 - 31
data_pipeline/training_data/task_20250701_131627/bss_business_day_data.ddl

@@ -1,31 +0,0 @@
--- 中文名: 业务支撑系统每日营业数据表
--- 描述: 业务支撑系统每日营业数据表,记录各服务区运营统计信息,包含统计日期、服务区编码及版本控制字段。
-create table public.bss_business_day_data (
-  id varchar(32) not null     -- 主键标识符,主键,
-  version integer not null    -- 数据版本号,
-  create_ts timestamp         -- 创建时间,
-  created_by varchar(50)      -- 创建人账号,
-  update_ts timestamp         -- 更新时间,
-  updated_by varchar(50)      -- 最后更新人,
-  delete_ts timestamp         -- 删除时间,
-  deleted_by varchar(50)      -- 删除操作人,
-  oper_date date              -- 统计日期,
-  service_no varchar(255)     -- 服务区编码,
-  service_name varchar(255)   -- 服务区名称,
-  branch_no varchar(255)      -- 档口编码,
-  branch_name varchar(255)    -- 档口名称,
-  wx numeric(19,4)            -- 微信支付金额,
-  wx_order integer            -- 微信订单数量,
-  zfb numeric(19,4)           -- 支付宝支付金额,
-  zf_order integer            -- 支付宝订单数,
-  rmb numeric(19,4)           -- 现金支付金额,
-  rmb_order integer           -- 现金订单数量,
-  xs numeric(19,4)            -- 行吧支付金额,
-  xs_order integer            -- 行吧订单数量,
-  jd numeric(19,4)            -- 金豆支付金额,
-  jd_order integer            -- 金豆订单数量,
-  order_sum integer           -- 订单总数,
-  pay_sum numeric(19,4)       -- 支付总金额,
-  source_type integer         -- 数据来源类别,
-  primary key (id)
-);

+ 0 - 32
data_pipeline/training_data/task_20250701_131627/bss_business_day_data_detail.md

@@ -1,32 +0,0 @@
-## bss_business_day_data(业务支撑系统每日营业数据表)
-bss_business_day_data 表业务支撑系统每日营业数据表,记录各服务区运营统计信息,包含统计日期、服务区编码及版本控制字段。
-字段列表:
-- id (varchar(32)) - 主键标识符 [主键, 非空] [示例: 00827DFF993D415488EA1F07CAE6C440, 00e799048b8cbb8ee758eac9c8b4b820]
-- version (integer) - 数据版本号 [非空] [示例: 1]
-- create_ts (timestamp) - 创建时间 [示例: 2023-04-02 08:31:51, 2023-04-02 02:30:08]
-- created_by (varchar(50)) - 创建人账号 [示例: xingba]
-- update_ts (timestamp) - 更新时间 [示例: 2023-04-02 08:31:51, 2023-04-02 02:30:08]
-- updated_by (varchar(50)) - 最后更新人
-- delete_ts (timestamp) - 删除时间
-- deleted_by (varchar(50)) - 删除操作人
-- oper_date (date) - 统计日期 [示例: 2023-04-01]
-- service_no (varchar(255)) - 服务区编码 [示例: 1028, H0501]
-- service_name (varchar(255)) - 服务区名称 [示例: 宜春服务区, 庐山服务区]
-- branch_no (varchar(255)) - 档口编码 [示例: 1, H05016]
-- branch_name (varchar(255)) - 档口名称 [示例: 宜春南区, 庐山鲜徕客东区]
-- wx (numeric(19,4)) - 微信支付金额 [示例: 4790.0000, 2523.0000]
-- wx_order (integer) - 微信订单数量 [示例: 253, 133]
-- zfb (numeric(19,4)) - 支付宝支付金额 [示例: 229.0000, 0.0000]
-- zf_order (integer) - 支付宝订单数 [示例: 15, 0]
-- rmb (numeric(19,4)) - 现金支付金额 [示例: 1058.5000, 124.0000]
-- rmb_order (integer) - 现金订单数量 [示例: 56, 12]
-- xs (numeric(19,4)) - 行吧支付金额 [示例: 0.0000, 40.0000]
-- xs_order (integer) - 行吧订单数量 [示例: 0, 1]
-- jd (numeric(19,4)) - 金豆支付金额 [示例: 0.0000]
-- jd_order (integer) - 金豆订单数量 [示例: 0]
-- order_sum (integer) - 订单总数 [示例: 324, 146]
-- pay_sum (numeric(19,4)) - 支付总金额 [示例: 6077.5000, 2687.0000]
-- source_type (integer) - 数据来源类别 [示例: 1, 0, 4]
-字段补充说明:
-- id 为主键
-- source_type 为枚举字段,包含取值:0、4、1、2、3

+ 0 - 15
data_pipeline/training_data/task_20250701_131627/bss_company_detail.md

@@ -1,15 +0,0 @@
-## bss_company(存储高速公路服务区合作公司基础信息(含公司名称及唯一编码))
-bss_company 表存储高速公路服务区合作公司基础信息(含公司名称及唯一编码),用于业务支撑系统中企业信息管理与业务关联支撑。
-字段列表:
-- id (varchar(32)) - 主键ID [主键, 非空] [示例: 30675d85ba5044c31acfa243b9d16334, 47ed0bb37f5a85f3d9245e4854959b81]
-- version (integer) - 版本号 [非空] [示例: 1, 2]
-- create_ts (timestamp) - 创建时间 [示例: 2021-05-20 09:51:58.718000, 2021-05-20 09:42:03.341000]
-- created_by (varchar(50)) - 创建人 [示例: admin]
-- update_ts (timestamp) - 更新时间 [示例: 2021-05-20 09:51:58.718000, 2021-05-20 09:42:03.341000]
-- updated_by (varchar(50)) - 更新人 [示例: admin]
-- delete_ts (timestamp) - 删除时间
-- deleted_by (varchar(50)) - 删除人
-- company_name (varchar(255)) - 分公司名称 [示例: 上饶分公司, 宜春分公司]
-- company_no (varchar(255)) - 公司编码 [示例: H03, H02]
-字段补充说明:
-- id 为主键

+ 0 - 10
data_pipeline/training_data/task_20250701_131627/db_query_decision_prompt.txt

@@ -1,10 +0,0 @@
-=== 数据库业务范围 ===
-当前数据库存储的是高速公路服务区运营管理的相关数据,主要涉及服务区运营统计、车辆通行量、基础信息管理及路段关联,包含以下业务数据:
-核心业务实体:
-- 服务区:描述高速公路服务区基础信息,主要字段:服务区名称、服务区编码、地理坐标、服务区类型、运营状态
-- 车辆类型:描述通行车辆分类维度,主要字段:车辆类别(其他、危化品、城际、过境)
-- 路段路线:描述高速公路路段与路线归属关系,主要字段:路段名称、路线名称、路段编号
-- 合作公司:描述服务区所属分公司信息,主要字段:分公司名称、公司编码
-关键业务指标:
-- 营收指标:包含微信/支付宝/现金/行吧/金豆支付金额及订单数、支付总金额、订单总数
-- 车辆流量:按类型统计的日通行车辆数量

+ 0 - 62
data_pipeline/training_data/task_20250701_131627/metadata.txt

@@ -1,62 +0,0 @@
--- Schema Tools生成的主题元数据
--- 业务背景: 高速公路服务区管理系统
--- 生成时间: 2025-07-01 13:47:36
--- 数据库: highway_db
-
--- 创建表(如果不存在)
-CREATE TABLE IF NOT EXISTS metadata (
-    id SERIAL PRIMARY KEY,    -- 主键
-    topic_name VARCHAR(100) NOT NULL,  -- 业务主题名称
-    description TEXT,                  -- 业务主体说明
-    related_tables TEXT[],			  -- 相关表名
-    biz_entities TEXT[],               -- 主要业务实体名称
-    biz_metrics TEXT[],                -- 主要业务指标名称
-    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP    -- 插入时间
-);
-
--- 插入主题数据
-INSERT INTO metadata(topic_name, description, related_tables, biz_entities, biz_metrics) VALUES
-(
-  '日营收结构',
-  '分析各服务区每日营收构成及支付方式占比,优化资金管理策略',
-  'bss_business_day_data',
-  '服务区,支付方式,档口',
-  '总营收,现金占比,移动支付比例'
-);
-
-INSERT INTO metadata(topic_name, description, related_tables, biz_entities, biz_metrics) VALUES
-(
-  '车流高峰分析',
-  '通过车辆统计表识别服务区高峰时段及车型分布,指导资源调度',
-  'bss_car_day_count,bss_service_area',
-  '服务区,车辆类型,统计日期',
-  '日均车流,高峰时段,危化品车辆占比'
-);
-
-INSERT INTO metadata(topic_name, description, related_tables, biz_entities, biz_metrics) VALUES
-(
-  '分公司对比',
-  '比较不同分公司的服务区运营效率及营收能力,发现管理差异',
-  'bss_company,bss_service_area,bss_business_day_data',
-  '分公司,服务区,运营指标',
-  '人均营收,客单价,订单密度'
-);
-
-INSERT INTO metadata(topic_name, description, related_tables, biz_entities, biz_metrics) VALUES
-(
-  '路线关联分析',
-  '研究路段路线与服务区的关联关系,优化路线规划和服务区配置',
-  'bss_section_route,bss_section_route_area_link,bss_car_day_count',
-  '路段,路线,服务区',
-  '路线车流,服务区覆盖率,路线营收贡献'
-);
-
-INSERT INTO metadata(topic_name, description, related_tables, biz_entities, biz_metrics) VALUES
-(
-  '节假日效应',
-  '分析节假日前后服务区营收和车流变化,制定营销和服务方案',
-  'bss_business_day_data,bss_car_day_count',
-  '服务区,节假日,支付方式',
-  '节前增幅,节假日营收占比,车流增长率'
-);
-

+ 0 - 190
data_pipeline/training_data/task_20250701_131627/qs_highway_db_20250701_134736_pair.json

@@ -1,190 +0,0 @@
-[
-  {
-    "question": "统计2023年4月1日各服务区的总营收及现金支付金额占比",
-    "sql": "SELECT service_name AS 服务区名称, SUM(pay_sum) AS 总营收, SUM(rmb)/SUM(pay_sum)*100 AS 现金支付占比 FROM bss_business_day_data WHERE oper_date = '2023-04-01' AND delete_ts IS NULL GROUP BY service_name;"
-  },
-  {
-    "question": "分析2023年第一季度各支付方式在总营收中的占比变化趋势",
-    "sql": "SELECT oper_date AS 统计日期, SUM(wx)/SUM(pay_sum)*100 AS 微信占比, SUM(zfb)/SUM(pay_sum)*100 AS 支付宝占比, SUM(rmb)/SUM(pay_sum)*100 AS 现金占比 FROM bss_business_day_data WHERE oper_date BETWEEN '2023-01-01' AND '2023-03-31' AND delete_ts IS NULL GROUP BY oper_date ORDER BY 统计日期;"
-  },
-  {
-    "question": "查询最近7天总营收最高的前5个服务区及其移动支付比例",
-    "sql": "SELECT service_name AS 服务区名称, SUM(pay_sum) AS 总营收, (SUM(wx)+SUM(zfb))/SUM(pay_sum)*100 AS 移动支付比例 FROM bss_business_day_data WHERE oper_date >= CURRENT_DATE - 7 AND oper_date < CURRENT_DATE AND delete_ts IS NULL GROUP BY service_name ORDER BY 总营收 DESC LIMIT 5;"
-  },
-  {
-    "question": "对比不同档口的现金支付订单占比并按占比排序",
-    "sql": "SELECT branch_name AS 档口名称, SUM(rmb_order)/SUM(order_sum)*100 AS 现金订单占比 FROM bss_business_day_data WHERE delete_ts IS NULL GROUP BY branch_name ORDER BY 现金订单占比 DESC;"
-  },
-  {
-    "question": "计算宜春服务区2023年各季度月均营收及最大单日营收",
-    "sql": "SELECT EXTRACT(QUARTER FROM oper_date) AS 季度, AVG(pay_sum) AS 月均营收, MAX(pay_sum) AS 最大单日营收 FROM bss_business_day_data WHERE service_name = '宜春服务区' AND EXTRACT(YEAR FROM oper_date) = 2023 AND delete_ts IS NULL GROUP BY 季度 ORDER BY 季度;"
-  },
-  {
-    "question": "统计2023年4月各服务区订单总数及总营收并按营收排名",
-    "sql": "SELECT service_name AS 服务区名称, SUM(order_sum) AS 订单总数, SUM(pay_sum) AS 总营收 FROM bss_business_day_data WHERE oper_date BETWEEN '2023-04-01' AND '2023-04-30' AND delete_ts IS NULL GROUP BY service_name ORDER BY 总营收 DESC;"
-  },
-  {
-    "question": "查询最近一天移动支付占比超过80%的服务区信息",
-    "sql": "SELECT service_name AS 服务区名称, (wx+zfb)/pay_sum*100 AS 移动支付比例 FROM bss_business_day_data WHERE oper_date = (SELECT MAX(oper_date) FROM bss_business_day_data WHERE delete_ts IS NULL) AND (wx+zfb)/pay_sum > 0.8 AND delete_ts IS NULL ORDER BY 移动支付比例 DESC;"
-  },
-  {
-    "question": "分析庐山服务区2023年各星期的营收分布情况",
-    "sql": "SELECT EXTRACT(ISODOW FROM oper_date) AS 星期, SUM(pay_sum) AS 总营收 FROM bss_business_day_data WHERE service_name = '庐山服务区' AND EXTRACT(YEAR FROM oper_date) = 2023 AND delete_ts IS NULL GROUP BY 星期 ORDER BY 星期;"
-  },
-  {
-    "question": "统计最近一天总营收超过1万元且现金占比低于10%的服务区",
-    "sql": "SELECT service_name AS 服务区名称, pay_sum AS 总营收, rmb/pay_sum*100 AS 现金占比 FROM bss_business_day_data WHERE oper_date = (SELECT MAX(oper_date) FROM bss_business_day_data WHERE delete_ts IS NULL) AND pay_sum > 10000 AND rmb/pay_sum < 0.1 AND delete_ts IS NULL ORDER BY 总营收 DESC;"
-  },
-  {
-    "question": "对比宜春和南昌南服务区最近30天各支付方式的平均日营收",
-    "sql": "SELECT service_name AS 服务区名称, AVG(wx) AS 日均微信营收, AVG(zfb) AS 日均支付宝营收, AVG(rmb) AS 日均现金营收 FROM bss_business_day_data WHERE oper_date >= CURRENT_DATE - 30 AND service_name IN ('宜春服务区','南昌南服务区') AND delete_ts IS NULL GROUP BY service_name ORDER BY 服务区名称;"
-  },
-  {
-    "question": "统计各服务区日均车流量并按车流由高到低排序",
-    "sql": "SELECT sa.service_area_name AS 服务区名称, AVG(cc.customer_count) AS 日均车流量 FROM bss_car_day_count cc JOIN bss_service_area sa ON cc.service_area_id = sa.id WHERE cc.delete_ts IS NULL AND sa.delete_ts IS NULL GROUP BY sa.service_area_name ORDER BY 日均车流量 DESC;"
-  },
-  {
-    "question": "查询危化品车辆占比超过5%的服务区信息",
-    "sql": "SELECT sa.service_area_name, ROUND((SUM(CASE WHEN cc.car_type='危化品' THEN cc.customer_count ELSE 0 END)*100.0/SUM(cc.customer_count))::numeric,2) AS 危化品占比 FROM bss_car_day_count cc JOIN bss_service_area sa ON cc.service_area_id = sa.id WHERE cc.delete_ts IS NULL AND sa.delete_ts IS NULL GROUP BY sa.service_area_name HAVING SUM(CASE WHEN cc.car_type='危化品' THEN cc.customer_count ELSE 0 END)*100.0/SUM(cc.customer_count) > 5 ORDER BY 危化品占比 DESC;"
-  },
-  {
-    "question": "分析最近30天各车型日均通行量变化趋势",
-    "sql": "SELECT count_date AS 统计日期, car_type AS 车型, AVG(customer_count) AS 日均车流量 FROM bss_car_day_count WHERE count_date >= CURRENT_DATE - 30 AND delete_ts IS NULL GROUP BY count_date, car_type ORDER BY count_date;"
-  },
-  {
-    "question": "对比周末与工作日车流量差异",
-    "sql": "SELECT CASE WHEN EXTRACT(DOW FROM count_date) IN (0,6) THEN '周末' ELSE '工作日' END AS 时段类型, AVG(customer_count) AS 平均车流量 FROM bss_car_day_count WHERE delete_ts IS NULL GROUP BY 时段类型;"
-  },
-  {
-    "question": "获取各服务区过境车辆占比TOP5",
-    "sql": "SELECT sa.service_area_name, ROUND((SUM(CASE WHEN cc.car_type='过境' THEN cc.customer_count ELSE 0 END)*100.0/SUM(cc.customer_count))::numeric,2) AS 过境占比 FROM bss_car_day_count cc JOIN bss_service_area sa ON cc.service_area_id = sa.id WHERE cc.delete_ts IS NULL AND sa.delete_ts IS NULL GROUP BY sa.service_area_name ORDER BY 过境占比 DESC LIMIT 5;"
-  },
-  {
-    "question": "统计最近一周每日总车流量及环比增长率",
-    "sql": "WITH daily_total AS (SELECT count_date, SUM(customer_count) AS total FROM bss_car_day_count WHERE count_date >= CURRENT_DATE - 7 AND delete_ts IS NULL GROUP BY count_date) SELECT count_date, total, LAG(total) OVER(ORDER BY count_date) AS 前一日流量, ROUND(((total - LAG(total) OVER(ORDER BY count_date))*100.0/LAG(total) OVER(ORDER BY count_date))::numeric,2) AS 环比增长率 FROM daily_total;"
-  },
-  {
-    "question": "查询连续3天车流量增长的服务区",
-    "sql": "WITH daily_growth AS (SELECT service_area_id, count_date, SUM(customer_count) AS daily_count, LAG(SUM(customer_count),1) OVER(PARTITION BY service_area_id ORDER BY count_date) AS prev_count FROM bss_car_day_count WHERE delete_ts IS NULL GROUP BY service_area_id, count_date) SELECT sa.service_area_name FROM (SELECT service_area_id FROM daily_growth WHERE daily_count > prev_count GROUP BY service_area_id, count_date - generate_series(0,2)) t JOIN bss_service_area sa ON t.service_area_id = sa.id;"
-  },
-  {
-    "question": "统计各车辆类型在不同时间段的分布比例",
-    "sql": "SELECT car_type AS 车型, EXTRACT(HOUR FROM create_ts)::integer AS 小时段, ROUND(AVG(customer_count)::numeric,0) AS 平均车流量 FROM bss_car_day_count WHERE delete_ts IS NULL GROUP BY car_type, 小时段 ORDER BY 小时段;"
-  },
-  {
-    "question": "获取昨日车流量最高的3个服务区及对应车型分布",
-    "sql": "SELECT sa.service_area_name, cc.car_type, cc.customer_count FROM bss_car_day_count cc JOIN bss_service_area sa ON cc.service_area_id = sa.id WHERE cc.count_date = CURRENT_DATE - 1 AND sa.delete_ts IS NULL ORDER BY cc.customer_count DESC LIMIT 3;"
-  },
-  {
-    "question": "分析各区域城际车辆通行量与服务区开放状态的关系",
-    "sql": "SELECT sa.service_state AS 开放状态, AVG(CASE WHEN cc.car_type='城际' THEN cc.customer_count ELSE 0 END) AS 平均城际车流量 FROM bss_car_day_count cc RIGHT JOIN bss_service_area sa ON cc.service_area_id = sa.id WHERE sa.delete_ts IS NULL GROUP BY sa.service_state;"
-  },
-  {
-    "question": "各分公司2023年4月人均营收TOP5(按支付总额/车流量计算)",
-    "sql": "SELECT c.company_name AS 分公司名称, SUM(bd.pay_sum)/SUM(car.customer_count) AS 人均营收 FROM bss_company c JOIN bss_service_area sa ON c.id = sa.company_id JOIN bss_business_day_data bd ON sa.service_area_no = bd.service_no JOIN bss_car_day_count car ON sa.id = car.service_area_id AND bd.oper_date = car.count_date WHERE bd.oper_date BETWEEN '2023-04-01' AND '2023-04-30' GROUP BY c.company_name ORDER BY 人均营收 DESC LIMIT 5;"
-  },
-  {
-    "question": "2023年Q2各分公司客单价对比分析",
-    "sql": "SELECT c.company_name AS 分公司名称, AVG(bd.pay_sum/bd.order_sum) AS 客单价 FROM bss_company c JOIN bss_service_area sa ON c.id = sa.company_id JOIN bss_business_day_data bd ON sa.service_area_no = bd.service_no WHERE bd.oper_date BETWEEN '2023-04-01' AND '2023-06-30' GROUP BY c.company_name ORDER BY 客单价 DESC;"
-  },
-  {
-    "question": "最近一周订单密度(订单数/面积)最低的3个分公司",
-    "sql": "SELECT c.company_name AS 分公司名称, SUM(bd.order_sum)/COUNT(DISTINCT sa.id) AS 订单密度 FROM bss_company c JOIN bss_service_area sa ON c.id = sa.company_id JOIN bss_business_day_data bd ON sa.service_area_no = bd.service_no WHERE bd.oper_date >= CURRENT_DATE - 7 GROUP BY c.company_name ORDER BY 订单密度 ASC LIMIT 3;"
-  },
-  {
-    "question": "各分公司2023年节假日营收总额环比分析",
-    "sql": "SELECT c.company_name AS 分公司名称, SUM(CASE WHEN EXTRACT(MONTH FROM bd.oper_date) = 1 THEN bd.pay_sum ELSE 0 END) AS 一月营收, SUM(CASE WHEN EXTRACT(MONTH FROM bd.oper_date) = 2 THEN bd.pay_sum ELSE 0 END) AS 二月营收 FROM bss_company c JOIN bss_service_area sa ON c.id = sa.company_id JOIN bss_business_day_data bd ON sa.service_area_no = bd.service_no WHERE EXTRACT(YEAR FROM bd.oper_date) = 2023 GROUP BY c.company_name;"
-  },
-  {
-    "question": "2023-04-01当日各分公司运营指标对比(支付总额、订单数、车流量)",
-    "sql": "SELECT c.company_name AS 分公司名称, SUM(bd.pay_sum) AS 支付总额, SUM(bd.order_sum) AS 订单总数, SUM(car.customer_count) AS 车流量 FROM bss_company c JOIN bss_service_area sa ON c.id = sa.company_id JOIN bss_business_day_data bd ON sa.service_area_no = bd.service_no JOIN bss_car_day_count car ON sa.id = car.service_area_id WHERE bd.oper_date = '2023-04-01' GROUP BY c.company_name ORDER BY 支付总额 DESC;"
-  },
-  {
-    "question": "各分公司微信支付占比分析(近30天)",
-    "sql": "SELECT c.company_name AS 分公司名称, SUM(bd.wx) / SUM(bd.pay_sum) * 100 AS 微信占比百分比 FROM bss_company c JOIN bss_service_area sa ON c.id = sa.company_id JOIN bss_business_day_data bd ON sa.service_area_no = bd.service_no WHERE bd.oper_date >= CURRENT_DATE - 30 GROUP BY c.company_name ORDER BY 微信占比百分比 DESC;"
-  },
-  {
-    "question": "各分公司服务区数量与营收能力关联分析",
-    "sql": "SELECT c.company_name AS 分公司名称, COUNT(sa.id) AS 服务区数量, SUM(bd.pay_sum) AS 总营收 FROM bss_company c JOIN bss_service_area sa ON c.id = sa.company_id JOIN bss_business_day_data bd ON sa.service_area_no = bd.service_no GROUP BY c.company_name ORDER BY 服务区数量 DESC, 总营收 DESC;"
-  },
-  {
-    "question": "2023年各分公司月均订单密度趋势分析",
-    "sql": "SELECT c.company_name AS 分公司名称, EXTRACT(MONTH FROM bd.oper_date) AS 月份, AVG(bd.order_sum) AS 月均订单密度 FROM bss_company c JOIN bss_service_area sa ON c.id = sa.company_id JOIN bss_business_day_data bd ON sa.service_area_no = bd.service_no WHERE EXTRACT(YEAR FROM bd.oper_date) = 2023 GROUP BY c.company_name, 月份 ORDER BY 分公司名称, 月份;"
-  },
-  {
-    "question": "各分公司不同支付方式订单数占比分析",
-    "sql": "SELECT c.company_name AS 分公司名称, SUM(bd.wx_order)/SUM(bd.order_sum)*100 AS 微信占比, SUM(bd.zf_order)/SUM(bd.order_sum)*100 AS 支付宝占比 FROM bss_company c JOIN bss_service_area sa ON c.id = sa.company_id JOIN bss_business_day_data bd ON sa.service_area_no = bd.service_no GROUP BY c.company_name ORDER BY 微信占比 DESC;"
-  },
-  {
-    "question": "2023年Q2各分公司营收增长率分析",
-    "sql": "SELECT c.company_name AS 分公司名称, SUM(CASE WHEN EXTRACT(MONTH FROM bd.oper_date) = 4 THEN bd.pay_sum ELSE 0 END) / SUM(CASE WHEN EXTRACT(MONTH FROM bd.oper_date) = 5 THEN bd.pay_sum ELSE 0 END) - 1 AS 月增长率 FROM bss_company c JOIN bss_service_area sa ON c.id = sa.company_id JOIN bss_business_day_data bd ON sa.service_area_no = bd.service_no WHERE EXTRACT(QUARTER FROM bd.oper_date) = 2 GROUP BY c.company_name ORDER BY 月增长率 DESC;"
-  },
-  {
-    "question": "统计各路线关联的服务区数量及平均车流量,按服务区数量降序排列",
-    "sql": "SELECT r.route_name AS 路线名称, COUNT(l.service_area_id) AS 服务区数量, AVG(c.customer_count) AS 平均车流量 FROM bss_section_route r LEFT JOIN bss_section_route_area_link l ON r.id = l.section_route_id LEFT JOIN bss_car_day_count c ON l.service_area_id = c.service_area_id WHERE r.delete_ts IS NULL GROUP BY r.route_name ORDER BY 服务区数量 DESC;"
-  },
-  {
-    "question": "计算2023年Q2各路段日均车流量,筛选出日均车流量>1000的路段",
-    "sql": "SELECT s.section_name AS 路段名称, COUNT(*) AS 天数, AVG(c.customer_count) AS 日均车流量 FROM bss_section_route s JOIN bss_section_route_area_link l ON s.id = l.section_route_id JOIN bss_car_day_count c ON l.service_area_id = c.service_area_id WHERE c.count_date BETWEEN '2023-04-01' AND '2023-06-30' AND s.delete_ts IS NULL GROUP BY s.section_name HAVING AVG(c.customer_count) > 1000;"
-  },
-  {
-    "question": "查询2023年车流量TOP5服务区及对应路线信息",
-    "sql": "SELECT a.service_area_name AS 服务区名称, r.route_name AS 路线名称, SUM(c.customer_count) AS 总车流量 FROM bss_service_area a JOIN bss_section_route_area_link l ON a.id = l.service_area_id JOIN bss_section_route r ON l.section_route_id = r.id JOIN bss_car_day_count c ON a.id = c.service_area_id WHERE c.count_date BETWEEN '2023-01-01' AND '2023-12-31' GROUP BY a.service_area_name, r.route_name ORDER BY 总车流量 DESC LIMIT 5;"
-  },
-  {
-    "question": "统计未关联服务区的路段清单及创建时间",
-    "sql": "SELECT r.section_name AS 路段名称, r.create_ts AS 创建时间 FROM bss_section_route r LEFT JOIN bss_section_route_area_link l ON r.id = l.section_route_id WHERE l.service_area_id IS NULL AND r.delete_ts IS NULL;"
-  },
-  {
-    "question": "分析春运期间(2023-01-07至2023-02-16)各路线车流变化趋势",
-    "sql": "SELECT r.route_name AS 路线名称, c.count_date AS 日期, SUM(c.customer_count) AS 总车流量 FROM bss_section_route r JOIN bss_section_route_area_link l ON r.id = l.section_route_id JOIN bss_car_day_count c ON l.service_area_id = c.service_area_id WHERE c.count_date BETWEEN '2023-01-07' AND '2023-02-16' GROUP BY r.route_name, c.count_date ORDER BY 日期;"
-  },
-  {
-    "question": "计算各服务区车流覆盖率(关联路段车流/总车流)TOP10",
-    "sql": "SELECT a.service_area_name AS 服务区名称, SUM(c.customer_count) AS 关联车流, (SELECT SUM(customer_count) FROM bss_car_day_count WHERE service_area_id = a.id) AS 总车流, ROUND((SUM(c.customer_count)/(SELECT SUM(customer_count) FROM bss_car_day_count WHERE service_area_id = a.id)) * 100)::numeric(5,2) AS 覆盖率 FROM bss_service_area a JOIN bss_section_route_area_link l ON a.id = l.service_area_id JOIN bss_car_day_count c ON a.id = c.service_area_id GROUP BY a.id, a.service_area_name ORDER BY 覆盖率 DESC LIMIT 10;"
-  },
-  {
-    "question": "分析不同分公司管辖路段的服务区密度(服务区数/路段长度)",
-    "sql": "SELECT c.company_name AS 分公司名称, COUNT(a.id) AS 服务区数量, SUM(LENGTH(s.code)) AS 路段总长度, ROUND((COUNT(a.id)/SUM(LENGTH(s.code))) * 1000)::numeric(5,2) AS 密度_每千米 FROM bss_company c JOIN bss_service_area a ON c.id = a.company_id JOIN bss_section_route_area_link l ON a.id = l.service_area_id JOIN bss_section_route s ON l.section_route_id = s.id GROUP BY c.company_name;"
-  },
-  {
-    "question": "分析2023年国庆节期间各服务区营收总额及环比增长率",
-    "sql": "WITH holiday_revenue AS (SELECT service_name, SUM(pay_sum) AS holiday_amount FROM bss_business_day_data WHERE oper_date BETWEEN '2023-10-01' AND '2023-10-07' AND delete_ts IS NULL GROUP BY service_name), pre_holiday_revenue AS (SELECT service_name, SUM(pay_sum) AS pre_amount FROM bss_business_day_data WHERE oper_date BETWEEN '2023-09-24' AND '2023-09-30' AND delete_ts IS NULL GROUP BY service_name) SELECT h.service_name, h.holiday_amount, ROUND((h.holiday_amount - p.pre_amount)/p.pre_amount*100, 2) AS growth_rate FROM holiday_revenue h JOIN pre_holiday_revenue p ON h.service_name = p.service_name ORDER BY growth_rate DESC;"
-  },
-  {
-    "question": "统计2023年春节期间各服务区节假日营收占Q1季度总营收比例",
-    "sql": "WITH q1_revenue AS (SELECT service_name, SUM(pay_sum) AS q1_amount FROM bss_business_day_data WHERE oper_date BETWEEN '2023-01-01' AND '2023-03-31' AND delete_ts IS NULL GROUP BY service_name), lunar_revenue AS (SELECT service_name, SUM(pay_sum) AS lunar_amount FROM bss_business_day_data WHERE oper_date BETWEEN '2023-01-20' AND '2023-01-27' AND delete_ts IS NULL GROUP BY service_name) SELECT q.service_name, ROUND(l.lunar_amount/q.q1_amount*100, 2) AS ratio FROM q1_revenue q JOIN lunar_revenue l ON q.service_name = l.service_name ORDER BY ratio DESC;"
-  },
-  {
-    "question": "对比2023年国庆节期间不同支付方式金额占比",
-    "sql": "SELECT '微信' AS pay_type, ROUND(SUM(wx)/SUM(pay_sum)*100, 2) AS ratio FROM bss_business_day_data WHERE oper_date BETWEEN '2023-10-01' AND '2023-10-07' AND delete_ts IS NULL UNION ALL SELECT '支付宝', ROUND(SUM(zfb)/SUM(pay_sum)*100, 2) FROM bss_business_day_data WHERE oper_date BETWEEN '2023-10-01' AND '2023-10-07' AND delete_ts IS NULL UNION ALL SELECT '现金', ROUND(SUM(rmb)/SUM(pay_sum)*100, 2) FROM bss_business_day_data WHERE oper_date BETWEEN '2023-10-01' AND '2023-10-07' AND delete_ts IS NULL;"
-  },
-  {
-    "question": "分析节假日与非节假日各服务区日均车流量增长率",
-    "sql": "WITH holiday_avg AS (SELECT service_area_id, AVG(customer_count) AS holiday_avg FROM bss_car_day_count WHERE count_date BETWEEN '2023-10-01' AND '2023-10-07' AND delete_ts IS NULL GROUP BY service_area_id), non_holiday_avg AS (SELECT service_area_id, AVG(customer_count) AS non_holiday_avg FROM bss_car_day_count WHERE count_date NOT BETWEEN '2023-10-01' AND '2023-10-07' AND delete_ts IS NULL GROUP BY service_area_id) SELECT h.service_area_id, ROUND((h.holiday_avg - n.non_holiday_avg)/n.non_holiday_avg*100, 2) AS growth_rate FROM holiday_avg h JOIN non_holiday_avg n ON h.service_area_id = n.service_area_id ORDER BY growth_rate DESC LIMIT 10;"
-  },
-  {
-    "question": "统计节假日车流最高峰时段的车辆类型分布",
-    "sql": "SELECT car_type, SUM(customer_count) AS total_cars FROM bss_car_day_count WHERE count_date BETWEEN '2023-10-01' AND '2023-10-07' AND EXTRACT(HOUR FROM create_ts) BETWEEN 8 AND 10 AND delete_ts IS NULL GROUP BY car_type ORDER BY total_cars DESC;"
-  },
-  {
-    "question": "对比2023年五一假期与清明假期营收增幅排名TOP5服务区",
-    "sql": "WITH may_revenue AS (SELECT service_name, SUM(pay_sum) AS may_amount FROM bss_business_day_data WHERE oper_date BETWEEN '2023-04-29' AND '2023-05-03' AND delete_ts IS NULL GROUP BY service_name), qingming_revenue AS (SELECT service_name, SUM(pay_sum) AS qingming_amount FROM bss_business_day_data WHERE oper_date BETWEEN '2023-04-05' AND '2023-04-07' AND delete_ts IS NULL GROUP BY service_name) SELECT m.service_name, ROUND((m.may_amount - q.qingming_amount)/q.qingming_amount*100, 2) AS growth_rate FROM may_revenue m JOIN qingming_revenue q ON m.service_name = q.service_name ORDER BY growth_rate DESC LIMIT 5;"
-  },
-  {
-    "question": "分析节假日现金支付比例变化趋势",
-    "sql": "SELECT oper_date, ROUND(SUM(rmb)/SUM(pay_sum)*100, 2) AS cash_ratio FROM bss_business_day_data WHERE oper_date BETWEEN '2023-09-24' AND '2023-10-07' AND delete_ts IS NULL GROUP BY oper_date ORDER BY oper_date;"
-  },
-  {
-    "question": "统计危化品车辆节假日期间通行量同比增幅",
-    "sql": "WITH holiday_2022 AS (SELECT COUNT(*) AS cnt_2022 FROM bss_car_day_count WHERE count_date BETWEEN '2022-10-01' AND '2022-10-07' AND car_type = '危化品' AND delete_ts IS NULL), holiday_2023 AS (SELECT COUNT(*) AS cnt_2023 FROM bss_car_day_count WHERE count_date BETWEEN '2023-10-01' AND '2023-10-07' AND car_type = '危化品' AND delete_ts IS NULL) SELECT ROUND((cnt_2023 - cnt_2022)/cnt_2022*100, 2) AS growth_rate FROM holiday_2022, holiday_2023;"
-  },
-  {
-    "question": "查询2023年国庆节期间营收增幅超过50%的服务区清单",
-    "sql": "WITH pre_data AS (SELECT service_name, SUM(pay_sum) AS pre_amount FROM bss_business_day_data WHERE oper_date BETWEEN '2023-09-24' AND '2023-09-30' AND delete_ts IS NULL GROUP BY service_name), holiday_data AS (SELECT service_name, SUM(pay_sum) AS holiday_amount FROM bss_business_day_data WHERE oper_date BETWEEN '2023-10-01' AND '2023-10-07' AND delete_ts IS NULL GROUP BY service_name) SELECT h.service_name, ROUND((h.holiday_amount - p.pre_amount)/p.pre_amount*100, 2) AS growth_rate FROM holiday_data h JOIN pre_data p ON h.service_name = p.service_name WHERE (h.holiday_amount - p.pre_amount)/p.pre_amount > 0.5 ORDER BY growth_rate DESC;"
-  },
-  {
-    "question": "分析节假日期间城际车辆流量与服务区地理位置的关系",
-    "sql": "SELECT s.service_area_name, s.service_position, AVG(c.customer_count) AS avg_traffic FROM bss_car_day_count c JOIN bss_service_area s ON c.service_area_id = s.id WHERE c.car_type = '城际' AND c.count_date BETWEEN '2023-10-01' AND '2023-10-07' AND c.delete_ts IS NULL GROUP BY s.service_area_name, s.service_position ORDER BY avg_traffic DESC;"
-  }
-]

+ 0 - 202
data_pipeline/training_data/task_20250701_131627/qs_highway_db_20250701_134736_pair.json.backup

@@ -1,202 +0,0 @@
-[
-  {
-    "question": "统计2023年4月1日各服务区的总营收及现金支付金额占比",
-    "sql": "SELECT service_name AS 服务区名称, SUM(pay_sum) AS 总营收, SUM(rmb)/SUM(pay_sum)*100 AS 现金支付占比 FROM bss_business_day_data WHERE oper_date = '2023-04-01' AND delete_ts IS NULL GROUP BY service_name;"
-  },
-  {
-    "question": "分析2023年第一季度各支付方式在总营收中的占比变化趋势",
-    "sql": "SELECT oper_date AS 统计日期, SUM(wx)/SUM(pay_sum)*100 AS 微信占比, SUM(zfb)/SUM(pay_sum)*100 AS 支付宝占比, SUM(rmb)/SUM(pay_sum)*100 AS 现金占比 FROM bss_business_day_data WHERE oper_date BETWEEN '2023-01-01' AND '2023-03-31' AND delete_ts IS NULL GROUP BY oper_date ORDER BY 统计日期;"
-  },
-  {
-    "question": "查询最近7天总营收最高的前5个服务区及其移动支付比例",
-    "sql": "SELECT service_name AS 服务区名称, SUM(pay_sum) AS 总营收, (SUM(wx)+SUM(zfb))/SUM(pay_sum)*100 AS 移动支付比例 FROM bss_business_day_data WHERE oper_date >= CURRENT_DATE - 7 AND oper_date < CURRENT_DATE AND delete_ts IS NULL GROUP BY service_name ORDER BY 总营收 DESC LIMIT 5;"
-  },
-  {
-    "question": "对比不同档口的现金支付订单占比并按占比排序",
-    "sql": "SELECT branch_name AS 档口名称, SUM(rmb_order)/SUM(order_sum)*100 AS 现金订单占比 FROM bss_business_day_data WHERE delete_ts IS NULL GROUP BY branch_name ORDER BY 现金订单占比 DESC;"
-  },
-  {
-    "question": "计算宜春服务区2023年各季度月均营收及最大单日营收",
-    "sql": "SELECT EXTRACT(QUARTER FROM oper_date) AS 季度, AVG(pay_sum) AS 月均营收, MAX(pay_sum) AS 最大单日营收 FROM bss_business_day_data WHERE service_name = '宜春服务区' AND EXTRACT(YEAR FROM oper_date) = 2023 AND delete_ts IS NULL GROUP BY 季度 ORDER BY 季度;"
-  },
-  {
-    "question": "统计2023年4月各服务区订单总数及总营收并按营收排名",
-    "sql": "SELECT service_name AS 服务区名称, SUM(order_sum) AS 订单总数, SUM(pay_sum) AS 总营收 FROM bss_business_day_data WHERE oper_date BETWEEN '2023-04-01' AND '2023-04-30' AND delete_ts IS NULL GROUP BY service_name ORDER BY 总营收 DESC;"
-  },
-  {
-    "question": "查询最近一天移动支付占比超过80%的服务区信息",
-    "sql": "SELECT service_name AS 服务区名称, (wx+zfb)/pay_sum*100 AS 移动支付比例 FROM bss_business_day_data WHERE oper_date = (SELECT MAX(oper_date) FROM bss_business_day_data WHERE delete_ts IS NULL) AND (wx+zfb)/pay_sum > 0.8 AND delete_ts IS NULL ORDER BY 移动支付比例 DESC;"
-  },
-  {
-    "question": "分析庐山服务区2023年各星期的营收分布情况",
-    "sql": "SELECT EXTRACT(ISODOW FROM oper_date) AS 星期, SUM(pay_sum) AS 总营收 FROM bss_business_day_data WHERE service_name = '庐山服务区' AND EXTRACT(YEAR FROM oper_date) = 2023 AND delete_ts IS NULL GROUP BY 星期 ORDER BY 星期;"
-  },
-  {
-    "question": "统计最近一天总营收超过1万元且现金占比低于10%的服务区",
-    "sql": "SELECT service_name AS 服务区名称, pay_sum AS 总营收, rmb/pay_sum*100 AS 现金占比 FROM bss_business_day_data WHERE oper_date = (SELECT MAX(oper_date) FROM bss_business_day_data WHERE delete_ts IS NULL) AND pay_sum > 10000 AND rmb/pay_sum < 0.1 AND delete_ts IS NULL ORDER BY 总营收 DESC;"
-  },
-  {
-    "question": "对比宜春和南昌南服务区最近30天各支付方式的平均日营收",
-    "sql": "SELECT service_name AS 服务区名称, AVG(wx) AS 日均微信营收, AVG(zfb) AS 日均支付宝营收, AVG(rmb) AS 日均现金营收 FROM bss_business_day_data WHERE oper_date >= CURRENT_DATE - 30 AND service_name IN ('宜春服务区','南昌南服务区') AND delete_ts IS NULL GROUP BY service_name ORDER BY 服务区名称;"
-  },
-  {
-    "question": "统计各服务区日均车流量并按车流由高到低排序",
-    "sql": "SELECT sa.service_area_name AS 服务区名称, AVG(cc.customer_count) AS 日均车流量 FROM bss_car_day_count cc JOIN bss_service_area sa ON cc.service_area_id = sa.id WHERE cc.delete_ts IS NULL AND sa.delete_ts IS NULL GROUP BY sa.service_area_name ORDER BY 日均车流量 DESC;"
-  },
-  {
-    "question": "查询危化品车辆占比超过5%的服务区信息",
-    "sql": "SELECT sa.service_area_name, ROUND((SUM(CASE WHEN cc.car_type='危化品' THEN cc.customer_count ELSE 0 END)*100.0/SUM(cc.customer_count))::numeric,2) AS 危化品占比 FROM bss_car_day_count cc JOIN bss_service_area sa ON cc.service_area_id = sa.id WHERE cc.delete_ts IS NULL AND sa.delete_ts IS NULL GROUP BY sa.service_area_name HAVING SUM(CASE WHEN cc.car_type='危化品' THEN cc.customer_count ELSE 0 END)*100.0/SUM(cc.customer_count) > 5 ORDER BY 危化品占比 DESC;"
-  },
-  {
-    "question": "分析最近30天各车型日均通行量变化趋势",
-    "sql": "SELECT count_date AS 统计日期, car_type AS 车型, AVG(customer_count) AS 日均车流量 FROM bss_car_day_count WHERE count_date >= CURRENT_DATE - 30 AND delete_ts IS NULL GROUP BY count_date, car_type ORDER BY count_date;"
-  },
-  {
-    "question": "对比周末与工作日车流量差异",
-    "sql": "SELECT CASE WHEN EXTRACT(DOW FROM count_date) IN (0,6) THEN '周末' ELSE '工作日' END AS 时段类型, AVG(customer_count) AS 平均车流量 FROM bss_car_day_count WHERE delete_ts IS NULL GROUP BY 时段类型;"
-  },
-  {
-    "question": "获取各服务区过境车辆占比TOP5",
-    "sql": "SELECT sa.service_area_name, ROUND((SUM(CASE WHEN cc.car_type='过境' THEN cc.customer_count ELSE 0 END)*100.0/SUM(cc.customer_count))::numeric,2) AS 过境占比 FROM bss_car_day_count cc JOIN bss_service_area sa ON cc.service_area_id = sa.id WHERE cc.delete_ts IS NULL AND sa.delete_ts IS NULL GROUP BY sa.service_area_name ORDER BY 过境占比 DESC LIMIT 5;"
-  },
-  {
-    "question": "统计最近一周每日总车流量及环比增长率",
-    "sql": "WITH daily_total AS (SELECT count_date, SUM(customer_count) AS total FROM bss_car_day_count WHERE count_date >= CURRENT_DATE - 7 AND delete_ts IS NULL GROUP BY count_date) SELECT count_date, total, LAG(total) OVER(ORDER BY count_date) AS 前一日流量, ROUND(((total - LAG(total) OVER(ORDER BY count_date))*100.0/LAG(total) OVER(ORDER BY count_date))::numeric,2) AS 环比增长率 FROM daily_total;"
-  },
-  {
-    "question": "查询连续3天车流量增长的服务区",
-    "sql": "WITH daily_growth AS (SELECT service_area_id, count_date, SUM(customer_count) AS daily_count, LAG(SUM(customer_count),1) OVER(PARTITION BY service_area_id ORDER BY count_date) AS prev_count FROM bss_car_day_count WHERE delete_ts IS NULL GROUP BY service_area_id, count_date) SELECT sa.service_area_name FROM (SELECT service_area_id FROM daily_growth WHERE daily_count > prev_count GROUP BY service_area_id, count_date - generate_series(0,2)) t JOIN bss_service_area sa ON t.service_area_id = sa.id;"
-  },
-  {
-    "question": "统计各车辆类型在不同时间段的分布比例",
-    "sql": "SELECT car_type AS 车型, EXTRACT(HOUR FROM create_ts)::integer AS 小时段, ROUND(AVG(customer_count)::numeric,0) AS 平均车流量 FROM bss_car_day_count WHERE delete_ts IS NULL GROUP BY car_type, 小时段 ORDER BY 小时段;"
-  },
-  {
-    "question": "获取昨日车流量最高的3个服务区及对应车型分布",
-    "sql": "SELECT sa.service_area_name, cc.car_type, cc.customer_count FROM bss_car_day_count cc JOIN bss_service_area sa ON cc.service_area_id = sa.id WHERE cc.count_date = CURRENT_DATE - 1 AND sa.delete_ts IS NULL ORDER BY cc.customer_count DESC LIMIT 3;"
-  },
-  {
-    "question": "分析各区域城际车辆通行量与服务区开放状态的关系",
-    "sql": "SELECT sa.service_state AS 开放状态, AVG(CASE WHEN cc.car_type='城际' THEN cc.customer_count ELSE 0 END) AS 平均城际车流量 FROM bss_car_day_count cc RIGHT JOIN bss_service_area sa ON cc.service_area_id = sa.id WHERE sa.delete_ts IS NULL GROUP BY sa.service_state;"
-  },
-  {
-    "question": "各分公司2023年4月人均营收TOP5(按支付总额/车流量计算)",
-    "sql": "SELECT c.company_name AS 分公司名称, SUM(bd.pay_sum)/SUM(car.customer_count) AS 人均营收 FROM bss_company c JOIN bss_service_area sa ON c.id = sa.company_id JOIN bss_business_day_data bd ON sa.service_area_no = bd.service_no JOIN bss_car_day_count car ON sa.id = car.service_area_id AND bd.oper_date = car.count_date WHERE bd.oper_date BETWEEN '2023-04-01' AND '2023-04-30' GROUP BY c.company_name ORDER BY 人均营收 DESC LIMIT 5;"
-  },
-  {
-    "question": "2023年Q2各分公司客单价对比分析",
-    "sql": "SELECT c.company_name AS 分公司名称, AVG(bd.pay_sum/bd.order_sum) AS 客单价 FROM bss_company c JOIN bss_service_area sa ON c.id = sa.company_id JOIN bss_business_day_data bd ON sa.service_area_no = bd.service_no WHERE bd.oper_date BETWEEN '2023-04-01' AND '2023-06-30' GROUP BY c.company_name ORDER BY 客单价 DESC;"
-  },
-  {
-    "question": "最近一周订单密度(订单数/面积)最低的3个分公司",
-    "sql": "SELECT c.company_name AS 分公司名称, SUM(bd.order_sum)/COUNT(DISTINCT sa.id) AS 订单密度 FROM bss_company c JOIN bss_service_area sa ON c.id = sa.company_id JOIN bss_business_day_data bd ON sa.service_area_no = bd.service_no WHERE bd.oper_date >= CURRENT_DATE - 7 GROUP BY c.company_name ORDER BY 订单密度 ASC LIMIT 3;"
-  },
-  {
-    "question": "各分公司2023年节假日营收总额环比分析",
-    "sql": "SELECT c.company_name AS 分公司名称, SUM(CASE WHEN EXTRACT(MONTH FROM bd.oper_date) = 1 THEN bd.pay_sum ELSE 0 END) AS 一月营收, SUM(CASE WHEN EXTRACT(MONTH FROM bd.oper_date) = 2 THEN bd.pay_sum ELSE 0 END) AS 二月营收 FROM bss_company c JOIN bss_service_area sa ON c.id = sa.company_id JOIN bss_business_day_data bd ON sa.service_area_no = bd.service_no WHERE EXTRACT(YEAR FROM bd.oper_date) = 2023 GROUP BY c.company_name;"
-  },
-  {
-    "question": "2023-04-01当日各分公司运营指标对比(支付总额、订单数、车流量)",
-    "sql": "SELECT c.company_name AS 分公司名称, SUM(bd.pay_sum) AS 支付总额, SUM(bd.order_sum) AS 订单总数, SUM(car.customer_count) AS 车流量 FROM bss_company c JOIN bss_service_area sa ON c.id = sa.company_id JOIN bss_business_day_data bd ON sa.service_area_no = bd.service_no JOIN bss_car_day_count car ON sa.id = car.service_area_id WHERE bd.oper_date = '2023-04-01' GROUP BY c.company_name ORDER BY 支付总额 DESC;"
-  },
-  {
-    "question": "各分公司微信支付占比分析(近30天)",
-    "sql": "SELECT c.company_name AS 分公司名称, SUM(bd.wx) / SUM(bd.pay_sum) * 100 AS 微信占比百分比 FROM bss_company c JOIN bss_service_area sa ON c.id = sa.company_id JOIN bss_business_day_data bd ON sa.service_area_no = bd.service_no WHERE bd.oper_date >= CURRENT_DATE - 30 GROUP BY c.company_name ORDER BY 微信占比百分比 DESC;"
-  },
-  {
-    "question": "各分公司服务区数量与营收能力关联分析",
-    "sql": "SELECT c.company_name AS 分公司名称, COUNT(sa.id) AS 服务区数量, SUM(bd.pay_sum) AS 总营收 FROM bss_company c JOIN bss_service_area sa ON c.id = sa.company_id JOIN bss_business_day_data bd ON sa.service_area_no = bd.service_no GROUP BY c.company_name ORDER BY 服务区数量 DESC, 总营收 DESC;"
-  },
-  {
-    "question": "2023年各分公司月均订单密度趋势分析",
-    "sql": "SELECT c.company_name AS 分公司名称, EXTRACT(MONTH FROM bd.oper_date) AS 月份, AVG(bd.order_sum) AS 月均订单密度 FROM bss_company c JOIN bss_service_area sa ON c.id = sa.company_id JOIN bss_business_day_data bd ON sa.service_area_no = bd.service_no WHERE EXTRACT(YEAR FROM bd.oper_date) = 2023 GROUP BY c.company_name, 月份 ORDER BY 分公司名称, 月份;"
-  },
-  {
-    "question": "各分公司不同支付方式订单数占比分析",
-    "sql": "SELECT c.company_name AS 分公司名称, SUM(bd.wx_order)/SUM(bd.order_sum)*100 AS 微信占比, SUM(bd.zf_order)/SUM(bd.order_sum)*100 AS 支付宝占比 FROM bss_company c JOIN bss_service_area sa ON c.id = sa.company_id JOIN bss_business_day_data bd ON sa.service_area_no = bd.service_no GROUP BY c.company_name ORDER BY 微信占比 DESC;"
-  },
-  {
-    "question": "2023年Q2各分公司营收增长率分析",
-    "sql": "SELECT c.company_name AS 分公司名称, SUM(CASE WHEN EXTRACT(MONTH FROM bd.oper_date) = 4 THEN bd.pay_sum ELSE 0 END) / SUM(CASE WHEN EXTRACT(MONTH FROM bd.oper_date) = 5 THEN bd.pay_sum ELSE 0 END) - 1 AS 月增长率 FROM bss_company c JOIN bss_service_area sa ON c.id = sa.company_id JOIN bss_business_day_data bd ON sa.service_area_no = bd.service_no WHERE EXTRACT(QUARTER FROM bd.oper_date) = 2 GROUP BY c.company_name ORDER BY 月增长率 DESC;"
-  },
-  {
-    "question": "统计各路线关联的服务区数量及平均车流量,按服务区数量降序排列",
-    "sql": "SELECT r.route_name AS 路线名称, COUNT(l.service_area_id) AS 服务区数量, AVG(c.customer_count) AS 平均车流量 FROM bss_section_route r LEFT JOIN bss_section_route_area_link l ON r.id = l.section_route_id LEFT JOIN bss_car_day_count c ON l.service_area_id = c.service_area_id WHERE r.delete_ts IS NULL GROUP BY r.route_name ORDER BY 服务区数量 DESC;"
-  },
-  {
-    "question": "计算2023年Q2各路段日均车流量,筛选出日均车流量>1000的路段",
-    "sql": "SELECT s.section_name AS 路段名称, COUNT(*) AS 天数, AVG(c.customer_count) AS 日均车流量 FROM bss_section_route s JOIN bss_section_route_area_link l ON s.id = l.section_route_id JOIN bss_car_day_count c ON l.service_area_id = c.service_area_id WHERE c.count_date BETWEEN '2023-04-01' AND '2023-06-30' AND s.delete_ts IS NULL GROUP BY s.section_name HAVING AVG(c.customer_count) > 1000;"
-  },
-  {
-    "question": "查询2023年车流量TOP5服务区及对应路线信息",
-    "sql": "SELECT a.service_area_name AS 服务区名称, r.route_name AS 路线名称, SUM(c.customer_count) AS 总车流量 FROM bss_service_area a JOIN bss_section_route_area_link l ON a.id = l.service_area_id JOIN bss_section_route r ON l.section_route_id = r.id JOIN bss_car_day_count c ON a.id = c.service_area_id WHERE c.count_date BETWEEN '2023-01-01' AND '2023-12-31' GROUP BY a.service_area_name, r.route_name ORDER BY 总车流量 DESC LIMIT 5;"
-  },
-  {
-    "question": "分析各路线服务区营收贡献占比,按微信支付金额排序",
-    "sql": "SELECT r.route_name AS 路线名称, SUM(b.wx) AS 微信支付总额, SUM(b.pay_sum) AS 总营收, ROUND((SUM(b.wx)/SUM(b.pay_sum))*100, 2) AS 微信占比 FROM bss_section_route r JOIN bss_section_route_area_link l ON r.id = l.section_route_id JOIN bss_business_day_data b ON l.service_area_id = b.service_area_id WHERE b.oper_date BETWEEN '2023-01-01' AND '2023-12-31' GROUP BY r.route_name ORDER BY 微信支付总额 DESC;"
-  },
-  {
-    "question": "对比不同车辆类型在各路线的分布比例",
-    "sql": "SELECT r.route_name AS 路线名称, c.car_type AS 车辆类型, COUNT(*) AS 记录数, ROUND((COUNT(*)/(SELECT COUNT(*) FROM bss_car_day_count WHERE service_area_id IN (SELECT service_area_id FROM bss_section_route_area_link WHERE section_route_id = r.id))) * 100)::numeric(5,2) AS 占比百分比 FROM bss_car_day_count c JOIN bss_section_route_area_link l ON c.service_area_id = l.service_area_id JOIN bss_section_route r ON l.section_route_id = r.id GROUP BY r.route_name, c.car_type;"
-  },
-  {
-    "question": "统计未关联服务区的路段清单及创建时间",
-    "sql": "SELECT r.section_name AS 路段名称, r.create_ts AS 创建时间 FROM bss_section_route r LEFT JOIN bss_section_route_area_link l ON r.id = l.section_route_id WHERE l.service_area_id IS NULL AND r.delete_ts IS NULL;"
-  },
-  {
-    "question": "分析春运期间(2023-01-07至2023-02-16)各路线车流变化趋势",
-    "sql": "SELECT r.route_name AS 路线名称, c.count_date AS 日期, SUM(c.customer_count) AS 总车流量 FROM bss_section_route r JOIN bss_section_route_area_link l ON r.id = l.section_route_id JOIN bss_car_day_count c ON l.service_area_id = c.service_area_id WHERE c.count_date BETWEEN '2023-01-07' AND '2023-02-16' GROUP BY r.route_name, c.count_date ORDER BY 日期;"
-  },
-  {
-    "question": "计算各服务区车流覆盖率(关联路段车流/总车流)TOP10",
-    "sql": "SELECT a.service_area_name AS 服务区名称, SUM(c.customer_count) AS 关联车流, (SELECT SUM(customer_count) FROM bss_car_day_count WHERE service_area_id = a.id) AS 总车流, ROUND((SUM(c.customer_count)/(SELECT SUM(customer_count) FROM bss_car_day_count WHERE service_area_id = a.id)) * 100)::numeric(5,2) AS 覆盖率 FROM bss_service_area a JOIN bss_section_route_area_link l ON a.id = l.service_area_id JOIN bss_car_day_count c ON a.id = c.service_area_id GROUP BY a.service_area_name ORDER BY 覆盖率 DESC LIMIT 10;"
-  },
-  {
-    "question": "查询节假日(2023-10-01至2023-10-07)营收贡献最高的TOP3服务区及对应路线",
-    "sql": "SELECT a.service_area_name AS 服务区名称, r.route_name AS 路线名称, SUM(b.pay_sum) AS 总营收 FROM bss_service_area a JOIN bss_section_route_area_link l ON a.id = l.service_area_id JOIN bss_section_route r ON l.section_route_id = r.id JOIN bss_business_day_data b ON a.id = b.service_area_id WHERE b.oper_date BETWEEN '2023-10-01' AND '2023-10-07' GROUP BY a.service_area_name, r.route_name ORDER BY 总营收 DESC LIMIT 3;"
-  },
-  {
-    "question": "分析不同分公司管辖路段的服务区密度(服务区数/路段长度)",
-    "sql": "SELECT c.company_name AS 分公司名称, COUNT(a.id) AS 服务区数量, SUM(LENGTH(s.code)) AS 路段总长度, ROUND((COUNT(a.id)/SUM(LENGTH(s.code))) * 1000)::numeric(5,2) AS 密度_每千米 FROM bss_company c JOIN bss_service_area a ON c.id = a.company_id JOIN bss_section_route_area_link l ON a.id = l.service_area_id JOIN bss_section_route s ON l.section_route_id = s.id GROUP BY c.company_name;"
-  },
-  {
-    "question": "分析2023年国庆节期间各服务区营收总额及环比增长率",
-    "sql": "WITH holiday_revenue AS (SELECT service_name, SUM(pay_sum) AS holiday_amount FROM bss_business_day_data WHERE oper_date BETWEEN '2023-10-01' AND '2023-10-07' AND delete_ts IS NULL GROUP BY service_name), pre_holiday_revenue AS (SELECT service_name, SUM(pay_sum) AS pre_amount FROM bss_business_day_data WHERE oper_date BETWEEN '2023-09-24' AND '2023-09-30' AND delete_ts IS NULL GROUP BY service_name) SELECT h.service_name, h.holiday_amount, ROUND((h.holiday_amount - p.pre_amount)/p.pre_amount*100, 2) AS growth_rate FROM holiday_revenue h JOIN pre_holiday_revenue p ON h.service_name = p.service_name ORDER BY growth_rate DESC;"
-  },
-  {
-    "question": "统计2023年春节期间各服务区节假日营收占Q1季度总营收比例",
-    "sql": "WITH q1_revenue AS (SELECT service_name, SUM(pay_sum) AS q1_amount FROM bss_business_day_data WHERE oper_date BETWEEN '2023-01-01' AND '2023-03-31' AND delete_ts IS NULL GROUP BY service_name), lunar_revenue AS (SELECT service_name, SUM(pay_sum) AS lunar_amount FROM bss_business_day_data WHERE oper_date BETWEEN '2023-01-20' AND '2023-01-27' AND delete_ts IS NULL GROUP BY service_name) SELECT q.service_name, ROUND(l.lunar_amount/q.q1_amount*100, 2) AS ratio FROM q1_revenue q JOIN lunar_revenue l ON q.service_name = l.service_name ORDER BY ratio DESC;"
-  },
-  {
-    "question": "对比2023年国庆节期间不同支付方式金额占比",
-    "sql": "SELECT '微信' AS pay_type, ROUND(SUM(wx)/SUM(pay_sum)*100, 2) AS ratio FROM bss_business_day_data WHERE oper_date BETWEEN '2023-10-01' AND '2023-10-07' AND delete_ts IS NULL UNION ALL SELECT '支付宝', ROUND(SUM(zfb)/SUM(pay_sum)*100, 2) FROM bss_business_day_data WHERE oper_date BETWEEN '2023-10-01' AND '2023-10-07' AND delete_ts IS NULL UNION ALL SELECT '现金', ROUND(SUM(rmb)/SUM(pay_sum)*100, 2) FROM bss_business_day_data WHERE oper_date BETWEEN '2023-10-01' AND '2023-10-07' AND delete_ts IS NULL;"
-  },
-  {
-    "question": "分析节假日与非节假日各服务区日均车流量增长率",
-    "sql": "WITH holiday_avg AS (SELECT service_area_id, AVG(customer_count) AS holiday_avg FROM bss_car_day_count WHERE count_date BETWEEN '2023-10-01' AND '2023-10-07' AND delete_ts IS NULL GROUP BY service_area_id), non_holiday_avg AS (SELECT service_area_id, AVG(customer_count) AS non_holiday_avg FROM bss_car_day_count WHERE count_date NOT BETWEEN '2023-10-01' AND '2023-10-07' AND delete_ts IS NULL GROUP BY service_area_id) SELECT h.service_area_id, ROUND((h.holiday_avg - n.non_holiday_avg)/n.non_holiday_avg*100, 2) AS growth_rate FROM holiday_avg h JOIN non_holiday_avg n ON h.service_area_id = n.service_area_id ORDER BY growth_rate DESC LIMIT 10;"
-  },
-  {
-    "question": "统计节假日车流最高峰时段的车辆类型分布",
-    "sql": "SELECT car_type, SUM(customer_count) AS total_cars FROM bss_car_day_count WHERE count_date BETWEEN '2023-10-01' AND '2023-10-07' AND EXTRACT(HOUR FROM create_ts) BETWEEN 8 AND 10 AND delete_ts IS NULL GROUP BY car_type ORDER BY total_cars DESC;"
-  },
-  {
-    "question": "对比2023年五一假期与清明假期营收增幅排名TOP5服务区",
-    "sql": "WITH may_revenue AS (SELECT service_name, SUM(pay_sum) AS may_amount FROM bss_business_day_data WHERE oper_date BETWEEN '2023-04-29' AND '2023-05-03' AND delete_ts IS NULL GROUP BY service_name), qingming_revenue AS (SELECT service_name, SUM(pay_sum) AS qingming_amount FROM bss_business_day_data WHERE oper_date BETWEEN '2023-04-05' AND '2023-04-07' AND delete_ts IS NULL GROUP BY service_name) SELECT m.service_name, ROUND((m.may_amount - q.qingming_amount)/q.qingming_amount*100, 2) AS growth_rate FROM may_revenue m JOIN qingming_revenue q ON m.service_name = q.service_name ORDER BY growth_rate DESC LIMIT 5;"
-  },
-  {
-    "question": "分析节假日现金支付比例变化趋势",
-    "sql": "SELECT oper_date, ROUND(SUM(rmb)/SUM(pay_sum)*100, 2) AS cash_ratio FROM bss_business_day_data WHERE oper_date BETWEEN '2023-09-24' AND '2023-10-07' AND delete_ts IS NULL GROUP BY oper_date ORDER BY oper_date;"
-  },
-  {
-    "question": "统计危化品车辆节假日期间通行量同比增幅",
-    "sql": "WITH holiday_2022 AS (SELECT COUNT(*) AS cnt_2022 FROM bss_car_day_count WHERE count_date BETWEEN '2022-10-01' AND '2022-10-07' AND car_type = '危化品' AND delete_ts IS NULL), holiday_2023 AS (SELECT COUNT(*) AS cnt_2023 FROM bss_car_day_count WHERE count_date BETWEEN '2023-10-01' AND '2023-10-07' AND car_type = '危化品' AND delete_ts IS NULL) SELECT ROUND((cnt_2023 - cnt_2022)/cnt_2022*100, 2) AS growth_rate FROM holiday_2022, holiday_2023;"
-  },
-  {
-    "question": "查询2023年国庆节期间营收增幅超过50%的服务区清单",
-    "sql": "WITH pre_data AS (SELECT service_name, SUM(pay_sum) AS pre_amount FROM bss_business_day_data WHERE oper_date BETWEEN '2023-09-24' AND '2023-09-30' AND delete_ts IS NULL GROUP BY service_name), holiday_data AS (SELECT service_name, SUM(pay_sum) AS holiday_amount FROM bss_business_day_data WHERE oper_date BETWEEN '2023-10-01' AND '2023-10-07' AND delete_ts IS NULL GROUP BY service_name) SELECT h.service_name, ROUND((h.holiday_amount - p.pre_amount)/p.pre_amount*100, 2) AS growth_rate FROM holiday_data h JOIN pre_data p ON h.service_name = p.service_name WHERE (h.holiday_amount - p.pre_amount)/p.pre_amount > 0.5 ORDER BY growth_rate DESC;"
-  },
-  {
-    "question": "分析节假日期间城际车辆流量与服务区地理位置的关系",
-    "sql": "SELECT s.service_area_name, s.service_position, AVG(c.customer_count) AS avg_traffic FROM bss_car_day_count c JOIN bss_service_area s ON c.service_area_id = s.id WHERE c.car_type = '城际' AND c.count_date BETWEEN '2023-10-01' AND '2023-10-07' AND c.delete_ts IS NULL GROUP BY s.service_area_name, s.service_position ORDER BY avg_traffic DESC;"
-  }
-]

+ 0 - 14
data_pipeline/training_data/task_20250701_131627/task_config.json

@@ -1,14 +0,0 @@
-{
-  "task_id": "task_20250701_131627",
-  "created_at": "2025-07-01T05:16:27.671265",
-  "parameters": {
-    "db_connection": "postgresql://postgres:postgres@192.168.67.1:6432/highway_db",
-    "table_list_file": "data_pipeline/tables.txt",
-    "business_context": "高速公路服务区管理系统",
-    "enable_llm_repair": true,
-    "modify_original_file": true,
-    "enable_sql_validation": true,
-    "enable_training_data_load": true
-  },
-  "output_directory": "data_pipeline\\training_data\\task_20250701_131627"
-}

+ 0 - 88
data_pipeline/training_data/task_20250701_131627/task_result.json

@@ -1,88 +0,0 @@
-{
-  "success": true,
-  "workflow_summary": {
-    "total_duration": 1283.84,
-    "completed_steps": [
-      "ddl_md_generation",
-      "question_sql_generation",
-      "sql_validation",
-      "training_data_load"
-    ],
-    "failed_steps": [],
-    "total_steps": 4,
-    "workflow_started": "2025-07-01T13:30:53.267230",
-    "workflow_completed": "2025-07-01T13:52:17.112211"
-  },
-  "input_parameters": {
-    "db_connection": "postgresql://postgres:***@192.168.67.1:6432/highway_db",
-    "table_list_file": "data_pipeline/tables.txt",
-    "business_context": "高速公路服务区管理系统",
-    "db_name": "highway_db",
-    "output_directory": "data_pipeline\\training_data\\task_20250701_131627",
-    "enable_sql_validation": true,
-    "enable_llm_repair": true,
-    "modify_original_file": true,
-    "enable_training_data_load": true
-  },
-  "processing_results": {
-    "ddl_md_generation": {
-      "total_tables": 7,
-      "processed_successfully": 7,
-      "failed": 0,
-      "files_generated": 14,
-      "duration": 422.30856490135193
-    },
-    "question_sql_generation": {
-      "output_file": "data_pipeline\\training_data\\task_20250701_131627\\qs_highway_db_20250701_134736_pair.json",
-      "total_questions": 50,
-      "total_themes": 5,
-      "successful_themes": 5,
-      "failed_themes": [],
-      "duration": 607.0530173778534
-    },
-    "sql_validation": {
-      "original_sql_count": 50,
-      "valid_sql_count": 47,
-      "invalid_sql_count": 3,
-      "success_rate": 0.94,
-      "repair_stats": {
-        "attempted": 4,
-        "successful": 1,
-        "failed": 3
-      },
-      "file_modification_stats": {
-        "modified": 1,
-        "deleted": 3,
-        "failed_modifications": 0
-      },
-      "average_execution_time": 0.02947342872619629,
-      "total_retries": 0,
-      "duration": 236.6604528427124
-    },
-    "training_data_load": {
-      "training_data_dir": "data_pipeline\\training_data\\task_20250701_131627",
-      "load_successful": true,
-      "total_records": 288,
-      "data_type_counts": {
-        "sql": 254,
-        "documentation": 17,
-        "ddl": 16,
-        "error_sql": 1
-      },
-      "duration": 17.167370080947876
-    }
-  },
-  "final_outputs": {
-    "primary_output_file": "data_pipeline\\training_data\\task_20250701_131627\\qs_highway_db_20250701_134736_pair.json",
-    "output_directory": "data_pipeline\\training_data\\task_20250701_131627",
-    "final_question_count": 47,
-    "backup_files_created": true
-  },
-  "performance_metrics": {
-    "step1_duration": 422.31,
-    "step2_duration": 607.05,
-    "step3_duration": 236.66,
-    "step4_duration": 17.17,
-    "total_duration": 1283.84
-  }
-}

+ 0 - 17
data_pipeline/training_data/task_20250701_175640/bss_car_day_count.ddl

@@ -1,17 +0,0 @@
--- 中文名: 服务区车辆日统计表
--- 描述: 服务区车辆日统计表,按车型统计每日车辆数量及类型,用于交通流量分析与资源调度。
-create table public.bss_car_day_count (
-  id varchar(32) not null     -- 主键ID,主键,
-  version integer not null    -- 版本号,
-  create_ts timestamp         -- 创建时间,
-  created_by varchar(50)      -- 创建人ID,
-  update_ts timestamp         -- 更新时间,
-  updated_by varchar(50)      -- 更新人ID,
-  delete_ts timestamp         -- 删除时间,
-  deleted_by varchar(50)      -- 删除人ID,
-  customer_count bigint       -- 车辆数量,
-  car_type varchar(100)       -- 车辆类别,
-  count_date date             -- 统计日期,
-  service_area_id varchar(32) -- 服务区ID,
-  primary key (id)
-);

+ 0 - 18
data_pipeline/training_data/task_20250701_175640/bss_car_day_count_detail.md

@@ -1,18 +0,0 @@
-## bss_car_day_count(服务区车辆日统计表)
-bss_car_day_count 表服务区车辆日统计表,按车型统计每日车辆数量及类型,用于交通流量分析与资源调度。
-字段列表:
-- id (varchar(32)) - 主键ID [主键, 非空] [示例: 00022c1c99ff11ec86d4fa163ec0f8fc, 00022caa99ff11ec86d4fa163ec0f8fc]
-- version (integer) - 版本号 [非空] [示例: 1]
-- create_ts (timestamp) - 创建时间 [示例: 2022-03-02 16:01:43, 2022-02-02 14:18:55]
-- created_by (varchar(50)) - 创建人ID
-- update_ts (timestamp) - 更新时间 [示例: 2022-03-02 16:01:43, 2022-02-02 14:18:55]
-- updated_by (varchar(50)) - 更新人ID
-- delete_ts (timestamp) - 删除时间
-- deleted_by (varchar(50)) - 删除人ID
-- customer_count (bigint) - 车辆数量 [示例: 1114, 295]
-- car_type (varchar(100)) - 车辆类别 [示例: 其他]
-- count_date (date) - 统计日期 [示例: 2022-03-02, 2022-02-02]
-- service_area_id (varchar(32)) - 服务区ID [示例: 17461166e7fa3ecda03534a5795ce985, 81f4eb731fb0728aef17ae61f1f1daef]
-字段补充说明:
-- id 为主键
-- car_type 为枚举字段,包含取值:其他、危化品、城际、过境

+ 0 - 14
data_pipeline/training_data/task_20250701_175640/task_config.json

@@ -1,14 +0,0 @@
-{
-  "task_id": "task_20250701_175640",
-  "created_at": "2025-07-01T09:56:40.836065",
-  "parameters": {
-    "db_connection": "postgresql://postgres:postgres@192.168.67.1:6432/highway_db",
-    "table_list_file": "./data_pipeline/tables.txt",
-    "business_context": "高速公路服务区管理系统测试",
-    "enable_llm_repair": true,
-    "modify_original_file": true,
-    "enable_sql_validation": true,
-    "enable_training_data_load": true
-  },
-  "output_directory": "data_pipeline\\training_data\\task_20250701_175640"
-}

+ 0 - 14
data_pipeline/training_data/task_20250701_180014/task_config.json

@@ -1,14 +0,0 @@
-{
-  "task_id": "task_20250701_180014",
-  "created_at": "2025-07-01T10:00:14.816750",
-  "parameters": {
-    "db_connection": "postgresql://postgres:postgres@192.168.67.1:6432/highway_db",
-    "table_list_file": "data_pipeline/tables.txt",
-    "business_context": "高速公路服务区管理系统",
-    "enable_llm_repair": true,
-    "modify_original_file": true,
-    "enable_sql_validation": true,
-    "enable_training_data_load": true
-  },
-  "output_directory": "data_pipeline\\training_data\\task_20250701_180014"
-}

+ 0 - 38
data_pipeline/training_data/task_20250701_184430/db_query_decision_prompt.txt

@@ -1,38 +0,0 @@
-{
-  "数据库业务范围": "当前数据库存储的是高速公路服务区运营管理与车辆流量分析的相关数据,主要涉及运营交易数据与车辆通行数据,包含以下业务数据:",
-  "核心业务实体": [
-    {
-      "实体类型": "服务区",
-      "详细描述": "高速公路沿线提供停车休憩的场所,记录其每日运营数据与车辆流量统计",
-      "主要字段": "oper_date, service_no, service_name, service_area_id"
-    },
-    {
-      "实体类型": "档口",
-      "详细描述": "服务区内的商业经营单元,记录其每日交易明细",
-      "主要字段": "branch_no, branch_name"
-    },
-    {
-      "实体类型": "车辆类型",
-      "详细描述": "按车辆属性分类的通行记录,用于分析交通流量结构",
-      "主要字段": "car_type"
-    }
-  ],
-  "关键业务指标": [
-    {
-      "指标类型": "支付金额与订单数量",
-      "详细描述": "按支付渠道(微信/支付宝/现金/行吧/金豆)统计的交易金额与订单数,反映消费行为分布"
-    },
-    {
-      "指标类型": "车辆流量统计",
-      "详细描述": "按车辆类型分类的通行量统计,用于分析交通流量结构与高峰时段特征"
-    },
-    {
-      "指标类型": "运营总指标",
-      "详细描述": "订单总数与支付总额的时序变化,反映服务区整体运营态势"
-    },
-    {
-      "指标类型": "数据来源分布",
-      "详细描述": "通过source_type字段分析数据采集渠道的覆盖情况与可靠性"
-    }
-  ]
-}

+ 0 - 5
data_pipeline/training_data/task_20250701_184430/filename_mapping.txt

@@ -1,5 +0,0 @@
-# 文件名映射报告
-# 格式: 原始表名 -> 实际文件名
-
-public.bss_business_day_data -> bss_business_day_data_detail.md
-public.bss_car_day_count -> bss_car_day_count_detail.md

+ 0 - 62
data_pipeline/training_data/task_20250701_184430/metadata.txt

@@ -1,62 +0,0 @@
--- Schema Tools生成的主题元数据
--- 业务背景: 高速公路服务区管理系统
--- 生成时间: 2025-07-01 18:58:22
--- 数据库: highway_db
-
--- 创建表(如果不存在)
-CREATE TABLE IF NOT EXISTS metadata (
-    id SERIAL PRIMARY KEY,    -- 主键
-    topic_name VARCHAR(100) NOT NULL,  -- 业务主题名称
-    description TEXT,                  -- 业务主体说明
-    related_tables TEXT[],			  -- 相关表名
-    biz_entities TEXT[],               -- 主要业务实体名称
-    biz_metrics TEXT[],                -- 主要业务指标名称
-    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP    -- 插入时间
-);
-
--- 插入主题数据
-INSERT INTO metadata(topic_name, description, related_tables, biz_entities, biz_metrics) VALUES
-(
-  '日营收分析',
-  '基于bss_business_day_data表分析各服务区每日营收结构及支付方式占比',
-  'bss_business_day_data',
-  '服务区,档口,支付方式,统计日期',
-  '收入分布,订单构成,支付方式渗透率'
-);
-
-INSERT INTO metadata(topic_name, description, related_tables, biz_entities, biz_metrics) VALUES
-(
-  '车流统计分析',
-  '通过bss_car_day_count表分析不同车辆类型在各服务区的流量分布特征',
-  'bss_car_day_count',
-  '服务区,车辆类型,统计日期',
-  '车流趋势,车型占比,高峰时段流量'
-);
-
-INSERT INTO metadata(topic_name, description, related_tables, biz_entities, biz_metrics) VALUES
-(
-  '档口效能评估',
-  '结合两个表数据评估不同档口的单位车流营收产出及运营效率差异',
-  'bss_business_day_data,bss_car_day_count',
-  '档口,服务区,运营日期',
-  '坪效分析,客单价对比,时段效率曲线'
-);
-
-INSERT INTO metadata(topic_name, description, related_tables, biz_entities, biz_metrics) VALUES
-(
-  '节假日效应分析',
-  '对比法定节假日与平日的车流变化及消费行为差异,支撑资源调度决策',
-  'bss_business_day_data,bss_car_day_count',
-  '服务区,节假日类型,支付方式',
-  '节前节后对比,消费金额波动,车流峰值分析'
-);
-
-INSERT INTO metadata(topic_name, description, related_tables, biz_entities, biz_metrics) VALUES
-(
-  '区域对标分析',
-  '按地理区域划分统计各服务区营收能力和车流规模的topN排名对比',
-  'bss_business_day_data,bss_car_day_count',
-  '区域,服务区等级,运营指标',
-  '营收排名,车流密度,运营健康度评分'
-);
-

+ 0 - 198
data_pipeline/training_data/task_20250701_184430/qs_highway_db_20250701_185822_pair.json

@@ -1,198 +0,0 @@
-[
-  {
-    "question": "统计最近7天各服务区的总收入和总订单数,并按收入从高到低排序",
-    "sql": "SELECT service_name AS 服务区名称, SUM(pay_sum) AS 总收入, SUM(order_sum) AS 总订单数 FROM bss_business_day_data WHERE delete_ts IS NULL AND oper_date >= CURRENT_DATE - 7 GROUP BY service_name ORDER BY 总收入 DESC;"
-  },
-  {
-    "question": "计算各服务区不同支付方式的订单占比(微信/支付宝/现金),展示前五名",
-    "sql": "SELECT service_name AS 服务区名称, ROUND(SUM(wx_order)*100.0/SUM(order_sum),2) AS 微信占比, ROUND(SUM(zf_order)*100.0/SUM(order_sum),2) AS 支付宝占比, ROUND(SUM(rmb_order)*100.0/SUM(order_sum),2) AS 现金占比 FROM bss_business_day_data WHERE delete_ts IS NULL GROUP BY service_name ORDER BY SUM(order_sum) DESC LIMIT 5;"
-  },
-  {
-    "question": "分析2023年4月每日总营收变化趋势",
-    "sql": "SELECT oper_date AS 统计日期, SUM(pay_sum) AS 日营收总额 FROM bss_business_day_data WHERE delete_ts IS NULL AND EXTRACT(YEAR FROM oper_date) = 2023 AND EXTRACT(MONTH FROM oper_date) = 4 GROUP BY oper_date ORDER BY oper_date;"
-  },
-  {
-    "question": "查询最近一天营收超过5万元的服务区及对应支付方式渗透率",
-    "sql": "SELECT service_name AS 服务区名称, wx_order AS 微信订单数, zf_order AS 支付宝订单数, rmb_order AS 现金订单数, pay_sum AS 日营收 FROM bss_business_day_data WHERE delete_ts IS NULL AND oper_date = (SELECT MAX(oper_date) FROM bss_business_day_data) AND pay_sum > 50000 ORDER BY pay_sum DESC;"
-  },
-  {
-    "question": "统计各档口平均客单价(日均)并排名",
-    "sql": "SELECT branch_name AS 档口名称, ROUND(AVG(pay_sum/order_sum),2) AS 平均客单价 FROM bss_business_day_data WHERE delete_ts IS NULL AND order_sum > 0 GROUP BY branch_name ORDER BY 平均客单价 DESC;"
-  },
-  {
-    "question": "对比不同服务区现金支付占比的分布情况",
-    "sql": "SELECT service_name AS 服务区名称, ROUND(SUM(rmb) * 100.0 / SUM(pay_sum), 2) AS 现金占比 FROM bss_business_day_data WHERE delete_ts IS NULL AND pay_sum > 0 GROUP BY service_name ORDER BY 现金占比 DESC;"
-  },
-  {
-    "question": "查询指定日期(2023-04-01)微信支付金额TOP5的服务区明细",
-    "sql": "SELECT service_name AS 服务区名称, wx AS 微信支付金额, wx_order AS 微信订单数, pay_sum AS 总营收 FROM bss_business_day_data WHERE delete_ts IS NULL AND oper_date = '2023-04-01' ORDER BY wx DESC LIMIT 5;"
-  },
-  {
-    "question": "分析各服务区支付宝订单占比与总营收的关系",
-    "sql": "SELECT service_name AS 服务区名称, ROUND(SUM(zf_order)*100.0/SUM(order_sum),2) AS 支付宝订单占比, SUM(pay_sum) AS 总营收 FROM bss_business_day_data WHERE delete_ts IS NULL GROUP BY service_name ORDER BY 支付宝订单占比 DESC;"
-  },
-  {
-    "question": "统计各服务区不同支付方式的订单数量分布",
-    "sql": "SELECT service_name AS 服务区名称, SUM(wx_order) AS 微信订单数, SUM(zf_order) AS 支付宝订单数, SUM(rmb_order) AS 现金订单数 FROM bss_business_day_data WHERE delete_ts IS NULL GROUP BY service_name ORDER BY SUM(wx_order + zf_order + rmb_order) DESC;"
-  },
-  {
-    "question": "查询最近3天庐山服务区每日营收及支付方式构成",
-    "sql": "SELECT oper_date AS 统计日期, wx AS 微信金额, zfb AS 支付宝金额, rmb AS 现金金额, pay_sum AS 总营收 FROM bss_business_day_data WHERE delete_ts IS NULL AND service_name = '庐山服务区' AND oper_date >= CURRENT_DATE - 3 ORDER BY oper_date DESC;"
-  },
-  {
-    "question": "不同车辆类型的总车流量统计情况如何?",
-    "sql": "SELECT car_type AS 车辆类型, SUM(customer_count) AS 总车流量 FROM bss_car_day_count WHERE delete_ts IS NULL GROUP BY car_type;"
-  },
-  {
-    "question": "哪些服务区的累计车流量位列前十?",
-    "sql": "SELECT service_area_id AS 服务区ID, SUM(customer_count) AS 总车流量 FROM bss_car_day_count WHERE delete_ts IS NULL GROUP BY service_area_id ORDER BY 总车流量 DESC LIMIT 10;"
-  },
-  {
-    "question": "2022年3月2日各车型在服务区的流量分布是怎样的?",
-    "sql": "SELECT car_type AS 车辆类型, SUM(customer_count) AS 当日车流量 FROM bss_car_day_count WHERE delete_ts IS NULL AND count_date = '2022-03-02' GROUP BY car_type;"
-  },
-  {
-    "question": "每个服务区每月平均车流量是多少?",
-    "sql": "SELECT service_area_id AS 服务区ID, DATE_TRUNC('month', count_date) AS 月份, AVG(daily_total) AS 月均车流量 FROM (SELECT service_area_id, count_date, SUM(customer_count) AS daily_total FROM bss_car_day_count WHERE delete_ts IS NULL GROUP BY service_area_id, count_date) AS daily_counts GROUP BY service_area_id, 月份;"
-  },
-  {
-    "question": "最近一个月内,各服务区的日均车流量对比如何?",
-    "sql": "SELECT service_area_id AS 服务区ID, AVG(customer_count) AS 日均车流量 FROM bss_car_day_count WHERE delete_ts IS NULL AND count_date >= CURRENT_DATE - INTERVAL '1 month' GROUP BY service_area_id ORDER BY 日均车流量 DESC;"
-  },
-  {
-    "question": "车流量最高的五个服务区是哪些?",
-    "sql": "SELECT service_area_id AS 服务区ID, SUM(customer_count) AS 总车流量 FROM bss_car_day_count WHERE delete_ts IS NULL GROUP BY service_area_id ORDER BY 总车流量 DESC LIMIT 5;"
-  },
-  {
-    "question": "各车型在不同服务区的车流量分布情况如何?",
-    "sql": "SELECT car_type AS 车辆类型, service_area_id AS 服务区ID, SUM(customer_count) AS 总车流量 FROM bss_car_day_count WHERE delete_ts IS NULL GROUP BY car_type, service_area_id;"
-  },
-  {
-    "question": "某服务区(如service_area_id='17461166e7fa3ecda03534a5795ce985')各车型的日均车流量是多少?",
-    "sql": "SELECT car_type AS 车辆类型, AVG(customer_count) AS 日均车流量 FROM bss_car_day_count WHERE delete_ts IS NULL AND service_area_id = '17461166e7fa3ecda03534a5795ce985' GROUP BY car_type;"
-  },
-  {
-    "question": "2022年1月至3月期间,总车流量的月度变化趋势是怎样的?",
-    "sql": "SELECT DATE_TRUNC('month', count_date) AS 月份, SUM(customer_count) AS 总车流量 FROM bss_car_day_count WHERE delete_ts IS NULL AND count_date BETWEEN '2022-01-01' AND '2022-03-31' GROUP BY 月份 ORDER BY 月份;"
-  },
-  {
-    "question": "某服务区(如ID为'81f4eb731fb0728aef17ae61f1f1daef')中,哪种车型的累计车流量最多?",
-    "sql": "SELECT car_type AS 车辆类型, SUM(customer_count) AS 总车流量 FROM bss_car_day_count WHERE delete_ts IS NULL AND service_area_id = '81f4eb731fb0728aef17ae61f1f1daef' GROUP BY car_type ORDER BY 总车流量 DESC LIMIT 1;"
-  },
-  {
-    "question": "统计各档口单位车流营收产出(坪效)并按从高到低排序",
-    "sql": "SELECT b.branch_name AS 档口名称, SUM(b.pay_sum) / SUM(c.customer_count) AS 单位车流营收 FROM bss_business_day_data b JOIN bss_car_day_count c ON b.service_no = c.service_area_id AND b.oper_date = c.count_date WHERE b.delete_ts IS NULL AND c.delete_ts IS NULL GROUP BY b.branch_name ORDER BY 单位车流营收 DESC;"
-  },
-  {
-    "question": "对比不同服务区客单价(支付金额/订单数)排名",
-    "sql": "SELECT service_name AS 服务区名称, SUM(pay_sum) / SUM(order_sum) AS 客单价 FROM bss_business_day_data WHERE delete_ts IS NULL GROUP BY service_name ORDER BY 客单价 DESC LIMIT 10;"
-  },
-  {
-    "question": "查询最近7天车流最高的服务区对应坪效TOP5",
-    "sql": "SELECT s.service_name, SUM(s.pay_sum) / MAX(c.customer_count) AS 坪效 FROM (SELECT service_name, service_no, SUM(pay_sum) AS pay_sum FROM bss_business_day_data WHERE oper_date >= CURRENT_DATE - 7 AND delete_ts IS NULL GROUP BY service_name, service_no) s JOIN (SELECT service_area_id, SUM(customer_count) AS customer_count FROM bss_car_day_count WHERE count_date >= CURRENT_DATE - 7 AND delete_ts IS NULL GROUP BY service_area_id) c ON s.service_no = c.service_area_id GROUP BY s.service_name ORDER BY 坪效 DESC LIMIT 5;"
-  },
-  {
-    "question": "分析各档口月度坪效趋势(2023年4月数据)",
-    "sql": "SELECT TO_CHAR(b.oper_date, 'YYYY-MM') AS 月份, b.branch_name, SUM(b.pay_sum) / SUM(c.customer_count) AS 坪效 FROM bss_business_day_data b JOIN bss_car_day_count c ON b.service_no = c.service_area_id WHERE b.oper_date BETWEEN '2023-04-01' AND '2023-04-30' AND b.delete_ts IS NULL AND c.delete_ts IS NULL GROUP BY 月份, b.branch_name ORDER BY 月份, 坪效 DESC;"
-  },
-  {
-    "question": "查询城际车辆占比超50%的服务区坪效对比",
-    "sql": "WITH car_ratio AS (SELECT service_area_id, SUM(CASE WHEN car_type = '城际' THEN customer_count ELSE 0 END) * 1.0 / SUM(customer_count) AS 城际占比 FROM bss_car_day_count GROUP BY service_area_id) SELECT b.service_name, SUM(b.pay_sum) / SUM(c.customer_count) AS 坪效 FROM bss_business_day_data b JOIN car_ratio r ON b.service_no = r.service_area_id JOIN bss_car_day_count c ON b.service_no = c.service_area_id WHERE r.城际占比 > 0.5 AND b.delete_ts IS NULL AND c.delete_ts IS NULL GROUP BY b.service_name ORDER BY 坪效 DESC;"
-  },
-  {
-    "question": "找出客单价最低的五个档口(客单价=金额/订单数)",
-    "sql": "SELECT branch_name, pay_sum / order_sum AS 客单价 FROM (SELECT branch_name, SUM(pay_sum) AS pay_sum, SUM(order_sum) AS order_sum FROM bss_business_day_data WHERE delete_ts IS NULL GROUP BY branch_name) t WHERE order_sum > 0 ORDER BY 客单价 ASC LIMIT 5;"
-  },
-  {
-    "question": "分析2023年Q2季度各服务区日均车流与营收关系",
-    "sql": "SELECT b.service_name, AVG(c.customer_count) AS 日均车流, AVG(b.pay_sum) AS 日均营收 FROM bss_business_day_data b JOIN bss_car_day_count c ON b.service_no = c.service_area_id WHERE b.oper_date BETWEEN '2023-04-01' AND '2023-06-30' GROUP BY b.service_name ORDER BY 日均车流 DESC;"
-  },
-  {
-    "question": "查询宜春服务区各档口微信支付占比TOP3",
-    "sql": "SELECT branch_name, SUM(wx) * 100.0 / SUM(pay_sum) AS 微信支付占比 FROM bss_business_day_data WHERE service_name = '宜春服务区' AND delete_ts IS NULL GROUP BY branch_name ORDER BY 微信支付占比 DESC LIMIT 3;"
-  },
-  {
-    "question": "统计各服务区坪效及车流排名差异(坪效排名与车流排名差值)",
-    "sql": "WITH rank_data AS (SELECT service_name, RANK() OVER (ORDER BY SUM(pay_sum)/SUM(customer_count) DESC) AS \"坪效排名\", RANK() OVER (ORDER BY SUM(customer_count) DESC) AS \"车流排名\" FROM bss_business_day_data b JOIN bss_car_day_count c ON b.service_no = c.service_area_id WHERE b.delete_ts IS NULL AND c.delete_ts IS NULL GROUP BY service_name) SELECT service_name, \"坪效排名\", \"车流排名\", ABS(\"坪效排名\" -\"车流排名\") AS \"排名差异\" FROM rank_data ORDER BY \"排名差异\" DESC;"
-  },
-  {
-    "question": "分析周末与工作日营收差异(以2023-04为例)",
-    "sql": "SELECT CASE WHEN EXTRACT(ISODOW FROM oper_date) IN (6,7) THEN '周末' ELSE '工作日' END AS 日期类型, AVG(pay_sum) AS 平均营收, AVG(customer_count) AS 平均车流 FROM bss_business_day_data b JOIN bss_car_day_count c ON b.service_no = c.service_area_id WHERE oper_date BETWEEN '2023-04-01' AND '2023-04-30' GROUP BY 日期类型;"
-  },
-  {
-    "question": "节假日与平日平均消费金额对比分析",
-    "sql": "SELECT '节假日' AS \"分析类型\", AVG(pay_sum) AS \"平均消费金额\" FROM bss_business_day_data WHERE oper_date BETWEEN '2023-10-01' AND '2023-10-07' AND delete_ts IS NULL UNION ALL SELECT '平日', AVG(pay_sum) FROM bss_business_day_data WHERE oper_date NOT BETWEEN '2023-10-01' AND '2023-10-07' AND delete_ts IS NULL;"
-  },
-  {
-    "question": "节假日与平日各类型车辆平均流量对比分析",
-    "sql": "SELECT car_type AS \"车辆类型\", AVG(CASE WHEN count_date BETWEEN '2023-10-01' AND '2023-10-07' THEN customer_count END) AS \"节假日均值\", AVG(CASE WHEN count_date NOT BETWEEN '2023-10-01' AND '2023-10-07' THEN customer_count END) AS \"平日均值\" FROM bss_car_day_count WHERE delete_ts IS NULL GROUP BY car_type;"
-  },
-  {
-    "question": "节假日与平日不同支付方式金额占比对比",
-    "sql": "SELECT '节假日' AS \"类型\", SUM(wx)/SUM(pay_sum) AS \"微信占比\", SUM(zfb)/SUM(pay_sum) AS \"支付宝占比\", SUM(rmb)/SUM(pay_sum) AS \"现金占比\" FROM bss_business_day_data WHERE oper_date BETWEEN '2023-10-01' AND '2023-10-07' AND delete_ts IS NULL UNION ALL SELECT '平日', SUM(wx)/SUM(pay_sum), SUM(zfb)/SUM(pay_sum), SUM(rmb)/SUM(pay_sum) FROM bss_business_day_data WHERE oper_date NOT BETWEEN '2023-10-01' AND '2023-10-07' AND delete_ts IS NULL;"
-  },
-  {
-    "question": "节假日总订单量Top10服务区",
-    "sql": "SELECT service_name AS \"服务区名称\", SUM(order_sum) AS \"总订单量\" FROM bss_business_day_data WHERE oper_date BETWEEN '2023-10-01' AND '2023-10-07' AND delete_ts IS NULL GROUP BY service_name ORDER BY \"总订单量\" DESC LIMIT 10;"
-  },
-  {
-    "question": "节假日车流峰值日期识别",
-    "sql": "SELECT count_date AS \"日期\", SUM(customer_count) AS \"总车流量\" FROM bss_car_day_count WHERE count_date BETWEEN '2023-10-01' AND '2023-10-07' AND delete_ts IS NULL GROUP BY count_date ORDER BY \"总车流量\" DESC LIMIT 1;"
-  },
-  {
-    "question": "平日周消费金额波动趋势分析",
-    "sql": "SELECT EXTRACT(DOW FROM oper_date) AS \"星期\", AVG(pay_sum) AS \"平均消费\" FROM bss_business_day_data WHERE oper_date NOT BETWEEN '2023-10-01' AND '2023-10-07' AND delete_ts IS NULL GROUP BY EXTRACT(DOW FROM oper_date) ORDER BY \"星期\";"
-  },
-  {
-    "question": "节假日与非节假日现金支付占比差异",
-    "sql": "SELECT '节假日' AS \"类型\", SUM(rmb)/SUM(pay_sum) AS \"现金占比\" FROM bss_business_day_data WHERE oper_date BETWEEN '2023-10-01' AND '2023-10-07' AND delete_ts IS NULL UNION ALL SELECT '平日', SUM(rmb)/SUM(pay_sum) FROM bss_business_day_data WHERE oper_date NOT BETWEEN '2023-10-01' AND '2023-10-07' AND delete_ts IS NULL;"
-  },
-  {
-    "question": "节前节后3日车流环比增长率计算",
-    "sql": "SELECT (AVG(CASE WHEN count_date BETWEEN '2023-10-08' AND '2023-10-10' THEN customer_count END) - AVG(CASE WHEN count_date BETWEEN '2023-09-28' AND '2023-09-30' THEN customer_count END))/AVG(CASE WHEN count_date BETWEEN '2023-09-28' AND '2023-09-30' THEN customer_count END) AS \"增长率\" FROM bss_car_day_count WHERE count_date BETWEEN '2023-09-28' AND '2023-10-10' AND delete_ts IS NULL;"
-  },
-  {
-    "question": "节假日各档口消费总额Top10排名",
-    "sql": "SELECT branch_name AS \"档口名称\", SUM(pay_sum) AS \"总消费额\" FROM bss_business_day_data WHERE oper_date BETWEEN '2023-10-01' AND '2023-10-07' AND delete_ts IS NULL GROUP BY branch_name ORDER BY \"总消费额\" DESC LIMIT 10;"
-  },
-  {
-    "question": "节假日车辆类型占比分布统计",
-    "sql": "SELECT car_type AS \"车辆类型\", SUM(customer_count) AS \"总量\", ROUND(100*SUM(customer_count)/(SELECT SUM(customer_count) FROM bss_car_day_count WHERE count_date BETWEEN '2023-10-01' AND '2023-10-07' AND delete_ts IS NULL),2) AS \"占比百分比\" FROM bss_car_day_count WHERE count_date BETWEEN '2023-10-01' AND '2023-10-07' AND delete_ts IS NULL GROUP BY car_type ORDER BY \"总量\" DESC;"
-  },
-  {
-    "question": "统计最近一个月各服务区总营收排名(按支付金额降序)Top10",
-    "sql": "SELECT service_name AS 服务区, SUM(pay_sum) AS 总营收 FROM bss_business_day_data WHERE delete_ts IS NULL AND oper_date >= CURRENT_DATE - INTERVAL '1 month' GROUP BY service_name ORDER BY 总营收 DESC LIMIT 10;"
-  },
-  {
-    "question": "分析最近7天各区域(按服务区划分)日均车流密度Top5",
-    "sql": "SELECT service_area_id AS 服务区ID, AVG(customer_count) AS 日均车流量 FROM bss_car_day_count WHERE delete_ts IS NULL AND count_date >= CURRENT_DATE - INTERVAL '7 days' GROUP BY service_area_id ORDER BY 日均车流量 DESC LIMIT 5;"
-  },
-  {
-    "question": "对比营收Top10服务区与车流Top10服务区的重合率",
-    "sql": "WITH 营收排名 AS (SELECT service_name, SUM(pay_sum) AS 金额 FROM bss_business_day_data WHERE delete_ts IS NULL AND oper_date >= CURRENT_DATE - INTERVAL '1 month' GROUP BY service_name ORDER BY 金额 DESC LIMIT 10), 车流排名 AS (SELECT service_area_id, SUM(customer_count) AS 车流 FROM bss_car_day_count WHERE delete_ts IS NULL AND count_date >= CURRENT_DATE - INTERVAL '1 month' GROUP BY service_area_id ORDER BY 车流 DESC LIMIT 10) SELECT COUNT(*) FILTER (WHERE r.service_name = c.service_area_id) * 100.0 / 10 AS 重合率 FROM 营收排名 r, 车流排名 c;"
-  },
-  {
-    "question": "计算各区域(按branch_name首字分组)客单价(支付金额/订单数)Top3",
-    "sql": "SELECT SUBSTRING(branch_name FROM 1 FOR 1) AS 区域, service_name AS 服务区, AVG(pay_sum / order_sum) AS 客单价 FROM bss_business_day_data WHERE delete_ts IS NULL AND order_sum > 0 AND oper_date >= CURRENT_DATE - INTERVAL '1 month' GROUP BY SUBSTRING(branch_name FROM 1 FOR 1), service_name ORDER BY 区域, 客单价 DESC LIMIT 3;"
-  },
-  {
-    "question": "查询2023年Q2季度各服务区运营健康度评分(支付金额环比增长率)",
-    "sql": "SELECT service_name AS 服务区, (SUM(CASE WHEN EXTRACT(QUARTER FROM oper_date)=2 THEN pay_sum ELSE 0 END) - SUM(CASE WHEN EXTRACT(QUARTER FROM oper_date)=1 THEN pay_sum ELSE 0 END)) / NULLIF(SUM(CASE WHEN EXTRACT(QUARTER FROM oper_date)=1 THEN pay_sum ELSE 0 END), 0) AS 增长率 FROM bss_business_day_data WHERE delete_ts IS NULL AND EXTRACT(YEAR FROM oper_date)=2023 GROUP BY service_name ORDER BY 增长率 DESC;"
-  },
-  {
-    "question": "统计周末与工作日车流量差异最大的Top5服务区",
-    "sql": "SELECT service_area_id AS 服务区ID, AVG(CASE WHEN EXTRACT(ISODOW FROM count_date) IN (6,7) THEN customer_count ELSE 0 END) - AVG(CASE WHEN EXTRACT(ISODOW FROM count_date) NOT IN (6,7) THEN customer_count ELSE 0 END) AS 差异值 FROM bss_car_day_count WHERE delete_ts IS NULL GROUP BY service_area_id ORDER BY 差异值 DESC LIMIT 5;"
-  },
-  {
-    "question": "查询2023年节假日(五一假期)期间营收异常波动(超3倍均值)的服务区",
-    "sql": "WITH 日均基准 AS (SELECT service_name, AVG(pay_sum) AS 基准值 FROM bss_business_day_data WHERE delete_ts IS NULL AND oper_date NOT BETWEEN '2023-04-29' AND '2023-05-03' GROUP BY service_name) SELECT b.service_name AS 服务区, b.pay_sum AS 节假日营收, d.基准值 FROM bss_business_day_data b JOIN 日均基准 d ON b.service_name = d.service_name WHERE b.delete_ts IS NULL AND b.oper_date BETWEEN '2023-04-29' AND '2023-05-03' AND b.pay_sum > d.基准值 * 3;"
-  },
-  {
-    "question": "分析不同车辆类型(过境/城际)对应服务区营收相关性",
-    "sql": "SELECT '过境车流' AS 类型, AVG(pay_sum) AS 平均营收 FROM bss_business_day_data b JOIN bss_car_day_count c ON b.service_name = c.service_area_id WHERE c.car_type = '过境' AND b.delete_ts IS NULL AND c.delete_ts IS NULL UNION ALL SELECT '城际车流', AVG(pay_sum) FROM bss_business_day_data b JOIN bss_car_day_count c ON b.service_name = c.service_area_id WHERE c.car_type = '城际' AND b.delete_ts IS NULL AND c.delete_ts IS NULL;"
-  },
-  {
-    "question": "统计最近30天支付方式偏好(各服务区微信/支付宝占比分布)",
-    "sql": "SELECT service_name AS 服务区, SUM(wx) / SUM(pay_sum) * 100 AS 微信占比, SUM(zfb) / SUM(pay_sum) * 100 AS 支付宝占比 FROM bss_business_day_data WHERE delete_ts IS NULL AND oper_date >= CURRENT_DATE - 30 GROUP BY service_name ORDER BY 微信占比 DESC LIMIT 10;"
-  }
-]

+ 0 - 202
data_pipeline/training_data/task_20250701_184430/qs_highway_db_20250701_185822_pair.json.backup

@@ -1,202 +0,0 @@
-[
-  {
-    "question": "统计最近7天各服务区的总收入和总订单数,并按收入从高到低排序",
-    "sql": "SELECT service_name AS 服务区名称, SUM(pay_sum) AS 总收入, SUM(order_sum) AS 总订单数 FROM bss_business_day_data WHERE delete_ts IS NULL AND oper_date >= CURRENT_DATE - 7 GROUP BY service_name ORDER BY 总收入 DESC;"
-  },
-  {
-    "question": "计算各服务区不同支付方式的订单占比(微信/支付宝/现金),展示前五名",
-    "sql": "SELECT service_name AS 服务区名称, ROUND(SUM(wx_order)*100.0/SUM(order_sum),2) AS 微信占比, ROUND(SUM(zf_order)*100.0/SUM(order_sum),2) AS 支付宝占比, ROUND(SUM(rmb_order)*100.0/SUM(order_sum),2) AS 现金占比 FROM bss_business_day_data WHERE delete_ts IS NULL GROUP BY service_name ORDER BY 总收入 DESC LIMIT 5;"
-  },
-  {
-    "question": "分析2023年4月每日总营收变化趋势",
-    "sql": "SELECT oper_date AS 统计日期, SUM(pay_sum) AS 日营收总额 FROM bss_business_day_data WHERE delete_ts IS NULL AND EXTRACT(YEAR FROM oper_date) = 2023 AND EXTRACT(MONTH FROM oper_date) = 4 GROUP BY oper_date ORDER BY oper_date;"
-  },
-  {
-    "question": "查询最近一天营收超过5万元的服务区及对应支付方式渗透率",
-    "sql": "SELECT service_name AS 服务区名称, wx_order AS 微信订单数, zf_order AS 支付宝订单数, rmb_order AS 现金订单数, pay_sum AS 日营收 FROM bss_business_day_data WHERE delete_ts IS NULL AND oper_date = (SELECT MAX(oper_date) FROM bss_business_day_data) AND pay_sum > 50000 ORDER BY pay_sum DESC;"
-  },
-  {
-    "question": "统计各档口平均客单价(日均)并排名",
-    "sql": "SELECT branch_name AS 档口名称, ROUND(AVG(pay_sum/order_sum),2) AS 平均客单价 FROM bss_business_day_data WHERE delete_ts IS NULL AND order_sum > 0 GROUP BY branch_name ORDER BY 平均客单价 DESC;"
-  },
-  {
-    "question": "对比不同服务区现金支付占比的分布情况",
-    "sql": "SELECT service_name AS 服务区名称, ROUND(SUM(rmb) * 100.0 / SUM(pay_sum), 2) AS 现金占比 FROM bss_business_day_data WHERE delete_ts IS NULL AND pay_sum > 0 GROUP BY service_name ORDER BY 现金占比 DESC;"
-  },
-  {
-    "question": "查询指定日期(2023-04-01)微信支付金额TOP5的服务区明细",
-    "sql": "SELECT service_name AS 服务区名称, wx AS 微信支付金额, wx_order AS 微信订单数, pay_sum AS 总营收 FROM bss_business_day_data WHERE delete_ts IS NULL AND oper_date = '2023-04-01' ORDER BY wx DESC LIMIT 5;"
-  },
-  {
-    "question": "分析各服务区支付宝订单占比与总营收的关系",
-    "sql": "SELECT service_name AS 服务区名称, ROUND(SUM(zf_order)*100.0/SUM(order_sum),2) AS 支付宝订单占比, SUM(pay_sum) AS 总营收 FROM bss_business_day_data WHERE delete_ts IS NULL GROUP BY service_name ORDER BY 支付宝订单占比 DESC;"
-  },
-  {
-    "question": "统计各服务区不同支付方式的订单数量分布",
-    "sql": "SELECT service_name AS 服务区名称, SUM(wx_order) AS 微信订单数, SUM(zf_order) AS 支付宝订单数, SUM(rmb_order) AS 现金订单数 FROM bss_business_day_data WHERE delete_ts IS NULL GROUP BY service_name ORDER BY 总营收 DESC;"
-  },
-  {
-    "question": "查询最近3天庐山服务区每日营收及支付方式构成",
-    "sql": "SELECT oper_date AS 统计日期, wx AS 微信金额, zfb AS 支付宝金额, rmb AS 现金金额, pay_sum AS 总营收 FROM bss_business_day_data WHERE delete_ts IS NULL AND service_name = '庐山服务区' AND oper_date >= CURRENT_DATE - 3 ORDER BY oper_date DESC;"
-  },
-  {
-    "question": "不同车辆类型的总车流量统计情况如何?",
-    "sql": "SELECT car_type AS 车辆类型, SUM(customer_count) AS 总车流量 FROM bss_car_day_count WHERE delete_ts IS NULL GROUP BY car_type;"
-  },
-  {
-    "question": "哪些服务区的累计车流量位列前十?",
-    "sql": "SELECT service_area_id AS 服务区ID, SUM(customer_count) AS 总车流量 FROM bss_car_day_count WHERE delete_ts IS NULL GROUP BY service_area_id ORDER BY 总车流量 DESC LIMIT 10;"
-  },
-  {
-    "question": "2022年3月2日各车型在服务区的流量分布是怎样的?",
-    "sql": "SELECT car_type AS 车辆类型, SUM(customer_count) AS 当日车流量 FROM bss_car_day_count WHERE delete_ts IS NULL AND count_date = '2022-03-02' GROUP BY car_type;"
-  },
-  {
-    "question": "每个服务区每月平均车流量是多少?",
-    "sql": "SELECT service_area_id AS 服务区ID, DATE_TRUNC('month', count_date) AS 月份, AVG(daily_total) AS 月均车流量 FROM (SELECT service_area_id, count_date, SUM(customer_count) AS daily_total FROM bss_car_day_count WHERE delete_ts IS NULL GROUP BY service_area_id, count_date) AS daily_counts GROUP BY service_area_id, 月份;"
-  },
-  {
-    "question": "最近一个月内,各服务区的日均车流量对比如何?",
-    "sql": "SELECT service_area_id AS 服务区ID, AVG(customer_count) AS 日均车流量 FROM bss_car_day_count WHERE delete_ts IS NULL AND count_date >= CURRENT_DATE - INTERVAL '1 month' GROUP BY service_area_id ORDER BY 日均车流量 DESC;"
-  },
-  {
-    "question": "车流量最高的五个服务区是哪些?",
-    "sql": "SELECT service_area_id AS 服务区ID, SUM(customer_count) AS 总车流量 FROM bss_car_day_count WHERE delete_ts IS NULL GROUP BY service_area_id ORDER BY 总车流量 DESC LIMIT 5;"
-  },
-  {
-    "question": "各车型在不同服务区的车流量分布情况如何?",
-    "sql": "SELECT car_type AS 车辆类型, service_area_id AS 服务区ID, SUM(customer_count) AS 总车流量 FROM bss_car_day_count WHERE delete_ts IS NULL GROUP BY car_type, service_area_id;"
-  },
-  {
-    "question": "某服务区(如service_area_id='17461166e7fa3ecda03534a5795ce985')各车型的日均车流量是多少?",
-    "sql": "SELECT car_type AS 车辆类型, AVG(customer_count) AS 日均车流量 FROM bss_car_day_count WHERE delete_ts IS NULL AND service_area_id = '17461166e7fa3ecda03534a5795ce985' GROUP BY car_type;"
-  },
-  {
-    "question": "2022年1月至3月期间,总车流量的月度变化趋势是怎样的?",
-    "sql": "SELECT DATE_TRUNC('month', count_date) AS 月份, SUM(customer_count) AS 总车流量 FROM bss_car_day_count WHERE delete_ts IS NULL AND count_date BETWEEN '2022-01-01' AND '2022-03-31' GROUP BY 月份 ORDER BY 月份;"
-  },
-  {
-    "question": "某服务区(如ID为'81f4eb731fb0728aef17ae61f1f1daef')中,哪种车型的累计车流量最多?",
-    "sql": "SELECT car_type AS 车辆类型, SUM(customer_count) AS 总车流量 FROM bss_car_day_count WHERE delete_ts IS NULL AND service_area_id = '81f4eb731fb0728aef17ae61f1f1daef' GROUP BY car_type ORDER BY 总车流量 DESC LIMIT 1;"
-  },
-  {
-    "question": "统计各档口单位车流营收产出(坪效)并按从高到低排序",
-    "sql": "SELECT b.branch_name AS 档口名称, SUM(b.pay_sum) / SUM(c.customer_count) AS 单位车流营收 FROM bss_business_day_data b JOIN bss_car_day_count c ON b.service_no = c.service_area_id AND b.oper_date = c.count_date WHERE b.delete_ts IS NULL AND c.delete_ts IS NULL GROUP BY b.branch_name ORDER BY 单位车流营收 DESC;"
-  },
-  {
-    "question": "对比不同服务区客单价(支付金额/订单数)排名",
-    "sql": "SELECT service_name AS 服务区名称, SUM(pay_sum) / SUM(order_sum) AS 客单价 FROM bss_business_day_data WHERE delete_ts IS NULL GROUP BY service_name ORDER BY 客单价 DESC LIMIT 10;"
-  },
-  {
-    "question": "查询最近7天车流最高的服务区对应坪效TOP5",
-    "sql": "SELECT s.service_name, SUM(s.pay_sum) / MAX(c.customer_count) AS 坪效 FROM (SELECT service_name, service_no, SUM(pay_sum) AS pay_sum FROM bss_business_day_data WHERE oper_date >= CURRENT_DATE - 7 AND delete_ts IS NULL GROUP BY service_name, service_no) s JOIN (SELECT service_area_id, SUM(customer_count) AS customer_count FROM bss_car_day_count WHERE count_date >= CURRENT_DATE - 7 AND delete_ts IS NULL GROUP BY service_area_id) c ON s.service_no = c.service_area_id ORDER BY 坪效 DESC LIMIT 5;"
-  },
-  {
-    "question": "分析各档口月度坪效趋势(2023年4月数据)",
-    "sql": "SELECT TO_CHAR(b.oper_date, 'YYYY-MM') AS 月份, b.branch_name, SUM(b.pay_sum) / SUM(c.customer_count) AS 坪效 FROM bss_business_day_data b JOIN bss_car_day_count c ON b.service_no = c.service_area_id WHERE b.oper_date BETWEEN '2023-04-01' AND '2023-04-30' AND b.delete_ts IS NULL AND c.delete_ts IS NULL GROUP BY 月份, b.branch_name ORDER BY 月份, 坪效 DESC;"
-  },
-  {
-    "question": "查询城际车辆占比超50%的服务区坪效对比",
-    "sql": "WITH car_ratio AS (SELECT service_area_id, SUM(CASE WHEN car_type = '城际' THEN customer_count ELSE 0 END) * 1.0 / SUM(customer_count) AS城际占比 FROM bss_car_day_count GROUP BY service_area_id) SELECT b.service_name, SUM(b.pay_sum) / SUM(c.customer_count) AS 坪效 FROM bss_business_day_data b JOIN car_ratio r ON b.service_no = r.service_area_id JOIN bss_car_day_count c ON b.service_no = c.service_area_id WHERE r.城际占比 > 0.5 AND b.delete_ts IS NULL AND c.delete_ts IS NULL GROUP BY b.service_name ORDER BY 坪效 DESC;"
-  },
-  {
-    "question": "找出客单价最低的五个档口(客单价=金额/订单数)",
-    "sql": "SELECT branch_name, pay_sum / order_sum AS 客单价 FROM (SELECT branch_name, SUM(pay_sum) AS pay_sum, SUM(order_sum) AS order_sum FROM bss_business_day_data WHERE delete_ts IS NULL GROUP BY branch_name) t WHERE order_sum > 0 ORDER BY 客单价 ASC LIMIT 5;"
-  },
-  {
-    "question": "分析2023年Q2季度各服务区日均车流与营收关系",
-    "sql": "SELECT b.service_name, AVG(c.customer_count) AS 日均车流, AVG(b.pay_sum) AS 日均营收 FROM bss_business_day_data b JOIN bss_car_day_count c ON b.service_no = c.service_area_id WHERE b.oper_date BETWEEN '2023-04-01' AND '2023-06-30' GROUP BY b.service_name ORDER BY 日均车流 DESC;"
-  },
-  {
-    "question": "查询宜春服务区各档口微信支付占比TOP3",
-    "sql": "SELECT branch_name, SUM(wx) * 100.0 / SUM(pay_sum) AS 微信支付占比 FROM bss_business_day_data WHERE service_name = '宜春服务区' AND delete_ts IS NULL GROUP BY branch_name ORDER BY 微信支付占比 DESC LIMIT 3;"
-  },
-  {
-    "question": "统计各服务区坪效及车流排名差异(坪效排名与车流排名差值)",
-    "sql": "WITH rank_data AS (SELECT service_name, RANK() OVER (ORDER BY SUM(pay_sum)/SUM(customer_count) DESC) AS坪效排名, RANK() OVER (ORDER BY SUM(customer_count) DESC) AS车流排名 FROM bss_business_day_data b JOIN bss_car_day_count c ON b.service_no = c.service_area_id WHERE b.delete_ts IS NULL AND c.delete_ts IS NULL GROUP BY service_name) SELECT service_name, 坪效排名, 车流排名, ABS(坪效排名 -车流排名) AS排名差异 FROM rank_data ORDER BY 排名差异 DESC;"
-  },
-  {
-    "question": "分析周末与工作日营收差异(以2023-04为例)",
-    "sql": "SELECT CASE WHEN EXTRACT(ISODOW FROM oper_date) IN (6,7) THEN '周末' ELSE '工作日' END AS 日期类型, AVG(pay_sum) AS 平均营收, AVG(customer_count) AS 平均车流 FROM bss_business_day_data b JOIN bss_car_day_count c ON b.service_no = c.service_area_id WHERE oper_date BETWEEN '2023-04-01' AND '2023-04-30' GROUP BY 日期类型;"
-  },
-  {
-    "question": "节假日与平日平均消费金额对比分析",
-    "sql": "SELECT '节假日' AS \"分析类型\", AVG(pay_sum) AS \"平均消费金额\" FROM bss_business_day_data WHERE oper_date BETWEEN '2023-10-01' AND '2023-10-07' AND delete_ts IS NULL UNION ALL SELECT '平日', AVG(pay_sum) FROM bss_business_day_data WHERE oper_date NOT BETWEEN '2023-10-01' AND '2023-10-07' AND delete_ts IS NULL;"
-  },
-  {
-    "question": "节假日与平日各类型车辆平均流量对比分析",
-    "sql": "SELECT car_type AS \"车辆类型\", AVG(CASE WHEN count_date BETWEEN '2023-10-01' AND '2023-10-07' THEN customer_count END) AS \"节假日均值\", AVG(CASE WHEN count_date NOT BETWEEN '2023-10-01' AND '2023-10-07' THEN customer_count END) AS \"平日均值\" FROM bss_car_day_count WHERE delete_ts IS NULL GROUP BY car_type;"
-  },
-  {
-    "question": "节假日与平日不同支付方式金额占比对比",
-    "sql": "SELECT '节假日' AS \"类型\", SUM(wx)/SUM(pay_sum) AS \"微信占比\", SUM(zfb)/SUM(pay_sum) AS \"支付宝占比\", SUM(rmb)/SUM(pay_sum) AS \"现金占比\" FROM bss_business_day_data WHERE oper_date BETWEEN '2023-10-01' AND '2023-10-07' AND delete_ts IS NULL UNION ALL SELECT '平日', SUM(wx)/SUM(pay_sum), SUM(zfb)/SUM(pay_sum), SUM(rmb)/SUM(pay_sum) FROM bss_business_day_data WHERE oper_date NOT BETWEEN '2023-10-01' AND '2023-10-07' AND delete_ts IS NULL;"
-  },
-  {
-    "question": "节假日总订单量Top10服务区",
-    "sql": "SELECT service_name AS \"服务区名称\", SUM(order_sum) AS \"总订单量\" FROM bss_business_day_data WHERE oper_date BETWEEN '2023-10-01' AND '2023-10-07' AND delete_ts IS NULL GROUP BY service_name ORDER BY \"总订单量\" DESC LIMIT 10;"
-  },
-  {
-    "question": "节假日车流峰值日期识别",
-    "sql": "SELECT count_date AS \"日期\", SUM(customer_count) AS \"总车流量\" FROM bss_car_day_count WHERE count_date BETWEEN '2023-10-01' AND '2023-10-07' AND delete_ts IS NULL GROUP BY count_date ORDER BY \"总车流量\" DESC LIMIT 1;"
-  },
-  {
-    "question": "平日周消费金额波动趋势分析",
-    "sql": "SELECT EXTRACT(DOW FROM oper_date) AS \"星期\", AVG(pay_sum) AS \"平均消费\" FROM bss_business_day_data WHERE oper_date NOT BETWEEN '2023-10-01' AND '2023-10-07' AND delete_ts IS NULL GROUP BY EXTRACT(DOW FROM oper_date) ORDER BY \"星期\";"
-  },
-  {
-    "question": "节假日与非节假日现金支付占比差异",
-    "sql": "SELECT '节假日' AS \"类型\", SUM(rmb)/SUM(pay_sum) AS \"现金占比\" FROM bss_business_day_data WHERE oper_date BETWEEN '2023-10-01' AND '2023-10-07' AND delete_ts IS NULL UNION ALL SELECT '平日', SUM(rmb)/SUM(pay_sum) FROM bss_business_day_data WHERE oper_date NOT BETWEEN '2023-10-01' AND '2023-10-07' AND delete_ts IS NULL;"
-  },
-  {
-    "question": "节前节后3日车流环比增长率计算",
-    "sql": "SELECT (AVG(CASE WHEN count_date BETWEEN '2023-10-08' AND '2023-10-10' THEN customer_count END) - AVG(CASE WHEN count_date BETWEEN '2023-09-28' AND '2023-09-30' THEN customer_count END))/AVG(CASE WHEN count_date BETWEEN '2023-09-28' AND '2023-09-30' THEN customer_count END) AS \"增长率\" FROM bss_car_day_count WHERE count_date BETWEEN '2023-09-28' AND '2023-10-10' AND delete_ts IS NULL;"
-  },
-  {
-    "question": "节假日各档口消费总额Top10排名",
-    "sql": "SELECT branch_name AS \"档口名称\", SUM(pay_sum) AS \"总消费额\" FROM bss_business_day_data WHERE oper_date BETWEEN '2023-10-01' AND '2023-10-07' AND delete_ts IS NULL GROUP BY branch_name ORDER BY \"总消费额\" DESC LIMIT 10;"
-  },
-  {
-    "question": "节假日车辆类型占比分布统计",
-    "sql": "SELECT car_type AS \"车辆类型\", SUM(customer_count) AS \"总量\", ROUND(100*SUM(customer_count)/(SELECT SUM(customer_count) FROM bss_car_day_count WHERE count_date BETWEEN '2023-10-01' AND '2023-10-07' AND delete_ts IS NULL),2) AS \"占比百分比\" FROM bss_car_day_count WHERE count_date BETWEEN '2023-10-01' AND '2023-10-07' AND delete_ts IS NULL GROUP BY car_type ORDER BY \"总量\" DESC;"
-  },
-  {
-    "question": "统计最近一个月各服务区总营收排名(按支付金额降序)Top10",
-    "sql": "SELECT service_name AS 服务区, SUM(pay_sum) AS 总营收 FROM bss_business_day_data WHERE delete_ts IS NULL AND oper_date >= CURRENT_DATE - INTERVAL '1 month' GROUP BY service_name ORDER BY 总营收 DESC LIMIT 10;"
-  },
-  {
-    "question": "分析最近7天各区域(按服务区划分)日均车流密度Top5",
-    "sql": "SELECT service_area_id AS 服务区ID, AVG(customer_count) AS 日均车流量 FROM bss_car_day_count WHERE delete_ts IS NULL AND count_date >= CURRENT_DATE - INTERVAL '7 days' GROUP BY service_area_id ORDER BY 日均车流量 DESC LIMIT 5;"
-  },
-  {
-    "question": "对比营收Top10服务区与车流Top10服务区的重合率",
-    "sql": "WITH 营收排名 AS (SELECT service_name, SUM(pay_sum) AS 金额 FROM bss_business_day_data WHERE delete_ts IS NULL AND oper_date >= CURRENT_DATE - INTERVAL '1 month' GROUP BY service_name ORDER BY 金额 DESC LIMIT 10), 车流排名 AS (SELECT service_area_id, SUM(customer_count) AS 车流 FROM bss_car_day_count WHERE delete_ts IS NULL AND count_date >= CURRENT_DATE - INTERVAL '1 month' GROUP BY service_area_id ORDER BY 车流 DESC LIMIT 10) SELECT COUNT(*) FILTER (WHERE r.service_name = c.service_area_id) * 100.0 / 10 AS 重合率 FROM 营收排名 r, 车流排名 c;"
-  },
-  {
-    "question": "计算各区域(按branch_name首字分组)客单价(支付金额/订单数)Top3",
-    "sql": "SELECT SUBSTRING(branch_name FROM 1 FOR 1) AS 区域, service_name AS 服务区, AVG(pay_sum / order_sum) AS 客单价 FROM bss_business_day_data WHERE delete_ts IS NULL AND order_sum > 0 AND oper_date >= CURRENT_DATE - INTERVAL '1 month' GROUP BY SUBSTRING(branch_name FROM 1 FOR 1), service_name ORDER BY 区域, 客单价 DESC LIMIT 3;"
-  },
-  {
-    "question": "查询2023年Q2季度各服务区运营健康度评分(支付金额环比增长率)",
-    "sql": "SELECT service_name AS 服务区, (SUM(CASE WHEN EXTRACT(QUARTER FROM oper_date)=2 THEN pay_sum ELSE 0 END) - SUM(CASE WHEN EXTRACT(QUARTER FROM oper_date)=1 THEN pay_sum ELSE 0 END)) / NULLIF(SUM(CASE WHEN EXTRACT(QUARTER FROM oper_date)=1 THEN pay_sum ELSE 0 END), 0) AS 增长率 FROM bss_business_day_data WHERE delete_ts IS NULL AND EXTRACT(YEAR FROM oper_date)=2023 GROUP BY service_name ORDER BY 增长率 DESC;"
-  },
-  {
-    "question": "统计周末与工作日车流量差异最大的Top5服务区",
-    "sql": "SELECT service_area_id AS 服务区ID, AVG(CASE WHEN EXTRACT(ISODOW FROM count_date) IN (6,7) THEN customer_count ELSE 0 END) - AVG(CASE WHEN EXTRACT(ISODOW FROM count_date) NOT IN (6,7) THEN customer_count ELSE 0 END) AS 差异值 FROM bss_car_day_count WHERE delete_ts IS NULL GROUP BY service_area_id ORDER BY 差异值 DESC LIMIT 5;"
-  },
-  {
-    "question": "查询2023年节假日(五一假期)期间营收异常波动(超3倍均值)的服务区",
-    "sql": "WITH 日均基准 AS (SELECT service_name, AVG(pay_sum) AS 基准值 FROM bss_business_day_data WHERE delete_ts IS NULL AND oper_date NOT BETWEEN '2023-04-29' AND '2023-05-03' GROUP BY service_name) SELECT b.service_name AS 服务区, b.pay_sum AS 节假日营收, d.基准值 FROM bss_business_day_data b JOIN 日均基准 d ON b.service_name = d.service_name WHERE b.delete_ts IS NULL AND b.oper_date BETWEEN '2023-04-29' AND '2023-05-03' AND b.pay_sum > d.基准值 * 3;"
-  },
-  {
-    "question": "分析不同车辆类型(过境/城际)对应服务区营收相关性",
-    "sql": "SELECT '过境车流' AS 类型, AVG(pay_sum) AS 平均营收 FROM bss_business_day_data b JOIN bss_car_day_count c ON b.service_name = c.service_area_id WHERE c.car_type = '过境' AND b.delete_ts IS NULL AND c.delete_ts IS NULL UNION ALL SELECT '城际车流', AVG(pay_sum) FROM bss_business_day_data b JOIN bss_car_day_count c ON b.service_name = c.service_area_id WHERE c.car_type = '城际' AND b.delete_ts IS NULL AND c.delete_ts IS NULL;"
-  },
-  {
-    "question": "统计最近30天支付方式偏好(各服务区微信/支付宝占比分布)",
-    "sql": "SELECT service_name AS 服务区, SUM(wx) / SUM(pay_sum) * 100 AS 微信占比, SUM(zfb) / SUM(pay_sum) * 100 AS 支付宝占比 FROM bss_business_day_data WHERE delete_ts IS NULL AND oper_date >= CURRENT_DATE - 30 GROUP BY service_name ORDER BY 微信占比 DESC LIMIT 10;"
-  },
-  {
-    "question": "查询连续3天车流量增长且营收排名上升的服务区",
-    "sql": "WITH 车流趋势 AS (SELECT service_area_id, COUNT(*) FILTER (WHERE customer_count > LAG(customer_count,1,0) OVER (PARTITION BY service_area_id ORDER BY count_date)) AS 连续增长天数 FROM bss_car_day_count WHERE delete_ts IS NULL GROUP BY service_area_id HAVING COUNT(*) FILTER (WHERE customer_count > LAG(customer_count,1,0) OVER (PARTITION BY service_area_id ORDER BY count_date)) >=3), 营收趋势 AS (SELECT service_name, COUNT(*) FILTER (WHERE pay_sum > LAG(pay_sum,1,0) OVER (PARTITION BY service_name ORDER BY oper_date)) AS 排名上升次数 FROM bss_business_day_data WHERE delete_ts IS NULL GROUP BY service_name) SELECT c.service_area_id AS 服务区ID FROM 车流趋势 c JOIN 营收趋势 r ON c.service_area_id = r.service_name;"
-  }
-]

+ 0 - 14
data_pipeline/training_data/task_20250701_184430/task_config.json

@@ -1,14 +0,0 @@
-{
-  "task_id": "task_20250701_184430",
-  "created_at": "2025-07-01T10:44:30.782367",
-  "parameters": {
-    "db_connection": "postgresql://postgres:postgres@192.168.67.1:6432/highway_db",
-    "table_list_file": "data_pipeline/tables.txt",
-    "business_context": "高速公路服务区管理系统",
-    "enable_llm_repair": true,
-    "modify_original_file": true,
-    "enable_sql_validation": true,
-    "enable_training_data_load": true
-  },
-  "output_directory": "data_pipeline\\training_data\\task_20250701_184430"
-}

Деякі файли не було показано, через те що забагато файлів було змінено