DataOps Platform - Business Rules & Validation Standards
Overview
This document defines the core business rules, validation standards, and processing workflows for the DataOps platform. These rules ensure data integrity, consistent API behavior, and reliable business logic execution.
1. Data Validation Rules
1.1 Talent Data Validation (Business Cards)
Rule ID: TALENT_VALIDATION_001
Required Fields
name_zh (Chinese name) - MANDATORY
- Must be non-empty string
- Maximum length: 100 characters
Recommended Fields
mobile - Mobile phone number
title_zh - Chinese job title
hotel_zh - Chinese hotel name
Format Validation Rules
# Mobile phone validation
- Remove all non-digit characters for validation
- Must match pattern: ^1[3-9]\d{9}$ (Chinese mobile format)
- Invalid format generates WARNING, not ERROR
# Email validation
- Must match pattern: ^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$
- Invalid format generates ERROR
# Array fields validation
- affiliation: Must be array type if present
- career_path: Must be array type if present
1.2 Parse Task Validation
Rule ID: PARSE_TASK_001
Task Type Validation
ALLOWED_TASK_TYPES = ['名片', '简历', '新任命', '招聘', '杂项']
File Upload Rules
- 招聘 (Recruitment) tasks: NO files required, data parameter mandatory
- Other task types: Files array mandatory and non-empty
- File format validation based on task type
- Maximum file size and allowed extensions per BaseConfig.ALLOWED_EXTENSIONS
Parameter Requirements
task_type: Required, must be from ALLOWED_TASK_TYPES
created_by: Optional, defaults to 'system'
files: Required for non-recruitment tasks
data: Required for recruitment tasks
publish_time: Required for 新任命 (appointment) tasks
2. API Response Standards
2.1 Standard Response Format
Rule ID: API_RESPONSE_001
All API responses MUST follow this structure:
{
"success": boolean,
"message": string,
"data": any,
"code": number (optional)
}
Success Response Example
{
"success": true,
"message": "操作成功",
"data": { ... }
}
Error Response Example
{
"success": false,
"message": "详细错误描述",
"data": null,
"code": 400
}
2.2 HTTP Status Code Rules
Rule ID: API_STATUS_001
200: Successful operation
400: Bad request (validation errors, missing parameters)
404: Resource not found
500: Internal server error
2.3 Content-Type Headers
Rule ID: API_HEADERS_001
- All API responses:
application/json; charset=utf-8
- File downloads: Preserve original content-type
- CORS headers automatically configured
3. Database Rules
3.1 Data Integrity Rules
Rule ID: DB_INTEGRITY_001
Duplicate Detection
- Business cards: Check for duplicates based on name_zh + mobile combination
- Create DuplicateBusinessCard record when duplicates detected
- Status tracking: 'pending' → 'processed' → 'ignored'
Timestamp Management
# Use East Asia timezone for all timestamps
created_at = get_east_asia_time_naive()
Required Relationships
- BusinessCard ↔ ParsedTalent (one-to-many)
- DuplicateBusinessCard → BusinessCard (foreign key)
3.2 Data Model Rules
Rule ID: DB_MODEL_001
Field Constraints
# String fields
name_zh: max_length=100, nullable=False
email: max_length=100, nullable=True
mobile: max_length=100, nullable=True
# JSON fields
career_path: JSON format for structured career data
origin_source: JSON format for source tracking
4. File Processing Rules
4.1 File Upload Rules
Rule ID: FILE_UPLOAD_001
Allowed Extensions
ALLOWED_EXTENSIONS = {
'txt', 'pdf', 'png', 'jpg', 'jpeg', 'gif',
'xlsx', 'xls', 'csv', 'sql', 'dll'
}
Storage Rules
- Development: Local filesystem (
C:\tmp\upload, C:\tmp\archive)
- Production: MinIO object storage
- File path tracking in database
Processing Workflow
- Validate file extension
- Upload to storage (MinIO/filesystem)
- Create database record
- Process file content (OCR, parsing)
- Extract structured data
- Validate extracted data
- Store in appropriate tables
5. Business Logic Rules
5.1 Talent Processing Workflow
Rule ID: BUSINESS_LOGIC_001
Neo4j Graph Processing
- Create or get talent node
- Process career path relationships
- Create WORK_AS, BELONGS_TO, WORK_FOR relationships
- Maximum traversal depth: 10 levels
- Duplicate node prevention
Data Enrichment
- Automatic brand group mapping
- Hotel position standardization
- Career path timeline construction
5.2 Query Processing Rules
Rule ID: BUSINESS_LOGIC_002
Graph Query Optimization
# Use recursive traversal for label-based queries
# Pattern: (start_node)-[*1..10]->(end_node)
# Stop conditions: No outgoing relationships OR Talent node reached
6. Security Rules
6.1 Input Validation
Rule ID: SECURITY_001
Sanitization Requirements
- All user inputs MUST be validated
- SQL injection prevention through SQLAlchemy ORM
- XSS prevention through proper encoding
- File upload validation (extension, size, content-type)
Authentication & Authorization
- Environment variables for sensitive data
- API key validation for external services
- CORS configuration for cross-origin requests
6.2 Error Handling
Rule ID: SECURITY_002
Information Disclosure Prevention
- Generic error messages for production
- Detailed logging for debugging
- No sensitive data in error responses
- Stack traces only in development mode
7. Configuration Rules
7.1 Environment-Specific Rules
Rule ID: CONFIG_001
Development Environment
- Debug mode: ON
- Detailed logging: ON
- Local database connections
- Console logging: ON
Production Environment
- Debug mode: OFF
- Info-level logging only
- Remote database connections
- File logging only
- Security headers enforced
7.2 Service Integration Rules
Rule ID: CONFIG_002
External Service Configuration
# LLM Services (Qwen API)
- API key from environment variables
- Fallback to default for development
- Rate limiting and retry logic
# Database Services
- Connection pooling enabled
- Health check (pool_pre_ping: True)
- Connection recycling (300 seconds)
8. Logging & Monitoring Rules
8.1 Logging Standards
Rule ID: LOGGING_001
Log Format
LOG_FORMAT = '%(asctime)s - %(levelname)s - %(filename)s - %(funcName)s - %(lineno)s - %(message)s'
Log Levels
- DEBUG: Development detailed information
- INFO: General operational information
- WARNING: Validation warnings, non-critical issues
- ERROR: Error conditions, exceptions
- CRITICAL: System failures
Log Rotation
- Development: Console + file logging
- Production: File logging only
- UTF-8 encoding for Chinese character support
9. Performance Rules
9.1 Database Performance
Rule ID: PERFORMANCE_001
Query Optimization
- Use proper indexing for frequently queried fields
- Batch processing for large datasets (batch_size: 1000)
- Connection pooling (pool_size: 10, max_overflow: 20)
Caching Strategy
- Session-based caching for Neo4j queries
- File processing result caching
- API response caching for static data
10. Compliance & Audit Rules
10.1 Data Tracking
Rule ID: AUDIT_001
Change Tracking
- All data modifications logged with timestamp
- User attribution for all operations
- Source tracking in origin_source field
Data Retention
- Archive processed files
- Maintain processing history
- Duplicate detection records retention
Rule Enforcement
Implementation Guidelines
- Validation: Implement validation functions following the patterns in
parse_menduner.py
- Error Handling: Use standardized error response format
- Testing: Create unit tests for each business rule
- Documentation: Update API documentation when rules change
Rule Violation Handling
- Critical violations: Return HTTP 400/500 with detailed error message
- Warning violations: Log warning, continue processing
- Data quality issues: Create audit records for manual review
Review Process
- Monthly review of business rules effectiveness
- Update rules based on operational feedback
- Version control for rule changes
- Impact assessment for rule modifications