17.1 PR阶段自动检测Schema破坏、robots.txt变更

在传统的SEO工作流中，Schema标记和robots.txt的变更往往是在上线后通过Search Console或第三方爬虫工具被动发现的。这种“事后补救”的方式在生成式搜索时代代价极高——一次错误的Schema标记或意外封禁爬虫，可能导致你的内容在数小时内从生成式引擎的答案池中消失。

全栈工程师的核心优势在于，可以将这些检测前置到代码审查（Code Review）和持续集成/持续部署（CI/CD）流程中。本节将指导你如何在Pull Request（PR）阶段，自动检测Schema破坏和robots.txt的异常变更。

17.1.1 为什么需要在PR阶段检测？

成本最低：在代码合并前发现问题，修复成本几乎为零。
阻断风险：避免错误的Schema标记污染线上数据，或错误的robots.txt指令导致整站被降权。
团队协作：让非SEO专业的开发人员也能在提交代码时获得即时反馈，提升团队整体的SEO意识。
可追溯：每一次变更都有记录，便于回溯问题根因。

17.1.2 检测Schema破坏

Schema标记（通常是JSON-LD格式）是生成式引擎理解你内容的核心。一个微小的语法错误或逻辑矛盾，都可能导致整个结构化数据被忽略。

检测策略

语法校验：确保JSON-LD是合法的JSON格式。
Schema.org规范校验：检查使用的类型（如 Product, FAQPage）和属性是否符合Schema.org的最新规范。
业务逻辑校验：检查关键字段是否存在且值合理（例如，Product 必须有 name 和 offers；FAQPage 必须有 mainEntity 且每个 Question 必须有 acceptedAnswer）。
破坏性变更检测：对比当前分支与目标分支（如 main 或 master）的Schema输出，识别新增、删除或修改了哪些字段。

实现方案（以GitHub Actions + Node.js为例）

步骤1：创建检测脚本 (scripts/check-schema.js)

// scripts/check-schema.js
const fs = require('fs');
const path = require('path');
const { diff } = require('deep-diff'); // 用于深度比较对象

// 1. 获取当前分支和基础分支的Schema文件（或通过构建工具提取）
// 实际项目中，你可能需要从构建产物或API响应中提取Schema
const currentSchemaPath = process.argv[2];
const baseSchemaPath = process.argv[3];

if (!currentSchemaPath || !baseSchemaPath) {
  console.error('请提供当前分支和基础分支的Schema文件路径');
  process.exit(1);
}

const currentSchema = JSON.parse(fs.readFileSync(currentSchemaPath, 'utf8'));
const baseSchema = JSON.parse(fs.readFileSync(baseSchemaPath, 'utf8'));

let errors = [];

// 2. 语法校验
try {
  JSON.parse(JSON.stringify(currentSchema)); // 确保是合法JSON
} catch (e) {
  errors.push(`语法错误: ${e.message}`);
}

// 3. Schema.org规范校验（简化示例，生产环境建议使用 schema-validator 库）
if (currentSchema['@type'] === 'Product') {
  if (!currentSchema.name) {
    errors.push('Product类型必须包含 name 属性');
  }
  if (!currentSchema.offers || !currentSchema.offers.price) {
    errors.push('Product类型必须包含 offers.price 属性');
  }
}

// 4. 破坏性变更检测
const differences = diff(baseSchema, currentSchema);
if (differences) {
  differences.forEach(d => {
    if (d.kind === 'D') { // 删除
      errors.push(`破坏性变更: 删除了字段 ${d.path.join('.')}`);
    }
    if (d.kind === 'E') { // 编辑
      errors.push(`字段值变更: ${d.path.join('.')} 从 "${d.lhs}" 变为 "${d.rhs}"`);
    }
  });
}

// 5. 输出结果
if (errors.length > 0) {
  console.error('Schema检测失败:');
  errors.forEach(e => console.error(`  - ${e}`));
  process.exit(1);
} else {
  console.log('Schema检测通过');
  process.exit(0);
}

步骤2：配置GitHub Actions工作流 (.github/workflows/seo-check.yml)

name: SEO Schema Check

on:
  pull_request:
    paths:
      - 'public/**/*.json'  # 监听JSON文件变更
      - 'components/**/*.tsx' # 或监听可能生成Schema的组件
      - 'pages/**/*.tsx'

jobs:
  schema-check:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout PR branch
        uses: actions/checkout@v4
        with:
          fetch-depth: 0 # 获取完整历史以进行diff

      - name: Setup Node.js
        uses: actions/setup-node@v4
        with:
          node-version: '20'

      - name: Install dependencies
        run: npm install deep-diff

      - name: Extract Schema from PR branch
        run: |
          # 模拟从构建产物中提取Schema
          # 实际项目中，这里可能是运行构建脚本，然后从HTML中提取JSON-LD
          echo '{"@context":"https://schema.org","@type":"Product","name":"Test Product","offers":{"price":"99.99"}}' > pr-schema.json

      - name: Extract Schema from base branch
        run: |
          git checkout HEAD~1 # 切换到上一个提交（或使用 ${{ github.base_ref }}）
          echo '{"@context":"https://schema.org","@type":"Product","name":"Old Product","offers":{"price":"89.99"}}' > base-schema.json
          git checkout -

      - name: Run Schema Check
        run: node scripts/check-schema.js pr-schema.json base-schema.json

      - name: Comment on PR (if failed)
        if: failure()
        uses: actions/github-script@v7
        with:
          script: |
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: '❌ **Schema检测失败**：PR中的结构化数据存在潜在问题，请检查工作流日志。'
            })

17.1.3 检测robots.txt变更

robots.txt 是爬虫进入你站点的第一道门。错误的指令可能导致重要页面被屏蔽，或允许AI机器人抓取本不该公开的内容。

检测策略

语法校验：确保 robots.txt 文件符合标准语法（User-agent, Disallow, Allow, Sitemap）。
破坏性指令检测：对比新旧版本，识别是否新增了针对重要爬虫（如 Googlebot, GPTBot, Bytespider）的 Disallow 指令。
关键路径检测：检查是否意外屏蔽了包含结构化数据或核心内容的路径（如 /products/, /faq/, /api/schema/）。
Sitemap变更检测：确保 Sitemap 指令指向的URL是合法的。

实现方案（Python脚本 + GitHub Actions）

步骤1：创建检测脚本 (scripts/check-robots.py)

# scripts/check-robots.py
import sys
import re
from urllib.parse import urlparse

def check_robots(file_path):
    errors = []
    with open(file_path, 'r') as f:
        content = f.read()

    # 1. 基本语法检查
    lines = content.split('\n')
    for i, line in enumerate(lines, 1):
        stripped = line.strip()
        if stripped and not stripped.startswith('#') and ':' not in stripped:
            errors.append(f"第{i}行语法错误: '{stripped}' 缺少冒号")

    # 2. 检查关键爬虫是否被意外Disallow
    critical_bots = ['Googlebot', 'GPTBot', 'Bytespider', 'CCBot', 'ClaudeBot']
    current_agent = None
    for line in lines:
        stripped = line.strip()
        if stripped.startswith('User-agent:'):
            current_agent = stripped.split(':')[1].strip()
        elif stripped.startswith('Disallow:') and current_agent:
            if current_agent in critical_bots or current_agent == '*':
                path = stripped.split(':')[1].strip()
                if path == '/' or path.startswith('/api') or path.startswith('/products'):
                    errors.append(f"危险操作: {current_agent} 被禁止访问 '{path}'")

    # 3. 检查Sitemap
    sitemaps = [line for line in lines if line.strip().startswith('Sitemap:')]
    for sitemap in sitemaps:
        url = sitemap.split(':')[1].strip()
        parsed = urlparse(url)
        if not parsed.scheme or not parsed.netloc:
            errors.append(f"无效的Sitemap URL: {url}")

    return errors

if __name__ == "__main__":
    if len(sys.argv) != 2:
        print("用法: python check-robots.py <robots.txt路径>")
        sys.exit(1)

    errors = check_robots(sys.argv[1])
    if errors:
        print("robots.txt 检测失败:")
        for e in errors:
            print(f"  - {e}")
        sys.exit(1)
    else:
        print("robots.txt 检测通过")
        sys.exit(0)

步骤2：集成到GitHub Actions

在之前的工作流文件中，增加一个job：

  robots-check:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v4

      - name: Setup Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.12'

      - name: Check robots.txt
        run: python scripts/check-robots.py public/robots.txt

      - name: Comment on PR (if failed)
        if: failure()
        uses: actions/github-script@v7
        with:
          script: |
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: '🚫 **robots.txt检测失败**：PR中的robots.txt存在潜在风险，请检查工作流日志。'
            })

17.1.4 进阶：使用Lighthouse CI进行综合检测

对于更全面的检测，可以集成Lighthouse CI。它不仅能检测Schema，还能检查Core Web Vitals、SEO最佳实践等。

  lighthouse-check:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v4

      - name: Build and start server
        run: |
          npm run build
          npm start &
          sleep 5

      - name: Run Lighthouse CI
        run: |
          npm install -g @lhci/cli
          lhci autorun --collect.url=http://localhost:3000 --collect.numberOfRuns=1 --assert.preset=lighthouse:no-pwa

17.1.5 最佳实践

渐进式引入：先只对关键页面（如首页、产品页、FAQ页）进行Schema检测，逐步扩大范围。
自定义断言：根据你的业务场景，编写自定义的断言规则。例如，电商网站可以断言每个产品页的Schema必须包含 sku 和 brand。
缓存与性能：避免在每次PR中都全量构建整个站点。可以只提取变更文件对应的Schema，或使用增量构建。
可视化报告：将检测结果以Markdown表格或图片的形式发布在PR评论区，方便团队成员理解。
白名单机制：对于已知的、无害的变更（如更新产品价格），允许通过白名单跳过某些检测。

17.1.6 总结

在PR阶段自动检测Schema破坏和robots.txt变更，是全栈工程师将SEO/GEO运维左移的关键实践。通过简单的脚本和CI/CD集成，你可以将潜在的风险消灭在代码合并之前，确保你的内容在生成式引擎中始终保持正确、可访问的状态。这不仅提升了团队的效率，也为你的产品在搜索未来的竞争中构建了坚实的技术护城河。