附录C：生成式搜索引擎的User-Agent清单（国际+中国）

概述

本附录整理了当前主流生成式搜索引擎及其爬虫的User-Agent（UA）信息，涵盖国际与中国两大市场。在实施GEO优化时，准确识别并区分这些爬虫是实现动态渲染、内容适配、访问控制与日志分析的基础。

重要提示：User-Agent并非固定不变，搜索引擎可能随时更新或新增UA。建议定期（每季度）检查官方文档，并保持本清单的更新。

一、国际生成式搜索引擎爬虫

1.1 OpenAI (GPTBot / ChatGPT-User)

OpenAI 用于训练模型和提供实时搜索服务的爬虫。

爬虫名称	User-Agent	用途	官方文档
GPTBot	`Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible with GPTBot/1.0; +https://openai.com/gptbot`	用于训练GPT模型（默认）	OpenAI GPTBot
ChatGPT-User	`Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible with ChatGPT-User/1.0; +https://openai.com/chatgpt-user`	用于ChatGPT联网搜索（Browse with Bing）	OpenAI ChatGPT-User

注意事项：

GPTBot 会忽略 robots.txt 中禁止的路径，但会遵循 noindex 标签。
ChatGPT-User 主要用于实时查询，建议在 robots.txt 中允许其访问。

1.2 Google (GoogleOther / Google-Extended)

Google 用于AI训练和生成式搜索体验的爬虫。

爬虫名称	User-Agent	用途	官方文档
GoogleOther	`Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/W.X.Y.Z Mobile Safari/537.36 (compatible; GoogleOther)`	通用AI训练爬虫	Google Other
Google-Extended	`Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/W.X.Y.Z Mobile Safari/537.36 (compatible; Google-Extended)`	用于Google SGE（搜索生成体验）	Google Extended

注意事项：

Google-Extended 是专门为SGE设计的，建议允许其访问结构化数据丰富的内容。
GoogleOther 可用于训练，若不想被训练可禁止。

1.3 Microsoft (BingBot / BingChat)

微软用于Bing Chat（Copilot）和Bing搜索的爬虫。

爬虫名称	User-Agent	用途	官方文档
BingBot	`Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)`	通用Bing搜索爬虫	Bing Webmaster
BingPreview	`Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534+ (KHTML, like Gecko) BingPreview/1.0b`	用于Bing Chat预览	同上
BingChat	`Mozilla/5.0 (compatible; bingchat/1.0; +http://www.bing.com/bingchat.htm)`	用于Bing Chat/Copilot	同上

注意事项：

BingChat 是较新的爬虫，主要用于生成式搜索。
建议在 robots.txt 中为BingChat单独设置规则。

1.4 Anthropic (ClaudeBot / Anthropic-AI)

Anthropic 用于训练Claude模型的爬虫。

爬虫名称	User-Agent	用途	官方文档
ClaudeBot	`Mozilla/5.0 (compatible; ClaudeBot/1.0; +https://claude.ai/bot)`	用于训练Claude模型	Anthropic Crawling
Anthropic-AI	`Mozilla/5.0 (compatible; Anthropic-AI/1.0; +https://www.anthropic.com/ai)`	用于Anthropic AI服务	同上

注意事项：

ClaudeBot 会遵守 robots.txt 和 noindex 标签。
若不想被训练，可在 robots.txt 中禁止。

1.5 Perplexity (PerplexityBot)

Perplexity AI 用于实时搜索和答案生成的爬虫。

爬虫名称	User-Agent	用途	官方文档
PerplexityBot	`Mozilla/5.0 (compatible; PerplexityBot/1.0; +https://docs.perplexity.ai/docs/perplexitybot)`	用于Perplexity搜索和答案生成	Perplexity Bot

注意事项：

PerplexityBot 会频繁访问，建议设置合理的抓取频率。
对于高价值内容，建议允许其访问。

1.6 其他国际爬虫

爬虫名称	User-Agent	来源	备注
CCBot	`Mozilla/5.0 (compatible; CCBot/2.0; +https://commoncrawl.org/faq/)`	Common Crawl	用于大规模网页抓取，被多个AI模型使用
Bytespider	`Mozilla/5.0 (compatible; Bytespider/1.0; +https://bytespider.com/bot)`	字节跳动	用于训练豆包等模型
FacebookBot	`Mozilla/5.0 (compatible; FacebookBot/1.0; +http://www.facebook.com/externalhit_uatext.php)`	Meta	用于社交搜索和AI训练
Applebot	`Mozilla/5.0 (compatible; Applebot/0.1; +https://www.apple.com/go/applebot)`	Apple	用于Apple Intelligence和Siri

二、中国生成式搜索引擎爬虫

2.1 百度 (Baiduspider / 文心一言)

百度用于搜索和文心一言训练的爬虫。

爬虫名称	User-Agent	用途	官方文档
Baiduspider	`Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)`	通用百度搜索爬虫	百度爬虫
Baiduspider-mobile	`Mozilla/5.0 (Linux; Android 8.0; Pixel 2 Build/OPD3.170816.012) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Mobile Safari/537.36 (compatible; Baiduspider-mobile/2.0)`	移动端爬虫	同上
BaiduImageSpider	`Mozilla/5.0 (compatible; BaiduImageSpider/2.0; +http://www.baidu.com/search/spider.html)`	图片爬虫	同上
BaiduVideoSpider	`Mozilla/5.0 (compatible; BaiduVideoSpider/2.0; +http://www.baidu.com/search/spider.html)`	视频爬虫	同上
WenxinBot	`Mozilla/5.0 (compatible; WenxinBot/1.0; +https://yiyan.baidu.com/bot)`	文心一言训练爬虫	百度智能云

注意事项：

百度爬虫对JS渲染能力较弱，建议提供SSR版本。
文心一言爬虫是较新的，主要用于训练。

2.2 字节跳动 (Bytespider / 豆包)

字节跳动用于抖音搜索和豆包训练的爬虫。

爬虫名称	User-Agent	用途	官方文档
Bytespider	`Mozilla/5.0 (compatible; Bytespider/1.0; +https://bytespider.com/bot)`	通用字节爬虫	字节跳动爬虫
DouyinBot	`Mozilla/5.0 (compatible; DouyinBot/1.0; +https://www.douyin.com/bot)`	抖音搜索爬虫	同上
DoubaoBot	`Mozilla/5.0 (compatible; DoubaoBot/1.0; +https://www.doubao.com/bot)`	豆包训练爬虫	同上

注意事项：

Bytespider 是字节跳动的通用爬虫，用于训练多个模型。
豆包爬虫主要用于生成式搜索，建议允许其访问结构化数据。

2.3 阿里 (AlibabaBot / 通义千问)

阿里巴巴用于搜索和通义千问训练的爬虫。

爬虫名称	User-Agent	用途	官方文档
AlibabaBot	`Mozilla/5.0 (compatible; AlibabaBot/1.0; +https://www.alibaba.com/bot)`	通用阿里爬虫	阿里云爬虫
TongyiBot	`Mozilla/5.0 (compatible; TongyiBot/1.0; +https://tongyi.aliyun.com/bot)`	通义千问训练爬虫	阿里云

注意事项：

阿里爬虫主要用于电商和云服务内容。
通义千问爬虫较新，建议定期检查。

2.4 腾讯 (TencentBot / 混元)

腾讯用于微信搜一搜和混元大模型训练的爬虫。

爬虫名称	User-Agent	用途	官方文档
TencentBot	`Mozilla/5.0 (compatible; TencentBot/1.0; +https://www.tencent.com/bot)`	通用腾讯爬虫	腾讯爬虫
WechatBot	`Mozilla/5.0 (compatible; WechatBot/1.0; +https://weixin.qq.com/bot)`	微信搜一搜爬虫	同上
HunyuanBot	`Mozilla/5.0 (compatible; HunyuanBot/1.0; +https://hunyuan.tencent.com/bot)`	混元大模型训练爬虫	腾讯云

注意事项：

微信搜一搜爬虫对公众号内容有特殊权重。
混元爬虫主要用于训练，建议允许其访问。

2.5 其他中国爬虫

爬虫名称	User-Agent	来源	备注
DeepSeek-Bot	`Mozilla/5.0 (compatible; DeepSeek-Bot/1.0; +https://www.deepseek.com/bot)`	深度求索	用于DeepSeek模型训练和联网搜索
KimiBot	`Mozilla/5.0 (compatible; KimiBot/1.0; +https://kimi.moonshot.cn/bot)`	月之暗面	用于Kimi训练和搜索
360Spider	`Mozilla/5.0 (compatible; 360Spider/1.0; +http://www.360.cn/spider.html)`	360搜索	通用搜索爬虫
SogouSpider	`Mozilla/5.0 (compatible; SogouSpider/1.0; +http://www.sogou.com/docs/spider.htm)`	搜狗搜索	通用搜索爬虫

三、User-Agent 快速识别表

3.1 国际爬虫速查

爬虫	关键词	典型UA前缀
GPTBot	`GPTBot`	`Mozilla/5.0 ... GPTBot/1.0`
ChatGPT-User	`ChatGPT-User`	`Mozilla/5.0 ... ChatGPT-User/1.0`
GoogleOther	`GoogleOther`	`Mozilla/5.0 ... GoogleOther`
Google-Extended	`Google-Extended`	`Mozilla/5.0 ... Google-Extended`
BingBot	`bingbot`	`Mozilla/5.0 ... bingbot/2.0`
BingChat	`bingchat`	`Mozilla/5.0 ... bingchat/1.0`
ClaudeBot	`ClaudeBot`	`Mozilla/5.0 ... ClaudeBot/1.0`
PerplexityBot	`PerplexityBot`	`Mozilla/5.0 ... PerplexityBot/1.0`
CCBot	`CCBot`	`Mozilla/5.0 ... CCBot/2.0`
Bytespider	`Bytespider`	`Mozilla/5.0 ... Bytespider/1.0`

3.2 中国爬虫速查

爬虫	关键词	典型UA前缀
Baiduspider	`Baiduspider`	`Mozilla/5.0 ... Baiduspider/2.0`
WenxinBot	`WenxinBot`	`Mozilla/5.0 ... WenxinBot/1.0`
Bytespider	`Bytespider`	`Mozilla/5.0 ... Bytespider/1.0`
DouyinBot	`DouyinBot`	`Mozilla/5.0 ... DouyinBot/1.0`
DoubaoBot	`DoubaoBot`	`Mozilla/5.0 ... DoubaoBot/1.0`
TongyiBot	`TongyiBot`	`Mozilla/5.0 ... TongyiBot/1.0`
HunyuanBot	`HunyuanBot`	`Mozilla/5.0 ... HunyuanBot/1.0`
DeepSeek-Bot	`DeepSeek-Bot`	`Mozilla/5.0 ... DeepSeek-Bot/1.0`
KimiBot	`KimiBot`	`Mozilla/5.0 ... KimiBot/1.0`

四、robots.txt 配置示例

4.1 允许所有生成式爬虫

User-agent: GPTBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: BingChat
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: DeepSeek-Bot
Allow: /

User-agent: DoubaoBot
Allow: /

4.2 禁止训练，允许搜索

# 禁止训练爬虫
User-agent: GPTBot
Disallow: /

User-agent: GoogleOther
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: WenxinBot
Disallow: /

# 允许搜索和生成式搜索爬虫
User-agent: Google-Extended
Allow: /

User-agent: BingChat
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: DeepSeek-Bot
Allow: /

User-agent: DoubaoBot
Allow: /

4.3 按路径精细控制

# 允许所有爬虫访问公开内容
User-agent: *
Allow: /public/
Allow: /blog/
Allow: /faq/

# 禁止训练爬虫访问API和内部页面
User-agent: GPTBot
Disallow: /api/
Disallow: /internal/
Disallow: /admin/

User-agent: GoogleOther
Disallow: /api/
Disallow: /internal/

# 允许搜索爬虫访问所有内容
User-agent: Google-Extended
Allow: /

User-agent: BingChat
Allow: /

五、Nginx 日志分析示例

5.1 提取生成式引擎爬虫日志

# 提取GPTBot访问日志
grep "GPTBot" /var/log/nginx/access.log | awk '{print $1, $7, $12}' > gptbot_access.log

# 提取DeepSeek-Bot访问日志
grep "DeepSeek-Bot" /var/log/nginx/access.log | awk '{print $1, $7, $12}' > deepseek_access.log

# 统计各生成式爬虫访问量
grep -oE "(GPTBot|ChatGPT-User|Google-Extended|BingChat|PerplexityBot|ClaudeBot|DeepSeek-Bot|DoubaoBot|WenxinBot)" /var/log/nginx/access.log | sort | uniq -c | sort -rn

5.2 Python脚本分析

import re
from collections import Counter

# 定义生成式爬虫正则
pattern = r'(GPTBot|ChatGPT-User|Google-Extended|BingChat|PerplexityBot|ClaudeBot|DeepSeek-Bot|DoubaoBot|WenxinBot)'

# 读取日志文件
with open('/var/log/nginx/access.log', 'r') as f:
    logs = f.readlines()

# 提取爬虫名称
crawlers = []
for line in logs:
    match = re.search(pattern, line)
    if match:
        crawlers.append(match.group(1))

# 统计
counter = Counter(crawlers)
for crawler, count in counter.most_common():
    print(f"{crawler}: {count} 次访问")

六、常见问题与最佳实践

6.1 如何验证爬虫身份？

建议使用反向DNS验证：

# 验证GPTBot
dig -x [IP地址] | grep openai.com

# 验证Google爬虫
dig -x [IP地址] | grep googlebot.com

# 验证DeepSeek-Bot
dig -x [IP地址] | grep deepseek.com

6.2 爬虫更新频率

爬虫	更新频率	建议检查周期
GPTBot	每月	每季度
Google-Extended	每季度	每半年
BingChat	每季度	每半年
PerplexityBot	每月	每季度
DeepSeek-Bot	每两周	每月
DoubaoBot	每月	每季度

6.3 最佳实践总结

区分训练与搜索：在 robots.txt 中分别控制训练爬虫和搜索爬虫。
允许结构化数据：为生成式搜索爬虫提供丰富的结构化数据。
监控访问模式：定期分析爬虫访问日志，了解抓取频率和偏好。
动态适配：根据爬虫类型返回不同版本的内容。
保持更新：定期检查官方文档，更新爬虫清单。

七、参考资源

最后更新：2025年7月
建议：将本清单集成到CI/CD流程中，每次部署时自动检查爬虫UA变化。