如何用Python脚本比较两个sitemap.xml的差异

admin 百科 2025-12-22 25

用Python比较两个sitemap.xml差异需解析XML提取URL、标准化（小写/去尾斜杠/统协议）、递归处理嵌套sitemapindex，再集合比对新增/缺失URL并格式化输出。

如何用Python脚本比较两个sitemap.xml的差异-第1张图片-佛山资讯网

用Python比较两个sitemap.xml的差异，核心是解析XML、提取URL列表，再做集合或有序比对。关键在于处理sitemap可能存在的嵌套（如sitemapindex）、重复URL、规范格式（如末尾斜杠、协议统一），以及输出可读性强的结果。

解析并标准化URL列表

sitemap.xml本质是XML，推荐用xml.etree.ElementTree（标准库，无需安装）解析。注意：
• 多数sitemap用 <loc></loc> 标签包裹URL；
• 若是sitemapindex（含多个子sitemap），需递归抓取所有<loc></loc>并过滤出以.xml结尾的子链接；
• URL标准化建议：转小写、移除末尾/、统一用https://（若业务要求）。

示例代码片段：

import xml.etree.ElementTree as ET
from urllib.parse import urlparse, urlunparse
<p>def normalize_url(url):
parsed = urlparse(url)</p><h1>转小写，去掉末尾/，保留path/query/fragment</h1><pre class='brush:php;toolbar:false;'>path = parsed.path.rstrip('/')
return urlunparse((parsed.scheme, parsed.netloc.lower(), path,
                   parsed.params, parsed.query, parsed.fragment))

登录后复制

def extract_urls_from_sitemap(file_path): urls = set() try: tree = ET.parse(file_path) root = tree.getroot() namespaces = {'ns': 'https://www.php.cn/link/654f3a10edb3bb1755a43cc4f9be9dc6'}

先尝试找普通url条目

    for loc in root.findall('.//ns:loc', namespaces):
        if loc.text:
            urls.add(normalize_url(loc.text.strip()))
    # 若是sitemapindex，递归处理子sitemap（这里简化为只读本地文件，实际需HTTP请求）
except Exception as e:
    print(f"解析失败 {file_path}: {e}")
return urls

登录后复制

执行三类基础比对

拿到两个标准化URL集合后，常用比对方式有：

立即学习“Python免费学习笔记（深入）”；

仅在A中存在（新增）：用 urls_a - urls_b
仅在B中存在（删除或失效）：用 urls_b - urls_a
双方共有但内容不同（如参数变化）：需逐条对比原始字符串或哈希值（如hashlib.md5(url.encode()).hexdigest()），但通常标准化后集合差已足够