Python爬虫获取网页内隐藏的外部链接-女篮世界杯时间-英格兰世界杯预选赛_世界杯卡塔尔

2026-01-08 04:30:05
admin

Python爬虫获取网页内隐藏的外部链接

需求场景：

门户网再发布文章的时候，由于从其他网站复制内容时，某些文字内容上有原网站链接，现在要把这些链接和所在文章地址找出来。

代码：

import requests

import sys

from bs4 import BeautifulSoup

def get_url(domain, url, seen=None):

# 判断集合seen是否为空，为空则初始化

if seen is None:

seen = set()

# 检查URL是否包含协议，不包含则添加

if not url.startswith('http://') and not url.startswith('https://'):

url = 'http://' + url

try:

# 执行get请求

response = requests.get(url)

# 检查请求是否成功，不成功则抛出异常

response.raise_for_status()

# 获取响应内容

content = response.text

# 使用BeautifulSoup解析HTML内容

soup = BeautifulSoup(content, "html.parser")

# 查找所有的标签

links = soup.find_all("a")

# 获取每个标签的href属性

for link in links:

href = link.get("href")

# href存在且不为空

if href:

# 如果URL不以http或https开头，则合并为完整的URL

if not href.startswith(('http://', 'https://')):

href = requests.compat.urljoin(url, href)

# 过滤要忽略的内容（此处根据实际需要添加URL关键字））

if all(substring not in href for substring in ['gov', 'edu', 'ncss']):

# 打印不是本站的链接及其所在页的URL（URL不包含本站域名关键字）

if domain not in href:

print(url)

print(r'-- ' + href)

# 将得到的链接写入文件

with open(r'C:\Users\Administrator\Desktop\url_list.txt', 'a') as file:

file.write(url + '\n' + r'--' + href + '\n')

# 过滤属于本站且未访问过的URL（包含本站域名关键字且不在seen集合中）

if domain in href and href not in seen:

# URL加入seen集合

seen.add(href)

# 递归调用，并传递 seen 集合

get_url(domain, href, seen)

except requests.exceptions.RequestException as e:

print("请求错误:", e)

# 初始调用

def start():

# 域名关键字

domain = sys.argv[1]

# 起始页URL

url = sys.argv[2]

get_url(domain,url)

if __name__ == '__main__':

start()

Best regards, yunxi p deng 2024.10.07

Python爬虫获取网页内隐藏的外部链接

最全像素畫入門教學！2025最好用的5款像素畫生成軟體推薦