利用爬虫获取页面数据-天翼云开发者社区

本文列举两种常见的爬虫方案，旨在清晰化爬虫的常见流程。

利用爬虫抓取接口数据

此种方案需要明确待爬取的接口，以及爬取的数据字段

利用requests库，是基于urllib，采用Apache2 Licensed开源协议的 HTTP 库。

import requests

引入之后，需要写请求报文的逻辑

请求报文的字段列表直接copy待爬取接口的字段列表

标准的请求代码逻辑如下：

response = requests.get(
        url='请求url',
        headers={
            xxx: yyy
        }
    )
    region_list = []
    if response.status_code == 200:
        data = response.json()
        print('data获取成功！')
    else:
        print("Failed to retrieve data. Status code:", response.status_code)

收到response响应体后，解析出想要的字段，可以写入到csv表格中

# 写入 CSV 文件，按要求写入数据
    with open(file_path, mode='w', newline='', encoding='utf-8') as file:
        writer = csv.writer(file)
        writer.writerow(['字段1', '字段2', '字段...'])
        print('文件写入成功！')

同时，为了清晰文件对应的爬取数据之间的对应关系，可以对文件进行重命名

但是部分接口存在对response内容值加密行为，我们需要找到对应的解密文件，进行解密即可

利用爬虫抓取DOM数据

待爬虫的数据并不一定是由接口返回，部分可能是前端写死的数据，我们可以直接将数据copy到本地，但是页面的数据后期可能会发生变化，因此为了避免每次都需要从页面中查找并copy数据，可以利用爬虫，直接对相对应DOM部分进行爬取

爬取DOM数据利用Selenium库，它是一个用于Web应用程序测试的工具。Selenium测试直接运行在浏览器中，就像真正的用户在操作一样。

# 提供用于模拟用户在网页上进行交互的方法和功能。
from selenium import webdriver
# 用于在 WebDriver 中执行鼠标和键盘操作的高级方法
from selenium.webdriver import ActionChains
# 提供了用于定位网页元素的方法
from selenium.webdriver.common.by import By
# 提供了显式等待的功能
from selenium.webdriver.support.ui import WebDriverWait
# 一系列预定义的等待条件
from selenium.webdriver.support import expected_conditions as EC
# 用于启动浏览器驱动程序的服务
from selenium.webdriver.chrome.service import Service

使用这些包之前需要下载浏览器对应的driver，用于浏览器与Selenium进行通信的载体。

chrome对应的driver版本可在chromedriver.com网址查看

使用Selenium库时，需要一系列预配置，具体配置信息如下：

# 文件加载目录
download_dir = "C:\\Users\\Administrator\\Desktop\\Data"
# 用于配置 Chrome 浏览器的行为
chrome_options = webdriver.ChromeOptions()
# 添加了一系列偏好设置
chrome_options.add_experimental_option("prefs", {
    "download.default_directory": download_dir, # 保存到指定的目录
    "download.prompt_for_download": False, # 禁用下载前的提示框
    "safebrowsing.enabled": True # 启用安全浏览功能，检查下载文件的安全性
})

chromedriver_path = 'C:/Program Files (x86)/Google/Chrome/Application/chromedriver.exe'
# 指定 Chromedriver 的路径
service = Service(chromedriver_path)
driver = webdriver.Chrome(service=service, options=chrome_options)
wait = WebDriverWait(driver, 10)  # 增加等待时间为10秒

接下来就是启动浏览器，爬取对应DOM：

try:
        driver.get(url)

        # 点击“yyy”按钮
        button_xpath = "//div[@class='xxx' and text()='yyy']"
        button = wait.until(EC.element_to_be_clickable((By.XPATH, button_xpath)))
        button.click()

        # 获取页面元素
        page_dom = driver.page_source
        soup = BeautifulSoup(page_dom, 'html.parser')
        iframe_id_element = soup.find('iframe', id='iframe')
        if iframe_id_element:
            # 如果找到了 id 为 "iframe" 的 iframe 元素
            # 获取 iframe 的内容
            src_value = iframe_id_element['src']
            full_url = 'https:' + src_value
            driver.get(full_url)
            wait = WebDriverWait(driver, 10)
            time.sleep(10)
            trigger_element = wait.until(EC.presence_of_element_located((By.CLASS_NAME, "ant-dropdown-trigger")))
            # 将鼠标移动到第一个元素上
            action = ActionChains(driver)
            action.move_to_element(trigger_element).perform()

            # 等待一段时间，确保下拉菜单已完全展开
            time.sleep(4)

            sub_menu_path = "//li[@class='aaa' and text()='bbb']"
            sub_menu_element = wait.until(EC.element_to_be_clickable((By.XPATH, sub_menu_path)))
            # 将鼠标从第一个元素移到下拉菜单中的第一个选项
            action.move_to_element(sub_menu_element).click().perform()
        else:
            print("未找到id为iframe的iframe元素")

    finally:
        # 等待一段时间，确保文件的数据已经下完
        time.sleep(10)
        driver.quit()

具体针对于DOM元素的定位，以及DOM元素的交互行为可以详细查阅Selenium，此处不再赘述。

但是，部分网站是直接屏蔽Selenium：

因为selenium在命令行手动开启后的谷歌浏览器加了一些变量值，比如window.navigator.webdriver，在正常的谷歌浏览器是undefined，在selenium打开的谷歌浏览器是True，然后对方服务器就会下发js代码，检测这个变量值给网站，网站判断这个值，为True就是爬虫程序就封锁你的访问

因此，我们需要对Selenium隐藏操作，在请求之前改变一些参数，绕过检测，具体细节可以自己了解下网站检测selenium的原理，需要设置对应其它的值都可以加：

# chrome在79版之前用下面两行代码
#options.add_experimental_option("excludeSwitches", ["enable-automation"])
#options.add_experimental_option('useAutomationExtension', False)
 


# -chrome在79和79版之后用这个，
driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {
    "source": """
    Object.defineProperty(navigator, 'webdriver', {
      get: () => undefined
    })
  """
})
driver.get("这里填写你被反爬网站的链接")

选择合适的爬虫方案应根据具体需求和团队技术栈来决定。上述方案仅仅是爬虫方案中的两种，欢迎大家进行补充！

response = requests.get( url='请求url', headers={ xxx: yyy } ) region_list = [] if response.status_code == 200: data = response.json() print('data获取成功！') else: print("Failed to retrieve data. Status code:", response.status_code)

# 写入 CSV 文件，按要求写入数据 with open(file_path, mode='w', newline='', encoding='utf-8') as file: writer = csv.writer(file) writer.writerow(['字段1', '字段2', '字段...']) print('文件写入成功！')

# 提供用于模拟用户在网页上进行交互的方法和功能。 from selenium import webdriver # 用于在 WebDriver 中执行鼠标和键盘操作的高级方法 from selenium.webdriver import ActionChains # 提供了用于定位网页元素的方法 from selenium.webdriver.common.by import By # 提供了显式等待的功能 from selenium.webdriver.support.ui import WebDriverWait # 一系列预定义的等待条件 from selenium.webdriver.support import expected_conditions as EC # 用于启动浏览器驱动程序的服务 from selenium.webdriver.chrome.service import Service

# 文件加载目录 download_dir = "C:\\Users\\Administrator\\Desktop\\Data" # 用于配置 Chrome 浏览器的行为 chrome_options = webdriver.ChromeOptions() # 添加了一系列偏好设置 chrome_options.add_experimental_option("prefs", { "download.default_directory": download_dir, # 保存到指定的目录 "download.prompt_for_download": False, # 禁用下载前的提示框 "safebrowsing.enabled": True # 启用安全浏览功能，检查下载文件的安全性 }) chromedriver_path = 'C:/Program Files (x86)/Google/Chrome/Application/chromedriver.exe' # 指定 Chromedriver 的路径 service = Service(chromedriver_path) driver = webdriver.Chrome(service=service, options=chrome_options) wait = WebDriverWait(driver, 10) # 增加等待时间为10秒

try: driver.get(url) # 点击“yyy”按钮 button_xpath = "//div[@class='xxx' and text()='yyy']" button = wait.until(EC.element_to_be_clickable((By.XPATH, button_xpath))) button.click() # 获取页面元素 page_dom = driver.page_source soup = BeautifulSoup(page_dom, 'html.parser') iframe_id_element = soup.find('iframe', id='iframe') if iframe_id_element: # 如果找到了 id 为 "iframe" 的 iframe 元素 # 获取 iframe 的内容 src_value = iframe_id_element['src'] full_url = 'https:' + src_value driver.get(full_url) wait = WebDriverWait(driver, 10) time.sleep(10) trigger_element = wait.until(EC.presence_of_element_located((By.CLASS_NAME, "ant-dropdown-trigger"))) # 将鼠标移动到第一个元素上 action = ActionChains(driver) action.move_to_element(trigger_element).perform() # 等待一段时间，确保下拉菜单已完全展开 time.sleep(4) sub_menu_path = "//li[@class='aaa' and text()='bbb']" sub_menu_element = wait.until(EC.element_to_be_clickable((By.XPATH, sub_menu_path))) # 将鼠标从第一个元素移到下拉菜单中的第一个选项 action.move_to_element(sub_menu_element).click().perform() else: print("未找到id为iframe的iframe元素") finally: # 等待一段时间，确保文件的数据已经下完 time.sleep(10) driver.quit()

# chrome在79版之前用下面两行代码 #options.add_experimental_option("excludeSwitches", ["enable-automation"]) #options.add_experimental_option('useAutomationExtension', False) # -chrome在79和79版之后用这个， driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", { "source": """ Object.defineProperty(navigator, 'webdriver', { get: () => undefined }) """ }) driver.get("这里填写你被反爬网站的链接")

智算服务

应用商城

合作伙伴

开发者

支持与服务

了解天翼云

利用爬虫获取页面数据

利用爬虫抓取接口数据

利用爬虫抓取DOM数据

利用爬虫获取页面数据

利用爬虫抓取接口数据

利用爬虫抓取DOM数据

活动

智算服务

应用商城

合作伙伴

开发者

支持与服务

了解天翼云

利用爬虫获取页面数据

利用爬虫抓取接口数据

利用爬虫抓取DOM数据

利用爬虫获取页面数据

利用爬虫抓取接口数据

利用爬虫抓取DOM数据