爬取網站數據（python爬取網站數據四種姿勢）|快速备案

前言首先，分析來爬蟲的思路：先在第一個網頁（https://www.wikidata.org/w/index.php?title=Special:WhatLinksHere/Q5&limit=500&from=0）中得到500個名人所在的網址，接下來就爬取這500個網頁中的名人的名字及描述，如無描述，則跳過。接下來，我們將介紹實現這個爬蟲的4種方法，並分析它們各自的優缺點，希望能讓讀者對爬蟲有更多的體會。實現爬蟲的方法為：一般方法（同步，requests+BeautifulSoup）並發（使用concurrent.futures模塊以及requests+BeautifulSoup）異步（使用aiohttp+asyncio+requests+BeautifulSoup）使用框架Scrapy一般方法一般方法即為同步方法，主要使用requests+BeautifulSoup，按順序執行。完整的Python代碼如下：import requests
from bs4 import BeautifulSoup
import time
#python學習群：695185429
# 開始時間
t1 = time.time()
print('#' * 50)

url = "http://www.wikidata.org/w/index.php?title=Special:WhatLinksHere/Q5&limit=500&from=0"
# 請求頭部
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36'}
# 發送HTTP請求
req = requests.get(url, headers=headers)
# 解析網頁
soup = BeautifulSoup(req.text, "lxml")
# 找到name和Description所在的記錄
human_list = soup.find(id='mw-whatlinkshere-list')('li')

urls = []
# 獲取網址
for human in human_list:
url = human.find('a')['href']
urls.append('https://www.wikidata.org'+url)

# 獲取每個網頁的name和description
def parser(url):
req = requests.get(url)
# 利用BeautifulSoup將獲取到的文本解析成HTML
soup = BeautifulSoup(req.text, "lxml")
# 獲取name和description
name = soup.find('span', class_="wikibase-title-label")
desc = soup.find('span', class_="wikibase-descriptionview-text")
if name is not None and desc is not None:
print('%-40s,\t%s'%(name.text, desc.text))

for url in urls:
parser(url)

t2 = time.time() # 結束時間
print('一般方法，總共耗時：%s' % (t2 – t1))
print('#' * 50)
輸出的結果如下(省略中間的輸出，以……代替)：##################################################
George Washington , first President of the United States
Douglas Adams , British author and humorist (1952–2001)
……
Willoughby Newton , Politician from Virginia, USA
Mack Wilberg , American conductor
一般方法，總共耗時：724.9654655456543
##################################################
使用同步方法，總耗時約725秒，即12分鐘多。一般方法雖然思路簡單，容易實現，但效率不高，耗時長。那麼，使用並發試試看。並發方法並發方法使用多線程來加速一般方法，我們使用的並發模塊為concurrent.futures模塊，設置多線程的個數為20個（實際不一定能達到，視計算機而定）。完整的Python代碼如下：import requests
from bs4 import BeautifulSoup
import time
from concurrent.futures import ThreadPoolExecutor, wait, ALL_COMPLETED

# 開始時間
t1 = time.time()
print('#' * 50)

urls = []
# 獲取網址
for human in human_list:
url = human.find('a')['href']
urls.append('https://www.wikidata.org'+url)

# 利用並發加速爬取
executor = ThreadPoolExecutor(max_workers=20)
# submit()的參數：第一個為函數，之後為該函數的傳入參數，允許有多個
future_tasks = [executor.submit(parser, url) for url in urls]
# 等待所有的線程完成，才進入後續的執行
wait(future_tasks, return_when=ALL_COMPLETED)

t2 = time.time() # 結束時間
print('並發方法，總共耗時：%s' % (t2 – t1))
print('#' * 50)
輸出的結果如下（省略中間的輸出，以……代替)：##################################################
Larry Sanger , American former professor, co-founder of Wikipedia, founder of Citizendium and other projects
Ken Jennings , American game show contestant and writer
……
Antoine de Saint-Exupery , French writer and aviator
Michael Jackson , American singer, songwriter and dancer
並發方法，總共耗時：226.7499692440033
##################################################
使用多線程並發後的爬蟲執行時間約為227秒，大概是一般方法的三分之一的時間，速度有瞭明顯的提升啊！多線程在速度上有明顯提升，但執行的網頁順序是無序的，在線程的切換上開銷也比較大，線程越多，開銷越大。異步方法異步方法在爬蟲中是有效的速度提升手段，使用aiohttp可以異步地處理HTTP請求，使用asyncio可以實現異步IO，需要註意的是，aiohttp隻支持3.5.3以後的Python版本。使用異步方法實現該爬蟲的完整Python代碼如下：import requests
from bs4 import BeautifulSoup
import time
import aiohttp
import asyncio

# 開始時間
t1 = time.time()
print('#' * 50)

urls = []
# 獲取網址
for human in human_list:
url = human.find('a')['href']
urls.append('https://www.wikidata.org'+url)

# 異步HTTP請求
async def fetch(session, url):
async with session.get(url) as response:
return await response.text()

# 解析網頁
async def parser(html):
# 利用BeautifulSoup將獲取到的文本解析成HTML
soup = BeautifulSoup(html, "lxml")
# 獲取name和description
name = soup.find('span', class_="wikibase-title-label")
desc = soup.find('span', class_="wikibase-descriptionview-text")
if name is not None and desc is not None:
print('%-40s,\t%s'%(name.text, desc.text))

# 處理網頁，獲取name和description
async def download(url):
async with aiohttp.ClientSession() as session:
try:
html = await fetch(session, url)
await parser(html)
except Exception as err:
print(err)

# 利用asyncio模塊進行異步IO處理
loop = asyncio.get_event_loop()
tasks = [asyncio.ensure_future(download(url)) for url in urls]
tasks = asyncio.gather(*tasks)
loop.run_until_complete(tasks)

t2 = time.time() # 結束時間
print('使用異步，總共耗時：%s' % (t2 – t1))
print('#' * 50)
輸出結果如下（省略中間的輸出，以……代替)：##################################################
Frédéric Taddeï , French journalist and TV host
Gabriel Gonzáles Videla , Chilean politician
……
Denmark , sovereign state and Scandinavian country in northern Europe
Usain Bolt , Jamaican sprinter and soccer player
使用異步，總共耗時：126.9002583026886
##################################################
顯然，異步方法使用瞭異步和並發兩種提速方法，自然在速度有明顯提升，大約為一般方法的六分之一。異步方法雖然效率高，但需要掌握異步編程，這需要學習一段時間。如果有人覺得127秒的爬蟲速度還是慢，可以嘗試一下異步代碼（與之前的異步代碼的區別在於：僅僅使用瞭正則表達式代替BeautifulSoup來解析網頁，以提取網頁中的內容）：import requests
from bs4 import BeautifulSoup
import time
import aiohttp
import asyncio
import re

# 開始時間
t1 = time.time()
print('#' * 50)

url = "http://www.wikidata.org/w/index.php?title=Special:WhatLinksHere/Q5&limit=500&from=0"
# 請求頭部
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36'}
# 發送HTTP請求
req = requests.get(url, headers=headers)
# 解析網頁
soup = BeautifulSoup(req.text, "lxml")
# 找到name和Description所在的記錄
human_list = soup.find(id='mw-whatlinkshere-list')('li')

urls = []
# 獲取網址
for human in human_list:
url = human.find('a')['href']
urls.append('https://www.wikidata.org' + url)

# 異步HTTP請求
async def fetch(session, url):
async with session.get(url) as response:
return await response.text()

# 解析網頁
async def parser(html):
# 利用正則表達式解析網頁
try:
name = re.findall(r'<span class="wikibase-title-label">(.+?)</span>', html)[0]
desc = re.findall(r'<span class="wikibase-descriptionview-text">(.+?)</span>', html)[0]
print('%-40s,\t%s' % (name, desc))
except Exception as err:
pass

# 利用asyncio模塊進行異步IO處理
loop = asyncio.get_event_loop()
tasks = [asyncio.ensure_future(download(url)) for url in urls]
tasks = asyncio.gather(*tasks)
loop.run_until_complete(tasks)

t2 = time.time() # 結束時間
print('使用異步（正則表達式），總共耗時：%s' % (t2 – t1))
print('#' * 50)
輸出的結果如下（省略中間的輸出，以……代替)：##################################################
Dejen Gebremeskel , Ethiopian long-distance runner
Erik Kynard , American high jumper
……
Buzz Aldrin , American astronaut
Egon Krenz , former General Secretary of the Socialist Unity Party of East Germany
使用異步（正則表達式），總共耗時：16.521944999694824
##################################################
16.5秒，僅僅為一般方法的43分之一，速度如此之快，令人咋舌爬蟲框架Scrapy最後，我們使用著名的Python爬蟲框架Scrapy來解決這個爬蟲。我們創建的爬蟲項目為wikiDataScrapy，項目結構如下：在settings.py中設置“ROBOTSTXT_OBEY = False”. 修改items.py，代碼如下：# -*- coding: utf-8 -*-

import scrapy

class WikidatascrapyItem(scrapy.Item):
# define the fields for your item here like:
name = scrapy.Field()
desc = scrapy.Field()
然後，在spiders文件夾下新建wikiSpider.py，代碼如下:import scrapy.cmdline
from wikiDataScrapy.items import WikidatascrapyItem
import requests
from bs4 import BeautifulSoup

# 獲取請求的500個網址，用requests+BeautifulSoup搞定
def get_urls():
url = "http://www.wikidata.org/w/index.php?title=Special:WhatLinksHere/Q5&limit=500&from=0"
# 請求頭部
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36'}
# 發送HTTP請求
req = requests.get(url, headers=headers)
# 解析網頁
soup = BeautifulSoup(req.text, "lxml")
# 找到name和Description所在的記錄
human_list = soup.find(id='mw-whatlinkshere-list')('li')

urls = []
# 獲取網址
for human in human_list:
url = human.find('a')['href']
urls.append('https://www.wikidata.org' + url)

# print(urls)
return urls

# 使用scrapy框架爬取
class bookSpider(scrapy.Spider):
name = 'wikiScrapy' # 爬蟲名稱
start_urls = get_urls() # 需要爬取的500個網址

def parse(self, response):
item = WikidatascrapyItem()
# name and description
item['name'] = response.css('span.wikibase-title-label').xpath('text()').extract_first()
item['desc'] = response.css('span.wikibase-descriptionview-text').xpath('text()').extract_first()

yield item

# 執行該爬蟲，並轉化為csv文件
scrapy.cmdline.execute(['scrapy', 'crawl', 'wikiScrapy', '-o', 'wiki.csv', '-t', 'csv'])
輸出結果如下（隻包含最後的Scrapy信息總結部分）：{'downloader/request_bytes': 166187,
'downloader/request_count': 500,
'downloader/request_method_count/GET': 500,
'downloader/response_bytes': 18988798,
'downloader/response_count': 500,
'downloader/response_status_count/200': 500,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 10, 16, 9, 49, 15, 761487),
'item_scraped_count': 500,
'log_count/DEBUG': 1001,
'log_count/INFO': 8,
'response_received_count': 500,
'scheduler/dequeued': 500,
'scheduler/dequeued/memory': 500,
'scheduler/enqueued': 500,
'scheduler/enqueued/memory': 500,
'start_time': datetime.datetime(2018, 10, 16, 9, 48, 44, 58673)}
可以看到，已成功爬取500個網頁，耗時31秒，速度也相當OK。再來看一下生成的wiki.csv文件，它包含瞭所有的輸出的name和description，如下圖：可以看到，輸出的CSV文件的列並不是有序的。至於如何解決Scrapy輸出的CSV文件有換行的問題Scrapy來制作爬蟲的優勢在於它是一個成熟的爬蟲框架，支持異步，並發，容錯性較好（比如本代碼中就沒有處理找不到name和description的情形），但如果需要頻繁地修改中間件，則還是自己寫個爬蟲比較好，而且它在速度上沒有超過我們自己寫的異步爬蟲，至於能自動導出CSV文件這個功能，還是相當實在的。總結本文內容較多，比較瞭4種爬蟲方法，每種方法都有自己的利弊，已在之前的陳述中給出，當然，在實際的問題中，並不是用的工具或方法越高級就越好，具體問題具體分析嘛~

相关文章