統一資源定位器資料庫(urllib)

利用urllib.request.urlopen開啟指定的網址

import urllib

url = "https://www.google.com/"

#避免被當成爬蟲，因此使用User-Agent

headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36'}

req = urllib.request.Request(url, headers = headers)

com = urllib.request.urlopen(req)

print(com)

Output：<http.client.HTTPResponse object at 0x000001730E6D6A00>

data= com.read()

print(data)

Output：<http.client.HTTPResponse object at 0x000001730E6C1220>

b'<!doctype html><html itemscope=...以下省略

#此為url網頁的原始碼

搜尋字串 (query string)

https://www.google.com/search?q=%E7%B5%B1%E7%A5%9E&rlz=1C1FKPE_zh-TWTW994TW994&sxsrf=ALiCzsZiy-TNWmpg0UeiDM0lgRI4y3sCAA:1653720960520&source=lnms&tbm=isch&sa=X&ved=2ahUKEwjC-qLTzoH4AhUIPXAKHQgsAaYQ_AUoAXoECAIQAw#imgrc=znqSniM0U9gtCM

此為搜尋統神所得到的網址，我們用此網址來解析。

search_url="https://www.google.com/search?q=%E7%B5%B1%E7%A5%9E&rlz=1C1FKPE_zh-TWTW994TW994&sxsrf=ALiCzsZiy-TNWmpg0UeiDM0lgRI4y3sCAA:1653720960520&source=lnms&tbm=isch&sa=X&ved=2ahUKEwjC-qLTzoH4AhUIPXAKHQgsAaYQ_AUoAXoECAIQAw#imgrc=znqSniM0U9gtCM"

print(search_url.split('&'))

Output：

['https://www.google.com/search?q=%E7%B5%B1%E7%A5%9E', 'rlz=1C1FKPE_zh-TWTW994TW994', 'sxsrf=ALiCzsZiy-TNWmpg0UeiDM0lgRI4y3sCAA:1653720960520', 'source=lnms', 'tbm=isch', 'sa=X', 'ved=2ahUKEwjC-qLTzoH4AhUIPXAKHQgsAaYQ_AUoAXoECAIQAw#imgrc=znqSniM0U9gtCM']

此為搜尋字串 (query string)，在這個範例就是Google 幫你進行搜尋的時候所設定的一些條件

較重要的為

'tbm=isch' 此為搜尋圖片縮寫(image search)

以及

'q=%E7%B5%B1%E7%A5%9E' 為關鍵字

因此只要輸入https://www.google.com/search?q=關鍵字&tbm=isch

就能搜尋您想要的圖片。

編碼解析

u = urllib.parse.urlparse(search_url) #網址解析結果物件(ParseResult)

print(u)

Output：

ParseResult(scheme='https', netloc='www.google.com', path='/search', params='', query='q=%E7%B5%B1%E7%A5%9E&rlz=1C1FKPE_zh-TWTW994TW994&sxsrf=ALiCzsZiy-TNWmpg0UeiDM0lgRI4y3sCAA:1653720960520&source=lnms&tbm=isch&sa=X&ved=2ahUKEwjC-qLTzoH4AhUIPXAKHQgsAaYQ_AUoAXoECAIQAw', fragment='imgrc=znqSniM0U9gtCM')

可以清楚地看到關鍵字query在第四個位子

print(u[4])

Output：

q=%E7%B5%B1%E7%A5%9E&rlz=1C1FKPE_zh-TWTW994TW994&sxsrf=ALiCzsZiy-TNWmpg0UeiDM0lgRI4y3sCAA:1653720960520&source=lnms&tbm=isch&sa=X&ved=2ahUKEwjC-qLTzoH4AhUIPXAKHQgsAaYQ_AUoAXoECAIQAw

我們可以利用urllib套件來解析

print(urllib.parse.parse_qs(u[4]))

Output：

{'q': ['統神'], 'rlz': ['1C1FKPE_zh-TWTW994TW994'], 'sxsrf': ['ALiCzsZiy-TNWmpg0UeiDM0lgRI4y3sCAA:1653720960520'], 'source': ['lnms'], 'tbm': ['isch'], 'sa': ['X'], 'ved': ['2ahUKEwjC-qLTzoH4AhUIPXAKHQgsAaYQ_AUoAXoECAIQAw']}

可以看到我們關鍵字"統神"已經浮現出來。

參考：如何使用urllib套件取得網路資源

通過 User-Agent 識別爬蟲的原理、實踐與對應的繞過方法

Page updated

Google Sites

Report abuse