本部落格已搬遷, 3秒後跳轉...

以 Requests 取得 Google 搜尋結果 | Laplace's Lab

.use-motion .motion-element, .use-motion .brand, .use-motion .menu-item, .sidebar-inner, .use-motion .post-block, .use-motion .pagination, .use-motion .comments, .use-motion .post-header, .use-motion .post-body, .use-motion .collection-title { opacity: initial; } .use-motion .logo, .use-motion .site-title, .use-motion .site-subtitle { opacity: initial; top: initial; } .use-motion .logo-line-before i { left: initial; } .use-motion .logo-line-after i { right: initial; }

“We're believers that the best way to learn something is to do it.”

以 Requests 取得 Google 搜尋結果

2018-09-03 | Data Science |

Web Analytics

首先瞧瞧Google Search的網址，嘗試輸入任意關鍵字執行搜尋後可以發現，搜尋的網址是長這樣的：

1	http://www.google.com.tw/search?q=

“=”後面便是搜尋的關鍵字了，再觀察網頁原始碼，搜尋結果就在class=”g”的div區塊中。

既然爬蟲能這樣到處玩耍，想必也會有不歡迎爬蟲的網站，畢竟要是放任大量爬蟲在自家網站撒野，可是會給伺服器帶來困擾的呢。所以Web Crawler也會有許多技巧來偽裝，讓自己在伺服器的認知裡看起來像是人為操作：例如，在request加上user agent偽裝成瀏覽器，或在多個request之間設置隨機延遲，除了模擬人為操作，亦避免造成他人伺服器的負擔…

Code

#!/usr/bin/env python3
# *** coding : utf-8 ***

import random
import requests as rq
from bs4 import BeautifulSoup as bs

user_agent = ["Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
            "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:38.0) Gecko/20100101 Firefox/38.0",
            "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0",
            "Mozilla/5.0 (Macintosh; Intel Mac OS X x.y; rv:42.0) Gecko/20100101 Firefox/42.0",
            "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1"]

target = input('search:')
url = 'http://www.google.com.tw/search?q=' + target
try:
    res = rq.get(url=url, headers={'User-Agent': random.choice(user_agent)})
    res.raise_for_status()
except rq.exceptions.HTTPError:
    print('[HTTP_Error]')

soup = bs(res.text, 'html.parser')
link = soup.select('.g .r a')

for index in range(2):
    print(link[index].string)   # title
    print(link[index]['href'])  # link

輸出結果：

0%