Python:正則式與re模組

什麼是正則式(Regex)？常見的說法有正規表示式、正則表達式…等等，維基百科是如此描述的：使用單個字串來描述、符合某個句法規則的字串。在很多文字編輯器裡，正則運算式通常被用來檢索、替換那些符合某個模式的文字，總之就是「描述某種規則的表達式」。

舉例來說，email address其格式具有一定規則，假設要在一堆密密麻麻的html中尋找Gmail郵件地址，就必須使用正則式來描述過濾規則，讓程式能準確判斷出符合Gmail郵件地址格式的字串。

Pythex

上述示範如何用正則式表達Gmail郵件地址規則的網站為pythex.org，可以用來驗證Regex正確與否，因為Regex的符號有點複雜，該頁面也相當貼心地放了速查表呢～只要按下regular expression cheatsheet按鈕即可查看。

re module

在Python中使用正則式需要import re模組，常用的方法有search()、match()、findall()等，可參考官方文件以取得更詳盡的說明。

那麼，search跟match有什麼不同呢？這是我一開始對這兩個方法的疑問，那就看看官方文件怎麼說，但實際編寫程式碼會幫助自己理解兩者間的差異。

根據官方文件的解釋：re.match() checks for a match only at the beginning of the string, while re.search() checks for a match anywhere in the string.

match的搜尋方式為「從字串起始開始搜索，遇到不符合的字元便停止」
search的搜尋方式為「整個字串」

個人覺得最常用到的應該是findall()，比對所有符合規則的字串並返回串列：

import re

str_ =  '_.Aa123Bb456Cc789Dd3.14'

find_alphabet = re.findall(r'[A-Za-z]+', str_)
print(find_alphabet)  ＃ ['Aa', 'Bb', 'Cc', 'Dd']

find_rational = re.findall(r'[0-9]+\.?[0-9]*', str_)
print(find_rational)  ＃ ['123', '456', '789', '3.14']

Code

在Google首頁中尋找.jpg/.png，雖然只有一張圖片(笑)，另外，常用的Regex可以透過re.compile()轉換為Regex Object，以便於直接呼叫search()、match()、findall()，也可以在使用bs4模組解析網頁時使用:

#!usr/bin/env python3
# coding:utf-8

import re
import requests as rq
from bs4 import BeautifulSoup as bs

pattern = re.compile(r'.+\.jpg|.+\.png')
url = 'https://www.google.com'
try:
    res = rq.get(url)
    res.raise_for_status()
except rq.HTTPError:
    print('HTTP Error!')

soup = bs(res.text, 'html.parser')
imgs = soup.find_all("meta", {"content":pattern})  # find all images in attr 'content' of tag 'meta'
for img in imgs:
    print(img['content'])  # /images/branding/googleg/1x/googleg_standard_color_128dp.png

本部落格已搬遷, 3秒後跳轉...

“We're believers that the best way to learn something is to do it.”