Python 初學者常犯錯誤 — 正規表示式未正確使用 re.compile()

Last updated on Jul 4, 2024 in Python 程式設計 - 初階 by Amo Chen ‐ 3 min read

覺得我們的內容實用嗎？ MyApollo 電子報讀者募集中！歡迎訂閱電子報!

眾所皆知， Python 不是一個以效能著稱的語言，所以更特別需要注意一些會讓程式變慢的錯誤。

本文將介紹初學者常犯的錯誤「使用正規表示式時，未正確使用 re.compile() 」。

本文環境

Python 3

常見錯誤 — 未使用 re.compile()

首先，我們請 ChatGPT 生成 1 個檢查 email 格式的 Python 程式碼，本文所使用的 prompt 為：

給我一段可以檢查 email 格式的 Python 程式，請使用正規表示式。

以下是 ChatGPT 所生成的答案：

import re

def is_valid_email(email):
    # 正規表示式來檢查 email 格式
    regex = r'^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$'

    # 使用 re.match 來檢查 email 是否符合格式
    if re.match(regex, email):
        return True
    else:
        return False

# 測試例子
emails = ["[email protected]", "bad-email", "[email protected]", "wrong@domain,com"]

for email in emails:
    print(f"{email}: {is_valid_email(email)}")

乍看沒有什麼大問題。

但這個答案就是 Python 初學者常犯錯誤之一 —— 沒正確使用 re.compile() ！

未正確使用 re.compile() 會造成什麼問題？慢！

直接看看未使用 re.compile() 會造成什麼問題。

我們可以在 Google Colab 針對 ChatGTP 生成的答案做一些效能測試：

%timeit is_valid_email("wrong@domain,com")

上述測試結果如下：

The slowest run took 5.03 times longer than the fastest. This could mean that an intermediate result is being cached.
4.2 µs ± 2.51 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)

可以看到每 1 次執行約在 4.2 µs (微秒)左右。

使用 re.compile() 解決問題

接著，我們將正規表示式使用 re.compile() 事先編譯起來，改成下列形式：

import re

EMAIL_RE = re.compile(r'^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$')

def is_valid_email(email):
    if EMAIL_RE.match(email):
        return True
    else:
        return False

再做一次效能測試看看：

%timeit is_valid_email("wrong@domain,com")

就會驚訝的發現，使用 re.compile() 的結果竟然快了不少，約在 535 ns (奈秒)，大概快了 6.8 倍左右：

535 ns ± 133 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

這也是實務上 code review 經常會發現的錯誤。

究其原因其實是 Python 有太多方便的 shortcuts 可以使用，例如 re.match() 就屬於 shortcuts 的其中 1 個， re.match() 實際上做的事情是：

prog = re.compile(pattern)
result = prog.match(string)

是的，就是先編譯正規表示式之後，再開始匹配(match)。

所以，在 1 個 n 次迴圈中執行 re.match() 就會伴隨 n 次浪費效能的編譯正規表示式，程式效能就浪費在此處，這就是為何要將正規表示式事先 compile 起來，並且拉到 module level 的原因，為的就是可以重複利用 compile 之後的 object ，消除多餘且無謂的 compile 。

Python 官方文件也有提到相同的事：

using re.compile() and saving the resulting regular expression object for reuse is more efficient when the expression will be used several times in a single program.

如果開發者沒有注意到這些細微的差別，就很容易本來就不快的 Python 程式碼變得更慢。

雖然翻閱 Python 原始碼會發現 re.match() 底層所使用的 _compile() 有做快取，也就是說有 re.match() 或者 re.search() 等 shortcuts 如果有快取就會使用快取，以盡量減少 compile 的時間消耗，但其實每次呼叫 _compile() 也都是 1 次時間支出成本，所以跟直接存取 regex object 比起來，還是慢上許多。

以下是 re._compile() 在 Google Colab 的執行時間測試，可以看到光是呼叫 re._compile() 也至少需要支出 595ns 左右的時間成本：

from re import _compile
%timeit _compile(r'^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$', 0)
595 ns ± 147 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

最後，讓 ChatGPT 生成正確答案

知道關於 Python 正規表示式魔鬼細節之後，我們可以把要求告訴 ChatGPT ，進一步修正我們的 prompt:

給我一段可以檢查 email 格式的 Python 程式，以下是程式要求：
1. 使用正規表示式
2. 於 module level 使用 re.compile() 編譯正規表示式，增加效能

這一次， ChatGPT 就會給我們正確答案了：

import re

# 在 module level 編譯正規表示式
email_regex = re.compile(r'^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$')

def is_valid_email(email):
    # 使用編譯好的正規表示式來檢查 email 格式
    if email_regex.match(email):
        return True
    else:
        return False

# 測試例子
emails = ["[email protected]", "bad-email", "[email protected]", "wrong@domain,com"]

for email in emails:
    print(f"{email}: {is_valid_email(email)}")

總結

「魔鬼藏在細節裡，效能也是。」

以上！

Enjoy!

References

re — Regular expression operations

How does Python’s regex pattern caching work?

覺得我們的內容實用嗎？ MyApollo 電子報讀者募集中！歡迎訂閱電子報!

python re regexp

Python 初學者常犯錯誤 — 正規表示式未正確使用 re.compile()

本文環境

常見錯誤 — 未使用 re.compile()

未正確使用 re.compile() 會造成什麼問題？慢！

使用 re.compile() 解決問題

最後，讓 ChatGPT 生成正確答案

總結

References

對抗久坐職業傷害

贊助我們的創作

Python 初學者常犯錯誤 — 正規表示式未正確使用 re.compile()

本文環境 #

常見錯誤 — 未使用 re.compile() #

未正確使用 re.compile() 會造成什麼問題？慢！ #

使用 re.compile() 解決問題 #

最後，讓 ChatGPT 生成正確答案 #

總結 #

References #

對抗久坐職業傷害

贊助我們的創作

你可能也會感興趣的文章

用 Python 實作 Floyd-Warshall 演算法：從最短路徑到尋找最便宜機票

從 Python 的 random.shuffle() 學 Fisher-Yates Shuffle / Knuth Shuffle 演算法

Python 好用套件介紹 - better-exceptions

本文環境

常見錯誤 — 未使用 re.compile()

未正確使用 re.compile() 會造成什麼問題？慢！

使用 re.compile() 解決問題

最後，讓 ChatGPT 生成正確答案

總結

References