如何扩展Python 服务-Toy模板网

Python 正在成为各种应用程序开发人员中越来越流行的选择。然而，与任何语言一样，有效扩展Python 服务可能会带来挑战。本文解释了可用于更好地扩展应用程序的概念。通过了解CPU 密集型任务与 I/O 密集型任务、全局解释器锁 (GIL) 的含义以及线程池和 asyncio 背后的机制，我们可以更好地扩展 Python 应用程序。

CPU 限制与 I/O 限制：基础知识

CPU 密集型任务：这些任务涉及繁重的计算、数据处理和转换，需要大量的 CPU 处理能力。
I/O 密集型任务：这些任务通常等待外部资源，例如读取或写入数据库、文件或网络操作。

文章来源地址https://www.toymoban.com/diary/python/477.html

CPU 限制与 I/O 限制

识别您的服务主要受 CPU 限制还是 I/O 限制是有效扩展的基础。

并发与并行：一个简单的类比

想象一下计算机上的多任务处理：

并发性：您打开了多个应用程序。即使某一时刻只有一个处于活动状态，您也可以在它们之间快速切换，给人一种它们同时运行的错觉。
并行性：多个应用程序真正同时运行，就像在下载文件时在电子表格上运行计算一样。

在单核CPU场景下，并发涉及任务的快速切换，而并行则允许多个任务同时执行。

并发与并行

全局解释器锁：GIL

您可能认为扩展受 CPU 限制的 Python 服务就像添加更多 CPU 能力一样简单。然而，Python 标准实现中的全局解释器锁 (GIL) 使这一点变得复杂。GIL 是一种互斥体，确保一次只有一个线程执行 Python 字节码，即使在多核机器上也是如此。此限制意味着 Python 中受 CPU 限制的任务无法充分利用 GIL 的多线程功能。

扩展解决方案：I/O 限制和 CPU 限制

线程池执行器

此类提供了使用线程异步执行函数的接口。虽然 Python 中的线程非常适合 I/O 密集型任务（因为它们可以在 I/O 操作期间释放 GIL），但由于 GIL，它们对于 CPU 密集型任务的效率较低。

异步

asyncio 适合 I/O 密集型任务，为异步 I/O 操作提供事件驱动框架。它采用单线程模型，在 I/O 等待期间将控制权交还给事件循环以执行其他任务。与线程相比，asyncio 更精简，并且避免了线程上下文切换等开销。

这是一个实际的比较。我们以获取 URL 数据（I/O 绑定）为例，并在没有线程的情况下使用线程池并使用 asyncio 来完成此操作。

import requests
import timeit
from concurrent.futures import ThreadPoolExecutor
import asyncio
URLS = [
    "https://www.example.com",
    "https://www.python.org",
    "https://www.openai.com",
    "https://www.github.com"
] * 50

# 获取URL数据的函数
def fetch_url_data(url):
    response = requests.get(url)
    return response.text

# 1. 顺序
def main_sequential():
    return [fetch_url_data(url) for url in URLS]
  
# 2. 线程池
def main_threadpool():
    with ThreadPoolExecutor(max_workers=4) as executor:
        return list(executor.map(fetch_url_data, URLS))
      
# 3. 带有请求的异步
async def main_asyncio():
    loop = asyncio.get_event_loop()
    futures = [loop.run_in_executor(None, fetch_url_data, url) for url in URLS]
    return await asyncio.gather(*futures)

def run_all_methods_and_time():
    methods = [
        ("顺序", main_sequential),
        ("线程池", main_threadpool),
        ("异步", lambda: asyncio.run(main_asyncio()))
    ]

    for name, method in methods:
        start_time = timeit.default_timer()
        method()
        elapsed_time = timeit.default_timer() - start_time
        print(f"{name} 执行时间: {elapsed_time:.4f} seconds")

if __name__ == "__main__":
    run_all_methods_and_time()

结果

顺序执行时间: 37.9845 seconds
线程池执行时间: 13.5944 seconds
异步执行时间: 3.4348 seconds

结果表明，asyncio 对于 I/O 密集型任务非常高效，因为它最小化了开销并且没有数据同步要求（如多线程所示）。

对于 CPU 密集型任务，请考虑：

多处理：进程不共享 GIL，使得这种方法适合 CPU 密集型任务。但是，请确保生成进程和进程间通信的开销不会削弱性能优势。
PyPy：带有即时 (JIT) 编译器的替代 Python 解释器。PyPy 可以提高性能，特别是对于 CPU 密集型任务。

在这里，我们有一个正则表达式匹配的示例（CPU 限制）。我们在没有任何优化的情况下使用多处理来实现它。

import re
import timeit
from multiprocessing import Pool
import random
import string

# 非重复字符的复杂正则表达式模式。
PATTERN_REGEX = r"(?:(\w)(?!.*\1)){10}"

def find_pattern(s):
    """Search for the pattern in a given string and return it, or None if not found."""
    match = re.search(PATTERN_REGEX, s)
    return match.group(0) if match else None

# 生成随机字符串的数据集
data = [''.join(random.choices(string.ascii_letters + string.digits, k=1000)) for _ in range(1000)]

def concurrent_execution():
    with Pool(processes=4) as pool:
        results = pool.map(find_pattern, data)

def sequential_execution():
    results = [find_pattern(s) for s in data]

if __name__ == "__main__":
    # Timing both methods
    concurrent_time = timeit.timeit(concurrent_execution, number=10)
    sequential_time = timeit.timeit(sequential_execution, number=10)

    print(f"并发执行时间（多处理）： {concurrent_time:.4f} seconds")
    print(f"顺序执行时间： {sequential_time:.4f} seconds")