Python 内幕揭秘：深度刨析 Windows 系统下的 os.path.join()

这篇具有很好参考价值的文章主要介绍了Python 内幕揭秘：深度刨析 Windows 系统下的 os.path.join()。希望对大家有所帮助。如果存在错误或未考虑完全的地方，请大家不吝赐教，您也可以点击"举报违法"按钮提交疑问。

参考

项目	描述
Python 官方文档	https://docs.python.org/zh-cn/3/
搜索引擎	Google 、Bing
CPython 3.6 解释器源码	官方下载页面

描述

项目	描述
Windows 操作系统	Windows 10 专业版
类 Unix 操作系统	Kali Linux 2023-04-18
PyCharm	2023.1 (Professional Edition)
Python	3.10.6

os.path

os.path 模块是 Python 标准库中的一个模块，用于处理与 文件路径相关的操作，如文件路径字符串的拼接、分解、规范化。

路径分隔符

路径分隔符是用于在文件路径中 分隔不同目录层级 的特殊字符。路径分隔符是根据操作系统的约定来确定的，不同的操作系统使用不同的路径分隔符。

常见的路径分隔符有两种，正斜杠与反斜杠。

正斜杠 /
正斜杠是在 类 Unix 操作系统中使用的路径分隔符。
反斜杠 \
反斜杠是在 Windows 操作系统上使用的 主要 路径分隔符，在 Windows 操作系统中，你还可以使用正斜杠 / 作为路径分隔符。

os.path.join()

os.path.join() 函数是 os.path 模块中的一个常用函数，用于将多个路径片段连接起来形成一个完整的路径。os.path.join 函数会根据 当前操作系统的类型 自动选择 合适的路径分隔符 来对路径进行拼接。

举个栗子

import os


result = os.path.join('path', 'to', 'file')
print(result)

Windows 操作系统中的执行效果

path\to\file

Linux 操作系统中的执行效果

path/to/file

不同实现

os.path.join() 函数是 Python 标准库中的一个函数，用于将多个路径组合成一个单一的路径。它可以根据操作系统的不同自动选择适当的路径分隔符（斜杠 / 或反斜杠 \）。

os.path.join()函数的实现依赖于不同的操作系统和底层文件系统。在Windows 操作系统中，os.path.join() 使用 ntpath.py 内置模块来处理路径；而在 POSIX 系统（类 Unix 系统）中，则使用 posixpath.py 内置模块来处理路径。

Windows 下的 os.path.join()

os.path.join() 与 ntpath.join()

在 Windows 操作系统中，os.path.join() 使用 ntpath.py 内置模块来处理路径。这意味着，我们除了通过导入 os 模块来使用 os.path.join() 函数外，还可以通过导入 ntpath 直接使用 join() 函数来完成路径拼接的操作。对此，请参考如下示例：

通过 os.path.join() 实现路径拼接操作

from os.path import join


result = join('path', 'to', 'file')
print(result)

通过 ntpath.join() 实现路径拼接操作

from ntpath import join


result = join('path', 'to', 'file')
print(result)

执行效果

在 Windows 操作系统中，上述代码的执行效果一致，均为：

path\to\file

注：

并不推荐通过 from ... import join （其中，... 代表 os.path 或 ntpath 模块）语句直接导入 join() 函数。Python 提供了字符串对象的 join() 方法，用于将可迭代对象中的元素（可迭代对象中的元素需要为字符串）通过指定的字符串对象进行连接，如果通过 from ... import join 导入 join() 函数则容易使人将两者混淆。
良好实践应是先将 os 或 ntpath 模块进行导入后，再通过 os.path.join() 或 ntpath.join() 的方式使用 join() 函数。
通过 from ... import join （其中，... 代表 os.path 或 ntpath 模块）语句直接导入 join() 函数并不会导致字符串对象的 join() 方法被覆盖。这是由于起路径拼接作用的 join() 是函数，而通过指定字符串对象将可迭代对象进行拼接的 join() 是方法（定义在类中的函数），Python 能够对这两者有一个很好的区分。对此，请参考如下示例：
```
from os.path import join


# 使用字符串对象的 join() 方法
# 将可迭代对象中的元素通过指定的字符串对象进行拼接。
arr = ['Hello', 'World']

result = ' '.join(arr)
print(result)

# 通过使用 os.path 模块提供的 join() 函数将
# 指定的多段路径进行正确的拼接。
result = join('path', 'ro', 'file')
print(result)
```
执行效果
```
Hello World
path\ro\file
```

ntpath.join()

在 Windows 系统中，os.path.join() 的本质是 ntpath.join()，因此，如果需要深入研究 os.path.join() 函数的行为，你需要对 ntpath.join() 函数的源码进行探索。

ntpath.join() 的源码如下

# Join two (or more) paths.
def join(path, *paths):
    path = os.fspath(path)
    if isinstance(path, bytes):
        sep = b'\\'
        seps = b'\\/'
        colon = b':'
    else:
        sep = '\\'
        seps = '\\/'
        colon = ':'
    try:
        if not paths:
            path[:0] + sep  #23780: Ensure compatible data type even if p is null.
        result_drive, result_path = splitdrive(path)
        for p in map(os.fspath, paths):
            p_drive, p_path = splitdrive(p)
            if p_path and p_path[0] in seps:
                # Second path is absolute
                if p_drive or not result_drive:
                    result_drive = p_drive
                result_path = p_path
                continue
            elif p_drive and p_drive != result_drive:
                if p_drive.lower() != result_drive.lower():
                    # Different drives => ignore the first path entirely
                    result_drive = p_drive
                    result_path = p_path
                    continue
                # Same drive in different case
                result_drive = p_drive
            # Second path is relative to the first
            if result_path and result_path[-1] not in seps:
                result_path = result_path + sep
            result_path = result_path + p_path
        ## add separator between UNC and non-absolute path
        if (result_path and result_path[0] not in seps and
            result_drive and result_drive[-1:] != colon):
            return result_drive + sep + result_path
        return result_drive + result_path
    except (TypeError, AttributeError, BytesWarning):
        genericpath._check_arg_types('join', path, *paths)
        raise

准备工作

os.fspath()

os.fspath() 接受一个对象作为实参，并尝试返回表示 文件系统路径 的 字符串 或 字节串 对象。如果传递给 os.fspath() 函数的是 str 或 bytes 类型的对象，则该对象将被原样返回。否则实参对象的 __fspath__() 方法将被调用，如果 __fspath__() 方法返回的不是一个 str 或 bytes 类型的对象，则该方法将抛出 TypeError 异常。

举个栗子

from os import fspath


class MyPath:
    def __fspath__(self):
        return '/path/to/file'


result = fspath(MyPath())
print(result)

print(fspath('Hello World'))
print(fspath(b'Hello World'))

执行效果

/path/to/file
Hello World
b'Hello World'

注：

该函数在 Python 3.6 及以上版本可用，在使用该函数前，请检查你所使用的 Python 版本。

isinstance()

isinstance() 函数是 Python 中的 内置函数，该函数用于检查一个对象是否是 指定类 或 其子类 的 实例。如果对象是给定类型的实例，则返回 True；如果不是，则始终返回 False。

isinstance(object, classinfo)

其中：

object
需要进行类型检查的对象，isinstance() 函数将判断 object 是否是指定类型或指定类型的子类的实例。
classinfo
classinfo 的值允许为一个类型对象、多个类型对象组成的 元组 或 Union 类型。

# 判断数值 1 是否是 int 类型或该类型的子类类型的实例
result = isinstance(1, int)
print(result)

# 判断数值 1 是否是 str 类型或该类型的子类类型的实例
result = isinstance(1, str)
print(result)

# 判断数值 1 是否是 str 或 int 类型或其子类类型的实例
result = isinstance(1, (str, int))
print(result)

# 判断数值 1 是否是 str、int、bool 类型或其子类类型的实例
result = isinstance(1, str | int | bool)
print(result)

# 判断数值 1 是否是 str、int、bool、list、tuple
# 类型或其子类型的实例
result = isinstance(1, (str | int, bool | list, tuple | tuple, tuple))
print(result)

执行效果

True
False
True
True
True

可迭代对象仅能为元组

isinstance() 函数的参数 classinfo 的值可以为包含一个或多个类型对象的元组，但这不意味着可以使用与元组同为 可迭代对象 的 列表 等数据类型。否则，Python 将抛出 TypeError 异常错误。

result = isinstance(1, [int, str])
print(result)

可能产生的 TypeError

在 isinstance 函数的 classinfo 参数不符合预期时，isinstance() 函数将抛出 TypeError 异常，但也存在例外。对此，请参考如下示例：

result = isinstance(1, (int, 1))
print(result)

执行效果

True

倘若将 isinstance() 函数的第二个参数 (int, 1) 中的内容的顺序修改为 (1, int)，则 Python 将为此抛出 TypeError 异常错误。
这是因为在通过 isinstance() 函数在进行类型检查时，isinstance() 函数会按照元组中的顺序逐个检查类型，一旦找到与 object 相匹配的类型对象，就返回 True。而如果在检查过程中遇到无效的类型，则将引发 TypeError 异常。

嵌套的元组

参数 classinfo 的值允许为多个类型对象组成的 元组，并且该元组中还能够嵌套元组。对此，请参考如下示例：

result = isinstance(1, (list, (str, (bool, (tuple | int)))))
print(result)

result = isinstance(1, (list, (str, (bool, (tuple | set)))))
print(result)

执行效果

True
False

os.path.splitdrive()

UNC 路径

UNC (Universal Naming Convention) 路径是一种在 Windows 操作系统中用于访问 网络共享资源 的 命名约定，主要用于在本地计算机或网络上引用文件、文件夹或打印机等资源。

UNC 路径的组成

UNC 路径 由以 四 部分组成：

反斜杠（\）
UNC 路径以两个反斜杠 \\ 开头，用于表示该路径是一个 UNC 路径。
服务器标识
紧跟在两个反斜杠后面的部分是 服务器的名称 或 IP 地址，标识了共享资源所在的计算机。
共享资源名
服务器标识及 单个 反斜杠之后，是 共享资源 的名称，用于标识共享文件夹或共享打印机。
资源路径
位于共享资源名及 单个 反斜杠之后，是目标资源 相对 于共享文件夹的路径。

举个栗子

\\ServerName\ShareFolder\ResourcePath

其中：

ServerName 为 共享资源 所在的 计算机的名称 或 IP 地址；ShareFolder 是 共享的件夹的名称；path 是目标资源相对共享文件夹的路径。

os.path.splitdrive()

在 Python 中，os.path.splitdrive() 函数用于分离 Windows 文件系统路径 中的驱动器名称和路径。驱动器名称 通常是指 Windows 系统中的盘符，而在 其他操作系统 中，驱动器名称通常为 空字符串。
在 Windows 操作系统 中，os.path.splitdrive() 还可用于将 UNC 路径分为 资源路径 与 UNC 路径中的其余部分共两部分内容。

os.path.splitdrive(path)

os.path.splitdrive() 函数的返回值是一个形如 (drive, path) 的元组。

其中：

drive 为 Windows 文件系统路径中的盘符或 UNC 路径中的资源路径。
path 为 Windows 文件系统路径中的 除盘符外 的 剩余 内容或 UNC 路径中 除资源路径外 后的 剩余 内容。

举个栗子

from os.path import splitdrive


# 尝试使用 splitdrive 分离类 Unix 系统文件路径
drive, path = splitdrive('/path/to/file')
print(f'【Drive】 {drive}')
print(f'【Path】 {path}')

# 尝试使用 splitdrive 分离 Windows 系统文件路径
drive, path = splitdrive(r'C:\path\to\file')
print(f'【Drive】 {drive}')
print(f'【Path】 {path}')

# 尝试使用 splitdrive 分离 UNC 路径
drive, path = splitdrive(r'\\ServerName\ShareFolder\Path\To\File')
print(f'【Drive】 {drive}')
print(f'【Path】 {path}')

执行效果

Windows 下的执行效果

【Drive】 
【Path】 /path/to/file
【Drive】 C:
【Path】 \path\to\file
【Drive】 \\ServerName\ShareFolder
【Path】 \Path\To\File

类 Unix 系统下的执行效果

【Drive】
【Path】 /path/to/file
【Drive】
【Path】 C:\path\to\file
【Drive】
【Path】 \\ServerName\ShareFolder\Path\To\File

genericpath._check_arg_types()

genericpath 模块

genericpath 模块是 Python 中的一个内置模块，该模块提供了一些 用于处理路径的通用函数和工具。

genericpath 模块中定义的函数主要用于路径处理的 通用 操作，不涉及特定的操作系统。这些函数可以在不同的操作系统上使用，因为它们不依赖于特定的路径分隔符或操作系统特定的文件系统规则。

genericpath._check_arg_types()

genericpath._check_arg_types() 函数的源码如下：

def _check_arg_types(funcname, *args):
    hasstr = hasbytes = False
    for s in args:
        if isinstance(s, str):
            hasstr = True
        elif isinstance(s, bytes):
            hasbytes = True
        else:
            raise TypeError(f'{funcname}() argument must be str, bytes, or '
                            f'os.PathLike object, not {s.__class__.__name__!r}') from None
    if hasstr and hasbytes:
        raise TypeError("Can't mix strings and bytes in path components") from None

在 os.path 内部，该函数常用于检查一个函数的一个或多个参数是否是以 bytes 或 str 类型表示的文件系统路径。若 genericpath._check_arg_types() 函数中的可迭代对象 args 中存在除 bytes 或 str 类型的元素或是同时存在 bytes 或 str 类型的元素，该函数将抛出 TypeError 异常。

注：

在 Python 中，以 单个下划线开头 的函数或方法通常被视为 内部实现细节，不是 公共 API 的一部分。这意味着它们不受官方支持，不建议直接使用，并且在未来的 Python 版本中可能发生更改。

ntpath.join() 函数的源码刨析

ntpath.join() 函数的具体实现（附注释）

def join(path, *paths):
    # 通过 fspath 将 path 转换为 str 或 bytes
    # 类型表示的文件系统路径。
    path = os.fspath(path)

    # 若 path 是 bytes 类或其子类的实例对象，
    # 则将 sep、seps 等变量设置为 bytes 类型的值。
    if isinstance(path, bytes):
        sep = b'\\'
        seps = b'\\/'
        colon = b':'
    else:
        # 若 path 不是 bytes 类或其子类的实例对象，
        # 则将 sep、seps 等变量设置为 str 类型的值。
        sep = '\\'
        seps = '\\/'
        colon = ':'
    try:

        # 这个判断语句恕在下不能理解，(╯°□°）╯︵ ┻━┻
        if not paths:
            path[:0] + sep  #23780: Ensure compatible data type even if p is null.

        # 将路径中的盘符与其余部分进行分隔。
        # result_drive 表示的是 join() 函数拼接结果中的盘符（驱动器名称）标志。
        # result_path 表示的是 join() 函数拼接结果中除盘符外的其他内容。
        result_drive, result_path = splitdrive(path)

        # 对可变参数 paths 中的每一个元素应用 os.fspath 函数
        for p in map(os.fspath, paths):

            # 将路径中的盘符与其余部分进行分隔。
            p_drive, p_path = splitdrive(p)

            r"""
            如果 p_path 以 \ 或 / 开头，则 result_path 将被覆盖为 p_path,
            这意味着：
            print(os.path.join('C:\\', r'\path\to\file')) -> C:\path\to\file
            print(os.path.join('C:\\', r'\path\to\file', r'\path\to\file')) -> C:\path\to\file
            """
            # 如果 p_path 是以 \ 或 / 开头的路径
            if p_path and p_path[0] in seps:
                # 如果 p 中不包含盘符则使用已存储的盘符，
                # 否则，则使用 p 中的盘符替换 result_drive
                if p_drive or not result_drive:
                    result_drive = p_drive
                result_path = p_path
                # 终止当前循环，进入下一轮循环
                continue

            r"""
            p_drive 与 不为空字符串或空字节串的 result_drive 不同时，
            p_path 将覆盖 result_path，p_drive 将覆盖 result_path。
            这意味着：
            print(os.path.join(r'C:\path\to\file', r'D:\new\path')) -> D:\new\path
            """
            # 如果 p_path 不是以 \ 或 / 开头的路径。
            # 如果 p_drive 不为空字符串或空字节串并且 p_drive 与
            # result_drive 不同。
            elif p_drive and p_drive != result_drive:
                # 如果 p_drive、result_drive 两者的小写形式均不相同，则
                # 路径中的分离出的盘符与结果路径 result_drive 中已经存储的盘符不同。
                if p_drive.lower() != result_drive.lower():
                    # 使用新的路径及盘符覆盖 result_path 及 result_drive
                    result_drive = p_drive
                    result_path = p_path
                    # 终止当前循环，进入下一轮循环
                    continue
                r"""
                如果 p_drive 与 result_drive 仅存在大小写的不同，
                则仅更新 result_drive。
                这意味着：
                print(os.path.join(r'd:\path\to\file', r'D:new\path')) -> D:\path\to\file\new\path
                """
                result_drive = p_drive

            # 如果 result_path 不为空字符串或空字节串并且 result_path
            # 的尾部字符不存在于 seps，中，则将通过 \ 将 result_path 与 p_path
            # 进行连接。
            if result_path and result_path[-1] not in seps:
                result_path = result_path + sep
            result_path = result_path + p_path

        """结果路径为 UNC 路径"""
        # 判断 result_path 是否为不为一个空字符串或空字节串，若不是，那么
        # result_path 的首个字符是否存在于 seps 中。若存在，则继续判断
        # 结果路径是否将为一个 UNC 路径。
        if (result_path and result_path[0] not in seps and
            result_drive and result_drive[-1:] != colon):

            # 若结果路径为一个 UNC 路径，则在 result_path 前缺少 \
            # 或 b'\' 时，使用相应的文件系统路径分隔符对两者进行拼接。
            return result_drive + sep + result_path

        return result_drive + result_path
    except (TypeError, AttributeError, BytesWarning):
        # 尝试使用 genericpath._check_arg_types() 函数
        # 判断产生异常错误的原因，以输出适当的错误信息帮助用户排错。
        genericpath._check_arg_types('join', path, *paths)

        # 若 genericpath._check_arg_types() 函数
        # 未检测到错误产生的原因并将其抛出，则抛出截获到的异常错误
        raise

奇怪的判断语句

在 ntpath.join() 函数的源代码中，下面的这个判断语句显得有些多余。

if not paths:
    path[:0] + sep  #23780: Ensure compatible data type even if p is null.

path[:0] + sep

if 中的 path[:0] + sep 语句并未将拼接的结果进行保存，这是因为列表对象的 切片操作 返回的是一个新的列表对象，它是原始列表的一个子集。修改这个切片实际上是在修改新创建的列表对象，而不是原始列表。那么，path[:0] + sep 的作用是什么？

观察 path[:0] + sep 语句旁边的注释 #23780: Ensure compatible data type even if p is null.，翻译翻译得到：#23780: 即使 p 为空，也要确保数据类型兼容。。也就是说， path[:0] + sep 能够保证 path[:0] 的数据类型为 str 或 bytes 中的其中一种。让我们对此验证一番：

string = 'Hello World'
bytes_string = b'Hello World'
arr = [1, 2, 3]

# 即使 [:0] 无法从序列中获取到任何元素
# 但 [:0] 仍将返回一个空字符串、空字节串或空列表等。
print(string[:0])
print(bytes_string[:0])
print(arr[:0])
print(type(string[:0]))
print(type(bytes_string[:0]))
print(type(arr[:0]))

# arr[:0] + '/' 的结果并不会保存在 arr 中
# 但，当两着进行加法操作时，若两者的类型不支持进行
# 加法操作，则 Python 将抛出 TypeError 异常错误。
try:
    arr[:0] + '/'
except TypeError:
    print('TypeError')

执行效果


b''
[]
<class 'str'>
<class 'bytes'>
<class 'list'>
TypeError

结果表明 path[:0] + sep 将在两者不支持作为加法操作符的操作数时产生 TypeError 异常，并且产生的异常错误将被 ntpath.join() 中的 except (TypeError, AttributeError, BytesWarning) 所捕获。这对 path[:0] 的数据类型是 str 或 bytes 提供了保障。但令人匪夷所思的是，os.fspath(path) 就足以保证 path 的数据类型为 str 或 bytes 中的一种。文章来源地址https://www.toymoban.com/news/detail-485223.html

def join(path, *paths):
    path = os.fspath(path)
    
    if isinstance(path, bytes):
        sep = b'\\'
        seps = b'\\/'
        colon = b':'
    else:
        sep = '\\'
        seps = '\\/'
        colon = ':'
        
    try:
        if not paths:
            path[:0] + sep  #23780: Ensure compatible data type even if p is null.