【爬虫】4.2 Scrapy 中查找 html 元素-Toy模板网

这篇具有很好参考价值的文章主要介绍了【爬虫】4.2 Scrapy 中查找 html 元素。希望对大家有所帮助。如果存在错误或未考虑完全的地方，请大家不吝赐教，您也可以点击"举报违法"按钮提交疑问。

Xpath简介

1. Scrapy 的 Xpath 简介

（1）使用xpath查找HTML中的元素

2. Xpath 查找 html 元素

（2）"//"与"/"的使用

（3）使用"."进行Xpath连续调用

（4）extract与extract_first函数使用

（5）获取元素属性值

（6）获取节点的文本值

（7）多个文本节点值

（8）使用condition限定tag元素

（9）使用position()序号来确定所选择的元素

（10）使用"*"代表任何element元素,不包括Text、Comment的结点

（11）使用@*代表属性

（12）Xpath搜索元素的父结点

（13）搜索后面的兄弟结点

（14）搜索前面的兄弟结点

Xpath简介

XPath是一门在XML和HTML文档中查找信息的语言，通过路径表达式从XML文档中选取节点或节点位置，可以用来在XML和HTML文档中对元素和属性进行遍历。
Xpath节点类型：元素、属性、文本、命名空间、指令处理、注释及文档
xpath定位方式（路径表达式+索引+属性）
格式：/node1/node2/node3[1]/node4[@attribute[="value“]]
xpath的索引值从1开始

**Xpath 常用表达式**
表达式	描述	例子
nodename	选取此节点	例： body ，选取 body 元素
/	绝对路径 , 表示当前节点的下一级节点元素。	例： /body ，当前节点下一级的 body 元素，默认当前节点选取根元素
//	相对路径，全文档查找；	例： //title ，全文档搜索 title 元素 body//title,在 body 元素后代中搜索所有 title 元素。
.	当前节点	例： .//title ，在当前节点后代中搜索所有 title元素
@	选取属性	//node[@attribute] ，例包含 attribute 属性的节点node 。
*	通配符	/* ，绝对路径匹配任意节点， //* ，全文匹配任意节点，@* ，匹配任意属性

**Xpath 特有的Selector对象函数/方法**
函数/方法	功能描述
extract()	获取对象的元素文本的列表
extract_first()	获取对象的元素文本的列表的第一个元素
"/@attrName"	/@attrName" 获取元素的属性节点对象，用extract() 获取属性值
"/text()"	获取元素的文本节点对象，用 extract() 获取文本值
"/tag[condition]"	获取符合限定条件的元素对象，其中 condition 是由这个 tag 的属性、文本等计算出的一个逻辑值。多个限定条件如下： "tag[condition1][condition2]...[conditionN]" 或者： "tag[condition1 and condition2 and ... and conditionN]"
position()	限定某元素对象，从 1 开始。可通过 and 、 or 等构造复杂的表达式
"element/parent::*"	获取元素的父亲节点对象
" element/folllowing-sibling::*"	所有后序同级兄弟节点
“element/preceding-sibling::*“	所有前序同级兄弟节点

1. Scrapy 的 Xpath 简介

（1）使用xpath查找HTML中的元素

# 使用Xpath查找HTML中的元素
from scrapy.selector import Selector

htmlText = '''
<html><body>
<bookstore>
    <book>
        <title lang="eng">Harry Potter</title>
        <price>29.99</price>
    </book>
    <book>
        <title lang="eng">Learning XML</title>
        <price>39.95</price>
    </book>
</bookstore>
</body></html>
'''
selector = Selector(text=htmlText)
print(type(selector))  # <class 'scrapy.selector.unified.Selector'>
print(selector)  # <Selector query=None data='<html><body>\n<bookstore>\n    <book>\n ...'>
s = selector.xpath("//title")  # 全文查找title  形成一个Selector的列表
print(type(s))  # <class 'scrapy.selector.unified.SelectorList'>
print(s)  # [<Selector query='//title' data='<title lang="eng">Harry Potter</title>'>, <Selector query='//title' data='<title lang="eng">Learning XML</title>'>]

from scrapy.selector import Selector

从 scrapy 中引入 Selector 类，这个类就是选择查找类。

selector=Selector(text=htmlText)

使用 htmlText 的文字建立 Selector类，就是装载 HTML 文档，文档装载后就形成一个 Selector对象，就可以使用 xpath 查找元素。

print(type(selector)

可看到 selector 是一个类型为 scrapy.selector.unified.Selector，这个类型是一个有 xpath 方法的类型。

s=selector.xpath("//title")

这个方法在文档中查找所有的 <title> 的元素，其中"//" 表示文档中的任何位置。一般地：

selector.xpath("//tagName")

表示在权文档中搜索<tagName> 的 tags ，形成一个 Selector 的列表。

print(type(s))

由于 <title> 有两个元素，因此这是一个 scrapy.selector.unified.SelectorList类，

类似 scrapy.selector.unified.Selector的列表。

print(s)

s 包含两个 Selector 对象，一个是 <Selector xpath='//title' data='<title lang="eng">Harry Potter</title>'>，

另外一个是 <Selector xpath='//title' data='<title lang="eng">Learning XML</title>'>。

由此可见一般 selector 搜索一个 <tagName> 的 HTML 元素的方法是：

selector.xpath("//tagName")

在装载 HTML 文档后 selector=Selector(text=htmlText)得到的 selector 是对应全文档顶层的元素<html> 的，其中 "//" 表示全文档搜索，结果是一个 Selector 的列表，哪怕只有一个元素也成一个列表，例如：

selector.xpath("//body") 搜索到<body>元素，结果是一个Selector的列表，包含一个Selector元素；
selector.xpath("//title")搜索到两个<title>元素，结果是Selector 的列表，包含2个Selector元素；
selector.xpath("//book")搜索到两个<book>元素，结果是Selector 的列表，包含2个Selector元素

2. Xpath 查找 html 元素

（2）"//"与"/"的使用

使用 “//” 表示文档下面的所有结点元素，用 “/” 表示当前结点的下一级结点元素

# "//"与"/"的使用
from scrapy.selector import Selector

htmlText = '''
<html><body>
<bookstore>
    <book>
        <title lang="eng">Harry Potter</title>
        <price>29.99</price>
    </book>
    <book>
        <title lang="eng">Learning XML</title>
        <price>39.95</price>
    </book>
</bookstore>
</body></html>
'''
selector = Selector(text=htmlText)
print(type(selector))
print(selector)
print("====================s1====================")
s1 = selector.xpath("//bookstore/book")  # 搜索<bookstore>下一级的<book>元素，找到2个
print(type(s1))
print(s1)
print("====================s2====================")
s2 = selector.xpath("//body/book")  # 搜索<body>下一级的<book>元素，结果为空
print(type(s2))
print(s2)
print("====================s3====================")
s3 = selector.xpath("//body//book")  # 搜索<body>下<book>元素，找到2个
print(type(s3))
print(s3)
print("====================s4====================")
s4 = selector.xpath("/body//book")  # 搜索文档下一级的<body>下的<book>元素，结果为空，∵文档的下一级是<html>元素，不是<body>元素
print(type(s4))
print(s4)
print("====================s5====================")
s5 = selector.xpath("/html/body//book")
# 或 s5 = selector.xpath("/html//book")  # 搜索<book>元素，找到2个
print(type(s5))
print(s5)
print("====================s6====================")
s6 = selector.xpath("//book/title")  # 搜索文档中所有<book>下一级的<title>元素，找到2个
print(type(s6))
print(s6)  # 结果与 selector.xpath("//title")   selector.xpath("//bookstore//title")一样
print("====================s7====================")
s7 = selector.xpath("//book//price")  # # 搜索文档中所有<book>下一级的<price>元素，找到2个
print(type(s7))
print(s7)  # 结果与 selector.xpath("//price")一样

运行结果：
<class 'scrapy.selector.unified.Selector'>
<Selector query=None data='<html><body>\n<bookstore>\n <book>\n ...'>
====================s1====================
<class 'scrapy.selector.unified.SelectorList'>
[<Selector query='//bookstore/book' data='<book>\n <title lang="eng">Harr...'>, <Selector query='//bookstore/book' data='<book>\n <title lang="eng">Lear...'>]
====================s2====================
<class 'scrapy.selector.unified.SelectorList'>
[]
====================s3====================
<class 'scrapy.selector.unified.SelectorList'>
[<Selector query='//body//book' data='<book>\n <title lang="eng">Harr...'>, <Selector query='//body//book' data='<book>\n <title lang="eng">Lear...'>]
====================s4====================
<class 'scrapy.selector.unified.SelectorList'>
[]
====================s5====================
<class 'scrapy.selector.unified.SelectorList'>
[<Selector query='/html/body//book' data='<book>\n <title lang="eng">Harr...'>, <Selector query='/html/body//book' data='<book>\n <title lang="eng">Lear...'>]
====================s6====================
<class 'scrapy.selector.unified.SelectorList'>
[<Selector query='//book/title' data='<title lang="eng">Harry Potter</title>'>, <Selector query='//book/title' data='<title lang="eng">Learning XML</title>'>]
====================s7====================
<class 'scrapy.selector.unified.SelectorList'>
[<Selector query='//book//price' data='<price>29.99</price>'>, <Selector query='//book//price' data='<price>39.95</price>'>]

（3）使用"."进行Xpath连续调用

使用 “.” 表示当前结点元素，使用 Xpath 可以连续调用，如果前一个 Xpath 返回一个 Selector 的列表，那么这个列表可以继续调用 Xpath

功能：为了每个列表元素调用 Xpath ，最后结果是全部元素调用 Xpath 的汇总

# 使用"."进行Xpath连续调用
from scrapy.selector import Selector

htmlText = '''
<html><body>
<bookstore>
    <title>books</title>
    <book>
        <title>Novel</title>
        <title lang="eng">Harry Potter</title>
        <price>29.99</price>
    </book>
    <book>
        <title>TextBook</title>
        <title lang="eng">Learning XML</title>
        <price>39.95</price>
    </book>
</bookstore>
</body></html>
'''
selector = Selector(text=htmlText)
s = selector.xpath("//book").xpath("./title")
# s = selector.xpath("//book").xpath("/title")  # 结果是空的，因为后面的 xpath("/title")从文档开始搜索<title>。
# s = selector.xpath("//book").xpath("//title")  # 结果有10个元素，因为每个 <book>都驱动xpath("//title")在全文档搜索 <title>元素，每次都搜索到5个元素。
for e in s:
    print(e)

运行结果：

<Selector query='./title' data='<title>Novel</title>'>
<Selector query='./title' data='<title lang="eng">Harry Potter</title>'>
<Selector query='./title' data='<title>TextBook</title>'>
<Selector query='./title' data='<title lang="eng">Learning XML</title>'>

selector.xpath("//book") 首先搜索到文档中所有 <book> 元素，总共有2 个，然后再次调用 xpath("./title") ，就是从当前元素 <book> 开始往下一级搜索<title> ，每个 <book> 都找到 2 个 <title> ，因此结果有 4 个 <title>。

注意： 如果 xpath 连续调用时不指定是从前一个 xpath 的结果元素开始的，那么默认是从全文档开始的，结果会不一样，例如：

s=selector.xpath("//book").xpath("/title") 结果是空的，因为后面的 xpath("/title")从文档开始搜索 <title> 。

s=selector.xpath("//book").xpath("//title") 结果有 10 个元素，因为每个 <book>都驱动 xpath("//title") 在全文档搜索 <title>元素，每次都搜索到 5 个元素。

（4）extract与extract_first函数使用

如果 xpath 返回的 Selector 对象列表

① 调用 extract() 函数会得到这些对象的 元素文本 的列表

② 使用 extract_first() 获取列表中 第一个元素值 ，如果列表为空 extract_first() 的值为 None 。

而对于单一的一个 Selector 对象

① 调用 extract() 函数就可以得到 Selector 对象对应的元素的文本值。

② 单一的 Selector 对象没有 extract_first() 函数。

# extract与extract_first函数使用
from scrapy.selector import Selector

htmlText = '''
<html><body>
<bookstore>
    <book id="b1">
        <title lang="english">Harry Potter</title>
        <price>29.99</price>
    </book>
    <book id="b2">
        <title lang="chinese">学习 XML</title>
        <price>39.95</price>
    </book>
</bookstore>
</body></html>
'''
selector = Selector(text=htmlText)
s = selector.xpath("//book/price")
print(type(s), s)
s = selector.xpath("//book/price").extract()
print(type(s), s)
s = selector.xpath("//book/price").extract_first()
print(type(s), s)

运行结果：

<class 'scrapy.selector.unified.SelectorList'> [<Selector query='//book/price' data='<price>29.99</price>'>, <Selector query='//book/price' data='<price>39.95</price>'>]
<class 'list'> ['<price>29.99</price>', '<price>39.95</price>']
<class 'str'> <price>29.99</price>

由此可见:

s=selector.xpath("//book/price") 得到的是SelectorList列表；

s=selector.xpath("//book/price").extract() 得到的是<price>元素的Selector对象对应的<price>元素的文本组成的列表，即：

['<price>29.99</price>', '<price>39.95</price>']

s=selector.xpath("//book/price").extrac_first() 得到的是<price>元素的文本组成的列表的第一个元素，是一个文本，即： <price>29.99</price>

（5）获取元素属性值

xpath使用 "/@attrName " 得到一个 Selector 元素的 attrName 属性节点对象，属性节点对象也是一个Selector 对象，通过 extract() 获取属性值。

# 获取元素属性值
from scrapy.selector import Selector

htmlText = '''
<html><body>
<bookstore>
    <book id="b1">
        <title lang="english">Harry Potter</title>
        <price>29.99</price>
    </book>
    <book id="b2">
        <title lang="chinese">学习 XML</title>
        <price>39.95</price>
    </book>
</bookstore>
</body></html>
'''
selector = Selector(text=htmlText)
s = selector.xpath("//book").xpath("./@id")
print(s)  # [<Selector query='./@id' data='b1'>, <Selector query='./@id' data='b2'>]
print(s.extract())  # ['b1', 'b2']
for e in s:
    print(e.extract())  # b1 \n  b2

运行结果：

[<Selector query='./@id' data='b1'>, <Selector query='./@id' data='b2'>]
['b1', 'b2']
b1
b2

由此可见：

s=selector.xpath("//book/@id")

结果是 2 个 <book> 的 id 属性组成的 SelectorList 列表， 即属性也是一个 Selector 对象；

print(s.extract()) 结果是 <book> 的 id 属性的两个 Selector 对象的属性文本值的列表，即['b1', 'b2'] ；

for e in s:

print(e.extract())

每个 e 是一个 Selector 对象，因此 extract() 获取对象的属性值。

（6）获取节点的文本值

xpath使用 "/text()" 得到一个 Selector 元素包含的文本值，文本值节点对象也是一个Selector 对象，通过 extract() 获取文本值。

# 获取节点的文本值
from scrapy.selector import Selector

htmlText = '''
<html><body>
<bookstore>
    <book id="b1">
        <title lang="english">Harry Potter</title>
        <price>29.99</price>
    </book>
    <book id="b2">
        <title lang="chinese">学习 XML</title>
        <price>39.95</price>
    </book>
</bookstore>
</body></html>
'''
selector = Selector(text=htmlText)
s = selector.xpath("//book/title/text()")
print(s)
print(s.extract())  # ['Harry Potter', '学习 XML']
for e in s:
    print(e.extract())  # Harry Potter  \n  学习 XML

运行结果：

[<Selector query='//book/title/text()' data='Harry Potter'>, <Selector query='//book/title/text()' data='学习 XML'>]
['Harry Potter', '学习 XML']
Harry Potter
学习 XML

由此可见：

s=selector.xpath("//book/title/text()") 结果也是 SelectorList 列表，即文本也是一个节点 ；

print(s.extract()) 结果是文本节点的字符串值的列表，即['Harry Potter', ' 学习 XML'] ；

for e in s:

print(e.extract())

每个 e 是一个 Selector 对象，因此 extract() 获取对象的属性值。 值得注意 的是如果一个 element 的元素包含的文本不是单一的文本，那么可能会 产生多个文本值。

（7）多个文本节点值

# 多个文本节点值
from scrapy.selector import Selector

htmlText = '''
<html><body>
<bookstore>
    <book id="b1">
        <title lang="english"><b>H</b>ary <b>P</b>otter</title>
        <price>29.99</price>
    </book>
</bookstore>
</body></html>
'''
selector = Selector(text=htmlText)
s = selector.xpath("//book/title/text()")
print(s)
print(s.extract())  # ['ary ', 'otter']
for e in s:
    print(e.extract())

运行结果：

[<Selector query='//book/title/text()' data='ary '>, <Selector query='//book/title/text()' data='otter'>]
['ary ', 'otter']
ary
otter

由此可见 <title> 中的文本值包含 arry 与 otter 两个。

（8）使用condition限定tag元素

# 使用condition限定tag元素
from scrapy.selector import Selector

htmlText = '''
<html><body>
<bookstore>
    <book id="b1">
        <title lang="english">Harry Potter</title>
        <price>29.99</price>
    </book>
    <book id="b2">
        <title lang="chinese">学习 XML</title>
        <price>39.95</price>
    </book>
</bookstore>
</body></html>
'''
selector = Selector(text=htmlText)
s = selector.xpath("//book/title[@lang='chinese']/text()")
print(s.extract_first())  # 学习 XML
s = selector.xpath("//book[@id='b1']/title")
print(s.extract_first())  # <title lang="english">Harry Potter</title>

运行结果：

学习 XML
<title lang="english">Harry Potter</title>

由此可见：

s=selector.xpath("//book/title[@lang='chinese']/text()")

搜索 <book> 下面属性 lang="chinese" 的 <title>

s=selector.xpath("//book[@id='b1']/title")

搜索属性 id="b1" 的 <book> 下面的 <title> 。

（9）使用position()序号来确定所选择的元素

xpath可以使用 position() 来确定其中一个元素的限制，这个选择序号是从 1 开始的，不是从 0 开始编号的，还可以通过 and 、 or 等构造复杂的表达式。

# 使用position()序号来确定所选择的元素
from scrapy.selector import Selector

htmlText = '''
<html><body>
<bookstore>
    <book id="b1">
        <title lang="english">Harry Potter</title>
        <price>29.99</price>
    </book>
    <book id="b2">
        <title lang="chinese">学习 XML</title>
        <price>39.95</price>
    </book>
</bookstore>
</body></html>
'''
selector = Selector(text=htmlText)
s = selector.xpath("//book[position()=1]/title")  # 选择第一个<book>元素下的title
print(s.extract_first())  # <title lang="english">Harry Potter</title>
s = selector.xpath("//book[position()=2]/title")  # 选择第二个<book>元素下的title
print(s.extract_first())  # <title lang="chinese">学习 XML</title>

运行结果：

<title lang="english">Harry Potter</title>
<title lang="chinese">学习 XML</title>

（10）使用"*"代表任何element元素,不包括Text、Comment的结点

xpath使用星号 "*" 代表任何 Element 节点，不包括 Text 、 Comment 的节点。

# 使用"*"代表任何element元素,不包括Text、Comment的结点
from scrapy.selector import Selector

htmlText = '''
<html><body>
<bookstore>
    <book id="b1">
        <title lang="english">Harry Potter</title>
        <price>29.99</price>
    </book>
    <book id="b2">
        <title lang="chinese">学习 XML</title>
        <price>39.95</price>
    </book>
</bookstore>
</body></html>
'''
selector = Selector(text=htmlText)
s = selector.xpath("//bookstore/*/title")
print(s.extract())  # ['<title lang="english">Harry Potter</title>', '<title lang="chinese">学习 XML</title>']

运行结果：

['<title lang="english">Harry Potter</title>', '<title lang="chinese">学习 XML</title>']

其中 s=selector.xpath("//bookstore/*/title")是搜索<bookstore>的孙子节点<title>，中间隔开一层任何元素。

（11）使用@*代表属性

xpath使用 "@*" 代表任何属性

# 使用@*代表属性
from scrapy.selector import Selector

htmlText = '''
<html><body>
<bookstore>
    <book>
        <title lang="english">Harry Potter</title>
        <price>29.99</price>
    </book>
    <book id="b2">
        <title lang="chinese">学习 XML</title>
        <price>39.95</price>
    </book>
</bookstore>
</body></html>
'''
selector = Selector(text=htmlText)
s = selector.xpath("//book[@*]/title")  # 搜索任何包含属性的<book>元素下面的<title>
print(s.extract())  # ['<title lang="chinese">学习 XML</title>']
s = selector.xpath("//@*")  # 搜索文档中所有属性结点
print(s.extract())  # ['english', 'b2', 'chinese']

运行结果：

['<title lang="chinese">学习 XML</title>']
['english', 'b2', 'chinese']

其中： s=selector.xpath("//book[@*]/title") 是搜索任何 包含属性的 <book> 元素下面的<title> ，结果搜索到第二个 <book>s=selector.xpath("//@*")是搜索文档中所有 属性节点 。

（12）Xpath搜索元素的父结点

xpath使用 "element/parent::*" 选择 element 的父节点，这个节点只有一个。如果写成element/parent::tag ，就指定 element 的 tag 父节点，除非 element的父节点正好为 <tag> 节点，不然就为[ ] 。

# Xpath搜索元素的父结点
from scrapy.selector import Selector

htmlText = '''
<html><body>
<bookstore>
    <book>
        <title lang="english">Harry Potter</title>
        <price>29.99</price>
    </book>
    <book id="b2">
        <title lang="chinese">学习 XML</title>
        <price>39.95</price>
    </book>
</bookstore>
</body></html>
'''
selector = Selector(text=htmlText)
s = selector.xpath("//title[@lang='chinese']/parent::*")  # 等价 /parent::book
print(s.extract())  # ['<book id="b2">\n        <title lang="chinese">学习 XML</title>\n        <price>39.95</price>\n    </book>']

运行结果：

['<book id="b2">\n <title lang="chinese">学习 XML</title>\n <price>39.95</price>\n </book>']

其中 s=selector.xpath("//title[@lang='chinese']/parent::*")是查找属性为 lang='chinese'的<title>元素的父节点，就是id="b2"的<book>元素节点。

（13）搜索后面的兄弟结点

xpath使用 "element/folllowing-sibling::*" 搜索 element 后面的同级的所有兄弟节点，使用 "element/folllowing-sibling::*[position()=1]" 搜索 element 后面的同级的第一个兄弟节点。

# 搜索后面的兄弟结点
from scrapy.selector import Selector

htmlText = """<a>A1</a>
              <b>B1</b>
              <c>C1</c>
              <d>D<e>E</e></d>
              <b>B2</b>
              <c>C2</c>"""
selector = Selector(text=htmlText)
s = selector.xpath("//a/following-sibling::*")  # 搜素<a>结点后面的兄弟结点
print(s.extract())  # ['<b>B1</b>', '<c>C1</c>', '<d>D<e>E</e></d>', '<b>B2</b>', '<c>C2</c>']
s = selector.xpath("//a/following-sibling::*[position()=1]")  # 搜索<a>结点后面的第1个兄弟结点
print(s.extract())  # ['<b>B1</b>']
s = selector.xpath("//b[position()=1]/following-sibling::*")  # 搜索第一个<b>结点后面的兄弟结点
print(s.extract())  # ['<c>C1</c>', '<d>D<e>E</e></d>', '<b>B2</b>', '<c>C2</c>']
s = selector.xpath("//b[position()=1]/following-sibling::*[position()=1]")  # 搜索第一个<b>结点后面的第1个兄弟结点
print(s.extract())  # ['<c>C1</c>']

运行结果：

['B1', '<c>C1</c>', '<d>D<e>E</e></d>', 'B2', '<c>C2</c>']
['B1']
['<c>C1</c>', '<d>D<e>E</e></d>', 'B2', '<c>C2</c>']
['<c>C1</c>']

（14）搜索前面的兄弟结点

xpath使用 "element/preceding-sibling::*" 搜索 element 前面的同级的所有兄弟节点，使用"element/preceding-sibling::*[position()=1]" 搜索 element前面的同级的第一个兄弟节点。

# 搜索前面的兄弟结点
from scrapy.selector import Selector

htmlText = """<a>A1</a>
              <b>B1</b>
              <c>C1</c>
              <d>D<e>E</e></d>
              <b>B2</b>
              <c>C2</c>"""
selector = Selector(text=htmlText)
s = selector.xpath("//a/preceding-sibling::*")
print(s.extract())  # []
s = selector.xpath("//b/preceding-sibling::*[position()=1]")  # 是所有<b>前面的第1个兄弟结点
print(s.extract())  # ['<a>A1</a>', '<d>D<e>E</e></d>']
s = selector.xpath("//b[position()=2]/preceding-sibling::*")  # 是第二个<b>前面的所有兄弟结点
print(s.extract())  # ['<a>A1</a>', '<b>B1</b>', '<c>C1</c>', '<d>D<e>E</e></d>']
s = selector.xpath("//b[position()=2]/preceding-sibling::*[position()=1]")  # 这里的position()=1指的是前1个兄弟结点
print(s.extract())  # ['<d>D<e>E</e></d>']