Python爬虫之xpath语法及案例使用
文章目录
Xpath是什么
XPath,全称 XML Path Language,即 XML 路径语言,它是一门在 XML 文档中查找信息的语言。最初是用来搜寻 XML 文档的,但同样适用于 HTML 文档的搜索。所以在做爬虫时完全可以使用 XPath 做相应的信息抽取。
XPath 的选择功能十分强大,它提供了非常简洁明了的路径选择表达式。另外,它还提供超过 100 个内置函数,用于字符串、数值、时间的匹配以及节点、序列的处理等,几乎所有想要定位的节点都可以用 XPath 来选取。
下面介绍实战中常用的几个知识点,详细也可以看W3C介绍:https://www.w3school.com.cn/xpath/index.asp
Xpath语法介绍
路径常用规则
表达式 |
描述 |
实例 |
|
nodename |
选取此节点的所有子节点 |
xpath('//div') |
选取了div节点的所有子节点 |
/ |
从根节点选取 |
xpath('/div') |
从根节点上选取div节点 |
// |
选取所有当前节点,不考虑位置 |
xpath('//div') |
选取所有的div节点 |
. |
选取当前节点 |
xpath('./div') |
选取当前节点下的div节点 |
.. |
选取当前节点的父节点 |
xpath('..') |
回到上一个节点 |
@ |
选取属性 |
xpath('//@calss') |
选取所有的class属性 |
谓语规则
谓语被嵌在方括号内,用来查找某个特定的节点或包含某个制定的值的节点
表达式 |
结果 |
xpath('/body/div[1]') |
选取body下的第一个div节点 |
xpath('/body/div[last()]') |
选取body下最后一个div节点 |
xpath('/body/div[last()-1]') |
选取body下倒数第二个div节点 |
xpath('/body/div[positon()❤️]') |
选取body下前两个div节点 |
xpath('/body/div[@class]') |
选取body下带有class属性的div节点 |
xpath('/body/div[@class="main"]') |
选取body下class属性为main的div节点 |
xpath('/body/div[price>35.00]') |
选取body下price元素值大于35的div节点 |
通配符
通配符来选取未知的XML元素
表达式 |
结果 |
xpath('/div/*') |
选取div下的所有子节点 |
xpath('/div[@*]') |
选取所有带属性的div节点 |
取多个路径
使用“|”运算符可以选取多个路径
表达式 |
结果 |
xpath('//div|//table') |
选取所有的div和table节点 |
功能函数
使用功能函数能够更好的进行模糊搜索
函数 |
用法 |
解释 |
starts-with |
xpath('//div[starts-with(@id,"ma")]') |
选取id值以ma开头的div节点 |
contains |
xpath('//div[contains(@id,"ma")]') |
选取id值包含ma的div节点 |
and |
xpath('//div[contains(@id,"ma") and contains(@id,"in")]') |
选取id值包含ma和in的div节点 |
text() |
xpath('//div[contains(text(),"ma")]') |
选取节点文本包含ma的div节点 |
语法熟悉
下面举一段HTML文本进行语法热身,代码如下
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
#!/usr/bin/env python # -*- coding: utf-8 -*- # time: 2022/8/8 0:05 # author: gangtie # email: 648403020@qq.com from lxml import etree text = ''' <span class="hljs-tag"><<span class="hljs-name">div</span>></span> <span class="hljs-tag"><<span class="hljs-name">ul</span> <span class="hljs-attr">id</span>=<span class="hljs-string">'ultest'</span>></span> <span class="hljs-tag"><<span class="hljs-name">li</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"item-0"</span>></span><span class="hljs-tag"><<span class="hljs-name">a</span> <span class="hljs-attr">href</span>=<span class="hljs-string">"link1.html"</span>></span>first item<span class="hljs-tag"></<span class="hljs-name">a</span>></span><span class="hljs-tag"></<span class="hljs-name">li</span>></span> <span class="hljs-tag"><<span class="hljs-name">li</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"item-1"</span>></span><span class="hljs-tag"><<span class="hljs-name">a</span> <span class="hljs-attr">href</span>=<span class="hljs-string">"link2.html"</span>></span>second item<span class="hljs-tag"></<span class="hljs-name">a</span>></span><span class="hljs-tag"></<span class="hljs-name">li</span>></span> <span class="hljs-tag"><<span class="hljs-name">li</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"item-inactive"</span>></span><span class="hljs-tag"><<span class="hljs-name">a</span> <span class="hljs-attr">href</span>=<span class="hljs-string">"link3.html"</span>></span>third item<span class="hljs-tag"></<span class="hljs-name">a</span>></span><span class="hljs-tag"></<span class="hljs-name">li</span>></span> <span class="hljs-tag"><<span class="hljs-name">li</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"item-1"</span>></span><span class="hljs-tag"><<span class="hljs-name">a</span> <span class="hljs-attr">href</span>=<span class="hljs-string">"link4.html"</span>></span><span class="hljs-tag"><<span class="hljs-name">span</span>></span>fourth item<span class="hljs-tag"></<span class="hljs-name">span</span>></span><span class="hljs-tag"></<span class="hljs-name">a</span>></span><span class="hljs-tag"></<span class="hljs-name">li</span>></span> <span class="hljs-tag"><<span class="hljs-name">li</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"item-0"</span>></span><span class="hljs-tag"><<span class="hljs-name">a</span> <span class="hljs-attr">href</span>=<span class="hljs-string">"link5.html"</span>></span>fifth item<span class="hljs-tag"></<span class="hljs-name">a</span>></span> <span class="hljs-tag"></<span class="hljs-name">ul</span>></span> <span class="hljs-tag"></<span class="hljs-name">div</span>></span> ''' # 调用HTML类进行初始化,这样就成功构造了一个XPath解析对象。 # 利用etree.HTML解析字符串 page = etree.HTML(text) print(type(page)) |
可以看到打印结果已经变成XML元素:
1 2 |
<<span class="hljs-class"><span class="hljs-keyword">class</span> '<span class="hljs-title">lxml</span>.<span class="hljs-title">etree</span>.<span class="hljs-title">_Element</span>'></span> |
字符串转换HTML
字符串利用etree.HTML解析成html格式:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
<span class="hljs-built_in">print</span>(etree.tostring(page,encoding=<span class="hljs-string">'utf-8'</span>).decode(<span class="hljs-string">'utf-8'</span>)) ```<span class="javascript"> <html><span class="xml"><span class="hljs-tag"><<span class="hljs-name">body</span>></span><span class="hljs-tag"><<span class="hljs-name">div</span>></span> <span class="hljs-tag"><<span class="hljs-name">ul</span> <span class="hljs-attr">id</span>=<span class="hljs-string">"ultest"</span>></span> <span class="hljs-tag"><<span class="hljs-name">li</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"item-0"</span>></span><span class="hljs-tag"><<span class="hljs-name">a</span> <span class="hljs-attr">href</span>=<span class="hljs-string">"link1.html"</span>></span>first item<span class="hljs-tag"></<span class="hljs-name">a</span>></span><span class="hljs-tag"></<span class="hljs-name">li</span>></span> <span class="hljs-tag"><<span class="hljs-name">li</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"item-1"</span>></span><span class="hljs-tag"><<span class="hljs-name">a</span> <span class="hljs-attr">href</span>=<span class="hljs-string">"link2.html"</span>></span>second item<span class="hljs-tag"></<span class="hljs-name">a</span>></span><span class="hljs-tag"></<span class="hljs-name">li</span>></span> <span class="hljs-tag"><<span class="hljs-name">li</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"item-inactive"</span>></span><span class="hljs-tag"><<span class="hljs-name">a</span> <span class="hljs-attr">href</span>=<span class="hljs-string">"link3.html"</span>></span>third item<span class="hljs-tag"></<span class="hljs-name">a</span>></span><span class="hljs-tag"></<span class="hljs-name">li</span>></span> <span class="hljs-tag"><<span class="hljs-name">li</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"item-1"</span>></span><span class="hljs-tag"><<span class="hljs-name">a</span> <span class="hljs-attr">href</span>=<span class="hljs-string">"link4.html"</span>></span><span class="hljs-tag"><<span class="hljs-name">span</span>></span>fourth item<span class="hljs-tag"></<span class="hljs-name">span</span>></span><span class="hljs-tag"></<span class="hljs-name">a</span>></span><span class="hljs-tag"></<span class="hljs-name">li</span>></span> <span class="hljs-tag"><<span class="hljs-name">li</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"item-0"</span>></span><span class="hljs-tag"><<span class="hljs-name">a</span> <span class="hljs-attr">href</span>=<span class="hljs-string">"link5.html"</span>></span>fifth item<span class="hljs-tag"></<span class="hljs-name">a</span>></span> <span class="hljs-tag"></<span class="hljs-name">li</span>></span><span class="hljs-tag"></<span class="hljs-name">ul</span>></span> <span class="hljs-tag"></<span class="hljs-name">div</span>></span> <span class="hljs-tag"></<span class="hljs-name">body</span>></span></span><span class="xml"><span class="hljs-tag"></<span class="hljs-name">html</span>></span></span> Process finished <span class="hljs-keyword">with</span> exit code <span class="hljs-number">0</span> </span>``` |
经过处理可以看到缺失的</li>也自动补全了,还自动添加html、body节点。
查找绝对路径
通过绝对路径获取a标签的所有内容
1 2 3 4 5 6 7 8 9 10 11 12 |
<span class="hljs-attr">a</span> = <span class="hljs-string">page.xpath("/html/body/div/ul/li/a")</span> <span class="hljs-attr">for</span> <span class="hljs-string">i in a:</span> <span class="hljs-attr">print(i.text)</span> <span class="hljs-attr">```</span> <span class="hljs-attr">first</span> <span class="hljs-string">item</span> <span class="hljs-attr">second</span> <span class="hljs-string">item</span> <span class="hljs-attr">third</span> <span class="hljs-string">item</span> <span class="hljs-attr">None</span> <span class="hljs-attr">fifth</span> <span class="hljs-string">item</span> <span class="hljs-attr">```</span> |
查找相对路径(常用)
查找所有li标签下的a标签内容
1 2 3 4 5 6 7 8 |
html = etree.HTML(text) a = html.xpath(<span class="hljs-string">"//a/text()"</span>) <span class="hljs-built_in">print</span>(a) ``` [<span class="hljs-string">'first item'</span>, <span class="hljs-string">'second item'</span>, <span class="hljs-string">'third item'</span>, <span class="hljs-string">'fifth item'</span>] ``` |
当前标签节点
. 表示选取当前标签的节点。
我们先定位 ul 元素节点得到一个列表,打印当前节点列表得到第一个 ul,
接着打印 ul 节点的子节点 li,text()输出。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
page = etree.HTML(text) ul = page.xpath(<span class="hljs-string">"//ul"</span>) <span class="hljs-built_in">print</span>(ul) <span class="hljs-built_in">print</span>(ul[0].xpath(<span class="hljs-string">"."</span>)) <span class="hljs-built_in">print</span>(ul[0].xpath(<span class="hljs-string">"./li"</span>)) <span class="hljs-built_in">print</span>(ul[0].xpath(<span class="hljs-string">"./li/a/text()"</span>)) ``` [<Element ul at 0x234d16186c0>] [<Element ul at 0x234d16186c0>] [<Element li at 0x234d1618ac0>, <Element li at 0x234d1618b00>, <Element li at 0x234d1618b40>, <Element li at 0x234d1618b80>, <Element li at 0x234d1618bc0>] [<span class="hljs-string">'first item'</span>, <span class="hljs-string">'second item'</span>, <span class="hljs-string">'third item'</span>, <span class="hljs-string">'fifth item'</span>] ``` |
父节点
.. 表示选取当前标签的父节点。
可以看到得到ul的上一级div
1 2 3 4 5 6 7 8 9 10 |
page = etree.HTML(text) ul = page.xpath(<span class="hljs-string">"//ul"</span>) <span class="hljs-built_in">print</span>(ul[<span class="hljs-number">0</span>].xpath(<span class="hljs-string">"."</span>)) <span class="hljs-built_in">print</span>(ul[<span class="hljs-number">0</span>].xpath(<span class="hljs-string">".."</span>)) <span class="hljs-string">``</span><span class="hljs-string">` [<Element ul at 0x1d6d5cd8540>] [<Element div at 0x1d6d5cd8940>] `</span><span class="hljs-string">``</span> |
属性匹配
匹配时可以用@符号进行属性过滤
查找a标签下属性href值为link2.html的内容
1 2 3 4 5 6 7 8 |
html = etree.HTML(text) a = html.xpath(<span class="hljs-string">"//a[@href='link2.html']/text()"</span>) <span class="hljs-keyword">print</span>(a) <span class="hljs-string">``</span><span class="hljs-string">` ['second item'] `</span><span class="hljs-string">``</span> |
函数
last():查找最后一个li标签里的a标签的href属性
1 2 3 4 5 6 7 8 |
html = etree.HTML(text) a = html.xpath(<span class="hljs-string">"//li[last()]/a/text()"</span>) <span class="hljs-built_in">print</span>(a) <span class="hljs-string">``</span><span class="hljs-string">` ['fifth item'] `</span><span class="hljs-string">``</span> |
contains:查找a标签中属性href包含link的节点,并文本输出
1 2 3 4 5 6 7 8 |
<span class="hljs-attribute">html</span> = etree.HTML(text) a = html.xpath(<span class="hljs-string">"//a[contains(<span class="hljs-variable">@href</span>, 'link')]/text()"</span>) print(a) ``` [<span class="hljs-string">'first item'</span>, <span class="hljs-string">'second item'</span>, <span class="hljs-string">'third item'</span>, <span class="hljs-string">'fifth item'</span>] ``` |
实战案例
上面说完基本用法,接下来做几个实战案例练练手。
案例一:豆瓣读书
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 |
<span class="hljs-comment"># -*-coding:utf8 -*-</span> <span class="hljs-comment"># 1.请求并提取需要的字段</span> <span class="hljs-comment"># 2.保存需要的数据</span> import requests from lxml import etree <span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">DoubanBook</span>():</span> <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__init__</span><span class="hljs-params">(<span class="hljs-keyword">self</span>)</span></span>: <span class="hljs-keyword">self</span>.base_url = <span class="hljs-string">'https://book.douban.com/chart?subcat=all&icn=index-topchart-popular'</span> <span class="hljs-keyword">self</span>.headers = { <span class="hljs-string">'User-Agent'</span>: <span class="hljs-string">'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '</span> <span class="hljs-string">'Chrome/104.0.0.0 Safari/537.36'</span> } <span class="hljs-comment"># 请求并提取需要的字段</span> <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">crawl</span><span class="hljs-params">(<span class="hljs-keyword">self</span>)</span></span>: res = requests.get(<span class="hljs-keyword">self</span>.base_url, headers=<span class="hljs-keyword">self</span>.headers) lis = etree.HTML(res.text).xpath(<span class="hljs-string">'//*[@id="content"]/div/div[1]/ul/li'</span>) <span class="hljs-comment"># print(type(lis))</span> books = [] <span class="hljs-keyword">for</span> li <span class="hljs-keyword">in</span> <span class="hljs-symbol">lis:</span> <span class="hljs-comment"># print(etree.tostring(li,encoding='utf-8').decode('utf-8'))</span> <span class="hljs-comment"># print("==================================================")</span> title = <span class="hljs-string">""</span>.join(li.xpath(<span class="hljs-string">".//a[@class='fleft']/text()"</span>)) score = <span class="hljs-string">""</span>.join(li.xpath(<span class="hljs-string">".//p[@class='clearfix w250']/span[2]/text()"</span>)) <span class="hljs-comment"># list输出带有['\n 刘瑜 / 2022-4 / 广西师范大学出版社 / 82.00元 / 精装\n ']</span> publishing = <span class="hljs-string">""</span>.join(li.xpath(<span class="hljs-string">".//p[@class='subject-abstract color-gray']/text()"</span>)).strip() book = { <span class="hljs-string">'title'</span>: title, <span class="hljs-string">'score'</span>: score, <span class="hljs-string">'publishing'</span>: publishing, } books.append(book) <span class="hljs-keyword">self</span>.save_data(books) <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">save_data</span><span class="hljs-params">(<span class="hljs-keyword">self</span>, datas)</span></span>: with open(<span class="hljs-string">'books.txt'</span>, <span class="hljs-string">'w'</span>, encoding=<span class="hljs-string">'utf-8'</span>) as <span class="hljs-symbol">f:</span> f.write(str(datas)) <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">run</span><span class="hljs-params">(<span class="hljs-keyword">self</span>)</span></span>: <span class="hljs-keyword">self</span>.crawl() <span class="hljs-keyword">if</span> __name_<span class="hljs-number">_</span> == <span class="hljs-string">'__main__'</span>: DoubanBook().run() |
案例二:彼岸图片下载
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 |
<span class="hljs-comment">#!/usr/bin/env python</span> <span class="hljs-comment"># -*- coding: utf-8 -*-</span> <span class="hljs-comment"># author: 钢铁知识库</span> <span class="hljs-comment"># email: 648403020<span class="hljs-doctag">@qq</span>.com</span> import os import requests from lxml import etree <span class="hljs-comment"># 彼岸图片下载</span> <span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">BiAn</span>():</span> <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__init__</span><span class="hljs-params">(<span class="hljs-keyword">self</span>)</span></span>: <span class="hljs-keyword">self</span>.url = <span class="hljs-string">'https://pic.netbian.com'</span> <span class="hljs-keyword">self</span>.headers = { <span class="hljs-string">'User-Agent'</span>: <span class="hljs-string">'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '</span> <span class="hljs-string">'Chrome/104.0.0.0 Safari/537.36'</span>, <span class="hljs-string">'cookie'</span>: <span class="hljs-string">'__yjs_duid=1_cb922eedbda97280755010e53b2caca41659183144320; Hm_lvt_c59f2e992a863c2744e1ba985abaea6c=1649863747,1660203266; zkhanecookieclassrecord=%2C23%2C54%2C55%2C66%2C60%2C; Hm_lpvt_c59f2e992a863c2744e1ba985abaea6c=1660207771; yjs_js_security_passport=1225f36e8501b4d95592e5e7d5202f4081149e51_1630209607_js'</span> } <span class="hljs-comment"># 如果目录不存在会报错</span> <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> os.path.exists(<span class="hljs-string">'BianPicture'</span>): os.mkdir(<span class="hljs-string">'BianPicture'</span>) <span class="hljs-comment"># 请求拿到ul列表</span> <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">crawl</span><span class="hljs-params">(<span class="hljs-keyword">self</span>)</span></span>: res = requests.get(<span class="hljs-keyword">self</span>.url, headers=<span class="hljs-keyword">self</span>.headers) res.encoding = <span class="hljs-string">'gbk'</span> uls = etree.HTML(res.text).xpath(<span class="hljs-string">'//div[@class="slist"]/ul[@class="clearfix"]/li'</span>) <span class="hljs-comment"># print(etree.tostring(uls,encoding='gbk').decode('gbk'))</span> <span class="hljs-comment"># 循环拿到图片名、图片地址,拼接请求再次下载到图片</span> <span class="hljs-keyword">for</span> ul <span class="hljs-keyword">in</span> <span class="hljs-symbol">uls:</span> img_name = ul.xpath(<span class="hljs-string">'.//a/b/text()'</span>)[<span class="hljs-number">0</span>] img_src = ul.xpath(<span class="hljs-string">'.//a/span/img/@src'</span>)[<span class="hljs-number">0</span>] <span class="hljs-comment"># print(img_name + img_src)</span> img_url = <span class="hljs-keyword">self</span>.url + img_src <span class="hljs-comment"># 拼接后下载图片,转义Bytes</span> img_res = requests.get(img_url, headers=<span class="hljs-keyword">self</span>.headers).content img_path = <span class="hljs-string">"BianPicture/"</span> + img_name + <span class="hljs-string">".jpg"</span> data = { <span class="hljs-string">'img_res'</span>: img_res, <span class="hljs-string">'img_path'</span>: img_path } <span class="hljs-keyword">self</span>.save_data(data) <span class="hljs-comment"># 数据保存逻辑</span> <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">save_data</span><span class="hljs-params">(<span class="hljs-keyword">self</span>, data)</span></span>: with open(data[<span class="hljs-string">'img_path'</span>], <span class="hljs-string">'wb'</span>) as <span class="hljs-symbol">f:</span> f.write(data[<span class="hljs-string">'img_res'</span>]) <span class="hljs-comment"># print(data)</span> <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">run</span><span class="hljs-params">(<span class="hljs-keyword">self</span>)</span></span>: <span class="hljs-keyword">self</span>.crawl() <span class="hljs-keyword">if</span> __name_<span class="hljs-number">_</span> == <span class="hljs-string">'__main__'</span>: BiAn().run() |
案例三:全国城市名称爬取
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 |
<span class="hljs-comment">#!/usr/bin/env python</span> <span class="hljs-comment"># -*- coding: utf-8 -*-</span> <span class="hljs-comment"># author: 钢铁知识库</span> <span class="hljs-comment"># email: 648403020<span class="hljs-doctag">@qq</span>.com</span> import os import requests from lxml import etree <span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">CityName</span>():</span> <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__init__</span><span class="hljs-params">(<span class="hljs-keyword">self</span>)</span></span>: <span class="hljs-keyword">self</span>.url = <span class="hljs-string">'https://www.aqistudy.cn/historydata/'</span> <span class="hljs-keyword">self</span>.headers = { <span class="hljs-string">'User-Agent'</span>: <span class="hljs-string">'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36'</span> } <span class="hljs-comment"># 判断目录是否存在</span> <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> os.path.exists(<span class="hljs-string">'city_project'</span>): os.mkdir(<span class="hljs-string">'city_project'</span>) <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">crawl</span><span class="hljs-params">(<span class="hljs-keyword">self</span>)</span></span>: res = requests.get(url=<span class="hljs-keyword">self</span>.url, headers=<span class="hljs-keyword">self</span>.headers).text uls = etree.HTML(res).xpath(<span class="hljs-string">'//div[@class="all"]/div[2]/ul/div[2]/li'</span>) all_city_name = list() <span class="hljs-keyword">for</span> ul <span class="hljs-keyword">in</span> <span class="hljs-symbol">uls:</span> city_name = ul.xpath(<span class="hljs-string">'.//a/text()'</span>)[<span class="hljs-number">0</span>] <span class="hljs-comment"># print(type(city_name))</span> all_city_name.append(city_name) <span class="hljs-keyword">self</span>.save_data(all_city_name) <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">save_data</span><span class="hljs-params">(<span class="hljs-keyword">self</span>, data)</span></span>: with open(<span class="hljs-string">'./city_project/city.txt'</span>, <span class="hljs-string">'w'</span>) as <span class="hljs-symbol">f:</span> f.write(str(data)) <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">run</span><span class="hljs-params">(<span class="hljs-keyword">self</span>)</span></span>: <span class="hljs-keyword">self</span>.crawl() <span class="hljs-keyword">if</span> __name_<span class="hljs-number">_</span> == <span class="hljs-string">'__main__'</span>: CityName().run() |
xpath使用工具
chrome生成XPath表达式
经常使用chome的朋友都应该知道这功能,在 审查 状态下(快捷键ctrl+shift+i,F12),定位到元素(快捷键ctrl+shift+c) ,在Elements选项卡中,右键元素 Copy->Copy xpath,就能得到该元素的xpath了
Xpath Helper插件
为chome装上XPath Helper就可以很轻松的检验自己的xpath是否正确了。安装插件需要特别上网,安装好插件后,在chrome右上角点插件的图标,调出插件的黑色界面,编辑好xpath表达式,表达式选中的元素被标记为黄色
https://www.cnblogs.com/jiba/p/16589856.html
1. 带 [亲测] 说明源码已经被站长亲测过!
2. 下载后的源码请在24小时内删除,仅供学习用途!
3. 分享目的仅供大家学习和交流,请不要用于商业用途!
4. 本站资源售价只是赞助,收取费用仅维持本站的日常运营所需!
5. 本站所有资源来源于站长上传和网络,如有侵权请邮件联系站长!
6. 没带 [亲测] 代表站长时间紧促,站长会保持每天更新 [亲测] 源码 !
7. 盗版ripro用户购买ripro美化无担保,若设置不成功/不生效我们不支持退款!
8. 本站提供的源码、模板、插件等等其他资源,都不包含技术服务请大家谅解!
9. 如果你也有好源码或者教程,可以到审核区发布,分享有金币奖励和额外收入!
10.如果您购买了某个产品,而我们还没来得及更新,请联系站长或留言催更,谢谢理解 !
GG资源网 » Python爬虫之xpath语法及案例使用