Python 爬虫之Beautiful Soup-CJavaPy

1、Beautiful Soup简介

Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库。Beautiful Soup自动将输入文档转换为Unicode编码，输出文档转换为utf-8编码。不需要考虑编码方式，除非文档没有指定一个编码方式，这种情况Beautiful Soup就不能自动识别编码方式。Beautiful Soup已成为和lxml、html6lib一样出色的python解释器，为用户灵活地提供不同的解析策略或更高的效率。

英文文档:

Beautiful Soup

Beautiful Soup Documentation

中文文档:

Beautiful Soup 4.4.0 文档

2、Beautiful Soup解析器

Beautiful Soup支持Python标准库中的HTML解析器，也支持一些第三方的解析器，如下表，

解析器	代码	优点	缺点
bs4的HTML 解析器	BeautifulSoup(markup, "html.parser")	1）Python的内置标准库 2）执行速度适中 3）文档容错能力强	部分版本中文档容错能力差
lxml HTML 解析器	BeautifulSoup(markup, "lxml")	1）速度快 2）文档容错能力强	需要安装C语言库
lxml XML 解析器	BeautifulSoup(markup, ["lxml-xml"]) BeautifulSoup(markup, "xml")	1）速度快 2）唯一支持XML的解析器	需要安装C语言库
html5lib 解析器	BeautifulSoup(markup, "html5lib")	1）最好的容错性 2）以浏览器的方式解析文档 3）生成HTML5格式的文档	1）速度慢 2）不依赖外部扩展

注意：如对性能效率要求高，可以使用lxml HTML 解析器。

3、获取元素标签的方法

基本元素	说明
Tag	标签，最基本的信息组织单元，分别用<>和标明开头和结尾
Name	标签的名字，< p >…< /p >的名字是'p'，格式： .name
Attributes	标签的属性，字典形式组织，格式： .attrs
NavigableString	标签内非属性字符串，<>…中字符串，格式： .string
Comment	标签内字符串的注释部分，一种特殊的Comment类型

1）使用CSS选择器的语法

使用CSS选择器的语法查找标签元素，可以使用select()方法，代码如下，

import bs4
import requests

response = requests.get("https://en.wikipedia.org/wiki/Mathematics")
if response is not None:
    html = bs4.BeautifulSoup(response.text, 'html.parser')
    title = html.select("#firstHeading")[0].text
    paragraphs = html.select("p")
    for para in paragraphs:
        print (para.text)
    intro = '\n'.join([ para.text for para in paragraphs[0:5]])
    print (intro)

2）使用find_all()和find()方法

find()返回找到的第一个标签，find_all()以list的形式返回找到的所有标签。select()使用CSS选择器的语法更方便，find_all使用更广，可调的参数也更多。

from bs4 import BeautifulSoup  
  
html = '''
<html><head><title>cjavapy</title></head>  
<body>  
<p class="title" name="dromouse"><b>cjavapy</b></p>  
<p class="story">www.cjavapy.com 
<a href="http://example.com/lacie" class="sister" id="link2">pandas</a> and  
<a href="http://example.com/tillie" class="sister" id="link3">python</a>;  
cjavapy</p>  
</body>  
</html>  
'''
  
soup = BeautifulSoup(html, "lxml")  
body_tag = soup.find(name='body')  
print body_tag
tags = soup.find_all(name='p')  
for tag in tags:  
    print tag

4、元素的遍历

1）父级遍历

属性	说明
.parent	节点的父亲标签
.parents	节点先辈标签的迭代类型，用于循环遍历先辈节点

示例代码：

from bs4 import BeautifulSoup  
  
html = '''
<html><head><title>cjavapy</title></head>  
<body>  
<p class="title" name="dromouse"><b>cjavapy</b></p>  
<p class="story">www.cjavapy.com 
<a href="http://example.com/lacie" class="sister" id="link2">pandas</a> and  
<a href="http://example.com/tillie" class="sister" id="link3">python</a>;  
cjavapy</p>  
</body>  
</html>  
'''
  
soup = BeautifulSoup(html, "lxml")  
for child in soup.body.contents:
    print(child)
for child in soup.body.children:
    print(child)
for child in soup.body.descendants:
    print(child)

2）同级遍历

属性	说明
.next_sibling	返回按照HTML文本顺序的下一个平行节点标签
.previous_sibling	返回按照HTML文本顺序的上一个平行节点标签
.next_siblings	迭代类型，返回按照HTML文本顺序的后续所有平行节点标签
.previous_siblings	迭代类型，返回按照HTML文本顺序的前续所有平行节点标签

示例代码：

from bs4 import BeautifulSoup  
  
html = '''
<html><head><title>cjavapy</title></head>  
<body>  
<p class="title" name="dromouse"><b>cjavapy</b></p>  
<p class="story">www.cjavapy.com 
<a href="http://example.com/lacie" class="sister" id="link2">pandas</a> and  
<a href="http://example.com/tillie" class="sister" id="link3">python</a>;  
cjavapy</p>  
</body>  
</html>  
'''
  
soup = BeautifulSoup(html, "lxml")  
for parent in soup.a.parents:
    if parent is None:
        print(parent)
    else:
        print(parent.name)

3）子级遍历

属性	说明
.contents	子节点的列表，将所有儿子节点存入列表
.children	子节点的迭代类型，与.contents类似，用于循环遍历儿子节点
.descendants	子孙节点的迭代类型，包含所有子孙节点，用于循环遍历

示例代码：

from bs4 import BeautifulSoup  
  
html = '''
<html><head><title>cjavapy</title></head>  
<body>  
<p class="title" name="dromouse"><b>cjavapy</b></p>  
<p class="story">www.cjavapy.com 
<a href="http://example.com/lacie" class="sister" id="link2">pandas</a> and  
<a href="http://example.com/tillie" class="sister" id="link3">python</a>;  
cjavapy</p>  
</body>  
</html>  
'''
  
soup = BeautifulSoup(html, "lxml")  
for sibling in soup.a.next_sibling:
    print(sibling)
for sibling in soup.a.previous_sibling:
    print(sibling)

find()的扩展方法：

方法	说明
<>.find()	搜索且只返回一个结果，同`.find_all()`参数
<>.find_parents()	在先辈节点中搜索，返回列表类型，同`.find_all()`参数
<>.find_parent()	在先辈节点中返回一个结果，同`.find()`参数
<>.find_next_siblings()	在后续平行节点中搜索，返回列表类型，同`.find_all()`参数
<>.find_next_sibling()	在后续平行节点中返回一个结果，同`.find()`参数
<>.find_previous_siblings()	在前序平行节点中搜索，返回列表类型，同`.find_all()`参数
<>.find_previous_sibling()	在前序平行节点中返回一个结果，同`.find()`参数

Python 爬虫之Beautiful Soup

1、Beautiful Soup简介

2、Beautiful Soup解析器

3、获取元素标签的方法

4、元素的遍历

Python 2.7中安装pip的方法及步骤

Java JDK11 在windows上的安装和环境变量配置

Python numpy.full函数方法的使用

Java Stream使用多个过滤器(filter)或复杂条件方法用法及简单写法代码

Java JDK11 在Mac上的安装和配置以及JDK多个版本之间切换

Python PIP升级后执行命令报错： sys.stderr.write(f"ERROR: {exc}")解决方法

Python pandas.to_numeric函数方法的使用

Python numpy.fromfile函数方法的使用