Beautiful Soup - 对象类型

当我们将 HTML 文档或字符串传递给 BeautifulSoup 构造函数时，BeautifulSoup 会将复杂的 HTML 页面转换为不同的 Python 对象。下面我们将讨论 bs4 包中定义的四种主要对象类型。

标签 (Tag)
可导航字符串 (NavigableString)
BeautifulSoup
注释 (Comments)

标签对象 (Tag Object)

HTML 标签用于定义各种类型的文本内容。BeautifulSoup 中的标签对象对应于实际页面或文档中的 HTML 或 XML 标签。

示例

from bs4 import BeautifulSoup

soup = BeautifulSoup('<b class="boldest">TutorialsPoint</b>', 'lxml')
tag = soup.html
print (type(tag))

输出

<class 'bs4.element.Tag'>

标签包含许多属性和方法，标签的两个重要特性是其名称和属性。

名称 (tag.name)

每个标签都有一个名称，可以通过“.name”后缀访问。tag.name 将返回标签的类型。

示例

from bs4 import BeautifulSoup

soup = BeautifulSoup('<b class="boldest">TutorialsPoint</b>', 'lxml')
tag = soup.html
print (tag.name)

输出

html

但是，如果我们更改标签名称，则 BeautifulSoup 生成的 HTML 标记中也会反映出相同的更改。

示例

from bs4 import BeautifulSoup

soup = BeautifulSoup('<b class="boldest">TutorialsPoint</b>', 'lxml')
tag = soup.html
tag.name = "strong"
print (tag)

输出

<strong><body><b class="boldest">TutorialsPoint</b></body></strong>

属性 (tag.attrs)

一个标签对象可以拥有任意数量的属性。在上面的示例中，标签 具有一个名为“class”的属性，其值为“boldest”。任何不是标签的内容基本上都是属性，并且必须包含一个值。“attrs”返回属性及其值的字典。您也可以通过访问键来访问属性。

在下面的示例中，Beautifulsoup() 构造函数的字符串参数包含 HTML 输入标签。“attr”返回输入标签的属性。

示例

from bs4 import BeautifulSoup

soup = BeautifulSoup('<input type="text" name="name" value="Raju">', 'lxml')
tag = soup.input

print (tag.attrs)

输出

{'type': 'text', 'name': 'name', 'value': 'Raju'}

我们可以使用字典运算符或方法对标签的属性进行各种修改（添加/删除/修改）。

在下面的示例中，更新了 value 标签。更新后的 HTML 字符串显示了更改。

示例

from bs4 import BeautifulSoup

soup = BeautifulSoup('<input type="text" name="name" value="Raju">', 'lxml')
tag = soup.input

print (tag.attrs)
tag['value']='Ravi'
print (soup)

输出

<html><body><input name="name" type="text" value="Ravi"/></body></html>

我们添加了一个新的 id 标签，并删除了 value 标签。

示例

from bs4 import BeautifulSoup

soup = BeautifulSoup('<input type="text" name="name" value="Raju">', 'lxml')
tag = soup.input

tag['id']='nm'
del tag['value']
print (soup)

输出

<html><body><input id="nm" name="name" type="text"/></body></html>

多值属性

一些 HTML5 属性可以具有多个值。最常用的属性是 class 属性，它可以具有多个 CSS 值。其他属性包括“rel”、“rev”、“headers”、“accesskey”和“accept-charset”。Beautiful Soup 中的多值属性显示为列表。

示例

from bs4 import BeautifulSoup

css_soup = BeautifulSoup('<p class="body"></p>', 'lxml')
print ("css_soup.p['class']:", css_soup.p['class'])

css_soup = BeautifulSoup('<p class="body bold"></p>', 'lxml')
print ("css_soup.p['class']:", css_soup.p['class'])

输出

css_soup.p['class']: ['body']
css_soup.p['class']: ['body', 'bold']

但是，如果任何属性包含多个值，但根据任何版本的 HTML 标准它都不是多值属性，Beautiful Soup 将保留该属性不变。

示例

from bs4 import BeautifulSoup

id_soup = BeautifulSoup('<p id="body bold"></p>', 'lxml')
print ("id_soup.p['id']:", id_soup.p['id'])
print ("type(id_soup.p['id']):", type(id_soup.p['id']))

输出

id_soup.p['id']: body bold
type(id_soup.p['id']): <class 'str'>

可导航字符串对象 (NavigableString object)

通常，某个字符串位于某种类型的起始和结束标签之间。浏览器的 HTML 引擎在渲染元素时会对字符串应用预期的效果。例如，在 Hello World 中，你会发现一个字符串位于 和 标签之间，因此它以粗体显示。

NavigableString 对象表示标签的内容。它是 bs4.element.NavigableString 类的对象。要访问内容，请使用 tag.string。

示例

from bs4 import BeautifulSoup
soup = BeautifulSoup("<h2 id='message'>Hello, Tutorialspoint!</h2>", 'html.parser')

print (soup.string)

print (type(soup.string))

输出

Hello, Tutorialspoint!
<class 'bs4.element.NavigableString'>

NavigableString 对象类似于 Python Unicode 字符串。它的某些特性支持遍历树和搜索树。可以使用 str() 函数将 NavigableString 转换为 Unicode 字符串。

示例

from bs4 import BeautifulSoup
soup = BeautifulSoup("<h2 id='message'>Hello, Tutorialspoint!</h2>",'html.parser')

tag = soup.h2
string = str(tag.string)
print (string)

输出

Hello, Tutorialspoint!

与 Python 字符串一样，NavigableString 也是不可变的，不能就地修改。但是，可以使用 replace_with() 用另一个字符串替换标签的内部字符串。

示例

from bs4 import BeautifulSoup
soup = BeautifulSoup("<h2 id='message'>Hello, Tutorialspoint!</h2>",'html.parser')

tag = soup.h2
tag.string.replace_with("OnLine Tutorials Library")
print (tag.string)

输出

OnLine Tutorials Library

BeautifulSoup 对象

BeautifulSoup 对象表示整个已解析的对象。但是，它可以被认为类似于 Tag 对象。当我们尝试抓取网络资源时创建的对象。因为它类似于 Tag 对象，所以它支持解析和搜索文档树所需的功能。

示例

from bs4 import BeautifulSoup
fp = open("index.html")
soup = BeautifulSoup(fp, 'html.parser')

print (soup)
print (soup.name)
print ('type:',type(soup))

输出

<html>
<head>
<title>TutorialsPoint</title>
</head>
<body>
<h2>Departmentwise Employees</h2>
<ul>
<li>Accounts</li>
<ul>
<li>Anand</li>
<li>Mahesh</li>
</ul>
<li>HR</li>
<ul>
<li>Rani</li>
<li>Ankita</li>
</ul>
</ul>
</body>
</html>
[document]
type: <class 'bs4.BeautifulSoup'>

BeautifulSoup 对象的 name 属性始终返回“[document]”。

如果将 BeautifulSoup 对象作为参数传递给某个函数（例如 replace_with()），则可以组合两个已解析的文档。

示例

from bs4 import BeautifulSoup
obj1 = BeautifulSoup("<book><title>Python</title></book>", features="xml")
obj2 = BeautifulSoup("<b>Beautiful Soup parser</b>", "lxml")

obj2.find('b').replace_with(obj1)
print (obj2)

输出

<html><body><book><title>Python</title></book></body></html>

注释对象 (Comment object)

在 HTML 和 XML 文档中，写在  之间的任何文本都被视为注释。BeautifulSoup 可以将此类注释文本检测为 Comment 对象。

示例

from bs4 import BeautifulSoup
markup = "<b><!--This is a comment text in HTML--></b>"
soup = BeautifulSoup(markup, 'html.parser')
comment = soup.b.string
print (comment, type(comment))

输出

This is a comment text in HTML <class 'bs4.element.Comment'>

Comment 对象是一种特殊的 NavigableString 对象。prettify() 方法以特殊格式显示注释文本：

示例

print (soup.b.prettify())

输出

<b>
   <!--This is a comment text in HTML-->
</b>

打印页面