Beautiful Soup - 删除子元素

HTML 文档是不同标签的分层排列，其中一个标签可能在多个层次上嵌套一个或多个标签。我们如何删除某个标签的子元素？使用 BeautifulSoup，这很容易做到。

BeautifulSoup 库中有两种主要方法可以删除某个标签。decompose() 方法和 extract() 方法，区别在于后者返回被删除的内容，而前者只是将其销毁。

因此，要删除子元素，请为给定的 Tag 对象调用 findChildren() 方法，然后对每个子元素调用 extract() 或 decompose()。

考虑以下代码段：

soup = BeautifulSoup(fp, "html.parser")
soup.decompose()
print (soup)

这将销毁整个 soup 对象本身，它是文档的解析树。显然，我们不希望这样做。

现在以下代码：

soup = BeautifulSoup(fp, "html.parser")
tags = soup.find_all()
for tag in tags:
   for t in tag.findChildren():
      t.extract()

在文档树中，<html> 是第一个标签，所有其他标签都是它的子元素，因此它将在循环的第一次迭代中删除除 <html> 和 </html> 之外的所有标签。

如果我们想删除特定标签的子元素，可以更有效地使用它。例如，您可能希望删除 HTML 表格的标题行。

以下 HTML 脚本有一个表格，第一个 <tr> 元素的标题由 <th> 标签标记。

<html>
   <body>
      <h2>Beautiful Soup - Remove Child Elements</h2>
      <table border="1">
         <tr class='header'>
            <th>Name</th>
            <th>Age</th>
            <th>Marks</th>
         </tr>
         <tr>
            <td>Ravi</td>
            <td>23</td>
            <td>67</td>
         </tr>
         <tr>
            <td>Anil</td>
            <td>27</td>
            <td>84</td>
         </tr>
      </table>
   </body>
</html>

我们可以使用以下 Python 代码删除 <th> 单元格具有 <tr> 标签的所有子元素。

示例

from bs4 import BeautifulSoup

fp = open("index.html")
soup = BeautifulSoup(fp, "html.parser")
tags = soup.find_all('tr', {'class':'header'})

for tag in tags:
   for t in tag.findChildren():
      t.extract()

print (soup)

输出

<html>
<body>
<h2>Beautiful Soup - Parse Table</h2>
<table border="1">
<tr class="header">

</tr>
<tr>
<td>Ravi</td>
<td>23</td>
<td>67</td>
</tr>
<tr>
<td>Anil</td>
<td>27</td>
<td>84</td>
</tr>
</table>
</body>
</html>

可以看出，<th> 元素已从解析树中删除。

打印页面