Beautiful Soup - diagnose() 方法

方法描述

Beautiful Soup 中的 diagnose() 方法是一个诊断套件，用于隔离常见问题。如果您难以理解 Beautiful Soup 对文档的操作，请将文档作为参数传递给 diagnose() 函数。报告将向您展示不同的解析器如何处理文档，并告诉您是否缺少解析器。

语法

diagnose(data)

参数

data − 文档字符串。

返回值

diagnose() 方法打印根据所有可用解析器解析给定文档的结果。

示例

让我们以这个简单的文档作为我们的练习 -

<h1>Hello World
<b>Welcome</b>
<P><b>Beautiful Soup</a> <i>Tutorial</i><p>

以下代码对上述 HTML 脚本运行诊断 -

markup = '''
<h1>Hello World
<b>Welcome</b>
<P><b>Beautiful Soup</a> <i>Tutorial</i><p>
'''

from bs4.diagnose import diagnose

diagnose(markup)

diagnose() 输出以显示所有可用解析器的消息开头 -

Diagnostic running on Beautiful Soup 4.12.2
Python version 3.11.2 (tags/v3.11.2:878ead1, Feb  7 2023, 16:38:35) [MSC v.1934 64 bit (AMD64)]
Found lxml version 4.9.2.0
Found html5lib version 1.1

如果要诊断的文档是一个完美的 HTML 文档，则所有解析器的结果几乎相同。但是，在我们的示例中，有很多错误。

首先使用内置的 html.parser。报告如下所示 -

Trying to parse your markup with html.parser
Here's what html.parser did with the markup:
   <h1>
      Hello World
   <b>
      Welcome
   </b>
   <p>
      <b>
         Beautiful Soup
         <i>
            Tutorial
         </i>
         <p>
         </p>
      </b>
   </p>
</h1>

您可以看到 Python 的内置解析器不会插入 <html> 和 <body> 标签。未关闭的 <h1> 标签在末尾提供匹配的 <h1>。

html5lib 和 lxml 解析器都通过将其包装在 <html>、<head> 和 <body> 标签中来完成文档。

Trying to parse your markup with html5lib
Here's what html5lib did with the markup:
<html>
   <head>
   </head>
   <body>
      <h1>
         Hello World
         <b>
            Welcome
         </b>
         <p>
            <b>
               Beautiful Soup
               <i>
                  Tutorial
               </i>
            </b>
         </p>
         <p>
            <b>
            </b>
         </p>
      </h1>
   </body>
</html>

使用 lxml 解析器，请注意插入结束 </h1> 的位置。此外，不完整的 <b> 标签已得到纠正，并且已删除悬空的 </a>。

Trying to parse your markup with lxml
Here's what lxml did with the markup:
<html>
   <body>
      <h1>
         Hello World
         <b>
            Welcome
         </b>
      </h1>
      <p>
         <b>
            Beautiful Soup
            <i>
               Tutorial
            </i>
         </b>
      </p>
      <p>
      </p>
   </body>
</html>

diagnose() 方法也以 XML 文档的形式解析文档，这在我们的例子中可能多余。

Trying to parse your markup with lxml-xml
Here's what lxml-xml did with the markup:
<?xml version="1.0" encoding="utf-8"?>
<h1>
   Hello World
   <b>
      Welcome
   </b>
   <P>
      <b>
         Beautiful Soup
      </b>
      <i>
         Tutorial
      </i>
   <p/>
   </P>
</h1>

让我们向 diagnose() 方法提供 XML 文档而不是 HTML 文档。

<?xml version="1.0" ?>
   <books>
      <book>
         <title>Python</title>
         <author>TutorialsPoint</author>
         <price>400</price>
      </book>
   </books>

现在，如果我们运行诊断，即使它是 XML，也会应用 html 解析器。

Trying to parse your markup with html.parser

Warning (from warnings module):
  File "C:\Users\mlath\OneDrive\Documents\Feb23 onwards\BeautifulSoup\Lib\site-packages\bs4\builder\__init__.py", line 545
    warnings.warn(
XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.

使用 html.parser，会显示警告消息。使用 html5lib，包含 XML 版本信息的第 1 行被注释掉，其余文档被解析为 HTML 文档。

Trying to parse your markup with html5lib
Here's what html5lib did with the markup:
<!--?xml version="1.0" ?-->
<html>
   <head>
   </head>
   <body>
      <books>
         <book>
            <title>
               Python
            </title>
            <author>
               TutorialsPoint
            </author>
            <price>
               400
            </price>
         </book>
      </books>
   </body>
</html>

lxml html 解析器不会插入注释，而是将其解析为 HTML。

Trying to parse your markup with lxml
Here's what lxml did with the markup:
<?xml version="1.0" ?>
<html>
   <body>
      <books>
         <book>
            <title>
               Python
            </title>
            <author>
               TutorialsPoint
            </author>
            <price>
               400
            </price>
         </book>
      </books>
   </body>
</html>

lxml-xml 解析器将文档解析为 XML。

Trying to parse your markup with lxml-xml
Here's what lxml-xml did with the markup:
<?xml version="1.0" encoding="utf-8"?>
<?xml version="1.0" ?>
   <books>
      <book>
         <title>
            Python
         </title>
         <author>
            TutorialsPoint
         </author>
         <price>
            400
         </price>
      </book>
   </books>

诊断报告可能有助于查找 HTML/XML 文档中的错误。

打印页面