Python 中对超文本标记语言的支持?


Python 可以通过 html.parser 模块中的 HTMLParser 类处理 HTML 文件。它可以检测 HTML 标签的性质、它们的位置和标签的许多其他属性。它还具有可以识别和提取 HTML 文件中数据的函数。

在下面的示例中,我们了解如何使用 HTMLParser 类创建一个自定义解析器类,这个类只能处理我们在类中定义的标签和数据。这里我们正在处理起始标签、结束标签和数据。

以下是 Python 自定义解析器正在处理的 HTML。

示例

<html>
<br>
<head>
<br>
<title>welcome to Tutorials Point!</title>
<br>
</head>
<br>
<body>
<br>
<h1>Learn anything !</h1>
<br>
</body>
<br>
</html>

以下是解析上述文件并根据自定义解析器输出结果的程序。

示例

from html.parser import HTMLParser
import io
class Custom_Parser(HTMLParser):
   def handle_starttag(self, tag, attrs):
      print("Line and Offset ==", HTMLParser.getpos(self))
      print("Encountered a start tag:", tag)


   def handle_endtag(self, tag):
      print("Line and Offset ==", HTMLParser.getpos(self))
      print("Encountered an end tag :", tag)


   def handle_data(self, data):
      print("Line and Offset ==", HTMLParser.getpos(self))
      print("Encountered some data :", data)

parser = Custom_Parser()

stream = io.open("E:\test.html", "r")
parser.feed(stream.read())

输出

运行以上代码,我们得到以下结果:

Line and Offset == (1, 0)
Encountered a start tag: html
Line and Offset == (1, 6)
Encountered some data :

Line and Offset == (2, 0)
Encountered a start tag: head
Line and Offset == (2, 6)
Encountered some data :

Line and Offset == (3, 0)
Encountered a start tag: title
Line and Offset == (3, 7)
Encountered some data : welcome to Tutorials Point!
Line and Offset == (3, 34)
Encountered an end tag : title
Line and Offset == (3, 42)
Encountered some data :

Line and Offset == (4, 0)
Encountered an end tag : head
Line and Offset == (4, 7)
Encountered some data :

Line and Offset == (5, 0)
Encountered a start tag: body
Line and Offset == (5, 6)
Encountered some data :

Line and Offset == (6, 0)
Encountered a start tag: h1
Line and Offset == (6, 4)
Encountered some data : Learn anything !
Line and Offset == (6, 20)
Encountered an end tag : h1
Line and Offset == (6, 25)
Encountered some data :

Line and Offset == (7, 0)
Encountered an end tag : body
Line and Offset == (7, 7)
Encountered some data :

Line and Offset == (8, 0)
Encountered an end tag : html

更新于:12-1-2021

220 次浏览

启动你的 职业

通过完成课程取得认证

开始学习吧
广告
© . All rights reserved.