Python程序提取HTML标签之间的字符串

HTML标签用于设计网站的骨架。我们将信息和上传内容以包含在标签内的字符串形式传递。HTML标签之间的字符串决定了元素如何显示以及浏览器如何解释。因此，提取这些字符串在数据处理和操作中起着至关重要的作用。我们可以分析和理解HTML文档的结构。

这些字符串揭示了网页构建背后的隐藏模式和逻辑。在本文中，我们将处理这些字符串。我们的任务是从HTML标签中提取字符串。

理解问题

我们必须提取HTML标签之间的所有字符串。我们的目标字符串包含在不同类型的标签中，并且只应检索内容部分。让我们通过一个例子来理解这一点。

输入输出场景

让我们考虑一个字符串：

Input:
Inp_STR = "<h1>This is a test string,</h1><p>Let's code together</p>"

输入字符串包含不同的HTML标签，我们必须提取它们之间的字符串。

Output: [" This is a test string,  Let's code together "]

如我们所见，“<h1>”和“<p>”标签已被移除，并且字符串已被提取。既然我们已经理解了这个问题，让我们讨论一些解决方案。

使用迭代和replace()

这种方法侧重于消除和替换HTML标签。我们将传递一个字符串和一个不同的HTML标签列表。之后，我们将初始化此字符串作为列表的元素。

我们将遍历标签列表中的每个元素，并检查它是否存在于原始字符串中。我们将传递一个“pos”变量，它将存储索引值并驱动迭代过程。

我们将使用“replace()”方法将每个标签替换为空格，并检索一个无HTML标签的字符串。

示例

以下是提取HTML标签之间字符串的示例：

Inp_STR = "<h1>This is a test string,</h1><p>Let's code together</p>"
tags = ["<h1>", "</h1>", "<p>", "</p>", "<b>", "</b>", "<br>"]
print(f"This is the original string: {Inp_STR}")
ExStr = [Inp_STR]
pos = 0

for tag in tags:
   if tag in ExStr[pos]:
      ExStr[pos] = ExStr[pos].replace(tag, " ")
pos += 1

print(f"The extracted string is : {ExStr}")

输出

This is the original string: <h1>This is a test string,</h1><p>Let's code together</p>
The extracted string is : [" This is a test string,  Let's code together "]

使用正则表达式模块+findall()

在这种方法中，我们将使用正则表达式模块来匹配特定模式。我们将传递一个正则表达式：“<"+tag+">(.*?)</"+tag+">”，它表示目标模式。此模式旨在捕获开始和结束标签。这里，“tag”是一个变量，它借助迭代从标签列表中获取其值。

“findall()”函数用于查找原始字符串中模式的所有匹配项。我们将使用“extend()”方法将所有“匹配项”添加到新列表中。通过这种方式，我们将提取包含在HTML标签中的字符串。

示例

以下是一个示例：

import re
Inp_STR = "<h1>This is a test string,</h1><p>Let's code together</p>"
tags = ["h1", "p", "b", "br"]
print(f"This is the original string: {Inp_STR}")
ExStr = []

for tag in tags:
   seq = "<"+tag+">(.*?)</"+tag+">"
   matches = re.findall(seq, Inp_STR)
   ExStr.extend(matches)
print(f"The extracted string is: {ExStr}")

输出

This is the original string: <h1>This is a test string,</h1><p>Let's code together</p>
The extracted string is: ['This is a test string,', "Let's code together"]

使用迭代和find()

在这种方法中，我们将使用“find()”方法在原始字符串中获得开始和结束标签的第一次出现。我们将遍历标签列表中的每个元素，并检索其在字符串中的位置。

将使用While循环继续搜索字符串中的HTML标签。我们将建立一个条件来检查字符串中是否存在不完整的标签。在每次迭代中，索引值都会更新以查找开始和结束标签的下一个出现。

所有开始和结束标签的索引值都被存储，一旦整个字符串被映射，我们就使用字符串切片来提取HTML标签之间的字符串。

示例

以下是一个示例：

Inp_STR = "<h1>This is a test string,</h1><p>Let's code together</p>"
tags = ["h1", "p", "b", "br"]
ExStr = []
print(f"The original string is: {Inp_STR}")

for tag in tags:
   tagpos1 = Inp_STR.find("<"+tag+">")
   while tagpos1 != -1:
      tagpos2 = Inp_STR.find("</"+tag+">", tagpos1)
      if tagpos2 == -1:
         break
      ExStr.append(Inp_STR[tagpos1 + len(tag)+2: tagpos2])
      tagpos1 = Inp_STR.find("<"+tag+">", tagpos2)

print(f"The extracted string is: {ExStr}")

输出

The original string is: <h1>This is a test string,</h1><p>Let's code together</p>
The extracted string is: ['This is a test string,', "Let's code together"]

结论

在本文中，我们讨论了提取HTML标签之间字符串的多种方法。我们从更简单的解决方案开始，即查找并将标签替换为空格。我们还使用了正则表达式模块及其findall()函数来查找模式的匹配项。我们也理解了find()方法和字符串切片的应用。

Devesh Chauhan

更新于：2023年7月12日

981 次浏览

开启您的职业生涯

完成课程获得认证

开始学习