Beautiful Soup - 将HTML转换为文本



像Beautiful Soup这样的网络爬虫库的一个重要且经常需要的应用是从HTML脚本中提取文本。您可能需要丢弃所有标签以及与每个标签关联的属性(如果有),并分离出文档中的原始文本。Beautiful Soup中的get_text()方法适用于此目的。

这是一个演示get_text()方法用法的基本示例。通过删除所有HTML标签,您可以获取HTML文档中的所有文本。

示例

html = '''
<html>
   <body>
      <p> The quick, brown fox jumps over a lazy dog.</p>
      <p> DJs flock by when MTV ax quiz prog.</p>
      <p> Junk MTV quiz graced by fox whelps.</p>
      <p> Bawds jog, flick quartz, vex nymphs.</p>
   </body>
</html>
'''
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "html.parser")
text = soup.get_text()
print(text)

输出

The quick, brown fox jumps over a lazy dog.
DJs flock by when MTV ax quiz prog.
Junk MTV quiz graced by fox whelps.
Bawds jog, flick quartz, vex nymphs.

get_text()方法有一个可选的分隔符参数。在下面的示例中,我们将get_text()方法的分隔符参数指定为“#”。

html = '''
   <p>The quick, brown fox jumps over a lazy dog.</p>
   <p>DJs flock by when MTV ax quiz prog.</p>
   <p>Junk MTV quiz graced by fox whelps.</p>
   <p>Bawds jog, flick quartz, vex nymphs.</p>
'''
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "html.parser")
text = soup.get_text(separator='#')
print(text)

输出

#The quick, brown fox jumps over a lazy dog.#
#DJs flock by when MTV ax quiz prog.#
#Junk MTV quiz graced by fox whelps.#
#Bawds jog, flick quartz, vex nymphs.#

get_text()方法还有一个参数strip,可以是True或False。让我们检查当strip参数设置为True时的效果。默认情况下它是False。

html = '''
   <p>The quick, brown fox jumps over a lazy dog.</p>
   <p>DJs flock by when MTV ax quiz prog.</p>
   <p>Junk MTV quiz graced by fox whelps.</p>
   <p>Bawds jog, flick quartz, vex nymphs.</p>
'''
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "html.parser")
text = soup.get_text(strip=True)
print(text)

输出

The quick, brown fox jumps over a lazy dog.DJs flock by when MTV ax quiz prog.Junk MTV quiz graced by fox whelps.Bawds jog, flick quartz, vex nymphs.
广告
© . All rights reserved.