如何在Java中将网页内容读取到字符串中？

Java 面向对象编程编程

您可以使用多种方法在Java中读取网页内容。在这里，我们将讨论其中的三种。

使用openStream()方法

java.net包中的**URL**类表示统一资源定位符，用于指向万维网中的资源（文件或目录或引用）。

此类的**openStream()**方法打开与当前对象表示的URL的连接，并返回一个InputStream对象，您可以使用该对象从URL读取数据。

因此，要从网页读取数据（使用URL类） -

通过将所需网页的URL作为参数传递给其构造函数来实例化java.net.URL类。
调用openStream()方法并检索InputStream对象。
通过将上面检索到的InputStream对象作为参数来实例化Scanner类。

示例

import java.io.IOException;
import java.net.URL;
import java.util.Scanner;
public class ReadingWebPage {
   public static void main(String args[]) throws IOException {
      //Instantiating the URL class
      URL url = new URL("http://www.something.com/");
      //Retrieving the contents of the specified page
      Scanner sc = new Scanner(url.openStream());
      //Instantiating the StringBuffer class to hold the result
      StringBuffer sb = new StringBuffer();
      while(sc.hasNext()) {
         sb.append(sc.next());
         //System.out.println(sc.next());
      }
      //Retrieving the String from the String Buffer object
      String result = sb.toString();
      System.out.println(result);
      //Removing the HTML tags
      result = result.replaceAll("<[^>]*>", "");
      System.out.println("Contents of the web page: "+result);
   }
}

输出

<html><body><h1>Itworks!</h1></body></html>
Contents of the web page: Itworks!

使用HttpClient

Http客户端是一个传输库，它位于客户端，发送和接收HTTP消息。它提供最新的、功能丰富的和高效的实现，满足最新的HTTP标准。

GET请求（Http协议）用于使用给定的URI从给定的服务器检索信息。使用GET的请求应该只检索数据，并且不应该对数据产生其他影响。

HttpClient API提供了一个名为HttpGet的类，它表示get请求方法。要执行GET请求并检索网页的内容 -

HttpClients类的**createDefault()**方法返回一个CloseableHttpClient对象，它是HttpClient接口的基本实现。使用此方法，创建一个HttpClient对象。
通过实例化HttpGet类来创建一个HTTP GET请求。此类的构造函数接受一个字符串值，该值表示您需要向其发送请求的网页的URI。
通过调用**execute()**方法执行HttpGet请求。
从响应中检索表示网站内容的InputStream对象，如下所示：

httpresponse.getEntity().getContent()

示例

import java.util.Scanner;
import org.apache.http.HttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
public class HttpClientExample {
   public static void main(String args[]) throws Exception{
      //Creating a HttpClient object
      CloseableHttpClient httpclient = HttpClients.createDefault();
      //Creating a HttpGet object
      HttpGet httpget = new HttpGet("http://www.something.com/");
      //Executing the Get request
      HttpResponse httpresponse = httpclient.execute(httpget);
      Scanner sc = new Scanner(httpresponse.getEntity().getContent());
      //Instantiating the StringBuffer class to hold the result
      StringBuffer sb = new StringBuffer();
      while(sc.hasNext()) {
         sb.append(sc.next());
         //System.out.println(sc.next());
      }
      //Retrieving the String from the String Buffer object
      String result = sb.toString();
      System.out.println(result);
      //Removing the HTML tags
      result = result.replaceAll("<[^>]*>", "");
      System.out.println("Contents of the web page: "+result);
   }
}

输出

<html><body><h1>Itworks!</h1></body></html>
Contents of the web page: Itworks!

使用Jsoup库

Jsoup是一个基于Java的库，用于处理基于HTML的内容。它提供了一个非常方便的API来提取和操作数据，使用DOM、CSS和类似jquery的方法的优点。它实现了WHATWG HTML5规范，并将HTML解析为与现代浏览器相同的DOM。

要使用Jsoup库检索网页的内容 -

Jsoup类的**connect()**方法接受网页的URL，连接到指定的网页并返回连接对象。使用**connect()**方法连接到所需的网页。
Connection接口的get()方法发送/执行GET请求，并将HTML文档作为Document类的对象返回。通过调用get()方法向页面发送GET请求。
将获得的文档的内容检索到字符串中，如下所示：

String result = doc.body().text();

示例

import java.io.IOException;
import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class JsoupExample {
   public static void main(String args[]) throws IOException {
      String page = "http://www.something.com/";
      //Connecting to the web page
      Connection conn = Jsoup.connect(page);
      //executing the get request
      Document doc = conn.get();
      //Retrieving the contents (body) of the web page
      String result = doc.body().text();
      System.out.println(result);
   }
}

输出

It works!

Maruthi Krishna

更新于：2019年10月10日

12K+ 次浏览

启动你的职业生涯

完成课程后获得认证

开始学习