从给定字符串中提取 URL

数据结构 C++服务器端编程编程

在信息时代，遇到包含 URL 的文本字符串是很常见的。作为数据清洗或网络抓取任务的一部分，我们经常需要提取这些 URL 以进行进一步处理。在本文中，我们将探讨如何使用 C++（一种提供对系统资源进行细粒度控制的高性能语言）来实现此目的。

理解 URL

URL（统一资源定位符）是对 Web 资源的引用，它指定了该资源在计算机网络上的位置以及检索它的机制。简单来说，URL 就是网页地址。

问题陈述

给定一个包含多个 URL 的字符串，我们的任务是从字符串中提取所有存在的 URL。

解决方案

为了解决这个问题，我们将使用 C++ 中的正则表达式 (regex) 支持。正则表达式是一系列字符，用于定义搜索模式，主要用于字符串模式匹配。

我们的方法涉及以下步骤：

定义正则表达式模式：定义一个匹配 URL 一般结构的正则表达式模式。

匹配和提取：使用正则表达式模式匹配和提取给定字符串中存在的所有 URL。

C++ 实现

示例

以下是实现我们解决方案的 C++ 代码：

#include <bits/stdc++.h>
using namespace std;

// Function to extract all URLs from a string
vector<string> extractURLs(string str) {
   vector<string> urls;
   regex urlPattern("(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?");
   
   auto words_begin = sregex_iterator(str.begin(), str.end(), urlPattern);
   auto words_end = sregex_iterator();
   
   for (sregex_iterator i = words_begin; i != words_end; i++) {
      smatch match = *i;                                                     
      string match_str = match.str(); 
      urls.push_back(match_str);
   }
   
   return urls;
}

int main() {
   string str = "Visit https://tutorialspoint.com and http://www.tutorix.com for more information.";
   
   vector<string> urls = extractURLs(str);
   cout << "URLs found in the string:" << endl;
   for (string url : urls)
      cout << url << endl;
   
   return 0;
}

输出

URLs found in the string:
https://tutorialspoint.com and http
www.tutorix.com for more information.

解释

让我们考虑一下这个字符串：

str = "Visit https://tutorialspoint.com and http://www.tutorix.com for more information."

将我们的函数应用于此字符串后，它会匹配这两个 URL 并将其提取到一个向量中。

urls = ["https://tutorialspoint.com", "http://www.tutorix.com"]

这个向量是我们程序的输出。

结论

从字符串中提取 URL 的任务为文本处理和正则表达式的使用提供了宝贵的见解。这种解决问题的方法以及它所需的 C++ 编程技能，在数据分析、网络抓取和软件开发领域非常有用。

Siva Sai

更新于：2023年5月17日

426 次浏览

开启您的职业生涯

完成课程获得认证

开始学习