如何在 R 中从网站链接中提取网站名称?


如果我们有一份网站链接列表,并且我们想要从那些链接中提取网站名称,那将是一项耗时的任务,因为我们需要一个接一个地复制每个名称。因此,最好使用 R 中的一个函数来提取它们,并节省时间。要从网站链接中提取网站名称,我们可以使用 urltools 软件包的 suffix_extract 函数。这将提取主机、子域名、域名和后缀。并且众所周知,域名值是网站名称。

加载 urltools 软件包 -

library(urltools)

存储在向量中的网站链接 -

Web_Links<-c("https://www.grammarly.com/grammar-check","https://sceptermarketing.com/comma-separated-lists-of-us-states-abbreviations-select-options-etc/","https://tutorialspoint.com/machine_learning/index.htm","https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/sort","https://www-islaah-in.cdn.ampproject.org/v/s/www.islaah.in/masail/13977/?amp=&usqp=mq331AQFKAGwASA%3D&_js_v=0.1#aoh=16016175660203&referrer=https%3A%2F%2Fwww.google.com&_tf=From%20%251%24s&share=https%3A%2F%2Fwww.islaah.in%2Fmasail%2F13977%2F","http://qoitrat.org/Qa/searchtopic.php?Main=76&MainTopc=245","https://theislamicinformation-com.cdn.ampproject.org/v/s/theislamicinformation.com/aqeeqah-for-baby-boy-and-girl/amp/?usqp=mq331AQFKAGwASA%3D&_js_v=0.1#aoh=16015741096047&referrer=https%3A%2F%2Fwww.google.com&_tf=From%20%251%24s&share=https%3A%2F%2Ftheislamicinformation.com%2Faqeeqah-for-baby-boy-and-girl%2F","https://parenting.firstcry.com/articles/50-popular-turkish-baby-names-for-girls/","https://www.amazon.in/SELF-CHEF-Delhi-Aloo-Tikki/dp/B089GW5ZPL/ref=asc_df_B089GW5ZPL/?tag=googleshopmob-21&linkCode=df0&hvadid=397060787211&hvpos=&hvnetw=g&hvrand=3239398407570685332&hvpone=&hvptwo=&hvqmt=&hvdev=m&hvdvcmdl=&hvlocint=&hvlocphy=9040189&hvtargid=pla-923173707999&psc=1&ext_vrnc=hi","http://ridenow.co.in/?From=Bareilly&To=Delhi&submit=","https://www.savaari.com/delhi/delhi-to-bareilly-cabs","https://www.olxgroup.com/search/operations/delhi-ncr/all-brands","https://unbelievable-facts.com/work-with-us","https://www.tataaiginsurance.in/taig/taig/tata_aig/CorporateCustomerPortal/login.jsp","https://www.dummies.com/programming/r/how-to-change-plot-options-in-r/","http://www.sthda.com/english/wiki/add-titles-to-a-plot-in-r-software")

打印网站链接向量 -

Web_Links

[1] "https://www.grammarly.com/grammar-check" [2] "https://sceptermarketing.com/comma-separated-lists-of-us-states-abbreviations-select-options-etc/" [3] "https://tutorialspoint.com/machine_learning/index.htm" [4] "https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/sort" [5] "https://www-islaah-in.cdn.ampproject.org/v/s/www.islaah.in/masail/13977/?amp=&usqp=mq331AQFKAGwASA%3D&_js_v=0.1#aoh=16016175660203&referrer=https%3A%2F%2Fwww.google.com&_tf=From%20%251%24s&share=https%3A%2F%2Fwww.islaah.in%2Fmasail%2F13977%2F" [6] "http://qoitrat.org/Qa/searchtopic.php?Main=76&MainTopc=245" [7] "https://theislamicinformation-com.cdn.ampproject.org/v/s/theislamicinformation.com/aqeeqah-for-baby-boy-and-girl/amp/?usqp=mq331AQFKAGwASA%3D&_js_v=0.1#aoh=16015741096047&referrer=https%3A%2F%2Fwww.google.com&_tf=From%20%251%24s&share=https%3A%2F%2Ftheislamicinformation.com%2Faqeeqah-for-baby-boy-and-girl%2F" [8] "https://parenting.firstcry.com/articles/50-popular-turkish-baby-names-for-girls/" [9] "https://www.amazon.in/SELF-CHEF-Delhi-Aloo-Tikki/dp/B089GW5ZPL/ref=asc_df_B089GW5ZPL/?tag=googleshopmob-21&linkCode=df0&hvadid=397060787211&hvpos=&hvnetw=g&hvrand=3239398407570685332&hvpone=&hvptwo=&hvqmt=&hvdev=m&hvdvcmdl=&hvlocint=&hvlocphy=9040189&hvtargid=pla-923173707999&psc=1&ext_vrnc=hi" [10] "http://ridenow.co.in/?From=Bareilly&To=Delhi&submit=" [11] "https://www.savaari.com/delhi/delhi-to-bareilly-cabs" [12] "https://www.olxgroup.com/search/operations/delhi-ncr/all-brands" [13] "https://unbelievable-facts.com/work-with-us" [14] "https://www.tataaiginsurance.in/taig/taig/tata_aig/CorporateCustomerPortal/login.jsp" [15] "https://www.dummies.com/programming/r/how-to-change-plot-options-in-r/" [16] "http://www.sthda.com/english/wiki/add-titles-to-a-plot-in-r-software"

提取网站名称 -

host subdomain
1 www.grammarly.com           www
2 sceptermarketing.com       <NA>
3 www.tutorialspoint.com      www
4 www.rdocumentation.org      www
5 www-islaah-in.cdn.ampproject.org www-islaah-in.cdn
6 qoitrat.org                  <NA>
7 theislamicinformation-com.cdn.ampproject.org theislamicinformation-com.cdn
8 parenting.firstcry.com      parenting
9 www.amazon.in                www
10 ridenow.co.in               <NA>
11 www.savaari.com             www
12 www.olxgroup.com            www
13 unbelievable-facts.com      <NA>
14 www.tataaiginsurance.in     www
15 www.dummies.com             www
16 www.sthda.com               www
domain suffix
1 grammarly    com
2 sceptermarketing com
3 tutorialspoint com
4 rdocumentation org
5 ampproject org
6 qoitrat org
7 ampproject org
8 firstcry com
9 amazon in
10 ridenow co.in
11 savaari com
12 olxgroup com
13 unbelievable-facts com
14 tataaiginsurance in
15 dummies com 16 sthda com

更新于: 16-Oct-2020

196 次查看

推进您的职业生涯

完成课程获得认证

开始
广告