R shorts: webpage tables

Importing tables embeded in webpages.

R
data
scraping
Author

wayward teachR

Published

August 22, 2022

Modified

January 4, 2025

There was a TikTok video doing its rounds on Twitter a few weeks ago. The video showed a person importing a table from a Wikipedia page into Excel without having to copy-paste.

Though I replied to the tweet to show how it can be done in R, I thought I should explain it a little more here.

Programmatically extracting information from webpages is known as scraping. In order to scrape, you will need to use the {rvest} package. {rvest} is part of the {tidyverse} set of packages; so, if you installed the {tidyverse}, you are good to go. If you don’t have it installed, however, do so now.

# Install rvest
install.packages("rvest")

In order to make this as easy as possible, let’s break the whole thing down into a series of steps.

Step 1: import package

To use a package in R, it needs to be imported. This is done using the library() function. Let’s import the {rvest} package.

# Import rvest
library(rvest)

Step 2: the URL

Now we need to tell R the URL that contains our table. The one we are interested in today is the main table over on the Dow Jones Industrial Average Wikipedia page.

# Define url
my_url <- "https://en.wikipedia.org/wiki/Dow_Jones_Industrial_Average"

Step 3: read

Use the function read_html() to get R to read our URL.

# Read
my_website <- read_html("https://en.wikipedia.org/wiki/Dow_Jones_Industrial_Average")

Step 4: identify the element

Webpages are made up of lots of different elements. What we need now is the name of the element that contains our table. To find that out

  1. right click the table heading on the Wikipedia page;
  2. click on inspect when the context menu pops up;
  3. make a note of the name of the table from the inspection panel.

Here is a visual demonstration.

Our table is called wikitable sortable. The name of the element becomes table.wikitable.sortable. Armed with this information, we are ready to extract our table.

Step 5: extract table

my_table <- my_website |>
  html_element("table.wikitable.sortable") |>
  html_table()

There you go. Done.

Company
Exchange
Symbol
Industry
Date added
Notes
Index weighting
3M
NYSE
MMM
Conglomerate
1976-08-09
As Minnesota Mining and Manufacturing
1.83%
American Express
NYSE
AXP
Financial services
1982-08-30
4.12%
Amgen
NASDAQ
AMGN
Biopharmaceutical
2020-08-31
3.76%
Amazon
NASDAQ
AMZN
Retailing
2024-02-26
3.02%
Apple
NASDAQ
AAPL
Information technology
2015-03-19
3.33%
Boeing
NYSE
BA
Aerospace and defense
1987-03-12
2.15%
Caterpillar
NYSE
CAT
Construction and mining
1991-05-06
5.41%
Chevron
NYSE
CVX
Petroleum industry
2008-02-19
Also 1930-07-18 to 1999-11-01
2.18%
Cisco
NASDAQ
CSCO
Information technology
2009-06-08
0.82%
Coca-Cola
NYSE
KO
Drink industry
1987-03-12
Also 1932-05-26 to 1935-11-20
0.86%
1–10 of 30 rows

Full code

Corrections

If you spot any mistakes or want to suggest changes, please let me know through the usual channels.

Footnotes

  1. There are other packages for more complex webpages, but {rvest} is all we need for this task.↩︎

Citation

BibTeX citation:
@online{teachr2022,
  author = {teachR, wayward},
  title = {R Shorts: Webpage Tables},
  date = {2022-08-22},
  url = {https://thewaywardteachr.netlify.app/posts/2022-08-22-r-shorts-web-tables/r-shorts-web-tables.html},
  langid = {en}
}
For attribution, please cite this work as:
teachR, wayward. 2022. “R Shorts: Webpage Tables.” August 22, 2022. https://thewaywardteachr.netlify.app/posts/2022-08-22-r-shorts-web-tables/r-shorts-web-tables.html.