Scraping is just helping a machine read data intended for humans.
It comes in a few forms:
Screen
Report
Web
Structured – typical data formats
Semi-structured – modern sites
Unstructured – varying levels of doom
json
html/xml
csv
{
"reviewerID": "A2SUAM1J3GNN3B",
"asin": "0000013714",
"reviewerName": "J. McDonald",
"reviewText": "I bought this for my husband who plays the piano.
He is having a wonderful time playing these old hymns.
The music is at times hard to read because we think the
book was published for singing from more than playing from.
Great purchase though!",
"overall": 5.0,
}
An API will make life much easier on everyone.
Grabbing parts out of the HTML is sometimes necessary.
Sometimes, sites will give us an API.
Always check – don’t bet on it.
Glassdoor
BLS
U.S. Census
Most modern sites have one (or something resembling one).
Many (if not most) APIs limit your quiries.
APIs usually do not give you all of the data that you want either.
This is where a combination approach becomes necessary.
Tables are the easiest thing to scrape.
url =
"https://en.wikipedia.org/wiki/List_of_largest_banks_in_the_United_States"
banklist = pd.read_html(url)[0]
url =
"https://en.wikipedia.org/wiki/List_of_largest_banks_in_the_United_States"
bankList = read_html(url) %>% # Read the html
html_nodes("table") %>% # Grab "table" nodes
extract2(1) %>% # Extract the first table
html_table() # Save the table as a data frame
CSS selectors and XPath make scraping certain objects easier.
The CSS Diner is a favorite!
No matter your browser, it will have an Inspect tool.
This tool will quickly become your friend.
Excel Web Query is handy for grabbing tables.
It can even be refreshed.
If so, let’s give Excel a quick try.
We should be able to do it in under two minutes.
Whether it is 1 or 1000, tables are pretty easy to scrape.
But…not everything comes in tabular form.
This is where things become fun.
Most modern sites are pretty well constructed.
url = "https://www.yelp.com/biz/capri-granger"
yelpHTML = read_html(url)
ratings = yelpHTML %>%
html_nodes(".review-wrapper .review-content .i-stars") %>%
html_attr("title") %>%
stringr::str_extract("[0-5]")
reviews = yelpHTML %>%
html_nodes(".review-wrapper .review-content p") %>%
html_text()
For these easy tables, we might be able to use a service.
Many free ones are available.
Let’s play a little game.
Pattern | Matches |
---|---|
Ph.? ?D |
Ph.D , Ph. D , Ph. D , Ph D |
\d{3}-\d{3}-\d{4} |
Phone numbers |
Scraping can get many things.
However, it is not magic.
We can’t scrape all of Google.
The laws of physics cannot be bent.
Some sites explicitly prohibit it.