Github Web Scraping



  1. Scrape everything, everywhere: invoke artoo in the JavaScript context of any web page. Loaded with helpers: Scrape data quick & easy with powerful methods such as artoo.scrape. Data download: Make your browser download the scraped data with artoo.save methods. Spiders: Crawl pages through ajax and retrieve accumulated data with artoo’s spiders.
  2. GitHub - nadireag/web-scraping-challenge Web Scraping - Mission to Mars Build a web application that scrapes various websites for data related to the Mission to Mars and displays the information in a single HTML page.
  3. Beautiful Soup web scraping tutorial. Contribute to KeithGalli/web-scraping development by creating an account on GitHub.

Table of Contents

In this tutorial, we first provide an overview of some foundational concepts about the World-Wide-Web. We then lay out some common approaches to web scraping and compare their usage. With this background, we introduce several applications that use the Selenium Python package to scrape websites.

This tutorial is organized into the following parts:

  1. Basic concepts of the World-Wide-Web.
  2. Comparison of some common approaches to web scraping.
  3. Use-cases for when to use the Selenium WebDriver.
  4. Illustration of how to find web elements using Selenium WebDriver.
  5. Illustration of how to fill in web forms using Selenium WebDriver.

All code samples are available on GitHub for viewing and downloading. What Is Web Scraping? The automated gathering of data from the internet is nearly as old as the internet itself. Although web scraping is not a new term, in years past the practice has been more commonly known as screen scraping, data mining, web harvesting, or similar ix.

We plan to add more applications in the near future. The content of this tutorial is a work in progress, and we are happy to receive feedback! If you find anything confusing or think the guide misses important content, please email: help@iq.harvard.edu.

Custom Websites

We decided to build custom websites for many of the examples used in this tutorial instead of scraping live websites, so that we have full control over the web environment. This provides us stability —– live websites are updated more often than books, and by the time you try a scraping example, it may no longer work. Also, a custom website allows us to craft examples that illustrate specific skills and avoid distractions. Finally, the maintainers of a live website may not appreciate us using them to learn about web scraping and could try to block our scrapers. Using our own custom websites avoids these risks, however, the skills learnt in these examples can certainly still be applied to live websites.

Below I list the name and its link for each of the custom websites we have built for this tutorial:

Scraping
  • static student profile webpage
  • dynamic search form webpage
  • dynamic table webpage
  • dynamic search load webpage
  • dynamic complete search form webpage

Authors and Sources

Jinjie Liu at IQSS designed the structure of the guide and created the content. Steve Worthington at IQSS helped design the structure of the guide and edited the content. We referenced the following sources when we wrote this guide:

  • Web Scraping with Python: Scrape data from any website with the power of Python, by Richard Lawson (ISBN: 978-1782164364)
  • Web Scraping with Python: Collecting Data From the Modern Web, by Ryan Mitchell (ISBN: 978-1491910276)
  • Hands-on Web Scraping with Python: Perform advanced scraping operations using various Python libraries and tools such as Selenium, Regex, and others, by Anish Chapagain (ISBN: 978-1789533392)
  • Learning Selenium Testing Tools with Python: A practical guide on automated web testing with Selenium using Python, by Unmesh Gundecha (ISBN: 978-1783983506)

Github Web Scraping Project

Vast amount of information exists across the interminable webpages that exist online. Much of this information are “unstructured” text that may be useful in our analyses. This section covers the basics of scraping these texts from online sources. Throughout this section I will illustrate how to extract different text components of webpages by dissecting the Wikipedia page on web scraping. However, its important to first cover one of the basic components of HTML elements as we will leverage this information to pull desired information. I offer only enough insight required to begin scraping; I highly recommend XML and Web Technologies for Data Sciences with R and Automated Data Collection with R to learn more about HTML and XML element structures.

HTML elements are written with a start tag, an end tag, and with the content in between: <tagname>content</tagname>. The tags which typically contain the textual content we wish to scrape, and the tags we will leverage in the next two sections, include:

Github Web Scraping
  • <h1>, <h2>,…,<h6>: Largest heading, second largest heading, etc.
  • <p>: Paragraph elements
  • <ul>: Unordered bulleted list
  • <ol>: Ordered list
  • <li>: Individual List item
  • <div>: Division or section
  • <table>: Table

For example, text in paragraph form that you see online is wrapped with the HTML paragraph tag <p> as in:

It is through these tags that we can start to extract textual components (also referred to as nodes) of HTML webpages.

Scraping HTML Nodes

To scrape online text we’ll make use of the relatively newer rvest package. rvest was created by the RStudio team inspired by libraries such as beautiful soup which has greatly simplified web scraping. rvest provides multiple functionalities; however, in this section we will focus only on extracting HTML text with rvest. Its important to note that rvest makes use of of the pipe operator (%>%) developed through the magrittr package. If you are not familiar with the functionality of %>% I recommend you jump to the section on Simplifying Your Code with %>% so that you have a better understanding of what’s going on with the code.

To extract text from a webpage of interest, we specify what HTML elements we want to select by using html_nodes(). For instance, if we want to scrape the primary heading for the Web Scraping Wikipedia webpage we simply identify the <h1> node as the node we want to select. html_nodes() will identify all <h1> nodes on the webpage and return the HTML element. In our example we see there is only one <h1> node on this webpage.

To extract only the heading text for this <h1> node, and not include all the HTML syntax we use html_text() which returns the heading text we see at the top of the Web Scraping Wikipedia page.

If we want to identify all the second level headings on the webpage we follow the same process but instead select the <h2> nodes. In this example we see there are 10 second level headings on the Web Scraping Wikipedia page.

Github

Next, we can move on to extracting much of the text on this webpage which is in paragraph form. We can follow the same process illustrated above but instead we’ll select all <p> nodes. This selects the 17 paragraph elements from the web page; which we can examine by subsetting the list p_nodes to see the first line of each paragraph along with the HTML syntax. Just as before, to extract the text from these nodes and coerce them to a character string we simply apply html_text().

Not too bad; however, we may not have captured all the text that we were hoping for. Since we extracted text for all <p> nodes, we collected all identified paragraph text; however, this does not capture the text in the bulleted lists. For example, when you look at the Web Scraping Wikipedia page you will notice a significant amount of text in bulleted list format following the third paragraph under the Techniques heading. If we look at our data we’ll see that that the text in this list format are not capture between the two paragraphs:

Github web templateWeb

This is because the text in this list format are contained in <ul> nodes. To capture the text in lists, we can use the same steps as above but we select specific nodes which represent HTML lists components. We can approach extracting list text two ways.

First, we can pull all list elements (<ul>). When scraping all <ul> text, the resulting data structure will be a character string vector with each element representing a single list consisting of all list items in that list. In our running example there are 21 list elements as shown in the example that follows. You can see the first list scraped is the table of contents and the second list scraped is the list in the Techniques section.

An alternative approach is to pull all <li> nodes. This will pull the text contained in each list item for all the lists. In our running example there’s 146 list items that we can extract from this Wikipedia page. The first eight list items are the list of contents we see towards the top of the page. List items 9-17 are the list elements contained in the “Techniques” section, list items 18-44 are the items listed under the “Notable Tools” section, and so on.

At this point we may believe we have all the text desired and proceed with joining the paragraph (p_text) and list (ul_text or li_text) character strings and then perform the desired textual analysis. However, we may now have captured more text than we were hoping for. For example, by scraping all lists we are also capturing the listed links in the left margin of the webpage. If we look at the 104-136 list items that we scraped, we’ll see that these texts correspond to the left margin text.

If we desire to scrape every piece of text on the webpage than this won’t be of concern. In fact, if we want to scrape all the text regardless of the content they represent there is an easier approach. We can capture all the content to include text in paragraph (<p>), lists (<ul>, <ol>, and <li>), and even data in tables (<table>) by using <div>. This is because these other elements are usually a subsidiary of an HTML division or section so pulling all <div> nodes will extract all text contained in that division or section regardless if it is also contained in a paragraph or list.

Scraping Specific HTML Nodes

However, if we are concerned only with specific content on the webpage then we need to make our HTML node selection process a little more focused. To do this we, we can use our browser’s developer tools to examine the webpage we are scraping and get more details on specific nodes of interest. If you are using Chrome or Firefox you can open the developer tools by clicking F12 (Cmd + Opt + I for Mac) or for Safari you would use Command-Option-I. An additional option which is recommended by Hadley Wickham is to use selectorgadget.com, a Chrome extension, to help identify the web page elements you need1.

Once the developer’s tools are opened your primary concern is with the element selector. This is located in the top lefthand corner of the developers tools window.

Once you’ve selected the element selector you can now scroll over the elements of the webpage which will cause each element you scroll over to be highlighted. Once you’ve identified the element you want to focus on, select it. This will cause the element to be identified in the developer tools window. For example, if I am only interested in the main body of the Web Scraping content on the Wikipedia page then I would select the element that highlights the entire center component of the webpage. This highlights the corresponding element <div> in the developer tools window as the following illustrates.

Github Web Scraping Challenge

I can now use this information to select and scrape all the text from this specific <div> node by calling the ID name (“#mw-content-text”) in html_nodes()2. As you can see below, the text that is scraped begins with the first line in the main body of the Web Scraping content and ends with the text in the See Also section which is the last bit of text directly pertaining to Web Scraping on the webpage. Explicitly, we have pulled the specific text associated with the web content we desire.

Using the developer tools approach allows us to be as specific as we desire. We can identify the class name for a specific HTML element and scrape the text for only that node rather than all the other elements with similar tags. This allows us to scrape the main body of content as we just illustrated or we can also identify specific headings, paragraphs, lists, and list components if we desire to scrape only these specific pieces of text:

Cleaning up

With any webscraping activity, especially involving text, there is likely to be some clean up involved. For example, in the previous example we saw that we can specifically pull the list of Notable Tools; however, you can see that in between each list item rather than a space there contains one or more n which is used in HTML to specify a new line. We can clean this up quickly with a little character string manipulation.

Similarly, as we saw in our example above with scraping the main body content (body_text), there are extra characters (i.e. n, , ^) in the text that we may not want. Using a little regex we can clean this up so that our character string consists of only text that we see on the screen and no additional HTML code embedded throughout the text.

So there we have it, text scraping in a nutshell. Although not all encompassing, this section covered the basics of scraping text from HTML documents. Whether you want to scrape text from all common text-containing nodes such as <div>, <p>, <ul> and the like or you want to scrape from a specific node using the specific ID, this section provides you the basic fundamentals of using rvest to scrape the text you need. In the next section we move on to scraping data from HTML tables.

Github Web Scraping

  1. You can learn more about selectors at flukeout.github.io↩

  2. You can simply assess the name of the ID in the highlighted element or you can right click the highlighted element in the developer tools window and select Copy selector. You can then paste directly into html_nodes() as it will paste the exact ID name that you need for that element. ↩