Website scraping is an activity where you download a web page or pages to capture specific fields or information. This is usually done with multiple pages with the same format or websites with different formatting but with the same information set. What is important is that the goal is to capture specific information that is common in all of these sites.
Web scraping may try to harvest information like definition, location, URL, title, and others. The information can be in a table or an enumerated list with a keyword or section/paragraph heading. The whole information set might be on different pages on the same website, or maybe all the entries are on the same page. The web scraping results can be stored in a table or enumerated in a text file for further research or processing.
Extracting Formatted Information from a Web Page
Scraping is not a single process; it entails downloading and saving the website as an HTML and then processing it on the local computer. The easiest way is to do a batch download of all the files to be processed, extract the data and save it in a single text file.
Python can be used for the whole process of web scraping, or it can be used specifically to parse the downloaded Html files. In truth, any language that can manipulate text strings can be used for web page scraping.
Understanding the Page
Before you download a page, you must understand what information you want to capture. You can preview the pages which contain the information and study the page itself and its URL. Nowadays, most pages with a backend database are dynamically generated after every request. If you harvest from different pages, it is best if you understand the page request URL.
In most instances, website searches use URLs to pass the search keyword to the web server. If you harvest multiple information or searches, you can create the page request and download the pages from the terminal or command line. If you scrape multiple pages, creating a batch file that downloads these pages directly without using a browser is best.
Additionally, you can also add the Python code to the batch file. This method is much faster than repeatedly downloading different pages and scraping the pages one at a time.
Downloading the Web Page
The easiest way to download the web page is to use a batch file from the command line. Typically, the “curl” command does the trick. You can also use functions from the Python libraries to download the webpage. You can use Python’s “requests” or “urlib” to download the page.
These functions require that you know the URL to download. The web page URL is the function parameter to download the page.
Using Python to Parse the Page
Parsing texts is one of the strengths of Python. Study the web page for the strings you want to capture. Pay close attention to the substrings that precede the text you want to parse or capture. Usually, some keywords designate a specific table or row. Or it can be specific formatting within the page. These may include keywords or even unique headings. In most cases, you have to search for these strings first before you start capturing the information you are interested in.
Python String split() is used to split or separate different parts of a string. It seeks a separator within the string and outputs the parsed string. Depending on the number of separators in the string, it would be best to print each substring to a file while using split().
Depending on the complexity of the string you want to capture, there are some intricacies involved. For instance, the information may not be in a table but a list. Typically, the list would have attributes for each line or item of information. For instance, the web page may have a list of articles, each with an attribute, title, date, author, and URL. You must go through each attribute and iterate through the list when capturing this information. Every time you parse an attribute, you write it down to a destination file and the attribute name.
Formatting and Saving the Information
When parsing information from a file, it is best to write the parsed information immediately to a new file. While doing so, you can also add some formatting, including string separators. You don’t need to have special formatting within the output file.
Open a new file with the open() function in write mode. Write the information to the opened file using the write() method. After iterating through the parsed text in the source file, close the destination file with the close() method.
Parsing through files to capture information in lists is a common task. Create your web page parsing program, and store it in your library for revision and reuse whenever necessary.