This article has been machine-translated from Chinese. The translation may contain inaccuracies or awkward phrasing. If in doubt, please refer to the original Chinese version.

I. Using urllib3 to Make HTTP Requests

1. Generating Requests

Generate requests through the request method with the following prototype:

urllib3.request(method, url, fields=None, headers=None, **urlopen_kw)

Parameter	Description
method	Accepts string. The request type, such as “GET” (commonly used), “HEAD”, “DELETE”, etc. No default value.
url	Accepts string. The URL as a string. No default value.
fields	Accepts dict. Parameters for the request type. Defaults to None.
headers	Accepts dict. Request header parameters. Defaults to None.
**urlopen_kw	Accepts dict and Python data types. Additional parameters depending on specific requirements and request type.

Code:

import urllib3
http = urllib3.PoolManager()
rq = http.request('GET',url='https://www.pythonscraping.com/pages/page3.html')
print('Server response code:', rq.status)
print('Response body:', rq.data)

2. Handling Request Headers

Pass in the headers parameter by defining a dictionary. Define a dictionary containing User-Agent information, using Firefox and Chrome browsers with the operating system “Windows NT 6.1; Win64; x64”, and send a GET request with headers to “https://www.tipdm/index.html”.

import urllib3
http = urllib3.PoolManager()
head = {'User-Agent':'Window NT 6.1;Win64; x64'}
http.request('GET',url='https://www.pythonscraping.com/pages/page3.html',headers=head)

3. Timeout Settings

To prevent packet loss due to network instability, add timeout parameter settings to requests, typically as floating-point numbers. You can set it directly after the URL for all parameters of that request, or set connect and read timeout separately. Setting the timeout parameter in the PoolManager instance applies it to all requests under that instance.

Direct setting:

http.request('GET',url='',headers=head，timeout=3.0)
#Timeout and abort after 3s

http.request('GET',url='https://www.pythonscraping.com/pages/page3.html',headers=head，timeout=urllib3.Timeout(connect=1.0,read=2.0))
#Connection timeout after 1s, read timeout after 2s

Applied to all requests of the instance:

import urllib3
http = urllib3.PoolManager(timeout=4.0)
head = {'User-Agent':'Window NT 6.1;Win64; x64'}
http.request('GET',url='https://www.pythonscraping.com/pages/page3.html',headers=head)
#Timeout after 4s

4. Request Retry Settings

The urllib3 library can control retries through the retries parameter. By default, it performs 3 request retries and 3 redirects. Custom retry counts can be set by assigning an integer to the retries parameter. You can define a retries instance to customize retry and redirect counts. To disable both retries and redirects, set retries to False. To disable only redirects, set redirect to False. Similar to Timeout settings, setting the retries parameter in the PoolManager instance controls the retry strategy for all requests under that instance.

Applied to all requests of the instance:

import urllib3
http = urllib3.PoolManager(timeout=4.0,retries=10)
head = {'User-Agent':'Window NT 6.1;Win64; x64'}
http.request('GET',url='https://www.pythonscraping.com/pages/page3.html',headers=head)
#Timeout after 4s, retry 10 times

5. Generating a Complete HTTP Request

Use the urllib3 library to generate a complete request to https://www.pythonscraping.com/pages/page3.html. The request should include a URL, request headers, timeout, and retry settings.

Note the encoding: utf-8

import urllib3
#Request instance
http = urllib3.PoolManager()
#URL
url = 'https://www.pythonscraping.com/pages/page3.html'
#Request headers
head = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36 Edg/88.0.705.56'}
#Timeout
tm = urllib3.Timeout(connect=1.0,read=3.0)
#Retry and redirect settings, generate request
rq = http.request('GET',url=url,headers=head,timeout=tm,redirect=4)
print('Server response code:', rq.status)
print('Response body:', rq.data.decode('utf-8'))

Server response code: 200
Response body: <html>
<head>
<style>
img{
 width:75px;
}
table{
 width:50%;
}
td{
 margin:10px;
 padding:10px;
}
.wrapper{
 width:800px;
}
.excitingNote{
 font-style:italic;
 font-weight:bold;
}
</style>
</head>
<body>
<div id="wrapper">
<img src="../img/gifts/logo.jpg" style="float:left;">
<h1>Totally Normal Gifts</h1>
<div id="content">Here is a collection of totally normal, totally reasonable gifts that your friends are sure to love! Our collection is
hand-curated by well-paid, free-range Tibetan monks.<p>
We haven't figured out how to make online shopping carts yet, but you can send us a check to:<br>
123 Main St.<br>
Abuja, Nigeria
</br>We will then send your totally amazing gift, pronto! Please include an extra $5.00 for gift wrapping.</div>
<table id="giftList">
<tr><th>
Item Title
</th><th>
Description
</th><th>
Cost
</th><th>
Image
</th></tr>

<tr id="gift1" class="gift"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg">
</td></tr>

<tr id="gift2" class="gift"><td>
Russian Nesting Dolls
</td><td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td><td>
$10,000.52
</td><td>
<img src="../img/gifts/img2.jpg">
</td></tr>

<tr id="gift3" class="gift"><td>
Fish Painting
</td><td>
If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
</td><td>
$10,005.00
</td><td>
<img src="../img/gifts/img3.jpg">
</td></tr>

<tr id="gift4" class="gift"><td>
Dead Parrot
</td><td>
This is an ex-parrot! <span class="excitingNote">Or maybe he's only resting?</span>
</td><td>
$0.50
</td><td>
<img src="../img/gifts/img4.jpg">
</td></tr>

<tr id="gift5" class="gift"><td>
Mystery Box
</td><td>
If you love suprises, this mystery box is for you! Do not place on light-colored surfaces. May cause oil staining. <span class="excitingNote">Keep your friends guessing!</span>
</td><td>
$1.50
</td><td>
<img src="../img/gifts/img6.jpg">
</td></tr>
</table>
</p>
<div id="footer">
&copy; Totally Normal Gifts, Inc. <br>
+234 (617) 863-0736
</div>

</div>
</body>
</html>

II. Using the requests Library for HTTP Requests

import requests url = 'https://www.pythonscraping.com/pages/page3.html' rq2 = requests.get(url) rq2.encoding = 'utf-8'
print('Response code:',rq2.status_code) print('Encoding:',rq2.encoding) print('Request headers:',rq2.headers) print('Body:',rq2.text)

Solving Character Encoding Issues

Note that when the requests library guesses incorrectly, you need to manually specify the encoding to avoid garbled characters in the parsed page content. Manual specification is not flexible and cannot adaptively handle different encodings during scraping. Using the chardet library is more convenient and flexible — it is an excellent string/file encoding detection module. The chardet library uses the detect method to detect the encoding of a given string. Common parameters:

Parameter	Description
byte_str	Accepts string. The string whose encoding needs detection. No default value.

import chardet
chardet.detect(rq2.content)

Output: 100% probability of being encoded in ASCII.

Complete code:

import requests
import chardet
url = 'https://www.pythonscraping.com/pages/page3.html'
head={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36 Edg/88.0.705.56'}
rq2 = requests.get(url,headers=head,timeout=2.0)
rq2.encoding = chardet.detect(rq2.content)['encoding']
print('Body:',rq2.content)

III. Parsing Web Pages

Chrome Developer Tools panel functions:

1. Elements Panel

In web scraping development, the Elements panel is mainly used to find the positions of page elements, such as image locations or text link positions. The left side of the panel shows the current page structure in a tree format; click the triangle to expand branches.

2. Sources Panel

Switch to the Sources panel.

Click the "tipdm" folder on the left and the "index.html" file; the complete code will be displayed in the middle.

3. Network Panel

Switch to the Network panel. You need to reload the page first. Click a resource and the header information, preview, response, Cookies, and timing details will be displayed in the middle.

IV. Parsing Web Pages with Regular Expressions

1. Python Regular Expressions: Finding Names and Phone Numbers in a String

Regular expressions are tools for pattern matching and replacement. They allow users to construct matching patterns using a series of special characters, then compare these patterns against target strings or files, executing corresponding actions based on whether the target contains the pattern.

rawdata = “555-1239Moe Szyslak(636) 555-0113Burns, C.Montgomery555-6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson,Homer5553642Dr. Julius Hibbert ”

Let’s try:

import re
string = '1. A small sentence - 2.Anthoer tiny sentence. '
print('re.findall:',re.findall('sentence',string))
print('re.search:',re.search('sentence',string))
print('re.match:',re.match('sentence',string))
print('re.match:',re.match('1. A small sentence',string))
print('re.sub:',re.sub('small','large',string))
print('re.sub:',re.sub('small','',string))

Output: re.findall: [‘sentence’, ‘sentence’] re.search: <re.Match object; span=(11, 19), match=‘sentence’> re.match: None re.match: <re.Match object; span=(0, 19), match=‘1. A small sentence’> re.sub: 1. A large sentence - 2.Anthoer tiny sentence. re.sub: 1. A sentence - 2.Anthoer tiny sentence.

Common generalized symbols:

Period ”.”: Can represent any single character except newline “\n”;

string = '1. A small sentence - 2.Anthoer tiny sentence. '
re.findall('A.',string)

Output: [‘A ’, ‘An’]

Character class ”[]”: Enclosed in brackets; any character inside the brackets will be matched;

string = 'small smell smll smsmll sm3ll sm.ll sm?ll sm\nll sm\tll'
print('re.findall:',re.findall('sm.ll',string))
print('re.findall:',re.findall('sm[asdfg]ll',string))
print('re.findall:',re.findall('sm[a-zA-Z0-9]ll',string))
print('re.findall:',re.findall('sm\.ll',string))
print('re.findall:',re.findall('sm[.?]ll',string))

Output:

re.findall: ['small', 'smell', 'sm3ll', 'sm.ll', 'sm?ll', 'sm\tll']
re.findall: ['small']
re.findall: ['small', 'smell', 'sm3ll']
re.findall: ['sm.ll']
re.findall: ['sm.ll', 'sm?ll']

Quantifier ”{}”: Specifies how many times a pattern can be matched.

print('re.findall:',re.findall('sm..ll',string))
print('re.findall:',re.findall('sm.{2}ll',string))
print('re.findall:',re.findall('sm.{1,2}ll',string))
print('re.findall:',re.findall('sm.{1,}ll',string))
print('re.findall:',re.findall('sm.?ll',string)) # {0,1}
print('re.findall:',re.findall('sm.+ll',string)) # {0,}
print('re.findall:',re.findall('sm.*ll',string)) # {1,}

Output: re.findall: [‘smsmll’] re.findall: [‘smsmll’] re.findall: [‘small’, ‘smell’, ‘smsmll’, ‘sm3ll’, ‘sm.ll’, ‘sm?ll’, ‘sm\tll’] re.findall: [‘small smell smll smsmll sm3ll sm.ll sm?ll’, ‘sm\tll’] re.findall: [‘small’, ‘smell’, ‘smll’, ‘smll’, ‘sm3ll’, ‘sm.ll’, ‘sm?ll’, ‘sm\tll’] re.findall: [‘small smell smll smsmll sm3ll sm.ll sm?ll’, ‘sm\tll’] re.findall: [‘small smell smll smsmll sm3ll sm.ll sm?ll’, ‘sm\tll’]

PS: Greedy rule — matches as many characters as possible.

Complete Code

import pandas as pd
rawdata = '555-1239Moe Szyslak(636) 555-0113Burns, C.Montgomery555-6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson,Homer5553642Dr. Julius Hibbert'
names = re.findall('[A-Z][A-Za-z,. ]*',rawdata)
print(names)
number = re.findall('\(?0-9\)?[ \-]?0-9[ \-]?0-9',rawdata)
print(number)
pd.DataFrame({'Name':names,'TelPhone':number})

Output:

V. Parsing Web Pages with XPath

XML Path Language (XPath) is a tree-structured language based on XML for finding nodes in the data structure tree and determining positions of parts of an XML document. Using XPath requires importing the etree module from the lxml library, and using the HTML class to initialize the HTML object to be matched (XPath can only process the DOM representation of documents). The basic syntax of the HTML class is as follows:

1. Basic Syntax

lxml.etree.HTML(text, parser=None, *, base_url=None)

Parameter	Description
text	Accepts str. The string to convert to HTML. No default value.
parser	Accepts str. The HTML parser to use. No default value.
base_url	Accepts str. Sets the original URL of the document, used to find relative paths for external entities. Defaults to None.
If HTML nodes are not properly closed, the etree module also provides auto-completion. Call the tostring method to output the corrected HTML code, but the result is in bytes type and needs the decode method to convert to str type.

XPath uses regex-like expressions to match content in HTML files. Common matching expressions are as follows:

Expression	Description
nodename	Select all child nodes of the nodename node
/	Select direct child nodes from the current node
//	Select descendant nodes from the current node
.	Select the current node
..	Select the parent node of the current node
@	Select attributes

2. Predicates

XPath predicates are used to find specific nodes or nodes containing specified values. Predicates are embedded in brackets after the path, as follows:

Expression	Description
/html/body/div[1]	Select the first div child node under body
/html/body/div[last()]	Select the last div child node under body
/html/body/div[last()-1]	Select the second-to-last div child node under body
/html/body/div[positon()<3]	Select the first two div child nodes under body
/html/body/div[@id]	Select div child nodes under body with an id attribute
/html/body/div[@id=“content”]	Select the div child node under body with id value “content”
/html/body/div[xx>10.00]	Select child nodes under body where element xx > 10

3. Utility Functions

XPath also provides some utility functions for fuzzy searching. Sometimes you only know partial characteristics of the target; these functions enable fuzzy matching:

Function	Example	Description
starts-with	//div[starts-with(@id,“co”)]	Select div nodes with id starting with “co”
contains	//div[contains(@id,“co”)]	Select div nodes with id containing “co”
and	//div[contains(@id,“co”)andcontains(@id,“en”)]	Select div nodes with id containing both “co” and “en”
text()	//li[contains(text(),“first”)]	Select nodes whose text contains “first”

4. Using Google Developer Tools

Google Developer Tools provides a very convenient way to copy XPath paths.

Example: Scraping Zhihu trending topics — complete code. Tried scraping Zhihu trending topics; login is required, so you can log in yourself and get the cookie.

import requests
from lxml import etree
url = "https://www.zhihu.com/hot"
hd = { 'Cookie':'your Cookie', #'Host':'www.zhihu.com',
        'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}

response = requests.get(url, headers=hd)
html_str = response.content.decode()
html = etree.HTML(html_str)
title = html.xpath("//section[@class='HotItem']/div[@class='HotItem-content']/a/@title")
href = html.xpath("//section[@class='HotItem']/div[@class='HotItem-content']/a/@href")
f = open("zhihu.txt",'r+')
for i in range(1,41):
    print(i,'.'+title[i])
    print('Link: '+href[i])
    print('-'*50)
    f.write(str(i)+'.'+title[i]+'\n')
    f.write('Link: '+href[i]+'\n')
    f.write('-'*50+'\n')
f.close()

Scraping result:

VI. Data Storage

1. Storing as JSON Format

import requests
from lxml import etree
import json
#Code above omitted
with open('zhihu.json','w') as j:
    json.dump({'title':title,'hrefL':href},j,ensure_ascii=False)

Storage result (PS: after file formatting):

Python Web Scraping Study Notes (1): Scraping Simple Static Web Pages