카테고리
coding

Different ways to parse Python BeautifulSoup

While studying Python, I tried to organize various ways to get specific tags from html documents.

HTML example code

Assuming that there is an HTML document as follows, let's see how to parse specific tags using the BeautifulSoup module one by one.

from bs4 import BeautifulSoup

html = """
<!DOCTYPE html>
<html>
    <head></head>
    <body>
        <span class='test1'>Content1</span>
        <div class='test1'>Content2</div>
        <div class='test1' id='target' name='sangminem'>Goal</div>
        <div class='test2'>Content3</div>
    </body>
</html>
"""

soup = BeautifulSoup(html, "html.parser")

The goal is the tag below.

<div class='test1' id='target' name='sangminem'>Goal</div>

Use the find() method

print(soup.find('div',id='target')) #tag, id
print(soup.find('div',attrs={'id':'target'})) #tag, id as attribute value
print(soup.find('div',attrs={'name':'sangminem'})) #tag, name as attribute value
print(soup.find(attrs={'name':'sangminem'})) #name as attribute value
print(soup.find(attrs={'id':'target'})) #id as attribute value

Here is the output.

<div class="test1" id="target" name="sangminem">Goal</div>
<div class="test1" id="target" name="sangminem">Goal</div>
<div class="test1" id="target" name="sangminem">Goal</div>
<div class="test1" id="target" name="sangminem">Goal</div>
<div class="test1" id="target" name="sangminem">Goal</div>

Use the find_all() method

print(soup.find_all('div')[1]) #tag
print(soup.find_all('div',class_=['test1'])[1]) #tag, class
print(soup.find_all('div',id='target')[0]) #tag, id
print(soup.find_all('div',attrs={'class':'test1'})[1]) #tag, class as attribute value
print(soup.find_all('div',attrs={'id':'target'})[0]) #tag, id as attribute value
print(soup.find_all('div',attrs={'name':'sangminem'})[0]) #name as attribute value
print(soup.find_all(class_=['test1'])[2]) #class
print(soup.find_all(id='target')[0]) #id
print(soup.find_all(attrs={'class':'test1'})[2]) #class as attribute value
print(soup.find_all(attrs={'id':'target'})[0]) #id as attribute value
print(soup.find_all(attrs={'name':'sangminem'})[0]) #name as attribute value

Here is the output.

<div class="test1" id="target" name="sangminem">Goal</div>
<div class="test1" id="target" name="sangminem">Goal</div>
<div class="test1" id="target" name="sangminem">Goal</div>
<div class="test1" id="target" name="sangminem">Goal</div>
<div class="test1" id="target" name="sangminem">Goal</div>
<div class="test1" id="target" name="sangminem">Goal</div>
<div class="test1" id="target" name="sangminem">Goal</div>
<div class="test1" id="target" name="sangminem">Goal</div>
<div class="test1" id="target" name="sangminem">Goal</div>
<div class="test1" id="target" name="sangminem">Goal</div>
<div class="test1" id="target" name="sangminem">Goal</div>

Use the select_one() method

print(soup.select_one('div.test1#target')) #tag, class, id
print(soup.select_one('div#target')) #tag, id
print(soup.select_one('.test1#target')) #class, id
print(soup.select_one('#target')) #id

Here is the output.

<div class="test1" id="target" name="sangminem">Goal</div>
<div class="test1" id="target" name="sangminem">Goal</div>
<div class="test1" id="target" name="sangminem">Goal</div>
<div class="test1" id="target" name="sangminem">Goal</div>

Use the select() method

print(soup.select('.test1')[2]) #class
print(soup.select('div.test1')[1]) #tag, class
print(soup.select('div.test1#target')[0]) #tag, class, id
print(soup.select('div#target')[0]) #tag, id
print(soup.select('.test1#target')[0]) #class, id
print(soup.select('#target')[0]) #id

Here is the output.

<div class="test1" id="target" name="sangminem">Goal</div>
<div class="test1" id="target" name="sangminem">Goal</div>
<div class="test1" id="target" name="sangminem">Goal</div>
<div class="test1" id="target" name="sangminem">Goal</div>
<div class="test1" id="target" name="sangminem">Goal</div>
<div class="test1" id="target" name="sangminem">Goal</div>

The find() method and the select_one() method get a single result.

If there are multiple results that satisfy the condition, only the first value is retrieved.

The find_all() and select() methods get all the results that satisfy the condition in the form of an array.

Therefore, in order to finally get the desired value, you must not omit the array index.

If it is difficult to decide which one to use because there are various methods, we recommend using the select() method.

It is easy to use and actually parses faster.

Leave a Reply

Your email address will not be published. Required fields are marked *

en_USEnglish