Asked  7 Months ago    Answers:  5   Viewed   30 times

I'm having trouble parsing HTML elements with "class" attribute using Beautifulsoup. The code looks like this

soup = BeautifulSoup(sdata)
mydivs = soup.findAll('div')
for div in mydivs: 
    if (div["class"] == "stylelistrow"):
        print div

I get an error on the same line "after" the script finishes.

File "./beautifulcoding.py", line 130, in getlanguage
  if (div["class"] == "stylelistrow"):
File "/usr/local/lib/python2.6/dist-packages/BeautifulSoup.py", line 599, in __getitem__
   return self._getAttrMap()[key]
KeyError: 'class'

How do I get rid of this error?

 Answers

58

You can refine your search to only find those divs with a given class using BS3:

mydivs = soup.find_all("div", {"class": "stylelistrow"})
Tuesday, June 1, 2021
 
Novalirium
answered 7 Months ago
43

Use:

//*[namespace-uri()='yourNamespaceURI-here'
   or
    @*[namespace-uri()='yourNamespaceURI-here']
   ]

the predicate two conditions are or-ed with the XPath or operator.

The XPath expression thus selects any element that either:

  • belongs to the specified namespace.
  • has attributes that belong to the specified namespace.
Thursday, July 15, 2021
 
Gil
answered 5 Months ago
Gil
84

If the value is hardcoded in the source of the page using the value attribute then you can

$('#attached_docs :input[value="123"]').remove();

If you want to target elements that have a value of 123, which was set by the user or programmatically then use EDIT works both ways ..

or

$('#attached_docs :input').filter(function(){return this.value=='123'}).remove();

demo http://jsfiddle.net/gaby/RcwXh/2/

Wednesday, July 28, 2021
 
twk
answered 5 Months ago
twk
88

From the docs's summarized table of advantages and disadvantages:

  1. html.parser - BeautifulSoup(markup, "html.parser")

    • Advantages: Batteries included, Decent speed, Lenient (as of Python 2.7.3 and 3.2.)

    • Disadvantages: Not very lenient (before Python 2.7.3 or 3.2.2)

  2. lxml - BeautifulSoup(markup, "lxml")

    • Advantages: Very fast, Lenient

    • Disadvantages: External C dependency

  3. html5lib - BeautifulSoup(markup, "html5lib")

    • Advantages: Extremely lenient, Parses pages the same way a web browser does, Creates valid HTML5

    • Disadvantages: Very slow, External Python dependency

Thursday, July 29, 2021
 
mcography
answered 5 Months ago
60

Follow the pagination by making an endless loop and follow the "Next" link until it is not found.

In other words, from:

enter image description here

following "Next" link until:

enter image description here

Working code:

from urlparse import urljoin

import requests
from bs4 import BeautifulSoup

base_url = 'http://www.chess.com/'
game_ids = []

next_page = 'http://www.chess.com/home/game_archive?sortby=&show=live&member=Hikaru'
while True:
    soup = BeautifulSoup(requests.get(next_page).content)

    # collect the game ids
    for link in soup.select('a[href^=/livechess/game?id=]'):
        gameid = link['href'].split("?id=")[1]
        game_ids.append(int(gameid))

    try:
        next_page = urljoin(base_url, soup.select('ul.pagination li.next-on a')[0].get('href'))
    except IndexError:
        break  # exiting the loop if "Next" link not found

print game_ids

For the URL you've provided (Hikaru GM), it would print you a list of 224 game ids from all pages.

Friday, November 12, 2021
 
JB Nizet
answered 4 Weeks ago
Only authorized users can answer the question. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :  
Share