Asked  6 Months ago    Answers:  5   Viewed   65 times

Basically, I want to use BeautifulSoup to grab strictly the visible text on a webpage. For instance, this webpage is my test case. And I mainly want to just get the body text (article) and maybe even a few tab names here and there. I have tried the suggestion in this SO question that returns lots of <script> tags and html comments which I don't want. I can't figure out the arguments I need for the function findAll() in order to just get the visible texts on a webpage.

So, how should I find all visible text excluding scripts, comments, css etc.?

 Answers

27

Try this:

from bs4 import BeautifulSoup
from bs4.element import Comment
import urllib.request


def tag_visible(element):
    if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
        return False
    if isinstance(element, Comment):
        return False
    return True


def text_from_html(body):
    soup = BeautifulSoup(body, 'html.parser')
    texts = soup.findAll(text=True)
    visible_texts = filter(tag_visible, texts)  
    return u" ".join(t.strip() for t in visible_texts)

html = urllib.request.urlopen('http://www.nytimes.com/2009/12/21/us/21storm.html').read()
print(text_from_html(html))
Tuesday, June 1, 2021
 
zhartaunik
answered 6 Months ago
31
from BeautifulSoup import BeautifulSoup
import re

html_text = """
<h2>this is cool #12345678901</h2>
<h2>this is nothing</h2>
<h1>foo #126666678901</h1>
<h2>this is interesting #126666678901</h2>
<h2>this is blah #124445678901</h2>
"""

soup = BeautifulSoup(html_text)


for elem in soup(text=re.compile(r' #S{11}')):
    print elem.parent

Prints:

<h2>this is cool #12345678901</h2>
<h2>this is interesting #126666678901</h2>
<h2>this is blah #124445678901</h2>
Thursday, June 10, 2021
 
codingb
answered 6 Months ago
61

I've used this:

def textOf(soup):
    return u''.join(soup.findAll(text=True))

So...

texts = [textOf(n) for n in soup.findAll('a', href=re.compile('^notizia.php?idn=d+'))]
Wednesday, August 11, 2021
 
vuliad
answered 4 Months ago
32

Beautiful Soup is a Python library designed for parsing web pages. Between it and urllib2 (urllib.request in Python 3) you should be able to figure out what you need.

Friday, August 13, 2021
 
tedders
answered 4 Months ago
92

You can use the find_all() method and the limit argument to get the third p tag in your html. Next use the .find which return the first br tag in the third paragraph. From there you can use the .next_siblings method which return a generator object and the .join function.

>>> third_p = soup.find_all('p', limit=3)[-1]
>>> ''.join(third_p.find('br').next_siblings)
Wednesday, August 25, 2021
 
diegoiglesias
answered 3 Months ago
Only authorized users can answer the question. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :  
Share