Asked  6 Months ago    Answers:  5   Viewed   65 times

Basically, I want to use BeautifulSoup to grab strictly the visible text on a webpage. For instance, this webpage is my test case. And I mainly want to just get the body text (article) and maybe even a few tab names here and there. I have tried the suggestion in this SO question that returns lots of <script> tags and html comments which I don't want. I can't figure out the arguments I need for the function findAll() in order to just get the visible texts on a webpage.

So, how should I find all visible text excluding scripts, comments, css etc.?



Try this:

from bs4 import BeautifulSoup
from bs4.element import Comment
import urllib.request

def tag_visible(element):
    if in ['style', 'script', 'head', 'title', 'meta', '[document]']:
        return False
    if isinstance(element, Comment):
        return False
    return True

def text_from_html(body):
    soup = BeautifulSoup(body, 'html.parser')
    texts = soup.findAll(text=True)
    visible_texts = filter(tag_visible, texts)  
    return u" ".join(t.strip() for t in visible_texts)

html = urllib.request.urlopen('').read()
Tuesday, June 1, 2021
answered 6 Months ago
from BeautifulSoup import BeautifulSoup
import re

html_text = """
<h2>this is cool #12345678901</h2>
<h2>this is nothing</h2>
<h1>foo #126666678901</h1>
<h2>this is interesting #126666678901</h2>
<h2>this is blah #124445678901</h2>

soup = BeautifulSoup(html_text)

for elem in soup(text=re.compile(r' #S{11}')):
    print elem.parent


<h2>this is cool #12345678901</h2>
<h2>this is interesting #126666678901</h2>
<h2>this is blah #124445678901</h2>
Thursday, June 10, 2021
answered 6 Months ago

I've used this:

def textOf(soup):
    return u''.join(soup.findAll(text=True))


texts = [textOf(n) for n in soup.findAll('a', href=re.compile('^notizia.php?idn=d+'))]
Wednesday, August 11, 2021
answered 4 Months ago

Beautiful Soup is a Python library designed for parsing web pages. Between it and urllib2 (urllib.request in Python 3) you should be able to figure out what you need.

Friday, August 13, 2021
answered 4 Months ago

You can use the find_all() method and the limit argument to get the third p tag in your html. Next use the .find which return the first br tag in the third paragraph. From there you can use the .next_siblings method which return a generator object and the .join function.

>>> third_p = soup.find_all('p', limit=3)[-1]
>>> ''.join(third_p.find('br').next_siblings)
Wednesday, August 25, 2021
answered 3 Months ago
Only authorized users can answer the question. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :