Asked  7 Months ago    Answers:  2   Viewed   73 times

I'm trying to scrape a website, but it gives me an error.

I'm using the following code:

import urllib.request
from bs4 import BeautifulSoup

get = urllib.request.urlopen("https://www.website.com/")
html = get.read()

soup = BeautifulSoup(html)

print(soup)

And I'm getting the following error:

File "C:Python34libencodingscp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 70924-70950: character maps to <undefined>

What can I do to fix this?

 Answers

68

I fixed it by adding .encode("utf-8") to soup.

That means that print(soup) becomes print(soup.encode("utf-8")).

Tuesday, June 1, 2021
 
Dail
answered 7 Months ago
80

Your data is encoded with the "UTF-8-SIG" codec, which is sometimes used in Microsoft environments.

This variant of UTF-8 prefixes encoded text with a byte order mark 'xefxbbxbf', to make it easier for applications to detect UTF-8 encoded text vs other encodings.

You can decode such bytestrings like this:

>>> bs = b'xefxbbxbfpudgala-dharma-nairxc4x81tmyayoxe1xb8xa5 apratipanna-vipratipannxc4x81nxc4x81m'
>>> text = bs.decode('utf-8-sig')
>>> print(text)                                                                                                         
pudgala-dharma-nairātmyayoḥ apratipanna-vipratipannānām 

To read such data from a file:

with open('myfile.txt', 'r', encoding='utf-8-sig') as f:
    text = f.read()

Note that even after decoding from UTF-8-SIG, you may still be unable to print your data because your console's default code page may not be able to encode other non-ascii characters in the data. In that case you will need to adjust your console settings to support UTF-8.

Tuesday, August 31, 2021
 
Otiel
answered 3 Months ago
Only authorized users can answer the question. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :  
Share