Asked  7 Months ago    Answers:  5   Viewed   32 times

there is already similar questions on stackoverflow, but none of their solutions have been working for me. I'm trying to grab a page on LoveIt.com with cURL, but it returns me a 404 error, while the url works fine in the browser :

        $url = 'http://loveit.com/loves/P0D1jlFaIOzzZfZqj_bY3KV';

        $curl = curl_init();
        curl_setopt($curl, CURLOPT_URL, $url);
        curl_setopt($curl, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)");
        curl_setopt ($curl, CURLOPT_HEADER, false);
        curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
        curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
        curl_setopt($curl, CURLOPT_REFERER,'http://loveit.com/');

Here's the header I receive :

Array ( [url] => http://loveit.com/loves/P0D1jlFaIOzzZfZqj_bY3KV [content_type] => text/html; charset=utf-8 [http_code] => 404 [header_size] => 667 [request_size] => 172 [filetime] => -1 [ssl_verify_result] => 0 [redirect_count] => 0 [total_time] => 0.320466 [namelookup_time] => 0.000326 [connect_time] => 0.119046 [pretransfer_time] => 0.119089 [size_upload] => 0 [size_download] => 499 [speed_download] => 1557 [speed_upload] => 0 [download_content_length] => 499 [upload_content_length] => 0 [starttransfer_time] => 0.320438 [redirect_time] => 0 [certinfo] => Array ( ) [primary_ip] => --- [primary_port] => 80 [local_ip] => --- [local_port] => 53837 [redirect_url] => )

I read that some website had protections against this kind of scripts; and I did test some solutions proposed, but none worked for me (CURLOPT_USERAGENT,CURLOPT_REFERER...)

Any ideas of what's happening here ?

I would like to backup my LoveIt account, that's why i'm making this (no exports functions and no replies from LoveIt.com about the health of the website)

 Answers

65

I quickly checked the said page with LiveHeaders enabled and I noticed bunch of cookies set. I suspect that, since it's not "normal" url, you need to hand those cookies while being redirected otherwise you end being kicked out with 404. Use CURLOPT_COOKIEJAR with your cURL instance at start. See: http://php.net/manual/pl/function.curl-setopt.php

Wednesday, March 31, 2021
 
Gerardo
answered 7 Months ago
88

You should only call curl_close() when you know you're done with that particular handle, or if switching from its current state to a new one (ie: changing a ton of options via curl_setopt() would be faster by going from a clean new handle than your current "dirty" one.

The cookiejar/file options are only strictly necessary for maintaining cookies between seperate curl handles/invokations. Each one's independent of the others, so the cookie files are the only way to share between them.

Wednesday, March 31, 2021
 
huhushow
answered 7 Months ago
23

You need to add the Curl libraries to the command line PHP.ini.

You can probably just copy the file C:wampbinapacheApache2.2.xbinphp.ini to c:wampbinphpphp5.3.10php.ini (adjust for the actual directories on your system).

Wednesday, March 31, 2021
 
Fanda
answered 7 Months ago
99

This was a bug in curl and fixed on 7.28.1 (according to this page http://curl.haxx.se/changes.html).

Note: Modern browsers doesnt send fragment part at the request. But your curl sending it, this is making difference.

Saturday, May 29, 2021
 
Oshrib
answered 5 Months ago
94

Well that's because default User-Agent of requests is python-requests/2.13.0, and in your case that website don't like traffic from "non-browsers", so they try to block such traffic.

>>> import requests
>>> session = requests.Session()
>>> session.headers
{'Connection': 'keep-alive', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'User-Agent': 'python-requests/2.13.0'}

All you need to do is to make the request appear like coming from a browser, so just add an extra header parameter:

import requests

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36'} # This is chrome, you can set whatever browser you like
response = requests.get('http://www.rlsnet.ru/search_result.htm?word=%D6%E5%F0%E5%E1%F0%EE%EB%E8%E7%E8%ED', headers=headers)

print response.status_code
print response.url

200 
http://www.rlsnet.ru/search_result.htm?word=%D6%E5%F0%E5%E1%F0%EE%EB%E8%E7%E8%ED
Saturday, July 3, 2021
 
Whakkee
answered 4 Months ago
Only authorized users can answer the question. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :