Asked  6 Months ago    Answers:  5   Viewed   61 times

I'm using the following code based on loadspeed.js example to open up a https:// site which requires http server authentication as well.

var page = require('webpage').create(), system = require('system'), t, address;

page.settings.userName = 'myusername';
page.settings.password = 'mypassword';

if (system.args.length === 1) {
    console.log('Usage: scrape.js <some URL>');
    phantom.exit();
} else {
    t = Date.now();
    address = system.args[1];
    page.open(address, function (status) {
        if (status !== 'success') {
            console.log('FAIL to load the address');
        } else {
            t = Date.now() - t;
            console.log('Page title is ' + page.evaluate(function () {
                return document.title;
            }));
            console.log('Loading time ' + t + ' msec');
        }
        phantom.exit();
    });
}  

Its failing to load the page all the time. What could be wrong here? Are secured sites to be handled any differently? The site can be accessed successfully from browser though.

I'm just starting with Phantom right now and find it too good to stop playing around even though i'm not moving forward with this issue.

 Answers

18

I tried Fred's and Cameron Tinker's answers, but only --ssl-protocol=any option seem to help me:

phantomjs --ssl-protocol=any test.js

Also I think it should be way safer to use --ssl-protocol=any as you still are using encryption, but --ignore-ssl-errors=true will ignore (duh) all ssl errors, including malicious ones.

Tuesday, June 1, 2021
 
Alix
answered 6 Months ago
92

The main problem seems to be that you're exiting too early. You're creating multiple page instances in a loop. Since PhantomJS is asynchronous, the call to page.open() immediately exists and the next for loop iteration is executed.

A for-loop is pretty fast, but web requests are slow. This means that your loop is fully executed before even the first page is loaded. This also means that the first page that is loaded will also exit PhantomJS, because you're calling phantom.exit() in each of those page.open() callbacks. I suspect the second URL is faster for some reason and is therefore always written.

var countFinished = 0, 
    maxFinished = len;
function checkFinish(){
    countFinished++;
    if (countFinished + 1 === maxFinished) {
        phantom.exit();
    }
}

for (i=1; i <= len; i++) {
    country = countries[i]
    name = country.concat(name1)
    add = add1.concat(country)

    var webPage = require('webpage');
    var page = webPage.create();

    var fs = require('fs');
    var path = name

    page.open(add, function (status) {
        var content = page.content;
        fs.write(path, content,'w')
        checkFinish();
    });
}

The problem is that you're creating a lot of page instances without cleaning up. You should close them when you're done with them:

for (i=1; i <= len; i++) {
    (function(i){
        country = countries[i]
        name = country.concat(name1)
        add = add1.concat(country)

        var webPage = require('webpage');
        var page = webPage.create();

        var fs = require('fs');
        var path = name

        page.open(add, function (status) {
            var content = page.content;
            fs.write(path, content,'w');
            page.close();
            checkFinish();
        });
    })(i);
}

Since JavaScript has function-level scope, you would need to use an IIFE to retain a reference to the correct page instance in the page.open() callback. See this question for more information about that: Q: JavaScript closure inside loops – simple practical example

If you don't want to clean up afterwards, then you should use the same page instance over all of those URLs. I already have an answer about doing that here: A: Looping over urls to do the same thing

Friday, July 30, 2021
 
KeK0
answered 4 Months ago
26

There are several measures you can take to decrease processing time.

1 . Get a more powerful server/computer (as Mathieu rightly noted).

Yes, you could argue this is irrelevant to the question, but in matters of scraping it very much is. On a budget $8 VPS without optimization your initial script ran for 9589ms which is already a ~30% improvement.

2 . Turn off images load. It will help... a bit. 8160ms load time.

page.settings.loadImages = false;  

3 . Analyze the page, find and cancel unnecessary network requests.

Even in a normal browser like Google Chrome the site loads slowly: 129 requests/8.79s load time with AdblockPlus. There are a lot of requests (gif, 1Mb), many if them are for third-party sites like facebook, twitter (to fetch widgets) and to ad sites.

We can cancel them too:

block_urls = ['gstatic.com', 'adocean.pl', 'gemius.pl', 'twitter.com', 'facebook.net', 'facebook.com', 'planplus.rs'];

page.onResourceRequested = function(requestData, request){
    for(url in block_urls) {
        if(requestData.url.indexOf(block_urls[url]) !== -1) {
            request.abort();
            console.log(requestData.url + " aborted");
            return;
        }
    }   
}

The load time for me now is just 4393ms while the page is loaded and usable: PhantomJS screenshot

I don't think much more can be done without tinkering with page's code because judging by the page source it is quite script-heavy.

The whole code:

var page = require('webpage').create();
var fs = require("fs");

// console.time polyfill from https://github.com/callmehiphop/console-time
;(function( console ) {
  var timers;
  if ( !console ) {
    return;
  }
  timers = {};
  console.time = function( name ) {
    if ( name ) {
      timers[ name ] = Date.now();
    }
  };
  console.timeEnd = function( name ) {
    if ( timers[ name ] ) {
      console.log( name + ': ' + (Date.now() - timers[ name ]) + 'ms' );
      delete timers[ name ];
    }
  };
}( window.console ));

console.time("open");

page.settings.loadImages = false;
page.settings.userAgent = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36';
page.viewportSize = {
  width: 1280,
  height: 800
};

block_urls = ['gstatic.com', 'adocean.pl', 'gemius.pl', 'twitter.com', 'facebook.net', 'facebook.com', 'planplus.rs'];
page.onResourceRequested = function(requestData, request){
    for(url in block_urls) {
        if(requestData.url.indexOf(block_urls[url]) !== -1) {
            request.abort();
            console.log(requestData.url + " aborted");
            return;
        }
    }            
}

page.open('https://www.halooglasi.com/nekretnine/izdavanje-stanova/novi-beograd---novi-merkator-id19270/5425485514649', function () {
    fs.write("longload.html", page.content, 'w');

    console.timeEnd("open");

    setTimeout(function(){
        page.render('longload.png');
        phantom.exit();
    }, 3000);

});
Thursday, August 12, 2021
 
danjah
answered 4 Months ago
86

Thanks to @CommonsWare I was able to achieve what I was trying to do by using InputStream and then also IOUtils to read everything into the List.

try {
        InputStream iS = this.getAssets().open("passwords.txt");
        List<String> user_password = IOUtils.readLines(iS);

        @SuppressWarnings("unchecked") List<Credentials> credentials = (List<Credentials>) CollectionUtils.collect(user_password, new Transformer() {
            @Override
            public Object transform(Object input) {
                String cred = (String) input;
                String parsed[] = cred.split(",");
                return new Credentials(parsed[0], parsed[1]);
            }
        });
        user = (Credentials) CollectionUtils.find(credentials, new Predicate() {
            @Override
            public boolean evaluate(Object object) {
                Credentials c = (Credentials) object;
                return c.getUserName().equals(userName);
            }
        });
    }catch (IOException e){
        System.out.print(e);
    }
Wednesday, September 1, 2021
 
Xavio
answered 3 Months ago
89

It depends on the certificates of the server.

  • If it is a public valid certificate, you can include the CA certs file into SSL_CTX.

code:

ctx = SSL_CTX_new(SSLv23_client_method()); 
// You can load CA certs into SSL_CTX
SSL_CTX_load_verify_locations(ctx, cafile, NULL); // cafile: CA PEM certs file

You can download the public CA certs file from cURL website CA Certs from mozilla.org

  • If it is a private certs, and you have the certificate file, you can use SSL_CTX_use_certificate_file instead of SSL_CTX_load_verify_locations.
Sunday, October 24, 2021
 
Taha
answered 1 Month ago
Only authorized users can answer the question. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :  
Share