Asked  7 Months ago    Answers:  5   Viewed   108 times

I'm trying to scrape product information from a webpage, using scrapy. My to-be-scraped webpage looks like this:

  • starts with a product_list page with 10 products
  • a click on "next" button loads the next 10 products (url doesn't change between the two pages)
  • i use LinkExtractor to follow each product link into the product page, and get all the information I need

I tried to replicate the next-button-ajax-call but can't get working, so I'm giving selenium a try. I can run selenium's webdriver in a separate script, but I don't know how to integrate with scrapy. Where shall I put the selenium part in my scrapy spider?

My spider is pretty standard, like the following:

class ProductSpider(CrawlSpider):
    name = "product_spider"
    allowed_domains = ['example.com']
    start_urls = ['http://example.com/shanghai']
    rules = [
        Rule(SgmlLinkExtractor(restrict_xpaths='//div[@id="productList"]//dl[@class="t2"]//dt'), callback='parse_product'),
        ]

    def parse_product(self, response):
        self.log("parsing product %s" %response.url, level=INFO)
        hxs = HtmlXPathSelector(response)
        # actual data follows

Any idea is appreciated. Thank you!

 Answers

77

It really depends on how do you need to scrape the site and how and what data do you want to get.

Here's an example how you can follow pagination on ebay using Scrapy+Selenium:

import scrapy
from selenium import webdriver

class ProductSpider(scrapy.Spider):
    name = "product_spider"
    allowed_domains = ['ebay.com']
    start_urls = ['http://www.ebay.com/sch/i.html?_odkw=books&_osacat=0&_trksid=p2045573.m570.l1313.TR0.TRC0.Xpython&_nkw=python&_sacat=0&_from=R40']

    def __init__(self):
        self.driver = webdriver.Firefox()

    def parse(self, response):
        self.driver.get(response.url)

        while True:
            next = self.driver.find_element_by_xpath('//td[@class="pagn-next"]/a')

            try:
                next.click()

                # get the data and write it to scrapy items
            except:
                break

        self.driver.close()

Here are some examples of "selenium spiders":

  • Executing Javascript Submit form functions using scrapy in python
  • https://gist.github.com/cheekybastard/4944914
  • https://gist.github.com/irfani/1045108
  • http://snipplr.com/view/66998/

There is also an alternative to having to use Selenium with Scrapy. In some cases, using ScrapyJS middleware is enough to handle the dynamic parts of a page. Sample real-world usage:

  • Scraping dynamic content using python-Scrapy
Tuesday, June 1, 2021
 
muffe
answered 7 Months ago
61

If anyone actually knew a general and always-applicable answer, it would have been implemented everywhere ages ago and would make our lives SO much easier.

There are many things you can do, but every single one of them has a problem:

  1. As Ashwin Prabhu said, if you know the script well, you can observe its behaviour and track some of its variables on window or document etc. This solution, however, is not for everyone and can be used only by you and only on a limited set of pages.

  2. Your solution by observing the HTML code and whether it has or hasn't been changed for some time is not bad (also, there is a method to get the original and not-edited HTML directly by WebDriver), but:

    • It takes a long time to actually assert a page and could prolong the test significantly.
    • You never know what the right interval is. The script might be downloading something big that takes more than 500 ms. There are several scripts on our company's internal page that take several seconds in IE. Your computer may be temporarily short on resources - say that an antivirus will make your CPU work fully, then 500 ms may be too short even for a noncomplex scripts.
    • Some scripts are never done. They call themselves with some delay (setTimeout()) and work again and again and could possibly change the HTML every time they run. Seriously, every "Web 2.0" page does it. Even Stack Overflow. You could overwrite the most common methods used and consider the scripts that use them as completed, but ... you can't be sure.
    • What if the script does something other than changing the HTML? It could do thousands of things, not just some innerHTML fun.
  3. There are tools to help you on this. Namely Progress Listeners together with nsIWebProgressListener and some others. The browser support for this, however, is horrible. Firefox began to try to support it from FF4 onwards (still evolving), IE has basic support in IE9.

And I guess I could come up with another flawed solution soon. The fact is - there's no definite answer on when to say "now the page is complete" because of the everlasting scripts doing their work. Pick the one that serves you best, but beware of its shortcomings.

Wednesday, June 9, 2021
 
aslum
answered 6 Months ago
21

If you intend to use Selenium in Grid configuration through Hub and Node configuration, I would suggest you to use the most recent selenium-server-standalone-3.6.0 jar as follows:

  1. Start the Selenium Grid Hub (by default on port 4444) :

    java -jar selenium-server-standalone-3.6.0.jar -role hub
    
  2. Confirm the Selenium Grid Hub is started:

    16:06:29.891 INFO - Nodes should register to http://192.168.1.48:4444/grid/register/
    16:06:29.891 INFO - Selenium Grid hub is up and running
    
  3. Access the Selenium Grid Hub Console and ensure Selenium Grid Hub is up and running:

    http://localhost:4444/grid/console
    
  4. Start the Selenium Grid Node (by default on port 5555) for Mozilla/GeckoDriver:

    java -Dwebdriver.gecko.driver=geckodriver.exe -jar selenium-server-standalone-3.6.0.jar -role node -hub http://localhost:4444/grid/register
    
  5. Confirm the Selenium Grid Node is registered and started:

    16:15:54.696 INFO - Selenium Grid node is up and ready to register to the hub
    16:15:54.742 INFO - Starting auto registration thread. Will try to register every 5000 ms.
    16:15:54.742 INFO - Registering the node to the hub: http://localhost:4444/grid/register
    16:15:54.975 INFO - The node is registered to the hub and ready to use
    
  6. Execute with the Testcase with DesiredCapabilities as follows:

    self.driver = webdriver.Remote(command_executor='http://127.0.0.1:4444/wd/hub', desired_capabilities=caps)
    
  7. Observe the console logs ending with the following on successful execution of your Testcase:

    16:23:50.590 INFO - Found handler: org.openqa.selenium.remote.server.ServicedSession@37ff9771
    16:23:50.590 INFO - Handler thread for session 31a1dcb0-8bed-40fb-acdb-d5be19f03ba2 (firefox): Executing DELETE on /session/31a1dcb0-8bed-40fb-acdb-d5be19f03ba2
     (handler: ServicedSession)
    1506941630595   Marionette      INFO    New connections will no longer be accepted
    
Tuesday, June 15, 2021
 
penpen
answered 6 Months ago
69

To scrape data from sciencedirect website https://www.sciencedirect.com/journal/journal-of-corporate-finance/issues you can perform the following steps:

  • First open all the accordions.

  • Then open each issue in the adjustant TAB using Ctrl + click().

  • Next switch_to() the newly opened tab and scrape the required contents.

  • Code Block:

      from selenium import webdriver
      from selenium.webdriver.common.by import By
      from selenium.webdriver.support.ui import WebDriverWait
      from selenium.webdriver.support import expected_conditions as EC
      from selenium.webdriver.common.action_chains import ActionChains
      from selenium.webdriver.common.keys import Keys
    
      options = webdriver.ChromeOptions() 
      options.add_argument("start-maximized")
      options.add_experimental_option("excludeSwitches", ["enable-automation"])
      options.add_experimental_option('useAutomationExtension', False)
      driver = webdriver.Chrome(options=options, executable_path=r'C:UtilityBrowserDriverschromedriver.exe')
      driver.get('https://www.sciencedirect.com/journal/journal-of-corporate-finance/issues')
      accordions = WebDriverWait(driver, 30).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "li.accordion-panel.js-accordion-panel>button.accordion-panel-title>span")))
      for accordion in accordions:
          ActionChains(driver).move_to_element(accordion).click(accordion).perform()
      issues = WebDriverWait(driver, 30).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "a.anchor.js-issue-item-link.text-m span.anchor-text")))
      windows_before  = driver.current_window_handle
      for issue in issues:
          ActionChains(driver).key_down(Keys.CONTROL).click(issue).key_up(Keys.CONTROL).perform()
          WebDriverWait(driver, 10).until(EC.number_of_windows_to_be(2))
          windows_after = driver.window_handles
          new_window = [x for x in windows_after if x != windows_before][0]
          driver.switch_to_window(new_window)
          WebDriverWait(driver, 30).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "a#journal-title>span")))
          print(WebDriverWait(driver, 30).until(EC.visibility_of_element_located((By.XPATH, "//h2"))).get_attribute("innerHTML"))
          driver.close()
          driver.switch_to_window(windows_before)
      driver.quit()
    
  • Console Output:

      Institutions, Governance and Finance in a Globally Connected Environment
      Volume 58
      Corporate Governance in Multinational Enterprises
      .
      .
      .
    

References

You can find a couple of relevant detailed discussions in:

  • How to open a link embeded in a webelement with in the main tab, in a new tab of the same window using Control + Click of Selenium Webdriver
  • How to open multiple hrefs within a webtable to scrape through selenium
  • WebScraping JavaScript-Rendered Content using Selenium in Python
  • StaleElementReferenceException even after adding the wait while collecting the data from the wikipedia using web-scraping
  • How to open each product within a website in a new tab for scraping using Selenium through Python
Thursday, August 5, 2021
 
akosch
answered 4 Months ago
13

The reason for this behavior is how the PhantomJS driver's Service class is implemented.

There is a __del__ method defined that calls self.stop() method:

def __del__(self):
    # subprocess.Popen doesn't send signal on __del__;
    # we have to try to stop the launched process.
    self.stop()

And, self.stop() is assuming the service instance is still alive trying to access it's attributes:

def stop(self):
    """
    Cleans up the process
    """
    if self._log:
        self._log.close()
        self._log = None
    #If its dead dont worry
    if self.process is None:
        return

    ...

The same exact problem is perfectly described in this thread:

  • Python attributeError on __del__

What you should do is to silently ignore AttributeError occurring while quitting the driver instance:

try:
    driver.quit()
except AttributeError:
    pass

The problem was introduced by this revision. Which means that downgrading to 2.40.0 would also help.

Thursday, October 14, 2021
 
Keat
answered 2 Months ago
Only authorized users can answer the question. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :  
Share