Asked  6 Months ago    Answers:  5   Viewed   16 times

I'd like to fetch results from Google using curl to detect potential duplicate content. Is there a high risk of being banned by Google?

 Answers

50

Google will eventually block your IP when you exceed a certain amount of requests.

Tuesday, June 1, 2021
 
revive
answered 6 Months ago
100

BigTable, which is the database back end for App Engine, will scale to millions of records. Due to this, App Engine will not allow you to do any query that will result in a table scan, as performance would be dreadful for a well populated table.

In other words, every query must use an index. This is why you can only do =, > and < queries. (In fact you can also do != but the API does this using a a combination of > and < queries.) This is also why the development environment monitors all the queries you do and automatically adds any missing indexes to your index.yaml file.

There is no way to index for a LIKE query so it's simply not available.

Have a watch of this Google IO session for a much better and more detailed explanation of this.

Monday, June 7, 2021
 
Fanda
answered 6 Months ago
86

You pretty much have to do this:

google.setOnLoadCallback(function() {
  $(function() {
    // init my stuff
  });
});

You can't do $(document).ready() without $ (the jQuery object) being available, so that needs to go inside the callback. And you can't be sure the document is ready inside the callback, so you have to do ready() too.

Tuesday, June 22, 2021
 
Gordnfreeman
answered 6 Months ago
19

Putting the scroll part and the JSON aside, I managed to read the data. The key is to read all of the elements inside the parent (which is done in the question):

parent = driver.find_element_by_xpath('//*[@id="pvExplorationHost"]/div/div/div/div[2]/div/div[2]/div[2]/visual-container[4]/div/div[3]/visual/div')
children = parent.find_elements_by_xpath('.//*')

Then sort them using their location:

x = [child.location['x'] for child in children]
y = [child.location['y'] for child in children]
index = np.lexsort((x,y))

To sort what we have read in different lines, this code may help:

rows = []
row = []
last_line = y[index[0]]
for i in index:
    if last_line != y[i]:
        row.append[children[i].get_attribute('title')]
    else:
        rows.append(row)
        row = list([children[i].get_attribute('title')]
rows.append(row)
Friday, June 25, 2021
 
Guesser
answered 6 Months ago
79
library(XML) 
library(relenium)

##downloading website
website<- firefoxClass$new() 
website$get("http://www.machinerytrader.com/list/list.aspx?pg=1&bcatid=4&DidSearch=1&EID=1&LP=MAT&ETID=5&catid=1015&mdlx=Contains&Cond=All&SO=26&btnSearch=Search&units=imperial") 
doc <- htmlParse(website$getPageSource())

##reading tables and binding the information
tables <- readHTMLTable(doc, stringsAsFactors=FALSE)
data<-do.call("rbind", tables[seq(from=8, to=56, by=2)])
data<-cbind(data, sapply(lapply(tables[seq(from=9, to=57, by=2)],  '[[', i=2), '[', 1))
rownames(data)<-NULL
names(data) <- c("year.man.model", "s.n", "price", "location", "auction")

This will give you what you want for the first page (showing just the first two lines here):

head(data,2)
      year.man.model      s.n      price location                                               auction
1 1972 AMERICAN 5530 GS14745W US $50,100       MI                   Auction: 1/9/2013; 4,796 Hours;  ..
2 AUSTIN-WESTERN 307      307  US $3,400       MT Auction: 12/18/2013;  AUSTIN-WESTERN track excavator.

To get all pages, just loop over them, pasting the pg=i in the address.

Thursday, September 23, 2021
 
Tom Lilletveit
answered 2 Months ago
Only authorized users can answer the question. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :  
Share