Trying to automatically grab the search results from a public search, but running into some trouble. The URL is of the form
http://www.website.com/search.aspx?keyword=#&&page=1&sort=Sorting
As I click through the pages, after visiting this page, it changes slightly to
http://www.website.com/search.aspx?keyword=#&&sort=Sorting&page=2
Problem being, if I then try to directly visit the second link without first visiting the first link, I am redirected to the first link. My current attempt at this is defining a long list of start_urls in scrapy.
class websiteSpider(BaseSpider):
name = "website"
allowed_domains = ["website.com"]
baseUrl = "http://www.website.com/search.aspx?keyword=#&&sort=Sorting&page="
start_urls = [(baseUrl+str(i)) for i in range(1,1000)]
Currently this code simply ends up visiting the first page over and over again. I feel like this is probably straightforward, but I don't quite know how to get around this.
UPDATE:
Made some progress investigating this and found that the site updates each page by sending a POST request to the previous page using __doPostBack(arg1, arg2). My question now is how exactly do I mimic this POST request using scrapy. I know how to make a POST request, but not exactly how to pass it the arguments I want.
SECOND UPDATE:
I've been making a lot of progress! I think... I looked through examples and documentation and eventually slapped together this version of what I think should do the trick:
def start_requests(self):
baseUrl = "http://www.website.com/search.aspx?keyword=#&&sort=Sorting&page="
target = 'ctl00$empcnt$ucResults$pagination'
requests = []
for i in range(1, 5):
url = baseUrl + str(i)
argument = str(i+1)
data = {'__EVENTTARGET': target, '__EVENTARGUMENT': argument}
currentPage = FormRequest(url, data)
requests.append(currentPage)
return requests
The idea is that this treats the POST request just like a form and updates accordingly. However, when I actually try to run this I get the following traceback(s) (Condensed for brevity):
2013-03-22 04:03:03-0400 [guru] ERROR: Unhandled error on engine.crawl()
dfd.addCallbacks(request.callback or spider.parse, request.errback)
File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 280, in addCallbacks
assert callable(callback)
exceptions.AssertionError:
2013-03-22 04:03:03-0400 [-] ERROR: Unhandled error in Deferred:
2013-03-22 04:03:03-0400 [-] Unhandled Error
Traceback (most recent call last):
Failure: scrapy.exceptions.IgnoreRequest: Skipped (request already seen)
Changing question to be more directed at what this post has turned into.
Thoughts?
P.S. When the second errors happen scrapy is unable to cleany shutdown and I have to send a SIGINT twice to get things to actually wrap up.
FormRequest doesn't have a positional argument in the constructor for formdata:
class FormRequest(Request):
def __init__(self, *args, **kwargs):
formdata = kwargs.pop('formdata', None)
so you actually have to say formdata=:
requests.append(FormRequest(url, formdata=data))
Related
I'm almost there with my first try of using scrapy, selenium to collect data from website with javascript loaded content.
Here is my code:
# -*- coding: utf-8 -*-
import scrapy
from selenium import webdriver
from scrapy.selector import Selector
from scrapy.http import Request
from selenium.webdriver.common.by import By
import time
class FreePlayersSpider(scrapy.Spider):
name = 'free_players'
allowed_domains = ['www.forge-db.com']
start_urls = ['https://www.forge-db.com/fr/fr11/players/?server=fr11']
driver = {}
def __init__(self):
self.driver = webdriver.Chrome('/home/alain/Documents/repository/web/foe-python/chromedriver')
self.driver.get('https://forge-db.com/fr/fr11/players/?server=fr11')
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
#time.sleep(1)
sel = Selector(text = self.driver.page_source)
players = sel.xpath('.//table/tbody/tr')
for player in players:
joueur = player.xpath('.//td[3]/a/text()').get()
guilde = player.xpath('.//td[4]/a/text()').get()
yield {
'player' : joueur,
'guild' : guilde
}
next_page_btn = self.driver.find_element_by_xpath('//a[#class="paginate_button next"]')
if next_page_btn:
time.sleep(2)
next_page_btn.click()
yield scrapy.Request(url = self.start_urls, callback=self.parse)
# Close the selenium driver, so in fact it closes the testing browser
self.driver.quit()
def parse_players(self):
pass
I want to collect user names and their relative guild and output to a csv file.
For now my issue is to proceed to NEXT PAGE and to parse again the content loaded by javascript.
if i'm able to simulate click on NEXT tag, i'm not 100% sure that code will proceed all pages and i'm not able to parse the new content using the same function.
Any idea how could i solve this issue ?
thx.
Instead of using selenium, you should try recreate the request to update the table. If you look closely at the HTML under chrometools. You can see that the request is made with parameters and a response is sent back with the data in a nice structured format.
Please see here with regards to dynamic content in scrapy. As it explains the first step to think about is it necessary to recreate browser activity ? Or can I get the information I need from reverse engineering HTTP get requests. Sometimes the information is hidden with <script></script> tags and you can use some regex or some string methods to gain what you want. Rendering the page and then using browser activity should be thought of as a last step.
Now before I go into some background on reverse engineering the requests, this website you're trying to get information from requires only to reverse engineer the HTTP requests.
Reverse Engineering HTTP requests in Scrapy
Now in terms of the actual web itself we can use chrome devtools by right clicking inspect on a page. Clicking the network tab allows you to see all requests the browser makes to render the page. In this case you want to see what happens when you click next.
Image1: here
Here you can see all the requests made when you click next on the page. I always look for the biggest sized response as that'll most likely have your data.
Image2: here
Here you can see the request headers/params etc... things you need to make a proper HTTP request. We can see that the referring URL is actually getplayers.php with all the params to get the next page added on. If you scroll down you can see all the same parameters it sends to getplayers.php. Keep this in mind, sometimes we need to send headers, cookies and parameters.
Image3: here
Here is the preview of the data we would get back from the server if we make the correct request, it's a nice neat format which is great for scraping.
Now You could copy the headers and parameters, cookies here into scrapy, but after a bit of searching and it's always worth checking this first, if just by passing in an HTTP request with the url will you get the data you want then that is the simplest way.
In this case it's true and infact you get in a nice need format with all the data.
Code example
import scrapy
class TestSpider(scrapy.Spider):
name = 'test'
allowed_domains = ['forge-db.com']
def start_requests(self):
url = 'https://www.forge-db.com/fr/fr11/getPlayers.php?'
yield scrapy.Request(url=url)
def parse(self,response):
for row in response.json()['data']:
yield {'name':row[2],'guild':row[3] }
Settings
In settings.py, you need to set ROBOTSTXT_OBEY = False The site doesn't want you to access this data so we need to set it to false. Be careful, you could end getting banned from the server.
I would also suggest a couple of other settings to be respectful and cache the results so if you want to play around with this large dataset you don't hammer the server.
CONCURRENT_REQUESTS = 1
DOWNLOAD_DELAY = 3
HTTPCACHE_ENABLED = True
HTTPCACHE_DIR = 'httpcache'
Comments on the code
We make a request to https://www.forge-db.com/fr/fr11/getPlayers.php? and if you were to print the response you get all the data from the table, it's quite a lot... Now it looks like it's in json format so we use scrapy's new feature to handle json and convert into a python dictionary. response.json() be sure that you have uptodate scrapy to take advantage of this. Otherwise you could use the json library that python provides to do the same thing.
Now you have to look at the preview data abit here but the individual rows are within response.json()['data'][i] where i in the row of data. The name and guild are within response.json()['data'][i][2] and response.json()['data'][i][3]. So looping over every response.json()['data']and grabbing the name and guild.
If the data wasn't so structured as it is here and it needed modifying I would strongly urge you to use Items or ItemLoaders for creating the fields that you can then output the data. You can modifying the extracted data more easily with ItemLoaders and you can interact with duplicates items etc using a pipeline. These are just some thoughts for in the future, I almost never use yielding a dictionary for extracting data particularly large datasets.
I have a website built using the django framework that takes in an input csv folder to do some data processing. I would like to use a html text box as a console log to let the users know that the data processing is underway. The data processing is done using a python function. It is possible for me to change/add text inputs into the text box at certain intervals using my python function?
Sorry if i am not specific enough with my question, still learning how to use these tools!
Edit - Thanks for all the help though, but I am still quite new at this and there are lots of things that I do not really understand. Here is an example of my python function, not sure if it helps
def query_result(request, job_id):
info_dict = request.session['info_dict']
machines = lt.trace_machine(inputFile.LOT.tolist())
return render(request, 'tools/result.html', {'dict': json.dumps(info_dict),
'job_id': job_id})
Actually my main objective is to let the user know that the data processing has started and that the site is working. I was thinking maybe I could display an output log in a html textbox to achieve this purpose.
No cannot do that because you already at server side therefor you cannot touch anything in html page.
You can have 2 ways to do that:
You can make a interval function to call to server and ask the progress and update the progress like you want at callback function.
You can open a socket connection in your server & browser to instantly update.
While it is impossible for the server (Django) to directly update the client (browser), you can you JavaScript to make the request, and Django can return a StreamingHttpResponse. As each part of the response is received, you can update the textbox using JavaScript.
Here is a sample with pseudo code
def process_csv_request(request):
csv_file = get_csv_file(requests)
return StreamingHttpResponse(process_file(csv_file))
def process_file(csv_file):
for row in csv_file:
yield progress
actual_processing(row)
return "Done"
Alternatively you could write the process to the db or some cache, and call an API that returns the progress repeatedly from the frontend
You can achieve this with websockets using Django Channels.
Here's a sample consumer:
class Consumer(WebsocketConsumer):
def connect(self):
self.group_name = self.scope['user']
print(self.group_name) # use this for debugging not sure what the scope returns
# Join group
async_to_sync(self.channel_layer.group_add)(
self.group_name,
self.channel_name
)
self.accept()
def disconnect(self, close_code):
# Leave group
async_to_sync(self.channel_layer.group_discard)(
self.group_name,
self.channel_name
)
def update_html(self, event):
status = event['status']
# Send message to WebSocket
self.send(text_data=json.dumps({
'status': status
}))
Running through the Channels 2.0 tutorial you will learn that by putting some javascript on your page, each time it loads it will connect you to a websocket consumer. On connect() the consumer adds the user to a group. This group name is used by your csv processing function to send a message to the browser of any users connected to that group (in this case just one user) and update the html on your page.
def send_update(channel_layer, group_name, message):
async_to_sync(channel_layer.group_send)(
group_name,
{
'type': 'update_html',
'status': message
}
)
def process_csv(file):
channel_layer = get_channel_layer()
group_name = get_user_name() # function to get same group name as in connect()
with open(file) as f:
reader=csv.reader(f)
send_update(channel_layer, group_name, 'Opened file')
for row in reader:
send_update(channel_layer, group_name, 'Processing Row#: %s' % row)
You would include javascript on your page as outlined in the Channels documentation then have an extra onmessage function fro updating the html:
var WebSocket = new ReconnectiongWebSocket(...);
WebSocket.onmessage = function(e) {
var data = JSON.parse(e.data);
$('#htmlToReplace').html(data['status']);
}
So I've found like gazillion of StackOverflow questions and answers to all the subtopics of my title. Linked them together and got 404.
I am using MEAN stack to build simple API searching app.
This is the echo of 'together.php'(function is working just fine):
$variable1 = $_POST['JavaScriptButtonVariable1'];
$variable2 = $_POST['JavaScriptButtonVariable2'];
echo json_encode(inst_search(myfunction($variable1, $variable2)));
Here is the give part of my JSON object:
{"Target hashtag searched":"pizza","Additional keyword searched":"italia","Number of instagram submitters this session":20,"Total number of tags submitted":242,"Score of this hashtag\/keyword pair this session":6,"Date and time (YYYY\/MM\/DD HH:MM:SS)":"2015\/07\/23 09:34:55am","0":[{"Location":null,"Tags":["bandung","jakarta","pizzaitalia","pizza"]},{"Location":{"Location":"Catania","Area":"Catania","Region":"Provincia di Catania"},"Tags":["casa","famiglia","food","sicily","pizza
Keys interesting for me are: 'Number of instagram submitters this session', 'Total number of tags submitted', 'Score of this hashtag/keyword pair this session' and 'Date and time (YYYY/MM/DD HH:MM:SS)', thus I am using [2],[3],[4] and [5] (visible in the code below). There is only one value assigned to those keys every time.
This is the part of the function in my 'global.js' file which calls the php file:
var parsedData = [];
var eins = $('#variable1').val();
var zwei = $('#variable2').val();
$.post
('together.php',{variable1:eins,variable2:zwei},function(omg)
{
var parsedData = JSON.parse(omg);
}
);
var php_no_sub = parsedData[2];
var php_tags_sub = parsedData[3];
var php_score = parsedData[4];
var php_datetime = parsedData[5];
var newTag =
{
'searchrecords': php_no_sub.val(),
'tagsfound': php_tags_sub.val(),
'datetime': php_score.val(),
'score': php_datetime.val()
}
Now, I simplified the code to the parts which may cause the problem, including the 404 error while calling and possibly my extreme inability to correctly construct JS objects from arrays extracted from JSON.
When clicking on the link http://localhost... my together.php gets downloaded, so it is definitely in the right directory.
Failed to load resource: the server responded with a status of 404 (Not Found) http://localhost:3000/together.php
Questions: Why is my console calling 404? Is the code for the array and the object written correctly?EDIT : How to allow POSTing to my php file by routing?
PS: Here's the full text of my console error:
POST http://localhost:3000/together.php 404 (Not Found)x.ajaxTransport.x.support.cors.e.crossDomain.send # jquery.min.js:6x.extend.ajax # jquery.min.js:6x.each.x.(anonymous function) # jquery.min.js:6addTagAutoTogether # global.js:190x.event.dispatch # jquery.min.js:5x.event.add.y.handle # jquery.min.js:5
Clicking the link creates a GET request, but your code creates a POST message. Your servers API apparently doesn't know about any POST for that url, so it returns a 404.
I don't know about the backend of your API, but perhaps you have to set a (POST)route first.
I am trying to notify the browser of the user of a change of the status of the model. I am trying to use the live-module of Rails for that. Here is what I have got so far:
require 'json'
class Admin::NotificationsController < ActionController::Base
include ActionController::Live
def index
puts "sending message"
videos = Video.all
response.headers['Content-Type'] = 'text/event-stream'
begin
if(params[:id].present?)
response.stream.write(sse({id: params[:id]},{event: "video_encoded"}))
end
rescue IOError
ensure
response.stream.close
end
end
private
def sse(object, options = {})
(options.map{|k,v| "#{k}: #{v}" } << "data: #{JSON.dump object}").join("\n") + "\n\n"
end
end
The idea behind the above controller is, that when its url gets called with a parameter, it would send this parameter (in this case the id) to the user. Here is how I am trying to call the controller:
notifications_path(id: video.id)
Unfortunately though, the following event-listener in the browser does not fire, even if I use curl to provoke an event:
var source = new EventSource('/notifications');
source.addEventListener("video_encoded", function(event) {
console.log("message")
console.log(event)
});
The goal of this is, that I want to add an dom-element to a certain page (later on) if there is a change. There may be a better way, but Ruby Live seemed like a suitable solution. Any tips or proposals of a different approach are appreciated.
Your use case does not seem like a valid use case for ActionController::Live. You are not sending a streaming output to the browser. You do a one time check on ID and send the JSON output.
Use a regular controller and get the request by AJAX instead of EventSource.
I'm using jQuery with Django in server-side. What I'm trying to do is to get some text from the user through the form and simultaneously displaying the text in the canvas area like about.me and flavors.me does. Then the user drag the text in the canvas area to the desired position and when they click the next button,the data must be stored in the database and redirect to the homepage. Everything is working perfect(the datas are stored in the database) except when I click the button which I set window.location to "http://127.0.0.1:8000". But I'm not getting to that page when I click the button.
I'm getting some errors in Django server:
error: [Errno 32] Broken pipe
----------------------------------------
Exception happened during processing of request from ('127.0.0.1', 51161)
Traceback (most recent call last):
File "/usr/lib/python2.7/SocketServer.py", line 284, in _handle_request_noblock
Here is my html:
https://gist.github.com/2359541
Django views.py:
from cover.models import CoverModel
from django.http import HttpResponseRedirect
def coverview(request):
if request.is_ajax():
t = request.POST.get('top')
l = request.POST.get('left')
n = request.POST.get('name')
h = request.POST.get('headline')
try:
g = CoverModel.objects.get(user=request.user)
except CoverModel.DoesNotExist:
co = CoverModel(top=t, left=l, name=n, headline=h)
co.user = request.user
co.save()
else:
g.top = t
g.left = l
g.name = n
g.headline = h
g.save()
return HttpResponseRedirect("/")
urls.py:
url(r'^cover/check/$', 'cover.views.coverview'),
url(r'^cover/$', login_required(direct_to_template), {'template': 'cover.html'}),
Could anyone help me?
Thanks!
There's really not enough information in your question to properly diagnose this, but you can try this:
It's always a bad idea to hard-code a domain name in your JS. What happens when you take this to production, for example? If you want to send the user to the homepage (presumed from the location being set to http://127.0.0.1:8000/), then set the location simply to /. That will ensure that it will always go to the site root regardless of the IP address, domain name or port.
Part of the problem is that you're trying to post data, and then immediately leaving the page by using window.location. You should only change the window.location whenever you get the response back from the $.post().
$.post("check/", { top: t, left: l, name: n, headline: h}, function(data) {
window.location.href = "/";
});
Notice also that I removed the hardcoded URL. Use a relative one here, like Chris said.
If it still isn't working, you need to check for Javascript errors in the lines above. Use Firebug, Chrome Dev Tools, Opera Dragonfly, something. Check to make sure your POST is actually going through, and post more data about that back here.