I'm relatively new to CasperJS, have wrote simple scraping scripts, and now I'm in a kind of more difficult task: I want to scrape some sort of data from a list of urls, but some pages some times "fail", I've a captcha solving service because a few of this pages have captcha by default, but phantomjs is rather inconsistent in rendering some captchas, sometimes they load, sometimes they don't.
The solution I thought was to rerun the script with the pages that failed to load the captcha in order to get the amount of data I need. But I don't seem to get it running, I thought of creating a function with the whole thing and then inside the casper.run() method invoke it and check if the amount of data scraped fulfills the minimum I need if not rerun, But I don't really know how to accomplish it, as for what I've seen casperjs adds the steps to the stack before calling the function (correct me if I'm wrong). Also I'm thinking of something using the run.complete event but not so sure how to do it. My script is something like this:
// This variable stores the amount of data collected
pCount = 0;
urls = ["http://page1.com","http://page2.com"];
// Create casperjs instance...
casper.start();
casper.eachThen(urls, function(response) {
if (pCount < casper.cli.options.number) {
casper.thenOpen(response.data, function(response) {
// Here is where the magic goes on
})
}
})
casper.run();
Is there anyway I can wrap the casper.eachThen() block in a function and do something like this?
casper.start();
function sample () {
casper.eachThen(urls, function(response) {
if (pCount < casper.cli.options.number) {
casper.thenOpen(response.data, function(response) {
// Here is where the magic goes on
})
}
})
}
casper.run(sample);
Also, I tried using slimerjs as engine to avoid the "inconsistencies", but I couldn't manage to get working the __utils__.sendAjax() method inside a casper.evaluate() I have, so it's a deal-breaker. Or is there a way to do a GET request asynchronously in a separate instance? if so, I would appreciate your advise
Update: I never managed to solve it with casperjs, I nonetheless found a workaround for my particular use case, check my answer for more info
Maybe with the back function, so something like that :
casper.start()
.thenOpen('your url');
.then(function(){
var count = 0;
if (this.exists("selector contening the captcha")){
//continue the script
}
else if (count==3){
this.echo("in 3 attempts, it failed each time");
this.exit();
}
else{
count++;
casper.back();//back to the previous step, so will re-open the url
}
.run();
I never found a way to do this from casper, this is how I solved it:
There's a program A, that manages user input (in my case written in C#). This program A is the one that executes the casperjs script, and read it's output. If I need to rerun the script, I just output a message with some specifications so that I catch it in the program A.
It may not be the best way, but it worked for me. Hope it helps
Related
I'm writing E2E tests in Cypress (version 12.3.0). I have a page with a table in a multi-step creation process that requires some data from back-end application. In some cases (rarely, but it occurs) the request gets stuck and the loader never disappears. The solution is simple: go back to the previous step and return to the "stuck" table. The request is sent anew and most likely receives a response - the process and the tests can proceed further. If the loader is not present, then going back and forth should be skipped (most of the times).
I managed to work around that with the code below, but I'm wondering if it could be done with some built-in Cypress functions and without explicit waiting. Unfortunately, I didn't find anything in the Cypress docs and on StackOverflow. I thought that maybe I could use the then function to work on a "conditionally present" element, but it fails on get, that's why I've used find on the jQuery ancestor element.
waitForTableData() {
return cy.get('.data-table')
.should('exist')
.then(table => {
if (this.loaderNotPresent(table)) {
return;
}
cy.wait(200)
.then(() => {
if (this.loaderNotPresent(table)) {
return;
}
cy.get('button')
.contains('Back')
.click()
.get('button')
.contains('Next')
.click()
.then(() => this.waitForTableData());
});
});
}
loaderNotPresent(table: JQuery) {
return !table.find('.loader')?.length;
}
Your code looks to me to be the best you could do at present.
The cy.wait(200) is about the right size, maybe a bit smaller would be better - 50 - 100 ms. The recursive call is going to give you similar behaviour to Cypress retry (which also waits internally, in order not to hammer the test runner).
Another approach would be to cy.intercept() and mock the backend, presuming it's the backend that gets stuck.
Also worth trying a simple test retry, if the loading only fails on a small percentage of times.
tl;dr: When I run my test case, steps executed seem to work, but the test bails out early on a failure to find an element that hasn't even loaded yet. It seems like the waits I have around locating certain elements are loaded/launched as soon as the test is launched, not when the lines should actually be executed in the test case. I think this is happening because the page is barely (correctly) loaded before the "search" for the element to verify the page has loaded bails out. How do I wrangle the event loop?
This is probably a promise question, which is fine, but I don't understand what's going on. How do I implement my below code to work as expected? I'm working on creating automated E2E test cases using Jasmine2 and Protractor 5.3.0 in an Angular2 web app.
describe('hardware sets', () => {
it('TC3780:My_Test', async function() {
const testLogger = new CustomLogger('TC3780');
const PROJECT_ID = '65';
// Test Setup
browser.waitForAngularEnabled(false); // due to nature of angular project, the app never leaves zones, leaving a macrotask constantly running, thus protractor's niceness with angular is not working on our web app
// Navigate via URL to planviewer page for PROJECT_ID
await planListingPage.navigateTo(PROJECT_ID); // go to listing page for particular project
await planListingPage.clickIntoFirstRowPlans(); // go to first plan on listing page
await planViewerPage.clickOnSetItem('100'); // click on item id 100 in the plan
});
});
planViewerPage.po.ts function:
clickOnSetItem(id: string) {
element(by.id(id)).click();
browser.wait(until.visibilityOf(element(by.css('app-side-bar .card .info-content'))), 30000); // verify element I want to verify is present and visible
return expect(element(by.css('app-side-bar .card .info-content')).getText).toEqual(id); //Verify values match, This line specifically is failing.
}
This is the test case so far. I need more verification, but it is mostly done. I switched to using async function and awaits instead of the typical (done) and '.then(()=>{' statement chaining because I prefer not having to do a bunch of nesting to get things to execute in the right order. I come from a java background, so this insanity of having to force things to run in the order you write them is a bit much for me sometimes. I've been pointed to information like Mozilla's on event loop, but this line just confuses me more:
whenever a function runs, it cannot be pre-empted and will run entirely before any other code
runs (and can modify data the function manipulates).
Thus, why does it seem like test case is pre-evaluated and the timer's set off before any of the pages have been clicked on/loaded? I've implemented the solution here: tell Protractor to wait for the page before executing expect pretty much verbatim and it still doesn't wait.
Bonus question: Is there a way to output the event-loop's expected event execution and timestamps? Maybe then I could understand what it's doing.
The behavior
The code in your function is running asynchronously
clickOnSetItem(id: string) {
element(by.id(id)).click().then(function(){
return browser.wait(until.visibilityOf(element(by.css('app-side-bar .card .info-content'))), 30000);
}).then(function(){
expect(element(by.css('app-side-bar .card .info-content')).getText).toEqual(id);
}).catch(function(err){
console.log('Error: ' + err);
})
}
I'm working on a closed system web application to aid companies in their everyday online commerce chores. That means on the one hand that it won't be open to the public, on the other: it will have to deal with large amounts of data while maintaining a fluent work experience.
This is why I turned to web workers in JS to run all sorts of database access and data loading in the background.
My understanding is, that not only the main UI/main JS remains uninterrupted but also the different web workers run without hindering each other.
I now have the following setup:
mainJS: function statusCheck which runs on pageload:
function statusCheck() {
if(typeof(w__statusCheck) == "undefined") {
var w__statusCheck = new Worker("...statusCheck.js");
w__statusCheck.postMessage("go");
w__statusCheck.onmessage = function(e) {
var message = JSON.parse(e.data);
if(message.text!=undefined) displayMessage(message.text);
}
}
statusCheck.js which is the worker simply goes like this:
function checkStatus() {
console.log("statusCheck started");
// I will leave standard parts out:
// creating and testing the ajax variable against different browsers
ajaxRequest.onreadystatechange = function() {
if(ajaxRequest.readyState == 4) {
self.postMessage(ajaxRequest.responseText);
var timer;
timer = self.setTimeout(function(){
checkStatus();
}, 1000);
}
}
ajaxRequest.open("GET", "...worker_statusCheck.php", true);
ajaxRequest.send(null);
}
this.onmessage = function(e){
checkStatus();
};
As you can see, this restarts itself every second (for now). The intervall might be longer in production.
worker_statusCheck.php simply gets different things from the database and knits them into a JSON object which gives me the system status.
This works beautifully.
Now I have another worker which is supposed to get initiated by a click on a link to effectively call some php to perform actions:
mainJS loadWorker
function loadWorker(url="") {
console.log("loadWorker started");
if(url!="") {
var uniqueID = "XXX" // creating a random ID based on timestamp and Math.random()
if(typeof(window[uniqueID]) == "undefined") {
var variables = { ajaxURL: url };
window[uniqueID] = new Worker("....loadWorker.js");
window[uniqueID].postMessage(JSON.stringify(variables));
window[uniqueID].onmessage = function(e) {
var message = JSON.parse(e.data);
if(message["success"]!=undefined) {
variables["close"] = "yes";
window[uniqueID].postMessage(JSON.stringify(variables));
}
}
}
With every click on a certain link this gets called, creates a uniquely named worker, runs it, receives the data and tells the worker to close().
The php again does its thing and writes a progress update in the DB after each step of the lengthy procedure. These progress updates I fetch from the DB with the above repeating statusCheck.
Now, I can see the entries in the DB with timestamp, so I know they get written each at their time.
So, both workers do their job and run reliably. But I have noticed, that whenever I initiate the manual (randomly named) worker the statusCheck actually stops performing. It just gets stuck... I was able to confirm this with console output from both workers. So it's not the main JS that seems stuck, but the statusCheck actually pauses... and resumes when loadWorker is done.
Am I missing something fundamental here? Any insight would be appreciated since I'm new to this concept of web workers.
Thanx :)
Your question lacks resources to truly figure out what exactly goes wrong. I can concur that two web workers can operate at the same time, even with synchronous operations. I tested this for both for loops and sync XHR requests.
There are multiple things I would recommend though.
First - unless you're processing the data with some CPU heavy algorithm, web workers are waste of time. XHR requests do not block main thread (unless you explicitly ask them to).
In statusCheck() you declare var w__statusCheck which means a local variable. Therefore it will always be null as seen from outer scope. It might get garbage-collected once no code is running in the worker.
Do not use XMLHttpRequest.onreadystatechange. Use onload and onerror.
Random unique ID's for variables are almost always wrong. If you need to store the worker refference at all, either give it a reasonable name (eg. the url it's supposed to load) or use incremental id.
Do NOT stringify data that you post to web worker. It's already done for you by the browser, possibly in more optimal manner. Converting the data to something is a single most common stupid thing people do with web workers.
Also when posting question, at least make sure the code makes some sense. In your post curly braces do not match.
Alright.. I figured it out:
I was looking in all the wrong places. Turns out, I had initialized my php session in all the php scripts which are called by the workers. And my two parallel workers both called one. So the session file was locked by the first php script and the second had to wait until it was back open again. It was not the workers or the JS being hindered, it was the php.
I now took out the session initialization from my statusCheck.php and it works like a charm. I will keep it in those others that handle the user input responses because there it actually makes sense: user clicks on button "compile data XY" which is run by the worker and takes a while. Impatient as he is he already clicks the next button "show this data"... and due to the locked session file I have sort of a neat queue for those actions. :)
I still will take above recommendations to heart and see to it to improve my code. :)
I process thousands of points asynchronously in ArcGIS JS API. In the main function, I call functions processing individual features, but I need to finalize the processing when all the features are processed. There should be an event for this, though I didn't find any and I'm afraid it even doesn't exist - it would be hard to state that the last item processed was the last of all. .ajaxStop() should do this, but I don't use jQuery, just Dojo. Closest what I found in Dojo was Fetch and its OnComplete, but as far as I know it's about fetching data from AJAX, not from other JS function.
The only workaround idea I have now is to measure how many features are to be processed and then fire when the output points array reaches desired length, but I need to count the desired number at first. But how to do it at loading? Tracking the data to the point where they are read from server would mean modifying functions I'm not supposed to even know, which is not possible.
EDIT - some of my code:
addData: function (data) {
dojo.addOnLoad(
this.allData = data,
this._myFunction()
);
},
Some comments:
data is an array of graphics
when I view data in debugger, its count is 2000, then 3000, then 4000...
without dojo.addOnLoad, the count started near zero, now it's around 2000, but still a fraction of the real number
_myFunction() processes all the 2000...3000...4000... graphics in this._allData, and returns wrong results because it needs them all to work correctly
I need to delay execution of _myFunction() until all data load, perhaps by some other event instead of dojo.addOnLoad.
Workarounds I already though of:
a) setTimeout()
This is clearly a wrong option - any magic number of miliseconds to wait for would fail to save me if the data contains too much items, and it would delay even cases of a single point in the array.
b) length-based delay
I could replace the event with something like this:
if(data.length == allDataCount) {
this._myFunction();
}
setTimeout(this._thisFunction, someDelay);
or some other implementation of the same, through a loop or a counter incremented in asynchronously called functions. Problem is how to make sure the allDataCount variable is definitive and not just the number of features leaded until now.
EDIT2: pointing to deferreds and promises by #tik27 definitely helped me, but the best I found on converting synchronous code to a deferred was this simple example. I probably misunderstood something, because it doesn't work any better than the original, synchronous code, the this.allData still can't be guaranteed to hold all the data. The loading function now looks like this:
addData: function (data) {
var deferred = new Deferred();
this._addDataSync(data, function (error, result) {
if (error) {
deferred.reject(error);
}
else {
deferred.resolve(result);
}
});
deferred.promise.then(this._myFunction());
},
_addDataSync: function (data, callback) {
callback(this.allData = data);
},
I know most use cases of deferred suppose requesting data from some server. But this is the first time where I can work with data without breaking functions I shouldn't change, so tracking the data back to the request is not an option.
addonload is to wait for the dom.
If you are waiting for a function to complete to run another function deferred/promises are what is used.
Would need more info on your program to give you more specific answers..
I sort of solved my problem, delaying the call of my layer's constructor until the map loads completely and the "onUpdateEnd" event triggers. This is probably the way how it should be properly done, so I post this as an answer and not as an edit of my question. On the other hand, I have no control over other calls of my class and I would prefer to have another line of defense against incomplete inputs, or at least a way to tell whether I should complain about incomplete data or not, so I keep the answer unaccepted and the question open for more answers.
This didn't work when I reloaded the page, but then I figured out how to properly chain event listeners together, so I now can combine "onUpdateEnd" with extent change or any other event. That's perfectly enough for my needs.
I am porting an old game from C to Javascript. I have run into an issue with display code where I would like to have the main game code call display methods without having to worry about how those status messages are displayed.
In the original code, if the message is too long, the program just waits for the player to toggle through the messages with the spacebar and then continues. This doesn't work in javascript, because while I wait for an event, all of the other program code continues. I had thought to use a callback so that further code can execute when the player hits the designated key, but I can't see how that will be viable with a lot of calls to display.update(msg) scattered throughout the code.
Can I architect things differently so the event-based, asynchronous model works, or is there some other solution that would allow me to implement a more traditional event loop?
Am I making sense?
Example:
// this is what the original code does, but obviously doesn't work in Javascript
display = {
update : function(msg) {
// if msg is too long
// wait for user input
// ok, we've got input, continue
}
};
// this is more javascript-y...
display = {
update : function(msg, when_finished) {
// show part of the message
$(document).addEvent('keydown', function(e) {
// display the rest of the message
when_finished();
});
}
};
// but makes for amazingly nasty game code
do_something(param, function() {
// in case do_something calls display I have to
// provide a callback for everything afterwards
// this happens next, but what if do_the_next_thing needs to call display?
// I have to wait again
do_the_next_thing(param, function() {
// now I have to do this again, ad infinitum
}
}
The short answer is "no."
The longer answer is that, with "web workers" (part of HTML5), you may be able to do it, because it allows you to put the game logic on a separate thread, and use messaging to push keys from the user input into the game thread. However, you'd then need to use messaging the other way, too, to be able to actually display the output, which probably won't perform all that well.
Have a flag that you are waiting for user input.
var isWaiting = false;
and then check the value of that flag in do_something (obviously set it where necessary as well :) ).
if (isWaiting) return;
You might want to implement this higher up the call stack (what calls do_something()?), but this is the approach you need.