Get a javascript variable from a web page without interaction/heedlessly - javascript

Good afternoon!
We're looking to get a javascript variable from a webpage, that we are usually able to retrieve typing app in the Chrome DevTools.
However, we're looking to realize this headlessly as it has to be performed on numerous apps.
Our ideas :
Using a Puppeteer instance to go on the page, type the command and return the variable, which works, but it's very ressource consuming.
Using a GET/POST request to the page trying to inject the JS command, but we didn't succeed.
We're then wondering if there will be an easier solution, as a special API that could extract the variable?
The goal would be to automate this process with no human interaction.
Thanks for your help!

Your question is not so much about a JS API (since the webpage is not yours to edit, you can only request it) as it is about webcrawling / browser automation.
You have to add details to get a definitive answer, but I see two scenarios:
the website actively checks for evidence of human browsing (for example, it sits behind CloudFlare and has requested this option); or the scripts depend heavily on there being a browser execution environment available. In this case, the simplest option is to automate a browser, because a headless option has to get many things right to fool the server or the scripts. I would use karate, which is easier than, say, selenium and can execute in-browser scripts. It is written in Java, but you can execute it externally and just read its reports.
the website does not check for such evidence and the scripts do not really require a browser execution environment. Then you can simply download everything requires locally and attempt to jury-rig the JS into executing in any JS environment. According to your post, this fails; but it is impossible to help unless you can describe how it fails. This option can be headless.

You can embed Chrome into your application and instrument it. It will be headless.
We've used this approach in the past to copy content from PowerPoint Online.
We were using .NET to do this and therefore used CEFSharp.

Related

Load Testing Javascript on a Webpage

I am currently load testing my companies new webpage and have used JMeter for this task. We have an assessment which pulls down javascript locally to the users machine which allows them to take a test, once completed tests are uploaded back to the database.
The issue I'm having is that it is well documented that JMeter is not a browser and does not interact with javascript. We need a way to test the time it takes for requests to browse to the page assessment and how long it takes to pull. It is also required to up the amount of requests over a specific period of time so we can determine at what point the server falls over.
I have also tried using Gatling however I am running into the same issue. Has anyone else ran into these problems and how did they get around this?
Thanks in advance!
Very few tools run downloaded JS code in the client/VU threads when executing a load test, for performance reasons mainly. You can try Selenium Grid or some online service based on it, like https://www.loadbooster.com/ or Blazemeter, but if you want to run tests in your own environment, Selenium Grid may be your only choice.
The alternative is to emulate the JS client side code when you script the load test scenario. Many tools can do that, at some level, but to make translation of existing JS code easier, I would choose a tool that offers a real scripting language, such as e.g. Grinder, Locust, Wrk or k6 (or possibly Gatling). k6 may be the simplest as you script it in Javascript, so translating the client-side JS code should be somewhat less work there.
https://k6.io
https://gatling.io/
https://locust.io
https://github.com/wg/wrk
http://grinder.sourceforge.net/
You need to split your test into 2 major areas:
JavaScript processing happens solely on client side and you need to test it separately, the majority of modern web browsers come with developer tools allowing testing the performance of JavaScript execution:
Firefox Developer Tools - Performance
Chrome DevTools - Analyze Runtime Performance
Profiling JavaScript performance
Server side impact is the fact of downloading the JavaScript and uploading the results, you should be able to mimic the corresponding calls using JMeter's HTTP Request sampler, check out Performance Testing: Upload and Download Scenarios with Apache JMeter article for more details.
See built in JavaScript profiler inside of developer tools inside every browser. This is a question tied to single definition of a browser/os/machine/running Apps/browser extensions question, not one of multi user performance

What's universal javascript?

I've been reading articles online about what universal javascript is but I'm still not comfortable with the definitions each site is giving which is, "code that can run on the client and server." Does this mean that a node.js app is inherently universal javascript because it will have javascript running in the client side and server side. Or does universal javascript have to do with server side rendering then client side rendering?
Preface: I cannot find any highly-authoritative (e.g. ECMA, Microsoft, Mozilla or Google) source that provides a strict definition of "universal JavaScript" or "isomorphic JavaScript" - at most I've found a few blog posts (albeit by influential personalities) however I can see why a newcomer might be confused.
It seems there are two definitions going around which are similar, but with crucial differences:
1. To refer to JavaScript which runs anywhere
This definition refers to JavaScript which does not take a dependency on any specific client-side or server-side API, instead they only make use of features present in JavaScript's built-in library (String, Array, Date, function, Math etc) or on other libraries that also similarly restrict their dependencies (a transitive relation).
Remember that "JavaScript" does not mean that the DOM API, AJAX, HTML5 <canvas> (and so on) are available - it just means the JavaScript scripting language is being used - that's it. JavaScript has been available outside of web-browsers for over 20 years now (Windows support JavaScript as a shell-scripting language in cscript.exe/wscript.exe and ASP 3.0 supported server-side JScript as an alternative to VBScript - and the .NET Framework has "JScript.NET" too).
So in this case, if you wrote a library that adds some useful string functions, which only references String, then that script would work without issue in a Node.js server environment or an in-browser environment.
But if your script ever used the window object (only present in browsers) or express (a library only for Node) then it loses "universal" status because it cannot "run everywhere".
2. To refer to JavaScript which renders the same HTML whether on the server or on the client
e.g. http://isomorphic.net/
This definition is actually a strict subset of the first definition: as the same script must (by definition) run inside both a server/Node.js context, but also a browser DOM context - and when it runs it generates content (typically HTML) that is then displayed in the user's browser (and by doing this it must take a dependency on both a Node API and the W3C DOM - so then it cannot strictly run "anywhere" because neither are available in a cscript.exe environment, for example.
Note: There is debate if use of XMLHttpRequest or fetch makes a script universal or not - as their presence is not guaranteed (as technically they're part of the DOM, not JavaScript's built-in library).
In this 2015 blog post ( https://medium.com/#ghengeveld/isomorphism-vs-universal-javascript-4b47fb481beb ) the author argues that only the term "isomorphic JavaScript" should be used to refer to rendering code that runs in both browser and server environments, while "universal JavaScript" should refer to truly portable, environment-agnostic, JavaScript (i.e. my first definition).
Nowadays Single Page Applications have become very popular but they have problems, SEO, for example.
So, how does an SPA work? JavaScript loads in the browser and loads data from an API. Most of the rendering is done on the client Side. But search engine bots have a hard time indexing the page because it doesn't have much without JS.
Now, Universal/Isomorphic App comes to the rescue. At the initial page load, the original page renders on the server. After that, the app works like an SPA. It's got better SEO because when a search engine bot asks for a page, the server returns the whole rendered HTML page, with content and meta tags.
Edit
An isomorphic app can be done with JavaScript (Node.js), PHP or some other language, but if that app written with Node.js, then we can call it universal as both the backend and frontend are in JavaScript.
I'll try to explain it with examples, even if other answers seem already accurate.
A basic example
Imagine you develop a SPA that render an Hello World message. This means that your browser loads an HTML file with a <script> tag (or the reference to a JS file) that actually makes this happen. You can prove that "Hello world" is generated by JavaScript in the client browser, because if you deactivate JavaScript you won't see any message.
Now isolate the code that prints the string "Hello World", it doesn't need much to be adapted and work in the server side. In fact, the server just needs to send an HTML string that "contains" the <h1>Hello World</h1> inside its body.
So what it makes it universal/isomorphic? The fact that the code can understand in which environment it runs (the browser, the server or possibly an other environment) and it keeps functioning. Remember: code usually only runs in one of the two environments, the thing is that you wrote some common code that can run in both environments (universal).
The behavior of a more complex Universal App
Imagine that you struggle to develop a new Universal website. The code can acknowledge in which environment it is running and work just fine. So you have, let's say, 80% of your code that is shared, it doesn't even need to know the environment, and the rest of your code is there to managing the fact that your app can be used in the client or in the server.
How does this work?
The client first contacts the server that returns some HTML to the client with all the content of the page, elaborated in the server. So the server renders the application. In the mean time the browser downloads the script file where your single page can work in the client. The client is now rendering the same page again. You won't see anything, because if it is properly done, it will just be the same (of course all the animations and real time features have to work client-side, so you will eventually see your animations starting)
When the user clicks an internal link or uses an interactive feature, or eventually fills out and submits a form, the client-side code is in use. The server doesn't get any request, especially assuming that all the interaction are abstracted in an API that is not our isomorphic app.
If the user goes crazy and wants to deactivate JavaScript, how do you assure that, for example, forms still work? Here is a trick you can use:
<form
method="post"
action="/api/fakeBackendRoute"
onSubmit={this.handleSubmit}
>
[input fields here]
</form>
When the client JS is available, the handleSubmit is executed and the propagation of the event is prevented. This way the server side code will never fire.
If the client JS is disabled, then handleSubmit will never be executed and you have to care that your /api/fakeBackendRoute will handle the data exactly how the client would.
Why do people use it?
In my opinion the difficulty of undertaking the development of an Universal App is often underestimated. Good reasons to use it are:
Be more SEO friendly
Support very old browsers. For example, if you want to support IE8, you could do something like this:
<!--[if gt IE 8]><!-->
<script src="yourfile.js"></script>
<!--<![endif]-->
Be more accessible for people that don't want to use JavaScript
Other reasons could be:
Performance, if it matters to your application. You can improve your response time by using, for example, a lot of Node capabilities to stream your HTML string in the first request, and eventually later be more in the client, where things will be likely faster. But you could decide whether it is faster to render on the client or on the server, depending on the content and how you create your assets.
If someone knows other good reasons, just comment below and I will add them.
Some good reference links:
https://medium.com/airbnb-engineering/isomorphic-javascript-the-future-of-web-apps-10882b7a2ebc
https://medium.com/front-end-developers/handcrafting-an-isomorphic-redux-application-with-love-40ada4468af4
https://github.com/xgrommx/awesome-redux

Python Scraping from JavaScript table on PGA Website

I'm just getting into Python and have been working mostly with BeautifulSoup to scrape sports data from the web. I have run into an issue with a table on the PGA website where it is generated by javascript, was hoping someone could walk me through the process in the context the specific website I am working with. Here is a sample link "http://www.pgatour.com/content/pgatour/players/player.29745.tyler-aldridge.html/statistics" the tables are all of the player statistics tables. Thanks!
When a web page uses JavaScript to build or get it's content, you are out of luck with tools that just download HTML from the web. You need something which is mimicking a web browser more thoroughly, and interpreting JavaScript. In other words, a so-called headless browser. There are some out there, even some with good Python integration. You may want to start your journey by searching for PhantomJS or Selenium. Once you've chosen the tool of your choice, you can let the browser do it's retrieving and rendering work and then browse the DOM in much a similar way than you did with BeautifulSoup on static pages.
I would, however, also a look at the Network tab of your browser's debugger first. Sometimes you can identify the GET which is actually getting the table data from the server. In this case it might be easier to GET the data yourself (e.g. via requests) than to employ complex technology to do it for you. It is also very likely that you get the information you want in plain JSON which will make it even simpler to use. The PGA site makes GETs hundreds of resources to build, but it will still be a good trade to browse thru them.
You need JavaScript Engine to parse and run JavaScript code inside the page. There are a bunch of headless browsers that can help you
http://code.google.com/p/spynner/
http://phantomjs.org/
http://zombie.labnotes.org/
http://github.com/ryanpetrello/python-zombie
http://jeanphix.me/Ghost.py/
http://webscraping.com/blog/Scraping-JavaScript-webpages-with-webkit/
Also, consider using this:
http://www.seleniumhq.org/docs/03_webdriver.jsp
Selenium-WebDriver makes direct calls to the browser using each browser’s native support for automation. How these direct calls are made, and the features they support depends on the browser you are using. Information on each ‘browser driver’ is provided later in this chapter.
For those familiar with Selenium-RC, this is quite different from what you are used to. Selenium-RC worked the same way for each supported browser. It ‘injected’ javascript functions into the browser when the browser was loaded and then used its javascript to drive the AUT within the browser. WebDriver does not use this technique. Again, it drives the browser directly using the browser’s built in support for automation.

Windowless container for Google App Engine channel API client

I would like to write a commandline tool that receives notifications from Google App Engine's Channel API. This seems to be quite straightforward thanks to open JavaScripts VMs such as v8 and js. One problem with this approach, though, is that these VMs do not provide standard js objects such as window and document, which the channel API references. Running such code therefore gives you window/document/.. not found errors.
There seem to be two ways of circumventing this obstacle:
To write a lightweight header in javascript to emulate the behavior of the required objects.
To edit Google's javascript (/_ah/channel/jsapi) and eliminate references to such objects.
Does anyone know if there are existing implementations of these approaches, or know of a better idea? Furthermore, is there a clean, uncompressed version of the channel API client side javascript code available somewhere?
You can't edit the script used by /_ah/channel/jsapi -- it's only used when the channel is running against the dev app server. When running in production, that script redirects to https://talkgadget.google.com/talkgadget/channel.js
So you're left with emulating the required objects, or just using a hidden browser window. I would opt for the latter, since I think emulating all the DOM calls is going to get very difficult very quickly.

Using Python to Automatically Login to a Website with a JavaScript Form

I'm attempting to write a particular script that logs into a website. This specific website contains a Javascript form so I had little to no luck by making use of "mechanize".
I'm curious if there exist other solutions that I may be unaware of that would help me in my situation. If this particular question or some related variant has been asked here before, please excuse me, and I would prefer the link to this particular query. Otherwise, what are some common techniques/approaches for dealing with this issue?
Thanks.
I've recently been using PhantomJS for this kind of work - it's a command-line tool that allows you to run Javascript in a browser environment (based on Webkit). This allows you to do scraping and online interactions that require Javascript-enabled interfaces. There's a Python-based implementation here that's fully compatible with the API of the C++ version, or you could run either version in Python via subprocess.
Depending on what you're trying to do, another good option might be to use Selenium, which has client driver implementation in Python - it's meant for integration testing, but can do a lot of automation as long as you're okay running the Java-based Selenium Server and having the automation happen in an open browser rather than as a background process.

Categories