Python Web Scraping with JavaScript Do Postback - javascript

I have been trying to:
Go to:
mdoe.state.mi.us/moecs/PublicCredentialSearch.aspx
Enter a certificate number (for the sake of illustration, you can just search for "Davidson" as the last name).
Click on a link corresponding to "Professional Teaching Certificate".
Copy and paste the resulting table.
The rub seems to be with the JavaScript doPostBack() part, as it requires rendering, I believe, to get the data.
When viewing the source code, see how the href part identifies an individual link like this? (for the 6th link down):
href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$gViewCredentialSearchList$ctl07$link1','')
From this:
<td class="MOECSNormal" style="border-color:Black;border-width:1px;border-style:Solid;">Professional Teaching Certificate Renewal</td><td class="MOECSNormal" style="border-color:Black;border-width:1px;border-style:Solid;">
<a id="ContentPlaceHolder1_gViewCredentialSearchList_link1_5" ItemStyle-BorderColor="Black" ItemStyle-BorderStyle="Solid" ItemStyle-BorderWidth="1px" href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$gViewCredentialSearchList$ctl07$link1','')">CC-XWT990004102</a>
</td>
I'm looking for a way (via Python) to get the data I need into a table, given a certification number and certificate name (i.e. "Professional Teaching Certificate".
I have tried following a tutorial using PyQt4, but installing it alone was traumatic.
Thanks in advance!

You can open the page in a browser e.g. Chrome and study how the interaction is done between the page and the server, normally this information can be seen in the network tab of Developer tool, this way you can formulate a python script to do the steps maybe using requests library
or
You can use selenium-python to do simulate your browser interaction (including javascript calls) until you got to the page where your interested data belongs to.

Related

Getting initiator of XXX-xsrfstatemanager.js file using Chrome Developer Tools

In order to triage a problem with a web browser I am trying to determine the initiator of the XXX-xsrfstatemanager.js file (the XXX part seems to be something dynamic like a nonce) that occurs as part of a Google Authentication flow (using OAuth).
When I use Chrome developer tools, it says the below URL is the initiator:
https://accounts.google.com/o/oauth2/v2/auth?approval_state=%21Ch[REDACTED]Q%E2%88%99AJ[REDACTED]xq&as=-aBk[REDACTED]
Looking at the result of the above page see a lot of Javascript, but the string "xsrfstatemanager" is nowhere to be found, nor do I see any other javascript pages being included. Unless there is some really cryptic code that is somehow building this URL, the call is actually coming from some other page.
Does anyone know how I can get the 'real' initiator? Or if the above URL might be correct, if I can get more information like what exact line number of the file initiated the call?
By the way, while I edited the above URL for security reasons, if you go to (for example) www.quora.com and quick "continue with google" it is easy to see the flow in question.
The flow includes a redirection, which is why you cannot see the source code that initiates/references that script.
If you view the source of the original URL that is opened when you click on "Continue with Google", you will see the <script src> that references it. This works in Chrome and probably Safari -
view-source:https://accounts.google.com/o/oauth2/auth?redirect_uri=storagerelay%3A%2F%2Fhttps%2Fwww.quora.com%3Fid%3Dauth488109&response_type=code%20permission%20id_token&scope=email%20profile%20openid&openid.realm=&client_id=917071888555.apps.googleusercontent.com&ss_domain=https%3A%2F%2Fwww.quora.com&access_type=offline&include_granted_scopes=true&prompt=select_account&origin=https%3A%2F%2Fwww.quora.com&gsiwebsdk=2
From the source code -
<script src='https://ssl.gstatic.com/accounts/o/532969778-xsrfstatemanager.js' nonce="IgiKmQiLZIHDwGvce7/q6Q"></script>
You can also use tools like Fiddler to see the source code of the redirect, or check "Preserve log" in the Network panel of the Developer Tools feature of Chrome, or by going to the original URL with JavaScript disabled.

Selenium Python: Cannot find element after javascript runs

I am trying to automatize some SAP Job monitoring with Python. I want to create a script that should do the following:
Connect and login the SAP environment -> Open SM37 transaction -> Send job parameters (name-user-from-to) -> Read the output and store it into a database.
I don't know about any module or library that allow me to do that. So I checked the WEBGUI is already enabled. I am able to open the environment through a Browser. A browsing module should allows me to do everything I need.
Tried with Mechanize and RoboBrowser. It works but the WEBGUI runs a lot of javascript for renderize and those modules doesn't handle javascript.
There is one more shot: Selenium.
I was able to connect and login to the environment. But when trying to select an element from new page (main menu), Selenium cannot locate the element.
Printing the sourcecode I realized that the Main Menu site is rendered with javascript. The sourcecode doesn't contains the element at all, only the title ("Welcome "). That means the login was successfull.
I read a lot of posts asking for this, and everybody reccommend to use WebDriverWait with some explicit conditions.
Tried this, didn't work:
driver.get("http://mysapserver.domain:8000/sap/bc/gui/sap/its/webgui?sap-client=300&sap-language=ES")
wait = WebDriverWait(driver, 30)
element = wait.until(EC.presence_of_element_located((By.ID, 'ToolbarOkCode')))
EDIT:
There are two sourcecodes: SC-1 is the one that Selenium reads. SC-2 is the one that appears once the javascript renders the site (the one from "Inspect Element").
The full SC-1 is this:
https://pastebin.com/5xURA0Dc
The SC-2 for the element itself is the following:
<input id="ToolbarOkCode" ct="I" lsdata="{0:'ToolbarOkCode',1:'Comando',4:200,13:'150px',23:true}" lsevents="{Change:[{ClientAction:'none'},{type:'TOOLBARINPUTFIELD'}],Enter:[{ClientAction:'submit',PrepareScript:'return\x20its.XControlSubmit\x28\x29\x3b',ResponseData:'delta',TransportMethod:'partial'},{Submit:'X',type:'TOOLBARINPUTFIELD'}]}" type="text" maxlength="200" tabindex="0" ti="0" title="Comando" class="urEdf2TxtRadius urEdf2TxtEnbl urEdfVAlign" value="" autocomplete="on" autocorrect="off" name="ToolbarOkCode" style="width:150px;">
Still can't locate the element. How can I solve it?
Thanks in advance.
The solution was to go into the iframe that containts the renderized html (with the control).
driver2.get("http://mysapserver.domain:8000/sap/bc/gui/sap/its/webgui?sap-client=300&sap-language=ES")
iframe = driver2.find_elements_by_tag_name('iframe')[0]
driver2.switch_to_default_content()
driver2.switch_to_frame(iframe)
driver2.find_element_by_id("ToolbarOkCode").send_keys("SM37")
driver2.find_element_by_id("ToolbarOkCode").send_keys(Keys.ENTER)

Starting a vbs script from a HTML file

I've been wrapping my head around this problem for a couple of days searching for all possible solutions on the forums and online but can't seem to get it working.
I'm calling a script by a link on a "button" to start a script on a server (in HTML):
<a href="#" onClick="RunScript();">
The script code is:
<script type="text/javascript" language="javascript">
function RunScript() {
var objShell = new ActiveXObject("WScript.Shell");
objShell.Run("%comspec% /k my_projects_EN.vbs" "), 1, false;
}
</script>
So why am I using a vbs? What I'm trying to do is create custom pages for each employee. So the vbs is actually checking the computer name and an if clause directs the employee to a custom page. With my basic knowledge of programming and a lot of hours of searching I did not find a better solution for this yet. So I'm trying to make this one to work.
And it does but only if I'm running the script locally (desktop). But as the webpage will be used in an intranet location this script will be on a server. And this is where it became a bit hairy as I can't seem to find the right combination of commands to do so. I already tried pushd for creating a mounted volume or currentDir for setting up the location of script but nothing seems to work completely.
I assume that I'm missing a subroutine for the function as adding anything there just stops the script - but how to go at it is beyond me.
All help is appreciated even if it means I have to bury myself into another program language (not preferred of course).
I am certain that there is a way to solve this other than sending a script to each employee to put on their desktop (each time a new employee comes to work).
Thanks
Edit: I see an additional clarification is in order:
We're creating an intranet webpage as a help for more efficient work for our employees. We're on the same level as the rest so not IT or admin rights guys so we're on our own.
The point is to have a personal page for each employee which can be accessed via the same interface. So a link has to send each person to another page that is why I've created the vbs code which helps with that. Checking several other options this seemed to be the simplest and best one - and it works at least partially. I don't see any security risks as all will be done on each client computer - the files themselves will be located on the server. The script itself does not represent any risk at least not that I would see it - but of course I'm not a specialist.
So in short this is what we're trying to do:
Main page -> link to My_projects button -> start script (located on the same server as the main page) -> determine the client computer name -> redirect to the right webpage.
Sorry for a lack of details, I see that it's sometimes hard to explain exactly what you want if you're not a pro in these things.
Thanks again.
If those computers are physically located at your workplace and you have control over the system, it would be better to tweak DNS redirections on those computers. Otherwise, more general and OS independent solution, would be session, cookie, or token on employee's computer. Still, some kind of authentication other than having one piece of machine, could be more versatile and secure (unless your PCs are 1000 feet underground :-) ).
Edit: What kind of info/data are sent to the server script? Server script runs on server and everything related to "this computer" (e.g. name) is actually referring to the server itself. Thus the script needs some data from the client to recognise his computer.
thanks for the effort
Everything is actually located on the server so the client computer only runs the page or interface which is in \Server\folder\folder for example.
In your browser you open the start page which contains a button with a link to this script (located on the same server).
When the script executes it searches for the computer name and send the user to his personal page:
Set wshShell = CreateObject( "WScript.Shell" )
strComputerName = wshShell.ExpandEnvironmentStrings( "%COMPUTERNAME%" )
On Error Resume Next
'#01 name_surname
If strComputerName = "XXXXXXXX" Then
CreateObject("WScript.Shell").Run """name_surname.html"""
and so on.
And this is all there is. As mentioned before we don't have admin rights to change anything on the client computer. So nothing is being done on the client side other that executing a script located on the server.

Clicking a Javascript link to make a post request in Python

I'm writing a webscraper/automation tool. This tool needs to use POST requests to submit form data. The final action uses this link:
<a id="linkSaveDestination" href='javascript:WebForm_DoPostBackWithOptions(new WebForm_PostBackOptions("linkSaveDestination", "", true, "", "", false, true))'>Save URL on All Search Engines</a>
to submit data from this form:
<input name="sem_ad_group__destination_url" type="text" maxlength="1024" id="sem_ad_group__destination_url" class="TextValueStyle" style="width:800px;">
I've been using requests and BeautifulSoup. I understand that these libraries can't interact with Javascript, and people recommend Selenium. But as I understand it Selenium can't do POSTs. How can I handle this? Is it possible to do without opening an actual browser like Selenium does?
Yes. You can absolutely duplicate what the link is doing by just submitting a POST to the proper url (this is, in reality, eventually going to be the same thing that the javascript that fires when the link is clicked does).
You'll find the relevant section in the requests docs here: http://docs.python-requests.org/en/latest/user/quickstart/#more-complicated-post-requests
So, that'll look something like this for your particular case:
payload = {'sem_ad_group__destination_url': 'yourTextValueHere'}
r = requests.post("theActionUrlForTheFormHere", data=payload)
If you're having trouble figuring out what url it is actually be posted to, just monitor the network tab (in chrome dev tools) while you manually click the link yourself, you should be able to find the right request and pull any information off of that.
Good Luck!
With selenium you mimic the real-user interactions in a real browser - tell it to locate an input, write a text inside, click a button etc - high-level approach - you don't even need to know what is there under-the-hood, you see what a real user sees. The downside here is that there is a real browser involved which, at least, slows things down. You can though, automate a a headless browser (PhantomJS), or use a Xvfb virtual framebuffer if you don't have conditions to open up a browser with a UI. Example:
from selenium import webdriver
driver = webdriver.PhantomJS()
driver.get('url here')
button = driver.find_element_by_id('linkSaveDestination')
button.click()
With requests+BeautifulSoup, you are going down to the bare metal - using browser developer tools you research/analyze what requests are made to a server and mimic them in your code. Sometimes the way a page is constructed and requests made are too complicated to automate, or there are anti-web-scraping technique used.
There are pros & cons about both approaches - which option to choose depends on many things.

Is there a way to mitigate downloading of resources (images/css and js files) with Javascript?

I have a html page on my localhost - get_description.html.
The snippet below is part of the code:
<input type="text" id="url"/>
<button id="get_description_button">Get description</button>
<iframe id="description_container" src="#"/>
When the button is clicked the src of the iframe is set to the url entered in the textbox. The pages fetched this way are very big with lots of linked files. What I am interested in the page is a block of text contained in a <div id="description"> element.
Is there a way to mitigate downloading of resources linked in the page that loads into the iframe?
I don't want to use curl because the data is only available to logged in users and the steps to take with curl to get the content is too complicated. The iframe is simple as I use this on a box which sends the right cookies to identify the request as coming from a logged in user, but the problem is that it is very wasteful to get nearly 1 MB of data to keep 1 KB of it and throw out the rest.
Edit
If the proposed method just works in Firefox it is fine, so I added Firefox tag. Also, it is possible that the answer actually is from the realm of Firefox add-on techniques, so I added that tag as well.
The problem is not that I cannot get at what I'm looking for, rather, the problem is the easy iframe method is wasteful.
I know that Firefox does allow loading only the text of a page. If you open a page and press Ctrl+U you are taken to 'view page source' window, There links behave as normal and are clickable, if you click on a link in source view, the source of the new page is loaded into the view source window, without the linked resources being downloaded, exactly what I'm trying to get. But I don't know how to access this behaviour.
Another example is the Adblock add-on. It somehow kills elements before they get loaded. With plain Javascript this is not possible. Because it only is triggered too late to intervene in good time.
The Same Origin Policy forbids any web page to access contents of any other web page in a different domain so basically you cannot do that.
However it seems that with some browsers it is allowed to access web pages content if you are trying to access it from a local web page which seems to be your case.
Safari, IE 6/7/8 are browser that allow a local web page to do so via XMLHttpRequest (source: Google Browser Security Handbook) so you may want to choose to use one of those browsers to do what you need (note that future versions of those browsers may not allow to do so anymore).
A part from this solution I only see two possibities:
If the web pages you need to fetch content from are somehow controlled by you, you can create a simpler interface to let other web pages to get the content you need (for example allowing JSONP requests).
If the web pages you need to fetch content from are not controlled by you the only solution I see is to fetch content server side logging in from the server directly (I know that you don't want to do so, but I don't see any other possibility if the previous I mentioned are not practicable)
Hope it helps.
Actually I've seen Cross Domain jQuery .load request before, here: http://james.padolsey.com/javascript/cross-domain-requests-with-jquery/
The author claims that codes like these found on that page
$('#container').load('http://google.com'); // SERIOUSLY!
$.ajax({
url: 'http://news.bbc.co.uk',
type: 'GET',
success: function(res) {
var headline = $(res.responseText).find('a.tsh').text();
alert(headline);
}
});
// Works with $.get too!
would work. (The BBC code might not work because of the recent redesign, but you get the idea)
Apparently it is using YQL wrapped into a jQuery plugin to do the trick. Now I cannot say I fully understand what he is doing there but it appears to work, and fits the bill. Once you load the data I suppose it is a simple matter of filtering out the data that you need.
If you prefer something that works at the browser level, may I suggest Mozilla's Jetpack framework for lightweight extensions. I've not yet read the documentations in its entirety but it should contain the APIs needed for this to work.
There are various ways to go about this in AJAX, I'm going to show the jQuery way for brevity as one option, though you could do this in vanilla JavaScript as well.
Instead of an <iframe> you can just use a container, let's say a <div> like this:
<div id="description_container"></div>
Then to load it:
$(function() {
$("#get_description_button").click(function() {
$("#description_container").load($("input").val() + " #description");
});
});
This uses the .load() method which takes a string in this format: .load("url selector"), then takes that element in the page and places it's content inside the container you're loading, in this case #description_container.
This is just the jQuery route, mainly to illustrate that yes, you can do what you want, but you don't have to do it exactly like this, just showing the concept is getting what you want from an AJAX request, rather than in an <iframe>.
Your description sounds like you are fetching pages from the same domain (you said that you need to be logged in and have session credentials) so have you tried to use async request via XMLHttpRequest? It might complain if the html on a page is particularly messed up but you chould still be able to get raw text via .responseText and extract what you need with a regex.

Categories