I'm trying to scrape some site in Node.js. I've followed a great tutorial however realize that it might not be what I am looking for, ie. might be looking at scraping the javascript portion of the page instead of the html one.
Is that possible ?
Reason for that is that I am looking for loading the content of the below portion of the code I could find by inspecting in Safari (not showing in Chrome) a kayak.com page (see url below) and seems to be in a scripting section.
reducer: {"reducerPath":"flights\/results\/react\/reducers\/
https://www.kayak.com/flights/TYO-PAR/2019-07-05-flexible/2019-07-14-flexible/1adults/children-11?fs=cfc=1;legdur=-960;stops=~0;bfc=1&sort=bestflight_a&attempt=2&lastms=1550392662619
UPDATE: Unfortunately, this site uses bot/scrape protection: tools like curl get a page with bot warning, headless browser tools like puppeteer get a page with captcha.
===============
As this line is present in the HTML source code and is not added dynamically by JavaScript execution, you can use something like this with the appropriate library API:
const extractedString = [...document.querySelectorAll('script')]
.map(({ textContent }) => textContent)
.find(txt => txt.includes('string'))
.match(/regexp/);
Related
I am trying to create a script to download an ebook into a pdf. When I try to use beautifulsoup in it I to print the contents of a single page, I get a message in the console stating "Oh no! It looks like JavaScript is disabled in your browser. Please re-enable to access the reader."
I have already enabled Javascript in Chrome and this same piece of code works for a page like a stackO answer page. What could be blocking Javascript in this page and how can I bypass it?
My code for reference:
url = requests.get("https://platform.virdocs.com/r/s/0/doc/350551/sp/14552484/mi/47443495/?cfi=%2F4%2F2%5BP7001013978000000000000000003FF2%5D%2F2%2F2%5BP7001013978000000000000000010019%5D%2F2%2C%2F1%3A0%2C%2F1%3A0")
url.raise_for_status()
soup = bs4.BeautifulSoup(url.text, "html.parser")
elems = soup.select("p")
print(elems[0].getText())
The problem is that the page actually contains no content. To load the content it needs to run some JS code. The requests.get method does not run JS, it just loads the basic HTML.
What you need to do is to emulate a browser, i.e. 'open' the page, run JS, and then scrape content. One way to do it is to use a browser driver as described here - https://stackoverflow.com/a/57912823/9805867
In order to triage a problem with a web browser I am trying to determine the initiator of the XXX-xsrfstatemanager.js file (the XXX part seems to be something dynamic like a nonce) that occurs as part of a Google Authentication flow (using OAuth).
When I use Chrome developer tools, it says the below URL is the initiator:
https://accounts.google.com/o/oauth2/v2/auth?approval_state=%21Ch[REDACTED]Q%E2%88%99AJ[REDACTED]xq&as=-aBk[REDACTED]
Looking at the result of the above page see a lot of Javascript, but the string "xsrfstatemanager" is nowhere to be found, nor do I see any other javascript pages being included. Unless there is some really cryptic code that is somehow building this URL, the call is actually coming from some other page.
Does anyone know how I can get the 'real' initiator? Or if the above URL might be correct, if I can get more information like what exact line number of the file initiated the call?
By the way, while I edited the above URL for security reasons, if you go to (for example) www.quora.com and quick "continue with google" it is easy to see the flow in question.
The flow includes a redirection, which is why you cannot see the source code that initiates/references that script.
If you view the source of the original URL that is opened when you click on "Continue with Google", you will see the <script src> that references it. This works in Chrome and probably Safari -
view-source:https://accounts.google.com/o/oauth2/auth?redirect_uri=storagerelay%3A%2F%2Fhttps%2Fwww.quora.com%3Fid%3Dauth488109&response_type=code%20permission%20id_token&scope=email%20profile%20openid&openid.realm=&client_id=917071888555.apps.googleusercontent.com&ss_domain=https%3A%2F%2Fwww.quora.com&access_type=offline&include_granted_scopes=true&prompt=select_account&origin=https%3A%2F%2Fwww.quora.com&gsiwebsdk=2
From the source code -
<script src='https://ssl.gstatic.com/accounts/o/532969778-xsrfstatemanager.js' nonce="IgiKmQiLZIHDwGvce7/q6Q"></script>
You can also use tools like Fiddler to see the source code of the redirect, or check "Preserve log" in the Network panel of the Developer Tools feature of Chrome, or by going to the original URL with JavaScript disabled.
I have been trying to:
Go to:
mdoe.state.mi.us/moecs/PublicCredentialSearch.aspx
Enter a certificate number (for the sake of illustration, you can just search for "Davidson" as the last name).
Click on a link corresponding to "Professional Teaching Certificate".
Copy and paste the resulting table.
The rub seems to be with the JavaScript doPostBack() part, as it requires rendering, I believe, to get the data.
When viewing the source code, see how the href part identifies an individual link like this? (for the 6th link down):
href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$gViewCredentialSearchList$ctl07$link1','')
From this:
<td class="MOECSNormal" style="border-color:Black;border-width:1px;border-style:Solid;">Professional Teaching Certificate Renewal</td><td class="MOECSNormal" style="border-color:Black;border-width:1px;border-style:Solid;">
<a id="ContentPlaceHolder1_gViewCredentialSearchList_link1_5" ItemStyle-BorderColor="Black" ItemStyle-BorderStyle="Solid" ItemStyle-BorderWidth="1px" href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$gViewCredentialSearchList$ctl07$link1','')">CC-XWT990004102</a>
</td>
I'm looking for a way (via Python) to get the data I need into a table, given a certification number and certificate name (i.e. "Professional Teaching Certificate".
I have tried following a tutorial using PyQt4, but installing it alone was traumatic.
Thanks in advance!
You can open the page in a browser e.g. Chrome and study how the interaction is done between the page and the server, normally this information can be seen in the network tab of Developer tool, this way you can formulate a python script to do the steps maybe using requests library
or
You can use selenium-python to do simulate your browser interaction (including javascript calls) until you got to the page where your interested data belongs to.
On this Yelp page:
http://www.yelp.com/search?find_desc=auto+repair&find_loc=70163&ns=1#l=g:-90.1266860962,29.9067341681,-90.0243759155,29.9959757119
The first result is GR Automotive. But when I do View Page Source and Ctrl+F for GR Automotive I get no results.
I believe this is because the text I want is generated by javascript.
How can I view the new page source which is generated by javascript?
I need to be able to manipulate the data on the page, but it's not in the html source, and I don't want to use the API since the main portion of my code is in Autohotkey. The URL version of the yelp API also doesnt seem to work with the sample example code.
Answer based on your question title:
This question does not appear to be about programming, but you need to view the information a different way in order to see the DOM. Instead of "view page source", use "inspect element".
Answer based on your edited question:
In order to manipulate Yelp listings, you will need the Yelp API.
General documentation
Business API
After your post on http://ahkscript.org
Remember that View Page Source does not give you the live Source
I still took a look at it... and it can be done just fine with a normal IE browser COM object from ahk
Example:
url := "http://www.yelp.com/search?find_desc=auto+repair&find_loc=70163&ns=1#l=g:-90.1266860962,29.9067341681,-90.0243759155,29.9959757119"
wb := ComObjCreate("InternetExplorer.Application")
wb.visible := true
wb.Navigate(url)
while wb.readyState!=4 || wb.document.readyState != "complete" || wb.busy
continue
sleep 100
while (wb.document.getElementsByClassName("throbber-overlay")[0].style.display != "none")
continue
msgbox % wb.document.getElementsByClassName("natural-search-result")[0].innertext
return
I don't really know what you tried before but with an IE COM object you can access the dhtml without much hassel.
But you always need to wait long enough for the elements you need to fully load when trying to access them this way.
I have a html page on my localhost - get_description.html.
The snippet below is part of the code:
<input type="text" id="url"/>
<button id="get_description_button">Get description</button>
<iframe id="description_container" src="#"/>
When the button is clicked the src of the iframe is set to the url entered in the textbox. The pages fetched this way are very big with lots of linked files. What I am interested in the page is a block of text contained in a <div id="description"> element.
Is there a way to mitigate downloading of resources linked in the page that loads into the iframe?
I don't want to use curl because the data is only available to logged in users and the steps to take with curl to get the content is too complicated. The iframe is simple as I use this on a box which sends the right cookies to identify the request as coming from a logged in user, but the problem is that it is very wasteful to get nearly 1 MB of data to keep 1 KB of it and throw out the rest.
Edit
If the proposed method just works in Firefox it is fine, so I added Firefox tag. Also, it is possible that the answer actually is from the realm of Firefox add-on techniques, so I added that tag as well.
The problem is not that I cannot get at what I'm looking for, rather, the problem is the easy iframe method is wasteful.
I know that Firefox does allow loading only the text of a page. If you open a page and press Ctrl+U you are taken to 'view page source' window, There links behave as normal and are clickable, if you click on a link in source view, the source of the new page is loaded into the view source window, without the linked resources being downloaded, exactly what I'm trying to get. But I don't know how to access this behaviour.
Another example is the Adblock add-on. It somehow kills elements before they get loaded. With plain Javascript this is not possible. Because it only is triggered too late to intervene in good time.
The Same Origin Policy forbids any web page to access contents of any other web page in a different domain so basically you cannot do that.
However it seems that with some browsers it is allowed to access web pages content if you are trying to access it from a local web page which seems to be your case.
Safari, IE 6/7/8 are browser that allow a local web page to do so via XMLHttpRequest (source: Google Browser Security Handbook) so you may want to choose to use one of those browsers to do what you need (note that future versions of those browsers may not allow to do so anymore).
A part from this solution I only see two possibities:
If the web pages you need to fetch content from are somehow controlled by you, you can create a simpler interface to let other web pages to get the content you need (for example allowing JSONP requests).
If the web pages you need to fetch content from are not controlled by you the only solution I see is to fetch content server side logging in from the server directly (I know that you don't want to do so, but I don't see any other possibility if the previous I mentioned are not practicable)
Hope it helps.
Actually I've seen Cross Domain jQuery .load request before, here: http://james.padolsey.com/javascript/cross-domain-requests-with-jquery/
The author claims that codes like these found on that page
$('#container').load('http://google.com'); // SERIOUSLY!
$.ajax({
url: 'http://news.bbc.co.uk',
type: 'GET',
success: function(res) {
var headline = $(res.responseText).find('a.tsh').text();
alert(headline);
}
});
// Works with $.get too!
would work. (The BBC code might not work because of the recent redesign, but you get the idea)
Apparently it is using YQL wrapped into a jQuery plugin to do the trick. Now I cannot say I fully understand what he is doing there but it appears to work, and fits the bill. Once you load the data I suppose it is a simple matter of filtering out the data that you need.
If you prefer something that works at the browser level, may I suggest Mozilla's Jetpack framework for lightweight extensions. I've not yet read the documentations in its entirety but it should contain the APIs needed for this to work.
There are various ways to go about this in AJAX, I'm going to show the jQuery way for brevity as one option, though you could do this in vanilla JavaScript as well.
Instead of an <iframe> you can just use a container, let's say a <div> like this:
<div id="description_container"></div>
Then to load it:
$(function() {
$("#get_description_button").click(function() {
$("#description_container").load($("input").val() + " #description");
});
});
This uses the .load() method which takes a string in this format: .load("url selector"), then takes that element in the page and places it's content inside the container you're loading, in this case #description_container.
This is just the jQuery route, mainly to illustrate that yes, you can do what you want, but you don't have to do it exactly like this, just showing the concept is getting what you want from an AJAX request, rather than in an <iframe>.
Your description sounds like you are fetching pages from the same domain (you said that you need to be logged in and have session credentials) so have you tried to use async request via XMLHttpRequest? It might complain if the html on a page is particularly messed up but you chould still be able to get raw text via .responseText and extract what you need with a regex.