I went to google, and had my firebug open. I started typing "in", and then checked the "NET" tab of Firebug, and a couple of new GET requests had been sent to fetch the list of search autocomplete suggestions.
Like:
GET http://clients1.google.com/complete/search?hl=en&client=hp&expIds=17259,17315,23628,24549,26637,26761,26849,26869,27386,27404&q=i&cp=1
But they were classified under the "JS" section, rather than as a "XHR" - why is this? Isn't google making an AJAX GET request behind the scene?
This is almost certainly a JSONP request, used to get around cross-domain restrictions on XHRs. Essentially, they are dynamically inserting <script /> tags into their page, and that's why it shows up under JS in Firebug.
Related
I'm currently working on a project to track products from several websites. I use a python scraper to retrieve all the URLs related to the listed products, and later, regularly check if these URLs are still active.
To do so I use the Python requests module, run a get request and look at the response's status code. Usually I get 200, 301, 302 or 404 as expected, except in the following case:
http://www.sephora.fr/Parfum/Parfum-Femme/Totem-Orange-Eau-de-Toilette/P2232006
This product has been removed and while opening the link (sorry it's in French), I am briefly shown a placeholder page saying the product is not available anymore and then redirected to the home page (www.sephora.fr).
Oddly, Python still returns a 200 status code and so do various redirect tracers such as wheregoes.com or redirectdetective.com. The worst part is that the response URL still is the original, so I can't even trace it that way.
When analyzing with Chrome DevTools and preserving the logs, I see that at some point the page is reloaded. However I'm unable to find out where.
I'm guessing this is done client-side via Javascript, but I'm not quite sure how. Furthermore, I'd really need to be able to detect this change from within Python.
As a reference, here's a link to a working product:
http://www.sephora.fr/Parfum/Parfum-Femme/Kenzo-Jeu-d-Amour-Eau-de-Parfum/P1894014
Any leads?
Thank you !
Ludwig
The page has a meta tag, that redirects the page to the root URL:
<meta http-equiv="refresh" content="0; URL=/" />
I am trying to include a widget in my webpage. The code for the widget is loaded dynamically with ajax (because it changes often and I need to update it from the server) and it looks like this ...
<a class="e-widget" href="https://gleam.io/0oIpw/contest-widget" rel="nofollow">This is a Widget!</a>
<script type="text/javascript" src="https://js.gleam.io/e.js" async="true"></script>
on load, I get the following errors in the console...
OPTIONS https://js.gleam.io/e.js 404 (Not Found)
XMLHttpRequest cannot load https://js.gleam.io/e.js. No 'Access-Control-Allow-Origin' header is present on the requested resource. Origin 'https://localhost:8443' is therefore not allowed access. The response had HTTP status code 404.
If I remove the ajax call that loads the data for the widget, and instead insert the widget directly, I do not get the same errors and the widget works fine.
I have read into this and figure that it is due to the Same-Origin-Policy (SOP), so I am now wondering the best way to circumvent the policy.
I have read the post Ways to circumvent the same-origin policy but unfortunately did not find it helpful in this case.
Since CORS is done on the server side (I think ? ) and JSONP is insecure, is the best option to create a proxy?
Thanks so much for the help. I have spent quite a few hours researching this and I am still confused.
Edited to add code for more info :
The information for the page is loaded via ajax when a command link is clicked as follows :
<h:commandLink action="#{redeemPerk.getDisplay(display.displayId)}" >
<h:graphicImage value="#{display.imgUrl}" styleClass="display-icon"/>
<f:ajax event="click" execute="#form" render="redeem-display-data-reveal" listener="#{redeemPerk.getDisplay(display.displayId)}" onevent="handleAjax"/>
</h:commandLink>
this renders the area that displays the widget, which looks like ...
<div class="reveal-modal-background hidden">
<h:form id="redeem-display-data-reveal">
<h:panelGroup rendered="#{display.type == 'WIDGET'}">
<a class="e-widget" href="https://gleam.io/0oIpw/contest-widget" rel="nofollow">This is a Widget!</a>
<script type="text/javascript" src="https://js.gleam.io/e.js" async="true"></script>
</h:form>
</div></h:panelGroup>
The second chunk of code is in a separate file from the first. To reiterate, if I remove the ajax call and load the data directly the widget works fine.
I am seeing two things in your output log that could be causing the issue.
First, it states that you received a 404 message from the request. Which means the JavaScript has probably not been uploaded properly.
Second, it says that the origin of the request came from localhost:8443. That leads me to believe that you are running the code locally instead of from the Internet.
In cases where you are trying to load a plugin from the internet, but your code is being tested locally you are still going to get an SOP error. To fix this problem you would need to upload all of the code that you do have to your web server. Once you have done that attempt to load the webpage from the Internet and not your local copy. That should fix that SOP error.
Our static content server is serving images for various web portals (black box for me). On web portal all images are coming fine even though they are from different domain (assuming static server sets http headers accordingly). However if I try to access same image using browser console via ajax calls (using jquery or xmlhttp) it gives cross domain call failure error (i.e. request is successful but browser denied response). Below is a simple jsfiddle to show the problem
JSfiddle for image coming in dom but ajax call failing
/*Image tag works fine*/
<img src='https://casinogames.bwin.com/htmllobby/images/gameicon/melonmadness.jpg' />
/*ajax call fails*/
var a = $.ajax(' https://casinogames.bwin.com/htmllobby/images/gameicon/melonmadness.jpg');
I verified request/response headers and they are exactly same in both scenarios. I want to know if there is any specific difference between request from image tag and ajax? I tried both IE console and Chrome console and same results.
I want to know if there is any specific difference between request from image tag and ajax?
There won't be anything significantly different about the request, but either way you do it, the browser will not make the image data available to JavaScript when the request is a cross-origin one.
Displaying an image to the user from an img element is not a security risk.
Giving JavaScript written by a third party access to data from another server is a security risk.
You cannot make ajax calls from different domain in normal ways.
here is a discussion about it.
You can look it up as "cross domain ajax calls"
Edit
Show remote img via jquery like...
var a = $('img').prop ('src', 'http://placehold.it/10x10');
I am scraping profiles on ask.fm for a research question. The problem is that only the top most recent questions are viewable and I have to click "view more" to see the next 15.
The source code for clicking view more looks like this:
<input class="submit-button-more submit-button-more-active" name="commit" onclick="return Forms.More.allowSubmit(this)" type="submit" value="View more" />
What is an easy way of calling this 4 times before scraping it. I want the most recent 60 posts on the site. Python is preferable.
Without using Headless Browser
Open the chrome debugger tools -> Network Tab.
Now click on the View More button.
Inspect the request being fired in the Network dialog when you click on View More.
The data would be loaded from an external API in most cases so check whether the request to the API is a get or a post request and the kind of response from it.
There could be a limit or any similar query parameter which could be passed to that url so as to limit the no of response objects. In your case, it would be 15.
Try to make the request to the same URL from your script increasing the limit to 60 let's say and check the response.
The above technique works in most cases. But if this doesn't work for you try the following steps.
Using Headless Browser
Try to use a headless browser which loads the dynamic content, you have the methods to scroll down, click etc available.
Examples of Headless browsers are Selenium, Splash, PhantomJS, SlimmerJS etc.
You could probably use selenium to browse to the website and click on the button/link a few times. You can get that here:
https://pypi.python.org/pypi/selenium
Or you might be able to do it with mechanize:
http://wwwsearch.sourceforge.net/mechanize/
I have also heard good things about twill, but never used it myself:
http://twill.idyll.org/
I have a html page on my localhost - get_description.html.
The snippet below is part of the code:
<input type="text" id="url"/>
<button id="get_description_button">Get description</button>
<iframe id="description_container" src="#"/>
When the button is clicked the src of the iframe is set to the url entered in the textbox. The pages fetched this way are very big with lots of linked files. What I am interested in the page is a block of text contained in a <div id="description"> element.
Is there a way to mitigate downloading of resources linked in the page that loads into the iframe?
I don't want to use curl because the data is only available to logged in users and the steps to take with curl to get the content is too complicated. The iframe is simple as I use this on a box which sends the right cookies to identify the request as coming from a logged in user, but the problem is that it is very wasteful to get nearly 1 MB of data to keep 1 KB of it and throw out the rest.
Edit
If the proposed method just works in Firefox it is fine, so I added Firefox tag. Also, it is possible that the answer actually is from the realm of Firefox add-on techniques, so I added that tag as well.
The problem is not that I cannot get at what I'm looking for, rather, the problem is the easy iframe method is wasteful.
I know that Firefox does allow loading only the text of a page. If you open a page and press Ctrl+U you are taken to 'view page source' window, There links behave as normal and are clickable, if you click on a link in source view, the source of the new page is loaded into the view source window, without the linked resources being downloaded, exactly what I'm trying to get. But I don't know how to access this behaviour.
Another example is the Adblock add-on. It somehow kills elements before they get loaded. With plain Javascript this is not possible. Because it only is triggered too late to intervene in good time.
The Same Origin Policy forbids any web page to access contents of any other web page in a different domain so basically you cannot do that.
However it seems that with some browsers it is allowed to access web pages content if you are trying to access it from a local web page which seems to be your case.
Safari, IE 6/7/8 are browser that allow a local web page to do so via XMLHttpRequest (source: Google Browser Security Handbook) so you may want to choose to use one of those browsers to do what you need (note that future versions of those browsers may not allow to do so anymore).
A part from this solution I only see two possibities:
If the web pages you need to fetch content from are somehow controlled by you, you can create a simpler interface to let other web pages to get the content you need (for example allowing JSONP requests).
If the web pages you need to fetch content from are not controlled by you the only solution I see is to fetch content server side logging in from the server directly (I know that you don't want to do so, but I don't see any other possibility if the previous I mentioned are not practicable)
Hope it helps.
Actually I've seen Cross Domain jQuery .load request before, here: http://james.padolsey.com/javascript/cross-domain-requests-with-jquery/
The author claims that codes like these found on that page
$('#container').load('http://google.com'); // SERIOUSLY!
$.ajax({
url: 'http://news.bbc.co.uk',
type: 'GET',
success: function(res) {
var headline = $(res.responseText).find('a.tsh').text();
alert(headline);
}
});
// Works with $.get too!
would work. (The BBC code might not work because of the recent redesign, but you get the idea)
Apparently it is using YQL wrapped into a jQuery plugin to do the trick. Now I cannot say I fully understand what he is doing there but it appears to work, and fits the bill. Once you load the data I suppose it is a simple matter of filtering out the data that you need.
If you prefer something that works at the browser level, may I suggest Mozilla's Jetpack framework for lightweight extensions. I've not yet read the documentations in its entirety but it should contain the APIs needed for this to work.
There are various ways to go about this in AJAX, I'm going to show the jQuery way for brevity as one option, though you could do this in vanilla JavaScript as well.
Instead of an <iframe> you can just use a container, let's say a <div> like this:
<div id="description_container"></div>
Then to load it:
$(function() {
$("#get_description_button").click(function() {
$("#description_container").load($("input").val() + " #description");
});
});
This uses the .load() method which takes a string in this format: .load("url selector"), then takes that element in the page and places it's content inside the container you're loading, in this case #description_container.
This is just the jQuery route, mainly to illustrate that yes, you can do what you want, but you don't have to do it exactly like this, just showing the concept is getting what you want from an AJAX request, rather than in an <iframe>.
Your description sounds like you are fetching pages from the same domain (you said that you need to be logged in and have session credentials) so have you tried to use async request via XMLHttpRequest? It might complain if the html on a page is particularly messed up but you chould still be able to get raw text via .responseText and extract what you need with a regex.