I'm currently working on a project to track products from several websites. I use a python scraper to retrieve all the URLs related to the listed products, and later, regularly check if these URLs are still active.
To do so I use the Python requests module, run a get request and look at the response's status code. Usually I get 200, 301, 302 or 404 as expected, except in the following case:
http://www.sephora.fr/Parfum/Parfum-Femme/Totem-Orange-Eau-de-Toilette/P2232006
This product has been removed and while opening the link (sorry it's in French), I am briefly shown a placeholder page saying the product is not available anymore and then redirected to the home page (www.sephora.fr).
Oddly, Python still returns a 200 status code and so do various redirect tracers such as wheregoes.com or redirectdetective.com. The worst part is that the response URL still is the original, so I can't even trace it that way.
When analyzing with Chrome DevTools and preserving the logs, I see that at some point the page is reloaded. However I'm unable to find out where.
I'm guessing this is done client-side via Javascript, but I'm not quite sure how. Furthermore, I'd really need to be able to detect this change from within Python.
As a reference, here's a link to a working product:
http://www.sephora.fr/Parfum/Parfum-Femme/Kenzo-Jeu-d-Amour-Eau-de-Parfum/P1894014
Any leads?
Thank you !
Ludwig
The page has a meta tag, that redirects the page to the root URL:
<meta http-equiv="refresh" content="0; URL=/" />
Related
I have a certification and a badge provided by Acclaim. I want to embed it in my personal website but it's not working. here's the code they provided:
<div data-iframe-width="150" data-iframe-height="270" data-share-badge-id="60615e70-6409-4752-9d77-3553a43d13d2" data-share-badge-host="https://www.youracclaim.com"></div>
<script type="text/javascript" async src="//cdn.youracclaim.com/assets/utilities/embed.js"></script>
but even when simply put onto an empty html:5 page, I get the error: Loading failed for the <script> with source “file:///assets/utilities/embed.js”.
What's the problem here? I'm not sure how Acclaim can provide a ready-to-paste script that's just simply not working, nothing shows up on the website. I'm guessing the problem is at the src... part, but don't know how to fix it.
If you're loading your page via file:, then protocol-relative URLs aren't going to work. The script tag has:
src="//cdn.youracclaim.com/assets/utilities/embed.js"
This should be changed to:
src="https://cdn.youracclaim.com/assets/utilities/embed.js"
You'll find though that when you're using an actual web server, this is a non-issue. The reason for the protocol-relative URLs is so that HTTP pages would use the HTTP version, and HTTPS would use the HTTPS version. This method is outdated anyway. HTTPS should be used everywhere, even if you're loading HTTPS JavaScript from an HTTP page.
Working with SEO and I'm trying to see how Google-bot is rendering my pages (Angular application). When I check the 'Fetch and Render' on various urls the images (the, "this is how a user sees your site/this is how google-bot sees your site") are never returned and the status says 'Partial'. Further, when I check the "Downloaded HTTP Response" every url has the same index.html-looking format to it, i.e. the pages aren't being rendered.
I've been reading a lot on SO as well as other sites about Google bots being able to render JS, as well as the old-school
<meta name="fragment" content="!">
technique (discussed in this SO post).
I'm definitely willing to try the 'older technique' even if Google advises against it, but I'd like to know why my site isn't being crawled correctly in the first place. Is it potentially a timing issue? Should I be using prerender.io?
Or, have I been overlooking something simpler? (fingers crossed)
Just say I typed in a bad hostname in the address bar.
For example, say I wasn't running a local web server, and I load:
http://localhost/callback_url
In Chrome, this will give me a "This webpage is not available" message.
Is there anyway I can find out what the url is in the address bar from the Javascript console, even though the page failed to load?
I know I can normally use window.location.href to get this, but that returns "data:text/html,chromewebdata" in this instance.
So in this example, I'd like to know if there's some javascript that returns http://localohost/callback_url
EDIT: The main reason I'd like to do this is so I know if server side redirect failed when using ChromeDriver with Selenium. So I'd prefer to avoid using extensions if possible, and am open to Chrome and ChromeDriver specific solutions if applicable! The callback_url may have extra info in it, added by the server, and I'd like to see what this info is. I'd like to avoid running another server to get this data if possible.
The loadTimeData object included in the ERR_CONNECTION_REFUSED page has the failed URL:
> loadTimeData.data_.summary.failedUrl
"http://localhost/foo?request_url=bar"
You can get it from the title of the page.
By typing document.title and doing some regex you can get the URL.
Another way I found is by using the following
var data = loadTimeData.createJsEvalContext();
console.log(data.a.$top.summary.failedUrl);
If you open the developer tools and search for a part of the URL in source code, you will see that Chrome creates the loadTimeData in the "not available page".
I've read several of the questions on this but am still a little confused.
For example: OK, I can't post examples because of hyperlink limitations
Here is my exact situation.
I have a site at mydomain.com
One of the pages has an iframe to another page at sub.mydomain.com
I am trying to prepare an onload script that if the page is not in an iframe or the parent domain of the page containing the iframe is not mydomain.com then redirect to mydomain.com.
After the initial permission issues I realised the problem with sub domains counting as separate domains.
One of the posts above says that "could each use either foo.mydomain.com or just mydomain.com"
So I tried (for testing):
onload="document.domain='mydomain.com';alert(parent.location.href);"
This produced the error (http replaced with lar
Error: Permission denied for <http://sub.mydomain.net> (document.domain=<http://mydomain.net>) to get property Location.href from <http://mydomain.net> (document.domain has not been set).
Source File: http://sub.mydomain.net/?pageID=1&framed=1
Line: 1
Removing the alert produces no errors.
Maybe I am going about this the wrong way since I do not need to interact with the parent just read its domain if there is one.
A nice simple top.domain. For read only there must be a way so that people can prevent their own pages being used within other people's sites.
You can't (easily) do this because of security restrictions.
This answer from #2771397 might point you in the right direction.
OK, while looking at the error console I still had open when I got home a wee lightbulb lit up. I am pretty new to javascript (can you tell ;) but I thought "If it has try/catch"...
well here is a hack at least to get the name of the top domain and an example of how I will use it in my site to show content only if the page is a frame in the correct domain.
Firstly the header will have the following partially PHP generated function:
function getParentDomain()
{
try
{
var wibble=top.location.href;
}
catch(err)
{
if (err.message.indexOf('http://mydomain.com')!=-1)
{
createCookie('IAmAWomble','value')
}
}
}
Basically the value will be something based on the PHP session I think. This will be executed at page load.
If the page is not within the proper site or if javascript is not enabled then the cookie will not be created.
PHP will then attempt to read the correct value from the cookie and show the content or an error message as appropriate.
I do see a slight flaw in this for first visit since page load will run after PHP has generated the content but I'm sure I can work around this somehow. I thought I'd post because this is at least what I was initially asking for and that is a way to read the URL of a parent site if it is in a different domain to the site in the frame.
IIUC you want to use the window.parent attribute: “A reference to the parent of the current window or subframe.”
Assumably, window.parent.document.location.host contains the container page URL domain name.
I have a html page on my localhost - get_description.html.
The snippet below is part of the code:
<input type="text" id="url"/>
<button id="get_description_button">Get description</button>
<iframe id="description_container" src="#"/>
When the button is clicked the src of the iframe is set to the url entered in the textbox. The pages fetched this way are very big with lots of linked files. What I am interested in the page is a block of text contained in a <div id="description"> element.
Is there a way to mitigate downloading of resources linked in the page that loads into the iframe?
I don't want to use curl because the data is only available to logged in users and the steps to take with curl to get the content is too complicated. The iframe is simple as I use this on a box which sends the right cookies to identify the request as coming from a logged in user, but the problem is that it is very wasteful to get nearly 1 MB of data to keep 1 KB of it and throw out the rest.
Edit
If the proposed method just works in Firefox it is fine, so I added Firefox tag. Also, it is possible that the answer actually is from the realm of Firefox add-on techniques, so I added that tag as well.
The problem is not that I cannot get at what I'm looking for, rather, the problem is the easy iframe method is wasteful.
I know that Firefox does allow loading only the text of a page. If you open a page and press Ctrl+U you are taken to 'view page source' window, There links behave as normal and are clickable, if you click on a link in source view, the source of the new page is loaded into the view source window, without the linked resources being downloaded, exactly what I'm trying to get. But I don't know how to access this behaviour.
Another example is the Adblock add-on. It somehow kills elements before they get loaded. With plain Javascript this is not possible. Because it only is triggered too late to intervene in good time.
The Same Origin Policy forbids any web page to access contents of any other web page in a different domain so basically you cannot do that.
However it seems that with some browsers it is allowed to access web pages content if you are trying to access it from a local web page which seems to be your case.
Safari, IE 6/7/8 are browser that allow a local web page to do so via XMLHttpRequest (source: Google Browser Security Handbook) so you may want to choose to use one of those browsers to do what you need (note that future versions of those browsers may not allow to do so anymore).
A part from this solution I only see two possibities:
If the web pages you need to fetch content from are somehow controlled by you, you can create a simpler interface to let other web pages to get the content you need (for example allowing JSONP requests).
If the web pages you need to fetch content from are not controlled by you the only solution I see is to fetch content server side logging in from the server directly (I know that you don't want to do so, but I don't see any other possibility if the previous I mentioned are not practicable)
Hope it helps.
Actually I've seen Cross Domain jQuery .load request before, here: http://james.padolsey.com/javascript/cross-domain-requests-with-jquery/
The author claims that codes like these found on that page
$('#container').load('http://google.com'); // SERIOUSLY!
$.ajax({
url: 'http://news.bbc.co.uk',
type: 'GET',
success: function(res) {
var headline = $(res.responseText).find('a.tsh').text();
alert(headline);
}
});
// Works with $.get too!
would work. (The BBC code might not work because of the recent redesign, but you get the idea)
Apparently it is using YQL wrapped into a jQuery plugin to do the trick. Now I cannot say I fully understand what he is doing there but it appears to work, and fits the bill. Once you load the data I suppose it is a simple matter of filtering out the data that you need.
If you prefer something that works at the browser level, may I suggest Mozilla's Jetpack framework for lightweight extensions. I've not yet read the documentations in its entirety but it should contain the APIs needed for this to work.
There are various ways to go about this in AJAX, I'm going to show the jQuery way for brevity as one option, though you could do this in vanilla JavaScript as well.
Instead of an <iframe> you can just use a container, let's say a <div> like this:
<div id="description_container"></div>
Then to load it:
$(function() {
$("#get_description_button").click(function() {
$("#description_container").load($("input").val() + " #description");
});
});
This uses the .load() method which takes a string in this format: .load("url selector"), then takes that element in the page and places it's content inside the container you're loading, in this case #description_container.
This is just the jQuery route, mainly to illustrate that yes, you can do what you want, but you don't have to do it exactly like this, just showing the concept is getting what you want from an AJAX request, rather than in an <iframe>.
Your description sounds like you are fetching pages from the same domain (you said that you need to be logged in and have session credentials) so have you tried to use async request via XMLHttpRequest? It might complain if the html on a page is particularly messed up but you chould still be able to get raw text via .responseText and extract what you need with a regex.