Casper/Phantomjs unable to retrieve highest resolution src image - javascript

I am trying to make a basic Instagram web scraper, both art inspiration pictures and just generally trying to boost my knowledge and experience programming.
Currently the issue that I am having is that Casper/Phantomjs can't detect higher res images from the srcset, and I can't figure out a way around this. Instagram has their srcsets provide 640x640, 750x750, and 1080x1080 images. I would obviously like to retrieve the 1080, but it seems to be undetectable by any method I've tried so far. Setting the viewport larger does nothing, and I can't retrieve the entire source set through just getting the HTML and splitting it where I need it. And as far as I can tell, there is no other way to retrieve said image than to get it from this srcset.
Edit
As I was asked for more details, here I go. This is the code I used to get the attributes from the page:
function getImages() {
var scripts = document.querySelectorAll('._2di5p');
return Array.prototype.map.call(scripts, function (e) {
return e.getAttribute('src');
});
}
Then I do the standard:
casper.waitForSelector('div._4rbun', function() {
this.echo('...found selector ...try getting image srcs now...');
imagesArray = this.evaluate(getImages);
imagesArray.forEach(function (item) {
console.log(item);
However, all that is returned is the lowest resolution of the srcset. Using this url, for example, (https://www.instagram.com/p/BhWS4csAIPS/?taken-by=kasabianofficial) all that is returned is https://instagram.flcy1-1.fna.fbcdn.net/vp/b282bb23f82318697f0b9b85279ab32e/5B5CE6F2/t51.2885-15/s640x640/sh0.08/e35/29740443_908390472665088_4690461645690896384_n.jpg, which is the lowest resolution (640x640) image in the srcset. Ideally, I'd like to retrieve the https://instagram.flcy1-1.fna.fbcdn.net/vp/8d20f803e1cb06e394ac91383fd9a462/5B5C9093/t51.2885-15/e35/29740443_908390472665088_4690461645690896384_n.jpg which is the 1080x1080 image in the srcset. But I can't. There's no way to get that item as far as I can tell. It's completely hidden.

I found a way around it in Instagram's case. Instagram puts the source picture in a meta tag within the head. So, using the code I'll paste below, you can call all of the meta tags and then sort out which one is the source picture by checking if "og:image" is retrieved.
function getImages() {
var scripts = document.querySelectorAll('meta[content]');
return Array.prototype.map.call(scripts, function (e) {
return e.getAttribute('property') + " " + e.getAttribute('content');
});
}
And this is the way to sort the meta tags into only having the original image in its native resolution.
this.echo('...found selector ...try getting image srcs now...');
imagesArray = this.evaluate(getImages);
imagesArray.forEach(function (item) {
if (typeof item == "string" && item.indexOf('og:image') > -1) {
Edit: Unfortunately this only works for single image posts on Instagram (the site I'm trying to scrape) so this unfortunately does me no goo. The values within the meta tags don't change even if you load the next image in the post. I'm leaving this up though in case anyone else could use it, but it's not ideal for my own use case.

Yes indeed PhantomJS doesn't seem to support srcset, its Webkit engine is very old.
But to be fair, all the metadata related to the page is out in the open in the HTML as JSON in window._sharedData variable.
If you want to use a headless browser (and not parse it with any server-side language) you can do this:
var imgUrl = page.evaluate(function(){
return window._sharedData.entry_data.PostPage[0].graphql.shortcode_media.display_resources[2].src;
});
https://instagram.fhen2-1.fna.fbcdn.net/vp/8d20f803e1cb06e394ac91383fd9a462/5B5C9093/t51.2885-15/e35/29740443_908390472665088_4690461645690896384_n.jpg

Solution: So my solution was to use slimerjs. If I run the js file through "casperjs --engine=slimerjs fileName.js", I can retrieve srcsets in full. So if I say use this code:
function getImgSrc() {
var scripts = document.querySelectorAll("._2di5p");
return Array.prototype.map.call(scripts, function (e) {
return e.getAttribute("srcset");
});
}
on this url (https://www.instagram.com/p/BhWS4csAIPS/?taken-by=kasabianofficial) I will get (https://instagram.flcy1-1.fna.fbcdn.net/vp/b282bb23f82318697f0b9b85279ab32e/5B5CE6F2/t51.2885-15/s640x640/sh0.08/e35/29740443_908390472665088_4690461645690896384_n.jpg 640w,https://instagram.flcy1-1.fna.fbcdn.net/vp/b4eebf94247af02c63d20320f6535ab4/5B6258DF/t51.2885-15/s750x750/sh0.08/e35/29740443_908390472665088_4690461645690896384_n.jpg 750w,https://instagram.flcy1-1.fna.fbcdn.net/vp/8d20f803e1cb06e394ac91383fd9a462/5B5C9093/t51.2885-15/e35/29740443_908390472665088_4690461645690896384_n.jpg 1080w) as the result.
This is what I wanted as it means I can scrape those 1080 images. Sorry for this messy page, but I wanted to leave my trail of steps to any of those who might be trying like me.

Related

Using image dimensions from a URL without loading it onto page - Javascript

I'm currently learning JavaScript in a bootcamp and I'm having a hard time figuring out how to solve a problem. I have googled it extensively, but I feel like I'm missing a core piece of functionality in what I'm trying to do.
I'm getting an array of returned images from an API based on a date specified by a date selector. I want to measure the dimensions of the natural height and width of each item in the array and filter them into another array which I will use to display a gallery. Here is a link to what the API returns:
API returned object array index[0] example
Here is the relevant piece of code, including the API call.
spaceApp.getRoverPhotos = (date) => {
$.ajax({
url: spaceApp.roverUrl,
method: 'GET',
data: {
api_key: spaceApp.key,
earth_date: date,
camera: spaceApp.cameras,
page: 1,
}
}).then((res) => {
//functioning code but doesn't wait for stuff to load.
let displayArray = res.photos.filter( (each) => {
let imageDim = new Image();
imageDim.src = each.img_src;
if(imageDim.naturalWidth >= 1000 && imageDim.naturalHeight >= 1000) {
return true;
}
else {
return false;
}
});
console.log(displayArray);
})
}
The results of this are not consistent, and I think it's the way that the images are being loaded. The filter method is used after a successful promise, so I know that the array consistently returns the information I need, but I don't always get the natural height and width because of the way something is loading. I used a forEach to console log the natural height and width when I submit the date twice in a row. The first returns me no values, the second has them all loaded:
First submit, 25 results of 0 0, then all show upon re-submitting the form
My best guess is this has something to do with caching the images. I've tried to use image.onload in creative ways, but I'm not sure that makes sense, given that I haven't used them on the page anywhere(similarly, I tried imageDim.addEventHandler('load', ()=>{})). I have tried using map to create a new array of images using the new Image() approach above, and then filtering the new array, but I ran into the same issue. I have considered using a promise, but I can't figure out how to "load" images without displaying them on the page, and using that as a promise inside a filter.
I have seen, and tried without success, a few variations of this:
var img = new Image();
img.onload = function() { alert("Height: " + this.height); }
img.src = "http://path/to/image.jpg";
I'm sorry if I'm missing a step here, but I've been working on this for several hours without success. I've asked my only senior developer friend for help, but he wasn't able to provide me much guidance since he happened to not be at home at the time (he suggested I "load" the images into a new array, then filter, but I have the same problem of how to make an image load - I did attempt his suggestion).
If anyone could spare the time to help me I would really appreciate it! I feel I'm at a roadblock because I don't quite know the nature of the true problem I'm experiencing, if someone can tell me what I'm missing I can continue googling it! Thank you in advance!

My JSLink script will not work

I am attempting to use JSLink ..finally.. and I am having some trouble that I cannot seem to straighten out. For my first venture down the rabbit hole I chose something super simple for use as proof of concept. So I looked up a tutorial and came up with a simple script to draw a box around the Title field of each entry and style the text. I cannot get this to work. Is there any chance you can take a look at this code for me? I used the following tokens in the JSLink box.
~sitecollection/site/folder/folder/file.js
And
~site/folder/folder/file.js
The .js file is stored on the same site as the List View WebPart I am attempting to modify. The list only has the default “Title” column.
(function () {
var overrideContext = {};
overrideContext.Templates = {};
overrideContext.Templates.Item = overrideTemplate;
SPClientTemplates.TemplateManager.RegisterTemplateOverrides(overrideContext);
}) ();
function overrideTemplate(ctx) {
return “<div style=’font-size:40px;border:solid 3px black;margin-bottom:6px;padding:4px;width:200px;’>” + ctx.CurrentItem.Title + “</div>”;
}
It looks as though you are attempting to override the context (ctx) item itself, where you actually just want to override the list field and the list view in which the field is displayed. Make sense?
Firstly, change overrideContext.Templates.Item to overrideContext.Templates.Fields :
(function () {
var overrideContext = {};
overrideContext.Templates = {};
overrideContext.Templates.Fields = {
// Add field and point it to your rendering function
"Title": { "View": overrideTemplate },
};
SPClientTemplates.TemplateManager.RegisterTemplateOverrides(overrideContext);
}) ();
Then when the JSLink runs the renderer looks for the Title field in the List view, and applies your overrideTemplate function.
function overrideTemplate(ctx) {
return “<div style=’font-size:40px;border:solid 3px black;margin-bottom:6px;padding:4px;width:200px;’>” + ctx.CurrentItem.Title + “</div>”;
}
In terms of running multiple JSLinks on a SharePoint page, it is quite possible to run multiple JSLink scripts, they just need to be separated by the pipe '|' symbol. I use SharePoint Online a lot and I see the following formatting working all the time (sorry Sascha!).
~site/yourassetfolder/yourfilename.js | ~site/yourassetfolder/anotherfilename.js
You can run as many scripts concurrently as you want, just keep separating them with the pipe. I've seen this on prem also, however you might want to swap out '~sites' for '~sitecollection' and make sure the js files you are accessing are at the top level site in the site collection if you do so!
I have noticed when running multiple JSLinks on a list or page because they are all doing Client Side Rendering, too many will slow your page down. If this happens, you might want to consider combining them into one JSLink script so that the server only has to call one file to return to the client to do all the rendering needed for your list.
Hope this helps.

Detecting net::ERR_FILE_NOT_FOUND with only JS

I have recreated a blueprint, which has 60+ rooms, as an inline SVG.
There are functions that display information, such as pictures, when you select or hover a room. I'm using one div container to display the pictures by setting its background property to url('path-of-image.ext'), as can be seen below.
var cla = document.getElementsByClassName('cla');
for (i = 0; i < cla.length; i++) {
cla[i].addEventListener('mouseenter', fun);
}
function fun(){
var str = 'url("media/' + this.id.slice(4) + '.jpg")';
pictureFrame.style.background = str;
pictureFrame.style.backgroundSize = 'cover';
pictureFrame.style.backgroundPosition = 'center'
}
The reason I'm not using the background property's shorthand is because I plan on animating the background-position property with a transition.
However, not all rooms have pictures. Hence console throws the following error, GET ... net::ERR_FILE_NOT_FOUND, when you select or hover said rooms. The error doesn't cause the script to break, but I would prefer not to run that code every single time a room is hovered, even when a given room doesn't have pictures.
Even though I know this can be done imperatively with if/else statements, I'm trying to do this programmatically since there are so many individual rooms.
I've tried using try/catch, but this doesn't seem to detect this sort of error.
Any ideas?
Is it even possible to detect this kind of error?
You could attempt to read it using FileReader and catch/handle NotFoundError error.
If it were to error, you could assign it to an object or array which you would first check upon hover. If the file was in that array, you could avoid attempting to read it again and just handle however you like.
Here is a good article by Nicholas Zakas on using FileReader
First off I would see if there is a way of checking if the file exists before the document even loads so that you don't make unnecessary requests. If you have a database on the backend which can manage this that would serve you very well in the long term
Since you make it sound like the way you only know a file exists is by requesting it, here's a method that will allow you to try this:
function UrlExists(url)
{
var http = new XMLHttpRequest();
http.open('HEAD', url, false);
http.send();
return http.status!=404;
}
This won't request the image twice because of browser caching. As you can see that method is itself being depricated and overall the best way you can remedy this problem is checking before the page even loads; if you have a database or datastructure of any sort, add a class or property to the element if the image exists or not. Then, in your existing method, you can call something like document.getElementsByClassName('cla-with-image') to get only records that you've determined has an image (much more efficient than trying to load images that don't exist).
If you end up using that UrlExists method, then you can just modify your existing method to be
function fun(){
var url = "media/' + this.id.slice(4) + '.jpg";
if (UrlExists(url)) {
var str = 'url(' + url + ')';
pictureFrame.style.background = str;
pictureFrame.style.backgroundSize = 'cover';
pictureFrame.style.backgroundPosition = 'center'
}
}

Jump to page in PDF.js with javascript

I'm trying to use PDF.js' viewer to display pdf files on a page.
I've gotten everything working, but I would like to be able to 'jump to' a specific page in the pdf. I know you can set the page with the url, but I would like to do this in javascript if it's possible.
I have noticed that there is a PDFJS object in the global scope, and it seems that I should be able to get access to things like page setting there, but it's a rather massive object. Anyone know how to do this?
You can set the page via JavaScript with:
var desiredPage = [the page you want];
PDFViewerApplication.page = desiredPage;
There is an event handler on this, and the UI will be adjusted accordingly. You may want to ensure this is not out of bounds:
function goToPage(desiredPage){
var numPages = PDFViewerApplication.pagesCount;
if((desiredPage > numPages) || (desiredPage < 1)){
return;
}
PDFViewerApplication.page = desiredPage;
}
In my case I was loading pdf file inside iframe so I had to do it in other way around.
function goToPage(desiredPage){
var frame_1 = window.frames["iframe-name"];
var frameObject = document.getElementById("iframe-id").contentWindow;
frameObject.PDFViewerApplication.page = desired page;
}
if Pdf shown into iframe and you want to navigate to page then use below code. 'docIfram' is iframe tag Id.
document.getElementById("docIframe").contentWindow.PDFViewerApplication.page=2

onClick replace /segment/ of img src path with one of number of values

No idea what I'm doing or why it isn't working. Clearly not using the right method and probably won't use the right language to explain the problem..
Photogallery... Trying to have a single html page... it has links to images... buttons on the page 'aim to' modify the path to the images by finding the name currently in the path and replacing it with the name of the gallery corresponding to the button the user clicked on...
example:
GALLERY2go : function(e) {
if(GalleryID!="landscapes")
{
var find = ''+ findGalleryID()+'';
var repl = "landscapes";
var page = document.body.innerHTML;
while (page.indexOf(find) >= 0) {
var i = page.indexOf(find);
var j = find.length;
page = page.substr(0,i) + repl + page.substr(i+j);
document.body.innerHTML = page;
var GalleryID = "landscapes";
}
}
},
There's a function higher up the page to get var find to take the value of var GalleryID:
var GalleryID = "portfolio";
function findGalleryID() {
return GalleryID
}
Clearly the first varGalleryID is global (t'was there to set a default value should I have been able to find a way of referring to it onLoad) and the one inside the function is cleared at the end of the function (I've read that much). But I don't know what any of this means.
The code, given its frailties or otherwise ridiculousness, actually does change all of the image links (and absolutely everything else called "portfolio") in the html page - hence "portfolio" becomes "landscapes"... the path to the images changes and they all update... As a JavaScript beginner I was pretty chuffed to see it worked. But you can't click on another gallery button because it's stuck in a loop of some sort. In fact, after you click the button you can't click on anything else and all of the rest of the JavaScript functionality is buggered. Perhaps I've introduced some kind of loop it never exits. If you click on portfolio when you're in portfolio you crash the browser! Anyway I'm well aware that 'my cobbled together solution' is not how it would be done by someone with any experience in writing code. They'd probably use something else with a different name that takes another lifetime to learn. I don't think I can use getElement by and refer to the class/id name and parse the filename [using lots of words I don't at all understand] because of the implications on the other parts of the script. I've tried using a div wrapper and code to launch a child html doc and that come in without disposing of the existing content or talking to the stylesheet. I'm bloody lost and don't even know where to start looking next.
The point is... And here's a plea... If any of you do reply, I fear you will reply without the making the assumption that you're talking to someone who really hasn't got a clue what AJAX and JQuery and PHP are... I have searched forums; I don't understand them. Please bear that in mind.
I'll take a stab at updating your function a bit. I recognize that a critique of the code as it stands probably won't help you solve your problem.
var currentGallery = 'landscape';
function ChangeGallery(name) {
var imgs = document.getElementsByTagName("img") // get all the img tags on the page
for (var i = 0; i < imgs.length; i++) { // loop through them
if (imgs[i].src.indexOf(currentGallery) >= 0) { // if this img tag's src contains the current gallery
imgs[i].src = imgs[i].src.replace(currentGallery, name);
}
}
currentGallery = name;
}
As to why I've done what I've done - you're correct in that the scope of the variables - whether the whole page, or only the given function, knows about it, is mixed in your given code. However, another potential problem is that if you replace everything in the html that says 'landscape' with 'portfolio', it could potentially change non-images. This code only finds images, and then replaces the src only if it contains the given keyword.

Categories