How can I programmatically scrape an image from another website? - javascript

A few years ago I helped someone put together a webpage (for local personal use only, not served to the world) that aggregates outdoor webcam photos from several of his favorite websites. It's a time-saver for viewing multiple websites at once. We had it easy when the images on those websites had fixed URLs. And we were able to write some JavaScript code when the URLs changed predictably (e.g., when the url had a date it in). But now he'd like to add an image whose filename changes seemingly at random and I don't know how to handle that. Basically, I'd like to:
Programmatically visit another website to find the URL of a particular image.
Insert that URL into my webpage with an <img> tag.
I realize this is probably a confusing and unusual question. I'm willing to help clarify as much as possible. I'm just not sure how to ask for what this guy wants to do.
Update: David Dorward mentioned that doing this with JavaScript violates the Same Origin Policy. I'm open to suggestions for other ways to approach this problem.

Fetch html of remote page using Cross Domain AJAX.
Then parse it to get urls of images of interest.
Then for each url do <img src=url />

Its probably a big fat violation of copyright.
The picture is most like containered within a page - just regularly visit that page and parse the img tag. Make sure that the random bit you commented on is not just a random parameter to force browsers to fetch the fresh image instead of retrieving a cached version.

If you use php at your project you can use CURL library to get another website content and using regex parse it for getting image url from source code.

You have a Python question in your profile, so I'll just say if I were trying to do this, I'd go with Python & Beautiful Soup. Has the added advantage of being able to handle invalid HTML.

Related

Parse the contents of a webpage

I'm working on an html page for my department at work. Just html and css nothing fancy. Now, we are trying to get data from another webpage to be displayed in the new one we are working on. I assume that I would need to use JavaScript and a parser of some sort but I'm not sure how to do this or what really to search for.
The solution I assume would exist is to have a function, feed it a link of the webpage we want to mine, and it would return (for example) the number of times a certain word was repeated in that webpage.
The best way to go for it is by using node.js and then installing cheerio (parser) and request (http request) module. There are many detailed tutorials showing how to do this (for e.g. this one at digital ocean).
But, if you don't want to have nodejs setup and want to work with plain web setup. Then, download cheerio and request js libraries and include them in your html page in tag and then follow above example. I hope it helps.

JavaScript to replace server used for all images?

Given a static HTML file and its accompanying static CSS file that have references to a static server holding images, would it be possible to use JavaScript to prevent the page from loading all images from their src attribute and url() definition and instead replace them with another server?
Trying to do something along the lines of
$('body').html($('body').html().replaceAll('src.example.com', 'target.example.com'));
Doesn't prevent the images from being loaded at the first place. Any ideas?
Background story, in case it helps you suggest something different: I'd like my website to display images inside China from a different local Chinese server. That might help with the throttling that non-Chinese sites experience in China. Choosing a Chinese CDN for serving the whole world seems like a bad idea and relying on a service resolving DNS differently per country seems somewhat risky.
So currently we have the .com and .cn domains and want the .cn domain to be served from the same server but have its images served from a local Chinese CDN. While the website does generate dynamic pages, they are cached statically and trying to generate different pages by domain would means more effort generating the pages. That's why I thought that perhaps JavaScript could help out by replacing all images.
With a combination of tricks like using the base tag and onclick and onsbumit event handlers you could achieve your objective, but I'm almost sure it isn't a nice solution.
Using a javascript at the head or the first thing in your body you could insert a base tag which would change the standard server for all resources linked by your site (forms actions, link href, stylesheets, images) requests (even the CSS ones).
Then to avoid the server change for links and forms you would have to handle click and submit events to fix the server according to the current website url or once the page is ready modify all them at once.
Does this might work? Yes, but I do believe the best solution would be change the URLs in server side or let the CDN handle the geotargeting.

Load webpage inside iframe but replace a part of url inside iframe page

My first question here :)
I want a way to load a page inside iframe while changing/replacing a part of the urls of any links present in the webpage with alternate text.
eg.
Suppose we load a website in iframe like "mywebsite.com" which has a link to another site inside the loaded page
eg. http s://www.facebook.com/abcd?id=text
http s://www.facebook.com/efgh?id=text
.
Then I want the website inside iframe to be loaded with custom urls like:
eg. http s://www.facebook.com/abcd?id=alternatetext
http s://www.facebook.com/efgh?id=alternatetext
Basically I need a way to replace "text" to "alternatetext" .. ON THE FLY while rendering the webpage inside iframe.
How do I do it?
Help me people..
Thanks.
This is completely possible. But I think you may be far off on this. Since you do not include any JavaScript I assume that you have not made any head way on that. This is going to be deep and take some fine tuning, its not just some code snippet that someone can give you. It can totally be done with a scripting language. I recommend you take the time to learn a server side language. I personally use VB.NET at work. You will be amazed with the possibilities.
On another note, if Facebook found out you were displaying their pages online and modifying their links they would surely take some action.
I recommend this question be closed.

Scrape website and insert a table into my own HTML document

I want to be able to extract a table from a website and put it into my own HTML page. For example, I want the information contained in the table class 'tbbox' on this website: http://www.timeanddate.com/worldclock/astronomy.html?n=24 inserted into my own HTML page. I want to avoid executing any kind of server side code like PHP. Perhaps use JavaScript for this? All the examples I have come across so far only provide details on how to extract the information into a CSV or text file.
Sorry if this question seems a bit vague but I know very little about how javascript is run on webpages and am not a web developer. I am just trying to setup a dashboard for personal use that will extract astronomical information from various websites into a single page, which I can open to find information at a glance.
Thanks for taking the time.
What you want to do has nothing to do with web-scraping. The problem you have can easily be solved with an <iframe> on your page that fetches the desired info from another source on load. Here is a reference that might help you with that: http://www.w3.org/TR/html401/present/frames.html
NOTE! Only display this information on your site if you are allowed to do so!

Screen scrape a web page that uses javaScript and frames

I want to scrape data from www.marktplaats.nl . I want to analyze the scraped description, price, date and views in Excel/Access.
I tried to scrape data with Ruby (nokogiri, scrapi) but nothing worked. (on other sites it worked well) The main problem is that for example selectorgadget and the add-on firebug (Firefox) don’t find any css I can use to scrape the page. On other sites I can extract the css with selectorgadget or firebug and use it with nokogiri or scrapi.
Due to lack of experience it is difficult to identify the problem and therefore searching for a solution isn’t easy.
Can you tell me where to start solving this problem and where I maybe can find more info about a similar scraping process?
Thanks in advance!
I used excel web query and it works perfect. You can find a lot about scraping with excel on youtube if you search for mrexcel.
Thanks, Mello
You can try IRobotSoft web scraper. It has good frame support and is free.
Iframes aren't a problem - just access the embedded iframe URL directly. You will find that it redirects in the browser unless you disable JavaScript.
Description and date can be extracted straight from HTML source. However prices are images which will make scraping them more cumbersome.

Categories