I'm using python selenium to crawl images in chrome. To download the images, I create download links using the following code:
script_js = 'var imageURL = document.getElementsByTagName("{select_tag}")[{num}].getAttribute("src");' \
'var link = document.createElement("a"); ' \
'link.download = "{image_name}";' \
'link.href = imageURL;' \
'link.innerHTML = "download";' \
'document.body.appendChild(link);' \
'link.click();' \
'document.body.removeChild(link);' \
'delete link;'.format(select_tag="img", num=0, image_name=f"{order+1}.jpg")
browser.execute_script(script_js)
I have successfully downloaded images using this method on other sites before. But this time, it doesn't work.
When I created the download link and clicked on it, the browser opened the image in the current tab instead of downloading it.
I tried to get the url of a single image in the page and open it in the new tab. Then, I created the download link in the same way and found that it worked.
I am curious why it cannot be downloaded on the original page. Since these images require login to view, is this an antiscrape method.
Is there any way to create a download link which download successfully in the original page?
I'm sorry that I can't provide the original site because it requires login.
Oh, I forget to mention that the src attribute of the image looks like this: "img/a175/321F2061A9895…". So, I think the image is served from the same origin.
If you can see it, you can download it... either directly or via screenshot of the images themselves. You can use SeleniumBase (A Python framework) to do it. pip install seleniumbase, and then run the script below with python or pytest.
The first test downloads images directly from a website. The second test takes a mini-screenshot of just the image, and then saves it.
"""Use SeleniumBase to download images and verify."""
import os
from seleniumbase import BaseCase
class DownloadImages(BaseCase):
def test_download_images_directly(self):
self.open("https://seleniumbase.io/help_docs/chart_maker/")
img_elements_with_src = self.find_elements("img[src]")
unique_src_values = []
for img in img_elements_with_src:
src = img.get_attribute("src")
if src not in unique_src_values:
unique_src_values.append(src)
print()
for src in unique_src_values:
if src.split(".")[-1] not in ["png", "jpg", "jpeg"]:
continue
self.download_file(src) # Goes to downloaded_files/
filename = src.split("/")[-1]
self.assert_downloaded_file(filename)
folder = "downloaded_files"
file_path = os.path.join(folder, filename)
self._print(file_path)
def test_download_images_via_screenshot(self):
self.open("seleniumbase.io/error_page/")
img_elements_with_src = self.find_elements("img[src]")
unique_src_values = []
for img in img_elements_with_src:
src = img.get_attribute("src")
if src not in unique_src_values:
unique_src_values.append(src)
print()
count = 0
for src in unique_src_values:
self.open(src)
if not self.headless and not self.headless2:
self.highlight("img", loops=1)
image = self.find_element("img")
if src.startswith("data:") or ";base64" in src:
# Special Cases: SVGs, etc. Convert to PNG.
count += 1
filename = "svg_image_%s.png" % count
else:
filename = src.split("/")[-1]
folder = "downloaded_files"
file_path = os.path.join(folder, filename)
image.screenshot(file_path)
self.assert_downloaded_file(filename)
self._print(file_path)
if __name__ == "__main__":
from pytest import main
main([__file__])
Here's the current output of that: (If those websites change, the images downloaded may change.)
downloaded_files/logo3c.png
downloaded_files/logo6.png
downloaded_files/sample_pie_chart.png
downloaded_files/sample_column_chart.png
downloaded_files/sample_bar_chart.png
downloaded_files/sample_line_chart.png
downloaded_files/sample_area_chart.png
downloaded_files/multi_series_chart.png
.
downloaded_files/svg_image_1.png
downloaded_files/svg_image_2.png
downloaded_files/svg_image_3.png
downloaded_files/svg_image_4.png
downloaded_files/svg_image_5.png
downloaded_files/svg_image_6.png
downloaded_files/svg_image_7.png
downloaded_files/svg_image_8.png
downloaded_files/svg_image_9.png
downloaded_files/svg_image_10.png
downloaded_files/svg_image_11.png
Related
Suppose this is the website page: "https://www.dior.com/en_us/products/couture-943C105A4655_C679-technical-fabric-cargo-pants-covered-in-tulle", from which I want to download all the images of the product showcased (4 images in this case).
I am using Selenium and extracting image links.
The problem is if I click the images they are even 2000x3000 pixels big, but I am only able to get 480 around pixels resolution images of them. Where are these images stored? How do I extract them? ( basically I want to download the maximum possible size of those images )
Withing the source code of the page you provided, there is json data that provides the links and content for the page. Once the data is stripped from the script in the source code, it is easy to retrieve the high resolution links and download the image. If you have not already, pip install requests and pip install bs4.
import requests, re, json
from bs4 import BeautifulSoup
url = 'https://www.dior.com/en_us/products/couture-943C105A4655_C679-technical-fabric-cargo-pants-covered-in-tulle'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
script = [script.text for script in soup.find_all('script') if 'window.initialState' in script.text][0]
json_data_s = re.search(r'{.+}', script).group(0)
json_data = json.loads(json_data_s)
for holder in json_data['CONTENT']['cmsContent']['elements']:
if holder.get('type') == 'PRODUCTMEDIAS':
for image in holder['items']:
name = image['galleryImages']['imageZoom']['viewCode']
img_src = image['galleryImages']['imageZoom']['uri']
image_page = requests.get(img_src)
with open(name + '.jpg', 'wb') as img:
img.write(image_page.content)
*The images you were downloading before were the thumbnail photos.
I am using this common method to download a file in javascript:
var URI = //some uri
var dl = document.createElement('a');
dl.href = URI;
dl.download = 'file name';
document.body.appendChild(dl);
download_link.click();
document.body.removeChild(dl);
When I execute it the first time it works, but fails for the next downloads. Do you know why this is? Thanks
Trying to run your code for the second time, Chrome shows this message:
My browser is in pt-BR, the translation is as follows:
http://stackoverflow.com would like to:
* Download multiple files
[Allow] [Block]
If you block it, it will not download the next files. You can check your current permission settings by click in the (i) icon before the URL and looking for automatic downloads. The default only downloads once.
This is my scenario:
I have my web page in folder:
http://www.example.com/example/index.html
I have media files in folder (one level up):
http://www.example.com/media/
and this files are linked in index.html like so: '../song1.mp3'
So when I read window.location.href from my web page I get this:
http://www.example.com/example/
But my media files are in location http://www.example.com/media/
Now I want to construct a download path for this media, but if I join window.location.href and media url I get this:
http://www.example.com/example/../song1.mp3
and I need to get this:
http://www.example.com/media/song1.mp3
what is the easiest way to manage this?
I am using javascript.
How about this:
var filename = "../song1.mp3",
domain = "http://example.com/", // may be static or made by some black magic
url = domain + "media/" + filename.split("/").pop();
So you just split your path with the ../-part, get the last element (would be "song1.mp3") and put it together to http://example.com/media/song1.mp3
Here your have a live example.
Complete and utter javascript newbie here with a problem fetching .pdf files from a web server based on a partial match. I have made a program that outputs data to a webserver, and one of the components is a folder of .pdf files. I want to be able to click on a link that will pull up the corresponding .pdf file based on a value in the data table that's generated (I'm using slickgrid for this). Each of the .pdf files contains the value that's in the data table and serves as good query to the .pdf folder, and I've been successful at getting the .pdfs I want with the following code:
var value = grid.getData().getItem(row)['data'];
var locpath = window.location.pathname.substring(0,window.location.pathname.lastIndexOf('/'));
var plotsFolder = window.location.protocol + "//" + window.location.host + locpath + "/CovPlots/";
var href = plotsFolder + value + ".pdf";
return "<a href='" + href + "'>" + value + "</a>";
The catch here is that sometimes the .pdf file that's generated is a concatenation of two or more (I've seen up to 4 so far) of the 'data' strings, separated by '_' as a delimiter for reasons not worth getting into. So, if the .pdf file is 'somestring.pdf', I can get it without problem. However, if the .pdf file is 'somestring_anotherstring.pdf', I can't figure out how to get that .pdf file if I have either 'somestring' or 'anotherstring' as the value of 'data'.
I've tried a ton of different things to get some kind of lookup that I can use to pull down the correct file based on a partial match. The latest attempt is with the FilenameFilter object in javascript, but without any knowledge of javascript, I'm having a hard time to get it working. I tried to create a new function that I could call as a lookup for the .pdf URL:
function lookup() {
File directory = new File(plotsFolder);
String[] myFiles = directory.list(new FilenameFilter() {
public boolean accept(File directory, String fileName) {
return fileName.match(value);
}
});
}
That only seems to thrown an error. Can anyone point me in the right direction to be able to download the correct .pdf file based on a partial match? I also tried to see if there was a jquery way to do it, but couldn't seem to find something that works. Thanks in advance!
Without support from the server, JavaScript cannot find a file from a partial filename. What you can do, however, is have a little script on the server that does the partial-filename-matching for JavaScript, and then JavaScript can ask the server to do the match, and then when it gets the match back, it can use that filename.
If you don't mind loading a whole index of all the PDFs at once, you could use this little Python script to generate an index in a nice, JavaScript-friendly JSON format:
#!/usr/bin/env python
# Create an index of a bunch of PDF files.
# Usage: python make_index.py directory_with_pdf_files
import os
import sys
import json
def index(directory):
index = {}
for filename in os.listdir(directory):
base, ext = os.path.splitext(filename)
if ext.lower() != '.pdf':
continue
for keyword in base.split('_'):
index[keyword] = filename
with open(os.path.join(directory, 'index.json'), 'w') as f:
f.write(json.dumps(index))
if __name__ == '__main__':
index(sys.argv[1])
Then you can just load index.json with jQuery or what-have-you. When you need to find a particular PDF's filename, you can do something like this (assuming the object loaded from index.json is in the indexOfPDFs variable):
var href = plotsFolder + indexOfPDFs[value];
How can I do a function, that allow do the download of a file in C: ?
For example, when the link is clicked, trigger a javascript function that will open the file or made download.
I'm trying this,
But it execute the program, I need that execute a download!
if (url == "Supremo") {
var oShell = new ActiveXObject("WScript.Shell");
var prog = "C:\\inetpub\\wwwroot\\admin.redenetimoveis.com\\San\\up\\arquivos\\Supremo.exe";
oShell.run('"' + prog + '"', 1);
return false;
}
To get a user to download an exe file, you simply need to point them to the url with something like
window.location = 'http://admin.redenetimoveis.com/Supremo.exe';
As per the comment below, simply make an anchor tag that points to the file:
Download Executable
Create a hidden iframe with a src attribute pointing to the file that needs to be downloaded. Keep in mind, however, that the src value needs to be a file that is accessible to the client, such as going to a domain. http://www.somedomain.com/thefile.exe