I'm doing web scraping with Python. I need to get links for all the search result pages. However, I found the href value is not a regular html link but something as below. How could I get the right page link? Thanks!
2
3
You need to find the showDocumentSearchResult function in the JS code (it might be in a separate file though). Then knowing what that function does, you might simulate such an action by Python if it's ever possible.
See a following example: https://webscraping.pro/download-a-file-from-a-link-in-python/
Related
I'm not really proficient in Javascript or in HTML and I have to use it to get information from an RSS flux. The RSS flux can be found here.
I need to get the company names found in each article. I have noticed that they're often at the 5th or 6th place in the <link> tags as shown below:
and I've managed to get the link tags from the whole document by doing the following:
Array.from(document.querySelectorAll("link")).forEach(function (ele) {
console.log(ele);
});
The picture below shows the result:
The problem is that whenever I try to do something with it (apply a regexp with /^https*$/.exec(ele) for example) it does not work.
Does anybody know how I could access this information?
Also after that, I'll need to put the different company names into an excel sheet so if anyone has a better solution that can be directly written to an excel sheet I'll take it.
Try to do something like this and then execute a regex instead of the console.log
Array.from(document.querySelectorAll("link")).forEach(function (ele){ console.log(ele.value;})
I am writing an automation script using AHK and have already gone through their forums and live chat to no avail.
My issue is that I am using a COM object to navigate and click things on a webpage. But the navigation menu on the webpage does not change url's when going to another part of the website. Instead they use a "main controller" so the url in the address bar never changes but the webpage does.
I do not have access to the source code but from the element inspector in the web browser I know the name of the javascript function and the arguments it calls to go to the page I want.
I am wondering if there is a way, through the com object or other method, to call the javascript function even though I do not have direct access to the source code?
Thanks for any input.
Yes, of course there is. Just have your script access the address bar and past in the script. E.g.:
javascript:alert("Hello World");
And note, some browsers may strip out the first part and give you back a search result, so you may have to have AHK, after typing/pasting in the command, go back to the beginning and re-type the javascript: part.
Now, whether this works when you call a function up without referencing back to the source, I can't say, but then again, you could have given more details in your post.
I've been given permission to scrape a website to build up a database of products. When a button is pressed, a javascript function is called and then altered information is presented to the user (change in colour, price etc..). When trying to scrape the website, I want to be able to predict the changes as if the button was pressed. The element in question is:
<a id="anId" title="title" class="class" data-code="code" href="javascript:aFunction('ctl00$MainContent$ctl00$ctl00$FabricGroups$ctl00$FabricOptions$ctl00$FabricButton','')"></a>
Within mojolicious (I imagine the userAgent class?), how do I print the output of what calling the javascript function would do? Is it possible?
It certainly isn't easy. Perl does not interpret javascript (at least not usually and almost certainly not with a DOM).
That said, I have been working on a project to help this, WHICH IS DEFINITELY NOT READY FOR PRODUCTION, which tests javascript actions by spawning an instance of PhantomJS. Once complete the api intends to be as easy to use as Test::Mojo already is. I will be presenting it at YAPC::NA later in the year (2015).
Update: The module is now on CPAN, called Test::Mojo::Role::Phantom.
I'm trying to rationalize a website, and I have many links on it to the same document, so I want to create a JavaScript that return the URL of this document. This way, i could update the document and only have to change the URL in the function, not in all the occurrences of the link (it's a professional and internal website, with many links to official documents, that get updated often, out of my control, and each time i get to update links, i realize a while after that i forgot some, even by searching in all html files. the site is messy, was poorly written by many people, and that's why i'm trying to simplify)
My first idea was to use link, but everyone says it's a bad practice. i still don't see why, and I don't like to use the onclick as it doesn't work with middle click, and i want to let users decide how they open the doc.
Also, I want to use link to redirect to a specific page.
on top of this, what i tried so far is not working like I intend, so i would need some help, whether to come up with a better solution, or to make this work!
here is my js, with different versions:
function F_link_PDF() {
// i was pretty sure this would work
return "http://www.example.com/presentation.pdf" ;
}
function F_link_PDF_2() {
document.write("http://www.example.com/presentation.pdf");
}
function F_link_PDF_3() {
// i don't like this solution, as it doesn't open as user intended to
location.href = "http://www.example.com/presentation.pdf" ;
}
this example is for a pdf document, but i could also need this for html, doc, ppt...
and finally, i started with js because i'm used to, but I could also use other languages, php, asp, if someone says it's a better option
thanks in advance!
The hack way: Go about using JavaScript, however you run into potential issues with browsers not running it.
The better way: Use mod_rewrite / .htaccess to redirect previous (expired) requests to the new location of the resource. You could also use FallbackResource and provide a .php file that could provide the new resource based on criteria (you now have the power of PHP to decide where the Location header should go).
The best way1: Place those document references in a database table somewhere and reference them in the page using the table's current value. This creates a single place of "truth" and allows you to update the site from a global perspective. You could also, at a later date, provide search, tag, display a list, etc.
1 Not implying it's the abosolute best, but it is certainly a better way than updating hard-coded references.
A server side programming language like php is a better option.
Here's example code that helps:
<?php
$link="http://www.example.com/files/document.pdf";
if ($_GET['PAGE'] == "downloads")
{
?>
This is a download page where you can download our flyer.
<?php
echo "Download PDF";
}
if ($_GET['PAGE'] == "specials")
{
?>
This is our store specials page. check them out. a link to the flyer is below.
<?php
echo "Download PDF";
}
?>
The code isn't 100% perfect since some text needs adjusting but what it does is it takes a parameter PAGE and sees that it is "downloads" or "specials" and if it is, it loads the appropriate page and adds the link to the download file. If you try both pages, the link to the download is exactly the same.
If the above php script is saved as index.php, then you can call each page with:
index.php?PAGE=specials for the specials page
index.php?PAGE=downloads for the download page
Once that works, then you can add another "if" section for another page to create but the most important line in each section is the last line of...
echo "Download PDF";
...because it's taking a variable thats usable in every case in the script.
An advantage with using server side method is that people can view the site even with javascript disabled.
I'm trying to retrieve some information from Gmail but have been unsuccessful after many attempts. This is the line of code that I'm trying to extract using javascript.
Inbox (182)
Im trying to get the text "Inbox (182)," to do that, I'm using this piece of code
NSString *js_result = [webview1 stringByEvaluatingJavaScriptFromString:#"document.getElementsByClassName('J-Ke n0').innerText"];
This however does not work, my result being nothing at all, and I've tried many alternatives but none have worked. All I need to do here is extract the "Inbox (182)" text in any way possible. Thanks.
I think your javascript is incorrect, since there are multiple elements with that class. If I login to gmail, this works:
document.getElementsByClassName('J-Ke n0')[0].innerText
I would be weary of using this in a production environment, though. It seems very brittle; that class or order of elements could be changed by Google at any time.
You also need to make sure that the page has loaded before trying to execute javascript. Typically this is implemented in a webViewDidFinishLoad: callback. If you're not getting a result and your JS is valid, this is probably the issue.