I am working on web crawler and i found some of the website populate their content by JSON. This makes me hard time to get the data using SIMPLE HTML DOM. Is there any way to get the final HTML Code that I could able to see in the inspect element?
This is not a trivial task. You'll need to use a "headless browser" and actually execute the JavaScript on the page. There are several headless browser implementations out there to choose from (just search on the term); then, of course, you'll have to drive them from PHP.
Related
Is there any way we get the full HTML source code of a website with JavaScript code (fetched from 3rd party) are also included in it?
I understand we can get using document.getElementsByTagName('html')[0].innerHTML but it will not include the 3rd party JS in the HTML code?
The reason for this question (based on my knowledge) is when the browser renders and create DOM then it should have all the required files so does anyone know where and how we can get that full HTML source code of any website?
Yes when the browser renders the page it does indeed fetch these resources, you can use the save option provided by the browser, by clicking right click, then save as, then save as a complete webpage.
If the previous does not work, then there are many tools, extensions, applications that provides this functionality and its easy to search for them.
If you want to save the html and its resources in one html file, you can save it as mhtml file, using this extension for example https://chrome.google.com/webstore/detail/save-webpages-offline-as/nfbcfginnecenjncdjhaminfcienmehn.
I am trying to crawl some websites and while I am using headless chrome browser with selenium to render some HTLM that have embedded JS, I would also like to simply use requests, for the cases where there is no need for JS code rendering.
Is there a way to know if the HTML needs to be rendered by a browser or if a simple requests.get() would give me the complete HTML content?
Any HTML code generated by tags won't be retrieved by requests.
The only way to know if a page would need to be rendered by a browser to generate the whole content is to check if its HTML code has tags.
Still, if the information you are interested on is not generated by JS, requests.get() will serve you well.
I'm working on an html page for my department at work. Just html and css nothing fancy. Now, we are trying to get data from another webpage to be displayed in the new one we are working on. I assume that I would need to use JavaScript and a parser of some sort but I'm not sure how to do this or what really to search for.
The solution I assume would exist is to have a function, feed it a link of the webpage we want to mine, and it would return (for example) the number of times a certain word was repeated in that webpage.
The best way to go for it is by using node.js and then installing cheerio (parser) and request (http request) module. There are many detailed tutorials showing how to do this (for e.g. this one at digital ocean).
But, if you don't want to have nodejs setup and want to work with plain web setup. Then, download cheerio and request js libraries and include them in your html page in tag and then follow above example. I hope it helps.
As part of a job I'm doing on a web site I have to copy a few thousand lines of text from several pages of the old site and paste them into the HTML for the new site. The long and painstaking way of going to the old page and copying the many lines of text and then going to my editor and pasting it there line by line is getting really old. I thought of using injected JavaScript to do this but I'm not quite sure where to start. Thanks in advance for any help.
Here are links to a page of the old site and a page of the new site. As you can see in the tables on each page it would take a ton of time to copy it all manually.
Old site: http://temp.delridgelegalformscom.officelive.com/macorporation1.aspx
New Site: http://ezwebsites.us/delridge/macorporation1.html
In order to do this type of work, you need two things: a way of injecting or executing your script on that page, and a good working knowledge of the Document Object Model for the target site.
I highly recommend using the Firefox plugin FireBug, or some equivalent tool on your browser of choice. FireBug lets you execute commands from a JavaScript console which will help. Hopefully the old site does not have a bunch of <FONT>, <OBJECT> or <IFRAME> tags which will make this even more tedious.
Using a library like Prototype or JQuery will also help selecting parts of the website you need. You can submit results using JQuery like this:
$(function() {
snippet = $('#content-id').html;
$.post('http://myserver/page', {content: snippet});
});
A problem you will very likely run into is the "same origination policy" many browsers enforce for JavaScript. So if your JavaScript was loaded from http://myserver as in this example, you would be OK.
Perhaps another route you can take is to use a scripting language like Ruby, Python, or (if you really have patience) VBA. The script can automate the list of pages to scrape and a target location for the information. It can just as easily package it up as a request to the new server if that's how pages get updated. This way you don't have to worry about injecting the JavaScript and hoping all works without problems.
I think you need Grease Monkey http://www.greasespot.net/
This is kind of tricky. There is this webpage which, I am guessing, uses some kind of AJAX to pull out content based on the search query. When I fetch the page using get in Perl, it fetches the script code behind the php/html, but not the results which are displayed when the query is searched manually. I need to be able to fetch the content of the results page. Is there anyway to do this in Perl?
Take a look at Selenium RC and the WWW::Selenium module in Perl. With them you can control a real web browser.
Another option is WWW::HtmlUnit which uses the HtmlUnit Java library to execute the JavaScript without a web browser. WWW::HtmlUnit uses Inline::Java to give Perl access to the library. I have found that when installing, it is best to say No to the question "Do you wish to build the JNI extension?".
If you are writing tests that need to check the rendered page, you can have a look at Schwern's javascript-tap-harness, which works with Selenium and handles all the scaffolding.
I also found Using WWW::Selenium To Test Or Automate An Ajax Website pretty useful.