I am trying to crawl some websites and while I am using headless chrome browser with selenium to render some HTLM that have embedded JS, I would also like to simply use requests, for the cases where there is no need for JS code rendering.
Is there a way to know if the HTML needs to be rendered by a browser or if a simple requests.get() would give me the complete HTML content?
Any HTML code generated by tags won't be retrieved by requests.
The only way to know if a page would need to be rendered by a browser to generate the whole content is to check if its HTML code has tags.
Still, if the information you are interested on is not generated by JS, requests.get() will serve you well.
Related
I am building a web scraper that has to retrieve quickly the text of a web page, from HTML only. I'm using Python, requests and BeautifulSoup.
I would like to detect if the web page content is pure HTML or if it's rendered from Javascript. In this last case, I would just return an error message saying that this cannot be done.
I know about headless browsers to render the Javascript but in this case I really just need to detect it the fastest way possible without having to render it.
It's not really possible to detect script tag as there are many in every webpage and that doesn't mean the text content is rendered in Javascript necessarily.
Is there something I could check jn the HTML that tells me accurately that the body content will be rendered from Javascript?
Thank you
There is nothing in the initial DOM that shows beforehand that the site is rendered with js. These are some stuff you could try:
Analyzing several websites and make a guess on where the site
is rendered with js based on the page's content size.
You could also get the html of different pages of the site
and compare the content length (for a js-rendered site, the contents
of different pages are likely to be the same/similar before any code is executed).
Check the content size of the scripts or detect the scripts names of
famous technologies like react, vue and angular
Is there any way we get the full HTML source code of a website with JavaScript code (fetched from 3rd party) are also included in it?
I understand we can get using document.getElementsByTagName('html')[0].innerHTML but it will not include the 3rd party JS in the HTML code?
The reason for this question (based on my knowledge) is when the browser renders and create DOM then it should have all the required files so does anyone know where and how we can get that full HTML source code of any website?
Yes when the browser renders the page it does indeed fetch these resources, you can use the save option provided by the browser, by clicking right click, then save as, then save as a complete webpage.
If the previous does not work, then there are many tools, extensions, applications that provides this functionality and its easy to search for them.
If you want to save the html and its resources in one html file, you can save it as mhtml file, using this extension for example https://chrome.google.com/webstore/detail/save-webpages-offline-as/nfbcfginnecenjncdjhaminfcienmehn.
I'm making a Tomcat and JavEE web app, but for the front end I am using a free javascript + html template. I am using a JSP with JSTL in place of html files, and including the javascript and other resources in tomcat 8.5/webapp.
However, when I call the JSP, I have this weird problem where the text that I display does not show up unless I click inspect element.
This happens on all browsers.
I think it might be some problem with the way that JSP/JavaEE works with javascript, because the template works fine when it is not used as a javaEE project.
Does anyone know common problems that cause this?
I will try to ask my Question as precise as I can. At the moment I cannot post any example Code as I'm not allowed to post code of my work. In my next free time I will lookup that I will code some example to be more precise.
Here is my Question:
I'm developing some project where I built some HTML and Javascript code dynamically with Java and load it into a Webview of JavaFX. The Html has to be generated dynamically because there are lots of Information in it that come up from a central database. Then I try to run it via a ChangeListener that is set on LoadWorker for the WebEngine of the Webview to ensure that the Webpage is loaded. The page is loaded correctly but when I try to use some javascript function the Application crashes.
The whole thing is working with a local written html document. In the Html are lots of Javascript-Libraries like Jquery and others. But my project crashes because the libraries and its functions cannot be used. I think that the http-Request of loading a local written html sets up these javascript when the webpage is loaded.
Am I correct with this?
And is there anything I can fix this Problem with dynamic generated htmls loaded in a Webview.
Thanks for any help!
I am working on web crawler and i found some of the website populate their content by JSON. This makes me hard time to get the data using SIMPLE HTML DOM. Is there any way to get the final HTML Code that I could able to see in the inspect element?
This is not a trivial task. You'll need to use a "headless browser" and actually execute the JavaScript on the page. There are several headless browser implementations out there to choose from (just search on the term); then, of course, you'll have to drive them from PHP.