python: how to save dynamically rendered html web page code

python: how to save dynamically rendered html web page code - javascript

I have a setup where a web page in a local server (localhost:8080) is changed dynamically by sending sockets that load some scripts (d3 code mainly).
In chrome I can inspect the "rendered html status" of the page, i.e., the resulting html code of the d3/javascript loaded codes. Now, I need to save that "full html snapshot" of the rendered web-page to be able to see it later, in a "static" way.
I have tried many solutions in python, which work well to load a web and save its "on-load" d3/javascript processed content, but DO NOT get info about the code generated "after" the load.
I could also use javascript to make this if no python solution is found.
Remember that I need to retrieve the full html rendered code that has been "dynamically" modified in time, in a chosen moment of time.
Here are a list of questions found in stackoverflow that are related but do not answer this question.
Not answered:
How to save dynamically changed HTML?
Answered but not for dynamically changed html:
Using PyQt4 to return Javascript generated HTML
Not Answered:
How to save dynamically added data to update the page (using jQuery)
Not dynamic:
Python to Save Web Pages

The question could be solved using selenium-python (thanks to #Juca suggestion to use selenium).
Once installed (pip install selenium) this code makes the trick:
from selenium import webdriver
# initiate the browser. It will open the url,
# and we can access all its content, and make actions on it.
browser = webdriver.Firefox()
url = 'http://localhost:8080/test.html'
# the page test.html is changing constantly its content by receiving sockets, etc.
#So we need to save its "status" when we decide for further retrieval)
browser.get(url)
# wait until we want to save the content (this could be a buttonUI action, etc.):
raw_input("Press to print web page")
# save the html rendered content in that moment:
html_source = browser.page_source
# display to check:
print html_source

Related

Force browser to reload the Javascript files

I am trying to achieve the below in ASP.NET MVC3 web application which uses razor.
1) In my Index.cshtml file, I have the below reference.
<script src="/MySite/Scripts/Main.js"></script>
2) I load my home page for the first time and a http request is made to fetch this file which returns 200.
3) Then, I made some changes to the Main.js and saved it.
4) Now I just reload the home page (please note that I am not refreshing the page) by going to the address bar and typing the home page url and pressing enter. At this point, I want the browser to fetch the updated Main.js file by making a http request again.
How can I achieve this? I don't want to use System.Web.Optimization bundling way. I knew that we can achieve this by changing the URL (appending version or some random number) everytime the file changes.
But the challenge here is the URL is hardcoded in my Index.cshtml file. Everytime when there is a change in Main.js file, how can I change that hardcoded URL in the Index.cshtml file?
Thanks,
Sathya.

What I was trying to achieve is to invalidate browser cache as soon as my application javascript file (which already got cached in the browser) gets modified at the physical location. I understood that this is simply not achievable as no browsers are providing that support currently. To get around this below are the only two ways:
1)Use MVC bundling
2)Everytime the file is modified, modify the URL by just appending the version or any random number to the URL through querystring. This method is explained in the following URL - force browsers to get latest js and css files in asp.net application
But the disadvantage with the 2nd method is, if there are any external applications referring to your application's javascript file, the browser cache will still not be invalidated without refreshing the external application in browser.

Just add a timestamp as a querystring parameter:
var timestamp = System.DateTime.Now.ToString("yyyyMMddHHmmssfff");
<script src="/MySite/Scripts/Main.js?TimeStamp=#timestamp"></script>
Note: Only update TimeStamp parameter value, when the file is updated/modified.

It's not possible without either using bundling (which internally handles version) or manually appending version. You can create a single file bundle as well if you want.

Scrape currently opened webpage or get live HTML with another method?

I need to get a bit of data from a HTML tag that only appears when you're signed into a site. I need to do it in either Python or Javascript. Javascript has the Cross-Origin-Browser-Policy(CORS) as a obstacle.
I can't use server-side code.
I can't use iframes.
The data is readily available if you open the page URL in Chrome or FireFox because it keeps you signed in, much like Facebook, so we'll use it as an example. We'll say I want to get the data from the first element of my Facebook news feed.
I've tried scraping the webpage and passing in the User Agent value with Pythons urllib module. I've tried using Yahoos YQL tool with Javascript. Both returned the HTML I wanted without the values I need in them. This is because it's not using my browsers to do it, which has the cookies stored required to populate the values I need.
So is there a way to scrape a webpage that's already open? Say I had Facebook open and I ran some code that got my news feed data from the browser.
Is there some other method I haven't mentioned to accomplish this?
Background: I'm creating an autobumper for a forum(within the site rules) and need some generated values from the site HTML, but will get no cooperation towards that end from the owner.

You can try the following with python selenium webdriver as it allows you to log in and get html source.
you will have to pip install selenium first and download the chromedriver.exe from selenium website http://docs.seleniumhq.org/
here is a sample code i use on gmail:
from selenium import webdriver
#you have to download the chromedriver from selenium hq homepage
chromedriver_path = r'your chromedriver.exe path here'
#create webdriver object and get url
driver = webdriver.Chrome(chromedriver_path)
driver.implicitly_wait(1)
driver.get('https://www.google.com/gmail')
#login
driver.find_element_by_css_selector('#Email').send_keys('email#gmail.com')
driver.find_element_by_css_selector('#next').click()
driver.find_element_by_css_selector('#Passwd').send_keys('1234')
driver.find_element_by_css_selector('#signIn').click()
#get html
html = driver.page_source

Downloading dynamically loaded webpage with python

I have this website and I want to download the content of the page.
I tried selenium, and button clicking with it, but with no success.
#!/usr/bin/env python
from contextlib import closing
from selenium.webdriver import Firefox
import time
# use firefox to get page with javascript generated content
with closing(Firefox()) as browser:
# setting the url
browser.get("http://bonusbagging.co.uk/oddsmatching.php#")
# finding and clicking the button
button = browser.find_element_by_id('select_button')
button.click()
page = browser.page_source
time.sleep(5)
print(page.encode("utf8"))
This code only downloads the source code, where the data are hidden.
Can someone show me the right way to do that? Or tell my how can be the hidden data downloaded?
Thanks in advance!

I always try to avoid selenium like the plague when scraping; it's very slow and is almost never the best way to go about things. You should dig into the source more before scraping; it was clear on this page that the html was coming in and then a separate call was being made to get the table's data. Why not make the same call as the page? It's lightning fast and requires no html parsing; just returns raw data, which seems to be what you're looking for. the python requests import is perfect for this. Happy Scraping!
import requests
table_data = requests.get('http://bonusbagging.co.uk/odds-server/getdata_slow.php').content
PS: The best way to look for these calls is to open the dev console, and check out the network tab. You can see what calls are being made here. Another way is to go to the sources tab, look for some javascript, and search for ajax calls (that's where I got the url I'm calling to above, the path was: top/odds-server.com/odds-server/js/table_slow.js). The later option is sometimes easier, sometimes it's nearly impossible (if the file is minified/uglified). Do whatever works for you!

Check out the Network tab in Chrome Dev tools. Nab the URL out of there.
What you're looking at is a DataTable. You can use their API to fetch what you need.
Adjust the "start" and/or "length" parameters to fetch the data page-by-page.
It's JSON data, so it'll be super easy to parse.
But be nice and don't hammer this poor guy's server.

How to avoid downloading the entire PDF to display

in my webpage you can read book in pdf format. The problem is that some books have around 1000 pages and the PDF is really big so even if the user reads just 10 pages the server download the full pdf, so this is awful for my hosting account because I have a transfer limit.
What could I do to display the pdf without load the full PDF.
I use pdf.js
Greetings.

ORIGINAL POST:
PDF files are designed in a way that forces the client side to download the whole file just to get the first page.
The last line of the PDF file tells the PDF reader where the root dictionary for the PDF file is located (the root dictionary tells the reader about the page catalog - order of pages - and other data used by the reader).
So, as you can see, the limitations of the PDF design require that you use a server side solution that will create a new PDF with only the page(s) you want to display.
The best solution (in my opinion) is to create a "reader" page (as opposed to a download page) that requests a specific page from the server and allows the user to advance page by page (using AJAX).
The server will need to create a new PDF (file or stream) that contains only the requested page and return it to the reader.
if you are running your server with Ruby (ruby on rails), you can use the combine_pdf gem to load the pdf and send just one page...
You can define a controller method that will look something like this:
def get_page
# read the book
book = CombinePDF.parse IO.read("book.pdf")
# create empty PDF
pdf_with_one_page = CombinePDF.new
# add the page you want
# notice that the pages array is indexed from 0,
# so an adjustment to user input is needed...
pdf_with_one_page << book.pages[ params[:page_number] - 1 ]
# no need to create a file, just stream the data to the client.
send_data pdf_with_one_page.to_pdf, type: 'application/pdf', disposition: 'inline'
end
if you are running PHP or node.js, you will need to find a different server-side solution.
Good luck!
EDIT:
I was looking over the PDF.js project (which looks very nice) and notice the limited support statement for Safari:
"Safari (desktop and mobile) lacks a number of features or has defects, e.g. in typed arrays or HTTP range requests"...
I understand from this statement that on some browsers you can manage a client-side solution based on the HTTP Byte Serving protocol.
This will NOT work with all browsers, but it will keep you from having to use a server-side solution.
I couldn't find the documentation for the PDF.js feature (maybe it defaults to ranges and you just need to set the range...?), but I would go with a server-side solution that I know to work on all browsers.
EDIT 2:
Ignore Edit 1, as iPDFdev pointed out (thank you iPDFdev), this requires a special layout of the PDF file and will not resolve the issue of the browser downloading the whole file.

You can take following approach governed by functionality
Add configuration (i.e. kind of flag) whether you want to display entire PDF or not.
While rendering your response read above mentioned configuration if flag is set generate minimal PDF with 20 pages with hyperlink to download entire PDF else minimal PDF with 20 pages only
When you prepare initial response of your web page add PDF which contains say 20 pages (minimal PDF) only and process the response

How can you retrieve a pages source code (after javascript has ran) using PHP?

On my page, javascript adds a lot of classes on page load (depending on the page).
How can I wait til javascript has added those classes, then get the HTML using either Javascript or PHP from a different file?

When the page has finished loading, POST the rendered source back to a PHP script using Ajax.
$(function()
{
var data = $('body').html();
$.post('/path/to/php/script', data);
});
(This example assumes you're using jQuery)

It looks like what you need is Firebug. If you are using Google Chrome, you could also use the Google Chrome Developer Tools.
These tools will allow you to view the live DOM of the page as well as track any changes made by your javascript. Tools like these are essential to us as developers.

You can not receive the rendered HTML source by an other resource other than from JavaScript on your page itself. After JS finished all the content changes in the HTML, you could post the HTML source to a PHP on the server and save it.
Pseudo code:
// JavaScript using jQuery
setTimeout("jQuery.post('/catch.php', jQuery(document));", 2000);
// on the server side create a catch.php file
<?php
file_put_contents('./tmp.txt', 'php://input');

You can't, easily.
JavaScript modifies the DOM in memory. This is completely separate entity than the "source" you originally sent to the browser.
The closest thing you can do is build an XML representation of the DOM via JS and send it back to the server via AJAX. Why you would want/need to do this is beyond me.

Open your Bookmarks/Favorites and create a new one with this and then click it after your page loads:
javascript:IHtml=document.documentElement.innerHTML;
LThan=String.fromCharCode(60);
LT=new RegExp(LThan,'g');
IHtml=IHtml.replace(LT,'<');
IHtml=IHtml.replace(/ /g,' ');Out ='';Out+='<!DOCTYPE html PUBLIC "-\/\/W3C\/\/DTD XHTML 1.0 Transitional\/\/EN"';Out+=' "http:\/\/www.w3.org\/TR\/xhtml1\/DTD\/xhtml1-transitional.dtd">';
Out+='<html xmlns="http:\/\/www.w3.org\/1999\/xhtml" xml:lang="en-US" lang="en-US">';
Out+='<head><title>Inner HTML<\/title><\/head>';
Out+='<body style="color:black;background-color:#ffffee;">';
Out+='Body HTML:<br \/><ul>';
NLine=String.fromCharCode(10);
ILines=IHtml.split(NLine);
for (ix1=0; ix1< ILines.length; ix1++) {
Out+='<li>'+ILines[ix1]+'<\/li>';
}
Out+='<\/ul>';
Out+=' [<a href="javascript:void(0);" onclick="window.close();" title="close">Close<\/a>]';
Out+='<\/body><\/html>\n';
PopUp1=window.open('','IHTML');
PopUp1.document.write(Out);
PopUp1.document.close();

We Keep Coding

JavaScript is the programming language of the Web.

python: how to save dynamically rendered html web page code - javascript

Related

Force browser to reload the Javascript files

Scrape currently opened webpage or get live HTML with another method?

Downloading dynamically loaded webpage with python

How to avoid downloading the entire PDF to display

How can you retrieve a pages source code (after javascript has ran) using PHP?

Categories

Resources