python and javascript with html generation - javascript

On a website that I manage, we have JSON files which contain page data. We then create the page using this JSON.
The data looks roughly like this (except a lot more complex).
[
{"title": "Hello world", "content": "World, hello to you!"},
{"title": "Hello world Part II", "content": "The sequel to hello world."},
...
]
This data is then parsed into HTML. Now, here lies the issue: we need two versions of the HTML.
One needs to be static, outputted in the format of file-0.html which would be formatted with a title of Hello World and content of World, hello to you! and file-1.html (title=Hello World Part II, content=The sequel to hello world).
The second needs to be just a plain page file-all.html which includes a JavaScript that pulls the JSON via AJAX when its needed and creates a container for each page which includes subpages that have the content/titles for everything in the JSON.
Right now, we use Python to generate the HTML for the file-0.html static pages, and then JavaScript for the AJAX pages. While this works, it means there is a lot of code duplication for a pretty small project—every time we want to change the class of the <h1> title is wrapped in, we have to change two places with slightly different syntax.
Is there any good way of fixing this issue, so that all the code for generating page (or as much as possible) is in one language? (This would probably have to be JavaScript, since bandwidth is an issue—we'd like to avoid transferring HTML via AJAX if possible.)

You have two good options:
Write the page generation logic in a template language like Mustache (http://mustache.github.com/). Then you can compile those templates down to Python (for the server side) and Javascript (for the client). The data consumed by both versions is the same and you only have to maintain one template definition.
Write everything in Javascript and execute that JS on the server. There are at least two good server-side JS engines: V8 and Apache Rhino.

You can use the server to render the page and grab that with an AJAX response. While this avoids code duplication, it might be less efficient since you have to query the server to render each page, rather than making the client do it themselves (this might not be too much of an issue, though). It shouldn't take up too much bandwidth, since it's just HTML (unless you're throwing in templates from all sorts of places). This approach, of course, only works if you're using a dynamic web site.
Alternatively, you can implement all the rendering logic in JavaScript and use something like PyV8 to run it from within Python. I question the efficiency (and sanity) of this, though.

Related

How to know if web content cannot be handled by Scrapy?

I apologize if my question sounds too basic or general, but it has puzzled me for quite a while. I am a political scientist with little IT background. My own research on this question does not solve the puzzle.
It is said that Scrapy cannot scrape web content generated by JavaScript or AJAX. But how can we know if certain content falls in this category? I once came across some texts that show in Chrome Inspect, but could not be extracted by Xpath (I am 99.9% certain my Xpath expression was correct). Someone mentioned that the texts might be hidden behind some JavaScript. But this is still speculation, I can't be totally sure that it wasn't due to wrong Xpath expressions. Are there any signs that can make me certain that this is something beyond Scrapy and can only be dealt with programs such as Selenium? Any help appreciated.
-=-=-=-=-=
Edit (1/18/15): The webpage I'm working with is http://yhfx.beijing.gov.cn/webdig.js?z=5. The specific piece of information I want to scrape is circled in red ink (see screenshot below. Sorry, it's in Chinese).
I can see the desired text in Chrome's Inspect, which indicates that the Xpath expression to extract it should be response.xpath("//table/tr[13]/td[2]/text()").extract(). However, the expression doesn't work.
I examined response.body in Scrapy shell. The desired text is not in it. I suspect that it is JavaScript or AJAX here, but in the html, I did not see signs of JavaScript or AJAX. Any idea what it is?
It is said that Scrapy cannot scrape web content generated by JavaScript or AJAX. But how can we know if certain content falls in this category?
The browsers do a lot of things when you open a web page. I will be oversimplify the process here:
Performs an HTTP request to the server hosting the web page.
Parses the response, which in most cases is HTML content (text-based format). We will assume we get a HTML response.
Starts the rendering the HTML, executes the Javascript code, retrieves external resources (images, css files, js files, fonts, etc). Not necessarily in this order.
Listens to events that may trigger more requests to inject more content into the page.
Scrapy provides tools to do 1. and 2. Selenium and other tools like Splash do 3., allow you to do 4. and access the rendered HTML.
Now, I think there are three basic cases you face when you want to extract text content from a web page:
The text is in plain HTML format, for example, as a text node or HTML attribute: <a>foo</a>, <a href="foo" />. The content could be visually hidden by CSS or Javascript, but as long is part of the HTML tree we can extract it via XPath/CSS rules.
The content is located in Javascript code. For example: <script>var cfg = {code: "foo"};</script>. We can locate the <script> node with a XPath rule and then use regular expressions to extract the string we want. Also there are libraries that allow us to parse pieces of Javascript so we can load objects easily. A complex solution here is executing the javascript code via a javascript engine.
The content is located in a external resource and is loaded via Ajax/XHR. Here you can emulate the XHR request with Scrapy and the parse the output, which can be a nice JSON object, arbitrary javascript code or simply HTML content. If it gets tricky to reverse engineer how the content is retrieved/parsed then you can use Selenium or Splash as a proxy for Scrapy so you can access the rendered content and still be able to use Scrapy for your crawler.
How you know which case you have? You can simply lookup the content in the response body:
$ scrapy shell http://example.com/page
...
>>> 'foo' in response.body.lower()
True
If you see foo in the web page via the browser but the test above returns False, then it's likely the content is loaded via Ajax/XHR. You have to check the network activity in the browser and see what requests are being done and what are the responses. Otherwise you are in case 1. or 2. You can simply view the source in the browser and search for the content to figure out where is located.
Let say the content you want is located in HTML tags. How do you know if your XPath expression correct? (By correct here we mean that gives you the output you expect)
Well, if you do scrapy shell and response.xpath(expression) returns nothing, then your XPath is not correct. You should reduce the specificity of your expression until you get an output that includes the content you want, and then narrow it down.

Strategy for making React image gallery SEO-friendly

I wrote a React image gallery or slideshow. I need to make the alt text indexable by search engines, but because my server is in PHP, React.renderToString is of limited use.
The server is in PHP + MySQL. The PHP uses Smarty, a decent PHP template engine, to render the HTML. The rest of the PHP framework is my own. The Smarty template has this single ungainly line:
<script>
var GalleryData = {$gallery};
</script>
which is rendered by the PHP's controller function as follows:
return array(
'gallery' => json_encode($gallery),
);
($gallery being the result table of a MySQL query).
And my .js:
React.render(<Gallery gallery={GalleryData} />, $('.gallery').get(0));
Not the most elegant setup, but given that my server is in PHP there doesn't seem to be much of a better way to do it (?)
I did a super quick hack to fix this at first shot - I copied the rendered HTML from Firebug, and manually inserted it into a new table in the DB. I then simply render this blob of code from PHP and we're good to go on the browser.
There was one complication which is that because React components are only inserted into the DOM as they're first rendered (as I understand it), and because the gallery only shows one image slide at a time, I had to manually click through all slides once before saving the HTML code out.
Now however the alt text is editable by CMS and so I need to automate this process, or come up with a better solution.
Rewriting the server in Node.js is out of the question.
My first guess is that I need to install Node, and write a script that creates the same React component. Because the input data (including the alt text) has to come from MySQL, I have a few choices:
connect to the MySQL DB from Note, and replicate the query
create a response URL on the PHP side that returns only the JSON (putting the SQL query into a common function)
fetch the entire page in Node but extracting GalleryData will be a mess
I then have to ensure that all components are rendered into the DOM, so I can script that by manually calling the nextSlide() method as many times as there are slides (less one).
Finally I'll save the rendered DOM into the DB again (so the Node script will require a MySQL connection after all - maybe the 1st option is the best).
This whole process seems very complicated for such a basic requirement. Am I missing something?
I'm completely new to Node and the whole idea of building a DOM outside of the browser is basically new to me. I don't mind introducing Node into my architecture but it will only be to support React being used on the front-end.
Note that the website has about 15,000 pageviews a month, so massive scalability isn't a consideration - I don't use any page caching as it simply isn't needed for this volume of traffic.
I'm likely to have a few React components that need to be rendered statically like this, but maintaining a small technical overhead (e.g. maintaing a set of parallel SQL queries in Node) won't be a big problem.
Can anyone guide me here?
I think you should try rendering React components on server-side using PHP. Here is a PHP lib to do that.
But, yes, you'll basically need to use V8js from your PHP code. However, it's kind of experimental and you may need to use other around. (And this "other way around" may be using Node/Express to render your component. Here is some thoughts on how to do it.)

How to completely separated DOM manipulation from PHP? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
I have to create a website for a friend of mine using PHP. It is basically an online store.
I want to use new features of HTML5, CSS3, jQuery and other JS libraries.And I want to keep all the document generation and manipulation separate from PHP.
I have done a lot of searching on Google. People come up with MVC architecture. And that's all good.But the problem is in all the examples or tutorials that I found; people retrieve the data from the SQL based databases, and then echo or print it to generate html, or use some ORM classes to display it. I don't know much about ORM or PHP frameworks.
I have always made pet projects, small websites, nothing like this medium-sized Store.
The way I understand the MVC architecture is this:
**Model:** Basic purpose is to save and retrieve data from the databases.
**Controller:** Do some operations on the data to either store them using Model, or do some operations on the retrieved data (again using the Model), to be passed to the View.
**View:** This is used to display the user the content.
What I want to do is do almost ZERO html generation using PHP.Instead I was thinking of a approach in which :Model is used for database handling only. Controller is used to convert that data into JSON objects, and make those JSON objects available to the appropriate Views.Then using JavaScript I will do the DOM Manipulations according to the JSON objects.
Is this approach any good ? If yes how to do it (especially the part of converting the data retrieved from database to JSON objects).
If you can provide me with a better approach where I won't have to do generate html using PHP, and use PHP for front-end as less as possible.
I am doing all the front-end stuff which the user is gonna see. My friend will be doing all the database handling. I don't wanna get involved in the PHP part, and if it is mandatory (i.e. there is no way-out) then as little as possible.
Please provide me with some solution. In desperate need here.
EDIT: I am especially talking about echo and print commands. I would like to have a fresh slate to work on instead of getting the html creation mixed with PHP and JavaScript.
If NOT using these commands is not suggested based on the fact that the user may be on mobile device, or have JavaScript turned off. Then is it possible to have a simple looking website with all the data displayed if JavaScript is turned off; and if it's not turned off then remove all those elements from the DOM and make a fresh DOM with JavaScript. However the main hindrance to this is converting the data retrieved from database to JSON object so that it can be used by the JavaScript.
I don't think this is possible, but is there some way in which PHP variables can be directly used by JavaScript ?
PHP does never manipulate the DOM, the DOM is purely client side, while php is purely server-side. PHP can generate HTML, which will be sent to the client, and processed to a DOM by the clients browser.
If you want to (nearly) completely split it in two parts, you could split it into an API server (php & database) which will provide a RESTful JSON-API and a content server, which will provide your static HTML, CSS and Javascript files.
The Javascript on the content server will connect to the API server with AJAX get and post requests to retrieve and send data to the database.
Yes, it's entirely possible to do what you're describing. You'd use static HTML files for the basic page setup, the usual CSS and images and such, and your PHP would only be used to generate JSON to return to the client and get used by JavaScript. So for instance:
index.html:
<!DOCTYPE html>
<html>
<head>
<meta charset=utf-8 />
<title>Example</title>
</head>
<body>
<table id="theTable">
<tbody>
<tr><em>Loading...</em></tr>
</tbody>
</table>
<script src="yourscript.js"></script>
</body>
</html>
yourscript.js:
getDataViaAjax("data.php", function(data) {
var table = document.getElementById("theTable");
// ...fill in the table using the information from `data`...
});
data.php:
<?php
// Get data from somewhere
// ...
// Output it
echo json_encode($theData);
?>
Obviously the table there is just an example. You'd probably have much more static content, and a few places where you wanted to add dynamic content.
This is a perfectly feasible approach, and the separation of concerns helps as the team expands.
However, note that if you do this, any page that has content from the DB will result in two HTTP requests to the server (one to load the HTML, the other — which won't start until after the first one is at least partially finished — to load the data) rather than one. In general, the goal is to minimize HTTP requests. So there are trade-offs.
I don't think this is possible, but is there some way in which PHP variables can be directly used by JavaScript ?
Correct, that's not possible. There are frameworks like Meteor (that one isn't PHP-based) that handle the middle layer for you, though, and make it seem a lot like that's happening.
You can also look at tools like AngularJS and KnockoutJS that can bind your JavaScript data objects to DOM elements, saving you a huge amount of manual update code, or even just things like Handlebars that render templated stuff for you.
I think what you are looking for is a client-side template engine, where the document is built client-side using ajax queries. The ones I have heard good things about are Handlebars and Mustache, though I'm sure there are others to choose from.
But even with such a solution, I imagine that some amount of server-side HTML needs to be output to "prime the pump", in which case, you would want to consider a server-side template engine like Smarty or whatever the latest-and-greatest equivalent is. With a server-side template engine, you would write the templates as standalone files (like .tpl for Smarty) and PHP would consume the template as an object and then pass in any unique variables for the template via the template-engine's methods and then you would call the display method for the template.
In either scenario (or a combination of both) you are separating your final HTML output from PHP so that PHP is interacting with the templates rather than doing plain echo "<div>This looks so Web 1.0</div>"; which I think is what you are trying to avoid.

Generating HTML in JavaScript vs loading HTML file

Currently I am creating a website which is completely JS driven. I don't use any HTML pages at all (except index page). Every query returns JSON and then I generate HTML inside JavaScript and insert into the DOM. Are there any disadvantages of doing this instead of creating HTML file with layout structure, then loading this file into the DOM and changing elements with new data from JSON?
EDIT:
All of my pages are loaded with AJAX calls. But I have a structure like this:
<nav></nav>
<div id="content"></div>
<footer></footer>
Basically, I never change nav or footer elements, they are only loaded once, when loading index.html file. Then on every page click I send an AJAX call to the server, it returns data in JSON and I generate HTML code with jQuery and insert like this $('#content').html(content);
Creating separate HTML files, and then for example using $('#someID').html(newContent) to change every element with JSON data, will use even more code and I will need 1 more request to server to load this file, so I thought I could just generate it in browser.
EDIT2:
SEO is not very important, because my website requires logging in so I will create all meta tags in index.html file.
In general, it's a nice way of doing things. I assume that you're updating the page with AJAX each time (although you didn't say that).
There are some things to look out for. If you always have the same URL, then your users can't come back to the same page. And they can't send links to their friends. To deal with this, you can use history.pushState() to update the URL without reloading the page.
Also, if you're sending more than one request per page and you don't have an HTML structure waiting for them, you may get them back in a different order each time. It's not a problem, just something to be aware of.
Returning HTML from the AJAX is a bad idea. It means that when you want to change the layout of the page, you need to edit all of your files. If you're returning JSON, it's much easier to make changes in one place.
One thing that definitly matters :
How long will it take you to develop a new system that will send data as JSON + code the JS required to inject it as HTML into the page ?
How long will it take to just return HTML ? And how long if you can re-use some of your already existing server-side code ?
and check how much is the server side interrection of your pages...
also some advantages of creating pure HTML :
1) It's simple markup, and often just as compact or actually more compact than JSON.
2) It's less error prone cause all you're getting is markup, and no code.
3) It will be faster to program in most cases cause you won't have to write code separately for the client end.
4) The HTML is the content, the JavaScript is the behavior. You're mixing both for absolutely no compelling reason.
in javascript or nay other scripting language .. if you encountered a problem in between the rest of the code will not work
and also it is easier to debug in pure html pages
my opinion ... use scriptiong code wherever necessary .. rest of the code you can do in html ...
it will save the triptime of going to server then fetch the data and then displaying it again.
Keep point No. 4 in your mind while coding.
I think that you can consider 3 methods:
Sending only JSON to the client and rendering according to a template (i.e.
handlerbar.js)
Creating the pages from the server-side, usually faster rendering also you can cache the page.
Or a mixture of this would be to generate partial views from the server and sending them to the client, for example it's like having a handlebar template on the client and applying the data from the JSON, but only having the same template on the server-side and rendering it on the server and sending it to the client in the final format, on the client you can just replace the partial views.
Also some things to think about determined by the use case of the applicaton, is that if you are targeting SEO you should consider ColBeseder advice, of if you are targeting mobile users, probably you would better go with the JSON only response, as this is a more lightweight response.
EDIT:
According to what you said you are creating a single page application, if this is correct, then probably you can go with either the JSON or a partial views like AngularJS has. But if your server-side logic is written to handle only JSON response, then probably you could better use a template engine on the client like handlerbar.js, underscore, or jquery templates, and you can define reusable portions of your HTML and apply to it the data from the JSON.
If you cared about SEO you'd want the HTML there at page load, which is closer to your second strategy than your first.
Update May 2014: Google claims to be getting better at executing Javascript: http://googlewebmastercentral.blogspot.com/2014/05/understanding-web-pages-better.html Still unclear what works and what does not.
Further updates probably belong here: Do Google or other search engines execute JavaScript?

HTML that's both server-side and javascript generated - how to combine?

I'm usually a creative gal, but right now I just can't find any good solution. There's HTML (say form rows or table rows) that's both generated javascript-based and server-sided, it's exactly the same in both cases. It's generated server-sided when you open the page (and it has to stay server-sided for Google) and it's generated by AJAX, to show live updates or to extend the form by new, empty rows.
Problem is: The HTML generation routines are existing twice now, and you know DRY (don't repeat yourself), aye? Each time something's changed I have to edit 2 places and this just doesn't fit my idea of good software.
What's your best strategy to combine the javascript-based and server-sided HTML generation?
PS: Server-sided language is always different (PHP, RoR, C++).
PPS: Please don't give me an answer for Node.JS, I could figure that out on my own ;-)
Here's the Ruby on Rails solution:
Every model has its own partial. For example, if you have models Post and Comment, you would have _post.html.erb and _comment.html.erb
When you call "render #post" or "render #comment", RoR will look at the type of the object and decide which partial to use.
This means that you can redner out an object in the same way in many different views.
I.e. in a normal response or in an AJAX response you'd always just call "render #post"
Edit:
If you would like to render things in JS without connecting to the server (e.g. you get your data from a different server or whatever), you can make a JS template with the method I mentioned, send it to the client and then have the client render new objects using that template.
See this like for a JS templating plugin: http://api.jquery.com/category/plugins/templates/
Make a server handler to generate the HTML. Call that code from the server when you open the page, and when you need to do a live update, do an AJAX request to that handler so you don't have to repeat the code in the client.
What's your best strategy to combine the javascript-based and server-sided HTML generation?
If you want to stay DRY, don't try to combine them. Stick with generating the HTML only on the server (clearly the preferable option for SEO), or only on the client.
Make a page which generates the HTML on the server and returns it, e.g.:
http://example.com/serverstuff/generaterows?x=0&y=foo
If you need it on the server, access that link, or call the subroutine that accessing the link calls. If you need it on the client, access that link with AJAX, which will end up calling the same server code.
Or am I missing something? (I'm not sure what you mean by "generated by AJAX").
I don't see another solution if you have two different languages. Either you have a PHP/RoR/whatever to JavaScript compiler (so you have source written in one language and automatically generated in the others), or you have one generate output that the other reads in.
Load the page without any rows/data.
And then run your Ajax routines to fetch the data first time on page load
and then subsequently fetch updates/new records as and when required/as decided by your code.

Categories