Altering a page from another site - javascript

Sorry for the vague question name - didn't know how to phrase it.
I have built a PHP engine to parse web pages and extract phone numbers, addresses etc.
This is going to be used by clients to populate an address book by simply entering a new contacts web address.
The problem I am having is useability:
At the moment the script just adds each item (landline number, fax etc) to a different list box and the user picks the correct one - from a useability standpoint this is hard work (how do you know which is the correct contact number without looking at the site)
so my question (finally!)
How would achieve the functionality of
http://bartaz.github.io/sandbox.js/jquery.highlight.html
On someone else website (I have no problem writing this functionality).
FOR CLARITY**
I want to show someone elses site (their contact page for example) on my site BUT I want to highlight items I have found (so for example add a tag around a phone number my php script has found)
I am aware that to display a website not on your domain an iFrame would be used - but as I need to alter the page content this is useless.
I also contemplated writing a bookmarklet that could be run on that page - but that means re-writing my parsing engine in javascript and exposing some of my tricks to make it accurate.
So I am left with pulling the page by cURL and then trying to match up javascript files, css files etc. that have relative URLs
Does anyone know how best to achieve this - and any pitfalls that might befall me.
I have tried using simple html dom parser - but it is tricky to get consistency and I also dont know how having two sets of tags, body tags etc. would affect sites.
If anyone has managed this before and could point me to the tools / general methods they used I would be eternally grateful!
PLEASE NOTE - I am very proficient with google and stack-overflow and have looked there first!

The ideal HTML solution
The easiest way to work around the relative paths for an arbitrary site would be to use the base href tag to specify the default relative location (just use the url up to the filename, such as <base href="http://www.example.com/path/to/" /> for the URL http://www.example.com/path/to/page. This should go at the top of the head block.
Then you can alter the site simply by finding the relative parts and wrapping them in your own tag, such as a span. For the formatting of these tags, the easiest way would be to add a style attribute, but you could also try to insert a <style> tag in the <head>.
Of course, you'll also need to account for badly made webpages without <html>, <head> or <body> tags. You could either wrap the source in a new set of these tags, or just put in your base and style tags, hoping that the browser will work out what to do.
You probably also want to make this interactive, so you should also wrap them with some kind of link, and ideally you'll insert some javascript to handle their actions by ajax. You should also insert your own header at the top of the page, probably floating at the top, so that they know they're using your tool. Just keep in mind that some advanced pages might then conflict with your alterations (though for those cases you could have a link saying 'is this page not displaying correctly?' to take the user to your original basic listbox page as a backup).
The more robust solution
Clearly there are a lot of potential problems with the above, even though it is ideal. If you want to ensure robustness and avoid any problems with custom javascript and css on the page you're trying to alter, you could instead use a similar algorithm to that used in text based browsers such as lynx to reformat the page consistently. Then you can apply your algorithm to highlight the relevant parts of the page, and you can apply your own formatting as well without risk of it not displaying correctly. This way you can frame it really well and maintain your interface.
The problem with this is that you lose the actual look of the original page, but you should keep the context around the numbers and addresses which is the important thing. You would also then be able to use some dynamic javascript to take the user to each number and address consecutively to improve the user experience. Basically, this is rigorous and gives you complete control over the user experience, but you lose the original look of the website which may or may not confuse your users.
Personally, I'd go for the second option, but I'm not sure if anyone's created such a parser before. If not, the simplest thing you could do would be to strip the tags to get it as plain text. The next simplest would be to convert it into some simple text markup format like markdown, then convert it back into html. That way, you'd keep some basic layout such as headings, italicised and bold text, etc.
You definitely don't want to have nested body tags. It might work, but it'll probably mess up your formatting and be inconsistent across browsers.
Here's a resource I found after a quick Google search:
https://github.com/nickcernis/html-to-markdown
There are other html to markdown scripts, but this was the more robust from the few I found. I'm still not sure though whether it can handle badly formatted pages or ones with advanced formatting, try it out yourself.
There are quite a few markdown to html converters though, in fact you could probably make a custom converter yourself quite easily to accommodate your personal needs.

Related

Should I inject style tags into the head dynamically or include include style tags in the body?

I have some html content that gets embedded into a page via a server side call. So, when the page's html is being compiled on the server, a call is made to another server to return some html, which is then embedded within a div somewhere in the body. The problem is, this content contains it's own css. So, I wrote a script to inject style tags into the HEAD on ready, which works fine on desktop browsers. However, on mobile devices there's a fairly significant flash of unstyled content. I know that you're technically not supposed to include style tags in the body, but in this case would it yield better results to just include them in the body instead of injecting them into the head?
In this case, it sounds like the right solution is to fix up your architecture so that the server-side compiler can include CSS for the remote page in the page head. This probably involves separating the CSS of the remote page(s) out of the markup there and then grabbing it as a separate file to be included in the page head during compilation.
Since the right solution is not always feasible given a myriad reasons, compromise is often required. Leaving the CSS in the remote markup, if it produces the result you desire, could be the best solution for you. Or perhaps some other hack to get the CSS into the head server-side could be appropriate. You need to decide if it is worth the effort to do any of these things, if they are possible for you to accomplish given your constraints.
Some discussion here. In my experience a lot of enterprise content does it. Does that mean it's the RIGHT thing to do? I dont know. But it's certainly not frowned upon in my experience.
Source: https://www.w3.org/wiki/The_web_standards_model_-_HTML_CSS_and_JavaScript
Why separate?
Efficiency of code: The larger your files are, the longer they will take to download, and the more they will cost some people to view (some people still pay for downloads by the megabyte.) You therefore don’t want to waste your bandwidth on large pages cluttered up with styling and layout information in every HTML file. A much better alternative is to make the HTML files stripped down and neat, and include the styling and layout information just once in a separate CSS file. To see an actual case of this in action, check out the A List Apart Slashdot rewrite article where the author took a very popular web site and re-wrote it in XHTML/CSS.
Ease of maintenance: Following on from the last point, if your styling and layout information is only specified in one place, it means you only have to make updates in one place if you want to change your site’s appearance. Would you prefer to update this information on every page of your site? I didn’t think so.
Accessibility: Web users who are visually impaired can use a piece of software known as a “screen reader” to access the information through sound rather than sight — it literally reads the page out to them, and it can do a much better job of helping people to find their way around your web page if it has a proper semantic structure, such as headings and paragraphs. In addition keyboard controls on web pages (important for those with mobility impairments that can't use a mouse) work much better if they are built using best practices. As a final example, screen readers can’t access text locked away in images, and find some uses of JavaScript confusing. Make sure that your critical content is available to everyone.
Device compatibility: Because your HTML/XHTML page is just plain markup, with no style information, it can be reformatted for different devices with vastly differing attributes (eg screen size) by simply applying an alternative style sheet — you can do this in a few different ways (look at the [mobile articles on dev.opera.com] for resources on this). CSS also natively allows you to specify different style sheets for different presentation methods/media types (eg viewing on the screen, printing out, viewing on a mobile device.)
Web crawlers/search engines: Chances are you will want your pages to be easy to find by searching on Google, or other search engines. A search engine uses a “crawler”, which is a specialized piece of software, to read through your pages. If that crawler has trouble finding the content of your pages, or mis-interprets what’s important because you haven’t defined headings as headings and so on, then your rankings in relevant search results will probably suffer.
It’s just good practice: This is a bit of a “because I said so” reason, but talk to any professional standards-aware web developer or designer, and they’ll tell you that separating content, style, and behaviour is the best way to develop a web application.
Additional stackoverflow articles:
Using <style> tags in the <body> with other HTML
Will it be a wrong idea to have <style> in <body>?

dynamically generated invalid html

I know that invalid html can cause some seo-issues. But that only applies on the html-source right?
what about html that is served validly to the client, but then via some fancy js it gets manipulated into a new structure with markup violations?
Example. I have an unordered list with several li-elements, but want them separated in clusters to be displayed on a row. So once the user performs a certain action the ul includes several divs (class="liCluster") that contains the original list.
I know it's not a really swanky way to do it, but is there actually some serious problem with that, that I might not see yet?
At least it looks fine so far from a client's point of view...
It looks fine because most browsers are smart enough to know what you're trying to do. However, This takes the browser more time do do.
Most search engines are also smart enough to know what you're trying to do. Although search engines don't care much for markup, rather they care about content and how that content relates to different parts of your website. A poorly formatted ul of lis is not likely to make a massive impact in SEO.
That being said, why would you put DIVs in a ul if they're not wrapped in lis? Most browsers will remove the offending markup and you're going to make more trouble for yourself, Just learn how to use CSS and realise that you can then do everything you need to do and you can do it the correct way.

Maintain height of the website

I have a client who wants to do a website with specific height for the content part.
The Question:
Is there any way that when the text is long / reach the maximum height of the content part, then a new page is created for the next text.
Within my knowledge, somehow I know this can't be done.
Thanks for helping guys!
You will probably want to look into something like jQuery paging with tabs
http://code.google.com/p/jquery-ui-tabs-paging/
Unfortunately you would need to figure out the maximum number of characters you want to allow in the content pane and anything after that would need to be put into another tab. You can hide the tab and use just a link instead.
Without more knowledge on what you're development is, this is a difficult question to answer. Are you looking to create a different page entirely, or just different sections on a page?
The former can be done using server-side code (e.g. Rails), and dynamically serving out pages (e.g. Google results are split across many page).
The latter can be done with Javascript and/or CSS. A simple example is:
<div id="the_content" style="overflow:hidden;width:200px;height:100px">
Some really long text...
</div>
This would create a "scroll" bar and just not disrupt the flow of the page. In Javascript (e.g. JQuery), you'll be able to split the content into "tabs".
Does this help?
(Almost) everything is possible, but your intuitions are right in that this can't be done easily or in a way that makes any sense.
If I were in your position, I would go up to the client and present advantages and disadvantages to breaking it up. Advantages include the fact that you'd be able to avoid long pages and that with some solutions to this problem, the page will load faster. Disadvantages include the increased effort (i.e., billable hours) it would take to accomplish this, the lack of precedent for it resulting in users being confused, and losses to SEO (you're splitting keywords amongst n pages).
This way, you're not shooting down the client's idea, and in the likely case the client retreats from his position, he will go away thinking that he's just made a smart choice by himself and everyone goes away happy.
If you're intent on splitting it up into pages, you can do it on the backend by either literally structuring your content into pages or applying some rule (e.g., cut a page off at the first whole paragraph after 1000 characters) to paginate the results. On the frontend, you could use hashtags to allow Javascript to paginate the results. You could even write an extensible library that "paginates" any text node. In fact, I wouldn't be surprised if one didn't exist already.

Ensure javascript badge / widget html is not changed

One of my clients wants to distribute a javascript widget that people can put on their websites. However he wants to ensure that the backlink is left intact (for SEO purposes and part of the price of using the widget). So the javascript he's going to distribute might look like this:
<script id="my-script" src="http://example.com/widget-script.js"></script>
<div style='font-size:10px'><a href='http://www.example.com/backlinkpage.html'>
Visit Exaxmple.com</a></div>
widget-script.js would display some html on the page. But what wew want to ensure is that some wiley webmaster doesn't strip out the back link. If they do we might display a message like "widget installed incorrectly" or something. Any ideas / thoughts.
Some code taken from this question.
There's no 100% way of preventing this, I'm afraid.
You could insert the link yourself with Javascript, but then it'd be for naught as far as PageRank goes.
You could give them the HTML with the link having an ID like mycompanybacklink and check with Javascript whether the element exists or not. If it doesn't, don't display the badge or whatever. If it does, you can verify that the link's href is your website and its text is what you want. You would have to edit the HTML you posted as sample so that the link comes before the script, not after. The element could still exist, however, but be blocked by some other element or simply hidden with CSS. You could then also do something akin to what jQuery does now with its :hidden selector: Instead of looking at the CSS property by itself (which is what a webmaster is most likely to try) you can just see whether the element itself or its parents take up any space in the document. I think this is done with offsetWidth and offsetHeight but I am not sure. Worth looking into, though....
If you wanted to ensure that the link is always there with the widget, you could just have it printed via JavaScript. However, I don't think search engines would pick it up as a backlink.
I think you're just going to have to trust that your users will act in good faith and show you the courtesy of not modifying/removing the link. You also need to accept that no matter what you do, a determined webmaster will be able to use your widget without displaying the link, and some inevitably won't, but they are likely to be in the minority (unless your backlink is just really intrusive or obnoxiously distracting).
Any JavaScript/HTML solution could simply be edited out by the webmaster. You'd have to make your widget in flash if you really want to prevent tampering.

window.location and SEO

I'm trying to use something like jQuery biggerlink or just simple window.location for making bigger and more accessible links. What I'm wondering is what happens with SEO in these cases — I have anchor link in the containing element, but does Google penalize such actions since I'm not really clicking on link. Also, are there any other solutions (besides CSS positioning) which could be better than this one? Thanks.
Setting window.location from script will not be spotted by search engines (Google has detection for simple document.write additions but this won't catch any of the more advanced DOM scripting stuff). It's also bad for usability: all the usual browser controls you get for links, like middle-click-for-new-tab, right-click-copy-location or bookmark stop working.
biggerlink avoids the SEO issue by keeping the correct <a href> markup in the HTML, and adding extra click handling over the top of that. (The ‘bigger’ parts of the biggerlinks still don't respond to eg. middle-click, but the ‘native’ parts do.) As long as you keep <a href> in an appropriate place you don't have to worry about search engines.
I'm not at all sure this stuff is necessary. The effects I've seen biggerlink do could easily be done using links with ‘display: block;’ and occasional workarounds like multiple links when you want to do things like headings inside the links. Sure it's a little more markup, but it's a lot less scripting and then all links respond in the expected way links usually do.
This doesn’t have similarly completion and code syntax to the Meta Refresh tag, although they perform alike wherever the Meta tag refresh and the JavaScript redirect occurs on the customer surface, sense at the web browser point.
<script type="text/javascript"> window.location = "http://www.example.com/path/file.html" </script>
This can be located wherever inside the HTML basis code and is most likely used more than Meta tag Refresh for encoding purpose delays seeing as in JavaScript you can make use of a lot additional other scripting include the window.location function. While this isn’t best for SEO as search engines usually ignore JavaScript code. In the recent years, Google reads javascript and talk about it’s headlell browser technology including GoogleBot crawling.
Search engines generally don’t interpret JavaScript, they just read what your HTML markup says. So your SEO attempts will be overlooked.

Categories