I'm using a Crawler4j and Jsoup to crawl a website and it works fine for the HTML text, but there are some important contents, which default values are hard coded in CSS and then dynamically set with JavaScript.
For example, I have the
and I need the width value, which in CSS is hard coded as 10px, but modified in JavaScript to, let's say, 5px.
Is there a way to get this value without using another crawler? Or a simple alternative?
I have already quite a lot of code, so I don't want to rewrite everything if there is a possibility to do that with the Crawler4j.
I hope my question is clear enough and thank you in advance for your help!
This is not possible with crawler4j nor with jsoup. They both handle only static HTML content.
There are several open issues related dynamic JavaScript execution on the official GitHub Repository: #49, #197 and #220.
To achieve your objectives, you would need to build a stack based on Selenium, CasperJS and/or PhantomJS, which could then be used for advanced crawling including JavaScript execution.
Related
I'm redesigning a Squarespace site for a client who owns a cooking school. She uses an external booking site (Fareharbor) to manage her classes which are then embedded into the site using JavaScript widget. The embedded listings don't match the style of the rest of the site, which looks really unprofessional, so I want to update the styles to make them look better. I tried modifying it in the CSS stylesheet (both in page and for the whole site) and it did nothing (clearly the script takes priority). I've looked around and people have said there's a way to do it using JQuery or straight JavaScript, but while I'm pretty adept at following code instructions and have a little JS experience, I'm pretty new to this so I don't actually know where I would put whatever code I use to change it.
So my question is:
1) What is the best way to override the default appearance of the embedded script?
2) If the answer involves using JavaScript/JQuery to modify the appearance of the original script, how do I do that? (What is the code, and where do I put it).
The page I'm trying to modify is figcookingschool.com/classes (currently working on the first embed, with the pictures – assuming the answer to this will help me with the second!) and the code I was given to embed the JavaScript is:
<script src="https://fareharbor.com/embeds/script/items/figcookingschool/?full-items=yes&fallback=simple"></script>
One of the many things I want to do is change the border-radius to 0 on each box so if you want something to use as an example, that's a good one ;) Thanks in advance!!
I'm working on a project that's basically a web form that the user fills in. Once complete they can save a screenshot of the entire, complete form to use as a sort of pamphlet.
I'm trying to work out the best way to approach this programatically.
I have a simple prototype working using the canvas tag, but the text formatting options aren't good enough for what I need so I'm wondering if there's some other way to generate a screenshot of a HTML element.
If I may suggest, post the result to a webserver and create a nice PDF document from it using some reporting tool available.
In my opinion, this is the best, maintainable solution. No need for hacking it in HTML or Javascript. You can use your own logo, print layout, fonts, etc.
Sorry for the vague question name - didn't know how to phrase it.
I have built a PHP engine to parse web pages and extract phone numbers, addresses etc.
This is going to be used by clients to populate an address book by simply entering a new contacts web address.
The problem I am having is useability:
At the moment the script just adds each item (landline number, fax etc) to a different list box and the user picks the correct one - from a useability standpoint this is hard work (how do you know which is the correct contact number without looking at the site)
so my question (finally!)
How would achieve the functionality of
http://bartaz.github.io/sandbox.js/jquery.highlight.html
On someone else website (I have no problem writing this functionality).
FOR CLARITY**
I want to show someone elses site (their contact page for example) on my site BUT I want to highlight items I have found (so for example add a tag around a phone number my php script has found)
I am aware that to display a website not on your domain an iFrame would be used - but as I need to alter the page content this is useless.
I also contemplated writing a bookmarklet that could be run on that page - but that means re-writing my parsing engine in javascript and exposing some of my tricks to make it accurate.
So I am left with pulling the page by cURL and then trying to match up javascript files, css files etc. that have relative URLs
Does anyone know how best to achieve this - and any pitfalls that might befall me.
I have tried using simple html dom parser - but it is tricky to get consistency and I also dont know how having two sets of tags, body tags etc. would affect sites.
If anyone has managed this before and could point me to the tools / general methods they used I would be eternally grateful!
PLEASE NOTE - I am very proficient with google and stack-overflow and have looked there first!
The ideal HTML solution
The easiest way to work around the relative paths for an arbitrary site would be to use the base href tag to specify the default relative location (just use the url up to the filename, such as <base href="http://www.example.com/path/to/" /> for the URL http://www.example.com/path/to/page. This should go at the top of the head block.
Then you can alter the site simply by finding the relative parts and wrapping them in your own tag, such as a span. For the formatting of these tags, the easiest way would be to add a style attribute, but you could also try to insert a <style> tag in the <head>.
Of course, you'll also need to account for badly made webpages without <html>, <head> or <body> tags. You could either wrap the source in a new set of these tags, or just put in your base and style tags, hoping that the browser will work out what to do.
You probably also want to make this interactive, so you should also wrap them with some kind of link, and ideally you'll insert some javascript to handle their actions by ajax. You should also insert your own header at the top of the page, probably floating at the top, so that they know they're using your tool. Just keep in mind that some advanced pages might then conflict with your alterations (though for those cases you could have a link saying 'is this page not displaying correctly?' to take the user to your original basic listbox page as a backup).
The more robust solution
Clearly there are a lot of potential problems with the above, even though it is ideal. If you want to ensure robustness and avoid any problems with custom javascript and css on the page you're trying to alter, you could instead use a similar algorithm to that used in text based browsers such as lynx to reformat the page consistently. Then you can apply your algorithm to highlight the relevant parts of the page, and you can apply your own formatting as well without risk of it not displaying correctly. This way you can frame it really well and maintain your interface.
The problem with this is that you lose the actual look of the original page, but you should keep the context around the numbers and addresses which is the important thing. You would also then be able to use some dynamic javascript to take the user to each number and address consecutively to improve the user experience. Basically, this is rigorous and gives you complete control over the user experience, but you lose the original look of the website which may or may not confuse your users.
Personally, I'd go for the second option, but I'm not sure if anyone's created such a parser before. If not, the simplest thing you could do would be to strip the tags to get it as plain text. The next simplest would be to convert it into some simple text markup format like markdown, then convert it back into html. That way, you'd keep some basic layout such as headings, italicised and bold text, etc.
You definitely don't want to have nested body tags. It might work, but it'll probably mess up your formatting and be inconsistent across browsers.
Here's a resource I found after a quick Google search:
https://github.com/nickcernis/html-to-markdown
There are other html to markdown scripts, but this was the more robust from the few I found. I'm still not sure though whether it can handle badly formatted pages or ones with advanced formatting, try it out yourself.
There are quite a few markdown to html converters though, in fact you could probably make a custom converter yourself quite easily to accommodate your personal needs.
Is there is any way to hide asp.net page view source?
If you mean, can you hide your ASP.NET code: it's not visible in View Source.
If you mean can you hide your HTML: you can discourage casual peeking by creating your HTML on the fly via Javascript or AJAX, but a developer will always be able to see what you are doing, using simple tools like Firebug and Fiddler.
Edited to add:
I wasn't thinking of obfuscation (though that also discourages casual peeking), I was thinking of using javascript to pull down HTML. Doing a View Source will only show a bunch of <SCRIPT> tags.
But it appears his question has been revised to go in a different direction anyway, to can I keep people from downloading my images, and the answer to that is a simple no. Making money from small numbers of images is not a viable business model. (If you have thousands of images, that's another story.)
Edited to add:
The conventional way of making a catalog of photographs is to [a] show low-resolution previews, [b] put a watermark on each image (here's an example), or both.
Are you talking about ASP.NET or the result? Since ASP.NET is server-sided, it simply returns HTML. Basically, your ASP.NET file is processed by the server and variables and functions are converted into HTML. Your users can view the HTML but not the ASP.NET as it resides on server.
No, there is no way to hide the html source of a page. It's just not possible. There are tools that will promise the ability to do this, but don't believe them. Consider that it might not even be a traditional web browser that downloads the html.
What you can do is obfuscate it a bit, but even that is trivial to reverse.
No, you can't hide HTML, and there's no point either. There's nothing of value in the HTML. It would take maybe a couple hours for a skilled developer to replicate the look and feel of a website without even glancing at the HTML. In fact, it would probably be easier for him to do it his way.
The ASP/code-behind, however, already isn't visible. It's processed on the server and outputs HTML. Only the HTML (and CSS etc.) makes it to the client.
Reading the comments, it appears you want to prevent users from downloading your images. You can't really do that either. You can make it a lot more difficult for users to download them by embedding the images in Flash, or a Java applet, or something like that, but a determined thief could still decompile it and nab your image. Easier yet, he could just take a screenshot and save it out.
The best you can do is restrict access to the image to only certain users by making the image source point to a script instead that runs some validation before outputting the image.
This is not true you can hide source code. One way would be to write a loop that puts a 100k /n in the source code at the top. So it will push it so far down with white space that you can see it :-)
Where there is a problem there is a way.
And for all those who dont like this. Amazon used to hide there code somehow until sometime back.
I'm learning jQuery and am about to write some pages using intensively that library. I just learned that some user disable JavaScript on their browser (I didn't even know that was possible and/or necessary).
Now, here's my question: What happens to my web application if a user disable JavaScript? For instance, I'd like to display some screens using AJAX and commands such as 'InsertBefore' to bring in live a DIV that will display the result.
So, if JavaScript is disabled, I wonder what going to happen to all this work that relies on JavaScript?
I'm kind of lost.
Thanks for helping
You may want to start by reading on Progressive Enhancement and Unobtrusive JavaScript.
I would also suggest to investigate how popular rich web applications like GMail, Google Maps and others, handle these situations.
I just learned that some user disable javascript on their browser
I do. The "NoScript" plugin for FireFox does the trick.
So, if Javascript is disabled, I wonder what going to happen to all this work that relies on Javascript?
It won't be functional.
A good practice suggests designing a site not to rely on JavaScript for major functionality. At least, accessing its content (in read-mode) should be possible. JavaScipt should only add interface enhancements like Ajax techniques etc. But the fallback version should always work.
I feel really sad when I see a site which is completely broken without JavaScript. Why can't people use CSS to put elements in proper places? Why do they try to align elements with JavaScript even if there is no dynamics involved?
The same goes for Flash sites. Once in a while a land upon a "web-design-agency" site which makes picky comments about me not allowing JavaScript. When I do I only see a basic primitive site with a few menus and that's it. What was the point of using Flash when the work is so primitive it can be done with raw HTML and CSS in an hour? For me it's a sign of unprofessional work.
All what's done in JavaScript won't work. Some users disable it for security reasons, NoScript is an excellent example. You can try it yourself by removing the scripts from your page or installing the NoScript-plugin for Firefox.
As a rule of thumb:
Make the website working with only semantic HTML
add the CSS
add the JS
But the website should be (almost) fully functional in stage 1.
If you disable Javascript in Safari things like Lexulous in Facebook won't work properly, the mouse letter carry function doesn't work.