I found this cool feature on digg.com, where you can input a news URL and it will nearly instantaneously give you the title, the summary, and the image from the news story.
I don't need all these features but I would like to abstract out just the title.
I don't have the resources to download the entire website and say parse it for this information but was wondering if there was a way to get just the title ... using the client's machine, i.e. browser.
Is there an API available that might help with this?
The similar feature is found at digg.com/news after hitting the add button at the top:
I don't have the resources to download the entire website and say parse it for this information
That would be the reliable way to do it.
You could get a performance boost by downloading only the first 𝒩 bytes of the page (by making a range request, but you risk missing the <title> element if it exists beyond those bytes.
if there was a way to get just the title ... using the client's machine, i.e. browser.
No. The same origin policy prevents this.
Related
My problem is a bit complex. Hard to explain with words, so I broke it down into steps with pictures at each step.
Select a single date from these boxes. Hit submit
I will land on a page with a table. Copy the <tbody> element from the developer console.
Paste it into a text file. Save the text file with the date that was selected.
Repeat steps 1-3 for as many times as needed, selecting a new date each time (01-15-2018, 01-14-2018, 01-13-2018, and so on...)
Is it even possible to build a bot that does this? If yes, what tools would I use?
I know a fair amount of JavaScript and Python, so I'd prefer to use those 2 if possible.
Would need to know the URL you're looking at/look at the page source. If the date is supplied as any part of a request, and the response contains this data you're looking for, it should be simple to farm and analyze that data from a python script.
Walk through your clicks with the network tab of your browser's developer tools and you should see a request go out when you hit submit. Expedia just uses query parameters, and so the entire URL that you'll need pops up in the URL bar of your browser after hitting submit...
Tools:
If request-based:
Python
Requests module
If something cached/more complicated, there are tools for automating clicks and saving the results...I would guess that this won't be necessary though...
Update:
AJAX calls are HTTP requests and responses, and so you should be able to observe them in your networking tab of our web browser developer tools, and then mimic that request from a script, rather than from your browser.
The readability of the requests/responses and/or any measures the organization has implemented to make any application other than a browser unable to get the same response would be potential impediments, but even those should be imitable. If your browser is making the request, then there is no reason your python script can't make the same.
The method you seem to be interested in, although it sounds more complicated to me, is possible with automation tools like Selenium, as the other poster answered. Best of luck.
It is possible:
Take a look at selenium library (its commonly used for automated testing) for python. It should be able to select single dates, hit the submit button then go through the HTML code and grab data in the tag. After that you can use python by itself to store this data in a text file with a name of your choice in a location of your choice.
I am trying to launch a website for myself which people might be using in future. Currently I am allowing users to post iframes for YouTube and Google Maps etc. Copy entire 'iframe' from Google Maps or YouTube and paste it in post box just to keep it simple.
Later I am storing it in MySQL database. I am displaying this post on some page. I am little worried since though I have asked user to paste only YouTube or Maps iframes, a devil mind might put src of malicious code.
What are all the possible ways to prevent this?
I think there are multiple risks, some that come to mind are:
Cross-site scripting. There are too many ways to achieve this if you allow the full <iframe> tag to be displayed as entered. This is probably the main risk, and the showstopper. It would be really hard to prevent XSS if you just write the full iframe tag (as entered by an attacker) into subsequent pages. If you really want to do this, you should look into HTML sanitization like Google Caja or HTMLPurifier or similar, but it is a can of worms that you better avoid if possible.
Information leak to malicious website. This very much depends on the browser (and the exact version of such browser), but some information (like for example teh window size, etc.) does leak to the website in an iframe, even if it's from a different origin.
Information / control leak from malicious website. Even worse than the previous, the embedded website would have some control over the window, for example it can redirect it (again, I think it depends on the browser though, I'm not quite sure), or can change the url hash fragment. Also if postMessage is used, the iframe can send messages to your application, which can be exploited if your application is not properly secured (not necessarily right now, but at any time in the future, like 5 years from now, after much development).
Arbitrary text injection, possibly leading to social engineering. Say an adversary includes a frame that says something like "You are the winner of this month's super-prize! Call 1-800-ATTACKER to provide your details and get your reward!"... You get the idea. The message would look like a legitimate one from your website, when it's not.
So you'd better not allow people to enter full tags as copied from Google Maps or anywhere else. There appears to be a finite set of things you want to allow (like for example Youtube videos and Google Maps links are only two), for which you should have customized controls. The user would only enter the video id/slug (the part after ?v=...), or would paste the full link, from which you would take the id, and you would make the actual tag for your page on the server side. The same for Google Maps, if the user navigates to wherever he wants in a Maps window and pastes the url, you can make your own iframe I think, because everything is in the url in Google Maps.
So in short, you should not allow people entering tags. XSS can be mitigated by sanitizers, but other risks listed above cannot.
I'm developing a web service that allows users to generate content. I need to provide a method for printing this content (html + css + images + javascript) without the information in the print options margins (page title, url, timestamp, page number).
I have considered the following:
1) Write a detailed tutorial where I inform the users how to disable these options in their browser. I can even "detect" the browser and automatically load the tutorial that fits them. Unfortunately, my customers know nothing about computers and I'd like to make the whole experience as simple and smooth as possible. I don't even seem to be able to supply a button to easily access these print options (see link below)..
2) After browser detection, I can offer them a registry file to download and run that will/might/should disable these options. I don't like this method since not only may it present a security issue, but will not work on Macs or Linux.
3) I have tried the hack below "Disabling browser print options (headers, footers, margins) from page?" without success. It doesn't work at all in IE and in FF only if the printed content is less than a page.
4) Generating pdfs on the server. This might work, but since the design relies on styles and content heavily on JavaScript, I doubt it will. http://phptopdf.com/ is an option, however adding their code to my server will allow the creators to hack into and change anything at will. Also, I dislike relying on external services and scripts.
5) I may be able to setup my own server that accepts the files to be printed directly in a browser, automatically prints to a pdf, finally returning that file to the customer.. but this requires separate hardware that's on 24/7.. it's messy.
How can Print Preview be called from Javascript?
Disabling browser print options (headers, footers, margins) from page?
So, what is the easiest way to disable/bypass printing margin information issue?
I want to create a web crawler/spider to iteratively fetch all the links in the webpage including javascript-based links (ajax), catalog all of the Objects on the page, build and maintain a site hierarchy. My question is:
Which language/technology should be better (to fetch javascript-based links)?
Is there any open source tools there?
Thanks
Brajesh
You can automate the browser. For example, have a look at http://watir.com/
Fetching ajax links is something that even the search-giants haven't accomplished yet. It is because, the ajax links are dynamic and the command and response both vary greatly as per the user's actions. That's probably why, SEF-AJAX (Search Engine Friendly AJAX) is now being developed. It is a technique that makes a website completely indexable to search engines that when visited by a web browser, acts as a web application. For reference, you may check this link: http://nixova.com
No offence but I dont see any way of tracking ajax links. That's where my knowledge ends. :)
you can do it with php, simple_html_dom and java. let the php crawler copy the pages on your local machine or webserver, open it with an java application (jpane or something) mark all text as focused and grab it. send it to your database or where you want to store it. track all a tags or tags with an onclick or mouseover attribute. check what happens when you call it again. if the source html (the document returned from server) size or md5 hash is different you know its an effective link and can grab it. i hope you can understand my bad english :D
first time poster here so I hope I am doing this correctly. I have been contracted by my church to redesign their website. (They've been dealing with a table layout for years.)
I am looking to display an RSS feed (with an audio file) on my website. I am unable to use PHP or any other server-side language, it has to be done in JavaScript.
Due to the way our hosting is set up, all images and xml files are hosted on
images.(mydomainname).com
and the page on which I looking to display the podcast is
(mydomainname).com/sermons
as such, I have run into the problem of being unable to access the xml file with JavaScript. For all the Googling I've done, it seems that my GoogleFu has failed me. Any tips would be greatly appreciated.
If you have the ability to drop another static file on your images domain, then I'd suggest EasyXDM. It's a cross-browser library which provides an a way to communicate (using only client script) between different domains. Caveat: you need to have control over both domains in order to make it work (where "control" means you can place static files on both of them).