Is there a name for this technique that consists in exploring a page open in the browser to find specific content and modify it?
Some examples:
Skype finds phone numbers on a page, and attaches a call menu
a script finds percentages in a page and replaces them with a small pie
an advertising engine finds keywords in the page and converts them into hyperlinks
add an icon next to all the hyperlinks on the page that point to another domain
etc.
I understand that it is a kind of progressive enhancement. But I am specifically interested in the first step, the content discovery process. I'd be interested in articles that offer best practices, or explain the shortcomings of this technique.
Edit: I added an example to show that this technique is not just for text nodes, but can apply to any kind of html content.
For example, execute this code for this web-page (from the console), and all numbers on the page will be replaced with "X":
function walkTheDOM( node, func ) {
func( node );
node = node.firstChild;
while ( node ) {
walkTheDOM( node, func );
node = node.nextSibling;
}
}
walkTheDOM( document.body, function ( node ) {
if ( node.nodeType === 3 ) {
node.data = node.data.replace( /\d/g, 'X' );
}
});
This is functionality is called Add-ons and the technic used by these is DOM traversing
The cases you describe is not something specific to one site, but appear on every site you visit, so there must be some extra functionality added to your browser. This often happen when checking on install toolbars etc when installing a new software like Skype
The technic can be called recognition (as in PNR, Skype Phone Number Recognition), and what they are doing is traversing your site DOM .
This add ons describe above probably runs only on page load, so content added later on with ajax will not be affected.
If its your own add-on there is a way to access it with javascript ad described here: how to call a function in Firefox extension from a html button.
Take also a look at GreaseMonkey and jQuery traversing.
So the conclusion for now is that there doesn't seem to be a name or established practices for this technique.
Thanks to those who have mentioned search engines, it makes sense to see it as a local search, with an effort to interpret the content and structure.
As it is already said it is call summarization but you can find about it more searching therm "web crawling bot/technique/robot". Here some starting document you might find useful:
Crawling the Web
Summarization
It is the technique used in all the web crawlers. Please have a look at open source well documented web crawler/search engine Yioop!
Related
I am using a WebRequest() function within the MetaTrader Terminal 4 codebase (MQL4) that allows one to download a HTML-response from a website.
Example site: http://www.forexfactory.com/docphoenix66#acct.57-tab.list
Here is an example how it is used in the MQL4 function call:
res = WebRequest( "GET",
"http://www.forexfactory.com/docphoenix66#acct.57-tab.list",
cookie,
NULL,
timeout,
post,
0,
result,
headers
);
and the documentation for the function WebRequest()
However, if I compare what is downloaded using a WebRequest() call with what you see when you right click and inspect element using Chrome or Safari, the bits I want available are missing!
In particular I want the trade information from below the following columns:
Instrument Price Open/Close Date Open/Close Lots Return
Profit Pips Chart Balance Swap Duration
And if you see below an example of what is missing from the htm file downloaded using the MQL4 function.
<td class="slidetable__cell slidetable__cell--fixed" style="width: 62px; min-width: 62px;"> <a id="snap_48205_trade_109309333" class="explorer__anchor explorer__anchor--trade"></a>
EUR/USD
</td>
If you download the HTML file, turn off your wifi and then open the file to see what was downloading, you see everything in the trade explorer still loading. Am I clear on what my problem is?
Short version: Yes, there is.
Long version: TL;DR;
Well, first, welcome to the Wild Worlds of MQL4
Given the intention is clear and given you were "promised" that there is "a possible way to read a HTML-page", I have to tell you it is not possible in all cases you will meet in real-world.
One may spend ages in MQL4-code domain to re-design html-sort-of Mark-Up syntax-(b)LOBs, suffering from all the restricted constraints the MQL4-code execution engine provided.
Nevertheless, the much faster, joyfull and a sure and future-proof ( read other posts on historically painfull creeping of the language syntax relief and crippled man*decades in API-integration code-base efforts ) approach exists .
Integrate MQL4-side via a professional fast & low-latency SIG/MSG-infrastructure with external, distributed processes, that can provide high-performance & robust services to MetaTrader Terminal ecosystem.
Using this approach we have prototyped and operate fast Mixed-Technical-and-Fundamental AI/ML-Inputs, including Web-page-feeds of Fundamental Data and News Announcements into the FX-trading realm 24/7/365 and it works blessingly well, independently of the limits the common MQL4 execution has.
If still in doubts, just try to read a page on rss.provider.com:6322/FED_actuals URL via a call to WebReqest() and you know, where is the dog burried.
I want to improve an existing website (I have no access to) by using my own javascript.
This means I have to add an < script > to the < head >. I am currently doing this by clicking on a bookmark which has something like this as its target:
javascript: if(document.createElement){
void(head=document.getElementsByTagName('head').item(0));
void(script=document.createElement('script'));
void(script.src='http://local/script.js');
void(script.type='text/javascript');
void(head.appendChild(script));
} // i added some spacer to make it readable
This works fine, but since i've got a lot of different scripts it gets complicated to organize them.
I am now looking for a way to automatically insert a defined script-uri if the top.location.href matches a given string.
Is there a way firefox can do something like this - maybe with the help of an add-on?
Userscripts are what you're looking for. Use Greasemonkey for Firefox to run your own script on specific sites.
https://addons.mozilla.org/en-US/developers/docs/sdk/1.12/modules/sdk/page-mod.html uses the add-on sdk, and provides precisely what you're looking for (i.e. an easy way to attach a script to a webpage if it matches a given domain). That gives you the option of making your addon a standalone one easily; moreover, if you were to extend its functionality significantly, the addon sdk would certainly be the way to go!
I manage a secured PHP/MySQL web app with extensive jQuery use. Today, a strange error popped up in our app's logs:
JS Error: Error loading script:
https://d15gt9gwxw5wu0.cloudfront.net/js/_MY_WEB_APP_DOMAIN_/r.js
We are not using Amazon's Cloudfront CDN in our app. When I go to the URL that failed to load, these are the only contents:
if(typeof _GPL.ri=='function'&&!_GPL.isIE6){_GPL.ri('_GPL_r')}_GPL.rl=true;
The user's user agent string is:
Mozilla/5.0 (Windows NT 6.1; rv:9.0.1) Gecko/20100101 Firefox/9.0.1
Please note: I am not the user who triggered this error. It was one of our thousands of users who triggered it. I do not have control over the client machine.
Does anyone know what's going on here? Is this some sort of XSS attack?
** Update **
It appears I'm not the only one who has discovered this anomaly on their website. I found this report of the same exact behavior, which seems to indicate the code is harmless, but still no answers as to where it came from.
In addition, I found this pastebin with similar code, that appears to be some sort of advertising script. Again, not terribly helpful.
** Update 2 **
More context: The webapp uses several third party jQuery plugins but no third party analytics of any kind. All scripts are hosted on our own server, and an audit of all our code provides no matches for "cloudfront".
This app has been in production for about 4 years, and this is the first and only instance of any activity like this. It has not happened before or since, so I doubt I'll be able to reproduce it.
What I'm interested in is if this is some sort of attack. If it is, I want to know how to plug the hole it's trying to exploit if it's not plugged already.
Disclaimer: I'm not a security analyst/expert, your issue simply sparked my interest ;)
Warning: While I share the initial conclusion that the code itself is probably harmless, the underlying technology can most certainly be (ab)used for malicious intents as well, so please take care when investigating this yourself.
Analysis
You already found the relevant evidence yourself - searching further I found another pastebin drop, which is more readable, so I'm using this for the explanation (though at first sight the other one would allow this as well after formatting).
The snippet features JavaScript fragments with the following major functionality:
Line 13 initializes the variable _GPL with all sorts of items for later use, e.g. various constants`, helper functions, browser compatibility stuff and actual payloads, for example:
Line20 defines an empty basdeCDN, line 21 defines a fCDN, which happens to be the one in question (d15gt9gwxw5wu0.cloudfront.net)
line 261 defines a function removeScripts(), which in turn uses findScript()from line 266, further accompanied by insertJS() on line 277 - their respective intend is obvious
line 270 defines function loadDomainRules(), which seems to be the one generating the URL you have found in your logs - see appendix below for the code snippet
Deduction: Even without further evidence gathered below, the naming and functionality strongly hints on r.js being a JavaScript file serving custom JavaScript specifically assembled/generated for the domain at hand
line 100 defines a function loadGeo(), which references some kind of an ad server indeed (ads2srv.com) - see appendix below for the code snippet
line 368 finally defines a function i(), which provides the most definite clues regarding the likely origin of all this, namely the notion of some Yontoo Client and Yontoo API - see appendix below for the code snippet
Corollary
What's it all about?
The extracted clues Yontoo Client and Yontoo API easily lead to Yontoo, an Application Platform that allows you to control the websites you visit everyday, i.e. it sounds like a commercialized version of Userscripts.org, see What is a Yontoo App?:
Yontoo is a browser add-on that customizes and
enhances the underlying website
Where Can I Use It?
Yontoo works on any site on the Web, although the
functionality comes from separate applications called Yontoo Apps
which provide specific functionalities depending on what site you are
on.
[emphasis mine]
Now, looking at the current listings in their App Market easily demonstrates, why this might be used for questionable nontransparent advertizing as well for example, all the trust signs and seals in their footer notwithstanding.
How did it end up in your logs?
Another quote provides more insight into the functionality and how it might have yielded the issue you've encountered:
Yontoo [...] is a
browser add- on that creates virtual layers that can be edited to
create the appearance of having made changes to the underlying
website. [...]
If you see a need for an application or tool over a website, then you
are free to create!
So somebody apparently has visited your site and created some custom domain rules for it by means of the Yontoo client (if it actually allows this for end users) or one of the available apps (the snippet used for analysis references the Drop Down Deals app in line 379 for example), which triggered the creation of d15gt9gwxw5wu0.cloudfront.net/js/_MY_WEB_APP_DOMAIN_/r.js to store these rules for reuse on next site visit in turn.
Due to some security flaw somewhere (see conclusion below) this URL or a respective JavaScript snippet must have been injected into JavaScript code of your application (e.g. by means of Cross-site scripting (XSS) indeed), and triggered the log entry error at some point in turn.
Conclusion
As mentioned upfront already, I share the initial conclusion that the code itself is probably harmless, although the underlying technology can most certainly be (ab)used for malicious intents as well due to its very nature of mocking with client side JavaScript, i.e. a user allows code from a 3rd party service to interact with sites (and especially data) he uses and trusts every day - your case is the apparent evidence for something gone wrong already in this regard.
I haven't investigated the security architecture (if any) of Yontoo, but wasn't able to find any information regarding this important topic immediately on their website either (e.g. in their Support section), which is pretty much unacceptable for a technology like this IMHO, all the trust signs and seals in their footer notwithstanding.
On the other hand, users do install 3rd party scripts from e.g. Userscripts.org all the time of course, not the least for fine tuning the user experience on Stack Exchange as well ;)
Please make your own judgment accordingly!
Appendix
Below you can find the code snippets referenced in the analysis (I've been unable to inline them within the lists without breaking the layout or syntax highlighting):
loadDomainRules()
function () {
if (location.host != "") {
var a = location.host.replace(RegExp(/^www\./i), "");
this.insertJS(this.proto + this.fCDN + "/js/" + a + "/r.js")
}
this.loaded_domain_rules = true
}
loadGeo()
function () {
var cid = this.items.e6a00.get("geo.cid");
var updatetime = this.items.e6a00.get("geo.updatetime");
if (!cid || (cid && updatetime && (Math.floor((new Date()).getTime() / 1000) - parseInt(updatetime)) >= 259200)) {
this.insertJS(((this.proto == 'https://') ? 'https://s.' : 'http://') + 'ads2srv.com/tb/gc.php?json&cb=_GPL.setGeoAndGo')
} else {
this.vars.cid = this.items.e6a00.get("geo.cid");
this.vars.rid = this.items.e6a00.get("geo.rid");
this.vars.ccid = this.items.e6a00.get("geo.ccid");
this.vars.ip = this.items.e6a00.get("geo.ip");
this.loadCC();
this.loadDomainRules()
}
}
i()
function () {
if (typeof YontooClient != 'undefined') YontooClient = {};
if (typeof yontooAPI != 'undefined') yontooAPI = {};
if (typeof DealPlyConfig != 'undefined') {
DealPlyConfig.getBaseUrl = function () {
return "https://d3lvr7yuk4uaui.cloudfront.net/items/blank.js?"
};
DealPlyConfig.getCrownUrl = function () {
return "https://d3lvr7yuk4uaui.cloudfront.net/items/blank.js?"
}
}
this.rm(this.ri, ['dropdowndeals', 'Y2LeftFixedCurtain', 'gbdho', 'bdca', 'dealply-toast-1', 'pricegong_offers_iframe', 'SF_VISUAL_SEARCH', 'batAdRight', 'batAdBottom', 'batAdMiddle_0', 'batAdMiddleExt1_0', 'batAdRight2', 'invisiblehand-iframe', 'scTopOfPageRefinementLinks', 'sf_coupon_obj']);
this.rm(this.rc, ['yontoolayerwidget', 'dealply-toast', 'imb-ad']);
this.rm(this.ric, [
['productbox', 'g'],
['related-searches', 'related-searches-bing']
]);
this.rm(this.rtn, ['MIVA_AdLink', 'itxtrst', 'kLink', 'FAAdLink', 'IL_AD', 'skimwords-link'])
}
I found an iFrame as well in my drupal 7 website. It was loaded into the site by enabling the module of sharaholic.
I need to be able to make an event such that every time a user loads a new page and closes firefox, I need it to call a method in my C# application that takes care of maintaining the user model. I know for sure I need to create some type of firefox extension where I use javascript to check such an event. However, I have no idea how I am going to integrate my C# application with the firefox extension. Can someone provide me with some guidance?
I'll help you out with the parts of the question that I'm familiar with (Javascript based add-ons), and offer some suggestions for the other parts. Here goes nothing!
Add-ons
Firefox add-ons easily provide the tools you need to detect page loads and opening / closing firefox.
To detect page loads you can register a listener to the DOMContentLoaded event in window.
window.addEventListener("DOMContentLoaded", function(event){
var url = event.originalTarget.location.href;
alert("Oh yeah, a document is loading: " + url);
}, false);
Alternatively, you can register a nsIWebProgressListener to listen for location changes. This probably closer to what you want, since DOMContentLoaded is also triggered for iframes.
var listener = {
//unimplemented methods (just give functions which do nothing)
onLocationChange: function(aWebProgress, aRequest, aLocation){
var url = aLocation.asciiSpec;
alert("Oh yeah, a the location changed: " + url);
}
};
gBrowser.addTabsProgressListener(listener);
To detect firefox open / close you need to first understand how firefox add-ons work with respect to multiple windows. When a new window of firefox is launched, you basically have 2 separate copies of your code running. So, if you care about firefox windows being opened and closed you can simply do:
window.addEventListener("load", function(event){
alert("Looks like you just opened up a new window");
}, false);
window.addEventListener("unload", function(event){
alert("Awh, you closed a window");
}, false);
But, most likely you want to detect opening / closing firefox as an entire application. This is achieved using a code-sharing mechanism called Javascript Modules. Javascript modules are loaded just once for the lifetime of the application. So, they enable you to share information between windows. Simply counting the number of windows opened and closed should be sufficient for this functionality.
var EXPORTED_SYMBOLS = ["windowOpened", "windowClosed"];
var windowsOpened = 0;
function windowOpened(){
if( windowsOpened === 0) {
alert("The first window has been opened!");
}
windowsOpened++;
}
function windowClosed(){
windowsOpened++;
if( windowsOpened === 0) {
alert("The last window has been closed!");
}
}
Then you can simply attach the aforementioned event handlers to call these 2 methods from their corresponding load and unload events.
So, this is all great and everything, but now you have to twiddle with the details of getting a baseline Firefox add-on setup. Fortunately, Mozilla has provided a handy Addon Builder to ease this. All the code about (except the Javascript module) should be placed in the ff-overlay.js file (assuming you use the linked builder).
C# communication
I'm a little less knowledgeable about the interprocess communication with C#. However, maybe I can point you in the right direction and let the smart people at SO fill in the rest.
I believe COM Objects are a method of communication between processes on Windows. So, you could build in a Binary Component to your add-on to perform the communication. However, as far as I understand it, setting up binary components is much more difficult than a standard javascript-based add-on. Either way, Mozilla provides a guide for setting it up in Visual Studio.
If you want to stay away from binary components you are left with the javascript enabled components of the SDK. This includes socket communication, files, pipes, a sqlite database etc. This SO question addresses exactly the question you're asking. If it were me, I would choose them in this order.
Sqlite Database
Named Pipes
Sockets
(1) because there is a lot of code samples available for this, and would be easy to implement on both sides. (2) because this would be the way I'd implement IPC if I were given full control of both sides of the application. (3) is last because I hate that crap (maybe I'm biased from Distributed Systems in college).
tl;dr
The page load stuff should be pretty simple. Check out the Addon Builder to get going with a FF addon, and here to see about detecting page loads.
The C# communication is doable, and addressed in this SO Question. I'd do it with a sqlite database for ease if it were me.
Spent a bunch of time looking at this.. It seems that what little info there was about accessing a Google-apps spreadsheet is not very well maintained..
At Google IO this year there was an announcement of enhanced Google-apps script. Including UI elements..
That got me to thinking of creating a widget based on data in Google spreadsheets, no data writing just a simple reading/look up and display calculations.. Then I realized the UI feature was only available for Premier account.. Not a huge deal at only $50/yr and some free trial time up front. It seems that the ui feature may be somewhat restrictive.
But then I began to think about all the little things I might have to do,, so I started to investigate how to just access the spreadsheets from Javascript, in which case I think they could be a plain I-Google gadget.. an I-Google gadget is quite powerful and flexible in what it can do. And this could allow a lot more flexibility.. In short I've come up short.. anyone else out there? This sort of looked like a clue http://almaer.com/blog/gspreadsheet-javascript-helper-for-google-spreadsheets and this one which I could not fetch a current spreadsheet http://code.google.com/apis/gdata/samples/spreadsheet_sample.html but has not been touch for a long time and I could not make it work on a current spreadsheet.
Here is a current "public" read only spreadsheet. http://spreadsheets1.google.com/ccc?key=tzbvU7NnAnWkabYmGo4VeXQ&hl=en
This is in what Google now refers t as it's old format, I've tried both (old and new).. don't know if that makes any difference..
Google provide a documented way to access google spreadsheet via JSONP that works for normal gmail.com accounts. In short:
Create a spreadsheet
Click on the dropdown next to "Share" and select "Publish as a web page"
Copy and paste out the key from the URL that shows (i.e. the bit after &key=)
Go to https://spreadsheets.google.com/feeds/cells/0AmHYWnFLY1F-dG1oTHQ5SS1uUzhvTnZTSHNzMjdDaVE/od6/public/values?alt=json-in-script&callback=myCallback replacing "0AmHYWnFLY1F-dG1oTHQ5SS1uUzhvTnZTSHNzMjdDaVE" with whatever key you cut out of the url
To access this from within JavaScript you'll have to insert a HTML script tag into your document:
<script src="https://spreadsheets.google.com/feeds/cells/0AmHYWnFLY1F-dG1oTHQ5SS1uUzhvTnZTSHNzMjdDaVE/od6/public/values?alt=json-in-script&callback=myCallback"></script>
And you'll need to implement the callback function in your webpage:
function myCallback(spreadsheetdata) {
// do something with spreadsheet data here
console.log(spreadsheetdata);
}
You can simplify this with jQuery:
var url = "https://spreadsheets.google.com/feeds/cells/0AmHYWnFLY1F-dG1oTHQ5SS1uUzhvTnZTSHNzMjdDaVE/od6/public/values?alt=json-in-script&callback=?";
$.getJSON(url,{}, function (d) { console.log(d); });
I have implemented a fairly complete example and the code is at
https://bitbucket.org/tbrander/ggadget/wiki/Home
Code is BSD license (except for Trademarks and institutional markings which are all rights reserved)
It is reasonably well commented...
It is in operation at
http://acre.cba.ua.edu/ (bottom of page)
Stand alone at :
http://acre.cba.ua.edu/mobiletool/res.html
It functions across IE, Chrome FF i-Phone and Android
Your hints above are close but I was looking for yet more... as You can now see,, But I will explore the Jquery syntax as the current implementation is pure JS