Hiding information in HTML for JavaScript to find

Hiding information in HTML for JavaScript to find - javascript

I want to add information to an HTML page that will be visible to JavaScript but not to the end-user. I want to keep the original HTML page as simple as possible. One solution would be to use non-standard tags, such as <custom> ... </custom>
I am aware of the official way of adding custom elements, but my purpose is not to show anything on-screen, so using the CustomElementRegistry seems overkill.
Here's my use case. I am creating an "aided reader" web application for people who are learning English. My JavaScript code adds <span> elements to an ordinary HTML page, enclosing words which are new to the reader. For example, JavaScript code will change the plain HTML ...
<p>The word "thought" may be new to elementary learners.</p>
... to:
<p>The word "<span data-info="think">thought</span>" may be new to elementary learners.</p>
(The span's data-info attribute provides information which is used later — when the user hovers the mouse over the word — to display images, definitions and examples, but that is not important here.)
The text comes from non-web-developer authors, and it contains no mark-up at all at the beginning. I am writing two tools: one offline and one online. The offline tool compares the vocabulary with lists of words that students are expected to know at different levels, and allows an non-tech-savvy editor to collect different inflections of the same word (for example: lose|loses|lost|losing) that appear in the given text, so that they can be treated as the same root word. This generates an array of terms that the student might want to learn more about. Each term is stored as a string that can be converted to a regular expression. For example "los(?:e|es|t|ing)".
The online web page will receive:
The raw text from the author
The array of search terms
Some more information about what reference sites to use. This information will be added to the data-info attribute of the enclosing span, but it is not important here.
The online code will work through the array, looking for matches for each regular expression (/thought|thinks?/, for example) in the raw text and add the same span to all the occurrences it finds. It will also add <p> tags where necessary.
However, the word "thought" can be either a verb or a noun: "Yesterday I thought..." (verb) or "Yesterday I had a thought... " (noun). In the second case, I need to use a different regular expression: /thoughts?/, to allow for both singular and plural forms.
However, both these regular expressions will find a match for "thought", which is the problem I need to solve.
This is where the "information hiding" comes in. One solution would be for my offline tool to add tags to the raw text like this...
Yesterday I thought ... Yesterday I had a thought ...
I can then use different regular expressions for each case, and there would be no clash.
/<verb>(thought|thinks?)<\/verb>/
/<noun>(thoughts?)<\/noun>/
Since these tags will be not be displayed in the browser, they can remain in place ... or can they?
Is there any danger in using non-standard and non-declared tags in this way?

Why don't wrap it with a span like you did at the beginning and add an attribute called "data-type".
Would give :
<p>Yesterday I <span data-info="think" data-type="verb">thought</<span> ... Yesterday I had a <span data-info="think" data-type="noun">thought</span> ... <p>

Related

Get the substring From the Html file without considering the tags only text

I am new to WebDevelopment. I have a html document consider it as a resume,and that is in the html format. e.g -
html
<p>Mobile: 12345678891
E-mail: abc#gmail.com</p>
<p><b>Career Objective</b></p>
<p>Employment that fully utilizes my experience in Web Site Design & Development and offers the opportunity for career advancement along with the further expansion of IT skills.
</p><p><b>Career Summary
</b></p><p><b>2.1 years</b> of experience in analysis, design and development of client/server, web based application.
Now, In this Now I want to have a substring from this html string. I want to have
my experience in Web Site this string. Now, I tried to take this string from subString method, I get the start and end offsetso, Here, What happens when I took innerText,it does not matches. So How to get the substring which I am displaying in that way only? And If I want to give the the id to the each and every word which is present in html including special char and space as well then How can do this ?

First, you can't assign an ID to anything that is not a HTML tag. So individual words or characters cannot have an ID.
Second, try with innerHTML property, that contains exact content. innerText contains only text, so indices can change.

how to prevent scripts from being run

SO kept preventing me from posting the title I wanted so finally got a title that let me post though it kind of sucks so feel free to edit/change it.
I have fields a user can fill in and in the javascript we have
'${chart.title}'
and stuff like that. Is it sufficient to just strip out the single quote character such that they cannot escape it back to javascript? or are there other ways to close out the string that started with the single quote character.
${chart.title} inserts the title a user typed in on a previous page so naturally they could type something like "Title'+callMethod()+'RestOfTitle" injecting a callMethod into my javascript.
thanks,
Dean

The best way would be to restrict the input to alphanumerical and space characters.
If you want to allow anything inside the title, you can use a escaping function.
http://xkr.us/articles/javascript/encode-compare/
Just stripping the string of single quote characters is definitely not enough. Think of new lines for one reason.

There are couple of options.
First go very restrictive way and do both so called white-list validation for input field for you title and always encode the text that you output to the page. That will filtered out all unwanted (and potentially dangerous) characters and make sure that if some of them pass filter (or somebody update the text to contains some js code after the filters were applied) the encoding procedure make all malicious js scripts not runable (it turns it into plain text).
Second you do let your users input what ever they want (which is highly unrecommended way but sometime developers asked to do it) but always encode the text that you output to the page.
You can implement white-list validation by yourself using regular expression or you can use one of the libraries.

Can prettify.js be extended to support Mathematica?

The mathematica.SE is currently in private beta and will open to the public in a few days. Stack Overflow and related sites use prettify.js, however Mathematica is not a supported language. It would be pretty awesome to have a custom highlighting script for our site, and I request the JavaScript and CSS community's help in developing a such a script and the accompanying CSS.
I've listed below a few basic requirements such that it captures most of the features of Mathematica's default highlighting scheme (ignoring stuff that only the internal parser would know). I've also named the colours generically – hexadecimal colour codes can be picked from the screenshots I've provided (further below). I've also added code samples to accompany the screenshots so that folks can test it out.
Basic requirements
Comments
These are entered as (* comment *). So anything between these should be highlighted in gray.
Strings
These are entered as "string"(single quotes are not supported), and should be highlighted in pink.
Operators/short hand notations
Apart from the standard +, -, *, /, ^, ==, etc., Mathematica has several other operators and short hand notations. The most commonly encountered ones are:
#, ##, ###, /#, //#, //, ~, /., //., ->, :>, /:, /;, :=, :^=, =.,
&, |, ||, &&, _, __, ___, ;;, [[, ]], <<, >>, ~~, <>
These, and parenthesis, brackets and braces should all be highlighted in black.
Patterns objects and slots
Pattern objects start with a letter and have either _ or __ or ___ attached, like for example, x_, x__ and x___. These can also have additional letters following the underscore, as x_abc, etc. All of these should be highlighted in green.
Slots are # and ## and can also be followed by an integer as #1, ##4, etc., and should also be in green.
Both of these (pattern objects and slots) are usually terminated by an operator/bracket/shortform from point 3 above.
Functions/variables
Functions and variables is a rather loose terminology here, but serves for the purposes of this post. Anything not falling in the above 4 can be highlighted in black. Mathematica often uses backticks ` in code and should be considered part of the function/variable name. E.g., abcd`defg. Dollar signs $ anywhere in a variable name is to be treated just like a letter (i.e., nothing special).
For all of the above, if they appear inside strings, they should be treated as such, i.e. "#~# should be highlighted in pink.
Additional nice to haves:
In the pattern objects in point 3 above, if the underscore(s) is followed by a ? and then some letters, then the part following the _ should be in black. E.g., in x__?abc, the x__ part must be in green and the ?abc in black.
if a function/variable starts with a capital letter, then it is highlighted in black. If it starts with a small letter, it is highlighted in blue. Internally, this differentiates built-in functions vs. user defined functions. However, the mathematica community (pretty much everywhere) sticks to this naming convention fairly well, so distinguishing the two would serve some purpose.
Screenshots & code samples:
1. Simple examples
Here's a small example set, with a screenshot at the end showing how it looks in Mathematica:
(*simple pattern objects & operators*)
f[x_, y__] := x Times ## y
(*pattern objects with chars at the end and strings*)
f[x_String] := x <> "hello#world"
(*pattern objects with ?xxx at the end*)
f[x_?MatrixQ] := x + Transpose#x
<< Combinatorica` (*example with backticks and inline comment*)
(*Slightly more complicated example with a mix of stuff*)
Developer`PartitionMap[Total, Range#1000, 3][[3 ;; -3]]~Partition~2 //
Times ### # &
2. A real world example
Here's an example from this answer of mine that also indicates my point 2 in the "Additional nice to haves" section, i.e., lowercase stuff being highlighted in blue.
Also, you might notice some of the variables highlighted in orange – I purposefully didn't include that as a requirement, as I think that's going to be a lot harder to do without a parser that knows Mathematica.
prob = MapIndexed[#1/#2 &,
Accumulate[
EuclideanDistance[{0, 0}, #] < 1 & /# arrows // Boole]]~N~4;
Manipulate[
Graphics[{White, Rectangle[{-5, -5}, {5, 5}], Red, Disk[{0, 0}, 1],
Black, Point[arrows[[;; i]]],
Text[Style[First#prob[[i]], Bold, 18, "Helvetica"], {-4.5, 4.5}]},
ImageSize -> 200], {i, Range[2, 20000, 1]},
ControlType -> Manipulator, SaveDefinitions -> True]
Is this feasible? Too much? Too hard? Impossible?
Quite frankly, I don't know the answer to any of those. I just listed some basic features that everyone on mathematica.SE would love to have and some additional stuff that would be a cherry on the top. However, do let me know if these are too difficult to implement. We can work out a smaller subset of features.
In recognition of this help, you all have the Mathematica community's eternal gratitude and in addition, I'll award a 500 bounty to each person that contributes significantly to this (if it's done in parts by different folks) – I'll rely on your votes/comments/output on the answers to decide what's significant (perhaps more than one bounty to one person if they do all the work). Implementing the "Additional nice to haves" gets an automatic +500 regardless of previous bounties, so you can also build upon the work of others even if you don't do the first half. I might also periodically place smaller bounties to attract users who might not have seen this question, so if you happen to earn those bounties, they'll be in addition to the "bounty to reward an existing answer" which will be decided towards the end.
Lastly, I'm not in a hurry. So please take your time with this question. The bounty is always an option until it is implemented by SE (or if it has been determined that existing answers satisfy the requirements completely). Ideally, I'm hoping to get this implemented 2/3rs of our way into the beta, which is 2 months from now.

Preface
Since the Mathematica support for google-code-prettify was mainly developed for the new Mathematica.Stackexchange site, please see also the discussion here.
Introduction
I have no deep knowledge of all of this, but there were times when I wrote a cweb plugin for Idea to have my code highlighted there. In an IDE all this is not a one step process. It is divided into several steps and each step has more highlighting-abilities. Let me explain this a bit to give later some reasons why some things are (imho) not possible for a code-highlighter we need here.
At first the code is split into tokens which are the single parts of a programming language. After this lexer you can categorize intervals of your code into e.g. whitespace, literal, string, comment, and so on. This lexer eats the source-code by testing regular expressions, storing the token-type for a text-span and stepping forward in the code.
After this lexical scan the source-code can be parsed by using the rules of the programming language, the tokens and the underlying code. For instance, if we have a token Plus which is of type Keyword then we know that the brackets and the parameter should follow. If not, the syntax is not correct. What you can build with this parsing is called an AST, abstract syntax tree, and looks basically like the TreeForm of Mathematica syntax.
With a nicely designed language, like Java for instance, it is possible to the check the code while typing and make it almost impossible to write syntactically wrong code.
prettify.js and Mathematica Code
First, the prettify.js implements only a lexical scanner, but no parser. I'm pretty sure, that this would be impossible anyway regarding the time-constrains for displaying a web-page. So let me explain what features are not possible/feasible with prettify.js:
Also, you might notice some of the variables highlighted in orange – I
purposefully didn't include that as a requirement, as I think that's
going to be a lot harder to do without a parser that knows
Mathematica.
Right, because the highlighting of these variables depends on the context. You have to know, that you are inside a Table construct or something like that.
Hacking prettify.js
I think hacking an extension for prettify.js is not so hard. I'm an absolute regular expression noob, so be prepared of what follows.
We don't need so much stuff for a simple Mathematica lexer.
We have whitespace, comments, string-literals, braces, a lot of operators, usual literals like variables and a giant list of keywords.
Lets start, with the keywords in java-script regexp-form:
Export["google-code-prettify/keywordsmma.txt",
StringJoin ## Riffle[Apply[StringJoin,
Partition[Riffle[Names[RegularExpression["[A-Z].*"]],
"|"], 100], {1}], "'+ \n '"], "TEXT"]
The regular expression for whitespace and string-literals can be copied from another language. Comments are matched by something like
/^$\*[\s\S]*?\*$/
This runs wrong if we have comments inside comments, but for the moment I don't care. We have braces and brackets
/^(?:\[|\]|{|}|$|$)/
We have something like blub_boing which should be matched separately.
/^[a-zA-Z$]+[a-zA-Z0-9$]*_+([a-zA-Z$]+[a-zA-Z0-9$]*)*/
We have the slots #, ##, #1, ##9 (currently only one digit can follow)
/^#+[0-9]?/
We have variable names and other literals. They need to start with either a letter or $ and then can follow letters, numbers and $. Currently \[Gamma] is not matched as one literal but for the moment it's ok.
/^[a-zA-Z$]+[a-zA-Z0-9$]*/
And we have operators (I'm not sure this list is complete).
/^(?:\+|\-|\*|\/|,|;|\.|:|#|~|=|\>|\<|&|\||_|`|\^)/
Update
I cleaned the stuff a bit up, did some debugging and created a color-style which looks beautiful to me. The following stuff works as far as I can see correctly:
All system symbols which can be found through Names[RegularExpression["[A-Z].*"]] are matched and highlighted in blue
Braces and brackets are black but bold font-weight. This was an suggestion from Szabolcs and I like it very much since it definitely add some energy to the appearance of the code
Patterns, as they appear in function definitions and the slots of pure functions are highlighted in green. This was suggested by Yoda and goes along with the highlighter in the Mathematica frontend. Patterns are only green in combination with a variable like in blub__Integer, a1_ or in b34_Integer32. Testfunctions for the pattern like in num_?NumericQ are only green infront of the question mark.
Comments and Strings have the same color. Comments and strings can go over several lines. Strings can include backslashed quotes. Comments cannot be nested.
For the coloring I used consistently the ColorData[1] scheme to ensure colors look nice side by side.
Currently it looks like that:
Testing and debugging
Szabolcs asked whether and how it is possible to test this. This is easy: You need my google-code-prettify source (Where can I put this, so that everyone has access?). Unpack the sources and open the file tests/mathematica_test.html in a webbrowser. This file loads by itself the files src/prettify.js, src/lang-mma.js and src/prettify-mma-1.css.
in lang-mma.js you find the regular expression the lexer is using when splitting the code into tokens.
in prettify-mma-1.css you find the style definitions I use
To test your own code, simply open mathematica_test.html in an editor and paste your stuff between the pre tags. Reload the page and your code should appear.
Debugging: If the highlighter is not working correctly, you can debug with an IDE or with Google-Chrome. In Chrome you mark the word where the highlighter starts to fail and make right-klick and Inspect Element. What you see then is the underlying html-highlight code. There you can see every single token and you see which type the token is. This looks then like
<span class="tag">[</span>
You see the open bracket is of type tag. This matches with the regexp definition I made in lang-mma.js. In Chrome it is even possible to browse the JS code, set breakpoints and debug it while reloading your page.
Local installation for Google Chrome and Firefox
Tim Stone was so kind to write a script which injects the highlighter during the loading of sites under http://stackoverflow.com/questions/. As soon as google-code-prettify is turned on for mathematica.stackexchange.com it should work there too.
I adapted this script to use my lexical scanning rules and colors. I heard that in Firefox the script is not always working, but this is how to install it:
Chrome: Follow this link https://github.com/halirutan/Mathematica-Source-Highlighting/raw/master/mathematica-source-highlighter.user.js and you should be prompted whether you want to install this extension.
Firefox: ensure you have the Greasemonkey plugin installed. Then download the same link as for Chrome.
Now you are set up and when you reload this page, comments, kernel-functions, strings and patterns should be highlighted correctly.
Versions
Under https://github.com/halirutan/Mathematica-Source-Highlighting/raw/master/mathematica-source-highlighter.user.js you will always find the most recent version. Here is some change history.
- 02/23/2013 Updated the lists of symbols and keywords to Mathematica version 9.0.1
- 09/02/2012 some minor issues with the coloring of Mathematica-patterns were fixed. For a detailed overview of features with Pattern-operator : see also the discussion here
02/02/2012 support of many number input formats like .123`10.2 or 1.2`100.3*^-12, highlighting of In[23] and Out[4], ::usage or other messages like blub::boing, highlighting of patterns like ProblemTest[prob:(findp_[pfun_, pvars_, {popts___}, ___]), opts___], bug-fixes (I checked the parser against 3500 lines of package code from the AddOns directory. It took about 3-4 sec to run, which should be more than fast enough for our purposes.)
01/30/2012 Fixed missing '?' in the operator list. Included named-characters like \\[Gamma] to give a complete match for such symbols. Added $variables in the keyword list. Improved the matching of patterns. Added matching of context constructions like Developer`PackedArrayQ. Switch of the color-scheme due to many requests. Now it's like in the Mathematica-frontend. Keywords black, variables blue.
01/29/2012 Tim hacked to injecting code. Now the highlighting works on mathematica.stackexchange too.
01/25/2012 Added the recognition of Mathematica-numbers. This should now highlight things like {1, 1.0, 1., .12, 16^^1.34f, ...}. Additionally it should recognize the backtick behind a number. I switched comments and strings to gray and use a dark red for the numbers.
01/23/2012 Initial version. Capabilities are described under section Update.

Not exactly what you are asking for, but I created a similar extension for MATLAB (based on the excellent work already done here). The project is hosted on github.
The script should solve some of the issues common for MATLAB code on Stack Overflow:
comments (no need to use tricks like %# ..)
transpose operator (single quote) is correctly recognized as such (confused with quoted strings by the default prettifier)
highlighting of popular built-in functions
Keep in mind the syntax highlighting is not perfect; among other things, it fails on nested block comments (I can live with that for now). As always, comments/fixes/issues are welcome.
A separate userscript is included, it allows switching the language used as seen in the screenshot below:
--- before ---
--- after ---
For those interested, a third userscript is provided, adapted to work on "MATLAB Answers" website.
TL;DR
Install the userscript for SO directly from:
https://github.com/amroamroamro/prettify-matlab/raw/master/js/prettify-matlab.user.js

Fast algorithm to find keywords in an HTML page using Javascript

I have a JS specialty dictionary that find certain keywords on a page and add explanatory tooltips to them. Right now I'm using RegEx to find the keywords, but I suspect it will get slow very soon, when my dictionary grows bigger. I store dictionary entries in an array so I think that can be improved as well. My site language is Vietnamese and my keywords will all be English.
Any idea on improving performance will be much appreciated. Thanks.

You could process your dictionary server side (checks output against keywords), then add a handler to each matched item (a class or other html element to identify the definition to use..). then use javascript to bind each element to your dictionary. This way your server is doing the heavy lifting.
1) Server loads your dictionary file and compares against text you are about to output
2) Where a match is found add
<span class="definition">yourword</span>
3) Generic javascript event handler (this is written in jQuery but of course you can fdo it anyway you like)
$('.definition').mouseOver(function(){
var keyword = $(this).html();
//load your definition using the keyword...
})

See my answer to a related question: javascript: find strings in dom and emphasize it
Also see the accepted answer to that question which is a jQuery plugin to do just what you want.
The problem with doing this with regexp is not speed since some people claim that the DOM parsing method can actually be slower. The problem is avoiding crazy corner casses like: you don't want to replace a javascript string that happens to contain the keyword, you don't want to replace css class name or id that happens to contain the keyword, etc.
In my experience, the DOM way is fast enough. In fact, my website has a list of more than 100 keywords and it manages to install tooltips on all of them in under half a second (certainly faster than my eye can see).

Fast method to find regex matches in a large document using javascript?

I need to search the text in a HTML document for reg-exes(emails, phone numbers, etc) and words. The matches need to be highlighted and be made anchor-able so that a link can be generated to jump to the location of the matches. So not only does it need to find matches using patterns in needs to do a replace do add the proper html code.
I am currently using jquery but I am not very happy with the speed. In a 1.5mb file it takes about 5 seconds to match 2 regexes and it increases when I add more search criteria.
Does anyone know of a fast method to find regex matches in a large document using javascript?

You say you're "using jQuery" but you don't say how. Have you tried a "highlight" plugin (or, as it sounds like you'd need, a derivation of one)? I've used this one: http://johannburkard.de/blog/programming/javascript/highlight-javascript-text-higlighting-jquery-plugin.html and it doesn't seem slow to me. Again, you'd have to work on it to make it add the markup you need, but that should be pretty clear - it's not very big.
It seems like what you'd want to do for performance is take your regular expressions and combine them into what amounts to a "token grammar". In other words, you don't want to start from scratch looking for each regex individually throughout the entire document. Instead, you'd want to proceed through it with a regex that matches each possible target (one at a time of course), and each time it finds one you'd replace it with whatever's appropriate. That way you could make just one pass over the document, no matter how big it is and no matter how many patterns you're looking for.
edit Mr. Burkard's plugin doesn't let you search with regexes; it uses "indexOf" internally. Hmm.

We Keep Coding

JavaScript is the programming language of the Web.