Mysterious garbage character - IE 8 only - javascript

I am building a table, with content pulled from other elements in the page (page scraping).
I am using innerText or textContent to pull the text, then a regular expression to trim it:
string.replace(/^\s+|\s+$/g,"");
This works fine in IE 9 and Chrome, but in IE 8 I am getting a garbage character that I cannot identify. I was able to reproduce the behavior with alerts in jsfiddle:
http://jsfiddle.net/Te4FQ/
What is this extra character, and how can I get rid of it?
Update: thanks for the helpful replies! It seems that the character in question is u200E (left to right mark). So the second part of my question remains, how can I get rid of such characters with regular expressions, and just keep regular text?

Both the "At Risk" and "Complete" <th> tags in your jsFiddle snippet have a U+200E (Left-to-Right Mark, aka LRM) code point at the end of their content. That is not a whitespace character, so it cannot be matched by \s.
One way to get rid of this character is to use the XRegExp library, so that you can replace all matches of \p{C} with the empty string (i.e., delete them). \p{C} matches any code point in Unicode's "Other" category, which includes control, format, private use, surrogate, and unassigned code points. U+200E, specifically, is within the \p{Cf} "Other, Format" subcategory.

Try printing to the page the result of
escape(string.replace(/^\s+|\s+$/g,""));
Your garbage character should show up as an escape code.

Related

Markup regular expression help, double vs single symbols

Background
I have burned myself out looking for this answer. The closest code I could find that works was from Stack Edit specifically the Markdown.Converter.js script; copied below. This is a pretty heavy hitting regular expression though, my regex for finding ** for example happens in almost 1/5 of the steps and I don't need this much extra support.
function _DoItalicsAndBold(text) {
// <strong> must go first:
text = text.replace(/([\W_]|^)(\*\*|__)(?=\S)([^\r]*?\S[\*_]*)\2([\W_]|$)/g,"$1<strong>$3</strong>$4");
text = text.replace(/([\W_]|^)(\*|_)(?=\S)([^\r\*_]*?\S)\2([\W_]|$)/g,"$1<em>$3</em>$4");
return text;
}
Question
I'm trying to make my own very simple markdown script that makes these transformations:
* ---> Italics
** ---> Bold
__ ---> Underline
So far I can find all uses of ** (two stars, bold text) with this regex:
/(\*\*)(?:(?=(\\?))\2.)*?\1/g
However I can not for the life of me figure out how to match only * (single star, italicized text) with one regular expression. If I decide to go further I may have to distinguish between _ and __ as well.
Can someone point me in the right direction on how to properly write the regular expressions that will do this?
Update / Clarifty of OP's Question
I am aware of parser's and I am afraid that this question is going to be derailed from the point. I am not asking for parser help (but I do welcome and appreciate it) I am looking specifically for regular expression help. If this helps people get away from parser answers here is another example. Lets say I have an app that looks for strings inside double quotes and pulls them out to make tags or something. I want to avoid troll users trying to mess things up or sneak things by me so if they use double double quotes I should just ignore it and not bother making a tag out of it. Example:
In this "sentence" my regex would match "sentence" and use other code I'm not showing you to pull out only the word: sentence.
Now if someone does double double quotes I just ignore it because no match was found. Meaning the inner word should not be found as a match in this instance.
In this ""sentence"" I have two double quotes around the word sentence and it should be completely ignored now. I don't even care about ignoring the outer double quotes and matching on the inner ones. I want no match in this case.

string.replace not working in Firefox?

I have a number of Dynamic Actions in my Oracle Apex 4.2 page with action "Execute Javascript Code" on a phone number entry field:
$s("P40_MOBILE_PHONE", $v("P40_MOBILE_PHONE").replace(/[()-\s]+/g, ''));
This works in IE and Chrome. In Firefox, however, it not only doesn't work, but it causes all other dynamic actions on the page to stop working entirely.
The only difference between this and the other dynamic actions seems to be the use of string.replace(/[()-\s]+/g, ''). This is supposed to strip any spaces, (, ) and - characters from the phone number.
As #dandavis said in a comment, escaping the dash works (no need to escape parentheses, though).
If you try to run the code
/[()-\s]+/
you get
SyntaxError: invalid range in character class
That's because Firefox is trying to use the dash as a range character, not dash.
To fix it, you can:
Escape the dash: /[()\-\s]+/
Place the dash at the beginning or end: /[-()\s]+/, /[()\s-]+/
For future reference, changing the regex as follows fixed the problem:
replace(/[\(\)\-\s]+/g, '')

RegExp Expression any multiple characters with linebreaks and whitespaces

My regex is for finding certain words in text, and not words inside elemental text.
REGEXP
RegExp('\\b([^<(.*?)>(.?+)<\/(.*?)>])(' + wregex.join('|') + ')\\b(?=\\W)
EXAMPLE
This is some text that should be looked through
though this text <code>Should not be looked at </code> and this text is ok to
look at
So I'll explain my method of my regex Expression which I am having trouble with
([^<(.*?)>(.?+)<\/(.*?)>]) Do Not match any text that starts with <element> nothing inside here until this </element>
Thats the most important so I've tried multiple methods and not sure if this regex is possible. I don't want to match anything starting with a basic html element tag until the ending tag appears then start over searching.
EDIT
I know that RegEx shouldn't be used to parse HTML this is looking through TEXT
Testing Example HERE
Assuming that the text you are searching over is correctly formed (as in, no tag mismatches) the following regex should work:
^([^<]*<([^>]*)>[^<]*</\2>)*[^<]Your Text
This insures that you text is outside of an open and closed set of tags by matching all open and closed sets before getting to your text.
It won't work for nested tags. Regex is incapable of parsing arbitrarily nested tags.
However, please remember, you should not parse html with regex
Why crum everything in a single regex? It can be as simple as this. Notice that I'm using [^] instead of ., to also match newlines.
string.replace(/<[^]+?<\/[^]+?>/, '').match(/what i really want to find/gi)
And yes, this is prone to breakage, as any regex solution would be.

Regex wordwrap with UTF8 characters in JS

i've already read all tha articles in here wich touch a similar problem but still don't get any solution working. In my case i wanna wrap each word of a string with a span. The words contain special characters like 'äüö...'
What i am doing at the moment is:
var textWrap = text.replace(/\b([a-zA-Z0-9ßÄÖÜäöüÑñÉéÈèÁáÀàÂâŶĈĉĜĝŷÊêÔôÛûŴŵ-]+)\b/g, "<span>$1</span>");
But what happens is that if the äüñ or whatever NON-Ascii character is at the end or at the beginning it also acts like a boundary. Being within a word these characters do't act as a boundary.
'Ärmelkanal' becomes Ä<span>rmelkanal</span> but should be <span>Ärmelkanal</span>
'Käse'works fine... becomes <span>Käse</span>
'diré' becomes <span>dir</span>é but should be <span>diré</span>
Any advice would be very appreciated. I need to do that on clientside :-( BTW did i mention that i hate regular expressions ;-)
Thank You very much!
The problem is that JavaScript recognizes word boundaries only before/after ASCII letters (and numbers/underscore). Just drop the \b anchors and it should work.
result = subject.replace(/[a-zA-Z0-9ßÄÖÜäöüÑñÉéÈèÁáÀàÂâŶĈĉĜĝŷÊêÔôÛûŴŵ-]+/g, "<span>$&</span>");

IE innerHTML chops sentence if the last word contains '&' (ampersand)

I am trying to populate a DOM element with ID 'myElement'. The content which I'm populating is a mix of text and HTML elements.
Assume following is the content I wish to populate in my DOM element.
var x = "<b>Success</b> is a matter of hard work &luck";
I tried using innerHTML as follows,
document.getElementById("myElement").innerHTML=x;
This resulted in chopping off of the last word in my sentence.
Apparently, the problem is due to the '&' character present in the last word. I played around with the '&' and innerHTML and following are my observations.
If the last word of the content is less than 10 characters and if it has a '&' character present in it, innerHTML chops off the sentence at '&'.
This problem does not happen in firefox.
If I use innerText the last word is in tact but then all the HTML tags which are part of the content becomes plain text.
I tried populating through jQuery's #html method,
$("#myElement").html(x);
This approach solves the problem in IE but not in chrome.
How can I insert a HTML content with a last word containing '&' without it being chopped off in all browsers?
Update : 1. I tried html encoding the content which I am trying to insert into the DOM. When I encode the content, the html tags which are part of the content becomes plain string.
For the above mentioned content, I expect the result to be rendered as,
Success is a matter of hard work &luck
but when I encode what I actually get in the rendered page is,
<b>Success</b> is a matter of hard work &luck
You should replace your & with &.
The & (ampersand) character is used within HTML to represent various special characters. For example, " = ", < = <, etcetera. Now, &luck clearly is not a valid HTML entity (for one it is missing the semicolon). However, various browsers may, due to combinations of error correcting (the semicolon), and the fact that it looks somewhat like an HTML entity (& followed by four characters) try to parse it as such.
Because &luck; is not a valid HTML entity, the original text is lost. Because of this, when using an ampersand in your HTML, always use &.
Update: When this text is entered by a user, it is up to you to escape this character properly. In PHP for example, you would call htmlentities on the text before displaying it to the user. This has the added benefit of filtering out malicious user code such as <script> tags.
The ampersand is a special character in HTML that indicates the start of a character entity reference or numeric character reference, you need to escape it like so:
var x = "<b>Success</b> is a matter of hard work &luck";
Try using this instead:
var x = "<b>Success</b> is a matter of hard work &luck";
By HTML encoding the ampersand, you are ensuring that there is no ambiguity in what you mean when you write "&luck".

Categories