Regex wordwrap with UTF8 characters in JS

Regex wordwrap with UTF8 characters in JS - javascript

i've already read all tha articles in here wich touch a similar problem but still don't get any solution working. In my case i wanna wrap each word of a string with a span. The words contain special characters like 'äüö...'
What i am doing at the moment is:
var textWrap = text.replace(/\b([a-zA-Z0-9ßÄÖÜäöüÑñÉéÈèÁáÀàÂâŶĈĉĜĝŷÊêÔôÛûŴŵ-]+)\b/g, "<span>$1</span>");
But what happens is that if the äüñ or whatever NON-Ascii character is at the end or at the beginning it also acts like a boundary. Being within a word these characters do't act as a boundary.
'Ärmelkanal' becomes Ä<span>rmelkanal</span> but should be <span>Ärmelkanal</span>
'Käse'works fine... becomes <span>Käse</span>
'diré' becomes <span>dir</span>é but should be <span>diré</span>
Any advice would be very appreciated. I need to do that on clientside :-( BTW did i mention that i hate regular expressions ;-)
Thank You very much!

The problem is that JavaScript recognizes word boundaries only before/after ASCII letters (and numbers/underscore). Just drop the \b anchors and it should work.
result = subject.replace(/[a-zA-Z0-9ßÄÖÜäöüÑñÉéÈèÁáÀàÂâŶĈĉĜĝŷÊêÔôÛûŴŵ-]+/g, "<span>$&</span>");

Related

How to match everything except specified characters and strings? Regex

I am building a graph drawer and currently working on the math expression parser. I'm done with most parts but I'm stuck at clearing the input text before parsing it. What I'm trying to achieve now is getting rid of unpermitted characters.
For example, in this text:
5ax+4asxxv+sdflog10aloga(132*43)sin(132)
I want to match everything that is not +,-,*,/,^,(,),ln,log,sin,cos,tan,cot,arcsin,arccos,...
and replace them with "".
so that the output is
5x+4xx+log10log(132*43)sin(132)
I need help with the regex.
Spaces don't matter since I clear them out beforehand.

A little bit tricky - at least I couldn't think of a simple way to do what you ask. The regex would get monstrous.
So I did it the other way around - match what you want to keep, and put it back together.
The regex:
[\d+*/^()x-]|ln|log|(?:arc)?(?:sin|cos)|tan|cot
The code:
var re = /[\d+*/^()x-]|ln|log|(?:arc)?(?:sin|cos)|tan|cot/g,
text = '5ax+4asxxv+sdflog10aloga(132*43)sin(132)arccos(1)';
console.log(text.match(re).join(''));

Markup regular expression help, double vs single symbols

Background
I have burned myself out looking for this answer. The closest code I could find that works was from Stack Edit specifically the Markdown.Converter.js script; copied below. This is a pretty heavy hitting regular expression though, my regex for finding ** for example happens in almost 1/5 of the steps and I don't need this much extra support.
function _DoItalicsAndBold(text) {
// <strong> must go first:
text = text.replace(/([\W_]|^)(\*\*|__)(?=\S)([^\r]*?\S[\*_]*)\2([\W_]|$)/g,"$1<strong>$3</strong>$4");
text = text.replace(/([\W_]|^)(\*|_)(?=\S)([^\r\*_]*?\S)\2([\W_]|$)/g,"$1<em>$3</em>$4");
return text;
}
Question
I'm trying to make my own very simple markdown script that makes these transformations:
* ---> Italics
** ---> Bold
__ ---> Underline
So far I can find all uses of ** (two stars, bold text) with this regex:
/(\*\*)(?:(?=(\\?))\2.)*?\1/g
However I can not for the life of me figure out how to match only * (single star, italicized text) with one regular expression. If I decide to go further I may have to distinguish between _ and __ as well.
Can someone point me in the right direction on how to properly write the regular expressions that will do this?
Update / Clarifty of OP's Question
I am aware of parser's and I am afraid that this question is going to be derailed from the point. I am not asking for parser help (but I do welcome and appreciate it) I am looking specifically for regular expression help. If this helps people get away from parser answers here is another example. Lets say I have an app that looks for strings inside double quotes and pulls them out to make tags or something. I want to avoid troll users trying to mess things up or sneak things by me so if they use double double quotes I should just ignore it and not bother making a tag out of it. Example:
In this "sentence" my regex would match "sentence" and use other code I'm not showing you to pull out only the word: sentence.
Now if someone does double double quotes I just ignore it because no match was found. Meaning the inner word should not be found as a match in this instance.
In this ""sentence"" I have two double quotes around the word sentence and it should be completely ignored now. I don't even care about ignoring the outer double quotes and matching on the inner ones. I want no match in this case.

replacing special characters with a specific special character

I have a string which is exactly like this...
R%26B,Alternative,Rock,Classic Rock,Heavy Metal,Classical,Reggae%2fSka,
I have tried enough to remove the special characters before they reach the browser...but not going anywhere..so planing to rely on my old and trusted friend "javascript" I want it to read
R&B,Alternative,Rock,Classic Rock,Heavy Metal,Classical,Reggae&Ska,
I know this can be done through regular expression which I am just not able to figure it out. How would I write the expression?
Any help would be highly appreciated

You may try using:
decodeURIComponent("R%26B,Alternative,Rock,Classic Rock,Heavy Metal,Classical,Reggae%26Ska")
//^prints^ "R&B,Alternative,Rock,Classic Rock,Heavy Metal,Classical,Reggae&Ska"

Look at these answers:
Regex to remove all special characters from string?
They layout a regex that will remove everything EXCEPT those characters you want to allow, this is safer then removing a list of %26,%2f, etc.
For example...
[^0-9a-zA-Z, ]+ would allow all letters, numbers, commas and whitespace.
[^0-9a-zA-Z]+ would be only letters and numbers
The other answers are probably pointing you in a better direction... if it means fixing the string before it gets to the client.

Probably you need decodeURIComponent() function.
<script>
var decodedString = decodeURIComponent('R%26B,Alternative,Rock,Classic Rock,Heavy Metal,Classical,Reggae%2fSka');
</script>
http://www.w3schools.com/jsref/jsref_decodeuricomponent.asp

Extending an existing regex to drop punctuation after URL links

I have an existing replace that matches http within a text string and creates a working URL from the text.
Working Example:
var Text = "Visit Gmail at http://gmail.com"
var linkText = Text.replace(/http:\/\/\S+/gi, '$&');
document.write(linkText);
Output:
Visit Gmail at http://gmail.com
Problem:
The problem arises when the link appears at the end of a sentence and the punctuation incorrectly becomes appended to the end of the URL.
Can someone advise on a way of extending my regex (or maybe adding a second replacement after this has been transformed) to overcome this?
I think the right answer will include adding something along the lines of /\W$/g to my original regex, but I can't see how this can be applied to just one word within the whole string.
As always, very grateful for any help.
Thanks,
Pete
Examples of problem links
http://gmail.com/.
http://gmail.com,
http://gmail.com/?
http://gmail.com!
All of these should resolve the link to http://gmail.com
Note how some could end in a slash then punctuation and others with punctuation directly after the domain name.

Try
/http:\/\/(.(?![.?] |$))*/
My logic is, if the last char is a dot, or question mark followed by either a space or end of string, you don't need it.
var Text = "Visit Gmail at http://gmail.com"
var linkText = Text.replace(/http:\/\/(.(?![.?](?:\s|$)))*./gi, '$&');
document.write(linkText);
Gives
"Visit Gmail at http://gmail.com"
Edit:
This may be better (it doesn't match white space now)
http:\/\/(.(?!(?:[.?](?: |$))))*.

Why not just use a negative character class?
/http://\S+[^.,?!]/gi

You could account for trailing unwanted characters, whether stripping them or not.
The replacement for both is capture buffer 1: <a href="$1">$1<\/a>
This also asumes you can do lookbehind. though I'm not sure if client side JS can do lookbehind assertions.
Strip unwanted chars
/(http:\/\/\S+)(?<![\/.,?!])[\/.,?!]*/
Or, leave unwanted characters
/(http:\/\/\S+)(?<![\/.,?!])/
Alternate, using lookahead
Strip
/(http:\/\/\S+?(?=[\/.,?!]+(?:\s|$)|\s|$))[\/.,?!]*/
Leave
/(http:\/\/\S+?(?=[\/.,?!]+(?:\s|$)|\s|$))/

regular expressions - finding the position of a number and removing brackets around it

I'm stuck. I tried it with regular expressions, but I guess I'm missing something. I'm working with JavaScript.
I have an input like:
(text [number]) the text that follows...
I want an output like:
[number] the text that follows...
I tried it with substr, but my problem is that I do not know the length of the text or number in the brackets. I guess I need the position of the beginning and ending of the number to work with a regEx.
Have you got an idea?

Regexes are the way to go — using JavaScript’s replace function, you don’t need to fiddle with the position of the number in the string.
Try this:
var geoff = '(text 694) the text that follows...';
var geoff_replaced = geoff.replace(/\([^0-9]* ([0-9]*)\)/, '$1');
# geoff_replaced will be "694 the text that follows...
I don’t do much JavaScript regex stuff, so I totally looked up the above on this guide to JavaScript regexes:
http://www.evolt.org/node/36435

It'd help to have a real example but I made one up...
Text:
(Some text 1234) some more text.
Regex:
^.+?(?<Number>\d+)\)(?<Text>.+)$
Replacement:
${Number}${Text}
Full example:
var fixedText = "(Some text 1234) some more text.".replace(/^.+?(?<Number>\d+)\)(?<Text>.+)$/, "${Number}${Text}");

the regex that matches (text [number]) the text that follows... can be like:
"^\(.*?([0-9]*)\)(.*)$"
or you can just match the beginning (and the ending )) and remove it
"^(\(.*?)[0-9]*(\)).*$"

We Keep Coding

JavaScript is the programming language of the Web.

Regex wordwrap with UTF8 characters in JS - javascript

The problem is that JavaScript recognizes word boundaries only before/after ASCII letters (and numbers/underscore). Just drop the \b anchors and it should work. result = subject.replace(/[a-zA-Z0-9ßÄÖÜäöüÑñÉéÈèÁáÀàÂâŶĈĉĜĝŷÊêÔôÛûŴŵ-]+/g, "<span>$&</span>");

Related

How to match everything except specified characters and strings? Regex

Markup regular expression help, double vs single symbols

replacing special characters with a specific special character

Extending an existing regex to drop punctuation after URL links

regular expressions - finding the position of a number and removing brackets around it

Categories

Resources