How to match everything except specified characters and strings? Regex - javascript

I am building a graph drawer and currently working on the math expression parser. I'm done with most parts but I'm stuck at clearing the input text before parsing it. What I'm trying to achieve now is getting rid of unpermitted characters.
For example, in this text:
5ax+4asxxv+sdflog10aloga(132*43)sin(132)
I want to match everything that is not +,-,*,/,^,(,),ln,log,sin,cos,tan,cot,arcsin,arccos,...
and replace them with "".
so that the output is
5x+4xx+log10log(132*43)sin(132)
I need help with the regex.
Spaces don't matter since I clear them out beforehand.

A little bit tricky - at least I couldn't think of a simple way to do what you ask. The regex would get monstrous.
So I did it the other way around - match what you want to keep, and put it back together.
The regex:
[\d+*/^()x-]|ln|log|(?:arc)?(?:sin|cos)|tan|cot
The code:
var re = /[\d+*/^()x-]|ln|log|(?:arc)?(?:sin|cos)|tan|cot/g,
text = '5ax+4asxxv+sdflog10aloga(132*43)sin(132)arccos(1)';
console.log(text.match(re).join(''));

Related

javascript regex - splitting a text into sentences on . if no number is before it

I am trying to parse a text into sentences, by using:
srt.replace(/(\.+|:|!|\?)(\s|\n|\r|\r\n)/gm, "$1$2|").split("|");
Which works great, but... If a sentence starts with a list number (i.e "1. some words") I get: ['1.', 'some words'].
It's my first time using regex and while I know there's a way to lookbehind I was not able to use it.
How can I change my regex to only split at . if there's no number character before it?
Ended up using str.replace(/(?<!:)(\n)\s*/g, "$1|").replace(/(?<![0-9])(\.+)\s*/g, "$1|").replace(/(\?+|!+)\s*/g, "$1|").split("|")
I am sure there's a prettier way to write this regex, but as a noob - I don't yet know how. This also covers:
1. Not splitting if there's a new line after :
2. Multiple dots, question and exclamation marks
This code is meant to split a text into "ideas", which is why I used the conditions I did, might not be the right logic for a simple "split to sentences" need.

Get all the WORDS except one specific word

I want to get all the words, except one, from a string using JS regex match function. For example, for a string testhello123worldtestWTF, excluding the word test, the result would be helloworldWTF.
I realize that I have to do it using look-ahead functions, but I can't figiure out how exactly. I came up with the following regex (?!test)[a-zA-Z]+(?=.*test), however, it work only partially.
http://refiddle.com/refiddles/59511c2075622d324c090000
IMHO, I would try to replace the incriminated word with an empty string, no?
Lookarounds seem to be an overkill for it, you can just replace the test with nothing:
var str = 'testhello123worldtestWTF';
var res = str.replace(/test/g, '');
Plugging this into your refiddle produces the results you're looking for:
/(test)/g
It matches all occurrences of the word "test" without picking up unwanted words/letters. You can set this to whatever variable you need to hold these.
WORDS OF CAUTION
Seeing that you have no set delimiters in your inputted string, I must say that you cannot reliably exclude a specific word - to a certain extent.
For example, if you want to exclude test, this might create a problem if the input was protester or rotatestreet. You don't have clear demarcations of what a word is, thus leading you to exclude test when you might not have meant to.
On the other hand, if you just want to ignore the string test regardless, just replace test with an empty string and you are good to go.

Extracting both the full match, and the last token match in a regexp

I have a little interesting issue here. I have a plaintext URL coming from Excel and I need to change it to an HTML URL with a unique body. Here is the regex code for javascript:
text = text.toString().replace(/=hyperlink\(([#\\\w\s\(\)-\.\/]+)\)/g, "<a href='file:///$1'>$1</a>");
This works perfectly fine for what it does. Example, text is:
=hyperlink("\\share\folder\log\2013\13-05-13\13-05-13.txt")
regex turns it into
\\share\folder\log\2013\13-05-13\13-05-13.txt
However, I need the inner HTML to be just the text file name:
13-05-13.txt
To further complicate the matter, the original text the regex is going through is not a single occurrence. It is an entire spreadsheet with 100's of rows that contain this. So the regex will be matching and replacing 100's of these strings in one operation.
Hopefully it is possible to get this all done in one regexp on the entire string, but I suppose I could loop through each line of the string first...
If there is no way to do this with one regex engine, what do you think the best approach is? (no PHP/Python/Server side. Just Javascript, HTML, Jquery, etc).
I guess you could use this regex:
=hyperlink\("([#\\\w\s\(\)\-\.\/]+\\([^"]+))"\)
And this new replace:
$2
I'm not sure how your regex was working, but I added the quotes in the regex and replaced the single quotes by double quotes in the replace. Revert those if need be.
Demo

Trying to remove trailing text

I having the following code. I want to extract the last text (hello64) from it.
<span class="qnNum" id="qn">4</span><span>.</span> hello64 ?*
I used the code below but it removes all the integers
questionText = questionText.replace(/<span\b.*?>/ig, "");
questionText=questionText.replace(/<\/span>/ig, "");
questionText = questionText.replace(/\d+/g,"");
questionText = questionText.replace("*","");
questionText = questionText.replace(". ",""); i want to remove the first integer, and need to keep the rest of the integers
It's the third line .replace(/\d+/g,"") which is replacing the integers. If you want to keep the integers, then don't replace \d+, because that matches one or more digits.
You could achieve most of that all on one line, by the way - there's no need to have multiple replaces there:
var questionText = questionText.replace(/((<span\b.*?>)|(<\/span>)|(\d+))/ig, "");
That would do the same as the first three lines of your code. (of course, you'd need to drop the |(\d+) as per the first part of the answer if you didn't want to get rid of the digits.
[EDIT]
Re your comment that you want to replace the first integer but not the subsequent ones:
The regex string to do this would depend very heavily on what the possible input looks like. The problem is that you've given us a bit of random HTML code; we don't know from that whether you're expecting it to always be in this precise format (ie a couple of spans with contents, followed by a bit at the end to keep). I'll assume that this is the case.
In this case, a much simpler regex for the whole thing would be to replace eveything within <span....</span> with blank:
var questionText = questionText.replace(/(<span\b.*?>.*?<\/span>)/ig, "");
This will eliminate the whole of the <span> tags plus their contents, but leave anything outside of them alone.
In the case of your example this would provide the desired effect, but as I say, it's hard to know if this will work for you in all cases without knowing more about your expected input.
In general it's considered difficult to parse arbitrary HTML code with regex. Regex is a contraction of "Regular Expressions", which is a way of saying that they are good at handling strings which have 'regular' syntax. Abitrary HTML is not a 'regular' syntax due to it's unlimited possible levels of nesting. What I'm trying to say here is that if you have anything more complex than the simple HTML snippets you've supplied, then you may be better off using a HTML parser to extract your data.
This will match the complete string and put the part after the last </span> till the next word boundary \b into the capturing group 1. You just need to replace then with the group 1, i.e. $1.
searched_string = string.replace(/^.*<\/span>\s*([A-Za-z0-9]+)\b.*$/, "$1");
The captured word can consist of [A-Za-z0-9]. If you want to have anything else there just add it into that group.

Regex wordwrap with UTF8 characters in JS

i've already read all tha articles in here wich touch a similar problem but still don't get any solution working. In my case i wanna wrap each word of a string with a span. The words contain special characters like 'äüö...'
What i am doing at the moment is:
var textWrap = text.replace(/\b([a-zA-Z0-9ßÄÖÜäöüÑñÉéÈèÁáÀàÂâŶĈĉĜĝŷÊêÔôÛûŴŵ-]+)\b/g, "<span>$1</span>");
But what happens is that if the äüñ or whatever NON-Ascii character is at the end or at the beginning it also acts like a boundary. Being within a word these characters do't act as a boundary.
'Ärmelkanal' becomes Ä<span>rmelkanal</span> but should be <span>Ärmelkanal</span>
'Käse'works fine... becomes <span>Käse</span>
'diré' becomes <span>dir</span>é but should be <span>diré</span>
Any advice would be very appreciated. I need to do that on clientside :-( BTW did i mention that i hate regular expressions ;-)
Thank You very much!
The problem is that JavaScript recognizes word boundaries only before/after ASCII letters (and numbers/underscore). Just drop the \b anchors and it should work.
result = subject.replace(/[a-zA-Z0-9ßÄÖÜäöüÑñÉéÈèÁáÀàÂâŶĈĉĜĝŷÊêÔôÛûŴŵ-]+/g, "<span>$&</span>");

Categories