Javascript - how to use regex process the following complicated string

Javascript - how to use regex process the following complicated string - javascript

I have the following string that will occur repeatedly in a larger string:
[SM_g]word[SM_h].[SM_l] "
Notice in this string after the phrase "[SM_g]word[Sm_h]" there are three components:
A period (.) This could also be a comma (,)
[SM_l]
"
Zero to all three of these components will always appear after "[SM_g]word[SM_h]". However, they can also appear in any order after "[SM_g]word[SM_h]". For example, the string could also be:
[SM_g]word[SM_h][SM_l]"
or
[SM_g]word[SM_h]"[SM_l].
or
[SM_g]word[SM_h]".
or
[SM_g]word[SM_h][SM_1].
or
[SM_g]word[SM_h].
or simply just
[SM_g]word[SM_h]
These are just some of the examples. The point is that there are three different components (more if you consider the period can also be a comma) that can appear after "[SM_h]word[SM_g]" where these three components can be in any order and sometimes one, two, or all three of the components will be missing.
Not only that, sometimes there will be up to one space before " and the previous component/[SM_g]word[SM_h].
For example:
[SM_g]word[SM_h] ".
or
[SM_g]word[SM_h][SM_l] ".
etc. etc.
I am trying to process this string by moving each of the three components inside of the core string (and preserving the space, in case there is a space before &\quot; and the previous component/[SM_g]word[SM_h]).
For example, [SM_g]word[SM_h].[SM_l]" would turn into
[SM_g]word.[SM_l]"[SM_h]
or
[SM_g]word[SM_h]"[SM_l]. would turn into
[SM_g]word"[SM_l].[SM_h]
or, to simulate having a space before "
[SM_g]word[SM_h] ".
would turn into
[SM_g]word ".[SM_h]
and so on.
I've tried several combinations of regex expressions, and none of them have worked.
Does anyone have advice?

You need to put each component within an alternation in a grouping construct with maximum match try of 3 if it is necessary:
\[SM_g]word(\[SM_h])((?:\.|\[SM_l]| ?"){0,3})
You may replace word with .*? if it is not a constant or specific keyword.
Then in replacement string you should do:
$1$3$2
var re = /(\[SM_g]word)(\[SM_h])((?:\.|\[SM_l]| ?"){0,3})/g;
var str = `[SM_g]word[SM_h][SM_l] ".`;
console.log(str.replace(re, `$1$3$2`));

This seems applicable for your process, in other word, changing sub-string position.
(\[SM_g])([^[]*)(\[SM_h])((?=([,\.])|(\[SM_l])|( ?&\\?quot;)).*)?
Demo,,, in which all sub-strings are captured to each capture group respectively for your post processing.
[SM_g] is captured to group1, word to group2, [SM_h] to group3, and string of all trailing part is to group4, [,\.] to group5, [SM_l] to group6, " ?&\\?quot;" to group7.
Thus, group1~3 are core part, group4 is trailing part for checking if trailing part exists, and group5~7 are sub-parts of group4 for your post processing.
Therefore, you can get easily matched string's position changed output string in the order of what you want by replacing with captured groups like follows.
\1\2\7\3 or $1$2$7$3 etc..
For replacing in Javascript, please refer to this post. JS Regex, how to replace the captured groups only?
But above regex is not sufficiently precise because it may allow any repeatitions of the sub-part of the trailing string, for example, \1\2\3\5\5\5\5 or \1\2\3\6\7\7\7\7\5\5\5, etc..
To avoid this situation, it needs to adopt condition which accepts only the possible combinations of the sub-parts of the trailing string. Please refer to this example. https://regex101.com/r/6aM4Pv/1/ for the possible combinations in the order.
But if the regex adopts the condition of allowing only possible combinations, the regex will be more complicated so I leave the above simplified regex to help you understand about it. Thank you:-)

Related

Regex for finding element tagname and attributes "skips" attributes

I'm trying to make a regular expression that finds the tagnames and attributes of elements. For example, if I have this:
<div id="anId" class="aClass">
I want to be able to get an array that looks like this:
["(full match)", "div", "id", "anId", "class", "aClass"]
Currently, I have the regex /<(\S*?)(?: ?(.*?)="(.*?)")*>/, but for some reason it skips over every attribute except for the last one.
var str = '<div id="anId" class="aClass">'
console.log(str.match(/<(\S*)(?: ?(.*?)="(.*?)")*>/));
Regex101: https://regex101.com/r/G0ncwF/2
Another odd thing: if I remove the * after the non-capture group, the capture group in quotes seems to somehow "forget" that it's lazy. (Regex101: https://regex101.com/r/C0UwI8/2)
Why does this happen, and how can I avoid it? I couldn't find any questions/answers that helped me (Python re.finditer match.groups() does not contain all groups from match looked promising, but didn't seem help me at all)
(note: I know there are better ways to get the attributes, I'm just experimenting with regex)
UPDATE:
I've figured out at least why the quantifiers seem to "forget" that they're lazy. It's actually just that the regex is trying to match all the way to the angle brackets. I suppose I must have been thinking that the non-capturing group was "insulating" everything and preventing that from happening, and I didn't see it was still lazy because there was only one angle bracket for it to find.
var str = '"foo" "bar"> "baz>"'
console.log("/\".*?\"/ produces ", str.match(/".*?"/), ", finds first quote, finds text, lazily stops at second quote");
console.log("/\".*?\">/ produces ", str.match(/".*?">/), ", finds first quote, finds text, sees second quote but doesn't see angle bracket, keeps going until it sees \">, lazily stops");
So at least that's solved. But I still don't understand why it skips over every attribute but the last one.
And note: Other regexes using different tricks to find the attributes are nice and all, but I'm mostly looking to learn why my regex skips over the attributes, so I can maybe understand regex a bit better.

Playing along with your experimentation you could do this: Instead of scanning for what you want, you can scan for what you don't want, and then filter it out:
const html = '<div id="anId" class="aClass">';
const regex = /[<> ="]/;
let result = html.split(regex).filter(Boolean);
console.log('result: '+JSON.stringify(result));
Output:
result: ["div","id","anId","class","aClass"]
Explanation:
regex /[<> ="]/ lists all chars you don't want
.split(regex) splits your text along the unwanted chars
.filter(Boolean) gets rid of the unwanted chars
Mind you this has flaws, for example it will split incorrectly for html <div id="anId" class="aClass anotherClass">, e.g a space in an attribute value. To support that you could preprocess the html with another regex to escape spaces in quotes, then postprocess with another regex to restore the spaces...
Yes, an HTML parser is more reliable for these kind of tasks.

How can I write a regular expression that matches everything between the first and the last quote?

I try to match multiple values between quotes
(these values can be anything but spaces)
the best I can achieve is to match everything between the first and the last quote
I already checked many SO answers, yet I cannot make it work
here is the regex
\[\[\[(\w*img\w*)\s(\w*id|url\w*)+="([^"]|.*)"\]\]\]
here is the string I try to match (values are numbers but I could have urls or anything similar)
[[[img id="37" w="100" h="70"]]]
I should get all parameters and their respecting values, but I get only one parameter with the value beeing 37" w="100" h="70
I know I am close, but this one is tricky
regards

I don't think you need all the \w.
And I also would suggest splitting the task in two parts as suggested in a comment.
However, I also see an option in doing it in just one step:
\[\[\[img(?:\s(\w+)="([^"]+)")?(?:\s(\w+)="([^"]+)")?(?:\s(\w+)="([^"]+)")?\]\]\]
This is basically the wrapper [[[]]], a normal character part img and then (?:\s(\w+)="([^"]+)")? repeated as many times as you expect attributes to appear. (\w+) matches the name of the attribute and ([^"]+) its value.

Splitting a string at question mark, exclamation mark, or period in javascript and retain those marks?

I was a bit surprised, that actually no one had the exact same issue in javascript...
I tried several different solutions none of them parse the content correctly.
The closest one I tried : (I stole its regex query from a PHP solution)
const test = `abc?aaa.abcd?.aabbccc!`;
const sentencesList = test.split("/(\?|\.|!)/");
But result just going to be
["abc?aaa.abcd?.aabbccc!"]
What I want to get is
['abc?', 'aaa.', 'abcd?','.', 'aabbccc!']
I am so confused.. what exactly is wrong?

/[a-z]*[?!.]/g) will do what you want:
const test = `abc?aaa.abcd?.aabbccc!`;
console.log(test.match(/[a-z]*[?!.]/g))

To help you out, what you write is not a regex. test.split("/(\?|\.|!)/"); is simply an 11 character string. A regex would be, for example, test.split(/(\?|\.|!)/);. This still would not be the regex you're looking for.
The problem with this regex is that it's looking for a ?, ., or ! character only, and capturing that lone character. What you want to do is find any number of characters, followed by one of those three characters.
Next, String.split does not accept regexes as arguments. You'll want to use a function that does accept them (such as String.match).
Putting this all together, you'll want to start out your regex with something like this: /.*?/. The dot means any character matches, the asterisk means 0 or more, and the questionmark means "non-greedy", or try to match as few characters as possible, while keeping a valid match.
To search for your three characters, you would follow this up with /[?!.]/ to indicate you want one of these three characters (so far we have /.*?[?!.]/). Lastly, you want to add the g flag so it searches for every instance, rather than only the first. /.*?[?!.]/g. Now we can use it in match:
const rawText = `abc?aaa.abcd?.aabbccc!`;
const matchedArray = rawText.match(/.*?[?!.]/g);
console.log(matchedArray);

The following code works, I do not think we need pattern match. I take that back, I have been answering in Java.
final String S = "An sentence may end with period. Does it end any other way? Ofcourse!";
final String[] simpleSentences = S.split("[?!.]");
//now simpleSentences array has three elements in it.

What Regex would capture both the beginning and end from of a string?

I am trying to edit a DateTime string in typescript file.
The string in question is 02T13:18:43.000Z.
I want to trim the first three characters including the letter T from the beginning of a string AND also all 5 characters from the end of the string, that is Z000., including the dot character. Essentialy I want the result to look like this: 13:18:43.
From what I found the following pattern (^(.*?)T) can accomplish only the first part of the trim I require, that leaves the initial result like this: 13:18:43.000Z.
What kind of Regex pattern must I use to include the second part of the trim I have mentioned? I have tried to include the following block in the same pattern (Z000.)$ but of course it failed.
Thanks.
Any help would be appreciated.

There is no need to use regular expression in order to achieve that. You can simply use:
let value = '02T13:18:43.000Z';
let newValue = value.slice(3, -5);
console.log(newValue);
it will return 13:18:43, assumming that your string will always have the same pattern. According to the documentation slice method will substring from beginIndex to endIndex. endIndex is optional.

as I see you only need regex solution so does this pattern work?
(\d{2}:)+\d{2} or simply \d{2}:\d{2}:\d{2}
it searches much times for digit-digit-doubleDot combos and digit-digit-doubleDot at the end
the only disadvange is that it doesn't check whether say there are no minutes>59 and etc.
The main reason why I didn't include checking just because I kept in mind that you get your dates from sources where data that are stored are already valid, ex. database.

Solution
This should suffice to remove both the prefix from beginning to T and postfix from . to end:
/^.*T|\..*$/g
console.log(new Date().toISOString().replace(/^.*T|\..*$/g, ''))
See the visualization on debuggex
Explanation
The section ^.*T removes all characters up to and including the last encountered T in the string.
The section \..*$ removes all characters from the first encountered . to the end of the string.
The | in between coupled with the global g flag allows the regular expression to match both sections in the string, allowing .replace(..., '') to trim both simultaneously.

Trying to remove trailing text

I having the following code. I want to extract the last text (hello64) from it.
<span class="qnNum" id="qn">4</span><span>.</span> hello64 ?*
I used the code below but it removes all the integers
questionText = questionText.replace(/<span\b.*?>/ig, "");
questionText=questionText.replace(/<\/span>/ig, "");
questionText = questionText.replace(/\d+/g,"");
questionText = questionText.replace("*","");
questionText = questionText.replace(". ",""); i want to remove the first integer, and need to keep the rest of the integers

It's the third line .replace(/\d+/g,"") which is replacing the integers. If you want to keep the integers, then don't replace \d+, because that matches one or more digits.
You could achieve most of that all on one line, by the way - there's no need to have multiple replaces there:
var questionText = questionText.replace(/((<span\b.*?>)|(<\/span>)|(\d+))/ig, "");
That would do the same as the first three lines of your code. (of course, you'd need to drop the |(\d+) as per the first part of the answer if you didn't want to get rid of the digits.
[EDIT]
Re your comment that you want to replace the first integer but not the subsequent ones:
The regex string to do this would depend very heavily on what the possible input looks like. The problem is that you've given us a bit of random HTML code; we don't know from that whether you're expecting it to always be in this precise format (ie a couple of spans with contents, followed by a bit at the end to keep). I'll assume that this is the case.
In this case, a much simpler regex for the whole thing would be to replace eveything within <span....</span> with blank:
var questionText = questionText.replace(/(<span\b.*?>.*?<\/span>)/ig, "");
This will eliminate the whole of the <span> tags plus their contents, but leave anything outside of them alone.
In the case of your example this would provide the desired effect, but as I say, it's hard to know if this will work for you in all cases without knowing more about your expected input.
In general it's considered difficult to parse arbitrary HTML code with regex. Regex is a contraction of "Regular Expressions", which is a way of saying that they are good at handling strings which have 'regular' syntax. Abitrary HTML is not a 'regular' syntax due to it's unlimited possible levels of nesting. What I'm trying to say here is that if you have anything more complex than the simple HTML snippets you've supplied, then you may be better off using a HTML parser to extract your data.

This will match the complete string and put the part after the last </span> till the next word boundary \b into the capturing group 1. You just need to replace then with the group 1, i.e. $1.
searched_string = string.replace(/^.*<\/span>\s*([A-Za-z0-9]+)\b.*$/, "$1");
The captured word can consist of [A-Za-z0-9]. If you want to have anything else there just add it into that group.

We Keep Coding

JavaScript is the programming language of the Web.

Javascript - how to use regex process the following complicated string - javascript

Related

Regex for finding element tagname and attributes "skips" attributes

How can I write a regular expression that matches everything between the first and the last quote?

Splitting a string at question mark, exclamation mark, or period in javascript and retain those marks?

What Regex would capture both the beginning and end from of a string?

Trying to remove trailing text

Categories

Resources