Javascript Regex capture multiline content from each newline entry - javascript

I am doing Javascript Regex to process and transform some raw data to 2D array.
Task Briefing (JS only):
Transforming raw string data to 2D array.
Raw Data Input :
Here is a piece of sample with 4 entries, a new entry will go to a newline. Entry 3 comes with multiline content.
2012/12/1, AM12:21 - user1‬: entry1_wasehhjdsaj
2012/12/2, AM9:42 - user2‬: entry2_bahbahbah_dsdeead
2012/12/2, AM9:44 - user3‬: entry3_Line1_ContdWithFollowingLine_bahbahbah
entry3_Line2_ContdWithABoveLine_bahbahbah_erererw
entry3_Line3_ContdWithABoveLine_bahbahbah_dsff
2012/12/4, AM11:48 - user7‬: entry4_bahbahbah_fggf
(raw string data, without empty line. )
Updated: Sorry for misleading, the end of contents do not necessary come with same END pattern, but just a line break.
How the pattern actually ends? (Thanks #Tim Pietzcker's comment).
The content should be end with a line break and following with next entry timestamp start. (You can assume the entry contents do not contain any similar timestamp pattern.)
I understand this may be a trouble regex question, so ANY OTHER JS METHOD ACHIEVING SAME GOAL WILL ALSO BE ACCEPT.
My current regex with capture group:
/^([0-9]{4}|[0-9]{2})[\/]([0]?[1-9]|[1][0-2])[\/]([0]?[1-9]|[1|2][0-9]|[3][0|1]), ([A|P])M([1-9]|1[0-2]):([0-5]\d) - (.*?): (.*)/gm
Desired capture result:
MATCH 1
2012
12
1
A
12
21
user1‬
entry1_wasehhjdsaj
MATCH 2
2012
12
2
A
9
42
user2‬
entry2_bahbahbah_dsdeead
MATCH 3
2012
12
2
A
9
44
user3‬
entry3_Line1_ContdWithFollowingLine_bahbahbah entry3_Line2_ContdWithABoveLine_bahbahbah_erererw entry3_Line3_ContdWithABoveLine_bahbahbah_dsff
MATCH 4
(to be skipped...)
Problem:
There is a problem when I capture Entry 3, I can't capture the 2nd & 3rd line content of Entry 3. If the entry only contains ONE line content, the regex work fine.
How can I capture Entry 3 with Multi-line content? I try to work with m modifier, but I have no idea how to deal with Multi-line contents and newline entry at the same time.
If it is impossible achieve with js regex, please suggest another js approach to transform the raw data to 2D array as ultimate goal.
THANKS!
the end of contents do not necessary come with same END pattern, but just a line break.
Testing: https://regex101.com/r/eS9pY5/1

Multiline doesn't work that way in javascript, but you can workaround it with [\s\S]. This class matches every character and \n as well. Note the *? instead of * after it, to stop it from being greedy and only go until the 1st END:
^([0-9]{4}|[0-9]{2})[\/]([0]?[1-9]|[1][0-2])[\/]([0]?[1-9]|[1|2][0-9]|[3][0|1]), ([A|P])M([1-9]|1[0-2]):([0-5]\d) - (.*?): ([\s\S]*?END)$
See: https://regex101.com/r/mT8rI4/3

Dots (.) don't match newline characters. There is a character class that matches everything ([\S\s]), but you don't want to use that without precautions - otherwise [\S\s]* would match all the entries at once.
So you need to tell the regex engine to stop matching when the next match begins. We can use a negative lookahead assertion for that, and we'll just feed the timestamp pattern into that:
/^([0-9]{4}|[0-9]{2})\/(0?[1-9]|1[0-2])\/(0?[1-9]|[12][0-9]|3[01]), ([AP])M([1-9]|1[0-2]):([0-5]\d) - ([^:]*): ((?:(?!^([0-9]{4}|[0-9]{2})\/(0?[1-9]|1[0-2])\/(0?[1-9]|[12][0-9]|3[01]), ([AP])M([1-9]|1[0-2]):([0-5]\d))[\S\s])*)/gm
Test it live on regex101.com.

Here is a single regex that will match the strings you have the way you need:
^(\d{4}|\d{2})\/(0?[1-9]|1[0-2])\/(0?[1-9]|[12]\d|3[01]), ([AP])M([1-9]|1[0-2]):([0-5]\d) - (.*?): ((?:(?!(?:\d{4}|\d{2})\/(?:0?[1-9]|1[0-2])\/(?:0?[1-9]|[12]\d|3[01]))[\s\S])*)(?=\n|$)
See demo
The last capturing group is no longer a greedy dot matching .* but a tempered greedy token (?:(?!([0-9]{4}|[0-9]{2})\/(0?[1-9]|1[0-2])\/(0?[1-9]|[12][0-9]|3[01]))[\s\S])* matching everything up to the end of string or the date pattern.
If we unroll it to make more efficient:
^(\d{4}|\d{2})\/(0?[1-9]|1[0-2])\/(0?[1-9]|[12]\d|3[01]), ([AP])M([1-9]|1[0-2]):([0-5]\d) - (.*?): (\D*(?:\d(?!(?:\d{3}|\d)\/(?:0?[1-9]|1[0-2])\/(0?[1-9]|[12]\d|3[01]))\D*)*)(?=\n|$)
See another demo

Related

Javascript - how to use regex process the following complicated string

I have the following string that will occur repeatedly in a larger string:
[SM_g]word[SM_h].[SM_l] "
Notice in this string after the phrase "[SM_g]word[Sm_h]" there are three components:
A period (.) This could also be a comma (,)
[SM_l]
"
Zero to all three of these components will always appear after "[SM_g]word[SM_h]". However, they can also appear in any order after "[SM_g]word[SM_h]". For example, the string could also be:
[SM_g]word[SM_h][SM_l]"
or
[SM_g]word[SM_h]"[SM_l].
or
[SM_g]word[SM_h]".
or
[SM_g]word[SM_h][SM_1].
or
[SM_g]word[SM_h].
or simply just
[SM_g]word[SM_h]
These are just some of the examples. The point is that there are three different components (more if you consider the period can also be a comma) that can appear after "[SM_h]word[SM_g]" where these three components can be in any order and sometimes one, two, or all three of the components will be missing.
Not only that, sometimes there will be up to one space before " and the previous component/[SM_g]word[SM_h].
For example:
[SM_g]word[SM_h] ".
or
[SM_g]word[SM_h][SM_l] ".
etc. etc.
I am trying to process this string by moving each of the three components inside of the core string (and preserving the space, in case there is a space before &\quot; and the previous component/[SM_g]word[SM_h]).
For example, [SM_g]word[SM_h].[SM_l]" would turn into
[SM_g]word.[SM_l]"[SM_h]
or
[SM_g]word[SM_h]"[SM_l]. would turn into
[SM_g]word"[SM_l].[SM_h]
or, to simulate having a space before "
[SM_g]word[SM_h] ".
would turn into
[SM_g]word ".[SM_h]
and so on.
I've tried several combinations of regex expressions, and none of them have worked.
Does anyone have advice?
You need to put each component within an alternation in a grouping construct with maximum match try of 3 if it is necessary:
\[SM_g]word(\[SM_h])((?:\.|\[SM_l]| ?"){0,3})
You may replace word with .*? if it is not a constant or specific keyword.
Then in replacement string you should do:
$1$3$2
var re = /(\[SM_g]word)(\[SM_h])((?:\.|\[SM_l]| ?"){0,3})/g;
var str = `[SM_g]word[SM_h][SM_l] ".`;
console.log(str.replace(re, `$1$3$2`));
This seems applicable for your process, in other word, changing sub-string position.
(\[SM_g])([^[]*)(\[SM_h])((?=([,\.])|(\[SM_l])|( ?&\\?quot;)).*)?
Demo,,, in which all sub-strings are captured to each capture group respectively for your post processing.
[SM_g] is captured to group1, word to group2, [SM_h] to group3, and string of all trailing part is to group4, [,\.] to group5, [SM_l] to group6, " ?&\\?quot;" to group7.
Thus, group1~3 are core part, group4 is trailing part for checking if trailing part exists, and group5~7 are sub-parts of group4 for your post processing.
Therefore, you can get easily matched string's position changed output string in the order of what you want by replacing with captured groups like follows.
\1\2\7\3 or $1$2$7$3 etc..
For replacing in Javascript, please refer to this post. JS Regex, how to replace the captured groups only?
But above regex is not sufficiently precise because it may allow any repeatitions of the sub-part of the trailing string, for example, \1\2\3\5\5\5\5 or \1\2\3\6\7\7\7\7\5\5\5, etc..
To avoid this situation, it needs to adopt condition which accepts only the possible combinations of the sub-parts of the trailing string. Please refer to this example. https://regex101.com/r/6aM4Pv/1/ for the possible combinations in the order.
But if the regex adopts the condition of allowing only possible combinations, the regex will be more complicated so I leave the above simplified regex to help you understand about it. Thank you:-)

What Regex would capture both the beginning and end from of a string?

I am trying to edit a DateTime string in typescript file.
The string in question is 02T13:18:43.000Z.
I want to trim the first three characters including the letter T from the beginning of a string AND also all 5 characters from the end of the string, that is Z000., including the dot character. Essentialy I want the result to look like this: 13:18:43.
From what I found the following pattern (^(.*?)T) can accomplish only the first part of the trim I require, that leaves the initial result like this: 13:18:43.000Z.
What kind of Regex pattern must I use to include the second part of the trim I have mentioned? I have tried to include the following block in the same pattern (Z000.)$ but of course it failed.
Thanks.
Any help would be appreciated.
There is no need to use regular expression in order to achieve that. You can simply use:
let value = '02T13:18:43.000Z';
let newValue = value.slice(3, -5);
console.log(newValue);
it will return 13:18:43, assumming that your string will always have the same pattern. According to the documentation slice method will substring from beginIndex to endIndex. endIndex is optional.
as I see you only need regex solution so does this pattern work?
(\d{2}:)+\d{2} or simply \d{2}:\d{2}:\d{2}
it searches much times for digit-digit-doubleDot combos and digit-digit-doubleDot at the end
the only disadvange is that it doesn't check whether say there are no minutes>59 and etc.
The main reason why I didn't include checking just because I kept in mind that you get your dates from sources where data that are stored are already valid, ex. database.
Solution
This should suffice to remove both the prefix from beginning to T and postfix from . to end:
/^.*T|\..*$/g
console.log(new Date().toISOString().replace(/^.*T|\..*$/g, ''))
See the visualization on debuggex
Explanation
The section ^.*T removes all characters up to and including the last encountered T in the string.
The section \..*$ removes all characters from the first encountered . to the end of the string.
The | in between coupled with the global g flag allows the regular expression to match both sections in the string, allowing .replace(..., '') to trim both simultaneously.

Get content with regex in javascript

::head
line 1
line 2
line 3
::content
content 1
content 2
content 3
How do I get "head" paragraph(first part) text with regex? This is from txt file.
Unfortunately, the below doesn't work in javascript because of this: Javascript regex multiline flag doesn't work. So we have to tweak things a bit. A line break in a file can be found in javascript strings as \n. In windows this includes \r but not in linux, so our \s* becomes more important now that we're doing this without using line-ending characters ($). I also noticed that you don't need to specifically gather the other lines, since line breaks are being ignored anyway.
/(::head[^]*?)\n\s*\n/m
This works in testing in Chrome, so it should work for your needs.
this is a little fancy, but it should fit if this is used in conjunction with many similar properties.
/(::head.*?$^.*?$)^\s*$/m
Note that you need the /m multiline flag.
Here it is tested against your sample data http://rubular.com/r/vtflEgDdkY
First, we check for the ::head data. That's where we start collecting information in a group with (). Then we look for anything with .*, but we do so with the lazy ? flag. Then we find the end of the line with $ and look for more lines with data with the line start ^ then anything .*? then the line end $ this will grab multiple lines because of the multiline flag, so it's important to use the lazy matching ? so we don't grab too much data. Then we look for an empty line. Normally you just need ^$ for that, but I wanted to make sure this would work if someone had stuck a stray space or tab on the lines in between sections, so we used \s* to grab spaces. The * allows it to find "0 or more" spaces as acceptable. Notice we didn't include the empty line in the group () because that's not the data you care about.
For further reading on regex, I recommend http://www.regular-expressions.info/tutorial.html It's where I learned everything I know about regex.
You can use [\s\S]+::content to match everything until ::content:
const text = ...
const matches = text.match(/^([\s\S]+)::content/m)
const content = matches[1]

Replace Numbers with dots

I am trying to replace some ID numbers in my system to clickable number to open the related record. The problem is, that they are sometimes in this format: 123.456.789.
When I use my regex, I can replace them and it works fine. The problem accurse when I also have IP addresses where the regex also matches: 123.[123.123.123] (the [] indicates where it matches).
How I can I prevent this behavior?
I tried something like this: /^(?!\.)([0-9]{3}\.[0-9]{3}\.[0-9]{3})(?!\.)/
I am working on "notes" in a ticket system. When the note contains only the ID or an IP, the regexp is working. When it contains more text like:
Affected IDs:
641.298.855 (this, lead)
213.794.868
948.895.285
Then it is not matching anymore on my IDs. Could you help me with this issue and explain what I am doing wrong?
Add gm modifier:
/^(?!\.)([0-9]{3}\.[0-9]{3}\.[0-9]{3})(?!\.)/gm
https://regex101.com/r/pK1fV4/2
You don't need to use negative lookahead at the start and also you don't need to include g modifier, just m modifier would be enough for this case because ^ matches the start of a line and the following pattern will match the string which exists only at the start so it won't do any global match (ie, two or more matches in a single line).
/^([0-9]{3}\.[0-9]{3}\.[0-9]{3})(?!\.)/m
For the sake of performance, you further don't need to use capturing group.
/^[0-9]{3}\.[0-9]{3}\.[0-9]{3}(?!\.)/m

Regex: Match string with substrings with the same pattern

I'm trying to match a string with a pattern, that can have sub strings with the same pattern.
Here's a example string:
Nicaragua [[NOTE|note|Congo was a member of ICCROM from 1999 and Nicaragua from 1971. Both were suspended by the ICCROM General Assembly in November 2013 having omitted to pay contributions for six consecutive calendar years (ICCROM [[Statutes|s|url|www.iccrom.org/about/statutes/]], article 9).]]. Another [[link|url|google.com]] that might appear.
and here's the pattern:
[[display_text|code|type|content]]
So, what I want with that is get the string within the brackets, and then look for some more string that match the pattern within the top level one.
and what I want is match this:
[[NOTE|s|note|Congo was a member of ICCROM from 1999 and Nicaragua from 1971. Both were suspended by the ICCROM General Assembly in November 2013 having omitted to pay contributions for six consecutive calendar years (ICCROM [[Statutes|s|url|www.iccrom.org/about/statutes/]], article 9).]]
1.1 [[Statutes|s|url|www.iccrom.org/about/statutes/]]
[[link|s|url|google.com]]
I was using this /(\[\[.*]])/ but it gets everything until the last ]].
What I want with that is be able to identify the matched string and convert them to HTML elements, where |note| is going to be a blockquote tag and |url| an a tag. So, a blockquote tag can have link tag inside it.
BTW, I'm using CoffeeScript to do that.
Thanks in advance.
In general, regex is not good at dealing with nested expressions. If you use greedy patterns, they'll match too much, and if you use non-greedy patterns, as #bjfletcher suggests, they'll match too little, stopping inside the outer content. The "traditional" approach here is a token-based parser, where you step through characters one by one and build an abstract syntax tree (AST) which you then reformat as desired.
One slightly hacky approach I've used here is to convert the string to a JSON string, and let the JSON parser do the hard work of converting into nested objects: http://jsfiddle.net/t09q783d/1/
function toPoorMansAST(s) {
// escape double-quotes, as they'll cause problems otherwise. This converts them
// to unicode, which is safe for JSON parsing.
s = s.replace(/"/g, "\u0022");
// Transform to a JSON string!
s =
// Wrap in array delimiters
('["' + s + '"]')
// replace token starts
.replace(/\[\[([^\|]+)\|([^\|]+)\|([^\|]+)\|/g,
'",{"display_text":"$1","code":"$2","type":"$3","content":["')
// replace token ends
.replace(/\]\]/g, '"]},"');
return JSON.parse(s);
}
This gives you an array of strings and structured objects, which you can then run through a formatter to spit out the HTML you'd like. The formatter is left as an exercise for the user :).

Categories