Split By 2 String Delimiters via RegEx Javascript - javascript

I am relatively new to RegEx and need help splitting up a section of text by two delimiters, and then capturing what's between them. For instance, I have a JSON object that has a "description" property, and it essentially contains this text:
This is yet another test product to test for all sorts of styling/formatting issues.
PLACEHOLDER
PLACEHOLDER
PLACEHOLDER
******
Not so expensive
Works all day
Works throughout the night
No risk
100% Guarantee
------
This would be where ingredients would go for Test Product 4.
What I am trying to do is to capture the "bulleted" list between the six-asterisk delimiter and the 6-dash delimiter. So in the above example, I want to be able to capture these lines:
Not so expensive
Works all day
Works throughout the night
No risk
100% Guarantee
Thus, I need to split by two delimiters:
1) ******
2) ------
After searching around this and other forums, I still can't seem to get any form of .split() method and RegEx to get precisely what I want. Is anyone able to help me out here? The ultimate goal is to be able to capture that text and then append it to a <div>

No? Really?
string.split(/------|\*\*\*\*\*\*/);

Related

Regex: How to (better) optimize text from/for messages

Since this question does not contain a specific question on regex but more on it's design/approach, it might take a while to understand the requirements and their dependencies. I have done everything I can to make it as easy as possible with this fully working yet not elegant solution(deadlink).
I need to optimize text in a messaging platform that is being created/edited by others and may have to be sanitized with regex. All optimizations need to be done with one single regex, since these happen often and are quite expensive (or am I wrong on this?). Furthermore the regex needs to be language-agnostic (at least compatible with Javascript and Php). Last but not least, the optimized text must not contain (additional) Html as it is used in a text-only environment.
Requirements
Optimize lines
Remove single lines
Do not remove single lines that end with two|no spaces (thus allow editors to force a newline)
Do not remove empty lines (double line breaks)
Do not remove single lines that start with symbol|char|digit|entity+space (raw lists)
Condense multiple consecutive empty lines (double line breaks) to one double line break
Optimize spaces
Remove excess spaces
Do not remove spaces at the end of a sentence
Optimize comments
Remove single line comments
Do not remove trailing comments
Overall
Preserve Html and do not add Html
Intermediate solution
So far, my solution is to combine 4 regexes which 'match' my requirements and get replaced by a single space:
Matches single lines while leaving empty lines intact and preserving raw lists: \n(?!\n|[-_.○•♥→›>+%\/*~=] |[a-zA-Z_1-9+][\.|\)|\:|\*]) (the length is due to several list-style-types I want to support)
Matches excess empty lines: (\n+)(?=\n\n)
Matches excess spaces: +
Matches single line comments (while ignoring trailing comments): ^\n?\/\/ .+\n
To make the optimization rather inexpensive, I concatenate them with | to one single regex which I can use in Javascript (as well as Php).
r = new RegExp(" \n(?!\n|[-_.○•♥→›>+%\/*~=] |[a-zA-Z_1-9+][.):*] )|(\n+)(?=\n\n)| + |^\n?\/\/ .+\n", "gm");
i = document.getElementById("input").innerHTML;
p = " ";
o = i.replace(r, p);
document.getElementById("output").innerHTML = o;
#input, #output { width: 100%; height: 88vh; }
#input { display: none; } #output { border: none; }
<textarea id="input">
MAKE PARAGRAPHS
This is the first paragraph.
Some sentences end with newlines.
Some don't. We need to cope with that.
This is the second paragraph.
It contains some unnecessary spaces.
Even at the end of a line.
This is the third paragraph.
Some sentences end with question- and exclamation-marks.
I hope that is ok for you. Is it? That's great! Really.
KEEP LISTS
This is an unordered list, starting with a minus+space:
- This is the first item.
- This is the second item.
- This is the third item.
Here is an unordered list, starting with entity|symbol+space:
• This is the second item.
> This is the third item. // Works in php only
* This is the fifth item.
This is a (manually) ordered lists, starting with char|digit+entity+space:
1. This is the first item.
b) This is the second item.
3: This is the third item.
Here is a mathematical list, starting with operators:
+ Plus
- Minus
% Percentage
/ Division
* Multiply
~ Like
= Equal
These are (manually) ordered lists, which are not summed up because they do not end with a space:
1 This is the first item.
b This is the second item.
I like the third item.
First: This works.
Second: It works great.
Third: That is nice!
KEEP HTML
The input text may contain Html.
The output text must simply keep it for further processing.
The output must not add Html as it is processed in a text-only environment.
I know this sounds stupid, but it isn't.
REMOVE COMMENTS
Single/whole line comments are being removed.
// Sources
// Removing single lines: https://regex101.com/r/qU1eP8/5
// Removing comments: https://www.perlmonks.org/?node_id=996552
// Tests
// Dialog: https://api.sefzig.net/dialog/test/regex/
// Jsbin: https://jsbin.com/goromad/edit?output
// Regex101: https://regex101.com/r/Xz5atA/2
// Regexr: https://regexr.com/45svm
Thank you, regex ♥ // Problem solved
~Fin~
</textarea>
<textarea id="output"><!-- Press "Run" --></textarea>
My request
Since I am not a regex-expert and my approach feels cumbersome, I'd like to hear your suggestions. I know Regex is expensive and everything can be done better.
You might wonder about a few details I haven't mentioned here for the sake of clarity. You also might want to test my Regexes. This is why I have set up a sandbox, isolating the requirements (Regexes), containing an example text with all use-cases as well as a detailed description:
https://api.sefzig.net/dialog/test/regex/(deadlink)
In case you want to use the features of great tools out there, here you go:
Regexr: https://regexr.com/45svm
Regex101: https://regex101.com/r/Xz5atA/2
Jsbin: https://jsbin.com/goromad/edit?output
Thank you
for helping me straighten this important feature of my messaging platform! Please feel free to enhance my approach, suggest an alternative or use the results in your own project ♥
This is my first question on stack overflow. I have researched a lot. Please bear with me if I have done anything wrong and help me fix that.

How to remove all of string up to and including hyphen

I am using javascript in a Mirth transformer. I apologize for my ignorance but I have no javascript training and have a hard time successfully utilizing info from similar threads here or elsewhere. I am trying to trim a string from 'Room-Bed' to be just 'Bed'. The "Room" and "Bed" values will fluctuate. All associated data is coming in from an ADT interface where our client is sending both the room and bed values, separated by a hyphen, in the bed field creating unnecessary redundancy. Please help me with the code needed to produce the result of 'Bed' from the received 'Room-Bed'.
There are many ways to reduce the string you have to the string you want. Which you choose will depend on your inputs and the output you want. For your simple example, all will work. But if you have strings come in with multiple hyphens, they'll render different results. They'll also have different performance characteristics. Balance the performance of it with how often it will be called, and whichever you find to be most readable.
// Turns the string in to an array, then pops the last instance out: 'Bed'!
'Room-Bed'.split('-').pop() === 'Bed'
// Looks for the given pattern,
// replacing the match with everything after the hyphen.
'Room-Bed'.replace(/.+-(.+)/, '$1') === 'Bed'
// Finds the first index of -,
// and creates a substring based on it.
'Room-Bed'.substr('Room-Bed'.indexOf('-') + 1) === 'Bed'

How to detect and remove unwanted lines from a string?

I am working on a project in which i have to extract text data from a PDF.
I am able to extract text from the PDF, but extracted text sometimes contains lines which i would like to strip off from it.
Here's and example of unwanted lines -
ISBN 0-7225-3293-8. = CONTENTS = Part One Part Two Epilogue
Page 1 / 94
And, here's an example of good line (which i'd like to keep) -
Dusk was falling as the boy arrived with his herd at an abandoned church.
I wanted to sleep a little longer, he thought. He had had the same dream that night as a week ago
Different PDFs can give out different unwanted lines.
How can i detect them ?
Option 1 - Give the computer a rule: If you are able to narrow down what content it is that you would like to keep, the obvious criteria that sticks out to me is the exclusion of special characters, then you can filter your results based on this.
So let's say you agree that all "good lines" will be without special characters ('/', '-', and '=') for example, if a line DOES contain one of these items, you know you can remove it from the content you are keeping. This could be done in a for loop containing an if-then condition that looks something like this..
var lineArray = //code needed to make each line of the file an element of the array
For (cnt = 0; cnt < totalLines; cnt++)
{
var line = lineArray[cnt];
if (line.contains("/") || line.contains("-") || line.contains("="))
lineArray[cnt] = "";
}
At the end of this code you could simply get all the text within the array and it would no longer contain the unwanted lines. If there are unwanted lines however, that are virtually indistinguishable by characters, length, positioning etc. the previous approach begins to break down on some of the trickier lines.
This is because there is no rule you can give the computer to distinguish between the good and the bad without giving it a brain such as yours that recognizes parts of speech and sentence structure. In which case you might consider option 2, which is just that.
Option 2- Give the computer a brain: Given that the text you want to remove will more or less be incoherent documentation based on what you have shown us, an open source (or purchased) natural language processor may be what you are looking for.
I found a good beginner's intro at http://myreaders.info/10_Natural_Language_Processing.pdf with some information that might be of use to you. From the source,
"Linguistics is the science of language. Its study includes:
sounds (phonology),
word formation (morphology),
sentence structure (syntax),
meaning (semantics), and understanding (pragmatics) etc.
Syntactic Analysis : Here the analysis is of words in a sentence to know the grammatical structure of the sentence. The words are transformed into structures that show how the words relate to each others. Some word sequences may be rejected if they violate the rules of the language for how words may be combined. Example: An English syntactic analyzer would reject the sentence say : 'Boy the go the to store.' "
Using some sort of NLP, you can discover whether a given section of text contains a sentence or some incoherent rambling. This test could then be used as a filter in your program for what you would like to keep or remove.
Side note- As it appears your sample text is not just sentences but literature, sometimes characters will speak in sentence fragments as part of their nature given by the author. In this case, you could add a separate condition that if the text is contained within two quotations and has no special characters, you want to keep the text regardless.
In the end NLP may be more work than you require or that you want to do, in which case Option 1 is likely going to be your best bet. On the other hand, it may be just the thing you are looking for. Whatever the case or if you decide you need some combination of the two, best of luck! I hope this answer helps.

RegEx to get ALL Strings between two Strings

I seem to have a love/hate relationship with RegEx in that I love how incredibly powerful it is, but at the same time, I don't quite understand all of the nuances of it yet.
I've got rather lengthy JSON feed that I need to parse and capture ALL of the matches between two specific strings. I've included a link to the regex101.com example with a few of the JSON results.
regex101.com Example
I'm trying to match every string between each /content/usergenerated and /jcr:content
...
I guess what I should really be trying to match is a string that starts with /content/webAppName/en/home and ends before /jcr:content
The path that I care about will always start with /content/webAppName/en/home
you have to use "positive look-ahead" that match a sequence of digits if they are followed by something
https://regex101.com/r/fU1iD1/4
Just wrap the two things you're looking to remove in parenthesis, and then remove them from the output. So...
(\/content\/usergenerated)(.*)(\/jcr\:content)
replaced by
/2
Which is everything in the middle of those two.
edit: Sorry, didn't look at your example :) - there was a deleted answer that said to add the g modifier, which looks like it works.
/content/usergenerated/content/webAppName/en/home([a-zA-Z/-]+)/jcr:content
This should work. It matches 3 out of 4 don't know why it doesn't match one of em. You could use exec() in a loop till it returns null and get hold of the object[1] which contains data for the first and only capture group.
all the best.
PS: I used gmi in options for the regex.

match text between two html custom tags but not other custom tags

I have something like the following;-
<--customMarker>Test1<--/customMarker>
<--customMarker key='myKEY'>Test2<--/customMarker>
<--customMarker>Test3 <--customInnerMarker>Test4<--/customInnerMarker> <--/customMarker>
I need to be able to replace text between the customMarker tags, I tried the following;-
str.replace(/<--customMarker>(.*?)<--\/customMarker>/g, 'item Replaced')
which works ok. I would like to also ignore custom inner tags and not match or replace them with text.
Also I need a separate expression to extract the value of the attribute key='myKEY' from the tag with Text2.
Many thanks
EDIT
actually I am trying to find things between comment tags but the comment tags were not displaying correctly so I had to remove the '!'. There's a unique situation that required comment tags... in anycase if anyone knows enough regex to help, it would be great. thank u.
In the end, I did something like the following (incase anyone else needs this. enjoy!!! But note: Word about town is that using regex with html tags is not ideal, so do your own research and make up your mind. For me, it had to be done this way, mostly bcos i wanted to, but also bcos it simplified the job in this instance);-
var retVal = str.replace(/<--customMarker>(.*?)<--\/customMarker>/g, function(token, match){
//question 1: I would like to also ignore custom inner tags and not match or replace them with text.
//answer:
var replacePattern = /<--customInnerMarker*?(.*?)<--\/customInnerMarker-->/g;
//remove inner tags from match
match = $.trim(match.replace(replacePattern, ''));
//replace and return what is left with a required value
return token.replace(match, objParams[match]);
//question 2: Also I need a separate expression to extract the value of the attribute key='myKEY' from the tag with Text2.
//answer
var attrPattern = /\w+\s*=\s*".*?"/g;
attrMatches = token.match(attrPattern);//returns a list of attributes as name/value pairs in an array
})
Can't you use <customMarker> instead? Then you can just use getElementsByTagName('customMarker') and get the inner text and child elements from it.
A regex merely matches an item. Once you have said match, it is up to you what you do with it. This is part of the problem most people have with using regular expressions, they try and combine the three different steps. The regex match is just the first step.
What you are asking for will not be possible with a single regex. You're going to need a mini state machine if you want to use regular expressions. That is, a logic wrapper around the matches such that it moves through each logical portion.
I would advise you look in the standard api for a prebuilt engine to parse html, rather than rolling your own. If you do need to do so, read the flex manual to get a basic understanding of how regular expressions work, and the state machines you build with them. The best example would be the section on matching multiline c comments.

Categories