Easy one: REGEX to select everything after ' - ' in multi line text - javascript

Apparently this is not easy! There's got to be a way using pure regex?? I just know there is....
I have found a way to select the text after the first occurrence of a hyphen in a text file
Unique Thing - Some Text
Another Thing - Some Text again
Some Thing - Some more text
But I only want the right side of the hyphen..
Anyone know a quick regex to accomplish this?
To be clear, given text above i want
Some Text
Some Text again
Some more text
Thanks ya'll
UPDATE:
Maybe it would help with an actual chunk of text. This is from the most recent live stream chat for the whitehouse press briefing Aug 2, 2017.
Hernando Arce - build the wall with solar panels,
Christmas Girl - Let's do our own quick internet poll on live chat. Ready........Good with new immigrating into the US policy he is talking about. YES or NO,
ART - AMEN,
coffeefish - Stop H1B visa corruption!,
CarollDelMuro .Arbonne - Red,
Legion - BUILD THE WALL!,
wass sabi - MAGA,
Yokoshima - I live in Florida. Speaking English isn't racist. If you've ever been to Miami, you would know why it's needed.,
Home O'DFree - NO the campaign was BUILD THE WALL,
Melissa Renee - is he on benzos,
Paid Observer - kim jung un vs Trump in basketball,
Selina Serrano - polling data,
zonnekat - aliens....,
Farrah - NFL ,
Selina Serrano - massive,
Glenda Greene - MAGA,
Christoph Schneider - who would ever come to USA when they get lower pays? Russians?,
Carolyn Hall - MAGA MAGA MAGA ,
Sandra Honeyman - Isn't limiting immigration to skilled workers going to displace more skilled American workers?,
Mike Hancock - AMERICA FIRST,
Adnan Khan - Send them back to Mars,
Paid Observer - wtf is that,
GDotcom - THIS BETTER PASS OR THERE SHALL BE HANGINGS,
Null_Mage - This man is more attractive than Sarah,
monkeygraborange - FUCK CONGRESS,
Selina Serrano - personal,
This is the text i'm testing in regex101.
^[^-]*[^ -] does not seem to work here.
I do like the few suggestions about splitting line by line then matching however, the chat stream is many thousands of lines. The end result of all this is counting occurrences of words. For anyone whos interested the repo is https://github.com/archae0pteryx/yt-live-chat-scraper I just pushed the logs from the latest press briefing.

/[\s\S]- (.*)/g - Should do it
[\s\S] - For matching new lines
/g - Continues matching
🍻

You can use a capturing group if you want to use a regex:
const r = /- (.*)/
console.log('Unique Thing - Some Text'.match(r)[1]) //'Some Text'

Try this:
.*-[ ]*
Choose everything before the hyphen and white spaces after the hyphen.
With this patter you can remove all the text that match and leave the right side that you want.
UPDATE:
But, if you want the right side you can use it:
-(.*)
and choose the group 1:

Related

Replace words in a paragraph using Javascript

I have a paragraph of some texts. I want to replace some words in that using wildcard.
The below is my Paragraph.
0.7% lower on the prospect of fresh restrictions that would deal a blow to hopes of a swift economic
recovery. <Origin Href=\"StoryRef\">urn:newsml:reuters.com:*:nL4N2EC04Z</Origin>\n The 2,000-plus cases reported on Sunday was a shocker ,said Nicholas Mapa, ING
In this Para, I want to remove <Origin Href=\"StoryRef\">urn:newsml:reuters.com:*:nL4N2EC04Z</Origin>\n
There are multiple paragraphs. But only the uncommon one is nL4N2EC04Z
All other words are common in those paragraphs .
<Origin Href=\"StoryRef\">urn:newsml:reuters.com:*:(need_to_use_wild_card_here)</Origin>\n
I tried to replace one half.
My code
storyRef="<Origin Href=\"StoryRef\">urn:newsml:reuters.com:*:";
storyRef.replace(storyRef," ")
But am stuck in replacing other parts.
It seems encoding problem. Try to use JSON.stringify to ensure that characters like < and some others does not read decoded.
You can improve Regex too to something like this 👇
storyRef.replace(/<Origin.*Origin>\\n/gm, ' ');
This example will start in <Origin, get all content between until Origin>\n.
const p = JSON.stringify('0.7% lower on the prospect of fresh restrictions that would deal a blow to hopes of a swift economic recovery. <Origin Href=\"StoryRef\">urn:newsml:reuters.com:*:nL4N2EC04Z</Origin>\n The 2,000-plus cases reported on Sunday was a shocker ,said Nicholas Mapa, ING');
const storyRef = JSON.parse(p.replace(/(<Origin.*Origin>\\n)+/gm, ' '));
console.log(storyRef);

Regex javascript tuning

First steps in regex and been trying to get some values out of emails that are being fetched. So far I've achieved some (even if it's not with the best approach) but in need of some more values and can't figure out how to get them.
This is the email template that is more or less always the same:
London, 06/20/09
Mr. Tom Waits
Process ref.: CR // 1943061
Your reference: 338256
Clients' names: Mary Lamb, John Snow
We return to your contact regarding the complaint on behalf of the clients mentioned above.
We inform you that the refund process has already started, so you should receive the respective amount (375EUR) within 4/6 weeks.
Payment ref.: 2500062960.
Our compliments,(...)
WHAT I NEED:
Date after "London,"
Process ref. (2 letters + // + digits)
Your ref. number
Clients' names
Amount
Payment reference number
Notes the amount not always comes between "( )", sometimes it's preceeded by "amount" others by "amount of", sometimes "EUR" is separated by a space, but the needed value is always the first digits combination on the paragraph
WHAT I HAVE SO FAR:
(?:London,)(.*)\s+(?:Mr. Tom Waits)\s+(?:Process ref.:)\s+(....\d+)\s+(?:Your reference:)\s+(\d+)\s+(?:Clients' names:)\s+(.*)\s+
WHAT IT RETRIEVES:
Date after "London,"
Process ref. (2 letters + // + digits)
Your reference number
Clients' names
WHAT'S MISSING:
Amount
Payment reference number
Other Concerns:
I tried to exclude the "Mr" parapraph but I think there may be a problem when they write it a little different like with an extra space or something
The same problem may arise if they also write the items a little different like, for example, "Process reference" instead of "Process ref."
Thanks in advance.
This one's gonna be one hefty regex. I do hope you don't try this regex on a massive file but rather one singular entry (like the one you've shown in your post).
Anyway, here's the regex-
London, ([\d\/]+)[\n\w\W]*?Process ref(?:\.|erence): ([A-Z \/\d]+)[\n\w\W]*?Your ref(?:\.|erence): (\d+)[\n\w\W]*?Client(?:s'|'s) name(?:s)?: ([\w ,]+)[\n\w\W]*?amount.*?(\d+)[\n\w\W]*?Payment ref(?:\.|erence): (\d+)
This should be quite permissive, it doesn't depend on many variable (seemingly) things, apart from London, that one's a bit of a hard coding but I'm assuming it's always London.
Now, let's walk through this-
London, ([\d\/]+) - This basically matches London, DATE - where date is, well, a date, where each element of the date is separated by a /.
In this case, it matches 06/20/09 from London, 06/20/09
[\n\w\W]*? - Try to keep up with this one - I'm using it a A LOT.
This will match pretty much all new lines, word characters and non word characters in a non-greedy way. In this particular case, this will match pretty much everything and that includes the newlines. This is used to just skip over anything and everything until we reach the desired spot.
Process ref(?:\.|erence): ([A-Z \/\d]+) - Captures the process reference, which can consist of capital alphabets (I assume, you can change that), slashes (/), and digits
Works with both ref. and reference
In this case, it matches CR // 1943061, from Process ref.: CR // 1943061
[\n\w\W]*? - Ignore everything up until the next token
Your ref(?:\.|erence): (\d+) - Captures "your reference", which can consist of digits
Works with both ref. and reference
In this case, it matches 338256
[\n\w\W]*? - Ignore everything up until the next token
Client(?:s'|'s) name(?:s)?: ([\w ,]+) - Captures the client name(s) - modified so it supports single client names too. (check the demo). The name list can consist of word characters, spaces and a comma.
In this case, it captures Mary Lamb, John Snow, from Clients' names: Mary Lamb, John Snow
(\d+) - Capture the digits - this is a big assumption, I'm assuming that the only digits that appear after client name list, are the ones for the amount. If they aren't
[\n\w\W]*? - Ignore everything up until the next token
amount.*?(\d+) - Captures the first group of digits that appear after amount. This is a bit of an assumption, I'm assuming that amount word is actually present in that paragraph.
In this case, it captures 375, from amount (375EUR)
[\n\w\W]*? - Ignore everything up until the next token
Payment ref(?:\.|erence): (\d+) - Capture the Payment reference number, which can consist of digits
Works with both ref. and reference
In this case, it captures 2500062960, from Payment ref.: 2500062960.
Check out the demo!

Matching multiple quotes in a sentence

I am trying to match multiple quotes inside of a single sentence, for example the line:
Hello "this" is a "test" example.
This is the regex that I am using, but I am having some problems with it:
/[^\.\?\!\'\"]{1,}[\"\'\“][^\"\'\“\”]{1,}[\"\'\“\”][^\.\?\!]{1,}[\.\?\!]/g
What I am trying achieve with this regex is to find everything from the start of the last sentence until I hit quotes, then find the closing set and continue until either a .?!
The sample text that I am using to test with is from Call of Cthulhu:
What seemed to be the main document was headed “CTHULHU CULT” in characters painstakingly printed to avoid the erroneous reading of a word so unheard-of. The manuscript was divided into two sections, the first of which was headed “1925—Dream and Dream Work of H. A. Wilcox, 7 Thomas St., Providence, R.I.”, and the second, “Narrative of Inspector John R. Legrasse, 121 Bienville St., New Orleans, La., at 1908 A. A. S. Mtg.—Notes on Same, & Prof. Webb’s Acct.” The other manuscript papers were all brief notes, some of them accounts of the queer dreams of different persons, some of them citations from theosophical books and magazines.
The issue comes on the line The manuscript was.... Does anyone know how to account for repeats like this? Or is there a better way?
This one ignores [.?!] inside quotes. But cases like Acct.” The nth will be considered as a single sentence in this case. Probably a . is missing over there.
var r = 'What seemed to be the main document was headed “CTHULHU.?! CULT” in characters painstakingly printed to avoid the erroneous reading of a word so unheard-of. The manuscript was divided into two sections, the first of which was headed “1925—Dream and Dream Work of H. A. Wilcox, 7 Thomas St., Providence, R.I.”, and the second, “Narrative of Inspector John R. Legrasse, 121 Bienville St., New Orleans, La., at 1908 A. A. S. Mtg.—Notes on Same, & Prof. Webb’s Acct.” The other manuscript papers were all brief notes, some of them accounts of the queer dreams of different persons, some of them citations from theosophical books and magazines.'
.split(/[“”]/g)
.map((x,i)=>(i%2)?x.replace(/[.?!]/g,''):x)
.join("'")
.split(/[.?!]/g)
.filter(x => x.trim()).map(x => ({
sentence: x,
quotescount: x.split("'").length - 1
}));
console.log(r);
You can use this naive pattern:
/[^"'“.!?]*(?:"[^"*]"[^"'“.!?]*|'[^']*'[^"'“.!?]*|“[^”]*”[^"'“.!?]*)*[.!?]/
details:
/
[^"'“.!?]* # all that isn't a quote or a punct that ends the sentence
(?:
"[^"*]" [^"'“.!?]*
|
'[^']*' [^"'“.!?]*
|
“[^”]*” [^"'“.!?]*
)*
[.!?]
/
If you want something more strong, you can emulate the "atomic grouping" feature, in particular if you are not sure that each opening quote has a closing quote (to prevent catastrophic backtracking):
/(?=([^"'“.!?]*))\1(?:"(?=([^"*]))\2"[^"'“.!?]*|'(?=([^']*))\3'[^"'“.!?]*|“(?=([^”]*))\4”[^"'“.!?]*)*[.!?]/
An atomic group forbids backtracking once closed. Unfortunately this feature doesn't exist in Javascript. But there's a way to emulate it using a lookahead that is naturally atomic, a capture group and a backreference:
(?>expr) => (?=(expr))\1

Ignoring carriage returns in regular expressions

I am currently attempting to parse a conversation file in Javascript. Here is an example of such a conversation.
09/05/2016, 13:11 - Joe Bloggs: Hey Jane how're you doing? 😊 what dates are you in London again? I realise that June isn't actually that far away so might book my trains down sooner than later!
09/05/2016, 13:47 - Jane Doe: Hey! I'm in london from the 12th-16th of june! Hope you can make it down :) sorry it's a bit annoying i couldn't make it there til a sunday!
09/05/2016, 14:03 - Joe Bloggs: Right I'll speak to my boss! I've just requested 5 weeks off in November/December to visit Aus so I'll see if I can negotiate some other days!
When does your uni term end in November? I'm thinking of visiting perth first then going to the east coast!
09/05/2016, 22:32 - Jane Doe: Oh that'll be awesome if you come to aus! Totally understand if it's too hard for you to request more days off in june.
I finish uni early November! So should definitely be done by then if you came here
09/05/2016, 23:20 - Joe Bloggs: I could maybe get a couple of days 😊 when do you fly into London on the Sunday?
Perfect! I need to speak to everyone else to make sure they're about. I can't wait to visit but it's so far away!
09/05/2016, 23:30 - Jane Doe: I fly in at like 7.30am so I'll have that whole day!
I'm sure the year will fly since it's may already haha
09/05/2016, 23:34 - Joe Bloggs: Aw nice one! Even if I can get just Monday off I can get an early train on Sunday 😊
My current regular expression looks like this
(\d{2}\/\d{2}\/\d{4}),\s(\d(?:\d)?:\d{2})\s-\s([^:]*):\s(.*?)(?=\s*\d{2}\/|$)/gm
My approach is almost there and gives me 4 groups as expected
{
"group": 1,
"value": "09/05/2016"
},
{
"group": 2,
"value": "13:11"
},
{
"group": 3,
"value": "Joe Bloggs"
},
{
"group": 4,
"value": "Hey Jane how're you doing? 😊 what dates are you in London again? I realise that June isn't actually that far away so might book my trains down sooner than later!"
}
The problem arises when a message (group 4) contains a carriage return. (see the message at line 3 in the example snippet).
I've done some research and using [\s\S] does not solve my issue. The pattern simply stops and moves onto the next occurrence.
For the third conversation the message is cut off at the carriage return.
DEMO
Any help would be appreciated!
Try
(\d{2}\/\d{2}\/\d{4}),\s(\d{1,2}:\d{2})\s-\s([^:]*):\s+(.*(?:\n+(?!\n|\d{2}\/).*)*)
(https://regex101.com/r/sA3sB8/2) which scans to the end of the line, then uses a repeated group to first check that the new line doesn't start with \d\d/ (which is the start of a date on the next line(s)), and if it doesn't, to capture that entire line as well.
You can make the negative look-ahead a little more specific if you fear that two digits followed by a forward slash could hit any edge cases. It increases the number of steps, but would make it slightly safer.
If a user actually entered a newline followed by a date in that syntax, you might have problems as it would stop matching at that point. I doubt they would also include a comma and a 24-hour time, though, so that could be one way to handle that scenario.
Example:
09/05/2016, 23:36 - Jane Doe: Great! Let me give you my travel details:
10/01/2016 # 6am - Arrive at the station
10/01/2016 # 7am - Get run over by a drunk horse carriage (the driver and the horse were both sober; the carriage stayed up a bit late to drink)
10/01/2016 # 7:15am - Pull myself out from under the carriage and kick at its wheels vehemently.
09/05/2016, 23:40 - Joe Bloggs: Haha, sounds great.
This is just an example (with the corresponding fix of adding more specifics to the look-ahead to handle it) just to show how a user might add text that could break that particular revision of the regex.

Regular Expression to get link and link description

I'm attempting to develop my own linking syntax as such:
[this is google|google.com]
I know how to get the text between the square brackets (\[(.*?)\]) but I'm not sure how to extract the individual pieces. Also, if someone wanted to simply add square brackets without a link (eg. [this is google]), it wouldn't be detected as a link.
Can anyone provide me with some direction on this? I need to access both pieces.
Here you go:
\[(.*?)\|(.*?)\]
Basically, just add another capture group. The \| escapes the | in your marvelous syntax. :)
http://regex101.com/r/dS3uJ9
Not this? Try this regex
(\[(.*?)\|?(.*?)\])
You can use;
s='[this is google|google.com]';
m = s.match(/\[([^|]+)\|([^\]]+)\]/);
Then use m[1] and m[2] for description and link.
UPDATE: For showing difference between performances of both regex approaches I created 2 regex101 links:
http://regex101.com/r/xZ6nB7 (Using negation)
http://regex101.com/r/rI0eH8 (Using lazy quantifier .*?
To see performance difference click on Launch regex debugger link on top-left of both links.
You will notice my regex displays:
+Match 1 - finished in 11 steps
But .*? link shows this:
+Match 1 - finished in 59 steps
That proves point that negation based regex is much more efficient than .* approach.
PS: If you click on +Match 1 - finished in 59 steps you will see red color BACKTRACK messages in 2nd link.

Categories