So, I've got a remotely retrieved text document that I want to split by two newline (\n) characters. For example, the popular hymn All Things Bright And Beautiful.
All Things Bright And Beautiful
Cecil F. Alexander
All things bright and beautiful,
All creatures great and small,
All things wise and wonderful,
The Lord God made them all.
Each little flower that opens,
Each little bird that sings,
He made their glowing colours,
He made their tiny wings.
The purple-headed mountain,
The river running by,
The sunset and the morning,
That brightens up the sky;
The cold wind in the winter,
The pleasant summer sun,
The ripe fruits in the garden,
He made them every one;
The tall trees in the greenwood,
The meadows for our play,
The rushes by the water,
To gather every day;
He gave us eyes to see them,
And lips that we might tell
How great is God Almighty,
Who has made all things well.
Taking a look in TextMate with invisibles on shows me two newline characters in between verses. So far so good. When I copy and paste this into irb everything works absolutely fine:
ruby-1.8.7-p334 :034 > content.split "\n\n"
=> ["All Things Bright And Beautiful\nCecil F. Alexander", "All things bright and beautiful,\nAll creatures great and small,\nAll things wise and wonderful,\nThe Lord God made them all. ", "Each little flower that opens,\nEach little bird that sings,\nHe made their glowing colours,\nHe made their tiny wings.", "The purple-headed mountain,\nThe river running by,\nThe sunset and the morning,\nThat brightens up the sky;", "The cold wind in the winter,\nThe pleasant summer sun,\nThe ripe fruits in the garden,\nHe made them every one;", "The tall trees in the greenwood,\nThe meadows for our play,\nThe rushes by the water,\nTo gather every day;", "He gave us eyes to see them,\nAnd lips that we might tell\nHow great is God Almighty,\nWho has made all things well."]
ruby-1.8.7-p334 :035 >
The problem comes the moment I go to JavaScript. In the console:
If I split by one \n it works absolutely fine. I've tried using carriage returns, newlines, passing a regex through as the delimiter and nothing seems to be working. I just cannot demystify this behaviour.
Any help would be thoroughly appreciated.
Javascript's split works the same as Ruby's. Most likely the \ns changed into something else because of how you imported the text to your script.
Do a this.element_content.split('')
var s = this.element_content;
var chars = [];
for(i=0; i<s.length; i++){ chars.push(s.charAt(i)); }
console.log(chars)
to convert your string into an array of characters and check what is actually being used in the line breaks.
Try, this.element_content.split(/[\n\r]{1,2}/m) using m for multiline
Related
I have a paragraph of some texts. I want to replace some words in that using wildcard.
The below is my Paragraph.
0.7% lower on the prospect of fresh restrictions that would deal a blow to hopes of a swift economic
recovery. <Origin Href=\"StoryRef\">urn:newsml:reuters.com:*:nL4N2EC04Z</Origin>\n The 2,000-plus cases reported on Sunday was a shocker ,said Nicholas Mapa, ING
In this Para, I want to remove <Origin Href=\"StoryRef\">urn:newsml:reuters.com:*:nL4N2EC04Z</Origin>\n
There are multiple paragraphs. But only the uncommon one is nL4N2EC04Z
All other words are common in those paragraphs .
<Origin Href=\"StoryRef\">urn:newsml:reuters.com:*:(need_to_use_wild_card_here)</Origin>\n
I tried to replace one half.
My code
storyRef="<Origin Href=\"StoryRef\">urn:newsml:reuters.com:*:";
storyRef.replace(storyRef," ")
But am stuck in replacing other parts.
It seems encoding problem. Try to use JSON.stringify to ensure that characters like < and some others does not read decoded.
You can improve Regex too to something like this 👇
storyRef.replace(/<Origin.*Origin>\\n/gm, ' ');
This example will start in <Origin, get all content between until Origin>\n.
const p = JSON.stringify('0.7% lower on the prospect of fresh restrictions that would deal a blow to hopes of a swift economic recovery. <Origin Href=\"StoryRef\">urn:newsml:reuters.com:*:nL4N2EC04Z</Origin>\n The 2,000-plus cases reported on Sunday was a shocker ,said Nicholas Mapa, ING');
const storyRef = JSON.parse(p.replace(/(<Origin.*Origin>\\n)+/gm, ' '));
console.log(storyRef);
the problem is the string value is a dynamic value, sometimes i have data with double quote character, so javascript cant read full of the string value from the data.
I made a manipulation, i put the value on html input, and set the variable with the html input value.
But, i found the same problem, if the value have double quote character, the variable cannot take all the value.
The element show like this:
<input type="hidden" id="soal7" value="Boyolali regency is located in north of Solo and east of Merapi and Merbabu Mountains. This regency has been known for its production of fresh milk for a long time. No wonder, the cow statues adorn Boyolali town.<br>
There are six main cow statues in Boyolali. They are displayed in different places.However, the size is made bigger to catch the eye.<br>
Besides decorating the town, the statues also turn out to be helpful for people from out of town to find places they are seeking in Boyolali. By mentioning the position of the statue, people can get their way easily.<br><br>" ...="" <u=""> they are seeking in Boyolali" (paragraph 3)<br>What does the underlined word refer to?">
the value must be
Boyolali regency is located in north of Solo and east of Merapi and Merbabu Mountains. This regency has been known for its production of fresh milk for a long time. No wonder, the cow statues adorn Boyolali town.
There are six main cow statues in Boyolali. They are displayed in different places.However, the size is made bigger to catch the eye.
Besides decorating the town, the statues also turn out to be helpful for people from out of town to find places they are seeking in Boyolali. By mentioning the position of the statue, people can get their way easily.
"they are seeking in Boyolali" (paragraph 3)What does the underlined word refer to?
the bold text have double quotes, so the input type cannot set all of the string
is there any idea to manipulate this ?
You can have double quotes in strings, you just need to escape them with the \ char. Keep in mind, if the string is incoming from an api or something like that, you may need to write some code to escape them manually.
const myString = '\"Hello\"';
console.log(myString);
You can use " for this:
<input type="text" value="single ' and double ""/>
Here's the full snippet for your code. I changed it from type="hidden" to type="text" so you could read it.
<input type="text" id="soal7" value="Boyolali regency is located in north of Solo and east of Merapi and Merbabu Mountains. This regency has been known for its production of fresh milk for a long time. No wonder, the cow statues adorn Boyolali town.<br>There are six main cow statues in Boyolali. They are displayed in different places.However, the size is made bigger to catch the eye.<br>Besides decorating the town, the statues also turn out to be helpful for people from out of town to find places they are seeking in Boyolali. By mentioning the position of the statue, people can get their way easily.<br><br>" ...="" <u=""> they are seeking in Boyolali" (paragraph 3)<br>What does the underlined word refer to?"/>
I want the user of my node.js application to write down ideas, which then get stored in a database.
So far so good, but I don't want redundant entrys in that table, so I decided to check for similarity, using this one:
https://www.npmjs.com/package/string-similarity-js
Do you know a way, in which I can compare two strings by meaning? In like getting a high similarity score for "using public transport" vs "driving by train" which performs very poor in the above one.
To compare two strings by meaning, the strings would need to be convert first to a tensor and then evalutuate the distance or similarity between the tensors. Many algorithm can be used to convert strings to tensors - all related to the domain of interest. But the Universal Sentence Encoder is a wide broad sentence encoder that will project all words in one dimensional space. The cosine similarity can be used to see how closed some words are in meaning.
Example
Though king and kind are closed in hamming distance (difference of only one character), they are very different. Whereas queen and king though they seems not related (because all characters are different) are close in meaning. Therefore the distance (in meaning) between king and queen should be smaller than between king and kind as demonstrated in the following snippet.
<script src="https://cdn.jsdelivr.net/npm/#tensorflow/tfjs"></script>
<script src="https://cdn.jsdelivr.net/npm/#tensorflow-models/universal-sentence-encoder"></script>
<script>
(async() => {
const model = await use.load();
const embeddings = (await model.embed(['queen', 'king', 'kind'])).unstack()
tf.losses.cosineDistance(embeddings[0], embeddings[1], 0).print() // 0.39812755584716797
tf.losses.cosineDistance(embeddings[1], embeddings[2], 0).print() // 0.5585797429084778
})()
</script>
Comparing the meaning of two string is still an ongoing research. If you really want to solve the problem (or to get really good performance of your language modal) you should consider get a PhD.
For out of box solution at the time: I found this Github repo that implement google's BERT modal and use it to get the embedding of two sentences. In theory, the two sentence share the same meaning if there embedding is similar.
https://github.com/UKPLab/sentence-transformers
# the following is simplified from their README.md
embedder = SentenceTransformer('bert-base-nli-mean-tokens')
# Corpus with example sentences
S1 = ['A man is eating a food.']
S2 = ['A man is eating pasta.']
s1_embedding = embedder.encode(S1)
s2_embedding = embedder.encode(S2)
dist = scipy.spatial.distance.cdist([s1_embedding], [s2_embedding], "cosine")[0]
Example output (copied from their README.md)
Query: A man is eating pasta.
Top 5 most similar sentences in corpus:
A man is eating a piece of bread. (Score: 0.8518)
A man is eating a food. (Score: 0.8020)
A monkey is playing drums. (Score: 0.4167)
A man is riding a horse. (Score: 0.2621)
A man is riding a white horse on an enclosed ground. (Score: 0.2379)
Apparently this is not easy! There's got to be a way using pure regex?? I just know there is....
I have found a way to select the text after the first occurrence of a hyphen in a text file
Unique Thing - Some Text
Another Thing - Some Text again
Some Thing - Some more text
But I only want the right side of the hyphen..
Anyone know a quick regex to accomplish this?
To be clear, given text above i want
Some Text
Some Text again
Some more text
Thanks ya'll
UPDATE:
Maybe it would help with an actual chunk of text. This is from the most recent live stream chat for the whitehouse press briefing Aug 2, 2017.
Hernando Arce - build the wall with solar panels,
Christmas Girl - Let's do our own quick internet poll on live chat. Ready........Good with new immigrating into the US policy he is talking about. YES or NO,
ART - AMEN,
coffeefish - Stop H1B visa corruption!,
CarollDelMuro .Arbonne - Red,
Legion - BUILD THE WALL!,
wass sabi - MAGA,
Yokoshima - I live in Florida. Speaking English isn't racist. If you've ever been to Miami, you would know why it's needed.,
Home O'DFree - NO the campaign was BUILD THE WALL,
Melissa Renee - is he on benzos,
Paid Observer - kim jung un vs Trump in basketball,
Selina Serrano - polling data,
zonnekat - aliens....,
Farrah - NFL ,
Selina Serrano - massive,
Glenda Greene - MAGA,
Christoph Schneider - who would ever come to USA when they get lower pays? Russians?,
Carolyn Hall - MAGA MAGA MAGA ,
Sandra Honeyman - Isn't limiting immigration to skilled workers going to displace more skilled American workers?,
Mike Hancock - AMERICA FIRST,
Adnan Khan - Send them back to Mars,
Paid Observer - wtf is that,
GDotcom - THIS BETTER PASS OR THERE SHALL BE HANGINGS,
Null_Mage - This man is more attractive than Sarah,
monkeygraborange - FUCK CONGRESS,
Selina Serrano - personal,
This is the text i'm testing in regex101.
^[^-]*[^ -] does not seem to work here.
I do like the few suggestions about splitting line by line then matching however, the chat stream is many thousands of lines. The end result of all this is counting occurrences of words. For anyone whos interested the repo is https://github.com/archae0pteryx/yt-live-chat-scraper I just pushed the logs from the latest press briefing.
/[\s\S]- (.*)/g - Should do it
[\s\S] - For matching new lines
/g - Continues matching
🍻
You can use a capturing group if you want to use a regex:
const r = /- (.*)/
console.log('Unique Thing - Some Text'.match(r)[1]) //'Some Text'
Try this:
.*-[ ]*
Choose everything before the hyphen and white spaces after the hyphen.
With this patter you can remove all the text that match and leave the right side that you want.
UPDATE:
But, if you want the right side you can use it:
-(.*)
and choose the group 1:
I am currently attempting to parse a conversation file in Javascript. Here is an example of such a conversation.
09/05/2016, 13:11 - Joe Bloggs: Hey Jane how're you doing? 😊 what dates are you in London again? I realise that June isn't actually that far away so might book my trains down sooner than later!
09/05/2016, 13:47 - Jane Doe: Hey! I'm in london from the 12th-16th of june! Hope you can make it down :) sorry it's a bit annoying i couldn't make it there til a sunday!
09/05/2016, 14:03 - Joe Bloggs: Right I'll speak to my boss! I've just requested 5 weeks off in November/December to visit Aus so I'll see if I can negotiate some other days!
When does your uni term end in November? I'm thinking of visiting perth first then going to the east coast!
09/05/2016, 22:32 - Jane Doe: Oh that'll be awesome if you come to aus! Totally understand if it's too hard for you to request more days off in june.
I finish uni early November! So should definitely be done by then if you came here
09/05/2016, 23:20 - Joe Bloggs: I could maybe get a couple of days 😊 when do you fly into London on the Sunday?
Perfect! I need to speak to everyone else to make sure they're about. I can't wait to visit but it's so far away!
09/05/2016, 23:30 - Jane Doe: I fly in at like 7.30am so I'll have that whole day!
I'm sure the year will fly since it's may already haha
09/05/2016, 23:34 - Joe Bloggs: Aw nice one! Even if I can get just Monday off I can get an early train on Sunday 😊
My current regular expression looks like this
(\d{2}\/\d{2}\/\d{4}),\s(\d(?:\d)?:\d{2})\s-\s([^:]*):\s(.*?)(?=\s*\d{2}\/|$)/gm
My approach is almost there and gives me 4 groups as expected
{
"group": 1,
"value": "09/05/2016"
},
{
"group": 2,
"value": "13:11"
},
{
"group": 3,
"value": "Joe Bloggs"
},
{
"group": 4,
"value": "Hey Jane how're you doing? 😊 what dates are you in London again? I realise that June isn't actually that far away so might book my trains down sooner than later!"
}
The problem arises when a message (group 4) contains a carriage return. (see the message at line 3 in the example snippet).
I've done some research and using [\s\S] does not solve my issue. The pattern simply stops and moves onto the next occurrence.
For the third conversation the message is cut off at the carriage return.
DEMO
Any help would be appreciated!
Try
(\d{2}\/\d{2}\/\d{4}),\s(\d{1,2}:\d{2})\s-\s([^:]*):\s+(.*(?:\n+(?!\n|\d{2}\/).*)*)
(https://regex101.com/r/sA3sB8/2) which scans to the end of the line, then uses a repeated group to first check that the new line doesn't start with \d\d/ (which is the start of a date on the next line(s)), and if it doesn't, to capture that entire line as well.
You can make the negative look-ahead a little more specific if you fear that two digits followed by a forward slash could hit any edge cases. It increases the number of steps, but would make it slightly safer.
If a user actually entered a newline followed by a date in that syntax, you might have problems as it would stop matching at that point. I doubt they would also include a comma and a 24-hour time, though, so that could be one way to handle that scenario.
Example:
09/05/2016, 23:36 - Jane Doe: Great! Let me give you my travel details:
10/01/2016 # 6am - Arrive at the station
10/01/2016 # 7am - Get run over by a drunk horse carriage (the driver and the horse were both sober; the carriage stayed up a bit late to drink)
10/01/2016 # 7:15am - Pull myself out from under the carriage and kick at its wheels vehemently.
09/05/2016, 23:40 - Joe Bloggs: Haha, sounds great.
This is just an example (with the corresponding fix of adding more specifics to the look-ahead to handle it) just to show how a user might add text that could break that particular revision of the regex.