This question already has answers here:
How to replace Microsoft-encoded quotes in PHP
(6 answers)
Closed 9 years ago.
The Details
I have a simple textarea <textarea></textarea>
The value of this textarea is sent through ajax and stored in a database.
The value in this database is viewed on an iPad (or iPad mini or iPhone, etc)
The Problem
When someone copies text from somewhere (could be anywhere from the internet potentially), I want to remove any weird characters such as: “windows-1252 quotes” from the text before storing them in a utf8_unicode_ci column in a database. This column stores the above quotes but are unknown on certain devices (like iPad)
The Question
How can I remove these characters in Javascript or PHP?
string.replace has been tried from various examples to remove these characters.
htmlentities($sample) has been tried in order to convert these characters but still no luck.
Any help would be appreciated! Thanks!
Regular expressions will do this; php's function for this is preg_replace, javascript's is simply .replace(). You can find usage snippets everywhere ;)
There are two ways to approach this using regex:
1. Define an allowed character range and strip anything that isn't in that range.
[^\w-=+()!##$%^*(] will match NOT anything in this character range (the ^ at the beginning of the character class denotes this). You can then take the resulting matched characters and replace with an empty string.
Working example: http://regex101.com/r/zK2qW6
2. Define a non-allowed character range and strip anything that is in that range.
[“”] will match anything in this character range. You can then take the resulting matched characters, and again replace with an empty string. You could also use a regex unicode range here too.
Working example: http://regex101.com/r/yG4qJ4
In the end, you should choose the path which requires the smallest expression. If there's only a handful of characters to replace, use option #2. If you only want to allow a handful of characters, use option #1.
Related
This question already has answers here:
Javascript RegExp dosn't recognize apostrophe on mobile iOS
(2 answers)
Closed last year.
I'm trying to match a single quote (and other characters) in a regex like this:
/^[a-zA-ZàáâäãåąčćęèéêëėįìíîïłńòóôöõøùúûüųūÿýżźñçčšžÀÁÂÄÃÅĄĆČĖĘÈÉÊËÌÍÎÏĮŁŃÒÓÔÖÕØÙÚÛÜŲŪŸÝŻŹÑßÇŒÆČŠŽ∂ð ,.'-]+$/u
the single quote being at the end of the regex before the "-" character, but impossible to make it work with a "'" single quote character in my string in any way what so ever. so i tried to remove the single quote character like this myString.replace("'", ''), but even the replace() function doesn't remove the single quote.
I don't understand because when I test this stuff in my browser it works perfectly as intended but not in my react native code. Thanks if you have any answer for this weird "bug" !
Found the answer, it was a problem with the apostrophe characters "U+0027" and the single quote character "U+2019" used by the ios simulator as apostrophe.
I am parsing a series of strings with various formats. The last edge case encountered has me stumped. I'm not a great regexer, believe me it was a challenge to get to this point.
Here are critical snippets from the strings I'm trying to parse. The second example is the current edge case I'm stuck on.
LBP824NW2-58.07789x43.0-207C72
LBP824WW1-77.6875 in. x 3.00 in. 24VDC
I am trying to grab all of the digits (including the decimal) that make up the width part of the dimension in the string (this would be the first number in the dimension). What works in every other case has been grabbing all digits from the "-" to the "x" using the following expression:
/-(\d+\.?\d+?)x\B/
However, this does not handle the cases that have inches included in the dimension. I thought about using "look-aheads" or "look-behinds", but I got confused. Any suggestions would be appreciated.
RegEx can be told to look for "zero or one" of things, using (...)? syntax, so if your pattern already works but it gets confused by a new pattern that simply has "more string data embedded in what is otherwise the same pattern" you can add in zero-or-one checks and you should be good to go.
In this case, putting something like (\s*in\.?\s*)? in a few tactical places to either match "any number of spaces (including none) followed by in followed by an optional full stop followed by any number of spaces (including none)" or nothing should work.
That said, "I cannot change the formatting" is almost never an argument, because while you can't change the formatting, you can almost always change what parses it. RegEx might be adequate, but some code that checks for what kind of general patter it is, and then calls the appropriate function for tokenizing and inspecting that specific string pattern should be quite possible. Unless you've been hired to literally update some predefined CLi script that has a grep in it and you're not allowed to touch anything except for the pattern...
This is the working solution using regex: -(\d+\.?\d+?)(\s*in\.?\s*|x)
I have a textarea meant for plain text that users sometimes copy and paste special characters into. It becomes a problem when emoticons are used, because it's material we then need to include in PDF files.
For instance: ❤️
❤
Now my question is, how could I go about identifying such characters and removing them with Javascript as the form is validated? I don't want to be too restrictive, as many languages are allowed (Russian, Arabic, etc.). Only those symbols would need to be excluded.
Thank you
See http://crocodillon.com/blog/parsing-emoji-unicode-in-javascript. The problem is that emoticons are in the Supplementary plane. That does not allow you to use a normal character range; instead you need to work with "surrogate pairs", along the lines of
/\ud83d[\ude00-\ude4f]/
The link above has additional information on how to find and treat emoticon characters in other Unicode ranges.
IE (<9?) doesn't tolerate trailing commas at the end of object or list.
I know this too late, after developing for a few months in Chrome.
Now I have to search for every place I put in a trail comma, this is really painful.
Is there any way(preferably automatic) to do this? Like a editor plugin, or some script that search and replace these commas with blank?
JSLint is an option, but it throws a lot of other warnings, and I have to paste in the scripts (which sometimes contain server-side template tags...).
Some examples would have been good and which editor you use.
Notepad++ and UltraEdit support both Perl regular expression replaces with back referencing.
So you could try a Perl regular expression replace searching with the expression ,(\s*?[)\]]) and using \1 as replace string.
This expression finds a comma before a closing round or square bracket with 0 or more spaces/tabs/line terminators between and keeps on replace everything except the comma.
You should run this replace manually on your JavaScript code with checking what is found before making the replace. And perhaps you need to run this replace several times in case of multiple commas at end of a list.
This is my string:
<link href="/post?page=4&tags=example" rel="last" title="Last Page">
From there I am trying to obtain the 4 out of that page parameter, using this regular expression:
link href="/post?page=(.*?)&tags=(.*?)" rel="last"
I will then collect the 4 out of the first group, the tags parameter has a wildcard because the contents can change. However, I don't seem to be getting a match with this, can anyone help?
And I know I shouldn't be using regex to parse HTML, but this is just a small thing and it would be a waste to import a huge module for this.
Assuming you are using a /regex literal/, you will need to escape the / in that path as \/.
Alternatively, it depends on how you are getting this string. Is it really typed that way, or is it part of an innerHTML that you are then reading out again? If that's the case, then the innerHTML won't be what you expect it to be, because the browser will "normalise" it.
If it is an innerHTML, then it'd be far easier to get the tag, then get the tag's href attribute, then regex that.
link href="/post\?page=(.*?)&tags=(.*?)" rel="last"
You forgot the slash before ?
I think it might be better to change your capture groups to something a little different, but will catch everything up to the terminating character:
link href="/post?page=([^&]+)&tags=([^\"]+)" rel="last"
Using the negating character first in the character group tells the regex engine "capture all characters EXCEPT the ones listed here". This makes it very easy to capture everything up until it hits a termination character, such as the amperstand and double-quote. Assuming you're using PHP or Java, this should also slightly improve regex performance.
If the page parameter always comes first, try the PCRE /\?page=(\d+)/. Match group 1 will contain the page number.