Javascript Fetch - Interpret header as UTF-8? [duplicate] - javascript

I'm using a "fun" HTML special-character (✰)(see http://html5boilerplate.com/ for more info) for a Server HTTP-header and am wondering if it is "allowed" per spec.
Using the Network Tab in the dev tools in Chrome on Windows Xp Pro SP 3 I see the ✰ just fine.
In IE8 the ✰ is not rendered correctly.
The w3.org HTML validator does not render it correctly (displays "â°" instead).
Now, I'm not too keen on character encodings ... and frankly I don't really care too much about them; I just blindly use UTF-8 cus I'm told to. :-)
Is the disparity caused by bugs in the different parsers/browses/engines/(whatever-they-are-called)?
Is there a spec for this or maybe a list of allowed characters for an HTTP-header "value"?

In short: Only ASCII is guaranteed to work. Some non-ASCII bytes are allowed for backwards compatibility, but are not supposed to be displayable.
HTTPbis gave up and specified that in the headers there is no useful encoding besides ASCII:
Historically, HTTP has allowed field content with text in the
ISO-8859-1 charset [ISO-8859-1], supporting other charsets only
through use of [RFC2047] encoding. In practice, most HTTP header
field values use only a subset of the US-ASCII charset [USASCII].
Newly defined header fields SHOULD limit their field values to
US-ASCII octets. A recipient SHOULD treat other octets in field
content (obs-text) as opaque data.
Previously, RFC 2616 from 1999 defined this:
Words of *TEXT MAY contain characters from character sets other than ISO-
8859-1 [22] only when encoded according to the rules of RFC 2047 [14].
and RFC 2047 is the MIME encoding, so it'd be:
=?UTF-8?Q?=E2=9C=B0?=
but I don't think that many (if any) clients support it.

Please read comments first, this answer likely draws wrong conclusions from the right sources, needs edit.
You can use any printable ASCII chars, and no special chars like ✰ (Which is not ASCII)
Tip: you can encode anything in JSON.
Edit: may not be obvious at first, the character encoding defined in the header only applies for the response body, not for the header itself. (As it would cause a chicken-&-egg problem.)
I'd like to sum up all the relevant definitions as per the spec linked by Penchant.
message-header = field-name ":" [ field-value ]
field-name = token
field-value = *( field-content | LWS )
So, we are after field-value.
LWS = [CRLF] 1*( SP | HT )
CRLF = CR LF
CR = <US-ASCII CR, carriage return (13)>
LF = <US-ASCII LF, linefeed (10)>
SP = <US-ASCII SP, space (32)>
HT = <US-ASCII HT, horizontal-tab (9)>
LWS stands for Linear White Space. Essentially, LWS is Space or Tab, but you can break your field-value into multiple lines by starting a new line before a Space or Tab.
Let's simplify it to this:
field-value = <any field-content or Space or Tab>
Now we are after field-content.
field-content = <the OCTETs making up the field-value
and consisting of either *TEXT or combinations
of token, separators, and quoted-string>
OCTET = <any 8-bit sequence of data>
TEXT = <any OCTET except CTLs,
but including LWS>
CTL = <any US-ASCII control character
(octets 0 - 31) and DEL (127)>
token = 1*<any CHAR except CTLs or separators>
separators = "(" | ")" | "<" | ">" | "#"
| "," | ";" | ":" | "\" | <">
| "/" | "[" | "]" | "?" | "="
| "{" | "}" | SP | HT
TEXT is the most general and includes all the rest -so forget about the rest-.
Here is the US-ASCII charset (= ASCII)
As you can see, all printable ASCII chars are allowed.

Related

JSON parse a decodedURI failed? [duplicate]

What is the difference between the JavaScript functions decodeURIComponent and decodeURI?
To explain the difference between these two let me explain the difference between encodeURI and encodeURIComponent.
The main difference is that:
The encodeURI function is intended for use on the full URI.
The encodeURIComponent function is intended to be used on .. well .. URI components that is
any part that lies between separators (; / ? : # & = + $ , #).
So, in encodeURIComponent these separators are encoded also because they are regarded as text and not special characters.
Now back to the difference between the decode functions, each function decodes strings generated by its corresponding encode counterpart taking care of the semantics of the special characters and their handling.
encodeURIComponent/decodeURIComponent() is almost always the pair you want to use, for concatenating together and splitting apart text strings in URI parts.
encodeURI in less common, and misleadingly named: it should really be called fixBrokenURI. It takes something that's nearly a URI, but has invalid characters such as spaces in it, and turns it into a real URI. It has a valid use in fixing up invalid URIs from user input, and it can also be used to turn an IRI (URI with bare Unicode characters in) into a plain URI (using %-escaped UTF-8 to encode the non-ASCII).
Where encodeURI should really be named fixBrokenURI(), decodeURI() could equally be called potentiallyBreakMyPreviouslyWorkingURI(). I can think of no valid use for it anywhere; avoid.
js> s = "http://www.example.com/string with + and ? and & and spaces";
http://www.example.com/string with + and ? and & and spaces
js> encodeURI(s)
http://www.example.com/string%20with%20+%20and%20?%20and%20&%20and%20spaces
js> encodeURIComponent(s)
http%3A%2F%2Fwww.example.com%2Fstring%20with%20%2B%20and%20%3F%20and%20%26%20and%20spaces
Looks like encodeURI produces a "safe" URI by encoding spaces and some other (e.g. nonprintable) characters, whereas encodeURIComponent additionally encodes the colon and slash and plus characters, and is meant to be used in query strings. The encoding of + and ? and & is of particular importance here, as these are special chars in query strings.
As I had the same question, but didn't find the answer here, I made some tests in order to figure out what the difference actually is.
I did this, since I need the encoding for something, which is not URL/URI related.
encodeURIComponent("A") returns "A", it does not encode "A" to "%41"
decodeURIComponent("%41") returns "A".
encodeURI("A") returns "A", it does not encode "A" to "%41"
decodeURI("%41") returns "A".
-That means both can decode alphanumeric characters, even though they did not encode them. However...
encodeURIComponent("&") returns "%26".
decodeURIComponent("%26") returns "&".
encodeURI("&") returns "&".
decodeURI("%26") returns "%26".
Even though encodeURIComponent does not encode all characters, decodeURIComponent can decode any value between %00 and %7F.
Note: It appears that if you try to decode a value above %7F (unless it's a unicode value), then your script will fail with an "URI error".
encodeURIComponent()
Converts the input into a URL-encoded
string
encodeURI()
URL-encodes the input, but
assumes a full URL is given, so
returns a valid URL by not encoding
the protocol (e.g. http://) and
host name (e.g.
www.stackoverflow.com).
decodeURIComponent() and decodeURI() are the opposite of the above
decodeURIComponent will decode URI special markers such as &, ?, #, etc, decodeURI will not.
encodeURIComponent
Not Escaped:
A-Z a-z 0-9 - _ . ! ~ * ' ( )
encodeURI()
Not Escaped:
A-Z a-z 0-9 ; , / ? : # & = + $ - _ . ! ~ * ' ( ) #
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/encodeURIComponent
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/encodeURI
Encode URI:
The encodeURI() method does not encodes:
, / ? : # & = + $ * #
Example
URI: https://my test.asp?name=ståle&car=saab
Encoded URI: https://my%20test.asp?name=st%C3%A5le&car=saab
Encode URI Component:
The encodeURIComponent() method also encodes:
, / ? : # & = + $ #
Example
URI: https://my test.asp?name=ståle&car=saab
Encoded URI: https%3A%2F%2Fmy%20test.asp%3Fname%3Dst%C3%A5le%26car%3Dsaab
For More: W3Schoools.com

Convert Unicode To ASCII in JavaScript

In c#, I have ö ASCII character 148. but when I change like
char convstr = (char)148;
It returns \u0094. Its "”".Its unicode.
If I want it back ASCII code, I just use Strings.Asc("”"). It will return to 148.
So How can I get from "”" to 148 in JavaScript?
I tried Like that=>
"”".charCodeAt(0) but it return 8221. I think its Unicode.
And If you don't mind, please explain to me why char convstr = (char)148; return \u0094 also. I am also stuck in there.
=================================
148 isn't the value for ö in ASCII (ie the 7-bit US ASCII encoding that goes only up to 127) nor the commonly-referred-to-as-ASCII codepages 1252 (Windows Latin 1) and ISO/IEC 8859-1. 1252 has ” in that location while the ISO codepage has nothing. That value is used for ö only in the old DOS codepages, 437 and 865.
Windows, .NET and C# strings are Unicode natively. This pages proves this - StackOverflow is an ASP.NET site. You can convert data in non-Unicode encodings easily, either through the Encoding class, or by specifying the encoding when loading data from streams with a StreamReader.
For example, this will convert the byte value 148 to ö using the 437 codepage :
var result=Encoding.GetEncoding(437).GetString(new byte[]{148});
Debug.Assert(result=="ö");
While this returns ”:
var result=Encoding.GetEncoding(1252).GetString(new byte[]{148});
The StreamReader(string,Encoding) overload and its variants can load data from files using the specified encoding, eg :
using(var reader=new StreamReader(path,Encoding.GetEncoding(437)))
{
var line=reader.ReadLine();
....
}

Converting multiple emojis into a single character [duplicate]

I'm not sure what this is called so I'm having trouble searching for it. How can I decode a string with unicode from http\u00253A\u00252F\u00252Fexample.com to http://example.com with JavaScript? I tried unescape, decodeURI, and decodeURIComponent so I guess the only thing left is string replace.
EDIT: The string is not typed, but rather a substring from another piece of code. So to solve the problem you have to start with something like this:
var s = 'http\\u00253A\\u00252F\\u00252Fexample.com';
I hope that shows why unescape() doesn't work.
Edit (2017-10-12):
#MechaLynx and #Kevin-Weber note that unescape() is deprecated from non-browser environments and does not exist in TypeScript. decodeURIComponent is a drop-in replacement. For broader compatibility, use the below instead:
decodeURIComponent(JSON.parse('"http\\u00253A\\u00252F\\u00252Fexample.com"'));
> 'http://example.com'
Original answer:
unescape(JSON.parse('"http\\u00253A\\u00252F\\u00252Fexample.com"'));
> 'http://example.com'
You can offload all the work to JSON.parse
UPDATE: Please note that this is a solution that should apply to older browsers or non-browser platforms, and is kept alive for instructional purposes. Please refer to #radicand 's answer below for a more up to date answer.
This is a unicode, escaped string. First the string was escaped, then encoded with unicode. To convert back to normal:
var x = "http\\u00253A\\u00252F\\u00252Fexample.com";
var r = /\\u([\d\w]{4})/gi;
x = x.replace(r, function (match, grp) {
return String.fromCharCode(parseInt(grp, 16)); } );
console.log(x); // http%3A%2F%2Fexample.com
x = unescape(x);
console.log(x); // http://example.com
To explain: I use a regular expression to look for \u0025. However, since I need only a part of this string for my replace operation, I use parentheses to isolate the part I'm going to reuse, 0025. This isolated part is called a group.
The gi part at the end of the expression denotes it should match all instances in the string, not just the first one, and that the matching should be case insensitive. This might look unnecessary given the example, but it adds versatility.
Now, to convert from one string to the next, I need to execute some steps on each group of each match, and I can't do that by simply transforming the string. Helpfully, the String.replace operation can accept a function, which will be executed for each match. The return of that function will replace the match itself in the string.
I use the second parameter this function accepts, which is the group I need to use, and transform it to the equivalent utf-8 sequence, then use the built - in unescape function to decode the string to its proper form.
Note that the use of unescape() is deprecated and doesn't work with the TypeScript compiler, for example.
Based on radicand's answer and the comments section below, here's an updated solution:
var string = "http\\u00253A\\u00252F\\u00252Fexample.com";
decodeURIComponent(JSON.parse('"' + string.replace(/\"/g, '\\"') + '"'));
http://example.com
Using JSON.decode for this comes with significant drawbacks that you must be aware of:
You must wrap the string in double quotes
Many characters are not supported and must be escaped themselves. For example, passing any of the following to JSON.decode (after wrapping them in double quotes) will error even though these are all valid: \\n, \n, \\0, a"a
It does not support hexadecimal escapes: \\x45
It does not support Unicode code point sequences: \\u{045}
There are other caveats as well. Essentially, using JSON.decode for this purpose is a hack and doesn't work the way you might always expect. You should stick with using the JSON library to handle JSON, not for string operations.
I recently ran into this issue myself and wanted a robust decoder, so I ended up writing one myself. It's complete and thoroughly tested and is available here: https://github.com/iansan5653/unraw. It mimics the JavaScript standard as closely as possible.
Explanation:
The source is about 250 lines so I won't include it all here, but essentially it uses the following Regex to find all escape sequences and then parses them using parseInt(string, 16) to decode the base-16 numbers and then String.fromCodePoint(number) to get the corresponding character:
/\\(?:(\\)|x([\s\S]{0,2})|u(\{[^}]*\}?)|u([\s\S]{4})\\u([^{][\s\S]{0,3})|u([\s\S]{0,4})|([0-3]?[0-7]{1,2})|([\s\S])|$)/g
Commented (NOTE: This regex matches all escape sequences, including invalid ones. If the string would throw an error in JS, it throws an error in my library [ie, '\x!!' will error]):
/
\\ # All escape sequences start with a backslash
(?: # Starts a group of 'or' statements
(\\) # If a second backslash is encountered, stop there (it's an escaped slash)
| # or
x([\s\S]{0,2}) # Match valid hexadecimal sequences
| # or
u(\{[^}]*\}?) # Match valid code point sequences
| # or
u([\s\S]{4})\\u([^{][\s\S]{0,3}) # Match surrogate code points which get parsed together
| # or
u([\s\S]{0,4}) # Match non-surrogate Unicode sequences
| # or
([0-3]?[0-7]{1,2}) # Match deprecated octal sequences
| # or
([\s\S]) # Match anything else ('.' doesn't match newlines)
| # or
$ # Match the end of the string
) # End the group of 'or' statements
/g # Match as many instances as there are
Example
Using that library:
import unraw from "unraw";
let step1 = unraw('http\\u00253A\\u00252F\\u00252Fexample.com');
// yields "http%3A%2F%2Fexample.com"
// Then you can use decodeURIComponent to further decode it:
let step2 = decodeURIComponent(step1);
// yields http://example.com
I don't have enough rep to put this under comments to the existing answers:
unescape is only deprecated for working with URIs (or any encoded utf-8) which is probably the case for most people's needs. encodeURIComponent converts a js string to escaped UTF-8 and decodeURIComponent only works on escaped UTF-8 bytes. It throws an error for something like decodeURIComponent('%a9'); // error because extended ascii isn't valid utf-8 (even though that's still a unicode value), whereas unescape('%a9'); // © So you need to know your data when using decodeURIComponent.
decodeURIComponent won't work on "%C2" or any lone byte over 0x7f because in utf-8 that indicates part of a surrogate. However decodeURIComponent("%C2%A9") //gives you © Unescape wouldn't work properly on that // © AND it wouldn't throw an error, so unescape can lead to buggy code if you don't know your data.
This is not an answer to this exact question, but for those who are hitting this page via a search result and who are trying to (like I was) construct a single Unicode character given a sequence of escaped codepoints, note that you can pass multiple arguments to String.fromCodePoint() like so:
String.fromCodePoint(parseInt("1F469", 16), parseInt("200D", 16), parseInt("1F4BC", 16)) // 👩‍💼
You can of course parse your string to extract the hex codepoint strings and then do something like:
let codePoints = hexCodePointStrings.map(s => parseInt(s, 16));
let str = String.fromCodePoint(...codePoints);
In my case, I was trying to unescape HTML file sth like
"\u003Cdiv id=\u0022app\u0022\u003E\r\n \u003Cdiv data-v-269b6c0d\u003E\r\n \u003Cdiv data-v-269b6c0d class=\u0022menu\u0022\u003E\r\n \u003Cdiv data-v-269b6c0d class=\u0022faux_column\u0022\u003E\r\n \u003Cdiv data-v-269b6c0d class=\u0022row\u0022\u003E\r\n \u003Cdiv data-v-269b6c0d class=\u0022col-md-12\u0022\u003E\r\n"
to
<div id="app">
<div data-v-269b6c0d>
<div data-v-269b6c0d class="menu">
<div data-v-269b6c0d class="faux_column">
<div data-v-269b6c0d class="row">
<div data-v-269b6c0d class="col-md-12">
Here below works in my case:
const jsEscape = (str: string) => {
return str.replace(new RegExp("'", 'g'),"\\'");
}
export const decodeUnicodeEntities = (data: any) => {
return unescape(jsEscape(data));
}
// Use it
const data = ".....";
const unescaped = decodeUnicodeEntities(data); // Unescaped html

jquery html() method does not render emoticons properly [duplicate]

I'm not sure what this is called so I'm having trouble searching for it. How can I decode a string with unicode from http\u00253A\u00252F\u00252Fexample.com to http://example.com with JavaScript? I tried unescape, decodeURI, and decodeURIComponent so I guess the only thing left is string replace.
EDIT: The string is not typed, but rather a substring from another piece of code. So to solve the problem you have to start with something like this:
var s = 'http\\u00253A\\u00252F\\u00252Fexample.com';
I hope that shows why unescape() doesn't work.
Edit (2017-10-12):
#MechaLynx and #Kevin-Weber note that unescape() is deprecated from non-browser environments and does not exist in TypeScript. decodeURIComponent is a drop-in replacement. For broader compatibility, use the below instead:
decodeURIComponent(JSON.parse('"http\\u00253A\\u00252F\\u00252Fexample.com"'));
> 'http://example.com'
Original answer:
unescape(JSON.parse('"http\\u00253A\\u00252F\\u00252Fexample.com"'));
> 'http://example.com'
You can offload all the work to JSON.parse
UPDATE: Please note that this is a solution that should apply to older browsers or non-browser platforms, and is kept alive for instructional purposes. Please refer to #radicand 's answer below for a more up to date answer.
This is a unicode, escaped string. First the string was escaped, then encoded with unicode. To convert back to normal:
var x = "http\\u00253A\\u00252F\\u00252Fexample.com";
var r = /\\u([\d\w]{4})/gi;
x = x.replace(r, function (match, grp) {
return String.fromCharCode(parseInt(grp, 16)); } );
console.log(x); // http%3A%2F%2Fexample.com
x = unescape(x);
console.log(x); // http://example.com
To explain: I use a regular expression to look for \u0025. However, since I need only a part of this string for my replace operation, I use parentheses to isolate the part I'm going to reuse, 0025. This isolated part is called a group.
The gi part at the end of the expression denotes it should match all instances in the string, not just the first one, and that the matching should be case insensitive. This might look unnecessary given the example, but it adds versatility.
Now, to convert from one string to the next, I need to execute some steps on each group of each match, and I can't do that by simply transforming the string. Helpfully, the String.replace operation can accept a function, which will be executed for each match. The return of that function will replace the match itself in the string.
I use the second parameter this function accepts, which is the group I need to use, and transform it to the equivalent utf-8 sequence, then use the built - in unescape function to decode the string to its proper form.
Note that the use of unescape() is deprecated and doesn't work with the TypeScript compiler, for example.
Based on radicand's answer and the comments section below, here's an updated solution:
var string = "http\\u00253A\\u00252F\\u00252Fexample.com";
decodeURIComponent(JSON.parse('"' + string.replace(/\"/g, '\\"') + '"'));
http://example.com
Using JSON.decode for this comes with significant drawbacks that you must be aware of:
You must wrap the string in double quotes
Many characters are not supported and must be escaped themselves. For example, passing any of the following to JSON.decode (after wrapping them in double quotes) will error even though these are all valid: \\n, \n, \\0, a"a
It does not support hexadecimal escapes: \\x45
It does not support Unicode code point sequences: \\u{045}
There are other caveats as well. Essentially, using JSON.decode for this purpose is a hack and doesn't work the way you might always expect. You should stick with using the JSON library to handle JSON, not for string operations.
I recently ran into this issue myself and wanted a robust decoder, so I ended up writing one myself. It's complete and thoroughly tested and is available here: https://github.com/iansan5653/unraw. It mimics the JavaScript standard as closely as possible.
Explanation:
The source is about 250 lines so I won't include it all here, but essentially it uses the following Regex to find all escape sequences and then parses them using parseInt(string, 16) to decode the base-16 numbers and then String.fromCodePoint(number) to get the corresponding character:
/\\(?:(\\)|x([\s\S]{0,2})|u(\{[^}]*\}?)|u([\s\S]{4})\\u([^{][\s\S]{0,3})|u([\s\S]{0,4})|([0-3]?[0-7]{1,2})|([\s\S])|$)/g
Commented (NOTE: This regex matches all escape sequences, including invalid ones. If the string would throw an error in JS, it throws an error in my library [ie, '\x!!' will error]):
/
\\ # All escape sequences start with a backslash
(?: # Starts a group of 'or' statements
(\\) # If a second backslash is encountered, stop there (it's an escaped slash)
| # or
x([\s\S]{0,2}) # Match valid hexadecimal sequences
| # or
u(\{[^}]*\}?) # Match valid code point sequences
| # or
u([\s\S]{4})\\u([^{][\s\S]{0,3}) # Match surrogate code points which get parsed together
| # or
u([\s\S]{0,4}) # Match non-surrogate Unicode sequences
| # or
([0-3]?[0-7]{1,2}) # Match deprecated octal sequences
| # or
([\s\S]) # Match anything else ('.' doesn't match newlines)
| # or
$ # Match the end of the string
) # End the group of 'or' statements
/g # Match as many instances as there are
Example
Using that library:
import unraw from "unraw";
let step1 = unraw('http\\u00253A\\u00252F\\u00252Fexample.com');
// yields "http%3A%2F%2Fexample.com"
// Then you can use decodeURIComponent to further decode it:
let step2 = decodeURIComponent(step1);
// yields http://example.com
I don't have enough rep to put this under comments to the existing answers:
unescape is only deprecated for working with URIs (or any encoded utf-8) which is probably the case for most people's needs. encodeURIComponent converts a js string to escaped UTF-8 and decodeURIComponent only works on escaped UTF-8 bytes. It throws an error for something like decodeURIComponent('%a9'); // error because extended ascii isn't valid utf-8 (even though that's still a unicode value), whereas unescape('%a9'); // © So you need to know your data when using decodeURIComponent.
decodeURIComponent won't work on "%C2" or any lone byte over 0x7f because in utf-8 that indicates part of a surrogate. However decodeURIComponent("%C2%A9") //gives you © Unescape wouldn't work properly on that // © AND it wouldn't throw an error, so unescape can lead to buggy code if you don't know your data.
This is not an answer to this exact question, but for those who are hitting this page via a search result and who are trying to (like I was) construct a single Unicode character given a sequence of escaped codepoints, note that you can pass multiple arguments to String.fromCodePoint() like so:
String.fromCodePoint(parseInt("1F469", 16), parseInt("200D", 16), parseInt("1F4BC", 16)) // 👩‍💼
You can of course parse your string to extract the hex codepoint strings and then do something like:
let codePoints = hexCodePointStrings.map(s => parseInt(s, 16));
let str = String.fromCodePoint(...codePoints);
In my case, I was trying to unescape HTML file sth like
"\u003Cdiv id=\u0022app\u0022\u003E\r\n \u003Cdiv data-v-269b6c0d\u003E\r\n \u003Cdiv data-v-269b6c0d class=\u0022menu\u0022\u003E\r\n \u003Cdiv data-v-269b6c0d class=\u0022faux_column\u0022\u003E\r\n \u003Cdiv data-v-269b6c0d class=\u0022row\u0022\u003E\r\n \u003Cdiv data-v-269b6c0d class=\u0022col-md-12\u0022\u003E\r\n"
to
<div id="app">
<div data-v-269b6c0d>
<div data-v-269b6c0d class="menu">
<div data-v-269b6c0d class="faux_column">
<div data-v-269b6c0d class="row">
<div data-v-269b6c0d class="col-md-12">
Here below works in my case:
const jsEscape = (str: string) => {
return str.replace(new RegExp("'", 'g'),"\\'");
}
export const decodeUnicodeEntities = (data: any) => {
return unescape(jsEscape(data));
}
// Use it
const data = ".....";
const unescaped = decodeUnicodeEntities(data); // Unescaped html

Regular expression to match generic URL

I've looked all over and have yet to find a single solution to address my need for a regular expression pattern that will match a generic URL. I need to support multiple protocols (with verification), localhost and/or IP addressing, ports and query strings. Some examples:
http://localhost/mysite
https://localhost:55000
ftp://192.1.1.1
telnet://somesite/page.htm?a=1&b=2
Ideally, I'd like the pattern to also support extracting the various elements (protocol, host, port, query string, etc.) but this is not a requirement.
(Also, for the purposes of myself and future readers, if you could explain the pattern, it would be helpful.)
Appendix B of RFC 3986/STD 0066 (Uniform Resource Identifier (URI): Generic Syntax) provides the regular expression you need:
Appendix B. Parsing a URI Reference with a Regular Expression
As the "first-match-wins" algorithm is identical to the "greedy"
disambiguation method used by POSIX regular expressions, it is
natural and commonplace to use a regular expression for parsing the
potential five components of a URI reference.
The following line is the regular expression for breaking-down a
well-formed URI reference into its components.
^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
12 3 4 5 6 7 8 9
The numbers in the second line above are only to assist readability;
they indicate the reference points for each subexpression (i.e., each
paired parenthesis). We refer to the value matched for subexpression
<n> as $<n>. For example, matching the above expression to
http://www.ics.uci.edu/pub/ietf/uri/#Related
results in the following subexpression matches:
$1 = http:
$2 = http
$3 = //www.ics.uci.edu
$4 = www.ics.uci.edu
$5 = /pub/ietf/uri/
$6 = <undefined>
$7 = <undefined>
$8 = #Related
$9 = Related
where <undefined> indicates that the component is not present, as is
the case for the query component in the above example. Therefore, we
can determine the value of the five components as
scheme = $2
authority = $4
path = $5
query = $7
fragment = $9
Going in the opposite direction, we can recreate a URI reference from
its components by using the algorithm of Section 5.3.
As for validating a URI against a particular scheme goes, you'll need to look at the RFC(s) describing the scheme(s) in which you are interested to get the detail required to validate that a URI is valid for the scheme it purports to be. The URI scheme registry is located at http://www.iana.org/assignments/uri-schemes.html.
And even then, you're doomed to some sort of failure. Consider the file: scheme. You can't validate that it represents a valid path in the file system of the authority (unless you are the authority). The best that you can do is validate that it represents something that looks like a valid path. And even then, a windows file: url like file:///C:/foo/bar/baz/bat.txt is (would be) invalid for anything but a server running some flavor of Windows. Any server running *nix would likely choke on it (what's a drive letter anyway?).
Nicholas Carey is correct to steer you towards RFC-3986. The regex he points out will match a generic URI, but it will not validate it (and this regex is not good for picking URLs out of "the wild" - it is too loose and matches just about any string including an empty string).
Regarding the validation requirement, you may want to take a look at an article I wrote on the subject, which takes from Appendix A all the ABNF syntax definitions of all the various components and provides regex equivalents:
Regular Expression URI Validation
Regarding the subject of picking out URL's from the "wild", take a look at Jeff Atwood's "The Problem With URLs" and John' Gruber's "An Improved Liberal, Accurate Regex Pattern for Matching URLs" blog posts to get a glimpse as to some of the subtle problems which can arise. Also, you may want to take a look at a project I started last year: URL Linkification - this picks out unlinked HTTP and FTP URLs from text which may already have some links.
That said, the following is a PHP function which uses a slightly modified version of the RFC-3986 "Absolute URI" regex to validate HTTP and FTP URL's (with this regex, the named host portion must not be empty). All the various components of the URI are isolated and captured into named groups which allows for easy manipulation and validation of the parts within the program code:
function url_valid($url)
{
if (strpos($url, 'www.') === 0) $url = 'http://'. $url;
if (strpos($url, 'ftp.') === 0) $url = 'ftp://'. $url;
if (!preg_match('/# Valid absolute URI having a non-empty, valid DNS host.
^
(?P<scheme>[A-Za-z][A-Za-z0-9+\-.]*):\/\/
(?P<authority>
(?:(?P<userinfo>(?:[A-Za-z0-9\-._~!$&\'()*+,;=:]|%[0-9A-Fa-f]{2})*)#)?
(?P<host>
(?P<IP_literal>
\[
(?:
(?P<IPV6address>
(?: (?:[0-9A-Fa-f]{1,4}:){6}
| ::(?:[0-9A-Fa-f]{1,4}:){5}
| (?: [0-9A-Fa-f]{1,4})?::(?:[0-9A-Fa-f]{1,4}:){4}
| (?:(?:[0-9A-Fa-f]{1,4}:){0,1}[0-9A-Fa-f]{1,4})?::(?:[0-9A-Fa-f]{1,4}:){3}
| (?:(?:[0-9A-Fa-f]{1,4}:){0,2}[0-9A-Fa-f]{1,4})?::(?:[0-9A-Fa-f]{1,4}:){2}
| (?:(?:[0-9A-Fa-f]{1,4}:){0,3}[0-9A-Fa-f]{1,4})?:: [0-9A-Fa-f]{1,4}:
| (?:(?:[0-9A-Fa-f]{1,4}:){0,4}[0-9A-Fa-f]{1,4})?::
)
(?P<ls32>[0-9A-Fa-f]{1,4}:[0-9A-Fa-f]{1,4}
| (?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}
(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)
)
| (?:(?:[0-9A-Fa-f]{1,4}:){0,5}[0-9A-Fa-f]{1,4})?:: [0-9A-Fa-f]{1,4}
| (?:(?:[0-9A-Fa-f]{1,4}:){0,6}[0-9A-Fa-f]{1,4})?::
)
| (?P<IPvFuture>[Vv][0-9A-Fa-f]+\.[A-Za-z0-9\-._~!$&\'()*+,;=:]+)
)
\]
)
| (?P<IPv4address>(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}
(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))
| (?P<regname>(?:[A-Za-z0-9\-._~!$&\'()*+,;=]|%[0-9A-Fa-f]{2})+)
)
(?::(?P<port>[0-9]*))?
)
(?P<path_abempty>(?:\/(?:[A-Za-z0-9\-._~!$&\'()*+,;=:#]|%[0-9A-Fa-f]{2})*)*)
(?:\?(?P<query> (?:[A-Za-z0-9\-._~!$&\'()*+,;=:#\\/?]|%[0-9A-Fa-f]{2})*))?
(?:\#(?P<fragment> (?:[A-Za-z0-9\-._~!$&\'()*+,;=:#\\/?]|%[0-9A-Fa-f]{2})*))?
$
/mx', $url, $m)) return FALSE;
switch ($m['scheme'])
{
case 'https':
case 'http':
if ($m['userinfo']) return FALSE; // HTTP scheme does not allow userinfo.
break;
case 'ftps':
case 'ftp':
break;
default:
return FALSE; // Unrecognised URI scheme. Default to FALSE.
}
// Validate host name conforms to DNS "dot-separated-parts".
if ($m{'regname'}) // If host regname specified, check for DNS conformance.
{
if (!preg_match('/# HTTP DNS host name.
^ # Anchor to beginning of string.
(?!.{256}) # Overall host length is less than 256 chars.
(?: # Group dot separated host part alternatives.
[0-9A-Za-z]\. # Either a single alphanum followed by dot
| # or... part has more than one char (63 chars max).
[0-9A-Za-z] # Part first char is alphanum (no dash).
[\-0-9A-Za-z]{0,61} # Internal chars are alphanum plus dash.
[0-9A-Za-z] # Part last char is alphanum (no dash).
\. # Each part followed by literal dot.
)* # One or more parts before top level domain.
(?: # Explicitly specify top level domains.
com|edu|gov|int|mil|net|org|biz|
info|name|pro|aero|coop|museum|
asia|cat|jobs|mobi|tel|travel|
[A-Za-z]{2}) # Country codes are exqactly two alpha chars.
$ # Anchor to end of string.
/ix', $m['host'])) return FALSE;
}
$m['url'] = $url;
for ($i = 0; isset($m[$i]); ++$i) unset($m[$i]);
return $m; // return TRUE == array of useful named $matches plus the valid $url.
}
The first regex validates the string as an absolute (has a non-empty host portion) generic URI. A second regex is used to validate the (named) host portion (when it is not an IP literal or IPv4 address) with regard to the DNS lookup system (where each dot-separated subdomain is 63 chars or less consisting of digits, letters and dashes, with an overall length less than 255 chars.)
Note that the structure of this function allows easy expansion to include other schemes.
Would this be in Perl by any chance?
Try:
use strict;
my $url = "http://localhost/test";
if ($url =~ m/^(.+):\/\/(.+)\/(.+)/) {
my $protocol = $1;
my $domain = $2;
my $dir = $3;
print "$protocol $domain $dir \n";
}

Categories