Related
I am trying to parse what I originally suspected was a JSON config file from a server.
After some attempts I was able to navigate and collapse the sections within Notepad++ when I selected the formatter as JavaScript.
However I am stuck on how I can convert/parse this data to JSON/another format, no online tools have been able to help with this.
How can I parse this text? Ideally I was trying to use PowerShell, but Python would also be an option if I can figure out how I can even begin the conversion.
For example, I am trying to parse out each of the servers, ie. test1, test2, test3 and get the data listed within each block.
Here is a sample of the config file format:
servername {
store {
servers {
* {
value<>
port<>
folder<C:\windows>
monitor<yes>
args<-T -H>
xrg<store>
wysargs<-t -g -b>
accept_any<yes>
pdu_length<23622>
}
test1 {
name<test1>
port<123>
root<c:\test>
monitor<yes>
}
test2 {
name<test2>
port<124>
root<c:\test>
monitor<yes>
}
test3 {
name<test3>
port<125>
root<c:\test>
monitor<yes>
}
}
senders
timeout<30>
}
}
Here's something that converts the above configfile into dict/json in python. I'm just doing some regex as #zett42 suggested.
import re
import json
lines = open('configfile', 'r').read()
# Quotations around the keys (next 3 lines)
lines2 = re.sub(r'([a-zA-Z\d_*]+)\s?{', r'"\1": {', lines)
# Process k<v> as Key, Value pairs
lines3 = re.sub(r'([a-zA-Z\d_*]+)\s?<([^<]*)>', r'"\1": "\2"', lines2)
# Process single key word on the line as Key, value pair with empty value
lines4 = re.sub(r'^\s*([a-zA-Z\d_*]+)\s*$', r'"\1": ""', lines3, flags=re.MULTILINE)
# Insert replace \n with commas in lines ending with "
lines5 = re.sub(r'"\n', '",', lines4)
# Remove the comma before the closing bracket
lines6 = re.sub(r',\s*}', '}', lines5)
# Remove quotes from numerical values
lines7 = re.sub(r'"(\d+)"', r'\1', lines6)
# Add commas after closing brackets when needed
lines8 = re.sub(r'[ \t\r\f]+(?!-)', '', lines7)
lines9 = re.sub(r'(?<=})\n(?=")', r",\n", lines8)
# Enclose in brackets and escape backslash for json parsing
lines10 = '{' + lines9.replace('\\', '\\\\') + '}'
j = json.JSONDecoder().decode(lines10)
Edit:
Here's an alternative that may be a little cleaner
# Replace line with just key with key<>
lines2 = re.sub(r'^([^{<>}]+)$', r'\1<>', lines, flags=re.MULTILINE)
# Remove spaces not within <>
lines3 = re.sub(r'\s(?!.*?>)|\s(?![^<]+>)', '', lines2, flags=re.MULTILINE)
# Quotations
lines4 = re.sub(r'([^{<>}]+)(?={)', r'"\1":', lines3)
lines5 = re.sub(r'([^:{<>}]+)<([^{<>}]*)>', r'"\1":"\2"', lines4)
# Add commas
lines6 = re.sub(r'(?<=")"(?!")', ',"', lines5)
lines7 = re.sub(r'}(?!}|$)', '},', lines6)
# Remove quotes from numbers
lines8 = re.sub(r'"(\d+)"', r'\1', lines7)
# Escape \
lines9 = '{' + re.sub(r'\\', r'\\\\', lines8) + '}'
I've come up with an even simpler solution than my previous one which uses PowerShell code only.
Using the RegEx alternation operator | we combine all token patterns into a single pattern and use named subexpressions to determine which one has actually matched.
The rest of the code is structurally similar to the C#/PS version.
using namespace System.Text.RegularExpressions
$ErrorActionPreference = 'Stop'
Function ConvertFrom-ServerData {
[CmdletBinding()]
param (
[Parameter(Mandatory, ValueFromPipeline)] [string] $InputObject
)
begin {
# Key can consist of anything except whitespace and < > { }
$keyPattern = '[^\s<>{}]+'
# Order of the patterns is important
$pattern = (
"(?<IntKey>$keyPattern)\s*<(?<IntValue>\d+)>",
"(?<TrueKey>$keyPattern)\s*<yes>",
"(?<FalseKey>$keyPattern)\s*<no>",
"(?<StrKey>$keyPattern)\s*<(?<StrValue>.*?)>",
"(?<ObjectBegin>$keyPattern)\s*{",
"(?<ObjectEnd>})",
"(?<KeyOnly>$keyPattern)",
"(?<Invalid>\S+)" # any non-whitespace sequence that didn't match the valid patterns
) -join '|'
}
process {
# Output is an ordered hashtable
$curObject = $outputObject = [ordered] #{}
# A stack is used to keep track of nested objects.
$stack = [Collections.Stack]::new()
# For each pattern match
foreach( $match in [RegEx]::Matches( $InputObject, $pattern, [RegexOptions]::Multiline ) ) {
# Get the RegEx groups that have actually matched.
$matchGroups = $match.Groups.Where{ $_.Success -and $_.Name.Length -gt 1 }
$key = $matchGroups[ 0 ].Value
switch( $matchGroups[ 0 ].Name ) {
'ObjectBegin' {
$child = [ordered] #{}
$curObject[ $key ] = $child
$stack.Push( $curObject )
$curObject = $child
break
}
'ObjectEnd' {
if( $stack.Count -eq 0 ) {
Write-Error -EA Stop "Parse error: Curly braces are unbalanced. There are more '}' than '{' in config data."
}
$curObject = $stack.Pop()
break
}
'IntKey' {
$value = $matchGroups[ 1 ].Value
$intValue = 0
$curObject[ $key ] = if( [int]::TryParse( $value, [ref] $intValue ) ) { $intValue } else { $value }
break
}
'TrueKey' {
$curObject[ $key ] = $true
break
}
'FalseKey' {
$curObject[ $key ] = $false
break
}
'StrKey' {
$value = $matchGroups[ 1 ].Value
$curObject[ $key ] = $value
break
}
'KeyOnly' {
$curObject[ $key ] = $null
break
}
'Invalid' {
Write-Warning "Invalid token at index $($match.Index): $key"
break
}
}
}
if( $stack.Count -gt 0 ) {
Write-Error "Parse error: Curly braces are unbalanced. There are more '{' than '}' in config data."
}
$outputObject # Implicit output
}
}
Usage example:
$sampleData = #'
test-server {
store {
servers {
* {
value<>
port<>
folder<C:\windows> monitor<yes>
args<-T -H>
xrg<store>
wysargs<-t -g -b>
accept_any<yes>
pdu_length<23622>
}
test1 {
name<test1>
port<123>
root<c:\test>
monitor<yes>
}
test2 {
name<test2>
port<124>
root<c:\test>
monitor<yes>
}
test3 {
name<test3>
port<125>
root<c:\test>
monitor<yes>
}
}
senders
timeout<30>
}
}
'#
# Call the parser
$objects = $sampleData | ConvertFrom-ServerData
# Uncomment to verify the whole result
#$objects | ConvertTo-Json -Depth 10
# The parser outputs nested hashtables, so we have to use GetEnumerator() to
# iterate over the key/value pairs.
$objects.'test-server'.store.servers.GetEnumerator().ForEach{
"[ SERVER: $($_.Key) ]"
# Convert server values hashtable to PSCustomObject for better output formatting
[PSCustomObject] $_.Value | Format-List
}
Output:
[ SERVER: * ]
value :
port :
folder : C:\windows
monitor : True
args : -T -H
xrg : store
wysargs : -t -g -b
accept_any : True
pdu_length : 23622
[ SERVER: test1 ]
name : test1
port : 123
root : c:\test
monitor : True
[ SERVER: test2 ]
name : test2
port : 124
root : c:\test
monitor : True
[ SERVER: test3 ]
name : test3
port : 125
root : c:\test
monitor : True
Notes:
I have further relaxed the regular expressions. Keys may now consist of any character except whitespace, <, >, { and }.
Line breaks are no longer required. This is more flexible but you can't have strings with embedded > characters. Let me know if this is a problem.
I have added detection of invalid tokens, which are output as warnings. Remove the "(?<Invalid>\S+)" line, if you want to ignore invalid tokens instead.
Unbalanced curly braces are detected and reported as error.
You can see how the RegEx works and get explanations at RegEx101.
C# version:
I've since created a faster C# version of the parser.
It requires PowerShell 7+. It can be imported using Add-Type -Path FileName.cs and called like [zett42.ServerDataParser]::Parse($text).
It doesn't use RegEx. It is based on ReadOnlySpan<char> and uses only simple string operations. In the benchmarks I did it was about 10x faster than a C# version that used RegEx and about 60x faster than the PS-only version.
Edit: I've since come up with a much simpler, PowerShell-only solution which I recommend to use.
I'll keep this answer alive as it might still be useful for other scenarios. Also there are propably differences in performance (I haven't measured).
MYousefi already posted a helpful answer with a Python implementation.
For PowerShell, I've come up with a solution that works without a convert-to-JSON step. Instead, I've adopted and generalized the RegEx-based tokenizer code from Jack Vanlightly (also see related blog post). A tokenizer (aka lexer) splits and categorizes the elements of the input text and outputs a flat stream of tokens (categories) and related data. A parser can use these as input to create a structured representation of the input text.
The tokenizer is written in generic C# and can be used for any input that can be split using RegEx. The C# code is included in PowerShell using the Add-Type command, so no C# compiler is required.
The parser function ConvertFrom-ServerData is written in PowerShell for simplicity. You only use the parser directly, so you don't have to know anything about the tokenizer C# code. If you want to adopt the code to different input, you should only have to modify the PowerShell parser code.
Save the following file in the same directory as the PowerShell script:
"RegExTokenizer.cs":
// Generic, precedence-based RegEx tokenizer.
// This code is based on https://github.com/Vanlightly/DslParser
// from Jack Vanlightly (https://jack-vanlightly.com).
// Modifications:
// - Interface improved for ease-of-use from PowerShell.
// - Return all groups from the RegEx match instead of just the value. This simplifies parsing of key/value pairs by requiring only a single token definition.
// - Some code simplifications, e. g. replacing "for" loops by "foreach".
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Text.RegularExpressions;
namespace DslTokenizer {
public class DslToken<TokenType> {
public TokenType Token { get; set; }
public GroupCollection Groups { get; set; }
}
public class TokenMatch<TokenType> {
public TokenType Token { get; set; }
public GroupCollection Groups { get; set; }
public int StartIndex { get; set; }
public int EndIndex { get; set; }
public int Precedence { get; set; }
}
public class TokenDefinition<TokenType> {
private Regex _regex;
private readonly TokenType _returnsToken;
private readonly int _precedence;
public TokenDefinition( TokenType returnsToken, string regexPattern, int precedence ) {
_regex = new Regex( regexPattern, RegexOptions.Multiline | RegexOptions.IgnoreCase | RegexOptions.Compiled );
_returnsToken = returnsToken;
_precedence = precedence;
}
public IEnumerable<TokenMatch<TokenType>> FindMatches( string inputString ) {
foreach( Match match in _regex.Matches( inputString ) ) {
yield return new TokenMatch<TokenType>() {
StartIndex = match.Index,
EndIndex = match.Index + match.Length,
Token = _returnsToken,
Groups = match.Groups,
Precedence = _precedence
};
}
}
}
public class PrecedenceBasedRegexTokenizer<TokenType> {
private List<TokenDefinition<TokenType>> _tokenDefinitions = new List<TokenDefinition<TokenType>>();
public PrecedenceBasedRegexTokenizer() {}
public PrecedenceBasedRegexTokenizer( IEnumerable<TokenDefinition<TokenType>> tokenDefinitions ) {
_tokenDefinitions = tokenDefinitions.ToList();
}
// Easy-to-use interface as alternative to constructor that takes an IEnumerable.
public void AddTokenDef( TokenType returnsToken, string regexPattern, int precedence = 0 ) {
_tokenDefinitions.Add( new TokenDefinition<TokenType>( returnsToken, regexPattern, precedence ) );
}
public IEnumerable<DslToken<TokenType>> Tokenize( string lqlText ) {
var tokenMatches = FindTokenMatches( lqlText );
var groupedByIndex = tokenMatches.GroupBy( x => x.StartIndex )
.OrderBy( x => x.Key )
.ToList();
TokenMatch<TokenType> lastMatch = null;
foreach( var match in groupedByIndex ) {
var bestMatch = match.OrderBy( x => x.Precedence ).First();
if( lastMatch != null && bestMatch.StartIndex < lastMatch.EndIndex ) {
continue;
}
yield return new DslToken<TokenType>(){ Token = bestMatch.Token, Groups = bestMatch.Groups };
lastMatch = bestMatch;
}
}
private List<TokenMatch<TokenType>> FindTokenMatches( string lqlText ) {
var tokenMatches = new List<TokenMatch<TokenType>>();
foreach( var tokenDefinition in _tokenDefinitions ) {
tokenMatches.AddRange( tokenDefinition.FindMatches( lqlText ).ToList() );
}
return tokenMatches;
}
}
}
Parser function written in PowerShell:
$ErrorActionPreference = 'Stop'
Add-Type -TypeDefinition (Get-Content $PSScriptRoot\RegExTokenizer.cs -Raw)
Function ConvertFrom-ServerData {
[CmdletBinding()]
param (
[Parameter(Mandatory, ValueFromPipeline)] [string] $InputObject
)
begin {
# Define the kind of possible tokens.
enum ServerDataTokens {
ObjectBegin
ObjectEnd
ValueInt
ValueBool
ValueString
KeyOnly
}
# Create an instance of the tokenizer from "RegExTokenizer.cs".
$tokenizer = [DslTokenizer.PrecedenceBasedRegexTokenizer[ServerDataTokens]]::new()
# Define a RegEx for each token where 1st group matches key and 2nd matches value (if any).
# To resolve ambiguities, most specific RegEx must come first
# (e. g. ValueInt line must come before ValueString line).
# Alternatively pass a 3rd integer parameter that defines the precedence.
$tokenizer.AddTokenDef( [ServerDataTokens]::ObjectBegin, '^\s*([\w*]+)\s*{' )
$tokenizer.AddTokenDef( [ServerDataTokens]::ObjectEnd, '^\s*}\s*$' )
$tokenizer.AddTokenDef( [ServerDataTokens]::ValueInt, '^\s*(\w+)\s*<([+-]?\d+)>\s*$' )
$tokenizer.AddTokenDef( [ServerDataTokens]::ValueBool, '^\s*(\w+)\s*<(yes|no)>\s*$' )
$tokenizer.AddTokenDef( [ServerDataTokens]::ValueString, '^\s*(\w+)\s*<(.*)>\s*$' )
$tokenizer.AddTokenDef( [ServerDataTokens]::KeyOnly, '^\s*(\w+)\s*$' )
}
process {
# Output is an ordered hashtable
$outputObject = [ordered] #{}
$curObject = $outputObject
# A stack is used to keep track of nested objects.
$stack = [Collections.Stack]::new()
# For each token produced by the tokenizer
$tokenizer.Tokenize( $InputObject ).ForEach{
# $_.Groups[0] is the full match, which we discard by assigning to $null
$null, $key, $value = $_.Groups.Value
switch( $_.Token ) {
([ServerDataTokens]::ObjectBegin) {
$child = [ordered] #{}
$curObject[ $key ] = $child
$stack.Push( $curObject )
$curObject = $child
break
}
([ServerDataTokens]::ObjectEnd) {
$curObject = $stack.Pop()
break
}
([ServerDataTokens]::ValueInt) {
$intValue = 0
$curObject[ $key ] = if( [int]::TryParse( $value, [ref] $intValue ) ) { $intValue } else { $value }
break
}
([ServerDataTokens]::ValueBool) {
$curObject[ $key ] = $value -eq 'yes'
break
}
([ServerDataTokens]::ValueString) {
$curObject[ $key ] = $value
break
}
([ServerDataTokens]::KeyOnly) {
$curObject[ $key ] = $null
break
}
}
}
$outputObject # Implicit output
}
}
Usage example:
$sampleData = #'
servername {
store {
servers {
* {
value<>
port<>
folder<C:\windows>
monitor<yes>
args<-T -H>
xrg<store>
wysargs<-t -g -b>
accept_any<yes>
pdu_length<23622>
}
test1 {
name<test1>
port<123>
root<c:\test>
monitor<yes>
}
test2 {
name<test2>
port<124>
root<c:\test>
monitor<yes>
}
test3 {
name<test3>
port<125>
root<c:\test>
monitor<yes>
}
}
senders
timeout<30>
}
}
'#
# Call the parser
$objects = $sampleData | ConvertFrom-ServerData
# The parser outputs nested hashtables, so we have to use GetEnumerator() to
# iterate over the key/value pairs.
$objects.servername.store.servers.GetEnumerator().ForEach{
"[ SERVER: $($_.Key) ]"
# Convert server values hashtable to PSCustomObject for better output formatting
[PSCustomObject] $_.Value | Format-List
}
Output:
[ SERVER: * ]
value :
port :
folder : C:\windows
monitor : True
args : -T -H
xrg : store
wysargs : -t -g -b
accept_any : True
pdu_length : 23622
[ SERVER: test1 ]
name : test1
port : 123
root : c:\test
monitor : True
[ SERVER: test2 ]
name : test2
port : 124
root : c:\test
monitor : True
[ SERVER: test3 ]
name : test3
port : 125
root : c:\test
monitor : True
Notes:
If you pass input from Get-Content to the parser, make sure to use parameter -Raw, e. g. $objects = Get-Content input.cfg -Raw | ConvertFrom-ServerData. Otherwise the parser would try to parse each input line on its own.
I've opted to convert "yes"/"no" values to bool, so they output as "True"/"False". Remove the line $tokenizer.AddTokenDef( 'ValueBool', ... to parse them as string instead and output as-is.
Keys without values <> (the "senders" in the example) are stored as keys with value $null.
The RegEx's enforce that values can be single-line only (as the sample data suggests). This allows us to have embedded > characters without the need to escape them.
Someone recently asked if there was a simple way to transform custom markup as follows, including nested markings. Examples included...
for \k[hello] the output will be <b>hello</b>
for \i[world], the output will be <em>world</em>
for hello \k[dear \i[world]], the output will be hello <b>dear <em>world</em></b>
for \b[some text](url), the output will be <a href=”url”>some text</a>
for \r[some text](url), the output will be <img alt=”some text” src=”url” />
Interestingly enough, transpiling the above to javascript, including consideration for nesting, is remarkably straightforward, particularly if the markup grammar is consistent.
//
// Define the syntax and translation to javascript.
//
const grammar = {
syntax: {
k: {markUp: `\k[`, javascript: `"+grammar.oneArg("k","`, pre: `<b>`, post: `</b>`},
i: {markUp: `\i[`, javascript: `"+grammar.oneArg("i","`, pre: `<em>`, post: `</em>`},
b: {markUp: `\b[`, javascript: `"+grammar.twoArgs("b","`, pattern: `$1`},
r: {markUp: `\r[`, javascript: `"+grammar.twoArgs("r","`, pattern: `<img alt="$1" src="$2"/>`},
close0: {markUp: `](`, javascript: `","`},
close1: {markUp: `)`, javascript: `")+"`},
close2: {markUp: `]`, javascript: `")+"`}
},
oneArg: function( command, arg1 ) {
return grammar.syntax[ command ].pre + arg1 + grammar.syntax[ command ].post;
},
twoArgs: function( command, arg1, arg2 ) {
return grammar.syntax[ command ].pattern.split( `$1` ).join( arg1 ).split( `$2` ).join( arg2 );
}
}
function transpileAndExecute( markUpString ) {
// Convert the markUp to javascript.
for ( command in grammar.syntax ) {
markUpString = markUpString.split( grammar.syntax[ command ].markUp ).join( grammar.syntax[ command ].javascript );
}
// With the markUp now converted to javascript, let's execute it!
return new Function( `return "${markUpString}"` )();
}
var markUpTest = `Hello \k[dear \i[world!]] \b[\i[Search:] \k[Engine 1]](http://www.google.com) \r[\i[Search:] \k[Engine 2]](http://www.yahoo.com)`;
console.log( transpileAndExecute( markUpTest ) );
Note that there are obviously pre-processing issues that must also be addressed, such as how to handle the inclusion of tokens in normal text. Eg, including a ']' as part of a text string will throw the transpiler a curve ball, so enforcing a rule such as using '\]' to represent a ']', and then replacing all such occurrences of '\]' with innocuous text before transpiling and then re-replacing afterwards solves this problem simply...
In terms of transpiling, using the grammar defined above, the following markup...
Hello \k[dear \i[world!]] \b[\i[Search:] \k[Engine 1]](http://www.google.com) \r[\i[Search:] \k[Engine 2]](http://www.yahoo.com)
...is transpiled to...
"Hello world! "+grammar.oneArg("k","dear "+grammar.oneArg("i","world")+"")+" "+grammar.twoArgs("b",""+grammar.oneArg("i","Search:")+" "+grammar.oneArg("k","Engine 1")+"","http://www.google.com")+" "+grammar.twoArgs("r",""+grammar.oneArg("i","Search:")+" "+grammar.oneArg("k","Engine 2")+"","http://www.yahoo.com")+""
...and once executed as a javascript Function, results in...
Hello <b>dear <em>world!</em></b> <em>Search:</em> <b>Engine 1</b> <img alt="<em>Search:</em> <b>Engine 2</b>" src="http://www.yahoo.com"/>
The real challenge though is the handling of syntax errors, particularly if one has large amounts of markup to transpile. The crystal clear answer by CertainPerformance (see Find details of SyntaxError thrown by javascript new Function() constructor ) provides a means of capturing the line number and character number of a syntax error from a dynamically compiled javascript function, but am not quite sure of the best means of mapping a syntax error of the transpiled code back to the original markup.
Eg, if an extra ']' is out of place (after "Goodbye")...
Hello World! \b[\i[Goodbye]]] \k[World!]]
...this transpiles to...
"Hello World! "+grammar.twoArgs("b",""+grammar.oneArg("i","Goodbye")+"")+"")+" "+grammar.oneArg("k","World!")+"")+""
^
...and CertainPerformance's checkSyntax function returns "Error thrown at: 1:76", as expected, marked above with the "^".
The question is, how to map this back to the original markup to aid in narrowing down the error in the markup? (Obviously in this case, it's simple to see the error in the markup, but if one has pages of markup being transpiled, then assistance in narrowing down the syntax error is a must.) Maintaining a map between the markup and the transpiled code seems tricky, as the transpiler is mutating the markup to javascript code step-by-step as it walks the grammar transformation matrix. My gut tells me there's a simpler way... Thanks for looking.
I would suggest you write a syntax checker, kinda like jsonlint or jslint etc... that checks if everything is checked and closed properly, before actually compiling the text to human readable text.
This allows for debugging, and prevents from malformed code running haywire, and allows you to provide an error highlighted document editor when they are editing the text.
Below a proof of concept which just checks if brackets are closed properly.
var grammarLint = function(text) {
var nestingCounter = 0;
var isCommand = char => char == '\\';
var isOpen = char => char == '[';
var isClose = char => char == ']';
var lines = text.split('\n');
for(var i = 0; i < lines.length; i++) {
text = lines[i];
for(var c = 0; c < text.length; c++) {
var char = text.charAt(c);
if(isCommand(char) && isOpen(text.charAt(c+2))) {
c += 2;
nestingCounter++;
continue;
}
if(isClose(char)) {
nestingCounter--;
if(nestingCounter < 0) {
throw new Error('Command closed but not opened at on line '+(i+1)+' char '+(c+1));
}
}
}
}
if(nestingCounter > 0) {
throw new Error(nestingCounter + ' Unclosed command brackets found');
}
}
text = 'Hello World! \\b[\\i[Goodbye]]] \\k[World!]]';
try {
grammarLint(text);
}
catch(e) {
console.error(e.message);
}
text = 'Hello World! \\b[\\i[Goodbye \\k[World!]]';
try {
grammarLint(text);
}
catch(e) {
console.error(e.message);
}
Chased down the ability to leverage the javascript compiler to capture syntax errors in the transpiled code, and reference this back to the original markup. In short, this involves a scheme of incorporating comments in the transpiled code to permit references back to the markup, providing the means of narrowing down the markup error. (There is a bit of shortcoming in that the error message is really a transpiler syntax error, and doesn't necessarily correspond exactly to the markup error, but gives one a fighting chance to figure out where the markup issue lies.)
This algorithm also leverages the concepts of CertainPerformance's technique ( Find details of SyntaxError thrown by javascript new Function() constructor ) of using setTimeout to capture the syntax errors of the transpiled code. I have interspersed a javascript Promise to smooth the flow.
"use strict";
//
// Define the syntax and translation to javascript.
//
class Transpiler {
static _syntaxCheckCounter = 0;
static _syntaxCheck = {};
static _currentSyntaxCheck = null;
constructor() {
this.grammar = {
syntax: {
k: {markUp: `\k[`, javascript: `"►+grammar.oneArg("k",◄"`, pre: `<b>`, post: `</b>`},
i: {markUp: `\i[`, javascript: `"►+grammar.oneArg("i",◄"`, pre: `<em>`, post: `</em>`},
b: {markUp: `\b[`, javascript: `"►+grammar.twoArgs("b",◄"`, pattern: `$1`},
r: {markUp: `\r[`, javascript: `"►+grammar.twoArgs("r",◄"`, pattern: `<img alt="$1" src="$2"/>`},
close0: {markUp: `](`, javascript: `"►,◄"`},
close1: {markUp: `)`, javascript: `"►)+◄"`},
close2: {markUp: `]`, javascript: `"►)+◄"`}
},
marker: { // https://www.w3schools.com/charsets/ref_utf_geometric.asp
begMarker: `►`, // 25ba
endMarker: `◄`, // 25c4
begComment: `◆`, // 25c6
endComment: `◇`, // 25c7
fillerChar: `●` // 25cf
},
oneArg: function( command, arg1 ) {
return this.syntax[ command ].pre + arg1 + this.syntax[ command ].post;
},
twoArgs: function( command, arg1, arg2 ) {
return this.syntax[ command ].pattern.split( `$1` ).join( arg1 ).split( `$2` ).join( arg2 );
}
};
};
static transpilerSyntaxChecker(err) {
// Uncomment the following line to disable default console error message.
//err.preventDefault();
let transpiledLine = Transpiler._syntaxCheck[ Transpiler._currentSyntaxCheck ].transpiledFunction.split(`\n`)[1];
let lo = parseInt( transpiledLine.substr( transpiledLine.substr( 0, err.colno ).lastIndexOf( `●` ) + 1 ) );
let hi = parseInt( transpiledLine.substr( transpiledLine.substr( err.colno ).indexOf( `●` ) + err.colno + 1 ) );
let markUpLine = Transpiler._syntaxCheck[ Transpiler._currentSyntaxCheck ].markUp;
let errString = markUpLine.substring( lo - 40, hi + 40 ).split(`\n`).join(`↵`) + `\n`;
errString += ( `.`.repeat( lo ) + `^`.repeat( hi - lo ) ).substring( lo - 40, hi + 40 );
Transpiler._syntaxCheck[Transpiler._currentSyntaxCheck].rejectFunction( new Error(`'${ err.message }' in transpiled code, corresponding to character range ${ lo }:${ hi } in the markup.\n${ errString }`) );
window.removeEventListener('error', Transpiler.transpilerSyntaxChecker);
delete Transpiler._syntaxCheck[Transpiler._currentSyntaxCheck];
};
async transpileAndExecute( markUpString ) {
// Convert the markUp to javascript.
console.log( markUpString );
let gm = this.grammar.marker;
let markUpIndex = markUpString;
let transpiled = markUpString;
for ( let n in this.grammar.syntax ) {
let command = this.grammar.syntax[ n ];
let markUpIndexSplit = markUpIndex.split( command.markUp );
let transpiledSplit = transpiled.split( command.markUp );
if ( markUpIndexSplit.length !== transpiledSplit.length ) {
throw `Ambiguous grammar when searching for "${ command.markUp }" to replace with "${ command.javascript }".`;
}
for ( let i = 0; i < markUpIndexSplit.length; i++ ) {
if ( i === 0 ) {
markUpIndex = markUpIndexSplit[ 0 ];
transpiled = transpiledSplit[ 0 ];
} else {
let js = command.javascript.replace( gm.begMarker, gm.begComment + gm.fillerChar + markUpIndex.length + gm.endComment );
markUpIndex += gm.fillerChar.repeat( command.markUp.length );
js = js.replace( gm.endMarker, gm.begComment + gm.fillerChar + markUpIndex.length + gm.endComment );
markUpIndex += markUpIndexSplit[ i ];
transpiled += js + transpiledSplit[ i ];
}
}
};
transpiled = transpiled.split( gm.begComment ).join( `/*` );
transpiled = transpiled.split( gm.endComment ).join( `*/` );
transpiled = `/*${ gm.fillerChar }0*/"${ transpiled }"/*${ gm.fillerChar }${ markUpIndex.length + 1 }*/`;
console.log( markUpIndex );
console.log( transpiled );
let self = this;
var id = ++Transpiler._syntaxCheckCounter;
Transpiler._syntaxCheck[id] = {};
let transpiledFunction = `"use strict"; if ( run ) return\n${ transpiled.split(`\n`).join(` `) }`;
Transpiler._syntaxCheck[id].markUp = markUpString;
Transpiler._syntaxCheck[id].transpiledFunction = transpiledFunction;
//
// Here's where it gets tricky. (See "CertainPerformance's" post at
// https://stackoverflow.com/questions/35252731
// for details behind the concept.) In this implementation a Promise
// is created, which on success of the JS compiler syntax check, is resolved
// immediately. Otherwise, if there is a syntax error, the transpilerSyntaxChecker
// routine, which has access to a reference to the Promise reject function,
// calls the reject function to resolve the promise, returning the error back
// to the calling process.
//
let checkSyntaxPromise = new Promise((resolve, reject) => {
setTimeout( () => {
Transpiler._currentSyntaxCheck = id;
window.addEventListener('error', Transpiler.transpilerSyntaxChecker);
// Perform the syntax check by attempting to compile the transpiled function.
new Function( `grammar`, `run`, transpiledFunction )( self.grammar );
resolve( null );
window.removeEventListener('error', Transpiler.transpilerSyntaxChecker);
delete Transpiler._syntaxCheck[id];
});
Transpiler._syntaxCheck[id].rejectFunction = reject;
});
let result = await checkSyntaxPromise;
// With the markUp now converted to javascript and syntax checked, let's execute it!
return ( new Function( `grammar`, `run`, transpiledFunction.replace(`return\n`,`return `) )( this.grammar, true ) );
};
}
Here are some sample runs with botched markup, and the corresponding console output. The following markup has an extra ]...
let markUp = `Hello World \k[Goodbye]] World`;
new Transpiler().transpileAndExecute( markUp ).then(result => console.log( result )).catch( err => console.log( err ));
...resulting in transpiled code of...
/*●0*/""/*●0*/+grammar.oneArg("i",/*●2*/"Hello World"/*●13*/)+/*●14*/" "/*●15*/+grammar.oneArg("k",/*●17*/""/*●17*/+grammar.oneArg("i",/*●19*/"Goodbye"/*●26*/)+/*●27*/" World"/*●34*/
Note the interspersed comments, which point back to the character position in the original markup. Then, when the javascript compiler throws an error, it is trapped by transpilerSyntaxChecker which uses the embedded comments to identify the location in the markup, dumping the following results to the console...
Uncaught SyntaxError: Unexpected token )
at new Function (<anonymous>)
at markUp.html:127
Error: 'Uncaught SyntaxError: Unexpected token )' in transpiled code, corresponding to character range 22:23 in the markup.
Hello World k[Goodbye]] World
......................^
at transpilerSyntaxChecker (markUp.html:59)
Note that the Unexpected token ) message refers to the transpiled code, not the markup script, but the output points to the offending ].
Here's another sample run, in this case missing a close ]...
let markUp = `\i[Hello World] \k[\i[Goodbye] World`;
new Transpiler().transpileAndExecute( markUp ).then(result => console.log( result )).catch(err => console.log( err ));
...which produces the following transpiled code...
/*●0*/""/*●0*/+grammar.oneArg("i",/*●2*/"Hello World"/*●13*/)+/*●14*/" "/*●15*/+grammar.oneArg("k",/*●17*/""/*●17*/+grammar.oneArg("i",/*●19*/"Goodbye"/*●26*/)+/*●27*/" World"/*●34*/
...throwing the following error...
Uncaught SyntaxError: missing ) after argument list
at new Function (<anonymous>)
at markUp.html:127
Error: 'Uncaught SyntaxError: missing ) after argument list' in transpiled code, corresponding to character range 27:34 in the markup.
i[Hello World] k[i[Goodbye] World
...........................^^^^^^^
at transpilerSyntaxChecker (markUp.html:59)
Maybe not the best solution, but a lazy man's solution. Tschallacka's response has merit (ie, a custom syntax checker or using something like Jison) in performing a true syntax check against the markup, without the setTimeout / Promise complexities nor the somewhat imprecise method of using the transpiler error messages to refer to the original markup...
I was wondering if it is possible to format numbers in Javascript template strings, for example something like:
var n = 5.1234;
console.log(`This is a number: $.2d{n}`);
// -> 5.12
Or possibly
var n = 5.1234;
console.log(`This is a number: ${n.toString('.2d')}`);
// -> 5.12
That syntax obviously doesn't work, it is just an illustration of the type of thing I'm looking for.
I am aware of tools like sprintf from underscore.string, but this seems like something that JS should be able to do out the box, especially given the power of template strings.
EDIT
As stated above, I am already aware of 3rd party tools (e.g. sprintf) and customised functions to do this. Similar questions (e.g. JavaScript equivalent to printf/String.Format) don't mention template strings at all, probably because they were asked before the ES6 template strings were around. My question is specific to ES6, and is independent of implementation. I am quite happy to accept an answer of "No, this is not possible" if that is case, but what would be great is either info about a new ES6 feature that provides this, or some insight into whether such a feature is on its way.
No, ES6 does not introduce any new number formatting functions, you will have to live with the existing .toExponential(fractionDigits), .toFixed(fractionDigits), .toPrecision(precision), .toString([radix]) and toLocaleString(…) (which has been updated to optionally support the ECMA-402 Standard, though).
Template strings have nothing to do with number formatting, they just desugar to a function call (if tagged) or string concatenation (default).
If those Number methods are not sufficient for you, you will have to roll your own. You can of course write your formatting function as a template string tag if you wish to do so.
You should be able to use the toFixed() method of a number:
var num = 5.1234;
var n = num.toFixed(2);
If you want to use ES6 tag functions here's how such a tag function would look,
function d2(pieces) {
var result = pieces[0];
var substitutions = [].slice.call(arguments, 1);
for (var i = 0; i < substitutions.length; ++i) {
var n = substitutions[i];
if (Number(n) == n) {
result += Number(substitutions[i]).toFixed(2);
} else {
result += substitutions[i];
}
result += pieces[i + 1];
}
return result;
}
which can then be applied to a template string thusly,
d2`${some_float} (you can interpolate as many floats as you want) of ${some_string}`;
that will format the float and leave the string alone.
Here's a fully ES6 version of Filip Allberg's solution above, using ES6 "rest" params. The only thing missing is being able to vary the precision; that could be done by making a factory function. Left as an exercise for the reader.
function d2(strs, ...args) {
var result = strs[0];
for (var i = 0; i < args.length; ++i) {
var n = args[i];
if (Number(n) == n) {
result += Number(args[i]).toFixed(2);
} else {
result += args[i];
}
result += strs[i+1];
}
return result;
}
f=1.2345678;
s="a string";
console.log(d2`template: ${f} ${f*100} and ${s} (literal:${9.0001})`);
While template-string interpolation formatting is not available as a built-in, you can get equivalent behavior with Intl.NumberFormat:
const format = (num, fraction = 2) => new Intl.NumberFormat([], {
minimumFractionDigits: fraction,
maximumFractionDigits: fraction,
}).format(num);
format(5.1234); // -> '5.12'
Note that regardless of your implementation of choice, you might get bitten by rounding errors:
(9.999).toFixed(2) // -> '10.00'
new Intl.NumberFormat([], {
minimumFractionDigits: 2,
maximumFractionDigits: 2, // <- implicit rounding!
}).format(9.999) // -> '10.00'
based on ES6 Tagged Templates (credit to https://stackoverflow.com/a/51680250/711085), this will emulate typical template string syntax in other languages (this is loosely based on python f-strings; I avoid calling it f in case of name overlaps):
Demo:
> F`${(Math.sqrt(2))**2}{.0f}` // normally 2.0000000000000004
"2"
> F`${1/3}{%} ~ ${1/3}{.2%} ~ ${1/3}{d} ~ ${1/3}{.2f} ~ ${1/3}"
"33% ~ 33.33% ~ 0 ~ 0.33 ~ 0.3333333333333333"
> F`${[1/3,1/3]}{.2f} ~ ${{a:1/3, b:1/3}}{.2f} ~ ${"someStr"}`
"[0.33,0.33] ~ {\"a\":\"0.33\",\"b\":\"0.33\"} ~ someStr
Fairly simple code using :
var FORMATTER = function(obj,fmt) {
/* implements things using (Number).toFixed:
${1/3}{.2f} -> 0.33
${1/3}{.0f} -> 1
${1/3}{%} -> 33%
${1/3}{.3%} -> 33.333%
${1/3}{d} -> 0
${{a:1/3,b:1/3}}{.2f} -> {"a":0.33, "b":0.33}
${{a:1/3,b:1/3}}{*:'.2f',b:'%'} -> {"a":0.33, "b":'33%'} //TODO not implemented
${[1/3,1/3]}{.2f} -> [0.33, 0.33]
${someObj} -> if the object/class defines a method [Symbol.FTemplate](){...},
it will be evaluated; alternatively if a method [Symbol.FTemplateKey](key){...}
that can be evaluated to a fmt string; alternatively in the future
once decorators exist, metadata may be appended to object properties to derive
formats //TODO not implemented
*/
try {
let fracDigits=0,percent;
if (fmt===undefined) {
if (typeof obj === 'string')
return obj;
else
return JSON.stringify(obj);
} else if (obj instanceof Array)
return '['+obj.map(x=> FORMATTER(x,fmt))+']'
else if (typeof obj==='object' && obj!==null /*&&!Array.isArray(obj)*/)
return JSON.stringify(Object.fromEntries(Object.entries(obj).map(([k,v])=> [k,FORMATTER(v,fmt)])));
else if (matches = fmt.match(/^\.(\d+)f$/))
[_,fracDigits] = matches;
else if (matches = fmt.match(/^(?:\.(\d+))?(%)$/))
[_,fracDigits,percent] = matches;
else if (matches = fmt.match(/^d$/))
fracDigits = 0;
else
throw 'format not recognized';
if (obj===null)
return 'null';
if (obj===undefined) {
// one might extend the above syntax to
// allow for example for .3f? -> "undefined"|"0.123"
return 'undefined';
}
if (percent)
obj *= 100;
fracDigits = parseFloat(fracDigits);
return obj.toFixed(fracDigits) + (percent? '%':'');
} catch(err) {
throw `error executing F\`$\{${someObj}\}{${fmt}}\` specification: ${err}`
}
}
function F(strs, ...args) {
/* usage: F`Demo: 1+1.5 = ${1+1.5}{.2f}`
--> "Demo: 1+1.5 = 2.50"
*/
let R = strs[0];
args.forEach((arg,i)=> {
let [_,fmt,str] = strs[i+1].match(/(?:\{(.*)(?<!\\)\})?(.*)/);
R += FORMATTER(arg,fmt) + str;
});
return R;
}
sidenote: The core of the code is as follows. The heavy lifting is done by the formatter. The negative lookbehind is somewhat optional, and to let one escape actual curly braces.
let R = strs[0];
args.forEach((arg,i)=> {
let [_,fmt,str] = strs[i+1].match(/(?:\{(.*)(?<!\\)\})?(.*)/);
R += FORMATTER(arg,fmt) + str;
});
You can use es6 tag functions. I don't know ready for use of that.
It might look like this:
num`This is a number: $.2d{n}`
Learn more:
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Template_literals
https://developers.google.com/web/updates/2015/01/ES6-Template-Strings
For example, if you have an expression like this:
Expression<Func<int, int>> fn = x => x * x;
Is there anything that will traverse the expression tree and generate this?
"function(x) { return x * x; }"
It's probably not easy, but yes, it's absolutely feasible. ORMs like Entity Framework or Linq to SQL do it to translate Linq queries into SQL, but you can actually generate anything you want from the expression tree...
You should implement an ExpressionVisitor to analyse and transform the expression.
EDIT: here's a very basic implementation that works for your example:
Expression<Func<int, int>> fn = x => x * x;
var visitor = new JsExpressionVisitor();
visitor.Visit(fn);
Console.WriteLine(visitor.JavaScriptCode);
...
class JsExpressionVisitor : ExpressionVisitor
{
private readonly StringBuilder _builder;
public JsExpressionVisitor()
{
_builder = new StringBuilder();
}
public string JavaScriptCode
{
get { return _builder.ToString(); }
}
public override Expression Visit(Expression node)
{
_builder.Clear();
return base.Visit(node);
}
protected override Expression VisitParameter(ParameterExpression node)
{
_builder.Append(node.Name);
base.VisitParameter(node);
return node;
}
protected override Expression VisitBinary(BinaryExpression node)
{
base.Visit(node.Left);
_builder.Append(GetOperator(node.NodeType));
base.Visit(node.Right);
return node;
}
protected override Expression VisitLambda<T>(Expression<T> node)
{
_builder.Append("function(");
for (int i = 0; i < node.Parameters.Count; i++)
{
if (i > 0)
_builder.Append(", ");
_builder.Append(node.Parameters[i].Name);
}
_builder.Append(") {");
if (node.Body.Type != typeof(void))
{
_builder.Append("return ");
}
base.Visit(node.Body);
_builder.Append("; }");
return node;
}
private static string GetOperator(ExpressionType nodeType)
{
switch (nodeType)
{
case ExpressionType.Add:
return " + ";
case ExpressionType.Multiply:
return " * ";
case ExpressionType.Subtract:
return " - ";
case ExpressionType.Divide:
return " / ";
case ExpressionType.Assign:
return " = ";
case ExpressionType.Equal:
return " == ";
case ExpressionType.NotEqual:
return " != ";
// TODO: Add other operators...
}
throw new NotImplementedException("Operator not implemented");
}
}
It only handles lambdas with a single instruction, but anyway the C# compiler can't generate an expression tree for a block lambda.
There's still a lot of work to do of course, this is a very minimal implementation... you probably need to add method calls (VisitMethodCall), property and field access (VisitMember), etc.
Script# is used by Microsoft internal developers to do exactly this.
Take a look at Lambda2Js, a library created by Miguel Angelo for this exact purpose.
It adds a CompileToJavascript extension method to any Expression.
Example 1:
Expression<Func<MyClass, object>> expr = x => x.PhonesByName["Miguel"].DDD == 32 | x.Phones.Length != 1;
var js = expr.CompileToJavascript();
Assert.AreEqual("PhonesByName[\"Miguel\"].DDD==32|Phones.length!=1", js);
Example 2:
Expression<Func<MyClass, object>> expr = x => x.Phones.FirstOrDefault(p => p.DDD > 10);
var js = expr.CompileToJavascript();
Assert.AreEqual("System.Linq.Enumerable.FirstOrDefault(Phones,function(p){return p.DDD>10;})", js);
More examples here.
The expression has already been parsed for you by the C# compiler; all that remains is for you to traverse the expression tree and generate the code. Traversing the tree can be done recursively, and each node could be handled by checking what type it is (there are several subclasses of Expression, representing e.g. functions, operators, and member lookup). The handler for each type can generate the appropriate code and traverse the node's children (which will be available in different properties depending on which expression type it is). For instance, a function node could be processed by first outputting "function(" followed by the parameter name followed by ") {". Then, the body could be processed recursively, and finally, you output "}".
A few people have developed open source libraries seeking to solve this problem. The one I have been looking at is Linq2CodeDom, which converts expressions into a CodeDom graph, which can then be compiled to JavaScript as long as the code is compatible.
Script# leverages the original C# source code and the compiled assembly, not an expression tree.
I made some minor edits to Linq2CodeDom to add JScript as a supported language--essentially just adding a reference to Microsoft.JScript, updating an enum, and adding one more case in GenerateCode. Here is the code to convert an expression:
var c = new CodeDomGenerator();
c.AddNamespace("Example")
.AddClass("Container")
.AddMethod(
MemberAttributes.Public | MemberAttributes.Static,
(int x) => "Square",
Emit.#return<int, int>(x => x * x)
);
Console.WriteLine(c.GenerateCode(CodeDomGenerator.Language.JScript));
And here is the result:
package Example
{
public class Container
{
public static function Square(x : int)
{
return (x * x);
}
}
}
The method signature reflects the more strongly-typed nature of JScript. It may be better to use Linq2CodeDom to generate C# and then pass this to Script# to convert this to JavaScript. I believe the first answer is the most correct, but as you can see by reviewing the Linq2CodeDom source, there is a lot of effort involved on handling every case to generate the code correctly.
I need a short basename function (one-liner ?) for Javascript:
basename("/a/folder/file.a.ext") -> "file.a"
basename("/a/folder/file.ext") -> "file"
basename("/a/folder/file") -> "file"
That should strip the path and any extension.
Update:
For dot at the beginning would be nice to treat as "special" files
basename("/a/folder/.file.a.ext") -> ".file.a"
basename("/a/folder/.file.ext") -> ".file"
basename("/a/folder/.file") -> ".file" # empty is Ok
basename("/a/folder/.fil") -> ".fil" # empty is Ok
basename("/a/folder/.file..a..") -> # does'nt matter
function basename(path) {
return path.split('/').reverse()[0];
}
Breaks up the path into component directories and filename then returns the last piece (the filename) which is the last element of the array.
function baseName(str)
{
var base = new String(str).substring(str.lastIndexOf('/') + 1);
if(base.lastIndexOf(".") != -1)
base = base.substring(0, base.lastIndexOf("."));
return base;
}
If you can have both / and \ as separators, you have to change the code to add one more line
Any of the above works although they have no respect for speed/memory :-).
A faster/simpler implementation should uses as fewer functions/operations as possible. RegExp is a bad choice because it consumes a lot of resources when actually we can achieve the same result but easier.
An implementation when you want the filename including extension (which in fact this is the genuine definition of basename):
function basename(str, sep) {
return str.substr(str.lastIndexOf(sep) + 1);
}
If you need a custom basename implementation that has to strip also the extension I would recommend instead a specific extension-stripping function for that case which you can call it whenever you like.
function strip_extension(str) {
return str.substr(0,str.lastIndexOf('.'));
}
Usage example:
basename('file.txt','/'); // real basename
strip_extension(basename('file.txt','/')); // custom basename
They are separated such that you can combine them to obtain 3 different things: stripping the extention, getting the real-basename, getting your custom-basename. I regard it as a more elegant implementation than others approaches.
Maybe try to use existing packages if you can. http://nodejs.org/api/path.html
var path = require('path');
var str = '/path/to/file/test.html'
var fileNameStringWithoutExtention = path.basename(str, '.html');
// returns 'test'
// let path determine the extension
var fileNameStringWithoutExtention = path.basename(str, path.extname(str));
// returns 'test'
Other examples:
var pathString = path.dirname(str);
var fileNameStringWithExtention = path.basename(str);
var fullPathAndFileNameString = path.join(pathString, fileNameString);
basename = function(path) {
return path.replace(/.*\/|\.[^.]*$/g, '');
}
replace anything that ends with a slash .*\/ or dot - some non-dots - end \.[^.]*$ with nothing
Just like #3DFace has commented:
path.split(/[\\/]/).pop(); // works with both separators
Or if you like prototypes:
String.prototype.basename = function(sep) {
sep = sep || '\\/';
return this.split(new RegExp("["+sep+"]")).pop();
}
Testing:
var str = "http://stackoverflow.com/questions/3820381/need-a-basename-function-in-javascript";
alert(str.basename());
Will return "need-a-basename-function-in-javascript".
Enjoy!
Using modern (2020) js code:
function basename (path) {
return path.substring(path.lastIndexOf('/') + 1)
}
console.log(basename('/home/user/file.txt'))
Contrary to misinformation provided above, regular expressions are extremely efficient. The caveat is that, when possible, they should be in a position so that they are compiled exactly once in the life of the program.
Here is a solution that gives both dirname and basename.
const rx1 = /(.*)\/+([^/]*)$/; // (dir/) (optional_file)
const rx2 = /()(.*)$/; // () (file)
function dir_and_file(path) {
// result is array with
// [0] original string
// [1] dirname
// [2] filename
return rx1.exec(path) || rx2.exec(path);
}
// Single purpose versions.
function dirname(path) {
return (rx1.exec(path) || rx2.exec(path))[1];
}
function basename(path) {
return (rx1.exec(path) || rx2.exec(path))[2];
}
As for performance, I have not measured it, but I expect this solution to be in the same range as the fastest of the others on this page, but this solution does more. Helping the real-world performance is the fact that rx1 will match most actual paths, so rx2 is rarely executed.
Here is some test code.
function show_dir(parts) {
console.log("Original str :"+parts[0]);
console.log("Directory nm :"+parts[1]);
console.log("File nm :"+parts[2]);
console.log();
}
show_dir(dir_and_file('/absolute_dir/file.txt'));
show_dir(dir_and_file('relative_dir////file.txt'));
show_dir(dir_and_file('dir_no_file/'));
show_dir(dir_and_file('just_one_word'));
show_dir(dir_and_file('')); // empty string
show_dir(dir_and_file(null));
And here is what the test code yields:
# Original str :/absolute_dir/file.txt
# Directory nm :/absolute_dir
# File nm :file.txt
#
# Original str :relative_dir////file.txt
# Directory nm :relative_dir
# File nm :file.txt
#
# Original str :dir_no_file/
# Directory nm :dir_no_file
# File nm :
#
# Original str :just_one_word
# Directory nm :
# File nm :just_one_word
#
# Original str :
# Directory nm :
# File nm :
#
# Original str :null
# Directory nm :
# File nm :null
By the way, "node" has a built in module called "path" that has "dirname" and "basename". Node's "path.dirname()" function accurately imitates the behavior of the "bash" shell's "dirname," but is that good? Here's what it does:
Produces '.' (dot) when path=="" (empty string).
Produces '.' (dot) when path=="just_one_word".
Produces '.' (dot) when path=="dir_no_file/".
I prefer the results of the function defined above.
Another good solution:
function basename (path, suffix) {
// discuss at: http://locutus.io/php/basename/
// original by: Kevin van Zonneveld (http://kvz.io)
// improved by: Ash Searle (http://hexmen.com/blog/)
// improved by: Lincoln Ramsay
// improved by: djmix
// improved by: Dmitry Gorelenkov
// example 1: basename('/www/site/home.htm', '.htm')
// returns 1: 'home'
// example 2: basename('ecra.php?p=1')
// returns 2: 'ecra.php?p=1'
// example 3: basename('/some/path/')
// returns 3: 'path'
// example 4: basename('/some/path_ext.ext/','.ext')
// returns 4: 'path_ext'
var b = path
var lastChar = b.charAt(b.length - 1)
if (lastChar === '/' || lastChar === '\\') {
b = b.slice(0, -1)
}
b = b.replace(/^.*[\/\\]/g, '')
if (typeof suffix === 'string' && b.substr(b.length - suffix.length) === suffix) {
b = b.substr(0, b.length - suffix.length)
}
return b
}
from: http://locutus.io/php/filesystem/basename/
Fast without regular expressions, suitable for both path types '/' and '\'. Gets the extension also:
function baseName(str)
{
let li = Math.max(str.lastIndexOf('/'), str.lastIndexOf('\\'));
return new String(str).substring(li + 1);
}
This is my implementation which I use in my fundamental js file.
// BASENAME
Window.basename = function() {
var basename = window.location.pathname.split(/[\\/]/);
return basename.pop() || basename.pop();
}
JavaScript Functions for basename and also dirname:
function basename(path) {
return path.replace(/.*\//, '');
}
function dirname(path) {
return path.match(/.*\//);
}
Sample:
Input dirname() basename()
/folder/subfolder/file.ext /folder/subfolder/ file.ext
/folder/subfolder /folder/ subfolder
/file.ext file.ext /
file.ext file.ext null
See reference.
Defining a flexible basename implementation
Despite all the answers, I still had to produce my own solution which fits the following criteria:
Is fully portable and works in any environment (thus Node's path.basename won't do)
Works with both kinds of separators (/ and \)
Allows for mixing separators - e.g. a/b\c (this is different from Node's implementation which respects the underlying system's separator instead)
Does not return an empty path if path ends on separator (i.e. getBaseName('a/b/c/') === 'c')
Code
(make sure to open the console before running the Snippet)
/**
* Flexible `basename` implementation
* #see https://stackoverflow.com/a/59907288/2228771
*/
function getBasename(path) {
// make sure the basename is not empty, if string ends with separator
let end = path.length-1;
while (path[end] === '/' || path[end] === '\\') {
--end;
}
// support mixing of Win + Unix path separators
const i1 = path.lastIndexOf('/', end);
const i2 = path.lastIndexOf('\\', end);
let start;
if (i1 === -1) {
if (i2 === -1) {
// no separator in the whole thing
return path;
}
start = i2;
}
else if (i2 === -1) {
start = i1;
}
else {
start = Math.max(i1, i2);
}
return path.substring(start+1, end+1);
}
// tests
console.table([
['a/b/c', 'c'],
['a/b/c//', 'c'],
['a\\b\\c', 'c'],
['a\\b\\c\\', 'c'],
['a\\b\\c/', 'c'],
['a/b/c\\', 'c'],
['c', 'c']
].map(([input, expected]) => {
const result = getBasename(input);
return {
input,
result,
expected,
good: result === expected ? '✅' : '❌'
};
}));
A nice one line, using ES6 arrow functions:
var basename = name => /([^\/\\]*|\.[^\/\\]*)\..*$/gm.exec(name)[1];
// In response to #IAM_AL_X's comments, even shorter and it
// works with files that don't have extensions:
var basename = name => /([^\/\\\.]*)(\..*)?$/.exec(name)[1];
function basename(url){
return ((url=/(([^\/\\\.#\? ]+)(\.\w+)*)([?#].+)?$/.exec(url))!= null)? url[2]: '';
}
Fairly simple using regex:
function basename(input) {
return input.split(/\.[^.]+$/)[0];
}
Explanation:
Matches a single dot character, followed by any character except a dot ([^.]), one or more times (+), tied to the end of the string ($).
It then splits the string based on this matching criteria, and returns the first result (ie everything before the match).
[EDIT]
D'oh. Misread the question -- he wants to strip off the path as well. Oh well, this answers half the question anyway.
my_basename('http://www.example.com/dir/file.php?param1=blabla#cprod', '/', '?');
// returns: file.php
CODE:
function my_basename(str, DirSeparator, FileSeparator) { var x= str.split(DirSeparator); return x[x.length-1].split(FileSeparator)[0];}
UPDATE
An improved version which works with forward / and backslash \ single or double means either of the following
\\path\\to\\file
\path\to\file
//path//to//file
/path/to/file
http://url/path/file.ext
http://url/path/file
See a working demo below
let urlHelper = {};
urlHelper.basename = (path) => {
let isForwardSlash = path.match(/\/{1,2}/g) !== null;
let isBackSlash = path.match(/\\{1,2}/g) !== null;
if (isForwardSlash) {
return path.split('/').reverse().filter(function(el) {
return el !== '';
})[0];
} else if (isBackSlash) {
return path.split('\\').reverse().filter(function(el) {
return el !== '';
})[0];
}
return path;
};
$('em').each(function() {
var text = $(this).text();
$(this).after(' --> <strong>' + urlHelper.basename(text) + '</strong><br>');
});
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>
<em>http://folder/subfolder/file.ext</em><br>
<em>http://folder/subfolder/subfolder2</em><br>
<em>/folder/subfolder</em><br>
<em>/file.ext</em><br>
<em>file.ext</em><br>
<em>/folder/subfolder/</em><br>
<em>//folder//subfolder//</em><br>
<em>//folder//subfolder</em><br>
<em>\\folder\\subfolder\\</em><br>
<em>\\folder\\subfolder\\file.ext</em><br>
<em>\folder\subfolder\</em><br>
<em>\\folder\\subfolder</em><br>
<em>\\folder\\subfolder\\file.ext</em><br>
<em>\folder\subfolder</em><br>
A more simpler solution could be
function basename(path) {
return path.replace(/\/+$/, "").replace( /.*\//, "" );
}
Input basename()
/folder/subfolder/file.ext --> file.ext
/folder/subfolder --> subfolder
/file.ext --> file.ext
file.ext --> file.ext
/folder/subfolder/ --> subfolder
Working example: https://jsfiddle.net/Smartik/2c20q0ak/1/
If your original string or text file contains a single backslash character, you could locate it by using '\\'.
In my circumstance, I am using JavaScript to find the index of "\N" from a text file. And str.indexOf('\\N'); helped me locate the \N from the original string, which is read from the source file.