I am trying to find and extract an assignment of a property of the product_image object from Javascript code, extracted with BeautifulSoup. I have tried following
re.findall(r"product_images\['top_lg'] = .*;", txt)
Unfortunately it does not extract anything from my text below.
product_images['top_lg'] = {
"tn": '//image.test.com/media/cache/04/0a/040a1e61f5edc387d8c8e40d3ea0e0ca.jpg',
"md": '//image.test.com/media/cache/b7/f3/b7f3cb1da267d7e8ac0412bdc522c862.jpg',
"lg": '//image.test.com/media/shape_images/011f7f24ae4cbbef191cff1a711df9e1_a3c9ca71b7d85d87085955f8d1c4bfc3_0_.jpg',
"alt": 'test ',
"data-zoomable": 'True',
"text_line": 'teest'
};
The scripts that I am parsing are taken from https://www.brilliantearth.com/Petite-Twisted-Vine-Diamond-Ring-White-Gold-BE1D54-3821855/
If, like me, you find regex flags confusing and hard to remember, use
"not semicolon" expressions instead of dot
re.findall(r"product_images\['top_lg'] = [^;]*;", txt)
Note. Otherwise you can add a flag as Thierry suggests, though you would need also add a 'non-gready modifier' ? after * to indicate that you are interested in the first semicolon rather that the last.
Related
I am working on a project that needs to check if the user has written a good condition on a textfield. So I'd like to know if one of you knows the regex of a 'if'. For example, if the user writes if ((k <= 5 && k>0)|| x>8) I will return true.
Keith's reference looks good (pegjs.org). You're not looking for a regex (albeit possible to do with such) by a lexer + yacc combo (see flex and bison). Note that what you are really looking for is the "expression" parser. A "simple calculator" expression.
One way to test would be to use the "eval()" function. However, it is considered to be a dangerous function. Yet, in your case you let the end user enter an expression and then execute that expression. If they write dangerous stuff, they probably know what they are doing and they will be doing it to themselves.
There is documentation about it:
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/eval
In your case you would do something like the following with a try/catch to know whether it is valid:
var valid = true;
try
{
// where input.val would reference your if() expression
eval(input.val);
}
catch(e)
{
valid = false;
}
Note that's a poor man solution since it won't tell you whether other code is included in the expression. There are probably some solution to that such as adding some string at the start such as:
eval("ignore = (1 + " + input.val + ")");
Now input.val pretty much has to be an expression. You'll have to test and see whether that's acceptable or not.
I am implementing jQuery chaining - using Mika Tuupola's Chained plugin - in my rails project (using nested form_for partials) and need to dynamically change the chaining attribute:
The code that works without substitution:
$(".employee_title_2").remoteChained({
parents : ".employee_title_1",
url : "titles/employee_title_2",
loading : "Loading...",
clear : true
});
The attributes being substituted are .employee_title_1 and .employee_title_2:
var t2 = new Date().getTime();
var A1 = ".employee_title_1A_" + t2;
var B2 = ".employee_title_2B_" + t2;
In ruby speak, I'm namespacing the variables by adding datetime.
Here's the code I'm using for on-the-fly substitution:
$(`"${B2}"`).remoteChained({
parents : `"${A1}"`,
url : "titles/employee_title_2",
loading : "Loading...",
clear : true
});
Which throws this error:
Uncaught Error: Syntax error, unrecognized expression:
".employee_title_2B_1462463848339"
The issue appears to be the leading '.' How do I escape it, assuming that's the issue? Researching the error message Syntax error, unrecognized expression lead to SO question #14347611 - which suggests "a string is only considered to be HTML if it starts with a less-than ('<) character" Unfortunately, I don't understand how to implement the solution. My javascript skills are weak!
Incidentally, while new Date().getTime(); isn't in date format, it works for my purpose, i.e., it increments as new nested form fields are added to the page
Thanks in advance for your assistance.
$(`"${B2b}"`).remoteChained({
// ^ ^
// These quotes should not be here
As it is evaluated to a string containing something like:
".my_class"
and to tie it together:
$('".my_class"')...
Same goes for the other place you use backtick notation. In your case you could simply use:
$(B2).remoteChained({
parents : A1,
url : "titles/employee_title_2",
loading : "Loading...",
clear : true
});
The back tick (``) syntax is new for Javascript, and provides a templating feature, similar to the way that Ruby provides interpolated strings. For instance, this Javascript code:
var who = "men";
var what = "country";
var famous_quote = `Now is the time for all good ${who} to come to the aid of their #{what}`;
is interpolated in exactly the same way as this Ruby code:
who = "men"
what = "country"
famous_quote = "Now is the time for all good #{who} to come to the aid of their #{what}"
In both cases, the quote ends up reading, "Now is the time for all good men to come to the aid of their country". Similar feature, slightly different syntax.
Moving on to jQuery selectors, you have some flexibility in how you specify them. For instance, this code:
$(".my_class").show();
is functionally equivalent to this code:
var my_class_name = ".my_class";
$(my_class_name).show();
This is a great thing, because that means that you can store the name of jQuery selectors in variables and use them instead of requiring string literals. You can also build them from components, as you will find in this example:
var mine_or_yours = (user_selection == "me") ? "my" : "your";
var my_class_name = "." + mine_or_yours + "_class";
$(my_class_name).show();
This is essentially the behavior that you're trying to get working. Using the two features together (interpolation and dynamic jQuery selectors), you have this:
$(`"${B2}"`).remote_chained(...);
which produces this code through string interpolation:
$("\".employee_title_2B_1462463848339\"").remote_chained(...);
which is not correct. and is actually the cause of the error message from jQuery, because of the embedded double quotes in the value of the string. jQuery is specifically complaining about the extra double quotes surrounding the value that you're passing to the selector.
What you actually want is the equivalent of this:
$(".employee_title_2B_1462463848339").remote_chained(...);
which could either be written this way:
$(`${B2}`).remote_chained(...);
or, much more simply and portably, like so:
$(B2).remote_chained(...);
Try this little sample code to prove the equivalence it to yourself:
if (`${B2}` == B2) {
alert("The world continues to spin on its axis...");
} else if (`"${B2}"` == B2) {
alert("Lucy, you've got some 'splain' to do!");
} else {
alert("Well, back to the drawing board...");
}
So, we've established the equivalency of interpolation to the original strings. We've also established the equivalency of literal jQuery selectors to dynamic selectors. Now, it's time to put the techniques together in the original code context.
Try this instead of the interpolation version:
$(B2).remoteChained({
parents : A1,
url : "titles/employee_title_2",
loading : "Loading...",
clear : true
});
We already know that $(B2) is a perfectly acceptable dynamic jQuery selector, so that works. The value passed to the parents key in the remoteChained hash simply requires a string, and A1 already fits the bill, so there's no need to introduce interpolation in that case, either.
Realistically, nothing about this issue is related to Chained; it just happens to be included in the statement that's failing. So, that means that you can easily isolate the failing code (building and using the jQuery selectors), which makes it far easier to debug.
Note that the Javascript syntax was codified just last year with ECMAScript version 6, so the support for it is still a mixed bag. Check your browser support to make sure that you can use it reliably.
I have a very specific problem concerning a regular expression matching in Javascript. I'm trying to match a piece of source code, more specifically a portion here:
<TD WIDTH=100% ALIGN=right>World Boards | Olympa - Trade | <b>Bump when Yasir...</b></TD>
The part I'm trying to match is boardid=106121">Olympa - Trade</a>, the part I actually need is "Olympa". So I use the following line of JS code to get a match and have "Olympa" returned:
var world = document.documentElement.innerHTML.match('/boardid=[0-9]+">([A-Z][a-z]+)( - Trade){0,1}<\/a>/i')[1];
the ( - Trade) part is optional in my problem, hence the {0,1} in the regex.
There's also no easier way to narrow down the code by e.g. getElementsByTagName, so searching the complete source code is my only option.
Now here's the funny thing. I have used two online regex matchers (of which one was for JS-regex specifically) to test my regex against the complete source code. Both times, it had a match and returned "Olympa" exactly as it should have. However, when I have Chrome include the script on the actual page, it gives the following error:
Error in event handler for 'undefined': Cannot read property '1' of null TypeError: Cannot read property '1' of null
Obviously, the first part of my line returns "null" because it does not find a match, and taking [1] of "null" doesn't work.
I figured I might not be doing the match on the source code, but when I let the script output document.documentElement.innerHTML to the console, it outputs the complete source code.
I see no reason why this regex fails, so I must be overlooking something very silly. Does anyone else see the problem?
All help appreciated,
Kenneth
You're putting your regular expression inside a string. It should not be inside a string.
var world = document.documentElement.innerHTML.match(/boardid=[0-9]+">([A-Z][a-z]+)( - Trade){0,1}<\/a>/i)[1];
Another thing — it appears you have a document object, in which case all this HTML is already parsed for you, and you can take advantage of that instead of reinventing a fragile wheel.
var element = document.querySelector('a[href*="boardid="]');
var world = element.textContent;
(This assumes that you don't need <=IE8 support. If you do, there remains a better way, though.)
(P.S. ? is shorthand for {0,1}.)
I have a bbcode -> html converter that responds to the change event in a textarea. Currently, this is done using a series of regular expressions, and there are a number of pathological cases. I've always wanted to sharpen the pencil on this grammar, but didn't want to get into yak shaving. But... recently I became aware of pegjs, which seems a pretty complete implementation of PEG parser generation. I have most of the grammar specified, but am now left wondering whether this is an appropriate use of a full-blown parser.
My specific questions are:
As my application relies on translating what I can to HTML and leaving the rest as raw text, does implementing bbcode using a parser that can fail on a syntax error make sense? For example: [url=/foo/bar]click me![/url] would certainly be expected to succeed once the closing bracket on the close tag is entered. But what would the user see in the meantime? With regex, I can just ignore non-matching stuff and treat it as normal text for preview purposes. With a formal grammar, I don't know whether this is possible because I am relying on creating the HTML from a parse tree and what fails a parse is ... what?
I am unclear where the transformations should be done. In a formal lex/yacc-based parser, I would have header files and symbols that denoted the node type. In pegjs, I get nested arrays with the node text. I can emit the translated code as an action of the pegjs generated parser, but it seems like a code smell to combine a parser and an emitter. However, if I call PEG.parse.parse(), I get back something like this:
[
[
"[",
"img",
"",
[
"/",
"f",
"o",
"o",
"/",
"b",
"a",
"r"
],
"",
"]"
],
[
"[/",
"img",
"]"
]
]
given a grammar like:
document
= (open_tag / close_tag / new_line / text)*
open_tag
= ("[" tag_name "="? tag_data? tag_attributes? "]")
close_tag
= ("[/" tag_name "]")
text
= non_tag+
non_tag
= [\n\[\]]
new_line
= ("\r\n" / "\n")
I'm abbreviating the grammar, of course, but you get the idea. So, if you notice, there is no contextual information in the array of arrays that tells me what kind of a node I have and I'm left to do the string comparisons again even thought the parser has already done this. I expect it's possible to define callbacks and use actions to run them during a parse, but there is scant information available on the Web about how one might do that.
Am I barking up the wrong tree? Should I fall back to regex scanning and forget about parsing?
Thanks
First question (grammar for incomplete texts):
You can add
incomplete_tag = ("[" tag_name "="? tag_data? tag_attributes?)
// the closing bracket is omitted ---^
after open_tag and change document to include an incomplete tag at the end. The trick is that you provide the parser with all needed productions to always parse, but the valid ones come first. You then can ignore incomplete_tag during the live preview.
Second question (how to include actions):
You write socalled actions after expressions. An action is Javascript code enclosed by braces and are allowed after a pegjs expression, i. e. also in the middle of a production!
In practice actions like { return result.join("") } are almost always necessary because pegjs splits into single characters. Also complicated nested arrays can be returned. Therefore I usually write helper functions in the pegjs initializer at the head of the grammar to keep actions small. If you choose the function names carefully the action is self-documenting.
For an examle see PEG for Python style indentation. Disclaimer: this is an answer of mine.
Regarding your first question I have tosay that a live preview is going to be difficult. The problems you pointed out regarding that the parser won't understand that the input is "work in progress" are correct. Peg.js tells you at which point the error is, so maybe you could take that info and go a few words back and parse again or if an end tag is missing try adding it at the end.
The second part of your question is easier but your grammar won't look so nice afterwards. Basically what you do is put callbacks on every rule, so for example
text
= text:non_tag+ {
// we captured the text in an array and can manipulate it now
return text.join("");
}
At the moment you have to write these callbacks inline in your grammar. I'm doing a lot of this stuff at work right now, so I might make a pullrequest to peg.js to fix that. But I'm not sure when I find the time to do this.
Try something like this replacement rule. You're on the right track; you just have to tell it to assemble the results.
text
= result:non_tag+ { return result.join(''); }
I want to find anything that comes after s= and before & or the end of the string. For example, if the string is
t=qwerty&s=hello&p=3
I want to get hello. And if the string is
t=qwerty&s=hello
I also want to get hello
Thank you!
\bs=([^&]+) and grabbing $1should be good enough, no?
edit: added word anchor! Otherwise it would also match for herpies, dongles...
Why don't you try something that was generically aimed at parsing query strings? That way, you can assume you won't run into the obvious next hurdle while reinventing the wheel.
jQuery has the query object for that (see JavaScript query string)
Or you can google a bit:
function getQuerystring(key, default_)
{
if (default_==null) default_="";
key = key.replace(/[\[]/,"\\\[").replace(/[\]]/,"\\\]");
var regex = new RegExp("[\\?&]"+key+"=([^&#]*)");
var qs = regex.exec(window.location.href);
if(qs == null)
return default_;
else
return qs[1];
}
looks useful; for example with
http://www.bloggingdeveloper.com?author=bloggingdeveloper
you want to get the "author" querystring's value:
var author_value = getQuerystring('author');
The simplest way to do this is with a selector s=([^&]*)&. The inside of the parentheses has [^&] to prevent it from grabbing hello&p=3 of there were another field after p.
You can also use the following expression, based on the solution provided here, which finds all characters between the two given strings:
(?<=s=)(.*)(?=&)
In your case you may need to slightly modify it to account for the "end of the string" option (there are several ways to do it, especially when you can use simple code manipulations such as manually adding a & character to the end of the string before running the regex).