How can I scrape values from embedded Javascript in HTML?

How can I scrape values from embedded Javascript in HTML? - javascript

I need to parse some values out of embedded Javascript in a webpage.
I tried to tokenize the HTML with something like this but it doesn't tokenize the Javascript part.
func CheckSitegroup(httpBody io.Reader) []string {
sitegroups := make([]string, 0)
page := html.NewTokenizer(httpBody)
for {
tokenType := page.Next()
fmt.Println("TokenType:", tokenType)
// check if HTML file has ended
if tokenType == html.ErrorToken {
return sitegroups
}
token := page.Token()
fmt.Println("Token:", token)
if tokenType == html.StartTagToken && token.DataAtom.String() == "script" {
for _, attr := range token.Attr {
fmt.Println("ATTR.KEY:", attr.Key)
sitegroups = append(sitegroups, attr.Val)
}
}
}
}
The Script in the HTML-body looks like this and I need the campaign number (nil / "" if there is no number or if there is no test.campaign = at all - same goes for the sitegroup).
Is there an easy way to get the information? I thought about regular expressions but maybe there is something else? Never worked with regex.
<script type="text/javascript" >
var test = {};
test.campaign = "8d26113ba";
test.isTest = "false";
test.sitegroup = "Homepage";
</script>

first you need to get the JS code safely. The easiest way would be with the goquery lib: https://github.com/PuerkitoBio/goquery
after that you need to get the variables safely. Depending on how complicated it gets you could either parse the real JS Abstract syntax tree and look for the right variables for example with the excellent JS interpreter in GO: http://godoc.org/github.com/robertkrimen/otto/parser
or as you mentioned in the case mentioned above regex would be really easy. There is a really nice tutorial on regexes in go : https://github.com/StefanSchroeder/Golang-Regex-Tutorial

The Go standard strings library comes with a lot of useful functions which you can use to parse the JavaScript code to get campaign number you need.
The following code can get the campaign number from the js code provided in your question (Run code on Go Playground):
package main
import (
"bufio"
"fmt"
"os"
"strings"
)
const js = `
<script type="text/javascript" >
var test = {};
test.campaign = "8d26113ba";
test.isTest = "false";
test.sitegroup = "Homepage";
</script>
`
func StringToLines(s string) []string {
var lines []string
scanner := bufio.NewScanner(strings.NewReader(s))
for scanner.Scan() {
lines = append(lines, scanner.Text())
}
if err := scanner.Err(); err != nil {
fmt.Fprintln(os.Stderr, "reading standard input:", err)
}
return lines
}
func getCampaignNumber(line string) string {
tmp := strings.Split(line, "=")[1]
tmp = strings.TrimSpace(tmp)
tmp = tmp[1 : len(tmp)-2]
return tmp
}
func main() {
lines := StringToLines(js)
for _, line := range lines {
if strings.Contains(line, "campaign") {
result := getCampaignNumber(line)
println(result)
}
}
}

Related

CryptoJS.HmacSHA256 vs Indy10 TIdHMACSHA256.HashValue

I've been spending a day or two on this issue - I've cut the code to the bare bones minimum that's needed. Both functions are returning a different output, even though the input is the same.
The delphi code generates a result similar to: https://www.freeformatter.com/hmac-generator.html
Is there any known issue with either the CryptoJS or Delphi's Indy10 sha256 hashing code that might explain the different results in this case?
JS:
function SetHashMessage(message_str)
{
var APIKey = "test";
var signature_hash_obj = CryptoJS.HmacSHA256(message_str.toString(), APIKey);
var signature_str = signature_hash_obj.toString(CryptoJS.enc.Base64);
return signature_str;
}
Delphi:
function FDoHashMessageString(MessageString: String): String;
var
hmac: TIdHMACSHA256;
hash: TIdBytes;
begin
LoadOpenSSLLibrary;
if not TIdHashSHA256.IsAvailable then
raise Exception.Create('SHA256 hashing is not available: ' + WhichFailedToLoad());
hmac := TIdHMACSHA256.Create;
try
hmac.Key := ToBytes('test');
hash := hmac.HashValue(ToBytes(MessageString));
Result := TIdEncoderMIME.EncodeBytes(hash); // EncodeBytes returns base64
finally
hmac.Free;
end;
end;

How to run javascript in android and pass in a map as argument

I'm trying to run javascript in android and found out Rhino and Duktape provides the functionality to run without a WebView.
But it seems like neither of them have a clear way of passing a variable number of key-value pairs as argument into my js function.
The argument would look like:
{"device":"android", "version":"4.4", "country":"US",...}
and the js side would look like
function calculate(param) {
    var country = 'country';
    var device = 'device';
    if (country in param && param[country]=='US') {
        return "a";
    };
    if (device in param && param[device]=="android") {
        return "b";
    } else {
        return "c";
    }
}
Is there any workaround?

I just tried this and I get the expected results:
#include "src/duktape.h"
char code[] = "function calculate(param) {"
"    var country = 'country';"
"    var device = 'device';"
"    if (country in param && param[country]=='US') {"
"        return \"a\";"
"    }; "
"    if (device in param && param[device]==\"android\") {"
"        return \"b\";"
"    } else {"
"        return \"c\";"
"    }"
"}"
"calculate({\"device\":\"android\", \"version\":\"4.4\", \"country\":\"US\"});";
int main(int argc, char *argv[]) {
duk_context *ctx = duk_create_heap_default();
duk_eval_string(ctx, code);
printf("result is: %s\n", duk_get_string(ctx, -1));
duk_destroy_heap(ctx);
return 0;
}
Compile and run:
$ gcc duktest.c duktape.c -lm
$ ./a.out
result is: a
Maybe your problem is not in duktape?

If the input is a JSON encoded string you get from elsewhere in your program, you can convert it to a parsed object simply as:
duk_push_string(ctx, my_json_argument);
duk_json_decode(ctx, -1);
The decoded value will be left on the value stack top. The decode call is not "protected" so it will throw on invalid inputs - if that matters, you should wrap the whole argument parsing and call e.g. into a duk_safe_call().
This is faster (and safer) than doing a duk_eval_string() especially if the input isn't fully trusted.

c# HtmlAgilityPack and Yahoo's HTML

So I am goofing off and wrote something that first queries other websites and loads all of the pertinent stock exchange symbols and the tries to iterate through the symbols and load Yahoo's server for the latest stock price. For some reason no matter what I try to search by using HtmlAgilityPack, I can't seem to get any pertinent info back. I don't think there's an issue with some javascript running after the page is loaded (but I could be wrong).
The following is a generalized routine that can be tinkered with to try and get the percentage change of the stock symbol back from Yahoo:
string symbol = "stock symbol"
string link = #"http://finance.yahoo.com/q?uhb=uh3_finance_vert&s=" + symbol;
string data = String.Empty;
try
{
// keep in this scope so wedget and doc deconstruct themselves
var webget = new HtmlWeb();
var doc = webget.Load(link);
string percGrn = doc.FindInnerHtml("//span[#id='yfs_pp0_" + symbol + "']//b[#class='yfi-price-change-up']");
string percRed = doc.FindInnerHtml("//span[#id='yfs_pp0_" + symbol + "']//b[#class='yfi-price-change-down']");
// create somthing we'll nuderstand later
if ((String.IsNullOrEmpty(percGrn) && String.IsNullOrEmpty(percRed)) ||
(!String.IsNullOrEmpty(percGrn) && !String.IsNullOrEmpty(percRed)))
throw new Exception();
// adding string to empty gives string
string perc = percGrn + percRed;
bool isNegative = String.IsNullOrEmpty(percGrn);
double percDouble;
if (double.TryParse(Regex.Match(perc, #"\d+([.])?(\d+)?").Value, out percDouble))
data = (isNegative ? 0 - percDouble : percDouble).ToString();
}
catch (Exception ex) { }
finally
{
// now we need to check what we have and load into the datgridView
if (!newData_d.ContainsKey(symbol)) newData_d.Add(symbol, data);
else MessageBox.Show("ERROR: Duplicate stock Symbols for: " + symbol);
}
And here is the extended method for FindInnerHtml:
// this is for the html agility class
public static string FindInnerHtml( this HtmlAgilityPack.HtmlDocument _doc, string _options)
{
var node = _doc.DocumentNode.SelectSingleNode(_options);
return (node != null ? node.InnerText.ToString() : String.Empty);
}
Any help with getting something back would be appreciated, thanks!
///////////////////////////////////////
EDIT:
///////////////////////////////////////
I highlighted the span id and then check out line 239 for where I saw 'yfi-price-change-up' reference:

The following XPath successfully find the target <span> which contains percentage of increase (or decrease) :
var doc = new HtmlWeb().Load("http://finance.yahoo.com/q?uhb=uh3_finance_vert&s=msft");
var xpath = "//span[contains(#class,'time_rtq_content')]/span[2]";
var span = doc.DocumentNode.SelectSingleNode(xpath);
Console.WriteLine(span.InnerText);
output :
(0.60%)

Chrome's Regexes - Can I view them?

Html5 input types includes many new types.
(range , Email , date etc...)
For example :
<input type="url" >
I know that IE used to have regex store ( on one of its internal folders)
Question :
Can I see in what regexes does chrome use to validate the input ?
Is it under a viewable file or something ? / how can I see those regexs ?

I looked up the source code of Blink. Keep in mind I never saw it before today, so I might be completely off.
Assuming I found the right place -
For type="url" fields there is URLInputType, with the code:
bool URLInputType::typeMismatchFor(const String& value) const
{
return !value.isEmpty() && !KURL(KURL(), value).isValid();
}
typeMismatchFor is called from HTMLInputElement::isValidValue
bool HTMLInputElement::isValidValue(const String& value) const
{
if (!m_inputType->canSetStringValue()) {
ASSERT_NOT_REACHED();
return false;
}
return !m_inputType->typeMismatchFor(value) // <-- here
&& !m_inputType->stepMismatch(value)
&& !m_inputType->rangeUnderflow(value)
&& !m_inputType->rangeOverflow(value)
&& !tooLong(value, IgnoreDirtyFlag)
&& !m_inputType->patternMismatch(value)
&& !m_inputType->valueMissing(value);
}
KURL seems like a proper implementation of a URL, used everywhere in Blink.
In comparison, the implementation for EmailInputType, typeMismatchFor calls isValidEmailAddress, which does use a regex:
static const char emailPattern[] =
"[a-z0-9!#$%&'*+/=?^_`{|}~.-]+" // local part
"#"
"[a-z0-9-]+(\\.[a-z0-9-]+)*"; // domain part
static bool isValidEmailAddress(const String& address)
{
int addressLength = address.length();
if (!addressLength)
return false;
DEFINE_STATIC_LOCAL(const RegularExpression, regExp,
(emailPattern, TextCaseInsensitive));
int matchLength;
int matchOffset = regExp.match(address, 0, &matchLength);
return !matchOffset && matchLength == addressLength;
}
These elements and more can be found on the /html folder. It seems most of them are using proper parsing and checking of the input, not regular expressions.

Is it possible to interpret a C# expression tree to emit JavaScript?

For example, if you have an expression like this:
Expression<Func<int, int>> fn = x => x * x;
Is there anything that will traverse the expression tree and generate this?
"function(x) { return x * x; }"

It's probably not easy, but yes, it's absolutely feasible. ORMs like Entity Framework or Linq to SQL do it to translate Linq queries into SQL, but you can actually generate anything you want from the expression tree...
You should implement an ExpressionVisitor to analyse and transform the expression.
EDIT: here's a very basic implementation that works for your example:
Expression<Func<int, int>> fn = x => x * x;
var visitor = new JsExpressionVisitor();
visitor.Visit(fn);
Console.WriteLine(visitor.JavaScriptCode);
...
class JsExpressionVisitor : ExpressionVisitor
{
private readonly StringBuilder _builder;
public JsExpressionVisitor()
{
_builder = new StringBuilder();
}
public string JavaScriptCode
{
get { return _builder.ToString(); }
}
public override Expression Visit(Expression node)
{
_builder.Clear();
return base.Visit(node);
}
protected override Expression VisitParameter(ParameterExpression node)
{
_builder.Append(node.Name);
base.VisitParameter(node);
return node;
}
protected override Expression VisitBinary(BinaryExpression node)
{
base.Visit(node.Left);
_builder.Append(GetOperator(node.NodeType));
base.Visit(node.Right);
return node;
}
protected override Expression VisitLambda<T>(Expression<T> node)
{
_builder.Append("function(");
for (int i = 0; i < node.Parameters.Count; i++)
{
if (i > 0)
_builder.Append(", ");
_builder.Append(node.Parameters[i].Name);
}
_builder.Append(") {");
if (node.Body.Type != typeof(void))
{
_builder.Append("return ");
}
base.Visit(node.Body);
_builder.Append("; }");
return node;
}
private static string GetOperator(ExpressionType nodeType)
{
switch (nodeType)
{
case ExpressionType.Add:
return " + ";
case ExpressionType.Multiply:
return " * ";
case ExpressionType.Subtract:
return " - ";
case ExpressionType.Divide:
return " / ";
case ExpressionType.Assign:
return " = ";
case ExpressionType.Equal:
return " == ";
case ExpressionType.NotEqual:
return " != ";
// TODO: Add other operators...
}
throw new NotImplementedException("Operator not implemented");
}
}
It only handles lambdas with a single instruction, but anyway the C# compiler can't generate an expression tree for a block lambda.
There's still a lot of work to do of course, this is a very minimal implementation... you probably need to add method calls (VisitMethodCall), property and field access (VisitMember), etc.

Script# is used by Microsoft internal developers to do exactly this.

Take a look at Lambda2Js, a library created by Miguel Angelo for this exact purpose.
It adds a CompileToJavascript extension method to any Expression.
Example 1:
Expression<Func<MyClass, object>> expr = x => x.PhonesByName["Miguel"].DDD == 32 | x.Phones.Length != 1;
var js = expr.CompileToJavascript();
Assert.AreEqual("PhonesByName[\"Miguel\"].DDD==32|Phones.length!=1", js);
Example 2:
Expression<Func<MyClass, object>> expr = x => x.Phones.FirstOrDefault(p => p.DDD > 10);
var js = expr.CompileToJavascript();
Assert.AreEqual("System.Linq.Enumerable.FirstOrDefault(Phones,function(p){return p.DDD>10;})", js);
More examples here.

The expression has already been parsed for you by the C# compiler; all that remains is for you to traverse the expression tree and generate the code. Traversing the tree can be done recursively, and each node could be handled by checking what type it is (there are several subclasses of Expression, representing e.g. functions, operators, and member lookup). The handler for each type can generate the appropriate code and traverse the node's children (which will be available in different properties depending on which expression type it is). For instance, a function node could be processed by first outputting "function(" followed by the parameter name followed by ") {". Then, the body could be processed recursively, and finally, you output "}".

A few people have developed open source libraries seeking to solve this problem. The one I have been looking at is Linq2CodeDom, which converts expressions into a CodeDom graph, which can then be compiled to JavaScript as long as the code is compatible.
Script# leverages the original C# source code and the compiled assembly, not an expression tree.
I made some minor edits to Linq2CodeDom to add JScript as a supported language--essentially just adding a reference to Microsoft.JScript, updating an enum, and adding one more case in GenerateCode. Here is the code to convert an expression:
var c = new CodeDomGenerator();
c.AddNamespace("Example")
.AddClass("Container")
.AddMethod(
MemberAttributes.Public | MemberAttributes.Static,
(int x) => "Square",
Emit.#return<int, int>(x => x * x)
);
Console.WriteLine(c.GenerateCode(CodeDomGenerator.Language.JScript));
And here is the result:
package Example
{
public class Container
{
public static function Square(x : int)
{
return (x * x);
}
}
}
The method signature reflects the more strongly-typed nature of JScript. It may be better to use Linq2CodeDom to generate C# and then pass this to Script# to convert this to JavaScript. I believe the first answer is the most correct, but as you can see by reviewing the Linq2CodeDom source, there is a lot of effort involved on handling every case to generate the code correctly.

We Keep Coding

JavaScript is the programming language of the Web.

How can I scrape values from embedded Javascript in HTML? - javascript

Related

CryptoJS.HmacSHA256 vs Indy10 TIdHMACSHA256.HashValue

How to run javascript in android and pass in a map as argument

c# HtmlAgilityPack and Yahoo's HTML

Chrome's Regexes - Can I view them?

Is it possible to interpret a C# expression tree to emit JavaScript?

Categories

Resources