DOMDocument - get script text from within body - javascript

What I am trying to do is get scripts from body tag but only scripts that have text not script links
eg. <script type="text/javascript">console.log("for a test run");</script>
not the scripts that have file src.
And I want to place those scripts to end of page before </body>.
So far I have
echo "<pre>";
echo "reaches 1 <br />";
//work for inpage scripts
$mainBody = #$dom->getElementsByTagName('body')->item(0);
foreach (#$dom->getElementsByTagName('body') as $head) {
echo "reaches 2";
foreach (#$head->childNodes as $node) {
echo "reaches 3";
var_dump($node);
if ($node instanceof DOMComment) {
if (preg_match('/<script/i', $node->nodeValue)){
$src = $node->nodeValue;
echo "its a node";
var_dump($node);
}
}
if ($node->nodeName == 'script' && $node->attributes->getNamedItem('type')->nodeValue == 'text/javascript') {
if (#$src = $node->attributes->getNamedItem('src')->nodeValue) {
// yay - $src was true, so we don't do anything here
} else {
$src = $node->nodeValue;
}
echo "its a node2";
var_dump($node);
}
if (isset($src)) {
$move = ($this->params->get('exclude')) ? true : false;
foreach ($omit as $omitit) {
if (preg_match($omitit, $src) == 1) {
$move = ($this->params->get('exclude')) ? false : true;
break;
}
}
if ($move)
$moveme[] = $node;
unset($src);
}
}
}
foreach ($moveme as $moveit) {
echo "Moving";
print_r($moveit);
$mainBody->appendChild($moveit->cloneNode(true));
if ($pretty) {
$mainBody->appendChild($newline->cloneNode(false));
}
$moveit->parentNode->removeChild($moveit);
}
$mainBody = $xhtml ? $dom->saveXML() : $dom->saveHTML();
JResponse::setBody($sanitize?preg_replace($this->sanitizews['search'],$this->sanitizews['replace'],$mainBody):$mainBody);
Update 1
The problem is <script type="text/javascript"> can also be in div or can be in nested divs. So as using foreach #$head->childNodes only gets the top html tags and do not scan the inner tags that may contain <script> tags. I don't understand how to get all required script tags.
And there is no error but there also has no script tags on top nodes.
Update 2
After an answer of xpath, thanks for the answer. There is some progress in task. But now after moving of scripts to footer, I can't delete/remove original script tags.
Here is the updated code I have so far:
echo "<pre>3";
// echo "reaches 1 <br />";
//work for inpage scripts
$xpath = new DOMXPath($dom);
$script_tags = $xpath->query('//body//script[not(#src)]');
foreach ($script_tags as $tag) {
// var_dump($tag->nodeValue);
$moveme[] = $tag;
}
$mainBody = #$dom->getElementsByTagName('body')->item(0);
foreach ($moveme as $moveItScript) {
print_r($moveItScript->cloneNode(true));
$mainBody->appendChild($moveItScript->cloneNode(true));
// var_dump($moveItScript->parentNode);
// $moveItScript->parentNode->removeChild($moveItScript);
/* try{
$mainBody->appendChild($moveit->cloneNode(true));
if ($pretty) {
$body->appendChild($newline->cloneNode(false));
}
$moveit->parentNode->removeChild($moveit);
}catch (Exception $ex){
var_dump($ex);
}*/
}
echo "</pre>";
Update 3
I was working for Joomla, was trying to move scripts to footer of the page. I had used the scriptsdown plugin, which moved the scripts from head tag to bottom. but the scripts with in the mid page were not moved to the bottom, so that what was causing the inpage scripts to not respond properly.
My problem is now solved. Posting my solution code so if it might help someone in future.
function onAfterRender() {
$app = JFactory::getApplication();
$doc = JFactory::getDocument();
/* test that the page is not administrator && test that the document is HTML output */
if ($app->isAdmin() || $doc->getType() != 'html')
return;
$pretty = (int)$this->params->get('pretty', 0);
$stripcomments = (int)$this->params->get('stripcomments', 0);
$sanitize = (int)$this->params->get('sanitize',0);
$debug = (int)$app->getCfg('debug',0);
if($debug) $pretty = true;
$omit = array();
/* now we know this is a frontend page and it is html - begin processing */
/* first - prepare the omit array */
if (strlen(trim($this->params->get('omit'))) > 0) {
foreach (explode("\n", $this->params->get('omit')) as $omitme) {
$omit[] = '/' . str_replace(array('/', '\''), array('\/', '\\\''), trim($omitme)) . '/i';
}
unset($omitme);
}
$moveme = array();
$dom = new DOMDocument();
$dom->recover = true;
$dom->substituteEntities = true;
if ($pretty) {
$dom->formatOutput = true;
} else {
$dom->preserveWhiteSpace = false;
}
$source = JResponse::getBody();
/* DOMDocument can get quite vocal when malformed HTML/XHTML is loaded.
* First we grab the current level, and set the error reporting level
* to zero, afterwards, we return it to the original value. This trickery
* is used to keep the logs clear of DOMDocument protests while loading the source.
* I promise to set the level back as soon as I'm done loading source...
*/
if(!$debug) $erlevel = error_reporting(0);
$xhtml = (preg_match('/XHTML/', $source)) ? true : false;
switch ($xhtml) {
case true:
$dom->loadXML($source);
break;
case false:
$dom->loadHTML($source);
break;
}
if(!$debug) error_reporting($erlevel); /* You see, error_reporting is back to normal - just like I promised */
if ($pretty) {
$newline = $dom->createTextNode("\n");
}
if($sanitize && !$debug && !$pretty) {
$this->_sanitizeCSS($dom->getElementsByTagName('style'));
}
if ($stripcomments && !$debug) {
$comments = $this->_domComments($dom);
foreach ($comments as $node)
if (!preg_match('/\[endif]/i', $node->nodeValue)) // we don't remove IE conditionals
if ($node->parentNode->nodeName != 'script') // we also don't remove comments in javascript because some developers write JS inside of a comment
$node->parentNode->removeChild($node);
}
$body = #$dom->getElementsByTagName('footer')->item(0);
foreach (#$dom->getElementsByTagName('head') as $head) {
foreach (#$head->childNodes as $node) {
if ($node instanceof DOMComment) {
if (preg_match('/<script/i', $node->nodeValue))
$src = $node->nodeValue;
}
if ($node->nodeName == 'script' && $node->attributes->getNamedItem('type')->nodeValue == 'text/javascript') {
if (#$src = $node->attributes->getNamedItem('src')->nodeValue) {
// yay - $src was true, so we don't do anything here
} else {
$src = $node->nodeValue;
}
}
if (isset($src)) {
$move = ($this->params->get('exclude')) ? true : false;
foreach ($omit as $omitit) {
if (preg_match($omitit, $src) == 1) {
$move = ($this->params->get('exclude')) ? false : true;
break;
}
}
if ($move)
$moveme[] = $node;
unset($src);
}
}
}
foreach ($moveme as $moveit) {
$body->appendChild($moveit->cloneNode(true));
if ($pretty) {
$body->appendChild($newline->cloneNode(false));
}
$moveit->parentNode->removeChild($moveit);
}
//work for inpage scripts
$xpath = new DOMXPath($dom);
$script_tags = $xpath->query('//body//script[not(#src)]');
$mainBody = #$dom->getElementsByTagName('body')->item(0);
foreach ($script_tags as $tag) {
$mainBody->appendChild($tag->cloneNode(true));
$tag->parentNode->removeChild($tag);
}
$body = $xhtml ? $dom->saveXML() : $dom->saveHTML();
JResponse::setBody($sanitize?preg_replace($this->sanitizews['search'],$this->sanitizews['replace'],$body):$body);
}

In order to get ONLY the <script> nodes that dont have the src attribute you better use the DOMXPath:
$xpath = new DOMXPath($dom);
$script_tags = $xpath->query('//body//script[not(#src)]');
The variable $script_tags is now a DOMNodeList object that contains all of your script tags.
You can now loop over the DOMNodeList to get all the nodes and do whatever you would like to do with them:
foreach ($script_tags as $tag) {
var_dump($tag->nodeValue);
$moveme[] = $tag;
}

Related

PHP 5.4.16 DOMDocument removes parts of Javascript

I try to load an HTML page from a remote server into a PHP script, which should manipulate the HTML with the DOMDocument class. But I have seen, that the DOMDocument class removes some parts of the Javascript, which comes with the HTML page. There are some things like:
<script type="text/javascript">
//...
function printJSPage() {
var printwin=window.open('','haha','top=100,left=100,width=800,height=600');
printwin.document.writeln(' <table border="0" cellspacing="5" cellpadding="0" width="100%">');
printwin.document.writeln(' <tr>');
printwin.document.writeln(' <td align="left" valign="bottom">');
//...
printwin.document.writeln('</td>');
//...
}
</script>
But the DOMDocument changes i.e. the line
printwin.document.writeln('</td>');
to
printwin.document.writeln(' ');
and also a lot of others things (i.e. the last script tag is no longer there. As the result I get a complete destroyed page, which I cannot send further.
So I think, DOMDocument has problems with the HTML tags within the Javascript code and tries to correct the code, to produce a well-formed document. Can I prevent the Javascript parsing within DOMDocument?
The PHP code fragment is:
$stdin = file_get_contents('php://stdin');
$dom = new \DOMDocument();
#$dom->loadHTML($stdin);
return $dom->saveHTML(); // will produce wrong HTML
//return $stdin; // will produce correct HTML
I have stored both HTML versions and have compared both with Meld.
I also have tested
#$dom->loadXML($stdin);
return $dom->saveHTML();
but I don't get any things back from the object.
Here's a hack that might be helpful. The idea is to replace the script contents with a string that's guaranteed to be valid HTML and unique then replace it back.
It replaces all contents inside script tags with the MD5 of those contents and then replaces them back.
$scriptContainer = [];
$str = preg_replace_callback ("#<script([^>]*)>(.*?)</script>#s", function ($matches) use (&$scriptContainer) {
$scriptContainer[md5($matches[2])] = $matches[2];
return "<script".$matches[1].">".md5($matches[2])."</script>";
}, $str);
$dom = new \DOMDocument();
#$dom->loadHTML($str);
$final = strtr($dom->saveHTML(), $scriptContainer);
Here strtr is just convenient due to the way the array is formatted, using str_replace(array_keys($scriptContainer), $scriptContainer, $dom->saveHTML()) would also work.
I find it very suprising that PHP does not properly parse HTML content. It seems to instead be parsing XML content (wrongly so as well because CDATA content is parsed instead of being treated literally). However it is what it is and if you want a real document parser then you should probably look into a Node.js solution with jsdom
If you have a <script> within a <script>, the following (not so smart) solution will handle that. There is still a problem: if the <script> tags are not balanced, the solution will not work. This could occur, if your Javascript uses String.fromCharCode to print the String </script>.
$scriptContainer = array();
function getPosition($tag) {
return $tag[0][1];
}
function getContent($tag) {
return $tag[0][0];
}
function isStart($tag) {
$x = getContent($tag);
return ($x[0].$x[1] === "<s");
}
function isEnd($tag) {
$x = getContent($tag);
return ($x[0].$x[1] === "</");
}
function mask($str, $scripts) {
global $scriptContainer;
$res = "";
$start = null;
$stop = null;
$idx = 0;
$count = 0;
foreach ($scripts as $tag) {
if (isStart($tag)) {
$count++;
$start = ($start === null) ? $tag : $start;
}
if (isEnd($tag)) {
$count--;
$stop = ($count == 0) ? $tag : $stop;
}
if ($start !== null && $stop !== null) {
$res .= substr($str, $idx, getPosition($start) - $idx);
$res .= getContent($start);
$code = substr($str, getPosition($start) + strlen(getContent($start)), getPosition($stop) - getPosition($start) - strlen(getContent($start)));
$hash = md5($code);
$res .= $hash;
$res .= getContent($stop);
$scriptContainer[$hash] = $code;
$idx = getPosition($stop) + strlen(getContent($stop));
$start = null;
$stop = null;
}
}
$res .= substr($str, $idx);
return $res;
}
preg_match_all("#\<script[^\>]*\>|\<\/script\>#s", $html, $scripts, PREG_OFFSET_CAPTURE|PREG_SET_ORDER);
$html = mask($html, $scripts);
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML($html);
libxml_use_internal_errors(false);
// handle some things within DOM
echo strtr($dom->saveHTML(), $scriptContainer);
If you replace the "script" String within the preg_match_all with "style" you can also mask the CSS styles, which can contain tag names too (i.e. within comments).

Screen-scraping JavaScript in PHP

I can successfully scrape all the items on this page using this script:
$html = file_get_contents($list_url);
$doc = new DOMDocument();
libxml_use_internal_errors(TRUE);
if(!empty($html))
{
$doc->loadHTML($html);
libxml_clear_errors(); // remove errors for yucky html
$xpath = new DOMXPath($doc);
/* FIND LINK TO PRODUCT PAGE */
$products = array();
$row = $xpath->query($product_location);
/* Create an array containing products */
if ($row->length > 0)
{
foreach ($row as $location)
{
$product_urls[] = $product_url_root . $location->getAttribute('href');
}
}
else { echo "product location is wrong<br>";}
$imgs = $xpath->query($photo_location);
/* Create an array containing the image links */
if ($imgs->length > 0)
{
foreach ($imgs as $img)
{
$photo_url[] = $photo_url_root . $img->getAttribute('src');
}
}
else { echo "photo location is wrong<br>";}
$was = $xpath->query($was_price_location);
/* Create an array containing the was price */
if ($was->length > 0)
{
foreach ($was as $price)
{
$stripped = preg_replace("/[^0-9,.]/", "", $price->nodeValue);
$was_price[] = "£".$stripped;
}
}
else { echo "was price location is wrong<br>";}
$now = $xpath->query($now_price_location);
/* Create an array containing the sale price */
if ($now->length > 0)
{
foreach ($now as $price)
{
$stripped = preg_replace("/[^0-9,.]/", "", $price->nodeValue);
$stripped = number_format((float)$stripped, 2, '.', '');
$now_price[] = "£".$stripped;
}
}
else { echo "now price location is wrong<br>";}
$result = array();
/* Create an associative array containing all the above values */
foreach ($product_urls as $i => $product_url)
{
$result[] = array(
'product_url' => $product_url,
'shop_name' => $shop_name,
'photo_url' => $photo_url[$i],
'was_price' => $was_price[$i],
'now_price' => $now_price[$i]
);
}
}
However, a problem arises if I want to get page two, or if I view 100 per page file_get_contents($list_url) will always return page one with its 24 values.
I presume that the page changes are being handled via AJAX request (though I can't find any evidence of this in the source). Is there a way to scrape exactly what I see on the screen?
I've seen talk of PhantomJS in previous answers but I'm not sure it'd be appropriate here given that I'm working in PHP.
It's because of a hashtag in the link which is generated by some js script. Turn off javascript for that site and check the output links it generates.
For example for page two it is http://www.hm.com/gb/subdepartment/sale?page=1
// Create DOM from URL or file
$file= file_get_html('http://stackoverflow.com/');
// Find your links
foreach($file->find('a') as $youreEement) {
echo $yourElement->href . '<br>';
}

Could someone clarify this code snippet for me?

This piece should create a csv file. The method that is calling to the nonAjaxPost is:
function exportCSV()
{
nonAjaxPost('getExport', 'post', {action: '/getView', 'view': current_pi, 'parameters': encodeURIComponent(JSON.stringify(current_parameters))});
}
function nonAjaxPost(action, method, input) {
"use strict";
var form;
form = $('<form />', {
action: action,
method: method,
style: 'display: none;'
});
if (typeof input !== 'undefined') {
$.each(input, function (name, value) {
$('<input />', {
type: 'hidden',
name: name,
value: value
}).appendTo(form);
});
}
form.appendTo('body').submit();
}
My problem is that i just can't seem to understand how this is going to create a csv file for me. I'm probaly missing out on something that i just can't see.
I really hope someone could help me out.
Update:
This is the getExport function:
$databundle = $this->_getData();
$data = $databundle['rows'];
$columns_all = $databundle['columns'];
$columns = array("Id");
foreach($data[0] as $key => $column) {
$column = "";
$found = false;
foreach($columns_all as $col_search) {
if($col_search['key'] == #$key) {
$found = true;
$column = $col_search['title'];
break;
}
}
if($found) {
//echo $key . ",";
$columns[] = $column;
}
}
$contents = putcsv($columns, ';', '"');
foreach($data as $key => $vals) {
if(isset($vals['expand'])) {
unset($vals['expand']);
}
array_walk($vals, '__decode');
$contents .= putcsv($vals,';', '"');
}
$response = Response::make($contents, 200);
$response->header("Last-Modified",gmdate("D, d M Y H:i:s") . " GMT");
$response->header("Content-type","text/x-csv");
$response->header("Content-Disposition","attachment; filename=".str_replace(" ","_",$databundle['title'])."_".date("Y-m-d_H:i").".csv");
return $response;
It also calls the getData function which is this:
$viewClass = str_replace('/', '', (isset($_POST['view']) ? $_POST['view'] : $_GET['view']));
$fileView = '../app/classes/view.'.$viewClass.'.php';
if(file_exists($fileView))
{
require_once($fileView);
$className = 'view_'.$viewClass;
if(class_exists($className))
{
$view = new $className();
//Seek for parameters
if(isset($_REQUEST['parameters']))
{
//Decode parameters into array
$parameters = json_decode(urldecode((isset($_POST['parameters']) ? $_POST['parameters'] : $_GET['parameters'])),true);
//Get supported parameters
$parameterTypes = $view->getVars();
$vars = array();
foreach($parameterTypes as $key => $type)
{
//If a value is found for a supported parameter in $_GET
if(isset($parameters[$key]))
{
switch($type)
{
case 'int':
$vars[$key] = intval($parameters[$key]);
break;
case 'float':
$vars[$key] = floatval($parameters[$key]);
break;
case 'filterdata':
// todo: date validation
$vars[$key] = $parameters[$key];
break;
}
}
}
$view->setVars($vars);
}
return $view->getData();
}
else {
/*
header('HTTP/1.1 500 Internal Server Error');
echo 'Class ' . $className . ' does not exist.';
*/
return false;
}
}
else {
/*
header('HTTP/1.0 404 Not Found');
die('Cannot locate view (' . $fileView . ').');
*/
return false;
I hope this is sufficient.
In short what i am trying to find out is that the csv that it produces has more columns than columns headers and where the difference comes from
My guess would be that the page you are calling (on the server) is generating the CSV file.
You would need to write code on the server to do the conversion.
This method is making a post request to getView page. Your csv create code would be present on getView page.
This is the front end code that creates an invisible form with your data: current_parameters.
See the content of current_parameters in the the current file.
Review back-end code and look for the "getExport" function (it should be the current php file loaded)
If you just copied this function from some example... you have to add also the back-end code on your own.
Update:
look at the getExport code:
$contents = putcsv($columns, ';', '"');
$contents .= putcsv($vals,';', '"');;
First row insert the titles , and the second loops the data and insert the other rows.
Print the content of $columns and $vals and see what is happening.
There are some strange conditions for filtering the columns... but can help you if you don't show the data you try to parse.

PHP variable appears to be null while it is not

I have a small php class which i have edited a little to ask my question.
The class is a shown beloww
class Register
{
public $notification = null;
public function __construct()
{
create_connection();
$this->validate_register();
}
public function validate_register()
{
//edit: missing double quote close
$select_register = "SELECT * FROM `student_reg`";
if($select_register_run = mysql_query($select_register))
{
$rows_returned = mysql_num_rows($select_register_run);
if($rows_returned >= 1)
{
$this->notification = 'error';
}else if($rows_returned == 0){
$this->notification = 'success';
}
}else{
$this->notification = 'error';
}
if($this->notification != null)
{
echo 'not null';
}else{ echo 'null';}
}
}
$new_register = new Register();
?>
It is clear that from the class, at any possible level, there is a value assigned to $this->notification. But for some reason, the class 'echoes' null.
The creat_connection() i built functions works perfectly but i have ommited it for the purpose of this question.
Why is this the case?
Actually, if $rows_returned is less than 1 and not equal to 0, the code will echo 'null', so I suggest you echo $rows_returned.
Try like this...
$select_register_run = mysql_query($select_register);
if($select_register_run){
//rest code
instead of
if($select_register_run = mysql_query($select_register))

ajax if statement not working properly

hi i have some ajax coding in which the if condition is not at all working, whenever the program executes else statement only works even the program satisfies the if statement.
<script type="text/javascript">
function CheckDetails()
{
var http = false;
if (navigator.appName == "Microsoft Internet Explorer") {
http = new ActiveXObject("Microsoft.XMLHTTP");
} else {
http = new XMLHttpRequest();
}
var rid = document.new_booking.ph_number.value;
http.onreadystatechange = function() {
if (http.readyState == 4) {
var str_d = http.responseText;
if (str_d == "no") {
document.getElementById('cus_name').focus();
} else {
var str_details = http.responseText;
var arr_details = str_details.split("~");
document.new_booking.cus_name.value = arr_details[0];
document.new_booking.pick_aline1.value = arr_details[1];
document.new_booking.pick_aline2.value = arr_details[2];
document.new_booking.pick_area.value = arr_details[3];
document.new_booking.pick_pincode.value = arr_details[4];
document.new_booking.drop_aline1.focus();
}
}
}
http.open("GET", "ajax.php?evnt=det&rid=" + rid);
http.send();
}
</script>
and its ajax.php file is given below
<?php
if ($_GET['evnt'] == 'det') {
$rid = $_GET['rid'];
include("configure.php");
$select = mysql_query("select * from new_booking where ph_number = '$rid'");
$count = mysql_num_rows($select);
if ($count > 0) {
$row = mysql_fetch_array($select);
echo $row['cus_name']
. "~" . $row['pick_aline1']
. "~" . $row['pick_aline2']
. "~" . $row['pick_area']
. "~" . $row['pick_pincode']
. "~" . $row['drop_aline1']
. "~" . $row['drop_aline2']
. "~" . $row['drop_area']
. "~" . $row['drop_pincode'];
} else {
echo "no";
}
}
?>
You can open your page with Chrome (or Chromium) and then debug your javascript code using builtin debugger (Ctrl+Shift+I, "Console" tab). I guess you will see some JS errors there.
Basically, your code works OK (at least when I removed all database access from it, since I don't have your DB).
If you don't like Chrome, use Firefox and FireBug extension. On 'Network' page you can see that your ajax request was executed (or not executed).
If I had to guess, I would say that some whitespace is sneaking into your AJAX response somewhere. Because "no " isn't equal to "no", it is always hitting your else branch.
You might consider sending back a whitespace-agnostic value. You could rewrite the whole mess to use JSON, which would require a lot less work on both ends:
// PHP:
if ($_GET['evnt'] == 'det') {
$rid = $_GET['rid'];
include("configure.php");
$select = mysql_query("select * from new_booking where ph_number = '$rid'");
$count = mysql_num_rows($select);
if ($count > 0) {
$row = mysql_fetch_array($select);
// YOU PROBABLY WANT TO WHITELIST WHAT YOU PASS TO JSON_ENCODE
echo json_encode($row);
} else {
echo json_encode([ "error" => "no" ]);
}
}
// js:
http.onreadystatechange = function() {
if (http.readyState == 4) {
var str_d;
try {
str_d = JSON.parse(http.responseText);
} catch(e) {
str_d = false;
}
if (str_d && str_d.error === "no") {
document.getElementById('cus_name').focus();
} else {
document.new_booking.cus_name.value = str_d.pick_aline1;
document.new_booking.pick_aline1.value = str_d.pick_aline2;
document.new_booking.pick_aline2.value = str_d.pick_area;
// etc...
}
}
}
I was having similar problem. This worked for me.
if (str_d.replace(/^\s+|\s+$/g, '') == "no")

Categories