I am writing a macro in VBA for Excel that examines a page and I need to extract a variable that is set within a script in the HTML. In the macro, I create an IE object thusly:
Set objIE = CreateObject("InternetExplorer.Application")
Pages in the target website all have the following script which defines a variable called digitalData.page.pageName. Here is a snippet of HTML:
<script>
var digitalData = '';
function init() {
digitalData = {
"user": {
"userLoginState": "guest",
"userCountry": "",
"userEmployeeName": "",
"userBirthday": "",
"userGender": "",
"userState": "",
"userID": "",
"LRUserID": "",
"userEmployeeID": "",
"userDWID": "",
"userSessionId": "BYTEzHFAdLrPoPPOlTPGWvlBjCx54jjEyB8="
},
"page": {
"pageName": "en_us:plp:men:clothing:Casual Shirts",
"pageType": "plp",
"pageGender": "men",
"pageLocale": "us",
"pageRedirected": "no",
"pageJSErrorCount": "3",
"pageLevel1": "men",
"pageLevel2": "men/clothing",
"pageLevel3": "men/clothing/Casual Shirts",
"pageLevel4": "men/clothing/Casual Shirts",
"pageHierarchy": "men/clothing/Casual Shirts"
},
If I open a page from this site in Chrome and inspect it, I can type the variable name in the console and it will return the value but I can't seem to access the variable from IE using VBA like this:
inspectLink(i, 1) = objIE.digitalData.page.pageName
In this case, I would like to find en_us:plp:men:clothing:Casual Shirts in inspectLink(i, 1) but instead I get Runtime error '438' Object doesn't support the property or method.
Dim inspectCat(4) As String
inspectCat(0) = "webcat=men"
inspectCat(1) = "webcat=women"
inspectCat(2) = "webcat=kids"
inspectCat(3) = "webcat=baby"
inspectCat(4) = "webcat=home"
Dim targetSearchCount as Integer
Dim failedSearchCount as Integer
targetSearchCount=0
failedSearchCount=0
REM New Code - DOES NOT WORK cannot access pageName this way
REM if digitalData.page.pageName has en_us: in it, then it's our target
REM if it has failedSearchResult in it, then report to web dev team
REM syntax might be objIE.Document.digitalData.page.pageName
REM inspectLink(i, 1) = objIE.digitalData.page.pageName
REM MsgBox inspectLink(i, 1)
REM if inStr(objIE.digitalData.page.pageName, "en_us:") then targetSearchCount=targetSearchCount+1 endif
REM if inStr(objIE.digitalData.page.pageName, "failed_Search_Result") then failedSearchCount=failedSearchCount+1 endif
REM End New Code
REM Begin Old Code - WORKS BUT "dublicate" MAY NOT BE RELIABLE OVER TIME
REM
Set pageNameDubs = objIE.Document.GetElementsByClassName("page-Name-dublicate")
'MsgBox pageNameDubs(0).Value
For Each pageName In pageNameDubs
' If InStr(pageName.innertext, "en_us:") > 0 Then
inspectLink(i, 1) = pageName.Value
' End If
Next
REM End Old Code
This was an interesting one. This answer is specifically about accessing digitalData.page.pageName. In the code below, everything between VVV and ^^^ does that, based on a document that's already loaded. You can integrate that part into your existing code.
In the Developer Console, this is as simple as digitalData.page.pageName, or, equivalently, document.defaultView.digitalData.page.pageName (source). You can get document.defaultView in Excel VBA, but I can't figure out how to access JavaScript globals from that object. Instead, I did it through the DOM. The below sample works for me.
Code
First, make sure you have added references to Microsoft Internet Controls and to Microsoft HTML Object Library.
Option Explicit
Option Base 0
Public Sub GetResult()
Dim objIE As SHDocVw.InternetExplorer
Set objIE = New SHDocVw.InternetExplorer
' Load the page with the target data
With objIE
.navigate "http://cxw42.github.io/49290039.html?buster=1"
' Cache buster thanks to https://stackoverflow.com/questions/24851824/how-long-does-it-take-for-github-page-to-show-changes-after-changing-index-html#comment69647442_24871850
' by https://stackoverflow.com/users/185973/joel-glovier
.Visible = True
End With
Do While objIE.Busy
DoEvents
Loop
Dim doc As MSHTML.HTMLDocument
Set doc = objIE.document
' VVVVVVVVVVVVVV
Dim win As MSHTML.HTMLWindow2
Set win = doc.defaultView
' Should be able to directly access win.digitalData, but I can't get that to work.
' Instead, access the data indirectly
Dim uniqueid As String
uniqueid = "id_" & Format(Now, "%yyyy%mm%dd%hh%nn%ss")
Dim code As String
code = "(function(){var x = document.createElement('p'); x.id='" & uniqueid & "'; x.innerText=digitalData.page.pageName; document.body.appendChild(x); })()"
' Copy digitalData.page.pageName into the DOM
win.execScript code, "JavaScript"
Dim pageName As String
Dim node
Set node = doc.getElementById(uniqueid) ' Get the new DOM node
pageName = node.innerText
' Clean up
doc.getElementsByTagName("body").Item(0).RemoveChild node
' ^^^^^^^^^^^^^^
' Now do whatever you want with pageName.
Debug.Print pageName
End Sub
Explanation
The magic is in code and win.execScript. code is a JavaScript oneliner that creates a new <p> element and copies digitalData.page.pageName into it. win.execScript runs that JavaScript in the context of the page, thereby creating the new node. The node has a uniqueid (well, probably unique) that we can use to find it once it's been created. Then we can pull the result out of the innerText of that paragraph.
Thanks to
document.defaultView
Scraping using VBA
Github Pages cache-busting
The importance of early binding (read it!)
Related
I need to get the page number in order to extract text from that specific page in a .PDF document. I am using Excel VBA function that makes use of the JSObject from the Acrobat Type Library 10.0
Here is the code snippet and the code hicks up on when I am trying to reference the pageNum property from Doc object. I am trying to avoid the AV Layer and use the PD Layer only, so my macro runs in the background only and doesn't invoke Acrobat Application.
Function getTextFromPDF_JS(ByVal strFilename As String) As String
Dim pdDoc As New AcroPDDoc
Dim pdfPage As Acrobat.AcroPDPage
Dim pdfBookmark As Acrobat.AcroPDBookmark
Dim jso As Object
Dim BookMarkRoot As Object
Dim vBookmark As Variant
Dim objSelection As AcroPDTextSelect
Dim objHighlight As AcroHiliteList
Dim currPage As Integer
Dim strText As String
Dim BM_flag As Boolean
Dim count As Integer
Dim word As Variant
strText = ""
If (pdDoc.Open(strFilename)) Then
Set jso = pdDoc.GetJSObject
Set BookMarkRoot = jso.BookMarkRoot
vBookmark = jso.BookMarkRoot.Children
'Add a function call to see if a particular bookmark exists within the .PDF
Set pdfBookmark = CreateObject("AcroExch.PDBookmark")
BM_flag = pdfBookmark.GetByTitle(pdDoc, "Title Page")
If (BM_flag) Then
For i = 0 To UBound(vBookmark)
If vBookmark(i).Name = "Title Page" Then
vBookmark(i).Execute
jso.pageNum
Set pdfPage = pdDoc.AcquirePage(pageNum)
Set objHighlight = New AcroHiliteList
objHighlight.Add 0, 10000 ' Adjust this up if it's not getting all the text on the page
Set objSelection = pdfPage.CreatePageHilite(objHighlight)
If Not objSelection Is Nothing Then
For tCount = 0 To objSelection.GetNumText - 1
strText = strText & objSelection.GetText(tCount)
Next tCount
End If
Exit For
End If
pdDoc.Close
End If
End If
getTextFromPDF_JS = strText
End Function
jso.pageNum = 0; set a page number
pageNo = jso.pageNum; get a page number
edit: 3.3.19
Mmmh, it seems you have to work with AVDoc in order to get the current actual page via jso.pageNum . Also if you work with AVdoc the Acobat window stay hidden in the background. Example:
strFilename = "d:\Test2.pdf"
set avDoc = CreateObject("AcroExch.AVDoc")
If (avDoc.Open(strFilename,"")) Then
Set pdDoc = avDoc.getPDDoc()
Set jso = pdDoc.GetJSObject
pageNo = jso.pageNum
msgbox(pageNo)
end if
I'm trying to fix an old macro working with a function GetData that retrieve the whole page in the response string. (Can't give the real URL here)
Function getDataFor(tva As String) As String
' Early binding, set Tools - Reference - Microsoft XML,v6.0
Dim oDOM As New MSXML2.DOMDocument30
Dim oNodeList As IXMLDOMNodeList
Dim oReq As MSXML2.XMLHTTP
Set oReq = New MSXML2.XMLHTTP
oReq.Open "GET", "https://xxxxxxx.asp?name=" & tva, False
oReq.send
While oReq.readyState <> 4
Wend
Dim response As String
Dim name As String
Dim hasLoginBelspo As String
response = oReq.responseText
Dim token1 As Long
Dim token2 As Long
token1 = InStr(response, "company.name=")
token2 = InStr(response, "company.address=")
On Error GoTo Proceed
name = Mid(response, token1 + 14, token2 - (token1 + 16))
On Error GoTo 0
token1 = InStr(response, "parent.SetLoginInfo")
hasLoginBelspo = Replace(Mid(response, token1 + 28, 5), ",", "")
GoTo Proceed2
Proceed:
name = ""
hasLoginBelspo = ""
Proceed2:
getDataFor = name & "," & hasLoginBelspo
End Function
The concerned page is loaded and the needed data is written between javascript tags:
Here is a copy of what I can find in the source page when I paste the URL in a browser.
<script language="javascript">
var company=new Object();company.number="xxx";company.name="xxx";company.address="xxx";company.zip="xxx";company.city="xxx";parent.SetLoginInfo(company,false,true,"xxx","");
</script>
All the data replaced by xxx is what I need. In the browser, I get it, but with the macro, sometimes those xxx data are empty. It should be because the library used to load the javascript may not be the good one (MSXML2).
Do you have an idea of what I need the change, or what kind of library should I use to be sure the javascript data will be catch?
Thanks a lot
I have a vbscript that I use in Illustrator. In order to save my file as a TIFF I need to use JavaScript. With complete paths the JS script section works fine but when I try to change them to variables('NewState" and 'NewSection' via user input), it's a no go. Nubie to JS so how to insert those variables?
Here's is my latest attempt (FYI - in Illustrator, to use JS inside VBS one must execute it as a String as shown below).
Set App = CreateObject("Illustrator.Application")
Set FSO = CreateObject("Scripting.FileSystemObject")
Dim SourceFolder, DestFolder, NewState, NewSection
Call GetNewInputs()
Sub GetNewInputs()
NewState = UCase(InputBox("Current STATE or REGION to be processed.", _
"STATE or REGION", "SOCAL"))
NewSection = ("Section_" & InputBox("INPUT SECTION NUMBER: Section_YY", _
"Input Section", "32"))
Set SourceFolder = FSO.GetFolder("S:\" & NewState & "\" & NewSection & "\Road DXFs")
DestFolder = "S:\" & NewState & "\" & NewSection & "\Light TIFFs\"
End Sub
App.DoJavaScript("function test(){var doc = app.activeDocument;var destFile = new File(""/s/" + <%=NewState%> + "/" + <%=NewSection%> + "/Light TIFFs/SOCAL_CN68_resx.tif"");var type = ExportType.TIFF;var opts = new ExportOptionsTIFF();opts.imageColorSpace=ImageColorSpace.GrayScale;opts.resolution=72;opts.antiAliasing=AntiAliasingMethod.ARTOPTIMIZED;opts.IZWCompression=false;opts.saveMultipleArtboards=true;opts.artboardRange=""1"";doc.exportFile(destFile, type, opts);}test();")
MyDoc.Close(2)
Going out on a limb here. Is there a way for a Photoshop VBScript to call a JavaScript file?
Or at least pass some user input (variable or return from function) from one script to another.
My reason for this? I've been having similar issues asked in this question and considered a VBScript UI to drive a photoshop-script. Re-writing the existing jsx into VBS isn't really an option.
Here's what I have. This simple VBScript asks for the user to type in their name which is then created as text in the second script.
VBScript
' Ask User for input
Dim appRef
Set appRef = CreateObject( "Photoshop.Application" )
Dim askName : askName = InputBox("Enter name: ")
JavaScript
// create a document to work with
var docRef = app.documents.add(200, 100, 72, "Hello");
// Create a new art layer containing text
var artLayerRef = docRef.artLayers.add();
artLayerRef.kind = LayerKind.TEXT;
// Set the contents of the text layer.
var textItemRef = artLayerRef.textItem
textItemRef.contents = "Hello " + askName
What do I need to connect the two up?
I have no experience in scripting with Photoshop, did some research.
The following code has been tested with Adobe Photoshop® CS6.
PsJavaScriptExecutionMode enum constants are extracted from scriptingsupport.8li (Adobe Photoshop CS6 Object Library) by using Microsoft OLE/COM Object Viewer.
VBScript:
'PsJavaScriptExecutionMode Enums
Const psNeverShowDebugger = 1, psDebuggerOnError = 2, psBeforeRunning = 3
Dim appRef
Set appRef = CreateObject("Photoshop.Application")
Dim askName
askName = InputBox("Enter name: ")
appRef.DoJavaScriptFile "C:\scripts\myPSscript.jsx", Array(askName), psNeverShowDebugger
JavaScript (myPSscript.jsx):
// create a document to work with
var docRef = app.documents.add(200, 100, 72, "Hello");
// Create a new art layer containing text
var artLayerRef = docRef.artLayers.add();
artLayerRef.kind = LayerKind.TEXT;
// Set the contents of the text layer.
var textItemRef = artLayerRef.textItem
var askName = arguments[0]; // first argument passed from VBScript
textItemRef.contents = "Hello " + askName;
Hope it helps.
Adobe® Creative Suite® 5 Photoshop® Scripting Guide
I'm trying to loop through a bunch of PDFs and, for each one, flatten it using the Acrobat API to access the PDF's JS Object. Because VBA doesn't have a code-accessible stack trace, I have a wrapper class AcroExchWrapper that wraps each Acrobat API function to identify the function if it fails, e.g.
Public Sub ClosePDF()
On Error GoTo errHandler
'AcroDoc is created with CreateObject("AcroExch.PDDoc")
AcroDoc.Close
Exit Sub
errHandler:
Err.Raise Err.Number, "AcroExchWrapper.ClosePDF (" & Err.source & ")", Err.Description, Err.HelpFile, Err.HelpContext
End Sub
Each PDF is flattened using the following code:
Private Function flattenPDF(pdfPath As String, savePath As String, deleteOld As Boolean) As String
Dim pageCount As Integer
Dim jso As Object
Dim saveResult As Long
Dim openResult As Long
Dim creatingApp As String
On Error GoTo errHandler
creatingApp = ""
If deleteOld Then
If fso.FileExists(savePath) Then
fso.DeleteFile savePath
End If
End If
acroExch.ClosePDF
openResult = acroExch.OpenPDF(pdfPath)
If openResult = 0 Then
Err.Raise "-1", "Flattener.flattenPDF", "unable to open PDF"
End If
DoEvents
Set jso = acroExch.GetJSObject
DoEvents
If jso Is Nothing Then
Err.Raise "-1", "Flattener.flattenPDF", "unable to get JS object"
Else
pageCount = acroExch.GetNumPages
creatingApp = acroExch.GetInfo("Creator")
If pageCount <> -1 Then
flattenPages jso, 0, pageCount - 1
Else
flattenPages jso
End If
DoEvents
saveResult = acroExch.SavePDF(savePath)
If saveResult = 0 Then
Err.Raise "-1", "Flattener.flattenPDF", "unable to save PDF"
End If
acroExch.ClosePDF
If pdfPath <> savePath Then
loggerObj.addEntry pdfPath, savePath, "flattened", creatingApp
End If
End If
flattenPDF = savePath
Exit Function
errHandler:
loggerObj.addErrorEntryFromErrorObj pdfPath, "", "flattenPDF", Err, creatingApp
flattenPDF = ""
End Function
'wrapper for jso.flattenPages to allow for stack trace
Private Sub flattenPages(jso As Object, Optional startPage As Long = -1, Optional endPage As Long = -1)
On Error GoTo errHandler
If startPage = -1 Then
jso.flattenPages
Else
jso.flattenPages startPage, endPage
End If
Exit Sub
errHandler:
Err.Raise Err.Number, "flattenPages (" & Err.source & ")", Err.Description, Err.HelpFile, Err.HelpContext
End Sub
After going through several thousand files--the number is different each time I run the script--flattenPages() raises the following error:
Automation error
The remote procedure call failed.
After that, for all remaining files, when flattenPDF() runs, the first call to acroExch.ClosePDF() raises this error:
The remote server machine does not exist or is unavailable
I haven't been able to find any documentation on why these errors occur when using Acrobat API, whether with the Javascript API (jso.flattenPages()) or the IAC API (PDDoc.Close()). More mysterious is that the application runs fine until it gets to some varying number of files and only then starts throwing these exceptions.
EDIT: I added the following function to AcroExchWrapper to reset Acrobat, to be performed every 100 files:
Public Sub Reset()
ClosePDF
AcroApp.Exit
Set AcroApp = CreateObject("AcroExch.App")
Set AcroDoc = CreateObject("AcroExch.PDDoc")
Exit Sub
However, the same exceptions are still being thrown. This seems to be occurring once 1500-2500 files have been processed.