How would I scrape the JS-generated data on this webpage? - javascript

This past week, there was the launch of a new tool called #Homescreen that allows people to share a screenshot of the apps that they have on their iPhone home screen. For example: https://homescreen.is/iamfinnym
I'd like to build a scraper that extracts the names of all the apps given a user's page (in addition to their location on the screen). How would I do this? I know how to build a normal HTML scraper, but it looks like the apps are generated onto the page via some kind of React.js javascript call, and I'm not sure how to go about figuring this out. (I can write basic Javascript, but have never used React.js before and I don't know how to get started.)

This is how you can get the data through Chrome's dev console:
If you open the Chrome dev console (Ctrl / Cmd+Shift+C), and head to the Network, you will find this:
If you click on it, you will see that the API is set up so that when you make a GET request to https://homescreen.is/api/user/{USERNAME}, you will get the data of their apps as responseData.apps. Click on Preview to get a preview of the data sent by the API:
Now you can use any language's http library to make GET requests to the API.

Related

Building a bot that fetches data from browser and saves it as text file

My problem is a bit complex. Hard to explain with words, so I broke it down into steps with pictures at each step.
Select a single date from these boxes. Hit submit
I will land on a page with a table. Copy the <tbody> element from the developer console.
Paste it into a text file. Save the text file with the date that was selected.
Repeat steps 1-3 for as many times as needed, selecting a new date each time (01-15-2018, 01-14-2018, 01-13-2018, and so on...)
Is it even possible to build a bot that does this? If yes, what tools would I use?
I know a fair amount of JavaScript and Python, so I'd prefer to use those 2 if possible.
Would need to know the URL you're looking at/look at the page source. If the date is supplied as any part of a request, and the response contains this data you're looking for, it should be simple to farm and analyze that data from a python script.
Walk through your clicks with the network tab of your browser's developer tools and you should see a request go out when you hit submit. Expedia just uses query parameters, and so the entire URL that you'll need pops up in the URL bar of your browser after hitting submit...
Tools:
If request-based:
Python
Requests module
If something cached/more complicated, there are tools for automating clicks and saving the results...I would guess that this won't be necessary though...
Update:
AJAX calls are HTTP requests and responses, and so you should be able to observe them in your networking tab of our web browser developer tools, and then mimic that request from a script, rather than from your browser.
The readability of the requests/responses and/or any measures the organization has implemented to make any application other than a browser unable to get the same response would be potential impediments, but even those should be imitable. If your browser is making the request, then there is no reason your python script can't make the same.
The method you seem to be interested in, although it sounds more complicated to me, is possible with automation tools like Selenium, as the other poster answered. Best of luck.
It is possible:
Take a look at selenium library (its commonly used for automated testing) for python. It should be able to select single dates, hit the submit button then go through the HTML code and grab data in the tag. After that you can use python by itself to store this data in a text file with a name of your choice in a location of your choice.

Get live html feed from website

When a webpage like https://poloniex.com/exchange#btc_eth is opened in the browser, we see that the browser constantly shows updated buy and sell orders. Also, in the Elements section in the chrome console, these updates are visible in the HTML tables.
Is there a way I can use a nodejs script run on my pc (so not in the browser console) to get these live html table updates from that website, without having to do a GET request every time?
If the chrome browser is able to do it, nodejs / jQuery / ajax should be able to do it as well. I tried the XMLHttpRequest nmp module but no luck yet.
It's possible they are using token authentication which means you wouldn't be able to get all the connection info you need just from their client-side code. Have you downloaded it and looked at it yet?
If you find it's not possible to call their services, there are other free products designed for webscraping. AutoHotKey is one that can open a web page and traverse its DOM. I believe it has the ability to run in the background, but don't quote me.

Questions about turning the HTML code that uses Cytoscape.js into a web link accessible by other people

I am able to use the Cytoscape.js library to display a network graph on my own web browser. I wrote a HTML file containing the JavaScript code that takes in the graph JSON and style JSON files from my laptop and calls cytoscape(). When I run my HTML code on my laptop, the network graph is displayed on my own web browser and I can play with the graph.
Now I need to run the HTML code on our Linux server and then send a web link to the user, so that the user can click on that web link to view the displayed network graph on their own web browser, and the user should also be able to move nodes & edges around just as I did on my own web browser.
I am not a web developer so I am missing some very basic knowledge. I think I probably need to link the HTML code to a web domain (deploying the HTML code on a hosting server with domain name). I was just wondering if you could offer me some advice on how to do this?
Another question (which is more important) is: Assume I am able to link the HTML code to a web domain. When the user clicks on the web link to view the displayed network graph on their own web browser, is the user still able to move nodes & edges around?
The graph JSON and style JSON files and some additional JavaScript code the HTML loads in reside on our server. I am not sure if there are any issues about this when the user accesses the web link?
Any advice would be greatly appreciated.
Thank you very much in advance!
The question is too broad. You'd be best off searching for some books to read regarding web dev.
You might find using Github pages a bit easier than managing your own server, but you really should do some reading either way.
Basic resources to get started
https://developer.mozilla.org/en-US/docs/Web/Guide/Introduction_to_Web_development
https://developer.mozilla.org/en-US/docs/Learn/Server-side/First_steps/Introduction

Display PDF from chatter files in Visualforce Page

I am storing pdfs as chatter files in our SF org which is working well except for the fact that displaying these PDF's to the users is very challenging, especially on mobile device (eg ipad)
I have tracked down some good javascript PDF viewers which will behave fairly on well on ipad.
The challenge is delivering the pdf file from to these viewers.
Most of them require a local pdf file to view but there are some like google view which will take a url to the pdf for eg
https://docs.google.com/viewer?url=https://urltopdf
The pdf is available via a chatter GET as follows:
https://cs2.salesforce.com/services/data/v35.0/chatter/files/{docId}/content?versionNumber=1
The problem ofcourse is that I need to pass in an authentication header. If I just pass that url to google viewer it fails because of Authentication as its not passing in the Auth header.
I tried a few things..
Built a proxy API in Salesforce which Google viewer calls, that API calls chatter and then returns the file to google docs view
PROBLEM: Custom SF APIs have a 6mb limit which means that if the pdf is bigger than 6mb it wont work!
Built a proxy API external to SF (to get around the 6mb limit) including some interesting ways to persist the authentication
PROBLEM: There are too many hops and the google viewer is not getting back the data in time...its calling the external proxy API which is then calling the chatter API which then has to return the pdf data back to the external proxy API and then back to Salesforce (ridiculous I know).
So I am stuck.....
I thought that as of Spring 13' Chatter API is accessible without any special authentication from Javascript on a VF page.
Is that true?
Will this url work without any auth header when called from javascript on a VF page? https://cs2.salesforce.com/services/data/v35.0/chatter/files/{docId}/content?versionNumber=1
It doesnt seem to work for me and definitely wont work when going via google viewer.
Would really appreciate any suggestions how to do this?
Thanks
Please dont hate me, I am trying to get some rep.
You can query the static resources for the docid and load the file by ajax request on the VF page.
After that its a matter of using data:image/png;base64 added to the base64 to add it to a canvas. You can also add a download tag to the canvas so it will download or open it in another window will give the user the browser pdf explorer.
If you use canvas be careful with Safari when you resize. If you get a really long number it will crap out on the canvas. I spent hours on that one.
I hope that helps.

How does Instagram's javascript work?

I'm trying to build a Python web crawler that could get Instagram's photos, like the Instagram's official insta: "instagram.com/instagram" . The initial web source only contains latest 20 photos, the others need to scroll down to load. As a Javascript freshman, I can't figure out how does Instagram load these via Javascript.
In my observation, there are two JS scripts that might relate to the load action which are webpack-common.js and UserProfile.js . Check when hit the bottom of page and then make an ajax call to get new data.
But how to do that via crawler? I'v download these js files and load them using PyV8 but always got errors. Do I need to do more than just execute the js files?

Categories