How to scrape website data into an Excel worksheet? - javascript

I'm a novice programmer trying to compile an Excel list of all the inc5000 companies and their industry, location, revenue, and CEO. Is there any way for me to automate this so that I don't have to manually input all 5000?
Some issues:
-The inc5000 list only displays 50 companies on a page, and scrolling to the next page does not change the URL. I tried converting the URL into HTML, but none of the metadata actually shows up in the HTML code (I used https://try.jsoup.org/~LGB7rk_atM2roavV0d-czMt3J_g).
-All of the information I need is on this one scrolling page (https://www.inc.com/profile/loot-crate), but the URL changes for each company as you progress down the page. Is there any way to grab the data from this site without manually changing 5000 URLs?
I'm really new to programming and I know next to nothing about HTML/JavaScript/Web design-- I only know basic Java. I would really appreciate any help or potential leads into a solution.

Here's the easy way:
Go to the page, hit f12, go to the "Network" tab of debug tools, select XHR (to filter to only the data calls) then scroll to the bottom of the page. The page makes a query for each company, that you can access in the debug tools.
Once you have all the pages, you can highlight all the rows in the file name list to the left, right click, and save it to a .har file.
From there, just write a script to pull out the json and you're set.

Related

Capture the state of a web page in a URL

I find myself having to interact with a web page that hides state in various places so that one cannot easily share it as a URL, for example this page which allows users to look up information from city zoning applications:
https://aca.cityofberkeley.info/community/Default.aspx
You can interact with the page all you want, but the URL in the location bar will remain the same as the above.
Currently, city staff provide users with instructions like "Load this URL, click on the 'Zoning' tab, enter DRCP2020-0010 under the 'Permit Number' field, click 'Search', then when the records come up, click 'Record Info' and then select 'Attachments' from the dropdown menu, then click on the PDF document that says '2020-10-21_DRCP_APP_PCKT_2801 Adeline.pdf'". I would like to be able to replace these instructions with a URL.
Another example is the website where video from city council meetings is archived:
http://berkeley.granicus.com/MediaPlayer.php?publish_id=cbebb4e6-5b83-11eb-920e-0050569183fa
It would be nice to be able to produce a link which brings up one of the meeting videos, and seeks to a certain timestamp like 53:40, so that I can refer to something specific that was said at a meeting.
Looking at the pages that are loaded when I follow the instructions in each case, I can see that there are some POST forms, cookies, hidden input fields, and so on.
Is there some kind of tool that I can use to create "deep links" to pages like these, that were generated using non-URL hidden state, which will allow me to quickly share what I'm looking at with another user?
What I'm seeking is similar to the frmget "bookmarklet", which changes the forms on a page to use GET instead of POST. Sometimes this succeeds in producing a URL which captures form submission query parameters. However, it doesn't work for these applications, for whatever reason.
This question is possibly related to the idea of capturing a web page's DOM state using "browser screenshots" and a script called html2canvas. A possible solution might involve getting and setting cookies in a bookmarklet. Ideally something that produces a normal "https://" URL would be ideal, but if it is impossible to solve the problem except by outputting a "javascript:" URL (bookmarklet) then that is acceptable to me (in spite of the security implications). Thanks.
That seems like not a programming matter. It seems like the site has some security issues as well.
QUESTION A: About Zoning
Here are some links you can use
Direct link to Zoning (I've found it via Advanced search from the site):
https://aca.cityofberkeley.info/CitizenAccess/Cap/CapHome.aspx?module=Planning&TabName=Planning&TabList=Home%7C0%7CBuilding%7C1%7CHousing%7C2%7CPlanning%7C3%7CFire%7C4%7CLicenses%7C5%7CPublicWorks%7C6%7CCurrentTabIndex%7C3
A strange link to the list of files (I've found it via downloading a file, then going to chrome://downloads, then right-clicking the file I've download. The link has been the following):
https://aca.cityofberkeley.info/CitizenAccess/FileUpload/AttachmentsList.aspx?iframeid=ctl00_PlaceHolderMain_attachmentEdit&module=Planning&isInConfirm=False&isdetail=True&isaccountmanager=False&isAdmin=True&isPeopleDocument=&agencyCode=BERKELEY&isForConditionDocument=N
It still doesn't give the direct link to the file, but it it gives the list of attachement of the previously opened Zoning record.
Currently I have no idea what file is triggered by javascipt:__doPostBack('attachmentList$gdvAttachmentList$ctl02$lnkFileName','').
In any case, based on what we have, step one, and then step two seems like minimize the path to download the file. I guess there could be a way to download the file directly, but I currently don't see any easy way. Maybe someone else could figure it out.
QUESTION B: About video
I've used an embed link that shows all the attributes that can be used.
There is a pretty strange but working way to give the exact timestamp. Change starttime from the link below:
https://berkeley.granicus.com/MediaPlayer.php?publish_id=cbebb4e6-5b83-11eb-920e-0050569183fa&starttime=0&stoptime=undefined&autostart=1
So replacing 0 for 3600 will rewind the video forward by one hour (3600 seconds):
https://berkeley.granicus.com/MediaPlayer.php?publish_id=cbebb4e6-5b83-11eb-920e-0050569183fa&starttime=3600&stoptime=undefined&autostart=1
The problem here is that ... you cannot rewind back manually that particular hour (it just gets kind cropped out). But it works to show the exact episode.
That's a pretty strange site.

Access input forms of an existing page using Javascript?

I'm playing around with Google Chrome Extensions and wanted to make one where you fill out a form beforehand. Then whenever, a certain URL is opened, it fills in the information you filled out. I can save the information, and track the tab URL with one of Google's packages. However, when the URL is loaded, how can I tell what form to put the saved strings into? I know how to use var.document.getElementById(""), and can see the ids when I inspect element, but since it's not my webpage, I can't link it to my JavaScript file, so it doesn't help. I've seen this been done before but just can't find the right tools. Any guidance to an answer would be appreciated.

Site Scraping for JavaScript rendered site

I'm trying to get the part numbers of which items go in their respective categories from http://www.dynacorn.com/ListItems.aspx but, for me at least, the catch is that the pages are paginated and therefore I cannot figure out how to move on to the next page. I've been reading about JSoup but am unsure of how to implement this. Can someone show me an example of how I could go from page to page when scraping?
Thanks!

Trace clicking behavior of visitors of on web page

I am writing my own home page in html and javascript.
I have many hyperlinks on the home page, which interests me is, how many times visitors of my page click on them.
For instance, pdf is a hyperlink which directs to downloading a pdf file. I would like to set up a mechanism of counter of clicking on it. For instance this information is automatically recorded in a file so that I could check it from time to time.
Besides counter, other information such as the time of clicking, the IP of visitors who click interest me too. It will be great if I can record them.
I don't know javascript, could anyone suggest me an efficient way to realize this with details (or a piece of code)?
As per you code
PDF
You can see I have added one ID in the PDF link, now
$("#uniqueID").click(function(){
//Write a ajax function to calculate the count and storing
});
I have not written the entire code, I think it's sufficient for you to understand the logic.

Is there any way to copy contents (text) of a pop up window from a web page,automatically in an Excel cell?

There is a web page with products(description and prices).If someone wants more details has to click on a row in order to open a pop up window with more details about the certain product.Is there any way to automatically copy the contents of the pop up window into a cell next to the cells i get the data from my web query? the pop up window is like(....http://www.apage.com/product_info.asp?node_serial=&node_id=&ITEMID=0011262)
thanks a lot in advance
I doubt very much that javascript (your tag) can access both the browser and Excel without user intervention, that would thoroughly break security.
If you know what the url is, I'm fairly sure in Excel you can do:
File > Open > FileType=HTML > {paste URL}
If you don't have the URL, right click the popup, choose properties, then highlight the URL (it is selectable)
Windows will fetch the file for you and dump the HTML into a new spreadsheet.
Note, you will get all the junk that the page contains too, not just the data. If you want an elaborate system, you're likely going to need a screen scraping application or use something like PHP's curl to pull data out of a URL.

Categories