I'm currently working on a project that scrapes grocery store pages for data given a search query (i.e., cereal) and display that in a Spinner view. However, I'm having some difficulty finding a way to scrape the data off the pages. I tried using Jsoup as that was the concensus online, but that doesn't support JavaScript.
The issue lies that most, if not all, sites like these use DOM storage for up-to-date stock listings and prices. That's why libraries like Jsoup won't work as they will return the HTML for no JavaScript. I currently have a prototype that displays the page via a WebView but I see no way of getting the data.
I've tried to research how to get around this but it's quite confusing to be quite honest to find an elegent solution, if that even exists.
If anyone can help, or at the very least point me in the right direction, that would be most appreciated! Thanks ^_^
Selenium would be a good option for web scraping. https://www.selenium.dev/ It basically has access to the website's DOM. In past experience, a dynamically generated web page can be difficult to scrape. RegExp will be your friend. https://regexone.com/
Related
I am relatively new to scraping and am trying to scrape this site (and many, many like it): http://www.superiorcourt.maricopa.gov/docket/CriminalCourtCases/caseInfo.asp?caseNumber=CR1999-012267
I'm using Python and Scrapy. My problem is that when I start up a scrapy shell and point it to this url, the response body is full of code I can't read, e.g :
c%*u9u\\'! (vy!}vyO"9u#$"v/!!!"yJZ*9u!##v/!"*!%y\\_9u\\')"v/\\'!#myJOu9u$)}vy}vy9CCVe^SdY_^uvkT_Se]U^dKju"&#$)\\')&vMK9u)}&vy}MKju!\\'$#)(# (!#vMuvmy\\:*Ve^SdY_^uCy\\y
The information I actually want to scrape does not appear to be accessible.
I believe this is a javascript problem, and have confirmed that using tools others have suggested before, like Selenium, renders the page correctly. My problem is that I will need to scrape several million of these sites, and don't believe that a browser-based solution is going to be fast enough.
Is there a better way to do this? I do not need to click any links on the page (I have a long list of all the URLs I want to scrape), or interact with it in any other way. Is it possible that the response body contains a JSON code I could parse?
If you just want to wait for the javascript data to load I'd use ScrapyJS.
If you need to interact with javascript elements on the website maybe use Scrapy + Selenium + phantomjs. the latter is usually a more popular choice because it's easier to learn and can do more but it's slower
I know this has been done many times before but I am new to the coding scene (relatively new) and love to fiddle around with things. I've never managed to make anything really be functional, as in useful to me. I'm trying to make a chrome extension that shows a list of the online Counter Strike streamers that are currently streaming. I have no idea how to go about this. Is there a way through jQuery to go through this the page and take the first ~10 usernames it find?
I already know how to make the extension and the HTML and all that stuff. Just looking for functionality. I have a list at the moment on an HTML page. There's nothing in the list but I want to fill it with the online streamers. Solely doing this for a "fun" project to get some practice in. So not exactly looking for full answers but someone to point me in the right direction :D.
Completely lost on how to do this. Would regular expressions work?
A mediocre way to do this would be to go into a Twitch page, look for the elements that repeat for each streamer and grabbing the inner HTML, that is, the information on who's streaming.
The proper way to do this would be to go through Twitch's dev API and find information using their dedicated web services and information hooks. Consider it as a way for Twitch to get people invested in their website by providing them easy 'methods' to retrieve things like who's streaming, how many, etc.
I've never used it, but I'm sure it's simple and it's made for this situation.
I am a young software engineer working on a mobile view for SharePoint 2013. For this, I have to access SharePoint Web Part - Objects with javascript.
The javascript should be linked in the Masterpage and starts after the page is done loading. Then, it should modify web parts.
For example, I want to resize web parts to fit to the max. available screen-resolution.
I want to turn the standard navigation into a drop down.
I want to fetch single informations out of different web parts and work with it.
I want to do change basically everything you can see on a standard page. :D If I am on the false road, pls tell me so.
I do not want anybody to post a link to a script or smth. - I wanna do the work ;)
But if you could give me a good hint or anything like that, I would highly appreciate it.
I already did a lot of research but due to the complex documentation of Microsoft's Products I did not find a proper solution. I think it's kind of a sad thing to be like that, because SharePoint is a great tool you could do so much with, if there would only be a document telling you how and where.
Have you thought of doing a different master page and layouts for mobile devices and have a redirection on the server side that when it detects the user agent of a mobile device that you're interested on redirecting it points the user to the mobile site? You wouldn't need to do the whole thing with JavaScript/jQuery since you could have the master and the layouts have the size that you want from the start. You could limit the width of the web part zones with some custom css as well.
Good luck!
I am a noob with angular, and making javascript crawlable. I've been searching for it, but I don't really get it so far.
I am working on a AngularJs thingie which is using client-side JSON.
There is a navigation with pages, but... each link is using a function getPage(n) to slice a chunk of JSON and Angular renders it.
Is it OK to put a href="#!page=n" to each link? When I add that hash #! to the url and press enter, and a function renders the right items, is that enough to make it crawlable?
I've read something about snapshots, but it requires Java? I have a webhost which is not really flexible, it does NOT TomCat or NodeJs.
I think it's much better practice these days to use HTML5 history.pushState, and thus provide a unique URL for every page.
More information here.
Check this older stackoverflow question - Making angular crawlable - Beginning of Project
A friend of mine uses - https://prerender.io/
Both these solutions are essentially caching versions of your rendered views, so the crawler can index your site.
I'm new on this whole web design business, and I am beginning to think I am going against good practices, so I had some questions.
I am making a website for a family company. We have a great deal of products that often change, and I need to make the site in a way that it will be editable by someone else less tech savy when I leave the company. My plan was to keep each product in an xml file loaded via javascript on each page. Later, I might attempt to write another script to make editing these xml files easy.
I am worried about two things. First, I am getting a sense that this is bad practice because some users disable javascript. Second, I am worried that search engines will not be able to find content on my site because all they will see is some template html and some javascript. I would like to be searchable and use good practice, but I have no idea how to solve the issue of dynamically changing content that is easy to edit in another way.
I would really appreciate if someone would point me in the right direction so that I know what I should be researching.
Thanks
RShom
There are many good free, open-source software products that let you create a customizable content management system (CMS). Drupal and Joomla are very popular ones.
Try searching for "free cms" and see what you find.
I bet you must have heard of PHP, even if you are not using it. I suggest you use PHP to parse your XML into HTML and present. Then your content will be searchable because PHP is server-side scripting. You are going to be using javascript, too, but it shouldn't be for the XML. It is to me for enhancement of a webpage (aside AJAX).