I have scraped following website: https://www.eex-transparency.com/homepage/power/czech-republic/production/availability/non-usability/non-usability using Selenium. I am scraping all the table data. It works well, but it takes rather a long time to run the script. Thus, I started searching for alternative and came across several topics here on StackOverflow using API to send request to server, but after hours of trying and searching for example I gave up, because I don't get several things:
How to reverse engineer API to send the right request?
Which url link should I use?
This is what I came up with:
import json
import requests
url = "https://www.eex-transparency.com/ajax/en/navigation/ajaxGetNavi/12"
data = {
"id": "16",
"title": "Czech Republic",
"url": "https:\\/\\/www.eex-transparency.com\\/homepage\\/power\\/czech-republic",
"class": "country",
"description": "",
"children": [
{
"id": "649",
"title": "Production",
"url": False,
"class": "",
"description": "",
"children": [
{
"id": "650",
"title": "Capacity",
"url": False,
"class": "",
"description": "",
"children": [
{
"id": "651",
"title": "Installed Capacity",
"url": "https:\\/\\/www.eex-transparency.com\\/homepage\\/power\\/czech-republic\\/production\\/capacity\\/installed-capacity",
"class": "",
"description": ""
}
]
}
]
}
]
}
response = requests.get(url, data=data)
file = response.json()
In general, maybe someone could explain, what steps should I take in order scrape the latter webpage, I am particularly interested how to find the correct info from Chrome (-> Inspect -> Network -> XHR) and how from the latter info to build data variable (that I input into requests)?
You can use Scrapy
Scrapy, a fast high-level web crawling & scraping framework for Python.
https://github.com/scrapy/scrapy/
Related
I have a JSON file that I used to create my database in MongoDB, which I am using to build a website. Here is a snippet, before I explain what I'm trying to do.
[
{
"name": "Field 1",
"description": "",
"completed": false,
"category": "Field",
"resources": [],
"items": [
{
"name": "Field 2",
"description": "",
"completed": false,
"category": "Field",
"resources": [],
"items": [
{
"name": "Field 3",
"description": "",
"completed": false,
"category": "Field",
"resources": [],
"items": [
{
"name": "Topic 1",
"description": "",
"completed": false,
"category": "Topic",
"resources": [],
"items": []
},
{
"name": "Topic 2",
"description": "",
"completed": false,
"category": "Topic",
"resources": [],
"items": []
}
My data (and the way I'd like to use it) is heavily dependent on the parent and children of any individual object. My database has 8 documents, but there's over many embedded objects in total, all embedded at different depths in the 'items' array of each object.
I just created my first draft of the site, which loads the original 8, and upon clicking any of them, it then lists the immediate children in the 'items' array, and it continues doing that until the end of that specific path. To do this, I rely on the index of each object in its array, and keep track of a path
But I would also like my users to be able to navigate directly to the page for a certain object without starting at the top and navigating through. If I wanted to directly access the object for Field 3 in the above example, what's the best way to do this with a single function or piece of code that could work with any of them?
The JSON file that I've used to create my database is 22,000+ lines long, and I'd love to not have to go back and change it more than is absolutely necessary. I was thinking I could add an ID somehow, and if that's unique, I could use that. The names of some objects will be the same, depending on where they are in the data.
EDIT: Bonus question - Would this sort of data be best stored in a relational database? I thought non-relational would be best because of the nested functionality, but I suppose I could make it work with either.
I am working on creating a script where I want to create a jira ticket along with several sub tasks. I am able to figure out creation of issue as well as sub tasks in different API calls with the following payload:
{
"fields": {
"project":
{
"key": "TEST"
},
"summary": "TEST summary",
"description": "TEST Description",
"issuetype": {
"name": "Bug"
}
}
}
Create a sub-task and attach it to the issue from above API call:
{
"fields":
{
"project":
{
"key": "TEST"
},
"parent":
{
"key": "TEST-1"
},
"summary": "Sub-task of TEST-1",
"description": "TEST-1 desc",
"issuetype":
{
"id": "5"
}
}
}
However, I want to do both in a single API call. Is it something that can be done ?
The Jira REST API does not offer such kind of operation. It does offer a bulk endpoint for creating multiple issues, but you can't define something like "issue one is the parent issue of issue two which is declared further down in the JSON file".
You have to use two different API calls:
Create your parent issue by using POST /rest/api/2/issue and save the issue key from the response.
Create the sub tasks with a bulk operation using POST /rest/api/2/issue/bulk.
The links are referring to the REST API docs for Jira Server, but the same is possible with the REST API in Jira Cloud. Only the authentication method is different.
The Project
I'm currently using FullCalendar to display a constantly changing calendar containing various information. When the user clicks on a day, a modal appears displaying a title, description, and links to files.
My current JSON object looks like:
{
"title": "myTitle",
"start": "2015-12-18T09:00:00",
"end": "2015-12-18T10:00:00",
"item_number": "1",
"description": "Test Document",
"items":[
{
"docName":"Document 1",
"docUrl":"docName.pdf"
},
{
"docName":"Document 2",
"docUrl":"docName-2.pdf"
},
{
"docName":"Document 3",
"docUrl":"docName-3.pdf"
},
],
"id": 0
}
The setup:
There are three teams:
Schedulers - This team modifies the schedule, then notifies me to edit the JSON file, updating the calendar.
Editors - This team creates then sends the documents to me to upload to the server and modify the JSON file.
Developers - This team puts everything together. Some days, you might have 60-90 edits throughout the day.
Currently, the JSON document is manually modified by the developers while we test.
My Plan:
Since the schedulers are not very tech-savvy, what I'm doing is having them modify a Google Spreadsheet that is published as a CSV and converted to JSON through PHP. This creates the following JSON object:
{
"title": "myTitle",
"start": "2015-12-18T09:00:00",
"end": "2015-12-18T10:00:00",
"item_number": "1",
"description": "Test Document"
}
The editors will create their documents and upload using Dropzone. A JSON object is created referencing the file(s):
"items":[
{
"docName":"Document 1",
"docUrl":"docName.pdf"
},
{
"docName":"Document 2",
"docUrl":"docName-2.pdf"
},
{
"docName":"Document 3",
"docUrl":"docName-3.pdf"
}
]
The two JSON objects are combined and an ID is assigned:
{
"title": "myTitle",
"start": "2015-12-18T09:00:00",
"end": "2015-12-18T10:00:00",
"item_number": "1",
"description": "Test Document",
"items":[
{
"docName":"Document 1",
"docUrl":"docName.pdf"
},
{
"docName":"Document 2",
"docUrl":"docName-2.pdf"
},
{
"docName":"Document 3",
"docUrl":"docName-3.pdf"
}
],
"id": 0
}
When changes occur – maybe there's a change to a document's name, documents are added or removed, or dates change – the individual JSON objects are re-created and re-combined.
The JSON file has hundreds of objects and, what I'm having trouble with is inserting the "items" key in the correct object. For instance, objects with IDs 0-5 are created. "items" of ID=0 is modified. Update "items" of ID=0 and not 1-5.
The Question(s)
Using PHP or JavaScript, how can I link these two JSON objects correctly?
Would it be better to feed all of this information into a database (MySQL) and then construct the JSON file?
I'm still trying to write a function in JavaScript where the user can type in an artist, and it will return a link to that artist's SoundCloud page.
For example,
/artist beyonce --> https://soundcloud.com/beyoncemusic
But the SoundCloud URLS don't all act the same. For example,
/artist dave matthews band --> https://soundcloud.com/dave-matthews-band.
For this reason, I can't simply just output scLink/artistName because they all have different URLs. I'm using Node.js, so I looked through a lot of npm packages, but couldn't figure out how to use any for this purpose. Perhaps Soundclouder will work somehow (though I couldn't figure it out myself). Does anyone know how I could write a command like this?
You are using the SoundCloud API, right?
A simple HTTP request to the right API should return the data you want. For example:
http://api.soundcloud.com/users.json?q=beyonce
[
{
"id": 4293843,
"kind": "user",
"permalink": "beyoncemusic",
"username": "Beyoncé",
"uri": "http://api.soundcloud.com/users/4293843",
"permalink_url": "http://soundcloud.com/beyoncemusic",
"avatar_url": "http://i1.sndcdn.com/avatars-000036935308-a2acxy-large.jpg?435a760",
"country": "United States",
"full_name": "Beyoncé",
"description": "",
"city": "New York",
"discogs_name": null,
"myspace_name": "beyonce",
"website": "http://www.beyonceonline.com",
"website_title": "",
"online": false,
"track_count": 33,
"playlist_count": 2,
"plan": "Pro Plus",
"public_favorites_count": 0,
"followers_count": 478783,
"followings_count": 0,
"subscriptions": [
{
"product": {
"id": "creator-pro-unlimited",
"name": "Pro Unlimited"
}
}
]
},
...
]
...so you could just do results[0].permalink_url.
You can use the request module to make the HTTP request manually, or use soundclouder to handle SoundCloud API's authentication details.
Most of the above does not apply if you want to make the actual requests from a browser. (The question is tagged node.js, but it sounds like you want to do this from a web page.)
If you're doing this from a webpage, use the SoundCloud JS SDK. The data you get back will look like the example above.
I don't think you'd be able to get an exact match reliably. Your best bet would be to search for users with the string you are looking for - example: "beyonce" and then to show the results and let them pick the correct link. You may be able to filter out likely results with follower count (high follower count) or something after you've pulled the initial list from soundcloud.
Search code:
users = SC.get('/users', { q: 'beyonce' });
Then iterate over users and display the permalink url. Hope this helps.
I am tackling frontend development (AngularJS) and rather than pull data from the backend (which isn't complete but renders everything to JSON), I am looking to just use hardcoded JSON.
However, I am new to this and can't seem to find anything about complex JSON structure. In a basic sense, my web app has users and the content they create. So, in my mind, there will be two databases, but I'm not sure if I'm approaching it correctly.
Users - username, location, created content, comments, etc.
"user": [
{
"userID": "12",
"userUserName": "My Username",
"userRealName": "My Real Name",
"mainInterests": [
{
"interest": "Guitar"
},
{
"interest": "Web Design"
},
{
"interest": "Hiking"
}
],
"userAbout": "All about me...",
"userComments": [
{
"comment": "this is a comment", "contentID" : "12"
},
{
"comment": "this is another comment", "contentID" : "123"
}
],
}
]
Created Content - title, description, username, comments, etc.
"mainItem": [
{
"mainID": "1",
"mainTitle": "Guitar Lessons",
"mainCreatorUserName": "My Username",
"mainCreatorRealName": "My Real Name",
"mainTags": [
{
"tag": "Intermediate"
},
{
"tag": "Acoustic"
},
{
"tag": "Guitar"
}
],
"mainAbout": "Learn guitar!",
"mainSeries": [
{
"videoFile": "file.mov",
"thumbnail": "url.jpg",
"time": "9:37",
"seriesNumber": "1",
"title": "Learn Scales"
},
{
"videoFile": "file.mov",
"thumbnail": "url.jpg",
"time": "8:12",
"seriesNumber": "2",
"title": "Learn Chords"
}
],
"userComments": [
{
"comment": "this is a comment", "userID" : "12"
},
{
"comment": "this is another comment", "userID" : "123"
}
]
}
]
And there is more complexity than that, but I just would like to understand if I'm approaching this right. Maybe I'm even approaching this entirely incorrectly (for instance, CRUD vs. REST? Does it matter here? As I understand it, REST implies that each of the objects above are resources with their own unique URI? So would JSON rendered be impacted?). I really am not sure. But ultimately, I need to use the JSON structure properly pull data into my frontend. Assumably, whatever said structure is will be mirrored and rendered in the backend.
Edit* thank you guys for the replies. I think part of my question, where I clarify "complex", is missing. So I'd like to explain. I guess more than the JSON itself, I really mean the structure of the data. For instance, in my example, I am structuring my data to all be beneath two unique objects (user and content). Is this correct? Or should I think about my data more diverse? For instance, technically I could have a comments database (where each comment is the main object). Or is that still implied in my dataset? Perhaps my question isn't even about JSON as much as it is the data structure which will happen to get rendered in JSON. Hopefully this clarifies what I mean by complex?
Any and all help is appreciated.
I'm not sure why you're making what seems to be objects into single-item arrays (as evidenced by the opening square brackets). Other than that, it looks fine to me. Generally speaking single items (like "User") are structured as an object and multiples are arrays of objects.
As for the Angular stuff, if you want to pull direct from a JSON file as a test, take a look here:
var services = angular.module('app.services', [])
services.factory('User', function($http) {
var User = function(data) {
return data;
}
User.findOne = function(id) {
return $http.get('/test_user.json').then(function(response) {
return new User(response.data);
});
};
return User;
});
I also recomment looking into Deployed for doing development without access to live data services.