You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In order to integrate the PoC-style Content Scraper and Content Processor an HTTP API is needed providing the following features:
trigger an async scraping -> processing task, which will have the content-scraper scrape content off the input URL, store the results and forward the task to the content-processor
check the status of a triggered task and eventually get the results of the processor stages/handlers
Here is a proposal for amending the HTTP API spec:
openapi: 3.0.0info:
title: Web Scraping and Processing APIversion: 1.0.0paths:
/scrape:
post:
summary: Initiates a web scraping job.description: Triggers a new scraping job for the given URL and returns a task ID for status polling.requestBody:
required: truecontent:
application/json:
schema:
type: objectproperties:
url:
type: stringformat: uridescription: The URL to be scraped.required:
- urlexamples:
example-1:
value: { "url": "https://example.com" }responses:
202:
description: Accepted. The scraping job is initiated, and a task ID is returned.content:
application/json:
schema:
type: objectproperties:
taskId:
type: stringdescription: The unique identifier for the scraping task.examples:
example-1:
value: { "taskId": "12345" }400:
description: Bad Request. The URL is invalid or missing.429:
description: Too Many Requests. Rate limit exceeded.500:
description: Internal Server Error./scrape/{taskId}:
get:
summary: Polls the status and results of a scraping job.description: Retrieves the status and, if available, the results of a scraping job by task ID.parameters:
- in: pathname: taskIdrequired: trueschema:
type: stringdescription: The unique identifier for the scraping task.responses:
200:
description: OK. Returns the status of the scraping job and results if completed.content:
application/json:
schema:
type: objectproperties:
status:
type: stringdescription: The current status of the job ('pending', 'in_progress', 'completed', 'failed').results:
type: objectproperties:
translation:
type: stringdescription: URL or location of the translation result.seoKeywords:
type: stringdescription: URL or location of the SEO keyword extraction result.sentimentAnalysis:
type: stringdescription: URL or location of the sentiment analysis result.required: []examples:
pending:
value:
status: "pending"completed:
value:
status: "completed"results:
translation: "https://results.example.com/translation/12345"seoKeywords: "https://results.example.com/seo/12345"sentimentAnalysis: "https://results.example.com/sentiment/12345"404:
description: Not Found. The task ID does not exist.429:
description: Too Many Requests. Rate limit exceeded.500:
description: Internal Server Error.
The text was updated successfully, but these errors were encountered:
For my understanding: the url is a page URL that has nothing to do with the sites we have in StarCatalogue?
The Location header in the post response could contain the url to poll for the status of the task.
The current status of the job ('pending', 'in_progress', 'completed', 'failed').
If the possible status values are known, we should use an enum.
What's the difference between 'pending' and 'in_progress'?
When is the task completed, after all the subtasks are completed? Would you show partial results as the subtask complete, or just final results?
It would be good to have an example of the response body for failed as well.
Generally, I think you'd have a different schema for the response body depending on status (or state), with different required properties.
For 429, it should include the Retry-After header.
In order to integrate the PoC-style Content Scraper and Content Processor an HTTP API is needed providing the following features:
Here is a proposal for amending the HTTP API spec:
The text was updated successfully, but these errors were encountered: