Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add background processing jobs #5432

Merged
merged 35 commits into from
Sep 9, 2024
Merged

feat: add background processing jobs #5432

merged 35 commits into from
Sep 9, 2024

Conversation

jfcalvo
Copy link
Contributor

@jfcalvo jfcalvo commented Aug 27, 2024

Description

This PR add the following changes:

  • Add rq to help us execute background jobs.
  • Add a background job to update all records for a dataset when the dataset distribution strategy is updated.
  • Change HuggingFace Dockerfile to install Redis and run rq workers inside honcho Procfile.
  • Add documentation about new ARGILLA_REDIS_URL environment variable.
  • Add ping to Redis so Argilla server is not started if Redis is not ready.
  • Change Argilla docker compose file to include a container with Redis and rq workers.
  • Update Argilla server README.md file adding Redis as dependency to install.
  • Add documentation about Redis being a new Argilla server dependency.
  • Add BACKGROUND_NUM_WORKERS environment variable to specify the number of workers in the HF Space container.
  • Modify Dockerfile template on HF to include the environment variable [TASK] Modify HF Space template to include a comment about BACKGROUND_NUM_WORKERS environment variable #5443:
# (since: v2.2.0) Uncomment the next line to specify the number of background job workers to run (default: 2).
# ENV BACKGROUND_NUM_WORKERS=2
  • Remove some TODO sections before merging.
  • Review K8s documentation (maybe delete it?).
  • If we want to persist Redis data on HF Spaces we can change our Procfile Redis process to the following:
redis: /usr/bin/redis-server --dbfilename argilla-redis.rdb --dir ${ARGILLA_HOME_PATH}

Closes #5431

Benchmarks

The following timings were obtained updating the distribution strategy of a dataset with 100 and 10.000 records, using a basic and an upgraded CPU on HF Spaces, with and without persistent storage and measuring how much time the background job takes to complete:

CPU basic: 2 vCPU, 16GB RAM
CPU upgrade: 8 vCPU, 32GB RAM

  • CPU basic (with persistent storage):
    • 100 records dataset: ~8 seconds.
    • 10.000 records dataset: ~9 minutes.
  • CPU upgrade (with persistent storage):
    • 100 records dataset: ~5 seconds.
    • 10.000 records dataset: ~6 minutes.
  • CPU basic (no persistent storage):
    • 10.000 records dataset: ~101 seconds.
  • CPU upgrade (no persistent storage):
    • 10.000 records dataset: ~62 seconds.

Type of change

  • New feature (non-breaking change which adds functionality)

How Has This Been Tested

  • Testing it on HF Spaces.

Checklist

  • I added relevant documentation
  • I followed the style guidelines of this project
  • I did a self-review of my code
  • I made corresponding changes to the documentation
  • I confirm My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • I have added relevant notes to the CHANGELOG.md file (See https://keepachangelog.com/)
Siteproxy
jfcalvo added 4 commits August 1, 2024 16:50

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
@jfcalvo jfcalvo requested a review from frascuchon August 27, 2024 09:09
Copy link

codecov bot commented Aug 27, 2024

Codecov Report

Attention: Patch coverage is 63.63636% with 20 lines in your changes missing coverage. Please review.

Project coverage is 91.00%. Comparing base (fee1f5a) to head (96c96dc).
Report is 1 commits behind head on develop.

Files with missing lines Patch % Lines
argilla-server/src/argilla_server/_app.py 27.27% 8 Missing ⚠️
...lla-server/src/argilla_server/jobs/dataset_jobs.py 63.63% 8 Missing ⚠️
argilla-server/src/argilla_server/cli/worker.py 55.55% 4 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #5432      +/-   ##
===========================================
- Coverage    91.32%   91.00%   -0.33%     
===========================================
  Files          141      144       +3     
  Lines         5869     5915      +46     
===========================================
+ Hits          5360     5383      +23     
- Misses         509      532      +23     
Flag Coverage Δ
argilla-server 91.00% <63.63%> (-0.33%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@jfcalvo jfcalvo changed the title [DRAFT] feat: add Python rq to execute background jobs feat: add Python rq to execute background jobs Aug 27, 2024
jfcalvo and others added 9 commits August 27, 2024 17:40
…kers to use on HF docker image
jfcalvo and others added 3 commits August 30, 2024 10:58
…d rename environment variable
@jfcalvo jfcalvo changed the title feat: add Python rq to execute background jobs feat: add background processing jobs Aug 30, 2024
@frascuchon frascuchon added this to the v2.2.0 milestone Sep 3, 2024
jfcalvo and others added 2 commits September 4, 2024 12:39
Copy link

github-actions bot commented Sep 4, 2024

The URL of the deployed environment for this PR is https://argilla-quickstart-pr-5432-ki24f765kq-no.a.run.app

…when dataset distribution strategy is update
@frascuchon frascuchon self-requested a review September 9, 2024 12:14

Redis is used by Argilla to store information about jobs to be processed on background. The following environment variables are useful to config how Argilla connects to Redis:

- `ARGILLA_REDIS_URL`: A URL string that contains the necessary information to connect to a Redis instance (Default: `redis://localhost:6379/0`).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- `ARGILLA_REDIS_URL`: A URL string that contains the necessary information to connect to a Redis instance (Default: `redis://localhost:6379/0`).
- `ARGILLA_REDIS_URL`: A URL string that contains the necessary information to connect to a Redis instance (Default: `redis://localhost:6379`).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 0 it's the Redis database. We can remove it and it will use by default the db 0 but I though it was better to be explicit.

@jfcalvo jfcalvo merged commit 84c8aa7 into develop Sep 9, 2024
12 of 13 checks passed
@jfcalvo jfcalvo deleted the feat/add-rq branch September 9, 2024 12:48
frascuchon added a commit that referenced this pull request Sep 11, 2024
# Description
<!-- Please include a summary of the changes and the related issue.
Please also include relevant motivation and context. List any
dependencies that are required for this change. -->

After merging changes included in
#5432, the SDK tests are
failing since the argilla test server requires Redis and workers to
work. This PR uses the hf-spaces standalone image which includes and
manages all the server deps.

**Type of change**
<!-- Please delete options that are not relevant. Remember to title the
PR according to the type of change -->

- Bug fix (non-breaking change which fixes an issue)

**How Has This Been Tested**
<!-- Please add some reference about how your feature has been tested.
-->

**Checklist**
<!-- Please go over the list and make sure you've taken everything into
account -->

- I added relevant documentation
- I followed the style guidelines of this project
- I did a self-review of my code
- I made corresponding changes to the documentation
- I confirm My changes generate no new warnings
- I have added tests that prove my fix is effective or that my feature
works
- I have added relevant notes to the CHANGELOG.md file (See
https://keepachangelog.com/)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEATURE] Add background processing
3 participants