[ENH] V1 → V2 API Migration - datasets#1608
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1608 +/- ##
==========================================
- Coverage 54.68% 53.40% -1.28%
==========================================
Files 63 63
Lines 5128 5520 +392
==========================================
+ Hits 2804 2948 +144
- Misses 2324 2572 +248 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
FYI @geetu040 Currently the
Issues:
Example:current def _get_dataset_features_file(did_cache_dir: str | Path | None, dataset_id: int) -> dict[int, OpenMLDataFeature]:
return _featuresOr by updating the Dataset class to use the underlining interface method from api_context directly. def _load_features(self) -> None:
...
self._features = api_context.backend.datasets.get_features(self.dataset_id)Another option is to add |
| bool | ||
| True if the deletion was successful. False otherwise. | ||
| """ | ||
| return openml.utils._delete_entity("data", dataset_id) |
There was a problem hiding this comment.
if you implement the delete logic yourself instead of openml.utils._delete_entity, how would that look? I think it would be better.
There was a problem hiding this comment.
Makes Sense , It would look like a delete request from client along with exception handling
| def list( | ||
| self, | ||
| limit: int, | ||
| offset: int, | ||
| **kwargs: Any, | ||
| ) -> pd.DataFrame: |
There was a problem hiding this comment.
same as above, it can use private helper methods
| # Minimalistic check if the XML is useful | ||
| if "oml:data_qualities_list" not in qualities: | ||
| raise ValueError('Error in return XML, does not contain "oml:data_qualities_list"') | ||
| from openml._api import api_context |
There was a problem hiding this comment.
can't we have this import at the very top? does it create circular import error? if not, should be moved to top from all functions.
There was a problem hiding this comment.
It does raise circular import
Thanks for a detailed explanation, I now have good understanding of the download mechanism.
minio can be handled easily, we will use a separate client along with
these are actually different objects in both apis, v1 uses xml and v2 keeps them in json
yes you are right, they are the same files, which are not required to be downloaded again for both versions, but isn't this true for almost all the http objects? they may have different format
I don't understand this point
agreed, should be handled by
agreed, adding in conclusion, I may ask, if we ignore the fact that it downloads the |
|
@geetu040 making a new client for FYI the new commit adds better handling of feature and qualites in OpenMLDataset class moving the v1 specific parsing logic to the interface. So only part left is to handle
|
|
From the standup discussion and earlier conversations, I think we can agree on a few points:
Consider this a green light to experiment with the client design. Try an approach, use whatever caching strategy you think fits best, and aim for a clean, sensible design. Feel free to ask for suggestions or reviews along the way. I'll review it in code. Just make sure this doesn't significantly impact the base classes or other stacked PRs. |
|
The points do make sense to me, I will propose the design along with how it would be used in the resource. |
|
@geetu040 I have a design implemented which needs reviews
Question:
|
I have taken a quick look, the design looks really good, though I have some suggestions/questions in the code, which I will leave in a detailed review. But this in general fixes all our blockers without damaging the original design.
Is it provided by the user? I don't think so. In that case, how does it affect the users? From looking at the code, this cache directory is generated programmatically inside the functions, we can completely remove these lines and always rely on the
|
This makes a sense now. having an independent download method as I have setup is better than updating requests/caching to return path right?
Yes that would work, but the function definition would be changed i.e. tests etc corresponding to them |
I am not sure about that, would require a detailed review
yes, that is expected |
| parsed_url = urllib.parse.urlparse(source) | ||
|
|
||
| # expect path format: /BUCKET/path/to/file.ext | ||
| _, bucket, *prefixes, _file = parsed_url.path.split("/") |
There was a problem hiding this comment.
_file should be renamed to _ given it is never called, surprised ruff does not call it out.
| Parameters | ||
| ---------- | ||
| data_id : list, optional | ||
| dataset_id : list, optional |
| def _get_dataset_parquet( | ||
| description: dict | OpenMLDataset, | ||
| cache_directory: Path | None = None, | ||
| cache_directory: Path | None = None, # noqa: ARG001 |
There was a problem hiding this comment.
Why the ARG001 addition, I assume ruff checks were passing before this was added too?
Edit1: Just saw that this is not being used, why not remove it then?
Edit2: This is happening in more than 1 place, I won't mention it in all of them so we can just discuss it here.
There was a problem hiding this comment.
Still under work,This method is now updated to not use the param, working on removing it here and update the corresponding test when starting to write tests for this pr, as mentioned in the meet
satvshr
left a comment
There was a problem hiding this comment.
Left a few comments, will look at this again once it is actually ready for review with all implementations complete.
|
Is there an API_KEY that you are using to test the endpoints? I was going to run a test script for your code but I could not given I have no API_KEY for it. |
If you are talking about the invalid api_key regex match in v2 you can set this in the config |
Um why are we discussing API keys here. Even if its just for testing. |
geetu040
left a comment
There was a problem hiding this comment.
sdk code look good so far, please take a look at #1575 (comment) and make changes accordingly where needed.
all tests (existing and new) should pass to make sure we are retaining the original functionality of the sdk
| ) -> pd.DataFrame: ... | ||
|
|
||
| @abstractmethod | ||
| def delete(self, dataset_id: int) -> bool: ... |
There was a problem hiding this comment.
you can remove it from here as well
see point 5 in #1575 (comment)
|
|
||
| did_cache_dir = _create_cache_directory_for_id( | ||
| DATASETS_CACHE_DIR_NAME, | ||
| return api_context.backend.datasets.get( |
There was a problem hiding this comment.
to keep the functionality of force_refresh_cache, you can use reset_cache for http.get
see point 1 in #1575 (comment)
|
@JATAYU000 The recent changes in the base branch fixes the test fail in the windows test job. Please update the base branch. |
| original_data_url: str | None = None, | ||
| paper_url: str | None = None, | ||
| ) -> int: | ||
| raise NotImplementedError(self._not_supported(method="edit")) |
There was a problem hiding this comment.
You can just use self._not_supported(method="edit")
No need for raise NotImplementedError()
| class TestDatasetV1API(TestAPIBase): | ||
| def setUp(self): | ||
| super().setUp() | ||
| self.client = self._get_http_client( |
There was a problem hiding this comment.
since this is V1, using
self.client = self.http_client will do.
There was a problem hiding this comment.
since the recent change, it can be used from self.http_clients[APIVersion.V1]
There was a problem hiding this comment.
Alright, I have made the change
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 15 out of 25 changed files in this pull request and generated 12 comments.
Comments suppressed due to low confidence (1)
tests/test_datasets/test_dataset.py:16
pytestis imported twice in this module. The secondimport pytestis redundant and may be flagged by linters; please remove the duplicate import.
import pytest
import openml
from openml.datasets import OpenMLDataFeature, OpenMLDataset
from openml.testing import TestBase
import pytest
| return any(f.suffix in (".pq", ".arff") for f in cache_directory.iterdir()) | ||
| def _dataset_data_file_is_downloaded(dataset: OpenMLDataset): | ||
| if dataset._parquet_url is not None: | ||
| pq_directory = Path(openml.config.get_cache_directory()) / Path(openml.config.get_minio_download_path(dataset._parquet_url)).parent |
|
|
||
| features = _parse_features_xml(features_xml_string) | ||
|
|
||
| except FileNotFoundError: |
| try: | ||
| arff.loads(response.text) | ||
| return "body.arff" | ||
| except arff.ArffException: | ||
| pass |
| candidates = [] | ||
| for p in path.iterdir(): | ||
| if p.name.startswith("body.") and len(p.suffixes) == 1: | ||
| candidates.append(p) | ||
|
|
||
| if (path / "body.xml").exists(): | ||
| return "body.xml" | ||
| if not candidates: | ||
| raise FileNotFoundError(f"No body file found in path: {path}") | ||
|
|
||
| return "body.txt" | ||
| if len(candidates) > 1: | ||
| raise FileNotFoundError( | ||
| f"Multiple body files found in path: {path} ({[p.name for p in candidates]})" | ||
| ) |
geetu040
left a comment
There was a problem hiding this comment.
Really nice, we are close to finish this, just left a few minor comments.
| class DatasetV1API(ResourceV1API, DatasetAPI): | ||
| """Version 1 API implementation for dataset resources.""" | ||
|
|
||
| @openml.utils.thread_safe_if_oslo_installed |
There was a problem hiding this comment.
I think this is redundant, we already apply lock on get_dataset in datasets/functions.py
| class DatasetV2API(ResourceV2API, DatasetAPI): | ||
| """Version 2 API implementation for dataset resources.""" | ||
|
|
||
| @openml.utils.thread_safe_if_oslo_installed |
| return openml._backend.dataset.delete_topic(data_id, topic) | ||
|
|
||
|
|
||
| def _get_dataset_description(did_cache_dir: Path, dataset_id: int) -> dict[str, Any]: |
There was a problem hiding this comment.
this function still uses old api calls, though it's never really used in the code base, except for the tests. I'd suggest to remove this entirely from everyhwere
|
|
||
|
|
||
| @contextlib.contextmanager | ||
| def file_lock(lock_path: str) -> Iterator[None]: |
There was a problem hiding this comment.
- Did you have to add this because of the failing tests in CI?
- We already have a lock in the codebase:
thread_safe_if_oslo_installed, this is different because it applies lock at file levels instead of functions - Do you think this
file_lockcould be avoided inside functions and replaced bythread_safe_if_oslo_installedat some function level. - If we are keeping it, can we align this function more with
thread_safe_if_oslo_installed, which would mean giving a similar name and if possible keepingfile_lock(resource/file-scoped) andthread_safe_if_oslo_installed(function/id-scoped) as separate wrappers over one shared external-lock primitive.
| output = dataset_v1.get_qualities(2) | ||
|
|
||
| assert isinstance(output, dict) | ||
| assert len(output.keys()) == 107 |
There was a problem hiding this comment.
quite hardcoded, should be avoided, may be >= 100 would be my suggestion, similarly in other places for other v1/v2 and get/list_quantiles.
| assert output | ||
|
|
||
| output = dataset_v1.feature_remove_ontology(did, fid, ontology) | ||
| assert output |
There was a problem hiding this comment.
doesn't really check if the functions actually work, can you check the output or the ontologies if they have been updated after each call.
| openml.config.apikey = TestBase.admin_key | ||
| topic = f"test_topic_{str(time.time())}" | ||
| dataset_v1.add_topic(31, topic) | ||
| dataset_v1.delete_topic(31, topic) |
| @pytest.mark.test_server() | ||
| def test_v2_add_delete_topic(dataset_v2): | ||
| with pytest.raises(OpenMLNotSupportedError): | ||
| dataset_v2.add_topic(2, 'test_topic_' + str(time.time())) |
There was a problem hiding this comment.
probably update for delete as well
Metadata
Fixes [ENH] V1 → V2 API Migration - datasets #1592
Depends on: [ENH] V1 → V2 API Migration - core structure #1576
Change Log Entry:This PR implements Datasets resource, and refactor its existing functions