Skip to content

Version 2.0 development #476

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 27 commits into
base: main
Choose a base branch
from
Draft

Version 2.0 development #476

wants to merge 27 commits into from

Conversation

leouieda
Copy link
Member

@leouieda leouieda commented Nov 29, 2024

This is a PR to keep track of development in the 2.0 branch. It won't be merged as a PR when 2.0 is ready.

PRs that add things meant for 2.0 should be made against this branch and not main. This includes documentation overhauls and breaking changes.

The docs for this branch are deployed to https://www.fatiando.org/verde/dev2.0

See #331 and the 2.0 milestone

DO NOT MERGE THIS PR. The default of "squash and merge" is no good here. We'll have to merge this branch preserving the history.

@leouieda leouieda added this to the v2.0.0 milestone Nov 29, 2024
@santisoler
Copy link
Member

Awesome @leouieda! I configured a branch protection for 2.0-dev that forbids pushing directly to it (a PR must be opened), and requires branches to be updated before merging into 2.0-dev.

leouieda and others added 17 commits May 8, 2025 10:47
The first tutorial of the new documentation structure (see #433). Trying
to be as simple as possible on how to generate a grid from some data. No
cross-validation or other fancy things if it can be avoided.
Talk about how to handle them using pyproj and the ways Verde is able to
generate grids and profiles using it.
Create a "dev2" folder in the gh-pages branch for the 2.0-dev branch
docs. This is useful since a lot of the work on 2.0 is a complete
rewrite of the docs. Need to undo this before the 2.0 release.
The datasets module was deprecated and was replaced by the Ensaio
package. Remove the data folder and the datasets module from Verde.
Remove the gallery and old tutorial which used the removed module.
Remove the tests and baseline images used to test datasets. Remove
pytest-mpl as a testing requirement.
This class was deprecated and replaced by Linear, Cubic, and KNearest.
Remove it from the codebase along with tests and docs.
The client argument was deprecated in favor of the delayed interface of
Dask. Remove it now from the few places that it was present.
Numba will be required from 2.0 and so the slower numpy engine isn't
needed anymore. Remove the engine argument from Spline and
VectorSpline2D. Remove utils functions that are no longer used. Remove
the numpy-based code from both classes.
We're using numba's parallel execution in Spline and VectorSpline2D and
they're great. But turns out using that in a `ThreadPoolExecutor` (which
dask uses) when numba was installed with pip causes a crash. This is
because the threading backend that numba uses from pip on Mac (called
"workqueue") is not thread-safe and aborts the process when it's inside
another threading layer. There are workarounds in the numbagg package
but they're complicated and involve caching of jit-compiled functions.
We'll use a simpler solution and allow users to turn off parallelism and
add a note in the docs about avoiding it when it would cause a crash.
Use caching of pip and conda as much as possible. Use pip instead of
conda for the tests now that the datasets module was removed. Remove
unused dependencies sphinx-gallery, cartopy, and pooch. Bump minimum
version of Python to 3.10 and update dependencies.

Co-authored-by: Santiago Soler <[email protected]>
It's no longer needed since we found a workaround for the case where the
distance is 0. This was been deprecated and now can be removed.
The newer KDTree in Scipy is already very fast and the difference isn't
as big. It's not worth the effort of maintaining an optional dependency
in terms of CI runs and code complexity. Since Verde doesn't have
optional dependencies now, we don't need the extra run of the CI.
This functions was deprecated and should be removed from 2.0. The tests
can easily be run with pytest and don't need this.
On second thought, that's not really necessary as the R² is good in most
cases. If RMSE is wanted, we can still do it on the SplineCV and
cross_val_score.
It was missing from the docstring.
We never want to predict on random points. And if we did, we could still
do this with the predict method.
For security, it's better to pin versions of third-party Actions to
commit hashes. This is because the tags used for versions can be changed
to point to a different (possibly malicious) commit while the hash
can't. Dependabot can still update using the hashes so it's not a huge
issue.
This paper no longer describes current Verde and can be removed. The
published version is on JOSS and archived in several places.
leouieda and others added 7 commits May 8, 2025 11:00
When nodes aren't given and are the same as the data coordinates, the
Jacobian is symmetric and we can calculate only the upper or lower
triangular part. In serial computation, exploiting this leads to ~50%
decrease of computation time. In parallel, numba does some kind of
optimization that doesn't work on the symmetric implementation and the
symmetric code is actually a little bit slower than the naive
implementation. So only use the new code when running the Jacobian
building in serial. This doesn't have a huge impact on performance but
it's significant enough.
Ditch the `setup.cfg` file and use only `pyproject.toml`. Move flake8
configurations to their own `.flake8` file.

Co-authored-by: Leonardo Uieda <[email protected]>
The community has been moving to putting the package in a src folder.
This has some advantages, for example we don't need to make a temp
folder to run the tests in since the module isn't importable from the
repository root without being installed. Also moving the tests to a
folder outside of the package to avoid distributing them. The tests
don't need to be there and aren't used outside of development. This
makes the tarballs and wheels a bit smaller, which is good.
Remove the `block_split` function which moved to Bordado. Renamed the
`spacing` arguments in `BlockReduce`, `BlockMean`, `BlockKFold`,
`BlockShuffleSplit` and `train_test_split` to `block_size` and `shape`
to `block_shape`. This is the new argument name in `bordado.block_split`
and it makes much more sense. Update tests and examples to match. Add
Bordado as a dependency.
This function was moved to Bordado. Remove it from the codebase and use
the Bordado version. Also removed the `spacing_to_size` function which
was only used by `line_coordinates`. Remove the associated tests and the
API entry.
This is created by setuptools_scm and shouldn't have been in the
repository.
The combination of autosummary and numpydoc was causing warnings about
missing stub files because numpydoc was building a toctree for methods
but the pages didn't exist (they are all in the main page). Disable this
option and remove the attribute autogenerated pages since we already
describe important attributes in the docstring.
leouieda added 3 commits May 14, 2025 10:34
The Chain class was introduced to help perform trend estimation and
spline interpolation in a single step. This was to help prevent data
leakage from the trend estimation step. In practice, this is never
really used since the trend ended up not being that necessary. The
existence of Chain makes it necessary to have the `filter` method, which
is awkward. It also makes it necessary for BlockMean and BlockReduce to
be classes with a single method (filter) instead of functions, as would
be more natural. Removing the Chain fixes all of these issues and won't
have a large impact on usability. With the Chain removed, the filter
method of the BaseGridder can also be removed.
This avoids clashes with the Python standard library io module.
These are all of the `check_*` functions that are scattered throughout
the code. Also take the opportunity to modernize them and combine some
that could be made into a single function. Functions `check_data` and
`check_coordinates` were combined into `check_tuple_of_arrays`. Function
`check_data_names` and `check_extra_coords_names` were combined into
`check_names`. Function `check_fit_input` was given more functionality,
like ravelling the input arrays and better checking of compatibility
between data and weights. The option to unpack data and weights
arguments was removed since we're moving towards having estimators
always deal with multiple data components.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants