Datasets included in library distributions are very practical for explaining concepts and for tutorials, as of course no extra download is required. A while ago, I posted a list of biodiversity datasets that come with R-core. Here I continue along the same line and list datasets coming with popular Python libraries.
Scikit-learn
Scikit-learn comes with the ubiquitous iris data set which includes petal and sepal length for 3 species of irises.
Another interesting dataset: the distribution of 2 species of South-American mammals with modelling examples.
There are many visualization examples of the iris and other datasets on this page.
Seaborn
The plotting library Seaborn comes with a variety of datasets among which a dataset on penguin sizes that includes plot examples. Here’s an example of applying pricipal component analysis to this dataset.
TensorFlow
TensorFlow provides a tutorial on how to use the datasets shipped with this library. As well as example Notebooks. People interested in biodiversity should definitely check the Fine tuning models for plant disease detection example, which is based on data from i_naturalist. Other datasets for biodiversity:
- Beans
- Bees
- North American birds
- Healthy and unhealthy cassava plants
- Healthy and unhealthy citrus tree leaves
- Australian wild plants
- Common flowers in the UK
- Healthy and unhealthy plant leaves and another one and another
- Flowers
- “Historical phenological data for cherry tree flowering at Kyoto City“
- Forest fires in Northeast Portugal
- Iris
- Penguin sizes
Not directly related to biodiversity, but interesting nonetheless: the EuroSat dataset of satellite images.