Update data-analysis materials for "Python for Data Analysis" refresh#779
Closed
realpython-bot wants to merge 1 commit into
Closed
Update data-analysis materials for "Python for Data Analysis" refresh#779realpython-bot wants to merge 1 commit into
realpython-bot wants to merge 1 commit into
Conversation
Sync the materials with the refreshed "Python for Data Analysis" tutorial
and its updated dependencies (pandas 3.0.3, matplotlib 3.10.9,
scikit-learn 1.9.0, openpyxl 3.1.5, pyarrow 24.0.0, lxml 6.1.1, Python 3.14).
Code changes in both notebooks:
- Currency cleanup regex now uses a raw string and strips whitespace:
.replace("[$,]", ...) -> .replace(r"[$,\s]", ...), matching the
source data, which has spaces inside the quoted figures (" $1,000.00 ").
- film_length suffix removal now strips the leading space too:
.str.removesuffix("mins") -> .str.removesuffix(" mins").
- read_html() now sends a browser User-Agent via storage_options, since
Wikipedia returns HTTP 403 without it.
Add data-analysis/requirements.txt pinning the tutorial's dependencies.
Regenerate james_bond_data_cleansed.csv: under pandas 3.0, .combine_first()
no longer alphabetically sorts the result columns, so the cleansed file now
preserves the logical source column order. Data values are unchanged.
Verified end-to-end on the pinned versions (Python 3.14): cleansing
reproduces the dataset and the regression analysis still yields R-squared
0.79 with film-length stats min 106 / max 163 / mean 128.28 / std 12.94.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Contributor
|
Closed in favor of #780 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Syncs the
data-analysis/materials with the refreshed Python for Data Analysis tutorial and its updated dependency stack.Dependencies
Added
data-analysis/requirements.txtpinning the versions from the tutorial update:(The tutorial targets Python 3.14.)
Code changes (both notebooks)
.replace("[$,]", "", regex=True)→.replace(r"[$,\s]", "", regex=True). The source CSV stores figures with surrounding spaces (" $1,000.00 "), so this makes the cleanup explicit rather than relying onastype()to trim.film_lengthsuffix removal now removes the leading space too:.str.removesuffix("mins")→.str.removesuffix(" mins").read_html()now passes a browser User-Agent viastorage_options={"User-Agent": "Mozilla/5.0"}, since Wikipedia now returnsHTTP 403 Forbiddenwithout one (findings notebook).Regenerated artifact
james_bond_data_cleansed.csvwas regenerated under pandas 3.0. In pandas 3.0,.combine_first()no longer alphabetically sorts the result columns, so the cleansed file now preserves the logical source column order. The data values are unchanged — only the column order differs.Verification
Ran the full pipeline end-to-end on the pinned versions (Python 3.14, pandas 3.0.3, scikit-learn 1.9.0, matplotlib 3.10.9):
y = 1.6637x - 4.9276.film_lengthstats: min 106, max 163, mean 128.28, std 12.94 — matching the tutorial output.🤖 Generated with Claude Code