commit b0c3b982d100e96b1bbf6760bd9274608a4ee2ff
parent 247fad46c7c38049c7a41c44e4c789885402c19c
Author: Steve Gattuso <steve@stevegattuso.me>
Date: Mon, 13 Nov 2023 21:07:59 +0100
update README
Diffstat:
| M | README.md | | | 43 | +++++++++++++++++++++++-------------------- |
1 file changed, 23 insertions(+), 20 deletions(-)
diff --git a/README.md b/README.md
@@ -1,29 +1,32 @@
# Forerad
-Trying to learn about statistics and forecasting via doing stuff with Citibike data. Also trying to prevent myself from going down another Factorio wormhole by essentially doing the same thing but a bit more useful.
+This repository is a collection of utilities for working with Citibike data. It allows you to easily download all of Citibike's ride history archives, transform them as you see fit, and throw them into a SQLite database for easy querying.
-Goal: come up with a somewhat decent 3-to-7 day Citibike ride forecast.
+This repository is what I use to build the SQLite database used in [Citibike Explorer](https://citibike.stevegattuso.me). It is also potentially useful if you don't feel like re-writing your own scraper to download, unzip, and load trip history archives into a `pd.DataFrame`.
-## Planning
-Data collection is the hardest part. I need to:
+## Installation and usage
+Clone the repository, cd into the directory, and run:
-1. Build a scraper job that fetches, cleans, and stores historical ride data.
-2. Build a scraper job that fetches, cleans, and stores any relevant real-time data.
-3. Build a forecasting job that combines these datasets into a 3 day prediction.
-4. Build a job that calculates MAPE for previous forecasts as realtime data comes out.
-5. Build some kind of dashboard that displays the predictions and actuals.
+```bash
+$ python -m virtualenv .venv
+$ source .venv/bin/activate
+$ pip install -r ./requirements.txt
+```
-Data to incorporate into a forecast:
+Once requirements are installed, you can use `./bin/scraper` to download the trip archives individually or all in one swoop. See `./bin/scraper --help` for details.
-* Historical ridership per day
-* Federal holiday status
-* Weather
- * Temperature min/max
- * Precipitation probability
+There is also `./bin/hourly-volume-rollup` which will parse through all available archives and roll up the trip data into an hourly timeseries. Note that this requires provisioning a sqlite database, which can be done by running `yoyo apply`.
-2023-11-05: Need to figure out why the populate script missed a bunch of random days.
+If you're just looking to load an archive into pandas, here's the code snippet you're looking for:
-## Open questions
-* Can I estimate electric citibike revenue with the v2 dataset?
+```python
+import forerad.scrapers.historical as historical
-## Prior art
-* [This blog post](https://medium.com/@kumbharniraj1/citi-bike-trips-analysis-and-prediction-63fcb557354d) details somebody trying to do this with out-of-the-box machine learning models.
+archives = historical.HistoricalTripArchive.list_cached()
+df = archives[0].fetch_df()
+
+print(df)
+```
+
+## FAQ
+### What's with the stupid name?
+I originally wanted to build a forecast of daily trip volume but ended up scaling back my ambitions (maybe just for now). `Fore` is for forecast, `rad` is for das Fahrrad, the German word for bike.