yun's attic

Blogs

I took part in the recently-finished ML Reproducibility Challenge 2020, and together with a great team we attempted to reproduce one of the papers as part of a course assignment. This is just my attempt to get back to blogging, as well as writing down some lessons learned. It’s by no means a criticism of anything, but lately I’ve realised if I don’t write things down I’ll most likely forget all about it in two or three months.
I was very excited to go back to school to once again bask in the glory of education and inch closer to becoming a virtuous woman. But the gods have not been kind, and more importantly, the humans have not been able enough to trust and be trusted and keep the virus at bay through concerted efforts. So even though I really miss the chance to meet people and talk to them once a week, I’ve decided to stay at home and avoid university campuses for a while.
I have a couple of scripts that I’d like to run at night, but I also want to leave my machine suspended to RAM overnight to conserve energy (and reduce noise, the fan is a beast!). So this is just a note for myself about how I went about it this time, using rtcwake and crontab. Here’s the crontab I originally had, which is triggered once a day in the afternoon:
I never seem to have got into the habit of writing tests as I code. That’s bad, I know. But there are so many excuses that prevents it, “oh this is just an exploratory thing”, or “Karen and Chad really needs the report/tool soon, no time for test”, or whatever else that might get in your way. Plus there’s a tendency to just use the million open-source projects out-of-the box, and expect them to do what you think they do.
Time zone is such a messy subject, and I don’t even know what to start with. At my previous job I had this photo saved from a stackoverflow answer: so that every time I need to do some time zone conversion magic, I have a quick reference to go to. Because seriously, how do you expect anyone to memorize all that? Today I’m playing with some time zone stuff again for the API with my autotrader database, and encountered a new problem/feature that somehow entirely evaded my attention in the past.
When handling timeseries data, quite often you may want to resample the data at a different frequency and use it that way. One way to achieve this is to load all data with Python, and resample or reindex it with Pandas. An alternative is to query directly in SQL by using a pattern like the one below. This allows you to only get the most recent data at each sample point you’re interested in.
I recently started working on my own autotrader. There’s still much to be done, but I’ve finished the first step – collecting data and put them in a database. I’ve got a PostgreSQL server running on Docker, and a script that reads data using the AlphaVantage API and writes to my database. The next step would be to write my own Python API to query data from the database. The easy way for me would be to stick a bunch of SQL queries in some python functions, but why do that when you can make life more complicated!
Menu