So a slow weekend means working on some of my non-bioinformatics projects. This time it was writing an R script to scrape historical stock data from Yahoo finance using R. This comes after Yahoo broke everyone’s scripts (including one I had written in bash) by changing their API to require a cookie/crumb pair. I won’t go into detail about solving that problem (reg ex) or much about the script itself but feel free to have a look at it here.. The important thing is that it works.
I should note that another R package that I like very much, quantmod, includes the similar function
getsymbols(). There’s a few things I dislike about
getsymbols(), namely that when downloading multiple stocks it loads each one as a separate variable in the environment. This sucks if you’re downloading an entire exchange like NYSE. The other downside is not being able to limit the date range which again is useful when dealing with large numbers of stocks.
With that let’s take the StockScraper for a spin. First we need to source it since I’m too lazy to package it:
As you can see the script includes two functions. The primary function,
stockhistoricals() is to download the data. The helper function
get_stocklists() is a way to retrieve the stocklists for NYSE, AMEX and NASDAQ. It retrieves a lot of good stock metadata too which could be useful later when building correlation analysis.
Let’s bulk download NASDAQ as an example. Here I’m being explicit in the arguments but you can run them with their defaults listed at the top of the function. I usually run it with verbose=TRUE to monitor it but that would look like crap in this markdown!
NASDAQ <- stockhistoricals(stocklist="NASDAQ", start_date = "2016-09-11", end_date = "2017-09-11", verbose = FALSE)
So now we have a years worth of stock price historicals for the entire NASDAQ exchange. The data is stored as a list of dataframes named for the stock tickers. Check it out:
#list the first ten stocks names(NASDAQ)[1:10]
##  "PIH" "TURN" "FLWS" "FCCY" "SRCE" "VNET" "TWOU" "JOBS" "CAFD" "EGHT"
We can retrieve individual stock data using standard R notation:
#check out GOOG head(NASDAQ$GOOG)
## Date Open High Low Close Adj.Close Volume ## 1 2016-09-12 755.13 770.29 754.000 769.02 769.02 1311000 ## 2 2016-09-13 764.48 766.22 755.800 759.69 759.69 1395000 ## 3 2016-09-14 759.61 767.68 759.110 762.49 762.49 1087400 ## 4 2016-09-15 762.89 773.80 759.960 771.76 771.76 1305100 ## 5 2016-09-16 769.75 769.75 764.660 768.88 768.88 2049300 ## 6 2016-09-19 772.42 774.00 764.441 765.70 765.70 1172800
If you’re comfortable with lists we can work directly with the list for simple analyses:
#Get the average adjusted close price for GOOG mean(NASDAQ$GOOG$Adj.Close)
##  852.4343
Or we can extract stocks and do fun things like plot them.
#extract GOOG GOOG <- data.frame(NASDAQ$GOOG) names(GOOG) <- c("Date","Open","High","Low","Close","Adj.Close","Volume") #plot it out! library(ggplot2) ggplot(GOOG, aes(x = Date, y = Close)) + geom_line() + labs(title = "GOOG Price", y = "Closing Price", x = "")
In the future I might bundle stockscraper into an R package along with some of my favorite plotting and clustering wrappers. But for now I'll leave it at that.
Good luck and happy data-mining!
## R version 3.3.0 (2016-05-03) ## Platform: x86_64-apple-darwin13.4.0 (64-bit) ## Running under: OS X 10.10 (Yosemite) ## ## locale: ##  en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 ## ## attached base packages: ##  stats graphics grDevices utils datasets methods base ## ## other attached packages: ##  ggplot2_2.2.1 readr_1.1.1 httr_1.2.1 RCurl_1.95-4.8 ##  bitops_1.0-6 XML_3.98-1.9 ## ## loaded via a namespace (and not attached): ##  Rcpp_0.12.11 knitr_1.16 magrittr_1.5 hms_0.3 ##  munsell_0.4.3 colorspace_1.3-2 R6_2.2.2 rlang_0.1.2 ##  plyr_1.8.4 stringr_1.2.0 tools_3.3.0 grid_3.3.0 ##  gtable_0.2.0 htmltools_0.3.6 lazyeval_0.2.0 yaml_2.1.14 ##  rprojroot_1.2 digest_0.6.12 tibble_1.3.3 curl_2.6 ##  evaluate_0.10 mime_0.5 rmarkdown_1.6 labeling_0.3 ##  stringi_1.1.5 scales_0.4.1 backports_1.1.0