Logo Icon

Importing Google News Data to R

Note: This post is NOT financial advice! This is just a fun way to explore some of the capabilities R has for importing and manipulating data.

Note: Google changed their URL structure for financial news, so this code no longer works.

I’ve been playing around lately with the stock market data available from google finance, through quantmod in R. Here’s a function I’ve written (which depends on the R Data Science Toolkit), to pull news stories related to a stock from google, parse them, and save them as a data frame. Let me know what you think!

devtools::install_github("rtelmore/RDSTK")
getNews <- function(symbol, number){

    # Warn about length
    if (number>300) {
        warning("May only get 300 stories from google")
    }

    # load libraries
    library(XML); library(plyr); library(stringr); library(lubridate);
    library(xts); library(RDSTK)

    # construct url to news feed rss and encode it correctly
    url.b1 = 'http://www.google.com/finance/company_news?q='
    url    = paste(url.b1, symbol, '&output=rss', "&start=", 1,
                "&num=", number, sep = '')
    url    = URLencode(url)

    # parse xml tree, get item nodes, extract data and return data frame
    doc   = xmlTreeParse(url, useInternalNodes = TRUE)
    nodes = getNodeSet(doc, "//item")
    mydf  = ldply(nodes, as.data.frame(xmlToList))

    # clean up names of data frame
    names(mydf) = str_replace_all(names(mydf), "value\\.", "")

    # convert pubDate to date-time object and convert time zone
    pubDate = strptime(mydf$pubDate,
                        format = '%a, %d %b %Y %H:%M:%S', tz = 'GMT')
    pubDate = with_tz(pubDate, tz = 'America/New_york')
    mydf$pubDate = NULL

    #Parse the description field
    mydf$description <- as.character(mydf$description)
    parseDescription <- function(x) {
        out <- html2text(x)$text
        out <- strsplit(out,'\n|--')[[1]]

        #Find Lead
        TextLength <- sapply(out,nchar)
        Lead <- out[TextLength==max(TextLength)]

        #Find Site
        Site <- out[3]

        #Return cleaned fields
        out <- c(Site,Lead)
        names(out) <- c('Site','Lead')
        out
    }
    description <- lapply(mydf$description,parseDescription)
    description <- do.call(rbind,description)
    mydf <- cbind(mydf,description)

    #Format as XTS object
    mydf = xts(mydf,order.by=pubDate)

    # drop Extra attributes that we don't use yet
    mydf$guid.text = mydf$guid..attrs = mydf$description = mydf$link = NULL
    return(mydf)

}
set.seed(42)
news <- getNews('SPY',10)
news

stay in touch