Wednesday, November 26, 2014

Slightly Advanced rvest with Help from htmltools + XML + pipeR

Hadley Wickham’s post “rvest: easy web scraping with R” introduces the fine new package rvest very well.  For those now yearning a slightly more advanced example with a little help from pipeR + htmltools + XML, I thought this might fill your yearn.  The code grabs css information running the fancy new site cssstats.com on my blog site.  With the background colors, it makes and labels some swatches and outputs them to the RStudio viewer--if installed--or your browser if not.

library(pipeR)
library(htmltools)
library(rvest)
library(XML)

# some slightly more advanced exercises
# using rvest, XML, and htmltools

# this one takes all the svg nodes in the section
# with id unique-background-colors from the
# site cssstats.com run on timelyportfolio.blogspot.com
# 1) removes attributes
# 2) sizes them at 85px x 64px
# 3) add new text node with fill value
# 4) combines them into a single div
# 5) with some meta information
"http://cssstats.com/stats?url=http%3A%2F%2Ftimelyportfolio.blogspot.com" %>>%
html %>>%
html_nodes( "#unique-background-colors svg" ) %>>%
xmlApply( function(x){
removeAttributes(x)
addAttributes(x,style="display:inline-block;height:85px;width:64px")
fillNode = newXMLNode(
"text"
,html_attr(html_node(x,"rect"),"fill")
,attrs=c(x=0,y=75,style="font-size:70%")
)
addChildren(x,fillNode)
saveXML(x) %>>% HTML
} ) %>>%
(tags$div(
style="display:inline-block;height:100%;width:100%"
,list(
tags$h3(
"Colors of TimelyPortfolio from "
,tags$a(href="http://cssstats.com","cssstats")
)
,.
)
)) %>>%
tagList %>>%
html_print

I just copied the div output below in Windows Live Writer (notably from JJ Allaire and Joe Cheng of RStudio).



Colors of TimelyPortfolio from cssstats

transparent #fff #ffffff #fcfcfc #eeeeee #fcf8e3 #f2dede #dff0d8 #d9edf7 #f5f5f5 #a9dba9 #f9f9f9 #d0e9c6 #ebcccc #faf2cc #c4e3f3 #e5e5e5 #0081c2 #e6e6e6 #cccccc \9 #006dcc #0044cc #003399 \9 #faa732 #f89406 #c67605 \9 #da4f49 #bd362f #942a25 \9 #5bb75b #51a351 #408140 \9 #49afcd #2f96b4 #24748c \9 #363636 #222222 #080808 \9 #0088cc #999999 #fafafa #ededed #1b1b1b #111111 #515151 #0e0e0e #040404 #000000 \9 #000000 #f7f7f7 #b94a48 #953b39 #c67605 #468847 #356635 #3a87ad #2d6987 #333333 #1a1a1a #0e90d2 #149bdf #dd514c #ee5f5b #5eb95e #62c462 #4bb1cf #5bc0de #fbb450 #ccc rgba(255, 255, 255, 0.25) #2288bb #ffff00

No comments:

Post a Comment