Put me in front of a Mac and it’s almost as if I never learned to use a computer. Put me in front of Google Spreadsheets and all of the time I’ve spent working with Excel feels a little like time wasted, and not in a good way. I’m just not very used to a spreadsheet that isn’t Excel.
Unafraid of a challenge, I recently decided to give Google’s (exceptional) importXML, importFEED and importHTML functions a try – the ability to fetch information from the web to retrieve the data you need. Mostly to make an interesting blog post, but partly out of envy that Excel doesn’t have this function.
It’s frustrating trying to get XML data into Microsoft Excel – unless you’ve got the time and patience to build some basic Macros or VBscript for your requirements. With Google Docs, it’s really easy.
A few resources
If you want to use Google Docs to extract data from the web, it would be a good idea for you to learn a little xPath. “XPath is used to navigate through elements and attributes in an XML document”, or, in simple terms, you can use xPath to fetch little bits of data contained in structured elements like <span>, <div> or links or pretty much anything, really.
Also, there are a few people who have been doing this a while, and probably have sample spreadsheets that blow some of the examples below away – but you have to start somewhere, right? If you’re already an importXML / Google Docs Ninja, maybe go and find something else to do instead of reading this post.
If you’re interested, I made a Google Docs Spreadsheet with all of the examples below:
Does anyone know?
“Does anyone know” is such an interesting search on Twitter – just combine that query with a keyword, like “restaurant” and a location for everyone on Twitter looking for a very specific, thing. Great if you happen to be trading in that thing.
Try a query like this to pull through results from the Twitter search RSS feed:
A nod to Steven Foskett for this one, and particular kudos for the mention of vCard, the query for LinkedIn connections, Klout score and Alexa Rank. Nice!
Try this query: =importXML(“http://twitter.com/[your-username]“,”//span[@id='follower_count']“)
Which will give you the number of followers you have on your Twitter profile. I added together the total followers that my SEO team have (that’s the three of us) for kicks. I wonder how long it will be before someone totals up all followers counts for all UK agencies? I wonder if there’s a correlation between that data and turnover
Pull price data from the web
I think that, after some mild haranguing, Will might have purchased himself a pair of Etymotic headphones. Perhaps my pitch would have gone slightly more efficiently with a little xPath and Google Product search:
For something like this, a way smarter approach to get pricing data from Amazon would be to use their API – but you get the point with this brief example.
Get all of your (competitors) URLs from their sitemap
Try something like this: =ImportXML(“http://www.yourcompetitordomain.com/sitemap.xml”,”//url/loc”)
I mentioned doing this with Excel to find orphaned pages, but you can have a lot more fun with importXML. For one, theoretically you could go off and fetch all keywords contained in the <title> tag of each of the URLs – an instant keyword strategy!
Pull link data from Blekko
With a query like this: =ImportXML(“http://blekko.com/ws/http://seogadget.com/+/links+/rss”,”//link”)
Blekko is everyone’s favourite new SEO tool, and fair enough, it is quite cool. As Blekko are happy to push their data out via RSS, we’re able to pull this data into our spreadsheets with ImportXML (to be fair this is really easy with Excel, unless you’d like to create multiple columns with different domain queries.
More Blekko – link data tables
Blekko have a feature that allows for a pretty insightful breakdown of their SEO data on your domain. If you want to pull some of that through in to Google Docs, no problem:
Try this query: =importhtml(“http://blekko.com/ws/www.smashingmagazine.com+/seo”,”table”,7)
This wasn’t a particularly “advanced” post – I did quite enjoy the thought of what to do next with this data, though. Fetch IP addresses, WHOIS details, root domain links or keyword research data with Google Suggest, the Alchemy API, or plain scraping your competitor home pages. If you’re using importXML, I’d really like to hear how.
Anyway, as I mentioned earlier, please feel free to take the queries from here: http://bit.ly/9Fs7aF – improve them, and let me know what you did.
A little update
I got in touch with my friend Tom from Distilled to see if he wanted to contribute. He’s been out in Vegas, but came back with a tip to solve the problem of Google caching a result for around two hours at a time:
Google docs will cache a URL for ~2 hours and so if you want to crawl a URL more often than that then you need to add a modifier to the URL.
I use int(now()*1000) to generate a unique timestamp and then add that into the URL in a dummy query string. E.g.
The search results won’t change when you change the time value but Google docs will treat it as a fresh URL and crawl it again.
Also – you can do lots of amazingly fancy things using Google Scripts (kind of like macros for google docs) but don’t have a huge amount of time to go into detail about that now!
Well, hopefully Tom will have time soon – thanks for contributing!