Estimate PDF Views on Your Site

Probably old hat to many astute RSS4Lib readers, but this was new to me, so I thought I’d share this solution to getting a reasonable minimum number of PDF downloads from your site, assuming that you use Google Search Console (formerly called Google Webmaster Tools) and that you’re willing to assume that Google drives a significant  portion of traffic to your site. Hat tip to Kenning Arlitsch for this method (citation below).

It is notoriously difficult to count PDF (or other binary file downloads) from your site, for several reasons:

  • PDFs, documents, images, etc., can not include embedded JavaScript tracking, meaning that many analytics tools can not accurately know when a file has been downloaded.
  • Web server logs can be used to count downloads, but these counts are often way too high because it is hard to figure out which downloads are triggered by humans and which are from robots, spiders, or other automated processes.
  • Event tracking in Google Analytics works well enough, assuming that the user is already on your website — but significantly undercounts accesses direct from search engines or links from other websites directly to your PDF content.

Google Search  Console provides a much better estimate of use, assuming you are willing to postulate that a great deal of your traffic comes from one Google platform or another (Search, Scholar, Images, etc.).

Here’s what to do. First, if you haven’t set up Google Search Console for your site, you need to do this. You can find out how to get started on the Google’s Product Forums for Search Console.

Second, go the Search Console, expand the “Search Traffic” menu on the left, and then select “Search Analytics.”

Search Analytics Menu in Google Search Console

You will then see the search analytics console, which gives you several ways to look at your results:

Search Analytics Options

The one we’re going to look at is Pages — this is where you can specify all pages with a file extension PDF. Select the button to the left of Page and click the filter. In the box, type “pdf” and click the blue “Filter” button.

Entering PDF into the search filter

You will then see a list of all the PDFs for which Google has seen traffic, whether via a link on your site, a direct-to-PDF link from some other site that employs Google Analytics, or from a Google search interface. You’ll see something like this:

Search Analytics PDF list

From any PDF in the list, you can click the double arrow on the right to see more detailed statistics for that particular PDF by selecting one of the other search filters (Countries, Devices, Queries, etc.). Here, we see the devices (desktop, mobile, or tablet) that users employed to see the “SPOTextbookBackground.pdf” document):

Devices used to access a specific PDF


Why is this count useful? Largely because  Google has its hand on vast swathes of the Internet through its analytics code and Google is very good at eliminating robots — they have to be, to prevent unscrupulous folks from inflating advertising revenue through automated clicks on ads. It  does not count traffic that originates on websites that do not use Google analytics or that originates from other search indexes (Bing, Yahoo, Baidu, etc.).

The same technique can be used to see who is accessing Microsoft Office (.docx/.doc, .xlsx/.xls, .pptx/.ppt) files, images (png, jpg, gif), or any other binary file on your server.

I learned of this technique from a presentation by Kenning Arlitsch, based on the following article:

Patrick OBrien, Kenning Arlitsch, Leila Sterman, Jeff Mixter, Jonathan Wheeler, and Susan Borda. “Undercounting File Downloads from Institutional Repositories,” Journal of Library Administration, vol. 56, no. 7, 2016.

A Year in Reading (2015)

I did a bit better reading (and perhaps worse blogging) this year, upping my total to 26 books for the year, or one every other week. My top 5 books of the year are in bold text; interestingly, even though I read a lot more fiction than non-fiction, the majority of my favorites are non-fiction.

  1. Quicksilver, by Neal Stephenson
  2. Insurgent, by Veronica Roth
  3. Super Sad True Love Story, by Gary Shteyngart
  4. Many-Colored Land, by Julian May
  5. Doomsday Equation, by Matt Richtel
  6. Lock In, by John Scalzi
  7. Kitten Clone, by Douglas Coupland
  8. One Summer, by Bill Bryson
  9. Voices of Heaven, by Frederick Pohl
  10. Midnight in Siberia, by David Greene
  11. Robopocalypse, by Daniel Wilson
  12. The Golden Torc, by Julian May
  13. Ghost Brigades, by John Scalzi
  14. Bone Church, by Victoria Dougherty
  15. Robogenesis, by Daniel Wilson
  16. End of Absence, by Michael Harris
  17. The Last Colony, by John Scalzi
  18. The Strangler Vine, by M.J. Carter
  19. Windswept, by Adam Rakunas
  20. Ancillary Justice, by Ann Leckie
  21. The Gift of Failure: How the Best Parents Learn To Let Go so Their Children Can Succeed, by Jessica Lahey
  22. Armada, by Ernest Cline
  23. Cain at Gettysburg, by Ralph Peters
  24. Seveneves, by Neal Stephenson
  25. The Swimmer, by Joakim Zander
  26. The Invention of Fire, by Bruce Holsinger

A Decade of Blogging

Do you remember May 2005? I do in some respects, but not in others. Being the parent of a six month old at the time, I have to admit a lot of non-family things are lost in a blur. But I do remember getting started with this blog, at the time when absolutely everyone had a blog, very much as today absolutely everyone has a Twitter feed, or an Instagram account, or an Ello page. (Just kidding about the last one. I think I’m the only one. Ello strives to be as successful as Google Plus when it grows up.)

For me, the month is notable because it was May ten years ago that I started RSS4Lib. I do remember coming back from a daylong conference (which was the subject of the first post on this blog, RSS and Libraries), not long after starting my then position at the Ginn Library at Tufts University. The conference was all about emerging and new technologies, and blogging caught my attention. It seemed there was so much interest in blogging and RSS, and so little information, that I grabbed the theme of RSS and libraries, came up with a name in the model of the venerable web4lib listserv, and soon snagged a domain name. Thus was born RSS4Lib.

Much as blogging was all the rage, so was RSS. Not just how data gets from place to place, RSS was itself a value-added and well-advertised service that any reasonably well architected website would provide. “Visit Oursite-dot-com! We have an RSS feed.” It was as much a sign of currency and hipness as being on Twitter, Instagram, Snapchat, or whatever it is the kids are using this afternoon.

Things have changed. Today, if RSS is having a good day, it’s been reduced to being the plumbing of data exchange for APIs and people who use the dwindling number feed aggregators like Feedly.

Back in the day, there was actually plenty of RSS-related news to talk about. Over the years, though, my fervor for blogging declined, in pace with the rest of the library worlds, and my posts became less frequent and more broadly focused on technology in libraries. My favorite posts are those that have allowed me to imagine futures that aren’t now possible, such as the recent “In Today’s Internet of Things, YOU Are the Thing, and the much older “Serendipity at Risk,” “Perspective on Discovery,” and “The Paradox of RSS and Web Scale Discovery.” The most popular posts, though, tend to be “how-to” posts (see below).

Ironically, the most popular post (see below for the top 10) is one that describes a tool I wrote to parse server log files to tell you how many unique subscribers your site’s feed has. Ironic because despite that page’s popularity, the reliance on RSS as an access mechanism has been steadily declining.

As the Internet and this blog have evolved, my career has grown, too. Rather than being the guy who works on the web site, though not as the main part of my job, I now oversee the maintenance and development of a vastly larger and more complex library website, and deal with an entirely new range of issues. Over the ten years of blogging, it went from something fun, to something I felt obligated to do, to something I felt guilty about not doing, to … whatever it is now. I still write, though at the extremes of length: occasional long-form article and chapter length publications on one side, and 144-character-long tweets Twitter at the other. I do like the freedom, even if it is seldom exercised, of having a space available to share my thoughts. I expect to post here from time to time going forward…. Maybe RSS4Lib will last another ten years. Who knows? Despite the time I spend on it, as Yogi Berra is credited with saying, it’s hard to make predictions, especially about the future.

— Ken

Just in case you’re curious, the most popular posts ever (where “ever” is since November 2009, when I added Analytics to the site):

  1. Counting RSS Subscribers (8039 views)
  2. RSS for Kindle Readers (5216 views)
  3. Google Has an RSS Embedding Tool (3997 views)
  4. YourStats: Estimate Feed Readership for Your Blog (3466 views)
  5. Google Chrome and RSS (2851 views)
  6. Flog Blog: RSS to Facebook (2252 views)
  7. JavaScript RSS Box Viewer (1813 views)
  8. Feedbooks: RSS to PDF for Offline Reading (1506 views)
  9. RSS to Twitter Tools (1341 views)
  10. Facebook Notes Redirects Your Feeds (1223 views)


In Today’s Internet of Things, YOU Are the Thing

The Internet of Things was a hot item at the Consumer Electronics Show in Las Vegas earlier this month. A vast array of Internet-enabled devices was on display, everything including the Baby Glgl smart bottle holder (it tells your smartphone if your baby’s bottle isn’t at the optimal angle), the Belty (it automatically loosens your belt if you overindulge at a meal), to Whirlpool’s “detergent assistant,” a feature on one of its high-end washers, that can order detergent when your stock is running low.

Of course, a slew of more useful network-enabled devices was also on display, including gizmos for monitoring health, home-monitoring items, and more. But the theme here is that the Internet of Things seems to be at that stage in the adoption cycle where manufacturers and inventors are hell-bent on network-enabling ALL THE THINGS in the hope that, someday, the market will tell us what actually makes sense, according to the ancient adage: “Network it all! Let the market sort it out.”

While the market is busy sorting out just what makes sense in the Internet of Things via the proxy of what we consumers will actually buy, I keep thinking that the Internet of Things, as it exists today, is really the Internet of You. Much as with Facebook or Google, you, the consumer, are the “thing” being networked. (The CEO of Jawbone claims this proudly in a January 5 column in the Huffington Post; I am not quite as sanguine.)

Here’s what I mean. The ubiquitous smartphone that so many of us carry around is giving off endless data about you, harvested by smart retailers and others. Your phone, the most networked thing in any of our lives, is a proxy for you. Here’s an example. Several years ago, I went to my local Kohl’s in search of some shoes. On entering the store, I saw an advertisement that I could text a phrase to a certain number to receive 15% off that day’s purchase. I did so, of course, not thinking through the reason that Kohl’s — a consummate retailer — would be offering a surprise additional discount to someone who was already in the store.

Later, it dawned on me. My desire to save a few bucks on that shopping trip gave Kohl’s the ability to connect my location in the store, my cell phone number, my cell phone’s MAC (which the wireless network in the store could pick up), and my purchases (when I used the coupon at checkout). If I paid by debit or credit card (which I did), Kohl’s had opportunity to capture my name, and by extension, all sorts of additional information about me. Not only that, but thanks to the cell phone metadata, assuming they had installed inexpensive wireless devices around the store in each department to gather data, they would know pretty well where I was in the store and how much time I spent in each section. As it turns out, this is almost certainly what was happening. As early as summer 2013, The New York Times was reporting on this sort of technology.

This sort of user tracking is common across many retailers. If your cell phone is turned on in a store, you can be certain that information about where your phone goes and where it spends time is being tracked, even if it’s anonymous. If you pull up a coupon on your phone to be scanned at checkout, all your in-store behavior is suddenly directly connected to you, the individual — to be used across time and space.

This does not even scratch the surface of what could be done by legally empowered law enforcement or other, less legally grounded, agencies.

I do not suggest you leave your smartphone at home, or event put it in airplane mode when you walk into a store. But I do want to highlight that the Internet of Things, as described in the media, is really two approaches. One is using your smartphone as a proxy for you — the Internet of You. The other is using the network and a computer to interact and learn from your environment — the Internet of Things. Don’t confuse one for the other and be discouraged about the entire concept based on the former.

A Year in Reading (2014)

My personal reading for 2014 has been mostly for entertainment. The list is shown in chronological order. My favorite five from the list below are noted with bold text.

  1. Arsenals of Folly: The Making of the Nuclear Arms Race, by Richard Rhodes
  2. The Ocean at the End of the Lane: A Novel, by Neil Gaiman
  3. The Martian, by Andy Weir
  4. Like a Mighty Army (Safehold), by David Weber
  5. Redshirts, by John Scalzi
  6. Bad Monkey, by Carl Hiassen
  7. Old Man’s War, by John Scalzi
  8. 2312, by Kim Stanley Robinson
  9. Trojan Horse: A Novel, by Mark Russinovich
  10. Quarter Share, by Nathan Lowell
  11. Ready Player One: A Novel, by Earnest Cline
  12. A Delicate Truth: A Novel, by John LeCarré
  13. Reamde: A Novel, by Neal Stephenson
  14. Good Faith, by Jane Smiley
  15. Existence, by David Brin
  16. Luna Park, by Kevin Baker
  17. Mr. Penumbra’s 24-Hour Bookstore: A Novel, by Robin Sloan
  18. Divergent (Divergent Trilogy, Book 1), by Veronica Roth
  19. Half Share, by Nathan Lowell
  20. LEGO: A Love Story, by Jonathan Bender
  21. World War Z: An Oral History of the Zombie War, by Max Brooks

What Could the “Internet of Library Things” Be?

At the recent ALA Annual Conference, I attended the OCLC Symposium on the Internet of Things, hosted by Lisa Carlucci Thomas and presented by Daniel Obodovski, co-author of The Silent Intelligence: The Internet of Things. (I wrote up my notes in an earlier post.) At the end of the talk, Mr. Obodovski asked the audience what they thought libraries should do if/when the Internet of Things came into being? The responses were varied, but were more like “RFID on steroids” — better circulation of materials, availability of equipment, and the like. These are mostly evolutionary steps, but the last one or two are more revolutionary.

So, I’ve been trying to think of less evolutionary and more revolutionary ideas. I have not, frankly, been particularly successful. But here are some of the things I can see happening:

  • Help library visitors find a space suitable to their needs (quiet study areas, low noise, full-on conversation) by installing noise-level monitors in each study space and simple sensors in each chair. This way, someone looking for a deserted, quiet area can easily find the available table off in the back corner, while a small group looking to conduct a group study session can find a free table in an area where there is already light conversation.
  • Put motion sensors in study rooms so that a list of available and in-use study rooms can be shown to library visitors. Library visitors will know which correctly sized study area is free, and they can then let their study group know where to come. Bonus points for tying these sensors into the study room lights for energy savings — the lights go off when the room is empty.
  • Show library visitors newly purchased books they are likely to enjoy when they enter the library. Combine information about books on your new-book shelf with each visitor’s checkout history to send a list of books that are on the new-book shelf to their device as they enter the building.
  • Impromptu book discussion clubs. (This is bordering on the creepy, but I wanted to see if you are paying attention.) Identify other people in the library with have similar reading interests and offer to introduce them to each other.

What would you do with pervasive connectivity of everything within your library? Let me know in the comments.

Incidentally, Jason Griffey talks about a number of other things libraries could do with cheap sensors in his chapter, “The Case for Open Hardware in Libraries,” in the recently-published LITA Guide, Top Technologies Every Librarian Needs to Know.

Technology Priorities for the New Library Reality

These are lightly edited notes from Sarah Houghton’s talk at ALA Annual 2014. Tweets from this presentation may be found at #alaac14.

Starts off with results of a survey: ‘Why are we talking about this now?’ Now that budgets are starting to recover from the Great Recession, libraries have the option to think about where to allocate restored funds. Do we spend on the things we did 10 years ago, or do we choose new priorities?

About half of libraries are losing money; half are gaining. Everyone feels that they don’t have enough and cannot keep up. No matter what kind of library responded, we all wanted the same things.

Libraries who thought they would get an increase were spending on staffing (27%), digital materials (26%), information technology (22%), facilities (17%). (137 respondents). Facilities were a smaller set, but the things that were wanted were often building safety and maintenance, not technology.

How is technology support managed? About 42% of respondents had libraries that ran their own IT. 28% by a parent organization, 24% some combination thereof, and 6% outsourced.

How much spending control does library staff have over the IT budget? 50% had none or “a wee bit”.

Your web services librarian doesn’t have to be a librarian. Get someone qualified, and have a librarian advisory group to advise.

Fewer people made collection decisions based on usage statistics for digital materials than for physical materials. Seems odd because it is so much easier to gather statistics on the digital materials.

If libraries had $1k, 42% chose non-tech things to spend it on. One said “actually pay the visiting clown.” If libraries had $100K, non-tech was still 42%, but answers were much more diverse. Hardware, digital content, software & staff, and other stuff are the big desiderata in technical areas.

If libraries could get one extra staff position of any kind, 42% said tech-oriented NON-librarian. 23% said tech librarian and non-tech librarian (each),

What concerns do people have? Staff capacity is biggest: 47%. Training (23%), outdated mindsets (14%), outdated technology (12%)

Libraries see using hosted services as a good way to get around IT’s rules (33%). Simply breaking the rules is also popular: 39%.

As technology integrates more and more into our jobs and lives, everyone has an opinion on how we should focus our technology spending. Few know what the hell they’re talking about.

How do you develop a budget? Establish priorities first. Determine needs for each. Draft a budget, revise with broad feedback. Make mid-year adjustments.

The Internet of Things

These are the notes I took during today’s OCLC Symposium on “The Internet of Things” at ALA Annual 2014. For tweets from the presentation, please see the Tweets at #oclciot.

The presentation was by Daniel Obodovski, co-author of The Silent Intelligence: The Internet of Things.

How do humans and machines communicate and connect? This is the Internet of Things [IoT]. But what is that? It’s all kinds of things today: smart thermostats, medical sensors and alert systems, smart electric meters… And more. Package, and person tracking is enabled through scannable codes or RFID tags for low-value things, and GPS devices for high-value (people, pets, valuable items). What are the privacy concerns around this? How to ensure that data are used as intended, by whom intended?

The IoT allows us to connect to the broader analog world around us in a digital way, to integrate, interpolate, and benefit us all. Relates to a new digital nervous system connecting us with our environment?

How big will this be? There could be as many as 50 billion by 2020. We have a lot more “smart” technology in our homes already than we might think. Up to 7% of U.S. population already has some sort of wearable technology (exercise trackers, medical monitors, etc.). By the end of this year, it is forecast that 10% of U.S. population will have wearable, internet-connected device on their person. And today, 45% of fleet vehicles in the U.S. have some form of monitoring — for vehicle maintenance, for driver compliance, for vehicle location, etc.

This is, all together, what we call “The Silent Intelligence.” And it is, ironically, very verbose.

We think of the future as rocket cars and jetpacks. But the reality is, it’s already here, slowly emerging, out of these interconnected devices. The most exciting area is healthcare — with immediate feedback for how treatment is working, or if there is an emergent situation before the individual even knows something is wrong.

What we have seen in social media — where the user is the source of data that the social media company then sells — is already emerging in the Internet of Things. Your car’s data is being sold to third parties. (I wonder, if it’s so easy to get the vehicle’s diagnostic reporting codes out of the vehicle, why it costs so much at a dealer to read the code and translate it into a fixable problem.)

The Internet of Things is very complex. Requires that many individual device manufacturers talk to each other and interplay. Need standards not just for communication, but for data itself. All of these data will be collected, analyzed, resold — after being anonymized. A new range of services will emerge around this data collection and processing. This opens up a new world of services, but also opens up a huge range of data privacy and security concerns.

We are currently missing a clear set of rules about privacy of data — who can have access, and what do they do with it? We are generally very bad about understanding the terms of service when we click through to use some online service.

This technological revolution has an uncertain impact on the nature of jobs. We have gone through one technological revolution, in which technology replaced many manufacturing jobs, leading those workers to move into service jobs. What happens if many services can be automated; what is the next kind of job that current service workers can move into?

What will Internet of Things mean for libraries? What will interconnections enable? Combined with knowledge of other things than where physical items are located, and what rooms are being used, or aisles in the stacks, etc., you can customize and improve services. Without data, you can’t improve your services in the optimal way.

We should think about how we can understand the patterns, and the data that generate them. Connecting patrons to their needs, more effectively and efficiently, is the goal. Let needs drive the technology.

How the Feed Changed the Web

Sharethrough_HistoryFeed_20130115Mashable published an interesting post and infographic about how the “feed” changed the way we consume information. The author notes: “The feed now dominates online content consumption, from the news we read on our mobile devices to the social networks we check constantly throughout the day…” (emphasis mine).

Just another indication that RSS has become plumbing, or infrastructure. It’s no longer the goal of itself, it’s the mechanism.

Discovering Discovery at LITA Forum

Notes from a  talk by Annette Bailey of Virginia Tech at the LITA National Forum, “Discovering Discovery.”

Virginia Tech has been a Summon customer since 2010. They have leveraged Summon to change cataloging practices locally. Still using original Summon (1.0) interface.

Library users are shifting behaviors. Increasing usage of online resources, physical spaces — but not physical resources. Discovery largely happens through Summon. How can VT know what its users are doing? COUNTER provides some information, but its delayed, and hard to process. Summon provides aggregate data on search terms and click data. How can we know what users are doing in real time? And share it with other members of the community, show visually what research is happening, live?

Discovery VisualizationThat is the heart of Discovering Discovery — what are users clicking on in Summon, in real time. Can’t tell if they use the item, but can tell that they accessed it.

This tool helps everyone — librarians, the public, students — to understand what is being done in the library. User does a search. There’s some custom JavaScript in the Summon interface that sends a record of the click to the visualization server, which stores it in a database. A visualization tool then makes a display on demand. It grabs the Summon record ID, unique for each item. They then use the Summon API to grab the metadata for that query — because Summon IDs are not persistent over the long term. All of that is stored in an SQLite database.

As a side note, they can tell how many unique items were clicked on over time — hard to do otherwise.

Current log analysis extracts and tabulates data at 1 minute, 5 minute, 1 day, 1 week intervals. Tabulates by discipline, content type, source of record, publication year. All comes from Summon, which means data are problematic. Does word frequencies for abstract, title, and abstract & title combined, and keywords & subject terms.

Use the d3.js library to do visualizations. It’s a powerful tool, but hard to work with. Follows jQuery in style. Also uses a variety of server-side technologies.

Summon 2.0 — not there yet. Unlike Summon 1.0, there is now an officially sanctioned way to include JavaScript (it’s a hack in 1.0). It now includes d3.js in Summon — they do not appear to be using it yet, but it’s there. Look out for visualizations at some point…. But they need to reverse engineer Summon 2.0 to achieve the same effect as in Summon 1.0.

Using this with other discovery services. You need to be able to record clicks, in real time. You need an API to get the machine data. If you use a different discovery service and want to try adapting this code, VT would like to work with you.

The visualization is the hard part; getting the data was the relatively easy part. Code needs to be consolidated, into a cloud solution, to make your version for your own use. (Like the Libx edition builder).