When SUNY decided to move all of its libraries to the same ILS, I learned that the more data you have, the more data you have to protect--or scrub away.
As the electronic services librarian at SUNY Cortland, I am responsible for the electronic presence of the library; I also provide cohesion between the library's various Web-based systems and electronic information. While SUNY Cortland is a medium-sized, 4-year college within the State University of New York system, it is at the forefront of technology. When management decided to move all of SUNY's libraries to the same ILS, we were in the second wave of libraries to migrate.
The topic of privacy regulations is complex at best, but my research yielded three key insights. First, there is a host of legislation at both the federal and state level that applies to our users' privacy issues. Forty-eight of 50 states have some laws protecting the privacy of library users. At the federal level, the Privacy Act of 1974, Family Education Rights and Privacy Act (FERPA), and COPPA (Children's Online Privacy Protection Act) are of interest to libraries. The Privacy Act of 1974 was passed in order to control the federal government's collection, use, and dissemination of sensitive personal information. The act contained five fair information principles, which remain the foundation for many of today's privacy regulations. FERPA is a law that requires educational institutions to protect students' privacy with regard to educational records, while COPPA prohibits Web sites from collecting personal information from children under the age of 13 without the consent of their parents. Other pieces of privacy legislation exist for specific types of organizations, such as financial and health institutions. However, my research suggested that this legislation is most pertinent to libraries.
Second, the Library Bill of Rights and the ALA Code of Ethics both have strong statements regarding libraries and a librarian's responsibilities to user privacy. The Library Bill of Rights implies that privacy needs to be protected by libraries. This is because users lose their freedom of inquiry if their privacy is not guaranteed. Furthermore, the ALA Code of Ethics makes a more direct statement regarding libraries and privacy: "We protect each library user's right to privacy and confidentiality with respect to information sought or received and resources consulted, borrowed, acquired or transmitted." Based on these principles, ALA recommends the following to protect user privacy:
• Limit the degree to which personally identifiable information is monitored, collected, disclosed, and distributed.
• Avoid creating unnecessary records.
• Avoid retaining records that are not needed for efficient operation of the library, including data-related logs, digital records, vendor-collected data, and system backups.
• Avoid library practices and procedures that place personally identifiable information in public view.
Third, beyond the legislative realm, libraries need to pay attention to privacy issues because they are part of the greater landscape of the Internet, and most commercial Web sites have some sort of privacy statement and policies. These policies typically contain five privacy principles recommended by the Federal Trade Commission: Notice/ Awareness, Choice/Consent, Access/ Participation, Integrity/Security, and Enforcement/Redress. Users of the Internet see the privacy and security of their information as an important issue; they expect some level of privacy protection. Furthermore, users trust that their privacy is being protected even if libraries are not able to or cannot accomplish this. All of these facts make privacy a critical issue today. [Editor's Note: For more information on privacy, see Marshall Breeding's "Protecting Personal Information" column in the April 2004 issue of CIL.] Determining What User Data Our Library Collects
Having evaluated the current privacy landscape as a whole, I next surveyed my library's parcel. I did this by assessing the numerous places in which the library collected data about the users and by determining what type of data each system contained. I started this process by making a list of all the library systems in which I believed user data could be located. Some places were obvious, such as the ILS and the ILL systems. Other locations were subtler. In order to find them. I used the ALA Privacy Toolkit's list of places to look for privacy data and consulted with other systems librarians within SUNY. I found several more sites that had user data, including server log files for Web servers, the proxy server, the Web-based catalog, and the OpenURL resolver.
Each of these locations contained different types and amounts of user data. The ILS and ILL systems contained the most: user ID, name, address, e-mail address, phone number, and borrowed items. In contrast, the Web server and proxy server held slightly different information, including the user's IP address and the resource that the user accessed. The log files for the library's catalog and OpenURL resolver store different types of information as well. The catalog log flies can contain a user's IP address, her ID if she is logged into the catalog, and the search query including index searched and search terms. The log files for the OpenURL resolver retain a user's IP address and the OpenURL request. This may not seem sinister on the surface. However, OpenURL requests include the complete citation for the item being requested. Therefore, these log files could be used to identify the materials a given user was seeking.
Our public computers also collect data about the people who use them. The Web browser's history file, cache, and cookies can be used to determine what resources a patron has accessed. Currently, this is difficult because all these systems are entered by using a library guest account. However, in the fall we will begin requiring all patrons to authenticate to these machines using their SUNY Cortland computing accounts. As a result, this information will more easily be tied to an individual user, and this data will need to be protected as well.
Once I had determined the different types of user information our systems were collecting, I needed to decide which pieces of data were necessary to maintain the library's mission and which were being captured without a purpose. My final list included the following uses for patron data: daily operations such as circulation and interlibrary loan; security such as authenticating users so they could access their accounts or resources from off campus; troubleshooting; collection development; Web site usability; and assessment.
During this process, I found several pieces of data being collected that the library did not need in order to fulfill its mission. In particular, we did not need user birth dates and Social Security numbers (SSNs). Because our campus has moved away from using SSNs to identify individuals, we are able to use a patron's Cortland ID number in order to determine his or her identity. As a result, SSNs are unnecessary information for us. The issue with having information about library users that you don't need is that the more data you have, the more data you need to protect. If your system was compromised, how damaging would its facts be? Removing small pieces of information that you do not need may make it more difficult for a person who compromises that system to commit identity theft or use the data in a malicious way.
Developing Procedures to Protect User Privacy
My final step was to develop strategies for removing user-related data. They would include IP washing, delinking borrower records, using data archiving routines that aggregate the removal of personally identifying information about users, and evaluating and selecting possible tools for protecting user privacy.
This was complicated by the fact that library systems are often designed to capture data about users for daily operations, personalization, or assessment purposes. Additionally, some Web-based library systems incidentally capture user-related data as part of their design. However, they do not have built-in mechanisms for purging or archiving this data on a regular schedule. With such a monumental task in front of me, I chose the only course of action that seemed rational--tackle one system's privacy issues at a time. ILS and OPAC: De-Link and Wait
I immediately eliminated the ILS from my list of systems to deal with because a script had been put in place to de-link users from the items they borrowed during the time I was conducting my survey of the privacy landscape. This left the log files for the Web-based catalog as the other issue with this system. Since this system is on a server that is controlled by the SUNY Office of Library and Information Services (OLIS), I did not have the authority to deal with it on my own. I could have requested that OLIS address my concerns about privacy at that time. However, I felt that any proposal about privacy would be stronger if I incorporated my experience with handling these issues. ILL System; Writing Routines
I chose to deal with the library's ILL system first because the amount of user data held there was greater than in any other system except the ILS. Our ILL is set up to store data in a relational database that has a table for users and a table for transactions (i.e., items requested). A link is created between the user table and the transaction table via the username; this link is stored in both tables. Typically, this link is never broken because this structure also allows us to generate reports about which types of people (faculty, students, staff) or departments are using the ILL service. My library wanted to retain this demographic information about the people but still be able to break the link between a user and the item borrowed. In order to do this, I developed a routine that writes the transaction number and the relevant demographic information from the user table into a new table in the database before the link between the transaction and the user is broken. This allows us to keep important statistical information about interlibrary loan but still protect our users' privacy.
Web Sites: Selective Scrubbing
The second set of systems that I chose to work on were the library's Web sites. Currently, we maintain three Web sites that are accessed by users. Each of these servers stores log files I which contain a user's IP address), whether the user has logged in, his or her campus computing username, and the page he or she visited. In order to protect a person's privacy we did not want to keep information about who visited which of our Web pages. However, we did want to know if users were accessing library resources from on or off campus (and where on campus). Because of that, removing or washing the entire IP address for each line of the log files was not an option. Instead, we chose to develop a system that washed only the last octet of the IP addresses and the username (if present! from the server log files. Without that, it would be impossible to tie the page visit to a specific library user.
Currently, one server's log files are collected over the course of a day and also at 12:10 a.m. the next day, when the previous day's logs are complete and are washed of the last octet of IP addresses and usernames. As a result, only the current day's log files contain that complete data. This decision allowed us to simultaneously protect patron privacy and still gather the data we needed for assessment purposes. Additionally, we had the side benefit of more easily distinguishing between on- and off-campus use of our Web site in the reports generated by our Web server log analysis tool.
Proxy Server: IP Scrubbing
The last internal system for which I am developing procedures to protect user privacy is the proxy server. Currently, we use the EZproxy system in order to allow off-campus users to access our resources. EZproxy has a log file that tracks every time someone passes through the proxy server In the case of SUNY Cortland, all users accessing electronic resources are passed through the proxy server in order to track resource usage. Like the Web server log files. EZproxy log files store a user's IP address as well as the resource that was being accessed. Depending on the resource, the log files can contain the search string as well. Additionally, the log files can include some sort of user ID; whether or not the log contains a meaningful user ID depends on how the library chooses to configure EZproxy. In our case, a user ID is not present. This makes the task of protecting privacy slightly easier.
Currently, we collect monthly proxy server logs. As with our Web server logs, we use the IP address to determine where (on or off campus) a resource was accessed from. Our current EZproxy log analyzation system deals with full IP addresses and monthly log files. In order to reduce the amount of user-specific data we have available, we are currently transitioning this system to collect daily log files. These files will be scrubbed of the last octet of the IP addresses and then analyzed by our in-house tool for tracking electronic resource usage. We estimate that this transition will be complete by the time this article is published and that the data archiving procedures for EZproxy logs will mirror those of the Web servers.
Public Computers: Reboot, Restore
In addition to our Web-based systems, our public computers also collect data about patrons. In order to minimize the amount of data they collect, the library will institute a set of practices in the fall 2005 semester. All installed Web browsers have the following settings: temporary Internet files set to 2 MB; history set to 0 days; form memory turned off. password memory turned off, and downloads turned off. All computers will have the Deep Freeze product installed to restore them to a standard template when rebooted. Also, they will be set up to reboot after a set time of inactivity. This will clear any individual who forgot to log off and delete his activities from the computer. Learning That We Can't Control Everything
These procedures will allow us to more effectively protect the privacy of our patrons in relation to the systems that we control. However, there are two systems that we do not control that raise privacy concerns. One is the ILS; the other is the OpenURL resolver. Both of these systems are handled by the SUNY Office of Library and Administrative Services and have not been systematically analyzed for user data that they may collect. These systems' server log files, which contain data that can link a specific user to the information he or she accessed, are of particular concern. In order to address this issue, practices such as server log washing need to be implemented. Because these systems affect many different SUNY institutions, a formal request for this change will need to be submitted to OLIS. I am currently working on a request that outlines the systems and data that I feel are privacy concerns.
Overall, addressing user privacy issues at my library has been a learning experience on a number of levels. In addition to the legal and ethical aspects of privacy, we must be aware of data collection procedures and assessment goals in order to develop effective privacy policies and procedures. I learned five key points while working on this project:
American Library Association (1996). "Library Bill of Rights," http://www.ala.org/ala/oif/statemenspals/statementsif/ librarybillrights.htm.
American Library Association (1995). "Code of Ethics of the American Library Association," http://www.ala.org/ala/orf/ statementspols/codeofethics/codeethics.htm.
American Library Association (2004). "Privacy Toolkit," http://www.ala.org/ala/oif/iftoolkit/toolkitsprivacy/Default4517.htm.
Consumer Privacy Guide "Privacy Ad of 1974," http://www.consumerprivacyguide.org/law/pas.html
Coyle, Karen (2002). "Make Sure Vou Are Privacy Literate" Library Journal 27(16). http://libraryjournalreviewsnews.com/ index.asp?layout=article&articleid=CA245045.
Fifarek, Aimee (2002). "Technology and Privacy in the Academic Library," Online Information Review 25, 566-374.
Sturges, Paul; Davies, Eric; Dearnly, James; Iliffe, Ursula; Oppenheim, Charles; and Hardy, Rachel (2003). "User Privacy in the Digital Library Environment: an Investigation of Policies and Preparedness," Library Management 24(1), 44-50.
By Karen A. Coombs
Karen A. Coombs is the electronic services librarian at SUNY Cortland in New York. She holds an M.S. in information management and an M.L.S., both from Syracuse University. Karen has written articles (Computers in Libraries and The Journal of Academic LibrarianshipJ and made presentations (Computers in Libraries and Internet Librarian conferences) on a variety of topics, including Web-based technologies, privacy, and Web accessibility. She is the author of The Library Web Chic (http:llwww.Ubrarywebchic.net), a Web-log that focuses on the use of Web-based technologies in libraries.