Availability and accessibility in an open access institutional repository: a case study
Jongwook Lee and Gary Burnett
School of Information, Florida State University, 142 Collegiate Loop, Tallahassee, Florida, 32306-2100, USA
Florida State University Libraries, 116 Honors Way, Tallahassee, Florida, 32306, USA
Jung Hoon Baeg
School of Information, Florida State University, 142 Collegiate Loop, Tallahassee, Florida, 32306-2100, USA
School of Communication Science & Disorder, Florida State University, 201 W. Bloxham, Warren Building, Tallahassee, Florida, 32306-1200, USA
Open access refers to a variety of approaches for making the products of scholarly research freely available for others to access and, in some cases, reuse. Some authors argue open access increases the accessibility of research papers by reducing or eliminating restrictions on access caused by the licensing requirements and copyright agreements common in traditional subscription journals (Chan, 2004; Harnad et al., 2008). Arguments have been forwarded for a move to open access on financial grounds, as rapidly increasing prices for journal subscriptions may result in restrictions to readership if libraries reduce the number of their subscriptions, (see, for example, Parks, 2002). Further, many researchers have claimed that open access provides a citation advantage over traditional publication models (Antelman, 2004; Eysenbach, 2006; Harnad and Brody, 2004; Norris, Oppenheim and Rowland, 2008), because it typically makes research papers available via the Web, increasing their citeability (Gargouri et al., 2010; Lawrence, 2001).
Generally speaking, there are two primary open access models currently in use:
- The gold open access model, in which papers are published in open access journals, often (though not always) with associated article charges paid by authors, and
- The green open access model, in which authors themselves archive their work in repositories, personal Websites or elsewhere (Bailey, 2010).
A third type, sometimes called the hybrid journal, allows authors to pay fees to make individual articles, originally published in traditional subscription journals, freely available (Miguel, Chinchilla-Rodriguez and Moya-Anegon, 2011). Of the models, green open access has typically been considered to be the most effective as a result of an increasing number of repositories and journals that allow author self-archiving (Harnad et al., 2008). For instance, Miguel et al. (2011) found that, of journals indexed in SCOPUS, 32% followed the green model by contractually allowing authors to self-archive, whereas only 9% could be considered to be true gold open access journals.
Authors can self-archive their papers by uploading them to their own personal Websites or by depositing them in discipline-specific, government-sponsored, or institutional repositories (IRs). A tradition of exchanging preprints among researchers in several science fields has given rise to numerous disciplinary repositories (such as arXiv and PubMed Central), which have often been developed and maintained by the communities of researchers who use them (Björk, 2004). Institutional repositories, by contrast, can be defined as 'digital collections capturing and preserving the intellectual output of a single or multi-university community' (Crow, 2002: 1). Typically, such institutional repositories are housed by university libraries, which, by adapting traditional collection development practices, can systematically manage and maintain institutional repository materials for the long term (Björk, 2004); as a result, the green approach of author self-archiving in institutional repositories has been widely recommended (Gargouri et al., 2010). This practice may have several advantages; it may, for instance, help to:
- facilitate scholarly communication,
- increase the visibility of institutions,
- mitigate the monopoly power of publishers, and
- support the teaching, research and administrative missions of universities (Crow, 2002; Markland, 2006; McCord, 2003).
Claims about the value of institutional repositories, like those about open access in general, are rooted in an assumption that they inherently enhance the accessibility of the materials they house, and, as a result, help to increase citation counts. Consequently, much effort has been expended to increase faculty participation, which is typically low, in institutions' repositories (Chan, 2004; Davis and Connolly, 2007; Oguz and Assefa, 2014; Swan and Brown, 2005). However, few researchers have tested this basic assumption. This case study addresses this dearth of research, exploring the extent to which open access makes articles accessible on the Web by examination of a particular institutional repository. The authors proposed the following research questions:
- How effective is an institutional repository in making articles accessible?
- What, if any, are the potential impediments to the effectiveness of an institutional repository in furthering open access goals?
Many researchers have associated the citation advantage of open access with its function of improving the accessibility of papers on the Web (Antelman, 2004; Eysenbach, 2006; Harnad and Brody, 2004; Lawrence, 2001). However, such studies typically do not provide empirical support for such a claim. In this study, the accessibility of materials housed in an institutional repository is tested using Google and Google Scholar searches. Since the mere availability of a paper (i.e., the mere fact that it is present within an institutional repository) does not necessarily guarantee that it is easily accessible, the study differentiates the concepts of availability and accessibility as dimensions of physical access: we examine availability as the ability of search engines to retrieve clear links to an individual paper within the first two pages of results, and, further, measure accessibility as the number of clicks required for a user to navigate from those results to the full text of the paper itself. The study provides empirical findings for scholarly communication researchers and librarians who are interested in promoting the success of open access and institutional repositories. The result of this study can be used as a source for encouraging the open access movement and for enhancing the performance of institutional repository software.
Open access citation advantage
Numerous previous studies have attempted to explain the causes of the open access citation advantage through three postulates (Craig, Plume, McVeigh, Pringle and Amin, 2007; Davis and Fromerth, 2007; Kurtz et al., 2005; Koler-Povh, Južnič and Turk, 2014; Xia and Nakanishi, 2012). The first (the open access postulate) suggests that open access increases citation count by directly improving the accessibility of papers. On the other hand, the second and third postulates (early access and selection bias) explicitly reject the assumptions inherent in the open access postulate. The early access postulate proposes that papers are more likely to be cited because open access papers are often made public in early pre-print versions and are thus accessible for a longer time than non-open access papers. Similarly, the selection bias postulate argues that authors tend to favour their highest quality, and, thus, most likely to be cited, work when choosing materials to make available in institutional repositories.
Lawrence (2001) examined the correlation between paper availability on the Web and citation count, analysing 119,924 conference papers in computer science and related disciplines, finding that papers on the Web are more likely to be cited. Harnad and Brody (2004) compared the citation counts of articles published in a non-open access journal that had been placed by authors into institutional repositories with those of articles from the same journal that had not been so placed. They found that, in the fields of computer science, astronomy, and physics, the citation rates of open access articles were 2.5-2.8 times higher than those of non-open access articles.
Antelman (2004) tested a hypothesis that citation counts of open access articles were higher than those of non-open access articles, choosing ten journals each from four disciplines whose practitioners are known to be heavy users of pre-prints (mathematics, electrical and electronic engineering, political science and philosophy). Using Google to distinguish freely available full-text open access articles from non-open access articles, the study found that citation counts of open access articles were 51% to 91% higher than those of non-open access articles. Eysenbach (2006), in a longitudinal study examining the impact of open access as well as article and author characteristics on citation rates, found open access to be a significant independent predictor. In addition, open access articles tend to be cited earlier and more often than non-open access articles, even when there is no significant difference in the quality of articles.
Some researchers have considered all three postulates. For example, Kurtz et al. (2005) measured the effects of open access, early access and self-selection bias on citations in seven astronomy journals. Comparing citation changes for older and newer articles to test the open access and early access postulates, and testing selection bias by using the Monte Carlo simulation to analyse the 'probability that a particular number of non-arXiv submitted papers [would] be [among] the top 100 or 200 most cited papers' (Kurtz et al., 2005: 1398), they found that, while early access and selection bias strongly influence citations, the effect of open access itself is unobservable, perhaps because the astronomy research community typically has easy access to core journals. Davis and Fromerth (2007) analysed 2,765 articles published in four mathematics journals, finding that articles deposited in arXiv tend to have more citations than non-deposited articles. However, while they reported a positive impact on citations due to selection bias, they detected an open access effect only among highly cited articles and no impact from early access.
Moed (2007) investigated the citation impact of articles deposited in arXiv with that of articles not deposited, using citation time windows to measure the effects of early access and analysing the proportion of prominent authors in arXiv to test selection bias. Although the study found an increase in citation counts for papers available in arXiv, this was due to early access and selection bias rather than to the impact of open access per se; as Moed (2007) put it, arXiv increases citation counts not because it makes papers freely accessible, but because it makes them 'available earlier' (Moed, 2007, p. 2054). Davis, Lewenstein, Simon, Booth and Connolly (2008) carried out a randomized controlled experiment to measure the open access effect on downloads and citations, finding that open access articles are downloaded more often than non-open access articles, with a strong impact from article characteristics such as article type and length, etc. While their study suggests that open access augments readership through increased downloads, there is no evidence of a true open access citation advantage.
Status of institutional repositories
Early work by Crow (2002) suggested that institutional repositories could be seen as contributing factors in 'a new disaggregated model' (Crow, 2002, p. 6) of scholarly publishing, one that may help to weaken the monopolistic power of the traditional academic journal system over scholarly communication. Through developing and maintaining 'institutionally defined', 'scholarly', 'cumulative and perpetual', and 'open and interoperable' repositories (Crow, 2002, p. 16), he argues that institutions can increase their visibility and prestige by centralising the intellectual work of their members, thus enabling researchers to find relevant materials more easily. Shearer (2002) identified potential factors that need to be considered for repositories to be successful, including 'input activity', 'disciplines', 'advocacy activities', 'archiving policies', 'copyright policies', 'content type', 'staff support', 'quality control policies', 'software' and 'use' (Shearer, 2002, p. 98-99). Shearer assumed that the input activity, that is, submission of papers by researchers, would be one of the most important factors and wanted to see the relationship between it and other factors. This 2002 study, however, did not provide the results based on the analysis of data.
Markland (2006) examined the effectiveness of Google in retrieving papers deposited in institutional repositories, choosing one item each from twenty-six UK institutional repositories, checking their availability and investigating the ease of finding them through five search strategies ('a search at the repository interface', 'a Google search using a keyword or phrase from the title', 'a Google search using the complete title', 'a Google Scholar search using a keyword or phrase from the title' and 'a Google Scholar search using the complete title' (Markland, 2006, p. 224). The study showed that three of the items could not be retrieved through the repository interface. For results of searches from Google and Google Scholar using keyword phrases from titles, 17 of 26 items in repositories were retrieved from Google, and 8 of 26 from Google Scholar. When using a complete title search, 25 of 26 were retrieved through Google and 17 of 26 through Google Scholar, suggesting that a simple title search via Google was the most effective means of retrieving repository items.
Some researchers have reported low awareness and usage of institutional repositories. Swan and Brown (2005) examined the perceptions of open access and self-archiving in a survey of 1,296 researchers. While 49% of respondents had self-archived their papers in repositories or Websites, the remainder had not. Of those who had not yet self-archived, 71% were unaware of open access and self-archiving. In an evaluative study of institutional repositories, Davis and Connolly (2007) collected data from Cornell's DSpace in order to calculate descriptive statistics and interviewed eleven faculty members for a deeper understanding of their attitudes and behaviours. DSpace had 2,646 items as of October 2006, categorized into 196 collections, of which almost 30% contained no materials. Further, of 519 unique contributors, nearly 50% uploaded only a single item, reinforcing the interview finding that faculty members lacked both knowledge and motivation to use institutional repositories. In a study of attitudes and behaviours, Watson (2007) interviewed twenty-one researchers from Cranfield University. Interviewees considered it important to share their work, but most were not aware of the potential of institutional repositories as a way to do so; even among those who knew of the existence of institutional repositories, many were not using them. Xia (2010) found researchers to be increasingly aware of open access but only at a very basic level, with insufficient understanding to enable them to participate in open access initiatives, suggesting that increased awareness alone may not be sufficient to increase faculty use of institutional repositories.
More recently, Nicholas, Rowlands, Watkinson, Brown and Jamali (2012) investigated the scientific researchers' perceptions of digital repositories. They analysed 1,685 survey responses obtained from faculty members and students who had been registered in the Institute of Physics Publishing. They found that 1,079 (63.7%) of survey respondents had deposited their research outcomes in some kind of repository and that 44.1% had specifically used institutional repositories. Oguz and Assefa (2014) surveyed faculty members at a medium-sized university to investigate their perceptions and attitudes toward institutional repositories and found positive perceptions among 52.9% and negative perceptions among 47.1%. In general, although there are some variations across disciplines and institutions (Cullen and Chawner, 2011; Oguz and Assefa, 2014), there appears to be a growing rate of author participation in institutional repositories, but there is still plenty of room for further growth (Björk, Laako, Welling and Paetau, 2014).
Dimensions of physical access: availability and accessibility
In a traditional brick-and-mortar library, the mere presence of an item in the collection does not guarantee full accessibility; for instance, a book may be available, but may be shelved on a top shelf, with the result that there are important barriers to the accessibility of that item for users in wheelchairs; in online settings, such as institutional repositories, there may be comparable impediments limiting the accessibility of items that are present (and, thus, available) in a collection. Therefore, in this study we treat availability as a necessary, but not sufficient, element of accessibility, because the mere presence of papers within an institutional repository does not guarantee their accessibility (Hargittai and Hinnant, 2006); this allows us to identify possible impediments to accessing documents housed in institutional repositories. Like Fidel and Green (2004), we consider availability to be a dimension of accessibility, arguing that availability, while a necessary precondition for users to gain access to and 'use a source at a particular time' (Fidel and Green, 2004, p. 577), does not ensure that users will easily be able to put their hands on that source. Ugah (2008), similarly, defines availability as the presence and readiness for use of materials in libraries or virtually; a source is unavailable if it lacks either physical presence or readiness for use. The accessibility of materials, following such a definition, depends upon prior availability, simply because unavailable sources are also inaccessible.
Many open access studies have used the terms availability and accessibility interchangeably or have reported a positive relationship between them, suggesting that increased availability can help to improve accessibility. The Budapest Open Access Initiative (2002) defines open access as 'free availability on the public internet, permitting any users to read, download, copy, distribute, print, search, or link to the full texts of … articles…'. Similarly, Bullinger et al. (2003), in the Berlin Declaration on Open Access to Knowledge in the Sciences and Humanities, describe open access as 'granting to all users a free, irrevocable, worldwide, right of access to, and a license to copy, use, distribute, transmit and display the work publicly…'. That is, open access may increase accessibility for users through making sources available.
However, given the diverse facets of access, and even physical access, mere availability may not always enhance accessibility. Lawrence (2001), for instance, argues that other variables such as search tools, search engines and indexing can alter the physical accessibility of a document even if it is, strictly speaking, available. Culnan (1985) also notes that individuals' prior knowledge and the context of use can affect accessibility. This implies that open access, even if it does improve availability, may not improve every aspect of accessibility; for one thing, while the relationship between availability and true physical accessibility may be enhanced by open access, open access itself may not be directly related to increasing intellectual and social accessibility. While the current study acknowledges the importance of both intellectual and social access, it focuses on the prerequisite concepts of availability and physical accessibility to explore physical barriers (or, strictly speaking, their virtual analogues) in accessing institutional repositories; that is, the potential of institutional repositories to make materials both available and easily accessible. As noted above, we investigate availability as the ability of search engines to retrieve clear links to an individual paper within the first two pages of results; that is, availability refers to the simple presence of an item in a set of search results, an indication that the item exists. Further, we examine accessibility as the number of clicks required for a user to navigate from those results to the full text of the paper itself; thus, accessibility, in this study, refers to the amount of labour required of a user to actually obtain the item after having determined that it is available.
In this study we explored the extent to which open access supports physical accessibility by conducting a case study of the institutional repository at Florida State University (FSU). Florida State University launched its institutional repository, DigiNole Commons (http://diginole.lib.fsu.edu/), in mid-2011 in order to provide a common, openly accessible repository for scholarly and creative works of the university's faculty. To date, this repository has been hosted through Digital Commons institutional repository software, provided by Berkeley Electronic Press (bepress) and managed by Florida State University Libraries. As of the end of 2012, DigiNole Commons hosted a total of 5,020 items: 4,600 electronic theses and dissertations, 146 honours undergraduate theses and 214 works by faculty. The dataset used for this study is a sub-set of the latter and includes 170 faculty publications found in the institutional repository that have also been published in peer-reviewed journals.
To analyse the institutional repository's impact on physical accessibility, we conducted independent known-item title searches on both Google and Google Scholar (GS) to search for the faculty publications housed in DigiNole Commons. Numerous prior studies have provided a rationale for using Google and Google Scholar in collecting research materials. Jacso (2005) reported that Google Scholar is a powerful tool for searching scholarly information because its crawlers run 'databases of the largest and most well-known scholarly publishers and university presses; their digital hosts/facilitators; societies and other scholarly organizations and government agencies, and preprint/reprint servers' (Jacso, 2005, p. 209). Markland (2006) searched Google and Google Scholar to assess their efficiency in finding papers deposited in institutional repositories, based on the assumption that users tend to use Google because of its simplicity and ease of use. Björk, Roos and Lauri (2009) used title and author name searches on Google to estimate the proportion of green open access papers, treating any papers not found through this approach as unavailable, since most users assume papers are not available if they do not turn up in the first few results of a Google search. Norris, Oppenheim and Rowland (2008) used both Google and Google Scholar to check the open access status of papers and compared search results from Google and Google Scholar with two centralized institutional repository access points, OAIster and OpenDOAR. While OAIster and OpenDOAR retrieved 14% of 2,280 open access papers, Google and Google Scholar were able to retrieve the other 86%, although there was little overlap in search results between Google and Google Scholar. Xia, Myer and Wilhoite (2011) conducted title searches (sometimes supplemented with author names) on Google, Google Scholar and Yahoo to look into the availability of papers, finding a similar difference in search results between Google and Google Scholar.
Open access institutional repositories do not exist in isolation, but are just one possible place where the outcomes of faculty research can be found. Papers may be deposited by their authors or publishers in multiple repositories, for example, or may exist in multiple versions in varying file formats (e.g. HTML, PDF, etc.) or even with differing contents, since works made available as preprints are often subject to revision between the preprint stage and publication. To account for this phenomenon of multi-locations and multi-versions, Xia et al. (2011) operationalize open access availability as 'the number of web search engines that can return a link to the free full text of an article' (Xia et al., 2011, p. 22).
In the current study, however, we were concerned with the ability of a single institutional repository to enhance physical access and thus focused on only one (out of many possible) version of a publication: that made available through DigiNole Commons at Florida State University via Google and Google Scholar searches. Because of this more limited focus, availability is treated more narrowly in this study, as described above; we consider a paper to be available if a link to it is found on the first two pages of search results and we then measure access by counting the number of clicks required to move from the search results to the full text of the item itself. If getting to a full text requires first moving to the second page of search results and then following a link from that page to the full text within DigiNole Commons, access to that article is considered to require two clicks; if finding the full text requires moving to the second page, following a link to DigiNole Commons metadata, and then following a link to the full text, access requires three clicks.
Google Scholar, possibly since it is more narrowly focused on scholarship, adds a particular complication for our measurement of access: rather than presenting unique links to specific items, it collapses multiple links into a single link covering multiple versions, both initially hiding specific search results and requiring an additional click in order to gain access to institutional repository materials (see Figure 1, below). Thus, if an item housed in DigiNole Commons is one of several items retrieved in a Google Scholar search, an extra click is required to gain access to that item over what is required in a Google search.
Researchers first checked for the presence of metadata and full texts for 170 faculty publications in DigiNole Commons itself. In contrast to the complete availability of metadata, only 100 (58.82%) full texts out of 170 were available within DigiNole Commons. In the other 70 cases, links to 18 items external to DigiNole Commons were provided (7 to openly accessible sites such as author or departmental Websites, and 11 to non-open access subscription-based sites such as JSTOR, publisher sites, etc.); an additional three items had links that appeared to no longer be current, and two items were under publisher embargo. For 47 items, neither full-text copies nor links to full texts were available at DigiNole Commons. Accordingly, while this study analyses metadata availability via Google and Google Scholar for 170 items, the analysis of full text availability and accessibility is limited to those 100 items for which full texts exist in DigiNole Commons. The researchers first conducted full title searches on Google and Google Scholar in March 2013; although some scholars (e.g. Vaughan, 2004; Vaughan and Shaw, 2005) have suggested that the consistency of Google search results over time is typically fairly high, identical title searches were repeated in May 2013. Fairly high consistency (ranging from 84.7% to 99.4%) was found in both availability and accessibility between the two search results, with most discrepancies found in Google Scholar searches, most commonly as a result of changes in Google Scholar links to multiple copies of an item, as discussed above, or small changes in search result rankings (e.g., items moving from the first page of results to the second or vice versa). In what follows, we report the findings based on the March 2013 searches, and note the possible impact of search inconsistencies on availability and accessibility in the discussion.
In Google searches, links to DigiNole metadata were found in the first two pages of search results for 78 (45.9%) out of 170 items; 74 (74.0%) out of 100 full texts housed in DigiNole Commons were available either directly (from a Google link to the item itself) or indirectly (through a Google link to DigiNole metadata). Searches in Google Scholar, by comparison, turned up links to DigiNole metadata in 127 (74.7%) cases out of 170, and to full texts in 78 (78.0%) out of 100 cases. A chi-square test for comparing two proportions shows a statistically significant difference (χ2 =28.306, df=1, p < 0.001) in metadata availability between Google (45.9%) and Google Scholar (74.7%) at the 0.05 alpha level. However, the difference in full-text availability is not statistically significant.
|Google Scholar||Google Scholar|
|Available links||78 (45.9%)||127 (74.7%)||74 (74.0%)||78 (78.0%)|
|Unique links||18 (10.6%)||67 (39.4%)||18 (18.0%)||22 (22.0%)|
|Shared links||60 (35.3%)||56 (56.05)|
|Total combined||145 (85.3%)||96 (96.0%)|
|Total DigiNole items||170 (100%)||100 (100%)|
As summarized in Table 1, Google Scholar searches uncovered 67 unique links to DigiNole metadata that did not turn up in Google searches, compared to 18 unique links that were found via Google but not Google Scholar. For full texts, Google Scholar retrieved 22 unique items not found by Google, while Google retrieved 18 unique items. Considered together, Google Scholar and Google searches provided links to DigiNole metadata for a total of 145 (85.3%) of 170 items and to full texts for 96 (96.0%) of 100 items.
The previous section describes the degree to which items housed in DigiNole Commons are made available – that is, the degree to which links appear in search results – in Google and Google Scholar. This section presents findings related to the accessibility of those items in terms of the number of clicks required to navigate to them from the search results. The examination of the search results for the 170 DigiNole Commons items revealed five different scenarios, illustrated in Figure 2.
Of these, only the first two – the first leading directly from a search to an item in the institutional repository and the second to a copy of an item not housed in the institutional repository but still freely available at an author's Website or some other location - can be considered to be true open access scenarios. Each of the three other scenarios, because they do not result in access to free copies of materials, do not mesh with the fundamental objectives of open access initiatives: scenario three does, ultimately, lead a user to the item sought, but only externally to the institutional repository itself and with a cost via a subscription-based vendor; for DigiNole Commons and the Florida State University community, this means that only on-campus users – and off-campus users officially logged in – can still freely access materials, but others cannot; thus, access is limited. The final two scenarios – one retrieving metadata while failing to provide access to a full text from DigiNole Commons, and the other, which neither turns up DigiNole metadata nor leads a searcher to the full text of an item – cannot be considered to support open access.
Figure 3 shows the distribution of DigiNole Commons' 170 items into these five scenarios. Given the number of available full text items shown in Table 1, it would be reasonable to expect scenario one to include 96 items. However, curiously, two additional DigiNole Commons full texts, neither of which shows up in an examination of the institutional repository itself, were retrieved by a Google Scholar search (for instance, DigiNole Commons provides an external link to one of these, Evaluation of dynamically downscaled reanalysis precipitation data for hydrological application in the southeast United States; however, an internal DigiNole version turns up among Google Scholar search results). In addition, four out of the 100 items with full texts available in DigiNole Commons are not retrieved in either Google or Google Scholar searches.
Six items, without full texts available in DigiNole Commons but with links to freely available external copies, fall into scenario two. Ten items, with links to publisher sites or other subscription-based vendors, fall into scenario three. Thirty-nine fall into scenario four, because their DigiNole Commons metadata pages provide no access at all to full text copies. Finally, scenario five includes 17 items, none of which can be retrieved from DigiNole Commons through Google or Google Scholar searches.
As noted earlier, this study measures accessibility as the number of mouse clicks required to reach either DigiNole metadata or full text copies from a set of Google or Google Scholar search results. Results differed a bit between the two, with Google Scholar requiring more clicks, largely because of the way Google Scholar groups similar items into a single link. From Google, an average of 1.12 clicks were required to access 78 DigiNole metadata records and an average of 1.33 clicks were required to access 75 DigiNole full text items. All available metadata could be obtained with either one or two clicks from Google; the same is true for full texts, with the exception of one item requiring three clicks. In Google Scholar, access to 127 metadata records required an average of 1.69 clicks; access to 80 full texts required an average of 1.78 clicks. Again, with one exception, access to either metadata or full text required no more than three clicks; the one exception required four clicks to access the full text. The Mann-Whitney U Test, for comparing two sample means that do not fall into a normal distribution, was applied to compare accessibility across the two search engines; Google and Google Scholar are significantly different in both metadata (p < 0.001) and full-text accessibility (p < 0.006) at the 0.05 alpha level.
The analysis above provides an overview of the availability and accessibility of metadata and full texts housed in Florida State University's DigiNole Commons. However, several other issues arose during the analysis of both the materials themselves and the Google and Google Scholar search results that cannot be summarized statistically. These anomalies themselves have implications for the ability of the institutional repository to provide access to the materials in its collection and are briefly discussed here.
Title or authorship issues
Institutional repositories are often used to house articles in versions other than the final published version, such as pre-prints; a small handful of items housed in DigiNole Commons fall into this category, most of which display minor differences from their published counterparts and a couple of which are radically different. For example, metadata for an article by Hart, Taylor and Schatschneider (2013) is present in the institutional repository under the title 'There is a world outside of experimental designs: using twins to explore causation'. This article (unavailable in full text in the institutional repository because of an embargo) was published online in 2012 and in print (in the journal Assessment for Effective Intervention) under the similar, but not identical, title 'There is a world outside of experimental designs: using twins to investigate causation'. Such a discrepancy, however, does not necessarily impede an article's accessibility and, in this case, it does not: a title search in Google Scholar turns up both the DigiNole and the published versions.
Other changes, however, are neither so minor nor always so innocuous in terms of their impact on accessibility. The most extreme example of this can be seen in an article by Coutts (2009) published in the Journal of Urban Planning and Development under the title 'Multiple case studies of the influence of land-use type on the distribution of uses along urban river greenways', but appearing in the institutional repository under the title 'Locational influence of land use type on the distribution of uses along urban river greenways'. In another case, an article not only shows different titles between the DigiNole pre-print and the published versions, but the pre-print metadata gives the name of only one of seven authors listed in the publication. Clearly, such extreme variations can have important implications for open access, since they may serve to make the open access copies of the work inaccessible from Google or Google Scholar, as they did in the first of these two cases.
The algorithms governing how Google Scholar derives author names may also have an impact, although the three instances in which this was seen in the current study did not influence either availability or accessibility, given that the study used known-item title searches. In these three instances, Google Scholar mistakes the letters MD after an author's name as his initials, rather than as an indication that he is a medical doctor: thus, in Google Scholar, Jose E. Rodriguez, MD appears as 'MD Rodriguez', with 'E Jose' appearing as a separate author.
Item visibility and other anomalies
On initial examination, one case appeared to be a true anomaly: an article by Falk, Lepore and Noe (2013) housed in the institutional repository and titled 'The cerebral cortex of Albert Einstein: a description and preliminary analysis of unpublished photographs', fell into scenario five (that is, it could not be retrieved) in a Google search, but was easily retrieved by Google Scholar. Initially, researchers found the inability of Google to uncover this item to be curious, since the article, upon publication, received considerable attention in both scholarly circles and the mass media; further, the article was published in a non-open access journal, with full open access to the published version and the right to archive the final published version in the institutional repository immediately upon publication, secured through the payment of an article processing charge (with assistance from one of the current article's authors). Upon reflection, however, it became clear that, while scenario five typically is a sign that the goals of open access are not being met – if items cannot be easily retrieved, that is, access cannot be considered to be truly open – in this case the most likely explanation is the opposite: access via Google to the DigiNole Commons copy does not constitute a failure of open access, but, rather, reflects the success of open access in a much broader sense. Because of the level of attention accorded the article, and its open accessibility on the journal's Website, the article's presence in DigiNole Commons is not as important as it may be for lesser-known works; Google's algorithms treat the DigiNole Commons copy as simply one more among many available copies, resulting in lower relevance ranking.
One other item displays a similar pattern: 'Development of a new academic digital library: a study of usage data of a core medical electronic journal collection' (Shearer, Klatt and Nagy, 2009) turns up in a Google Scholar search, but not in a Google search. In this case, the likely reason is similar, though not identical: an examination of Google search results shows that the article is easily available in numerous other open access repositories, which, in turn, appears to have negatively impacted the relevance ranking for the DigiNole Commons copy while enhancing the overall availability of the work through other outlets.
One final anomaly must be noted, although strictly speaking, it falls outside of the parameters of this study: the search engine built into DigiNole Commons itself fails to retrieve certain items that are clearly present in the institutional repository. In each case, this appears to be due to the fact that the articles' titles include non-alphanumeric characters such as parentheses, question marks and asterisks. In each case, Google and Google Scholar both successfully retrieve the articles, even though the institutional repository's search engine cannot do so. While this failure has no impact on the findings of this study, since it is internal to the institutional repository rather than related to the availability and accessibility of the items via Google and Google Scholar searches, it does constitute an impediment for the success of DigiNole Commons' goal of providing open access to the materials in its collection.
This case study confirms that institutional repositories, at least overall, can contribute to making papers available and accessible on the open Web. Nevertheless, it also uncovers some potential impediments to the success of institutional repositories. As pointed out in the previous section, some situations either do not satisfy the goals of institutional repositories at all or satisfy those goals only in part. In some cases, for instance, access to full texts requires the payment of fees to vendors, whether through subscription agreements between vendors and libraries hosting institutional repositories, or through access fees paid by individual searchers. For many – in this case, for users conducting their searches from computers on the Florida State University campus or users logged in as authorized off-campus users – such costs may be hidden because they are covered by the institution with which they are affiliated; however, the very fact that such fees exist violates the spirit of open access for all but a defined set of authorized users. Google or Google Scholar searches that retrieve metadata from DigiNole Commons but fail to retrieve full texts may occur for several reasons, including:
- Contractual embargos, in which authors must withhold their work from open access for a contractually-determined period of time;
- Undetected file upload errors;
- Institutional policy or procedural issues; and
- Erroneous or outdated links.
Instances in which neither metadata nor full texts were retrieved by Google or Google Scholar searches despite their presence in the institutional repository can most likely be attributed to issues related to the algorithms used by search engine crawlers either in searching or in determining relevance rankings. Moreover, although it was, as noted above, not a serious concern – and is, in any event, largely beyond the ability of institutional repository administrators to mitigate – inconsistencies in hits across multiple searches is a potential impediment for full open access implementation, whether because of the implementation of search and relevance algorithms in Google and Google Scholar or because of the ways in which links are updated over time. In addition, as noted above, title changes between pre-prints and final published versions can cause retrieval problems if the published versions of titles are used in known item searches (the same may be true in instances when there are changes in authorship between versions of a work, as noted above); if the title changes are minor, open retrieval may still be nearly seamless for users searching via Google or Google Scholar, but more significant title changes may make retrieval nearly impossible when the open access pre-print title is not used (Björk, Roos and Lauri, 2009).
Some potential impediments are clearly beyond the means of libraries managing open access institutional repositories to address; little, for instance, can be done about the ways in which search engine algorithms or relevance rankings cause existing items to go missing from Google or Google Scholar search results. However, there may be ways to mitigate some of the other potential impediments, edging institutional repositories closer to full implementation of open access goals. Since some items become inaccessible because of differences between pre-print and published versions, metadata records, following Dublin Core or other metadata standards, can make linkages and connections between these multiple versions explicit, and, thus, searchable. Also, although it not possible to alter third party search algorithms, libraries can increase institutional repository paper availability by representing and organizing their papers using the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH), so that service providers, such as OAIster, can help make those papers accessible (Norris et al., 2008).
When metadata is retrieved but full texts are missing, copies of materials, when available, can be obtained and uploaded, and regular checking of links, particularly for materials stored in author Websites or other external open access sites, can help to ensure continued access. It may also be useful to take advantage of redundancy and maintain full text copies of materials within the institutional repository even if they are also freely available elsewhere. When full texts can be retrieved but come with a subscription or other cost, librarians may be able to find ways of including alternative versions of full texts within their institutional repositories or may be able to negotiate with vendors, publishers, faculty members or academic societies so as to make them accessible. Finally, ongoing initiatives to educate faculty about their rights related to their own work may help to increase the frequency of open access clauses in copyright agreements with non-open access publishers.
This study, although limited to a single case at a specific point in time, suggests that relying on either Google or Google Scholar individually cannot ensure full access to scholarly works housed in open access institutional repositories. Counting only those 104 instances that can be fully considered to be open access (i.e., scenarios one and two), there is an overlap in results between the two search engines of only 57.5%, with Google providing links to 20 items not found via Google Scholar, and Google Scholar providing links to 25 items that are inaccessible via Google. Thus, it is necessary to use the two search tools together to find materials deposited in institutional repositories. In terms of general availability, Google Scholar appears to have the edge, especially for metadata records. On the other hand, Google does a better job of supporting access, requiring fewer mouse clicks overall to get to DigiNole materials than Google Scholar because of the way Google Scholar clusters multiple copies of an item into a single initial link.
While availability is a dimension of accessibility, making a distinction between them enabled us to identify possible impediments to the success of institutional repositories. In this case study we examined the degree of availability and physical accessibility of a collection of limited size housed in a particular open access institutional repository, via known-item title searches in Google and Google Scholar. Overall, the findings suggest that items in the collection are, for the most part, both available and accessible, although slightly more than 30% of items, falling into scenarios four and five, could not be retrieved at all; further, although an additional 6% of items could be retrieved, that retrieval came with the cost of subscription or other charges to either the institution or individual searchers, limiting the degree to which their accessibility could truly be considered to be open. Considering those items, impediments to open access generally fall into the following two broad categories:
- Impediments related to contractual agreements between authors, publishers and vendors, including costs related to institutional subscriptions, item embargo, etc.
- Impediments related to the policies, practices and technologies governing the institutional repository itself, including outdated links, file upload errors and internal search engine shortcomings.
The Florida State University Library system is continuing development of services related to its institutional repository, including efforts to enhance and clean up metadata records of items housed there. The authors of the current study plan further study in collaboration with the library and will initiate additional outreach efforts, outlining the benefits of archiving full-text versions of articles in the repository. The campus Office of Scholarly Communication is also working with several new offices on campus, including the Office of Proposal Development and the Office of Sponsored Research, to make more effective connections between open access scholarly objects online and their metadata records as presented in DigiNole Commons. This study is an essential part of the monitoring, testing, assessing and adapting of the repository platform as one facet of the scholarly communication initiative at Florida State University.
As noted earlier, this current study is part of an ongoing investigation into open access issues and institutional repository effectiveness; future work will look at several issues beyond simple questions of availability and physical accessibility, investigating other types of accessibility (intellectual and social) in relation to open access institutional repositories and will include projects designed to increase faculty awareness and participation.
About the author
Jongwook Lee a Doctoral Candidate in the School of Information at Florida State University. He received his bachelor's degree in library and information science from Kyungpook National University, South Korea, and his master's degree in information science from Indiana University Bloomington. His research interests cover scholarly communication, mentoring, bibliometrics, and information behaviour. He has been working on his dissertation research about information types exchanged in mentoring between faculty advisors and their doctoral students. He can be contacted at: firstname.lastname@example.org
Gary Burnett is a Professor at the College of Communication and Information at Florida State University. He earned a BA in English from the University of California, San Diego, an MLS from Rutgers University, and a Ph.D. in English from Princeton University. Before coming to FSU, he worked as a bookseller, a librarian, and a small press publisher. He has also been a research associate at the ERIC Clearinghouse on Urban Education and an adjunct faculty member at Princeton University and at the School of Communication, Information and Library Studies at Rutgers University. His research focuses on information theory and on the intersection between information exchange, social norms, and social interaction in online settings, with a particular focus on textuality and interpretive practices. His book, Information Worlds: Behavior, Technology, and Social Context in the Age of the Internet (Routledge), coauthored with Paul Jaeger, was published in 2010. His work has appeared in numerous journals, including The Journal of the American Society of Information Science and Technology, Library Quarterly, and Library and Information Science Research. He can be contacted at: email@example.com
Micah Vandegrift is the Scholarly Communication Librarian at Florida State University. His main role is outreach and research support for new forms of digital scholarship, including open access archiving and publishing. He blogs often about scholarly communication topics, and as an early career librarian is just beginning to publish in LIS journals. He is interested in altmetrics, issues around the future of academic publishing and digital humanities. Library Journal, a leading trade publication, named Mr. Vandegrift a “2013 Mover and Shaker.” He can be contacted at firstname.lastname@example.org
Jung Hoon Baeg is a Ph.D. candidate at the School of Information, College of Communication & Information, Florida State University. His interest is in consumer health information, and health informatics. In particular, his focus is on consumer health information services in public libraries, health and ehealth literacy, and health information resources. Currently, he is working completing his dissertation, which investigates the intentions of individuals to use public libraries as a primary health information seeking resource. He can be contacted at email@example.com
Richard Morris is a Professor in the School of Communication Science and Disorders at Florida State University. His main research interests are acoustic and physiological phonetics. He has presented and published papers on age-related changes in speech and voice, speech acoustics, and the acoustics of the singing voice. He can be contacted at firstname.lastname@example.org