METRICS MASHUP
Can Gen AI Solve the Problems With Evaluative Bibliometrics?
by Elaine M. Lasda
How can you tell if research is good research? I’ve always read and agreed with the precept that the only way to know if research is “good” is to read the dang paper. The entire dang paper. Nonetheless, bibliometric indicators are fascinating to me. In my experience, they effectively shine a light on how researchers perform certain aspects of information gathering. Unfortunately, as I’ve discussed many times, bibliometric indicators are often inappropriately used for purposes of evaluating the quality of essentially any research-related entity: paper, researcher, lab, organization, research field, and so on.
When Big Data analytics were all the rage about a decade ago, we saw shifts in how research was measured and evaluated: network analysis, natural language processing, sentiment analysis, etc. The Leiden Group developed VOSviewer (vosviewer.com), for example, which is now on version 1.6.20. Linked data tools, such as Dimensions AI (dimensions.ai), came on the scene and it became easier to tease out relationships and patterns among research-related entities. Generative AI (gen AI) was in its infancy; it existed but had not yet stolen the show.
Approximately a decade after the Big Data hype hit its peak, gen AI is ascending. Gen AI tools offer next-level data processing and analysis that can be applied in many situations to help solve many human problems. Can gen AI help with the thornier issues in the use and misuse of bibliometric indicators?
Gen AI as we know it came to light in around 2014 (Keith D. Foote wrote a brief, interesting history of AI development in March 2024; dataversity.net/a-brief-history-of-generative-ai). It wasn’t until OpenAI and ChatGPT came on the scene that the public was made more aware of and had easy access to gen AI tools. If you are familiar with the Gartner Hype Cycle (en.wikipedia.org/wiki/Gartner_hype_cycle), you may think that we are in the “trough of disillusionment” with regard to enthusiasm for gen AI, largely since gen AI tools will sometimes fabricate responses, a phenomenon known as hallucinating. Hallucinations sound like they may indeed be problematic to the point of disillusionment.
PATTERN MATCHING, NOT FACT DETERMINATION
The key thing to keep in mind, though, is that AI relies on word and phrase patterns, not facts. In gen AI, search results are built according to the likelihood of a word occurring as it follows the previous word. This is problematic when the next likely word choices constitute phrases that are not only not truthful or accurate but might also be entirely fabricated. Thus, text-based gen AI is not really generating new ideas or creating anything other than sentences formed based on the probability and frequency patterns of words co-occurring when a given topic is queried. Gen AI is a probabilistic tool. For a quick rundown on the various ways generative AI tools can work, check out this blog post by Jacob Zweig, co-founder and principal consultant at Strong, a company that works in data, machine learning, and AI: strong.io/blog/applications-of-generative-ai-a-deep-dive-into-models-and-techniques.
In a December 2023 Forbes blog, Peter Bendor-Samuel posits the question of when a probabilistic solution is a good fit for a problem (“Key Issues Affecting the Effectiveness of Generative AI”; forbes.com/sites/peterbendorsamuel/2023/12/05/key-issues-affecting-the-effectiveness-of-generative-ai). AI is extremely adept at functions such as classification and summarization, but not so much at functions that require a decision. For that, we still need actual humans to do a verification and/or a quality check. Gen AI may be able to see patterns in large language models (LLMs) that we otherwise would not have the means to derive and analyze. This is a good thing. The sheer quantity of material a gen AI tool can distill in a short period of time is astounding.
Another good feature, in my opinion, of gen AI in a research context is that many of the LLMs that are tailored to research and academic information rely on OA and open source content. The OA movement should be getting a boost from this new demand for open content.
The traction gained by Big Data and then AI in the 2010s rendered the fundamentals of traditional bibliometric indicators rather quaint. The early metrics conceptualized by Eugene Garfield and his team are easy to understand. Citation counting is, well, counting. The journal impact factor (JIF) is a more-or-less simple ratio. I’ve talked a lot in this column about how Garfield originally saw bibliometrics as tools to trace the evolution of a line of research inquiry. We in libraryland sometimes talk about research as a discussion. Citation analysis can help us follow that discussion.
But those quaint metrics are fraught with misapplication, oversimplification, lack of context, and other problems. Mario Biagioli talks about the JIF being used as the “currency” that will, hopefully, predict the future success of an aspiring researcher (“Fraud by Numbers: Metrics and the New Academic Misconduct,” UCLA, 2020; escholarship.org/uc/item/8xv4c2d3). Biagioli talks about citations and research impact as an economic system. While he mainly focuses on concerns related to gaming the metrics, he points to the root causes of gaming: The research evaluation economy creates perverse economic incentives largely related to the misuse of impact metrics as quality proxies for predicting the value and success of a researcher. Gaming the metrics is the inevitable response to misuse of the quantitative metric as a quality indicator.
Another interesting point Biagioli makes pertains to the increased focus on metrics, rankings, and impact. The merits of the authors’ methods, analyses, and findings are divorced from the metadata and metrics related to impact. The content of articles is almost irrelevant, as evaluators only care about the impact of the paper, not the actual findings.
TECH SUPPORT FOR READING RESEARCH PAPERS
On the one hand, the increasing reliance of evaluators on impact metrics—and, more recently, with the advent of using AI to speed up literature reviews—has me mulling why people don’t seem to want to actually read the research anymore. On the other hand, with more than 3 million research articles being published every year (ncses.nsf.gov/pubs/nsb202333/executive-summary), it is becoming increasingly more difficult to stay on top of the literature in many fields. Why not use some technological support?
What questions do we want to answer? What questions do bibliometric indicators answer? What could AI support in answering new questions? Is evaluation the main value/main attribute to bibliometric indicators driving their use?
Deep Kumar Kirtania, from Bankura Sammilani College, seeking, perhaps facetiously, an answer to how gen AI can be used to support bibliometric analysis, simply asked ChatGPT three prompts about the role of AI in bibliometric research (“ChatGPT as a Tool for Bibliometrics Analysis: Interview With ChatGPT,” March 17, 2023: ssrn.com/abstract=4391794 or dx.doi.org/10.2139/ssrn.4391794). The suggestions are not all that surprising—data retrieval, preprocessing, analysis, and visualization; conducting the literature review; and recommending other sources. Once again, the pattern dominates, not the creative exploration of solutions.
Let’s step back and think about this again: What problems are bibliometric indicators being used to solve? How appropriate is the use of a given indicator (e.g., JIF) in solving said problem? Could gen AI offer a more appropriate solution?
The biggest concern about irresponsible use of bibliometrics is when decisions about an individual researcher’s career are predicated on metrics that do not reliably measure the quality and/or impact of that researcher’s oeuvre. In fact, because it takes time for research ideas and progress to gain traction, there is no metric that can instantaneously tell us whether up-and-coming scientists are worth their salt. Biagioli is spot on when he looks at the situation in terms of economic concepts. Due to the disparate distribution of citation counts in a journal, the JIF is not, by nature, predictive.
In April 2024, DORA (Declaration of Research Assessment), a group that advocates for the responsible use of JIF and other metrics, released an interesting report giving an overview of metrics beyond journal-level, their appropriate use, and various shortcomings and strengths of a given indicator (“Guidance on the Responsible Use of Quantitative Indicators in Research Assessment”; dx.doi.org/10.5281/zenodo.10979644). Its recommendations for responsible use are not surprising; the report speaks to values such as clarity, fairness, transparency, contextualization, and specificity. Gen AI may be able to lend fairness, contextualization, and specificity, but clarity and transparency are not hallmarks of AI systems. There is some documentation to indicate that even AI experts don’t really know what makes the algorithms work, as Noam Hassenfeld reports in Vox on July 15, 2023: vox.com/unexplainable/2023/7/15/23793840/chat-gpt-ai-science-mystery-unexplainable-podcast.
MEASURING RESEARCH QUALITY
Measuring quality or whether research is “good” in many respects remains elusive to the statisticians, no matter what tools or indicators are applied. With gen AI, sentiment analysis and predictors of quality output may get more accurate, because there will be patterns that can be teased out that are related to the characteristics that make for quality research. This column has neither the space nor the scope to really dig into gen AI for evaluating research through peer review, the complement to bibliometric indicators. I will say that at least one gen AI tool has been proposed to review the peer reviewers, although it is a preprint, so not itself peer-reviewed (Yanheng XU, et al. “Spider Matrix: Towards Research Paper Evaluation and Innovation for Everyone”; chemrxiv.org/engage/chemrxiv/article-details/66084f099138d23161c43fed).
There are real opportunities with gen AI in terms of recognizing the patterns in research output. The implications could be quite influential. Identifying the intent of citations, as Scite aspires to do, flagging a publication when it cites retracted research, classifying the sentiments of citing references, and seeking to understand citation patterns are all worthy and helpful pursuits that marry gen AI with bibliometrics. My concern is for the research that is so innovative that it breaks the pattern. Will gen AI be able to spot the game changers?
Recently, I saw a meme about content written using gen AI: “Why should I bother reading something nobody bothered to write?” Likewise, why should I bother reading and using research that only machines have evaluated? Right now, the consensus seems to be that gen AI can never replace human discernment and judgment. Humans, however, are busy. They have competing priorities. Also, much to the chagrin of many humans I know, we can’t be experts in everything. It seems to me that gen AI-driven bibliometrics will do an adequate evaluative job that, in most cases, is an improvement over JIF and other simple metrics. The effect, though, will be the same. The gen AI analyses will be relied upon much in the same way as JIF is being relied upon currently. |