how does an ai phd produce impactful research? disciples of sloan prize winners talk about their experiences

how does an ai phd produce impactful research? disciples of sloan prize winners share their experiences

2024-10-07

machine heart compilation

author: omar khattab

editor: egg sauce, zenan

writing a paper? that's just one small step.

during graduate school, many people are often confused about how to structure their own research. how should we conduct research to make a difference in the already crowded field of artificial intelligence?

too many people believe that long-term projects, proper code releases, and well-thought-out benchmarks are not motivating enough - sometimes it can be something you do quickly and guiltily, and then go back to doing "real" research .

recently, omar khattab, a phd candidate in the nlp group at stanford university, published a blog post discussing the thoughts of top ai scholars on doing impactful research.

let's see what he said:

research impact comes in many forms, and i will only focus on measuring the research impact on ai through open source work (e.g. models, systems, frameworks or benchmarks). because part of my goal is to refine my ideas, record specific suggestions, and gather feedback, i will make the statement more succinct. if you have other ideas, please discuss them in the comment area.

first, here are the guiding principles:

focus on projects, not papers.
you can "dig a hole" by choosing an appropriate problem that has more room for development.
think two steps ahead and iterate quickly.
make your work public and promote your ideas.
find ways to motivate yourself: here are tips for growing your open source research.
continue investing in your project with new papers.
the fifth point, “tips for developing open source research,” deserves a longer article of its own. i may write about that in my next post.

focus on projects

rather than a paper

this is a crucial piece of thinking upon which everything else is based.

beginning students will place great emphasis on publishing their first few papers. this makes sense: it's how you learn to conduct research, explore initial directions, and demonstrate early progress. but this is a stage that you must eventually leave: in the long run, your achievements and growth will depend less on sheer number of papers and more on your impact and the overall research context you convey.

unfortunately, too many phd students view most potentially impactful behavior as "unmotivating." this confused me until i realized what they meant was that these actions might slow down your ability to publish your next paper. but your ability to publish your next paper so quickly is less important.

i suggest that you don't think of your work as a series of isolated papers, but instead ask yourself: what are the larger visions you are going to lead, and what are the subfields or paradigms within that? what difference do you want to make with your work? so you'll publish individual papers to explore and establish benchmarks, while the larger vision should be something you intentionally iterate on. it needs to be much larger than the paper carries, and is certainly a problem that has not yet been fully solved.

one way to do this is to structure some research papers around coherent artifacts (such as models, systems, frameworks, or benchmarks) that you maintain in the open source domain. this strategy is more expensive than "run some experiments and publish a fleeting quick repository," but it forces you to find a problem with real impact and helps ensure that the new research you do is actually coherent and useful: you don't put effort into introducing a small feature or trick that is useless for the artifacts you have been developing and maintaining.

choose appropriate questions with greater room for improvement

can "dig a hole"

not every paper you write is worth investing in indefinitely. many papers are one-off exploratory papers. to find directions that can turn into larger projects, use the following criteria.

first, the problem must be cutting-edge. you can define it in many ways, but inaian effective strategy in the field is to find a problem space that will be "hot" in 2-3 years but has not yet become mainstream.

second, the problem must have a large potential for digging holes, that is, a potential impact on many downstream problems. basically, the results of these questions might benefit or be of interest to enough people. researchers and people care about what helps them achieve their goals, so your question might be something like helping others build things or achieve research or production goals. you can apply this filter to study theoretical foundations, system infrastructure, new benchmarks, new models, and many other things.

third, the problem must be left with a larger margin. if you tell people that their system can be 1.5 times faster or 5% more efficient, that's probably not interesting. in my opinion, you need to find problems where, at least after years of hard work, you have a non-zero hope of making something faster, say 20x faster or 30% more efficient. of course, you don’t have to get all the way there to be successful, and you shouldn’t wait until you get completely there to publish your first paper or release your first piece of work.

i don't want to be too abstract, let's use colbert to explain. at the end of 2019, research on applying bert for retrieval was very popular, but these methods are very expensive. one naturally asks, can we significantly improve the efficiency of this approach? what makes this a good question?

first, it's very prefaced. we can correctly predict that by 2021 (1.5 years later), many researchers will seek efficient retrieval architectures based on bert. secondly, it has a lot of room for development. new ml paradigms tend to be like this because most such efforts initially ignore efficiency. in fact, the original approach might take 30 seconds to answer a query, but now it can complete higher-quality retrievals in 30 milliseconds, which is 1,000 times faster. third, it has a big fanout. scalable retrieval is a good "foundation" problem: everyone needs to build something on top of retrievers, but few want to build them.

think two steps ahead

and iterate quickly

now that you've got a good problem, don't rush into choosing the low-hanging fruit in front of you as your approach! at some point, at least a lot of people will eventually consider the "obvious" approach.

instead, think at least two steps ahead. identify the path most people are likely to take when this timely issue finally becomes mainstream. then, identify the limitations of the path itself and work to understand and address those limitations.

what does this look like in practice? let's revisit the colbert case. the obvious way to build an efficient retriever using bert is to encode documents into vectors. interestingly, by the end of 2019, only limited ir work had achieved this. for example, the most cited work in this category (dpr) only released its first preprint in april 2020.

given this, you might think that the right thing to do in 2019 is to build a great single-vector ir model via bert. in contrast, thinking just two steps ahead begs the question: sooner or later everyone builds a single-vector approach, so where does this single-vector approach fundamentally get stuck? in fact, this problem led to later interaction paradigms and widely used models.

as another example, we can use dspy. in february 2022, as hints became more and more powerful, it became clear that people wanted to use hints for retrieval-based quality assurance, rather than fine-tuning as before. to do this, we naturally have to establish a method. going two steps further, we ask: where does such an approach get stuck? ultimately, the "retrieve then generate" (or rag) approach is probably the simplest approach involving lm.

for the same reasons people are interested in it, they will obviously be increasingly interested in: (i) expressing more complex combinations of modules; (ii) figuring out what should be done through automatic prompting or fine-tuning of the underlying lm how to oversee or optimize the resulting complex pipeline. this is dspy.

the second half of this rule is "iterate quickly." this was perhaps the first piece of research advice my advisor matei zaharia (sloan prize winner and founder of apache spark) gave me during the first week of my phd: by identifying a topic where you can iterate quickly and get feedback (such as delay or verify score) version of the problem, which can greatly improve your chances of solving the puzzle. this is especially important if you're thinking two steps ahead, which is difficult and uncertain enough.

make your work public

let your ideas sink in

at this point, you've discovered a good problem and kept iterating until you discovered something cool and wrote an insightful article. don’t move on to the next paper. instead, focus on getting the results of your work out into the world and seek to have real interactions with people, not just about your one paper release, but about the big picture you are actively researching. or better yet, let people know about a useful open source tool you're building and maintaining that captures your key ideas.

a common first step is to publish a preprint of your paper on arxiv and then publish a "post" announcing the publication of your paper. when doing this, make sure you start your post with a specific, substantial, and understandable claim. the goal is not to tell people that you published a paper, which has no intrinsic value. the goal is to convey your key arguments in a direct and engaging way. (yes, i know it's hard, but it's necessary).

perhaps more importantly, the process doesn't end with the first "launch," it's just the beginning. given that you are now invested in projects, not just papers, your ideas and scientific communication will continue throughout the year, far beyond isolated paper publications.

when i help graduate students tweet about their work, it’s not uncommon for their initial posts to not receive the attention they hoped for. students often see this as validating their fear of publishing their research and take it as another sign that they should move on to their next paper. obviously, this idea is incorrect.

there is a wealth of personal experience, second-hand experience, and observation that shows it makes a lot of sense to persevere in this matter (which, by the way, not many people do). that is, with rare exceptions, good idea traction requires you to tell people key things multiple times in different contexts, and to continually refine your idea and your delivery of the idea until the community can grow over time. assimilate these ideas, or until the field reaches the right stage of development where these ideas are more easily understood.

gather excitement

tips for publishing open source research

getting people excited about your research is a good thing, but delivering your ideas to relevant downstream applications through publishing, contributing, and developing open source tools can often have a greater impact.

it's not easy to do this: simply uploading the code files along with the readme to github is not enough. a good repository will be the "home" of your project, more important than any paper you publish.

good open source research requires two almost independent qualities. first, it must be good research, novel, timely, well-scoped, and accurate. second, it needs to have clear downstream utility and low friction.

this is the most important part: people will repeatedly avoid (and others will repeatedly use) your oss work for all the "wrong" reasons. for example, your research may be objectively "state of the art," but in all likelihood people will prioritize alternatives with less friction. on the flip side, graduate students often miss the point of why people are using your tool, for example, because they're not taking full advantage of your most creative parts. this is not something worth resisting, but something worth understanding and having to improve upon.

based on this, i would like to list a few milestones to pay attention to when it comes to open sourcing research results.

milestone 0: make published content available

there's no point in releasing code that no one can run. in your field of research, these people want to replicate your results. maybe they will go beyond your work and cite your research results. these people are more patient than other types of users. still, you'll find huge differences in academic impact depending on how easy the code is to patch.

milestone 1: make published content useful

in addition to people in your niche, you should make sure your release is useful to an audience who want to actually use the project to build other things. in artificial intelligence research, this milestone rarely comes naturally. you should allocate a lot of time to thinking about the problems people are trying to solve (research, production, etc.) and where your ai efforts can help. if you can do this correctly, it will have a lot of benefits, from the project design to the exposed api and the documentation/examples presented.

milestone 2: make releases understandable

this is difficult for ai researchers, but we should realize that a useful version, where everything is technically available and is somewhat explainable, does not mean that the majority of your potential users will find this the version is easy to understand and enough to keep them engaged in learning or trying it out.

well-known ai scholar andrej karpathy wrote an article on this issue: "you build something, and then you need to build ramps to get to it." ben clavie has also written extensively about this, and he's been a big part of taking the work we did on colbert and making it more approachable.

milestone 3: figure out why the obvious alternative failed and be patient

we started by talking about thinking two steps ahead. this is crucial in my opinion, but it also means that most people won't understand why they need a solution to a problem they can't yet clearly observe. i think part of your job over time is to build a case. gather evidence and explain in an easy-to-understand way why obvious alternatives (think one step at a time) will fail.

milestone 4: understand the type of users and leverage this for growth

when i started colbert and dspy, my initial audience was researchers and professional ml engineers. over time, i learned to let go of that and understand that you can reach a larger audience, but they want different things. before doing anything, do not indirectly or even directly block different categories of potential users. this situation is much more common than people think.

secondly, when looking for users, we need to find a balance between the two types of users. on the one hand, expert builders with advanced use cases may require you to invest a lot of money, but tend to push certain use cases forward in a research sense, which can pay off. public builders, on the other hand, are typically not ml experts, but they often build and share their learnings in public, account for a greater share of large-scale growth, and will make you think more about your initial hypotheses. learn. you need both.

milestone 5: turn interest into a growing community

the true success of oss work lies in the presence of a community and its continued growth independent of your efforts. generally speaking, a good community should be organic, but you need to actively work to help it form, such as welcoming contributions and discussions, and looking for opportunities to turn interest into contributions or some kind of forum (such as discord or github).

milestone 6: convert interest into active, collaborative and modular downstream projects

chances are, your oss project in its early stages doesn't solve all the problems in your original vision. a well-designed project will often have multiple modular parts that allow you to initiate research collaborations (or other endeavors) and allow new team members to not only advance the project, but also own significant parts of the project, making it faster or larger influence their ideas while dramatically improving the project. for example, dspy currently has separate teams leading r&d efforts in just-in-time optimization, programming abstraction, and reinforcement learning. colbert's components such as external application programming interfaces, underlying retrieval infrastructure, and core modeling are primarily driven by different people in different projects.

come, summarize. the adoption of open source research requires good research and good open source results. this balance is difficult to strike, but once you get it right, it can be very rewarding. personally, it took me a long time to grasp and internalize this. this was thanks to the repeated feedback from my doctoral supervisors, chris potts and matei zaharia, as well as the valuable input from heather miller and jeremy howard.

the criterion for evaluating research is the "increment" compared to previous knowledge, but before one can meaningfully exploit the "increment", the software itself must be effective. for software to be effective, its documentation must also be effective: people won't see all the downstream ways they're supposed to use the software unless you show it to them. that is, until these tasks can be developed by an independent community.

having said all that, the most important skill in this section is to "publish", really publish, publish often, and learn from it.

publish new paper

continue to invest in your own projects

when you read the fifth rule, it's natural to ask: where do graduate students spend so much time on open source software? when can real research be done?

the practical answer is that most of the time spent on open source could be used to conduct new and exciting research. the two are not as separate as they seem. why do you say that?

first, being at the forefront of this kind of open source software work allows you to intuitively identify new problems very early on. you'll understand the problem more instinctively than you would otherwise. additionally, the community you build often provides direct feedback on your own method prototypes and gives you access to talented collaborators who understand the importance of the problem. you'll also gain access to useful "distribution channels" to ensure that every new paper you publish in this area reaches your audience and strengthen your existing platform.

for example, "colbert" is not just a paper in early 2020. it now has probably ten or so follow-up related papers, investing in improved training, lower memory footprint, faster retrieval infrastructure, better domain adaptability, and better matching with downstream nlp tasks. likewise, dspy is not a paper, but a collection of papers on programming abstraction, hint optimization, and downstream programs. many of these papers are written by various excellent authors, and their work has had a huge impact, in part by creating a large audience through open source software channels.

so, a good open source tool can create modular works that can be explored, owned, and developed by new researchers and contributors.

reference original text: https://github.com/okhat/blog/blob/main/2024.09.impact.md

news

how does an ai phd produce impactful research? disciples of sloan prize winners share their experiences

introduction

my contact information