Data Science Environment Summit
This week, I participated in the Moore-Sloan Data Science Environment Summit, a collaborative meeting between three data science centers at UC Berkeley, NYU, and the University of Washington. By virtue of being a fellow with the Berkeley Institute for Data Science, I was invited to participate in this meeting along with the PI’s, Co-PI’s, Senior Fellows, and other researchers who are part of BIDS, NYU’s center for Data Science, and the University of Washington’s eScience institute.
Our director, Kevin Koy, spearheaded the organization of this first annual meeting and made the wise choice to host it at the Asilomar Conference Grounds near Monterrey, CA.
The Monterrey Bay Aquarium
The first thing on the agenda was the Monterrey Bay Aquarium. I’d always wanted to visit, and it was everything I could have imagined an aquarium might be. There’s an excellent jellyfish exhibit as well as a brand new cephalopod exhibit. The octopuses, squids, and cuttlefish were amazing creatures.
The Conference Grounds
I learned from Professor Henry Brady and his lovely wife that Asilomar was designed by architect Julia Morgan in beautiful arts and crafts style. The grounds are simply lovely and have a fun, if cold, nearby beach.
Bonding and Community Building
In order to solidify the collaborative bonds between the diverse scientists at the geographically disperse campuses involved in the collaboration, we did various bonding and collaboration building activities. While that’s not usually my jam, I did learn a few things.
Dav Clark FTW
The most important thing I learned was to have Dav Clark on your team. If possible, maximize the experience by having an extra David Clark on your team as well. Out of five people, our team had two Dav* Clarks and we were the winners. Coincidence? I think not.
The other lesson from the bonding activities was straight out of agile workflows: fail fast. We were all separated into teams to make towers with spaghetti and tape, to hold up a marshmallow. We had 18 minutes. By prototyping quickly and racing ahead with the first reasonable-sounding idea that was suggested, we started actually building before a lot of other teams and had more time for quality control. In an analogy to the collaboration, this lesson suggests that rather than spending too much longer getting a plan together, we should just start moving with the ideas that we already have. Ideally, allowing ourselves room to fail fast, we’ll actually get some fast successes along the way.
Rather than allow a planning committee to dictate the meeting subject matter, the attendees collaborated on a conference schedule for Monday and Wednesday mornings. It resulted in a set of topics I had trouble choosing between (being only one person, I can only attend one session at a time, after all).
The first session I attended was the visualization session, where we mentioned and (and cheered for) a few interesting tools.
- d3.js (data driven documents)
- Topic modeling in R using LDAvis, related blog post
- yt, good for zoomable views on large dense volumetric data sets (including meshes, lots of astro examples of data cube data)
- DS9 (astronomy)
- dv3d (tom maxwell, nasa, for climate)
- IPython interactive charts using interacive widget system
- Max (this is for artists!)
I ended up (by virtue of being outspoken and often standing within earshot of Kevin Koy) leading the tools session. It was a great series of lightning talks (7 thrilling minutes per person!):
- Brian McFee showed off LibROSA, some sweet tools for audio analysis
- Dan Halperin showed off Myria, an abstracted, accessible interface for relational database exploration
- Zhao Zhang showed off the AMPStack
- Kyle Cranmer demonstrated the collaborative statistics modeling framework behind the LHC (RooFit, RooPlot, Roostats)
- Mike Blanton gave a description of typical astronomer workflows
- John Canny gave a BIDMach and BIDMat demo
- And Saul Perlmutter framed a discussion around goals, dreams, and the needs for workflow tools.
Tuesday, working group sessions took place simultaneously, in many different rooms, and covered most of the scientific computing topics I’ve been interested in for the last five years, including:
- Software and Tools,
- Reproducibility and Open Science,
- Education and Training, and
- Alternative metrics and Career Paths.
Most Influential Work
A fascinating presentation by Mark Stalzer described what the Moore Foundation learned about the most impactful works in data science. Each of the more than 1100 pre-applications for the Data Driven Discovery Investigators program included up to five references that, in the opinion of the author, were some of the most influential in data science. The resulting bibliography:
- named IPython the most important tool (well deserved!),
- had an h-index of 16,
- and had enormous representation from Google research papers (in particular, Map Reduce).
Finally, I learned this morning, that David Hogg, who was an exceptional leader of this event, blogs five times a week about research! Amazing, IMO. The rules he has instituted for himself are :
I must post five days per week (I choose which five), except when traveling, no matter what I have done.
I must write only about research; no committees, no refereeing, no teaching, no excuses.
It seems overly ambitious and questionably helpful, but I guess he’s been doing it for years, so it must be serving him well somehow. In the same way that Titus' blog is an impressive body of work, so too is Hogg’s blog. I’m inspired.
He also has a teaching blog.