The New Directions in Text Analysis conference at Harvard was an eclectic gathering that mixed social scientists and computer scientists interested in tools for mining or otherwise taming large amounts of text. I will only describe a handful of the papers here.
The conference began with a technical paper about statistictical tools for data driven science policy decisions by Hanna Wallach of UMass that focused on Latent Dirichlet Allocation for topic models. A topic is operationalized as a specialized probability distribution over all the words in the vocabulary. Documents can be considered as the product of an underlying generative model that may have hidden structure, and whose parameters must be learned from observables by statistical inference.
Lars Backstrom of Facebook gave an interesting paper about memes: topical text fragments that persist through many articles. He described a graphical model that represents quotes and misquotes as nodes linked by weighted edges, the weights depending upon repetition and edit distance between the corresponding strings. The graph can be partitioned into memes by deleting low-weight edges (NP hard, but can be managed with heuristics). See the MemeTracker web site for more details.
Jure Leskovec of Stanford's paper looked at influence and dynamics in online media, especially the interplay between mainstream media (MSM) and the blogosphere. Different memes seem to have different temporal signatures, depending on where they originate and who adopts them. Most big stories break in MSM and are then taken up by blogs; a lesser number break in the blogosphere and are taken up by MSM. The time/volume graphs of these events look quite different when you control for scale and duration.
Does bad news travel fast, or at any rate faster than good news? Previous research by Berger and Milkman on NYT articles suggested not. The paper by Michael Macy of Cornell presented a data mining study of Twitter that showed that positive affect dominates all forms of tweet. Furthermore, a study of half a billion US tweets showed that certain words have a pronounced diurnal rhythm when you adjust for time zones, including some words you might expect ("coffee") and words you might not expect ("random").
All in all, a worthwhile meeting that also showcased some Harvard research into such topics as data visualization, Chinese history and Japanese politics. I gave a talk on 15 years of R&D at Thomson Reuters that shared some of the things we have learned along the way about text analytics, machine learning, and user data.