Research Blog

January 5, 2016

A couple papers I'm involved with were just accepted/published. The first paper (with Yuan Huang, Diansheng Guo, and Alice Kasakoff, all from the University of South Carolina) replicates my 2011 study of American letters to the editor on the Twitter data finding similar patterns of regional variation and the second paper (with Martijn Weiling and Gosse Bouma from Groningen, Josef Fruehwald from Edinburgh, John Coleman from Oxford, and Mark Liberman from Pennsylvania) looks at the rise of the hesitation marker um over uh in Germanic languages, identifying a surprising cross-linguistic change in progress.

October 28, 2015

I also aggregated the swearing data. Basically I took about 40 of the smoothed swear word maps, like the ones I posted below, and used a factor analysis to find and map common patterns of regional variation. I extracted three factors that are mapped below, which represent the three most common patterns of regional variation in the data set. I've also annotated the maps to show the swear words in order are most strongly associated with these three patterns. Finally I combined the three factor maps to create a single overall map of American swearing regions using RGB mapping. For more information on the method see a number of my recent dialect studies (2011, 2013, 2015).

July 29, 2015

Some more great coverage and visualization of our work was posted today at by Nikhil Sonnad. Nikhil animated our maps for a number of the emerging words that we identified to show how these new words spread across the U.S. in 2014. Please check out the article for more information, as well as some of our recent presentations.

July 16, 2015

        A few swear word maps...

July 6, 2015
In addition to the Twitter data, I've also been working lately with Bert Vaux's Harvard Dialect Survey data, which got a lot attention a few years ago when Josh Katz mapped the results using a combination of "kernel density estimation and non-parametric smoothing " to help visualize the data. We also used the data a couple years ago to validate a method we designed for creating dialect maps using search engine results. Right now, I'm in the process of identifying common patterns of regional variation based on the complete dataset using a multivariate spatial analysis, as I've done in previous analyses of regional variation in letters to the editor and vowel formants. I'll post up the overall results when I'm done, but for now here are a few raw maps for the pronunciation of the vowel in been and the alternation between crawfish/crayfish/crawdad. These maps were created by first pooling all informants by county based on their zip codes and by then calculating the percentage of informants from each county that selected a particular answer to each of the 122 questions on the survey. Grey areas correspond to counties for which no data is available.


April 30, 2015

Here's a map of the alternation between LOL and haha on Twitter, ignoring all of the other ways of signalling laughter, including variations on these two forms. This map clearly shows that regions with high LOL percentages also tend to have large African American populations. A lot of the emerging words and non-standard forms that we've been mapping show similar regional patterns, with African American being more likely to use the incoming form. This map, however is flipped the other way: LOL is the more common form. So it appears that white speakers are taking the lead here and that African American speakers are resisting the incoming form haha.

April 22, 2015

A quick post showing the rise of (on) fleek over 2014 on Twitter. However, unlike the scatterplots posted below, which show that (on) fleek was one of the fastest rising emerging words of 2014, this graph plots the percentage tweets that contain the word fleek that do not also contain some reference to eyebrows. Clearly the meaning of (on) fleek has been broadening over time, as the examples provided in the table below show, although this semantic change appears to be slowing down.

June 26, 2014
eyebrows on what? on fleek???
eyebrows on fleek, the fuck?
Eyebrows on fleek
Eyebrows on "fleek" lmmfao.
Eyebrows on fleek. da fuq
what the hell is fleek
Lmfao on fleek ?
eyebrows on fleek

November 22, 2014
so all ya bitches got eyebrows?” Yeahh and my shits on fleek
"I find my paradise when you look me in the eyes. Jobros on fleek
Makeup was on fleek
When your brows be on fleek but you ain't going no where
Apparently my eyebrows are on fleek
Today in autocorrect the shade queen: "on fleek" became "on fleet"
Braids on fleek
I fleek a leek a week
I'm on fleek
After winter break everything gone be on fleek

March 11, 2015

Some more double modal maps. First we've got the very uncommon must can, which unlike the very uncommon would could, has a much less singular distribution, being concentrated in the Southeast, as one would expect. I've also included a few examples below. I also mapped might could/might can as an alternation (i.e. as the percentage of might could relative to might can), based on the assumption that these forms are roughly interchangeable (I'm not sure how true that is, but both seem to me to mean something like possibly could and possibly can, which have the same basic meaning for me at least). The maps I posted on February 17th showed that these two most common double modals are Southeastern forms, as one would expect, but that they had slightly different distributions. This map allows for these patterns to be compared a bit more clearly. There are no simple pattern, but might could (in orange) is concentrated in parts of Louisiana, Northern Mississippi and Alabama, and Tennessee, while might can (in blue) is concentrated in South Carolina and Southern Mississippi and Alabama, as well as in some larger Midwestern cities. This distribution suggests that might could is primarily a white Southern term, while might can is a black Southern term.

        My kid must can tell I'm upset, he keeps moving around like crazy and has been since this morning!!
                   Reggie must can read minds or something
        lol dogs must can fly now if I'm broke
        If you talking shit you must can back it up . If not your gonna get beat up
        "It's a song" bullshit if you singing that part of the song you must can relate
        they must can tell lol
        AD must can handle you on the course now
        you seem to still be entertaining this shit .. You must can smell the fame I'm bringing you
        I must can get in free
        lol she must can cook

February 26, 2015

I gave a presentation on Tuesday as part of the Digital History Seminar series at the Institute of Historical Research at the University of London's School of Advanced Study. Slides and a video of the talk are available here.

February 23, 2015

Our research on emerging words got some media coverage this weekend. It was the basis for quiz on new words posted online by the New York Times and made the front page of their webpage. We'd like to thank Josh Katz and Wilson Andrews of the New York Times for putting this together and hopefully we'll be collaborating again in the near future, so stay tuned. Also, this coverage generated some interesting questions especially regarding unbothered, which as others have noted in the past isn't really all that new of term. Regardless, it is clear that this term has really shot up in usage over the past year (not only on Twitter), especially with young African American in the South, so it's a "newly emerging word" at least in that sense of the term. But I also think the word's meaning has expanded recently, now being used to mean something like "happily or purposely oblivious" in addition to the more standard and general "not bothered" meaning.

February 20, 2015

There are double modals like might can and might could, which are very uncommon, and then there are double modals like would could, which are very uncommon even for double modals (occurring only once every 68 million words or so in our corpus!). As the examples below illustrate, would could appears to have a range of possible meanings along the lines of “would or/and could”, “really could”, and “would if that could ever happen”.

I'm sure Danny and I would could make some very incredible offspring with those genes
I wish I never grew up. Life would could still be simple
I wish we would could take off and go anywhere but here baby you know the deal
I didn't buy a movie ticket to do what we would could do at the crib for free
If you would could see me now would you pat my back or would you criticize me?
I wish things would could be fixed... I wish we didn't fight
And the award for the best lie goes to you: for making me believe that you would could be faithful to me
Wishing u would could cuddle with me
we would could chill later when you gett off
This fresh seafood everyday would could never get old.

The relative frequency of would could is also mapped below. This map shows that would could is remarkably rare in the Deep South, where both might can and might could were relatively common. Instead weak would could hotspots appear to be located in North Carolina, Texas, and the Midwest, although overall there isn't much of a regional pattern here. Interestingly, the Multimo database only has two American examples of this form and neither is from the Deep South either, with one example coming from New York and one from Texas. Multimo, however, contains a number of additional examples of would could from the UK.

It’s also important to note that I went through the 437 tweets containing the sequence would could and only 142 of them appear to be actual instances of double modals, which were the only tokens retained. Rather, the majority of these sequences appear to be typos or auto-corrections (all of our data comes from smart phones). Examples of bad hits are provided below along with what is presumed to be the intended form. There were also a few would could sequences that looked like they should be separated by a punctuation mark (e.g. Thought OKC could pull it off but its not playing out how I thought it would could be a sweep honestly). I don’t think these types of issues are a big deal when looking at more frequent forms in our corpus or even rare forms that aren't composed of sequences of very common forms like modal verbs, but this is a potential issue that should be kept in mind when analyzing rare forms using smart phone data or other types of CMC.

I wasted my time and lost would/what could have been mines
I told mom would/we could watch Bracketology or she could go get me ice cream.
damn! I was hoping maybe would/we could burn some cars or something
didn't think this feeling would could/come back
How much would/wood could a wood chuck really chuck?
I'm sure someone would could/come up with a great alternative
God bless the girl would/who could hold Patron
I hate the word sexy but in this case I had to use it cause no other would/word could describe it.
I wish someone would could/cook breakfast for me naked
To buy everything off of my sephora wish list it would could/cost $285 and my wallet is looking sadder by the minute

February 17, 2015

One type of form we are interested in is double modals, which involve sequences of modal verbs. These highly non-standard linguistic forms are difficult to analyze because they are very rare. For this reason they have never been properly mapped. Our Twitter corpus, however, has thousands of examples of dozens of different double modal pairings, so we are now able for the first time to generate highly detailed maps plotting their usage. I'll provide more information on how we extracted double modals from our corpus and the range and relative frequencies of the different combinations in a later post, but for now here are the maps for the two most common and probably the most well know double modals might could and might can (e.g. I might could/might can do that). As one would expect, they are both Southeastern forms, but we can see there are subtle differences in the maps, with might could being more strongly associated with the Upper South and with usage of might can being more widely distributed throughout the Southeast. As we map more double modals, it will be very interesting to compare their distributions.

February 9, 2015

We've also been looking at some other rare verb-infinitive constructions. A particularly interesting one is the alternation between finna and fixing to, which shows a very clear regional pattern. Looking at this alternation also allows for an important point to be made about the measurement of linguistic variation. When it comes to analyzing the frequency of linguistic forms across different samples of language, it is always important to normalize in some way to control for the fact that language samples are usually of different sizes (and even if they aren't to facilitate comparisons across studies). Two types of normalized measurements are most common: frequency variables and alternation variables.

Frequency variables are especially common in corpus linguistics and involve measuring the relative frequency of a form in a language sample based on the size of that sample, generally the total number of words. For example, the two figures below plot the relative frequency per million words of finna and fixing to across the counties of the US in our 9 billion word Twitter corpus, calculated by dividing the frequency of each form in each county by the total number of words in that county and multiplying this value by 1 million.

These two maps show quite clearly that both finna and fixing to are more common in the South, although the fixing to cluster is shifted to the west, including the South Central States and the Deep South, and the finna cluster is shifted to the East, including the Deep South and the South Atlantic States, with some smaller secondary clusters in major Midwestern Cities, especially Chicago, Detroit and St. Louis. Based on the finna map, we can also see that the distribution of finna corresponds closely to the distribution of African Americans in the United States, as mapped below, although fixing to is relatively common in most of this region as well, aside from Virginia, the Carolinas, and the Midwestern cities.

Alternatively, Alternation variables are especially common in sociolinguistics and dialectology and involve measuring the relative frequency of two or more interchangeable forms (variants) relative to each other, without regard to the total number of words in each sample. For example, the figure below plots the percentage of finna and fixing to across the counties of the US relative to each other, which was calculated by dividing the frequency of finna in each county by the frequency of finna plus fixing to in that county and multiplying this value by 100 (i.e. literally by combining the two relative frequencies maps presented above). Given that there are only two variants involved in this alternation, one measurement/map is sufficient to reflect the distribution of both variants.

This alternation map clearly shows that the use of fixing to is concentrated it the Upper South and South Central States, largely outside the areas with the highest density of African American in the United States, something that was not entirely clear in the relative frequency maps. As any sociolinguist or dialectologist will tell you this is why it is important to analyze variation in the usage of a linguistic form relative to other equivalent linguistic forms. This is true but given that this is the standard practice in these fields, the more important point is that the alternation map does not tell us the whole story either. Specifically, the alternation map obscures the fact that both forms are considerably more common in the South compared to the rest of the United States. In fact, analyzed on its own, the alternation map risks misrepresenting reality, as it implies that finna is most strongly associated with the North and the West, as opposed to the Southeast. When analyzing alternation variables in dialectology and sociolinguistics more generally, it is therefore important to consider the relative frequency of the individual variants.

February 6, 2015

Having such a large amount of data makes it possible to analyze and map extremely rare linguistic forms. In addition to emerging words, using the Twitter corpus therefore allows for various other rare linguistic constructions to be mapped for the first time. For example, we looked at where the form "wantna" is used, which is a particularly rare contracted form of "want to" that is much less common than "wanna". Since these three forms are in variation with each other, we measured them as an alternation variable, i.e. by computing the percentage of each of the three forms relative to the other forms. So, for example, the percentage of "wantna" is calculated for each county by dividing its frequency by the combined frequency of all three forms and multiplying by 100. You can see we don't find much of a regional pattern for "wantna", which looks like it is too rare for even 9 billion words of tweets, but the results are still informative and we get very interesting patterns for "want to" vs. "wanna". We'll be working on mapping double modals next, which are very rare but much more frequent than "wantna" and show clearer regional patterns.

January 10, 2015

Slides are up from our talk at the American Dialect Society Annual Meeting on tracking the spread of new words. Thanks to everyone for a good crowd. We also got "baeless" on to the Word of the Year ballot for most useless, which it won, and "unbothered" on for most useful (although I nominated it for most likely to succeed), which it didn't win, beat out by "even" as in "I can't even".

January 6, 2015

We've defined an emerging word as a word that is very rare at the start of some period of time that then quickly rises in relative frequency over that period of time. So to extract emerging words from a corpus we correlate the relative frequency of a word to the day count based on a Spearman correlation coefficient. Basically we are looking for time charts that look something like this one for unbothered, our candidate for word of the year.

So to find emerging words, we extract all the words from a time-stamped corpus that occur with a minimum relative frequency (e.g. the 67,000 words that occur at least 1,000 times in our 8 billion word Twitter corpus, which contains tweets from October 2013-November 2014) and then test each for a positive correlation to find words on the rise (or if we look for negative correlations, words on the decline). We then extract those rising words that are particularly rare at the start of the corpus (e.g. those that occur less than 1,000 per billion words per day from October-December 2013) to find emerging words.

These two measurements (time correlation and relative frequency) can also be used to visualize changes in vocabulary on a larger scale. Below are a series of scatter plots, each of which shows relative frequency against time correlation for each of the 67,000 words: the higher the word is on the y-axis the more frequent it is, the farther to the left the words is on the x-axis the faster it is decreasing in frequency, and the farther to the right the word is on the x-axis the more quickly it is rising. We are specifically interested in those words on the bottom right corner of this space (i.e. rare words that are rising quickly like unbothered), which is what these series of maps zoom in on, but the rest of these graphs are also informative. For example, they show that although I is by far the most common word on Twitter, it's frequency has been falling over the past year, whereas preposition usage is on the rise, reflecting probably increasing information density in tweets.

January 5, 2015

We've been thinking a lot lately about how to map the spread of a word on one map. Previously, we've been mapping month of first occurrence, which is okay, but doesn't control for sample size, so counties with big populations (and hence lots of tweets) tend to pop up as early adopters no matter what. For example, here is the month of first occurrence map for unbothered, our word of the year for 2014, where you can see that it appears to be primarily a southeastern form, although there are lots of early adopters in red counties spread across the US, especially those containing big cities.

Another possibility, which controls for sample size, is just to map relative frequency by county, in the standard corpus linguistics type of way, as we have done here for unbothered, per million words, which let's us see more clearly that it is a Southeastern form.

Still this map doesn't give us any direct information on date of use, and could therefore obscure spread patterns. A third possibility then is to measure the number of words seen in a date-sorted corpus before the first occurrence of the target word. This approach takes date of first occurrence and size of corpus into consideration, essentially combining the information from the two previous maps. Here is the map for unbothered, which looks a lot like the relative frequency map.

For a more robust measure, we can also map the number of words until the second, third, fourth, etc. occurrence of the word. For example, here is the map for the number of words until the fifth occurrence of unbothered, which successfully removes the outliers (for example in North Dakota), where unbothered happened to be used relatively early but is otherwise quite uncommon. This map let's us see that unbothered is a Southern term--the most informative map yet in my opinion.

There are also some other possibilities. Instead of measuring the number of words until the nth occurrence of the target form, the measure could be flipped around, so that one maps the date when the target word reaches a specific relative frequency. Finally, in a great paper using similar data, Jacob Eisenstein et al. measure the spread of the word based on the percentage of users in a particular metropolitan area who have used that term, which controls for the fact that one user might be responsible for all the uses of a particular word in a given regional sub-corpus.

January 3, 2015

We are primarily interested in tracking new words, especially their regional spread, but we can also look at the top rising and declining words of 2014 based on our data by just looking at the correlation between day of year and relative frequency, i.e. without filtering for new words, like we did below. Here are the top ten rising (but not necessarily new) words with time charts, including rn, which was our top riser for 2013 (see below), and fuckboy in its base form, which is the top rising words of 2014 but which didn't quite make the cutoff for our emerging word list, although some related forms did. Aside from fuckboy, as well as unbothered and gmfu, which made our emerging word list, most of these words looks like they're leveling off.

Top 10 Rising Words and Acronyms on American Twitter 2014




loser, wimp, poser, etc.Lmfao this 20 yr old fuckboy is still on me I'm flattered but get a life dirty rat
right now
Lmao my mom hella called me out rn wow..
happy birthday
congrats on turning 11, idk what I'd do without you happy birthday Tyler. you're my spirit animal hbd bye
fuck with
I fw any and everybody on social networks and chats, it's just to see how you are
not bothered
I'm always unbothered I have no need to worry about the next person.
face time, mostly
I need to ft justin back. but I barely have signal
get me fucked up
This cold weather gmfu big time... Good thing got garage so car be little warmer
so much
our one of the sweetest guys ik & I have sm respect for you. We have to chill soon!
squad, especially a group of friends
Excited for the trip to the bay tomorrow with the squad
as fuck
I think there cute asf but I don't have my nose pierced

And the top ten declining words, excluding lemmas (the top ten is dominated by forms of haha), and including fdb, which was a top riser of 2014.

Top 10 Declining Words and Acronyms on American Twitter 2014




I've been making some bad decisions.. haha but they feel so damn good.
fuck dem bitches
Don't let your ex feel like she has a lock on you because the relationship was intense .. FDB .
You don't even know
Tenzin just shit on your life and UOENO
he was all like ooh my bad and I told him he needed to leave or else I was gonna call the cops
Hey twitter I'm in gym blasting migos they got me being rachet this morning
I hate that shit ohh my gosh
you only live once
I was up bc yolo jk idk I just sleep late
kill them
hey said your nothing without Justin Smith I toldem fuck off cause you gone Killem this Year
I'll, mostly, also good
That's why I neva ask fa help ill do it for you niggas and do it for myself
swinging, driving, etc.
Just swangin until a ranger came and wasn't feeling us

January 2, 2015

The next ten...

Top 11-20 Newly Emerging Words and Acronyms on American Twitter 2014


Q4 2013
Per Billion Words


selfie, possibly but probably not taken while incarcerated
If you say "Celfie" instead of Selfie, you're a dumb, annoying fuck
 impresses, succeeds (at), nails, etc.
 I love his music. He slays all of his features
family and friendsMy uncle fly Ty been down since day one preciate that  famo
 fuckboi.838241 fuckboy (see below)
I'm still at home. You have time. Get ready fuckboi
(on) fleek.838
on point, especially eyebrows
we in dis bitch finna get crunk eyebrows on fleek dafuq
to favorite somethingif u faved my concert tweet  does that mean youll go with me ????
earningsWould you rather be super fit and yoked or in a relationship and lose all your gainz?
broLol bruuh we not even gon be in school when it rains
am I right?It's the 21st century n I still have to pay taxes n obey the law whatthefuck amirite?
 notifications, especially online I've not been getting half my notifs. My SIM card is jacked up

And the time charts for all twenty words showing word relative frequency by day, with notifs looking like it's ready to take off for 2015...

January 1, 2015

Here are the top 10 newly emerging words of 2014 according to our analysis of approximately 8 billion words of geo-coded Tweets from the US. First we measured the degree to which every word occurring at least 1,000 times in our corpus rose over the course of the year by correlating day of year against relative frequency of the word by day using a Spearman correlation coefficient. Then we cross-referenced this list of rising words against the relative frequency of those words from October-December 2013 (excluding proper nouns) to find rising words that didn't occur often during 2014. Graphs and maps to follow.

Top 10 Newly Emerging Words and Acronyms on American Twitter 2014


Q4 2013
Per Billion Words


unbothered.926159not bothered
I'm always unbothered I have no need to worry about the next person.
gmfu.924247get me fucked up
This cold weather gmfu big time... Good thing got garage so car be little warmer
joggers.908453jogging pants
It's gonna be warm tomorrow But I'm wearing joggers n a hoody.
fuckboys.902508losers, wimps, posers, etc.
I went harder and looked cooler than every single one of you fuckboys
Would Michael J Fox be the GOAT actor if parkinsons hadn't rekt his shit?
tfw.879235that feel when
Tfw you take your socks off afterschool
xans.878320Xanax pills
Cant wait to swallow dez xans im feelin that type of night
125to be without a bae, i.e. a significant other
Everyone is like bae this, bae that and I'm over here baeless
hanging out, especially with a group of young men I'm go be boolin with yo ass then instead of being at the house
223 lord, as a exclamation
Ooooo lordt these tweets at night

December 29, 2014

Guess I should have set this up a while ago, but I've finally got a twitter account now @JWGrieve.

December 28, 2014

We've recently generated another series of maps for with Nikhil Sonnad, this time looking at informal terms that are used for referring to people, words like dude and bro. I've included our maps below, including a couple that didn't make into the qz article. Also, to clarify, because I've seen quite a few comments on this, we aren't claiming that people only use these terms in highlighted regions. Obviously, all of these terms are used across the United States, as the raw maps show. Rather what we are doing are finding regional hotspots for each term. Furthermore, these terms are not all equally common, with dude and buddy in particular being more common than the rest of these words. So although pal, for example, has a hot spot in the North Central States, that doesn't mean it is the most common term there, just that it relatively more common there compared to the rest of the United States, in our twitter corpus at least.

November 19, 2014

Another article was published on um/uh alternation, this time in The Atlantic

October 6, 2014

A couple new interjection maps, following up on the um/uh stuff, this time for oh oh/uh oh and oops/whoops, both raw and smoothed. Regional patterns are present again, but they are different and not quite as strong.

September 30, 2014

Jack will be giving a presentation at American Dialect Society Annual Meeting at LSA in Portland in January on Mapping Lexical Spread in American English.

September 28, 2014

We gave a presentation today on Big Data Dialectology at the American Association of Corpus Linguistics conference in Flagstaff, Arizona. A bunch of new maps and temporal distributions in the presentation including for schleep and schleepy, two of the top rising new words in 2013 according to our analyses, which appears to have been introduced by young African American women from the Southeast.

September 16, 2014

We've had a bunch of coverage over the last couple days for the um/uh maps we made at the request of Mark Liberman, starting with a great article and map by Nikhil Sonnad at that briefly made the front page.

September 12, 2014

Here are two new maps showing the first occurrences by month of thotty and unmeet in American Tweets from 2013, which are two words that we've been looking at that show strong increases over time in the corpus (rho = .725 and .747 respectively). The distribution of thotty is particularly interesting, starting in Chicago and then moving to major urban centers in the Midwest and Central States and eventually to the East Coast, the Deep South and California after the release of Chicago native Chief Kief's popular "Love no Thotties" single in mid-September.

September 9, 2014

After extracting all 60,000 words that occur at least 1,000 times in the 6 billion word corpus of American Tweets (January-September, 2013), we identified the words showing the largest increases and decreases in usage over the course of 2013 by testing for a relationship between relative frequency and day of the year using Spearman rank-order correlation. The top 10 increasing and decreasing words are listed below, including what appear to be many new forms. In addition scatter plots are reproduced showing the change in frequency of each form over the days of the corpus. A clear trend that emerges immediately is increasing popularity of acronyms and the decreasing popularity of creative spellings. In addition, looking at the scatter plots, we can see the famous s-curve in the plots for each of the increasing forms and linear patterns for each of the decreasing forms.

 Increasing Decreasing
 rn (.978)
 wat (-.976)
 selfies (.965)
nf (-.963)
 selfie (.965)
 swerve (-.956)
 tbh (.960)
 p (-.956)
 fdb (.952)
 shrugs (-.956)
 literally (.948)
 ase (-.955)
 bc (.943)
 dnt (-.955)
 ily (.940)
 wen (-.948)
 bae (.934)
 rite (-.947)
 shleep (.932)
 yu (-.946)
 sweg (.932)
 wats (-.946)
 vibes (.925)
 yeahh (-.945)

August 18, 2014

Here are the updated maps for um/uh alternation, which are based on the complete 6 billion words corpus of tweets from 2013, with the non-interchangeable usages of um and uh removed from the data set (e.g. uh oh, uh huh, uh uhh, foreign language um), leaving 430,000 um tokens and 350,000 uh tokens. Overall the maps are very similar, although the raw map in particular is a bit clearer now.

August 13, 2014

At the Methods in Dialectology conference, Jack gave a talk on the use of spatial and geostatistical analysis for regional dialectology using the data from the Trees and Tweets corpus.

August 12, 2014

As requested by Mark Liberman, who was attending the Methods in Dialectology conference at the University of Groningen in the Netherlands, we've run an analysis showing the distribution of the interjections um and uh in American English, based on a portion of our American tweets corpus, which surprisingly shows a relatively clear Midland pattern. See Mark's post on the Language Log for more information. Here are the raw and smoothed maps.

May 8, 2014

Diansheng and Alice recently gave an interview on our project for the University of South Carolina's IT Minute Podcast.

April 18, 2014

The group's paper "Big Data Dialectology: Analyzing Lexical Spread in a Multi-billion Word Corpus of American English" was just accepted for presentation at the American Association of Corpus Linguistics 2014 Conference being hosted by Northern Arizona University in Flagstaff from September 26-28, 2014.

March 6, 2014

The launch of the Trees and Tweets project has generated some media attention, including from the Daily Telegraph. See posts at,, and as well.

March 1, 2014

I've started this blog primarily to report the results of the Trees and Tweets project, which is part of the Digging into Data Challenge, but I'll also be reporting on my research in general, including on dialectology, corpus linguistics, and authorship attribution. For the Trees and Tweets project, our team consists of Andrea Nini and I at Aston University and Diansheng Guo and Alice Kasakoff at the University South Carolina. We are interested in analyzing regional linguistic variation in a multi-billion word corpus of geo-coded British and American Tweets and on analyzing migration patterns based on a database consisting of millions of family trees harvested from online genealogy websites. Ultimately our goal is to link these two data sources in order to better understand the relationship between dialect variation and human migration patterns. The British side of the project is funded by JISC/ESRC/AHRC, while the American side is funded by the IMLS.