3. Word Analysis¶
Now that we’ve analyzed the broader trends of the texts, we’ll look at specific words and see how their usage over time compares across langauges. This will provide a more detailed analysis of the cultural differences between the languages.
3.1 Words Unique to Each Corpus¶
The simplest place to start is to look at words that occur in Finnosvenka text but not Swedish text, and vice versa. This is a good indicator of what topics each linguistic group might focus on that the other doesn’t. A lot of cities in each country (Sweden vs. Finland) show up in the following lists. Some examples incldue:
Cities: Helsignfors and Malmö are cities in Finland and Sweden, respectively, and don’t show up in the other corpus.
Historical Ties: The Finnosvenka texts mentions the Russian city of St. Petersburg, which makes sense considering Finland’s historical ties to Russia.
Currency: The Swedish kronor (the name of the Swedish currency) is mentioned only the Swedish text.
fin.topWordsTotal = [s.lower() for s in fin.topWordsTotal]
print("In Finnosvenka but not Swedish: \n\t", set(fin.topWordsTotal).difference(swe.topWordsTotal))
print("\nIn Swedish but not Finnosvenka: \n\t", set(swe.topWordsTotal).difference(fin.topWordsTotal))
In Finnosvenka but not Swedish:
{'gjort', 'stad', 'mark', 'gar', 'hvilket', 'helsingfors', 'huru', 'namn', 'regeringen', 'denne', 'land', 'afseende', 'böra', 'emedan', 'åter', 'finland', 'kop', 'finlands', 'måste', 'vill', 'bet', 'london', 'ganska', 'manad', 'åbo', 'alltid', 'voro', 'sådant', 'sta', 'ifrån', 'gärden', 'wasa', 'mcd', 'sade', 'vore', 'densamma', 'län', 'for', 'penni', 'finska', 'emellan', 'landet', 'också', 'amp', 'fom', 'wiborg', 'äbo', 'petersburg', 'abo', 'johan', 'dig', 'först', 'lör', 'borde', 'fall'}
In Swedish but not Finnosvenka:
{'parti', 'herrar', 'goda', 'kronor', 'fru', 'stadens', 'salu', 'billiga', 'dito', 'mäste', 'december', 'härmed', 'godt', 'finnas', 'hwilket', 'nied', 'andersson', 'ocb', 'carlskrona', 'norra', 'oktober', 'rdr', 'härstädes', 'öre', 'son', 'flera', 'undertecknad', 'kök', 'januari', 'priser', 'emellertid', 'månad', 'hyra', 'kapten', 'derefter', 'göteborg', 'ooh', 'huset', 'nästkommande', 'norrköpings', 'nägon', 'februari', 'afton', 'lund', 'linköping', 'oell', 'afgår', 'sorn', 'lördagen', 'meddelar', 'ali', 'stort', 'norrköping', 'auktion', 'örn', 'sör', 'nästa', 'par', 'lager', 'kongl', 'billigt', 'kontor', 'boktryckeriet', 'ester', 'arbete', 'går', 'plats', 'torget', 'hus', 'malmö', 'tiden'}
3.2 Finnish Loanwords¶
To get a little deeper into the linguistic differences, we look at Finnish words that have been adopted by the Swedish language. We hypothesize that these Finnish loanwords would see greater use in Finnosvenka texts due to the geopgrahic proximity of the group to Finnish speakers (in fact, many Swedish-speaking Finns are likely fluent in Finnish as well). To measure this trend, we look at the relative frequency per 100,000 words of these loanwords across all time for both languages. To see if the frequency of words are statistically significant across Finnosvenka and Swedish texts, we perform a G-test that calculate the probability of the frequencies of a loanword being different across langauges, given the null hypothesis that their frequencies are similar in each language. A value of p < 0.05
allows us to reject the null hypothesis and indicates that there is a statistically significant difference in the way both languages use a given loanword.
sweWordCount = sum([sum(c.values()) for c in swe.allCounters.values()])
finWordCount = sum([sum(c.values()) for c in fin.allCounters.values()])
def getFrequenciesLog(candidates):
for cand in candidates:
finCount = sum([c[cand] for c in fin.allCounters.values()])
sweCount = sum([c[cand] for c in swe.allCounters.values()])
finTotal = sum([sum(c.values()) for c in fin.allCounters.values()])
sweTotal = sum([sum(c.values()) for c in fin.allCounters.values()])
finFreq = (log(finCount+1)-log(finTotal))
sweFreq = (log(sweCount+1)-log(sweTotal))
print("Finnish loanword freq {:s}: \tFinnish: {:4f} \tSwedish: {:4f} \t [{:s}]".format(
cand, finFreq, sweFreq, str(finFreq > sweFreq)
))
def getFreqsDict(candidates):
candidates_d = {k: [] for k in candidates}
for i, cand in enumerate(candidates):
finCount = [c[cand] for c in fin.allCounters.values()]
sweCount = [c[cand] for c in swe.allCounters.values()]
finTotal = [sum(c.values()) for c in fin.allCounters.values()]
sweTotal = [sum(c.values()) for c in fin.allCounters.values()]
finFreqs = [(c[cand])/(sum(c.values()))*100000 for c in fin.allCounters.values()]
sweFreqs = [(c[cand])/(sum(c.values()))*100000 for c in swe.allCounters.values()]
# Average these frequences across all years
finFreq = sum( [(c[cand]) / (sum(c.values()))*100000 for c in fin.allCounters.values()]) / len(fin.allCounters)
sweFreq = sum([(c[cand])/(sum(c.values()))*100000 for c in swe.allCounters.values()]) / len(swe.allCounters)
candidates_d[cand] = (finFreqs, sweFreqs)
# https://wordhoard.northwestern.edu/userman/analysis-comparewords.html
a, b = sum(finCount), sum(sweCount)
c, d = sum(finTotal), sum(sweTotal)
e1 = c * (a + b) / (c + d)
e2 = d * (a + b) / (c + d)
G2 = 2* (a * np.log(a / e1) + b * np.log(b / e2))
p = 1 - stats.chi2.cdf(G2, 1)
# print("G^2 Stat: ", G2)
print("{}. Finnish loanword {:s}: \tFinnish: {:4f} \tSwedish: {:4f} \t{}".format(
i+1, cand, finFreq, sweFreq, "Signfinicant!" if p< 0.05 else ""
))
return candidates_d
cand_freqs = getFreqsDict(["kola", "kova", "pulka", "memma"])
1. Finnish loanword kola: Finnish: 0.620426 Swedish: 0.469006 Signfinicant!
2. Finnish loanword kova: Finnish: 0.011458 Swedish: 0.010739 Signfinicant!
3. Finnish loanword pulka: Finnish: 0.013246 Swedish: 0.005780 Signfinicant!
4. Finnish loanword memma: Finnish: 0.036777 Swedish: 0.000000
/share/apps/anaconda3/2020.02/lib/python3.7/site-packages/ipykernel_launcher.py:37: RuntimeWarning: divide by zero encountered in log
/share/apps/anaconda3/2020.02/lib/python3.7/site-packages/ipykernel_launcher.py:37: RuntimeWarning: invalid value encountered in double_scalars
Kola : translated as “to die.” Adapted from the Finnish word “kuolla,” this is the most common loanword across both languages.
Kova: translated as “money.” Originally from a Finnish expression “kova raha” meaning coins.
Pulka: translated as “sled.” Orignally from a Finnish and Sami (an indigenous group that lives in the north of both countries) word “pulkka”.
Memma: a traditional Finnish Easter pudding, originally called “mämmi” in Finnish. The dessert has supposedly begun to be exported to Sweden, although the frequencies of the word in either corpus can’t back up that statement.
3.3 Tracking the Usage of Loanwords¶
As section 3.2 shows, there are Finnish loanwords that are statistically signifcant in their usage across Finnosvenka and Swedish text. We now pick a few loanwords and track their usage over time to attempt to identify trends in usage between Finnosvenka and Swedish text.
3.3.1 Kola:¶
The word “kola” (to die) is the most common loanword across both texts. Other than a curious spike right at the beginning, Swedish texts seem to use the word less frequently than Finnosvenka texts. Around the 1850’s, usage of the word peaks in Finnosvenka texts. A few decades later, the usage of the word in Swedish texts starts to increase as well. The word has other usages in Swedish, including “toffee,” so a more detailed analysis of the context of this word would be necessary to tease apart possible differences in the word meaning. (wiktionary)
candidates_d = getFreqsDict(["kola"])
plt.plot(finYears, candidates_d['kola'][0], label="Finnish-kola")
plt.plot(sweYears, candidates_d['kola'][1], label="Swedish-kola")
plt.legend()
1. Finnish loanword kola: Finnish: 0.620426 Swedish: 0.469006 Signfinicant!
<matplotlib.legend.Legend at 0x7ff2970ef450>

3.3.2 Kova vs. Raha¶
Kova is the loanword that means “money” (“raha” in Finnish), derived from the Finnish phrase “kova raha” that originally means “coins” (quora). In order to test the hypothesis that Finnosvenka texts might directly take Finnish words as well as create loanwords to Swedish, we compare the usage of the words raha and kova.
candidates_d = getFreqsDict(["raha", "kova"])
plt.plot(finYears, candidates_d['raha'][0], label="Finnish-raha")
plt.plot(sweYears, candidates_d['raha'][1], label="Swedish-raha")
plt.plot(finYears, candidates_d['kova'][0], label="Finnish-kova")
plt.plot(sweYears, candidates_d['kova'][1], label="Swedish-kova")
plt.legend()
/share/apps/anaconda3/2020.02/lib/python3.7/site-packages/ipykernel_launcher.py:37: RuntimeWarning: divide by zero encountered in log
/share/apps/anaconda3/2020.02/lib/python3.7/site-packages/ipykernel_launcher.py:37: RuntimeWarning: invalid value encountered in double_scalars
1. Finnish loanword raha: Finnish: 0.012326 Swedish: 0.000000
2. Finnish loanword kova: Finnish: 0.011458 Swedish: 0.010739 Signfinicant!
<matplotlib.legend.Legend at 0x7ff29719e150>

The above graph shows the use of raha and kova in both languages over time. Swedish texts never use the term raha, which makes sense because it is exclusively a Finnish word. In the Finnosvenka texts, usage of both words peak at almost exactly the same time. This could be due to the usage of the full Finnish phrase “kova raha,” or they could be used as synonyms in a certain context. There isn’t enough data to make a conclusion about the usage of raha in either text, but there is a statistically significant difference in the usage of kova in either texts, with the Finnish text using it more often.
3.3.3 Pojke vs. Poika¶
Pojke is the Swedish loanword for the Finnish word poika and is a colloqial word for “boy.” (wiktionary) This is a second example of testing whether Finnosvenka texts bias towards the Swedish of Finnish version of the word.
candidates_d = getFreqsDict(["pojke", "poika"])
plt.plot(finYears, candidates_d['pojke'][0], label="Finnish-pojke")
plt.plot(sweYears, candidates_d['pojke'][1], label="Swedish-pojke")
plt.plot(finYears, candidates_d['poika'][0], label="Finnish-poika")
plt.plot(sweYears, candidates_d['poika'][1], label="Swedish-poika")
plt.legend()
1. Finnish loanword pojke: Finnish: 0.290092 Swedish: 0.458642 Signfinicant!
2. Finnish loanword poika: Finnish: 0.111841 Swedish: 0.001719 Signfinicant!
<matplotlib.legend.Legend at 0x7ff2968e5d50>

Here we see that the Finnosvenka uses the direct Finnish version ‘poika’ far more than Swedish texts, although both use the Swedish version ‘pojke’ more often. There is a compelling increase in the usage of pojke in Swedish texts just as Finnosvenka texts begin to use it in replacement of poika. So, as Swedish texts begin to use the Swedish version of the word, Finnosvenka texts significantly decrease their usage of ‘poika.’ It’s hard to determine any sort of causal relationship without more etymological details, but this example suggests that ther is a marked relationship between the language of origin and the word’s usage.
4. Future Work¶
Comparing loanwords provides a good calibration for comparing word usage patterns across langauges. To go further with this analysis, we could search for words with the greatest difference in frequency across texts and use this to investicate dialectical differences. We could also create more detailed representation of content words by training word embeddings on subsets of the corpora and comparing their representations.
For every experiment detailed above, we look only at the word frequencies of a newspaper publication. The next step in analyzing the text could be to look at syntatical differences in the publications. The original dataset contains Part-of-Speech and Named Entity tags, along with other annotations, that could be analyzed as well. However the tagging systems used in the SpråkbankenText platform are rather outdated. With the right time and resources, one ambitious project could be to fine-tune present state-of-the-art tagging systems on the Finnosvenka and Swedish texts.
5. Conclusion¶
In conclusion, this project looked at the evolutionary trends of Finnosvenka and Swedish by performing both quantitative and qualitative analyses of word frequency within newspaper publications. We found that Finnish texts exhibited less evolution than their Swedish counterparts, but also adopted more words from the majority Finnish-speaking population. Clearly, cultural and geographic contexts create differences in the two languages, with far more possible avenues of comparison yet to explore.