Scandinavian Languages Project

This project analyzes the different ways languages can evolve given a certain demographic context. More specifically, it analyzes the differences between the Swedish spoken in Sweden and the Swedish spoken by the ethnic minority group in Finland, also called Finnosvenka. There has been a Swedish-speaking minority in Finland since roughly the 16th century and as an isolated community, the language has hypothetically changed less than the “mainland” version of Swedish spoken in Sweden.

To test this hypothesis that Finnosvenka would have experienced less change over time than mainstream Swedish, we compare newspaper publishings from Sweden and Finland from the 18th and 19th centuries. While newspapers might not be the largest drivers of change, they demonstrate greater linguistic stability than less formal data sources. The SpråkbankenText service hosted by the University of Gothenburg has an extensive collection of newspapers from both Sweden and Finland, which we use for this analysis.

The analysis is divided into two sections. First, we take a high-level look at the trends across decades for both Swedish and Finnosvenka newspapers. This looks at the overall patterns of word usage for each linguistic groups to see if they really have evolved in different ways. The second part takes a more granular approach by looking at specific words and see how they are used over time in both the Swedish and Finnish text. By examining specific words, we can begin to hunt down potential cultural influences in the difference between Swedish and Finnosvenka over time.

import json
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats
from scipy import spatial
from collections import Counter
import os
import pickle
from math import log
import numpy as np
from utils import LanguageCounter
from sklearn.decomposition import PCA
from scipy import stats

finYears = list(fin.allCounters.keys())
sweYears = list(swe.allCounters.keys())

1. The Dataset

The Swedish dataset comes from the Khubist-2 corpus hosted on SpråkbankenText. The newspapers span from 1750 to 1890 and come from various regions in southern Sweden, where the majority of the population is located. Due to size contraints and the much larger sample size of newspapers in Sweden, we take only a sampling of the available newspapers.

The Finnish dataset comes from Nationalbibliotekets svenskspråkiga tidningar corpus. This is a comprehensive dataset of newspapers from Finnosvenka newspapers in Finland from 1770 to 1900.

We represent each corpus as an object of the LanguageCounter object, which is defined in utils.py. Each object stores the words and their frequencies across decades. This representation saves only the word counts across decades and across languages. Later analysis could look at the grammatical structure of the texts, but for now we look at the different trends in word usage across languages.

print(LanguageCounter.__doc__)
 Class used to represent each corpus. 
    ...
    Attributes
    ----------
    dataPath: str
        Path to the directory that contains the pickled raw text for each newspaper. 
    allCounters: Dict[np.datetime64, Counter]
        A mapping from decade to a counter of each word in the newspaper publishings for that decade. 
    commonWords: Dict[np.datetime64, list[str]]
        A mapping from decade to the top 100 most commonly used words used in that decade. 
    topWordsTotal: Counter
        The top 250 words used across all time by the newspapers and their overall frequncies. 
    allFeatized: List[List[float]]
        A list of 250-dimensional vectors, one for each decade of the corpus. Vectors show the frequencies of each of the topWordsTotal within that decade.
    
    Methods
    -------
    buildCommonCounters(dataPath)
        For every pickle file in dataPath, adds a counter for the given decade to self.allCounters
    getOverlaps()
        Get the amount of word overlap across decades. 
    buildFeatures()
        Builds feature vectors that describe each decade by the frequency of each of the overall top 250 most commonly used words by the language. 
    
finCountSource = "data/Finnish/temp_txt"
sweCountSource = "data/Swedish/"

fin = LanguageCounter(finCountSource)
fin.buildCommonCounters(fin.dataPath)

swe = LanguageCounter(sweCountSource)
swe.buildCommonCounters(swe.dataPath)

3. Word Analysis

Now that we’ve analyzed the broader trends of the texts, we’ll look at specific words and see how their usage over time compares across langauges. This will provide a more detailed analysis of the cultural differences between the languages.

3.1 Words Unique to Each Corpus

The simplest place to start is to look at words that occur in Finnosvenka text but not Swedish text, and vice versa. This is a good indicator of what topics each linguistic group might focus on that the other doesn’t. A lot of cities in each country (Sweden vs. Finland) show up in the following lists. Some examples incldue:

  • Cities: Helsignfors and Malmö are cities in Finland and Sweden, respectively, and don’t show up in the other corpus.

  • Historical Ties: The Finnosvenka texts mentions the Russian city of St. Petersburg, which makes sense considering Finland’s historical ties to Russia.

  • Currency: The Swedish kronor (the name of the Swedish currency) is mentioned only the Swedish text.

fin.topWordsTotal = [s.lower() for s in fin.topWordsTotal]
print("In Finnosvenka but not Swedish: \n\t", set(fin.topWordsTotal).difference(swe.topWordsTotal))
print("\nIn Swedish but not Finnosvenka: \n\t", set(swe.topWordsTotal).difference(fin.topWordsTotal))
In Finnosvenka but not Swedish: 
	 {'gjort', 'stad', 'mark', 'gar', 'hvilket', 'helsingfors', 'huru', 'namn', 'regeringen', 'denne', 'land', 'afseende', 'böra', 'emedan', 'åter', 'finland', 'kop', 'finlands', 'måste', 'vill', 'bet', 'london', 'ganska', 'manad', 'åbo', 'alltid', 'voro', 'sådant', 'sta', 'ifrån', 'gärden', 'wasa', 'mcd', 'sade', 'vore', 'densamma', 'län', 'for', 'penni', 'finska', 'emellan', 'landet', 'också', 'amp', 'fom', 'wiborg', 'äbo', 'petersburg', 'abo', 'johan', 'dig', 'först', 'lör', 'borde', 'fall'}

In Swedish but not Finnosvenka: 
	 {'parti', 'herrar', 'goda', 'kronor', 'fru', 'stadens', 'salu', 'billiga', 'dito', 'mäste', 'december', 'härmed', 'godt', 'finnas', 'hwilket', 'nied', 'andersson', 'ocb', 'carlskrona', 'norra', 'oktober', 'rdr', 'härstädes', 'öre', 'son', 'flera', 'undertecknad', 'kök', 'januari', 'priser', 'emellertid', 'månad', 'hyra', 'kapten', 'derefter', 'göteborg', 'ooh', 'huset', 'nästkommande', 'norrköpings', 'nägon', 'februari', 'afton', 'lund', 'linköping', 'oell', 'afgår', 'sorn', 'lördagen', 'meddelar', 'ali', 'stort', 'norrköping', 'auktion', 'örn', 'sör', 'nästa', 'par', 'lager', 'kongl', 'billigt', 'kontor', 'boktryckeriet', 'ester', 'arbete', 'går', 'plats', 'torget', 'hus', 'malmö', 'tiden'}

3.2 Finnish Loanwords

To get a little deeper into the linguistic differences, we look at Finnish words that have been adopted by the Swedish language. We hypothesize that these Finnish loanwords would see greater use in Finnosvenka texts due to the geopgrahic proximity of the group to Finnish speakers (in fact, many Swedish-speaking Finns are likely fluent in Finnish as well). To measure this trend, we look at the relative frequency per 100,000 words of these loanwords across all time for both languages. To see if the frequency of words are statistically significant across Finnosvenka and Swedish texts, we perform a G-test that calculate the probability of the frequencies of a loanword being different across langauges, given the null hypothesis that their frequencies are similar in each language. A value of p < 0.05 allows us to reject the null hypothesis and indicates that there is a statistically significant difference in the way both languages use a given loanword.

sweWordCount = sum([sum(c.values()) for c in swe.allCounters.values()])
finWordCount = sum([sum(c.values()) for c in fin.allCounters.values()]) 

def getFrequenciesLog(candidates):
    for cand in candidates: 
        finCount = sum([c[cand] for c in fin.allCounters.values()])
        sweCount = sum([c[cand] for c in swe.allCounters.values()])
        finTotal = sum([sum(c.values()) for c in fin.allCounters.values()])
        sweTotal = sum([sum(c.values()) for c in fin.allCounters.values()])
        finFreq = (log(finCount+1)-log(finTotal))
        sweFreq = (log(sweCount+1)-log(sweTotal))

        print("Finnish loanword freq {:s}: \tFinnish: {:4f} \tSwedish: {:4f} \t [{:s}]".format(
            cand, finFreq, sweFreq, str(finFreq > sweFreq)
        ))
        
def getFreqsDict(candidates):
    candidates_d = {k: [] for k in candidates}
    for i, cand in enumerate(candidates): 
        finCount = [c[cand] for c in fin.allCounters.values()]
        sweCount = [c[cand] for c in swe.allCounters.values()]
        finTotal = [sum(c.values()) for c in fin.allCounters.values()]
        sweTotal = [sum(c.values()) for c in fin.allCounters.values()]
        finFreqs = [(c[cand])/(sum(c.values()))*100000 for c in fin.allCounters.values()]
        sweFreqs = [(c[cand])/(sum(c.values()))*100000 for c in swe.allCounters.values()]

        # Average these frequences across all years
        finFreq = sum( [(c[cand]) / (sum(c.values()))*100000 for c in fin.allCounters.values()]) / len(fin.allCounters)
        sweFreq = sum([(c[cand])/(sum(c.values()))*100000 for c in swe.allCounters.values()]) / len(swe.allCounters)
        candidates_d[cand] = (finFreqs, sweFreqs)
        
        # https://wordhoard.northwestern.edu/userman/analysis-comparewords.html
        a, b = sum(finCount), sum(sweCount)
        c, d = sum(finTotal), sum(sweTotal)
        e1 = c * (a + b) / (c + d)
        e2 = d * (a + b) / (c + d)
        G2 = 2* (a * np.log(a / e1) + b * np.log(b / e2))
        p = 1 - stats.chi2.cdf(G2, 1)
        # print("G^2 Stat: ", G2)    
        
        print("{}. Finnish loanword {:s}: \tFinnish: {:4f} \tSwedish: {:4f} \t{}".format(
            i+1, cand, finFreq, sweFreq, "Signfinicant!" if p< 0.05 else ""
        ))
    return candidates_d
cand_freqs = getFreqsDict(["kola", "kova", "pulka", "memma"])
1. Finnish loanword kola: 	Finnish: 0.620426 	Swedish: 0.469006 	Signfinicant!
2. Finnish loanword kova: 	Finnish: 0.011458 	Swedish: 0.010739 	Signfinicant!
3. Finnish loanword pulka: 	Finnish: 0.013246 	Swedish: 0.005780 	Signfinicant!
4. Finnish loanword memma: 	Finnish: 0.036777 	Swedish: 0.000000 	
/share/apps/anaconda3/2020.02/lib/python3.7/site-packages/ipykernel_launcher.py:37: RuntimeWarning: divide by zero encountered in log
/share/apps/anaconda3/2020.02/lib/python3.7/site-packages/ipykernel_launcher.py:37: RuntimeWarning: invalid value encountered in double_scalars
  1. Kola : translated as “to die.” Adapted from the Finnish word “kuolla,” this is the most common loanword across both languages.

  2. Kova: translated as “money.” Originally from a Finnish expression “kova raha” meaning coins.

  3. Pulka: translated as “sled.” Orignally from a Finnish and Sami (an indigenous group that lives in the north of both countries) word “pulkka”.

  4. Memma: a traditional Finnish Easter pudding, originally called “mämmi” in Finnish. The dessert has supposedly begun to be exported to Sweden, although the frequencies of the word in either corpus can’t back up that statement.

3.3 Tracking the Usage of Loanwords

As section 3.2 shows, there are Finnish loanwords that are statistically signifcant in their usage across Finnosvenka and Swedish text. We now pick a few loanwords and track their usage over time to attempt to identify trends in usage between Finnosvenka and Swedish text.

3.3.1 Kola:

The word “kola” (to die) is the most common loanword across both texts. Other than a curious spike right at the beginning, Swedish texts seem to use the word less frequently than Finnosvenka texts. Around the 1850’s, usage of the word peaks in Finnosvenka texts. A few decades later, the usage of the word in Swedish texts starts to increase as well. The word has other usages in Swedish, including “toffee,” so a more detailed analysis of the context of this word would be necessary to tease apart possible differences in the word meaning. (wiktionary)

candidates_d = getFreqsDict(["kola"])

plt.plot(finYears, candidates_d['kola'][0], label="Finnish-kola") 
plt.plot(sweYears, candidates_d['kola'][1], label="Swedish-kola")
plt.legend()
1. Finnish loanword kola: 	Finnish: 0.620426 	Swedish: 0.469006 	Signfinicant!
<matplotlib.legend.Legend at 0x7ff2970ef450>
../../../_images/Scandinavian-Languages_24_2.png

3.3.2 Kova vs. Raha

Kova is the loanword that means “money” (“raha” in Finnish), derived from the Finnish phrase “kova raha” that originally means “coins” (quora). In order to test the hypothesis that Finnosvenka texts might directly take Finnish words as well as create loanwords to Swedish, we compare the usage of the words raha and kova.

candidates_d = getFreqsDict(["raha", "kova"])
plt.plot(finYears, candidates_d['raha'][0], label="Finnish-raha")
plt.plot(sweYears, candidates_d['raha'][1], label="Swedish-raha")
plt.plot(finYears, candidates_d['kova'][0], label="Finnish-kova")
plt.plot(sweYears, candidates_d['kova'][1], label="Swedish-kova")
plt.legend()
/share/apps/anaconda3/2020.02/lib/python3.7/site-packages/ipykernel_launcher.py:37: RuntimeWarning: divide by zero encountered in log
/share/apps/anaconda3/2020.02/lib/python3.7/site-packages/ipykernel_launcher.py:37: RuntimeWarning: invalid value encountered in double_scalars
1. Finnish loanword raha: 	Finnish: 0.012326 	Swedish: 0.000000 	
2. Finnish loanword kova: 	Finnish: 0.011458 	Swedish: 0.010739 	Signfinicant!
<matplotlib.legend.Legend at 0x7ff29719e150>
../../../_images/Scandinavian-Languages_26_3.png

The above graph shows the use of raha and kova in both languages over time. Swedish texts never use the term raha, which makes sense because it is exclusively a Finnish word. In the Finnosvenka texts, usage of both words peak at almost exactly the same time. This could be due to the usage of the full Finnish phrase “kova raha,” or they could be used as synonyms in a certain context. There isn’t enough data to make a conclusion about the usage of raha in either text, but there is a statistically significant difference in the usage of kova in either texts, with the Finnish text using it more often.

3.3.3 Pojke vs. Poika

Pojke is the Swedish loanword for the Finnish word poika and is a colloqial word for “boy.” (wiktionary) This is a second example of testing whether Finnosvenka texts bias towards the Swedish of Finnish version of the word.

candidates_d = getFreqsDict(["pojke", "poika"])
plt.plot(finYears, candidates_d['pojke'][0], label="Finnish-pojke")
plt.plot(sweYears, candidates_d['pojke'][1], label="Swedish-pojke")
plt.plot(finYears, candidates_d['poika'][0], label="Finnish-poika")
plt.plot(sweYears, candidates_d['poika'][1], label="Swedish-poika")
plt.legend()
1. Finnish loanword pojke: 	Finnish: 0.290092 	Swedish: 0.458642 	Signfinicant!
2. Finnish loanword poika: 	Finnish: 0.111841 	Swedish: 0.001719 	Signfinicant!
<matplotlib.legend.Legend at 0x7ff2968e5d50>
../../../_images/Scandinavian-Languages_29_2.png

Here we see that the Finnosvenka uses the direct Finnish version ‘poika’ far more than Swedish texts, although both use the Swedish version ‘pojke’ more often. There is a compelling increase in the usage of pojke in Swedish texts just as Finnosvenka texts begin to use it in replacement of poika. So, as Swedish texts begin to use the Swedish version of the word, Finnosvenka texts significantly decrease their usage of ‘poika.’ It’s hard to determine any sort of causal relationship without more etymological details, but this example suggests that ther is a marked relationship between the language of origin and the word’s usage.

4. Future Work

Comparing loanwords provides a good calibration for comparing word usage patterns across langauges. To go further with this analysis, we could search for words with the greatest difference in frequency across texts and use this to investicate dialectical differences. We could also create more detailed representation of content words by training word embeddings on subsets of the corpora and comparing their representations.

For every experiment detailed above, we look only at the word frequencies of a newspaper publication. The next step in analyzing the text could be to look at syntatical differences in the publications. The original dataset contains Part-of-Speech and Named Entity tags, along with other annotations, that could be analyzed as well. However the tagging systems used in the SpråkbankenText platform are rather outdated. With the right time and resources, one ambitious project could be to fine-tune present state-of-the-art tagging systems on the Finnosvenka and Swedish texts.

5. Conclusion

In conclusion, this project looked at the evolutionary trends of Finnosvenka and Swedish by performing both quantitative and qualitative analyses of word frequency within newspaper publications. We found that Finnish texts exhibited less evolution than their Swedish counterparts, but also adopted more words from the majority Finnish-speaking population. Clearly, cultural and geographic contexts create differences in the two languages, with far more possible avenues of comparison yet to explore.