Scandinavian Languages Project¶

This project analyzes the different ways languages can evolve given a certain demographic context. More specifically, it analyzes the differences between the Swedish spoken in Sweden and the Swedish spoken by the ethnic minority group in Finland, also called Finnosvenka. There has been a Swedish-speaking minority in Finland since roughly the 16th century and as an isolated community, the language has hypothetically changed less than the “mainland” version of Swedish spoken in Sweden.

To test this hypothesis that Finnosvenka would have experienced less change over time than mainstream Swedish, we compare newspaper publishings from Sweden and Finland from the 18th and 19th centuries. While newspapers might not be the largest drivers of change, they demonstrate greater linguistic stability than less formal data sources. The SpråkbankenText service hosted by the University of Gothenburg has an extensive collection of newspapers from both Sweden and Finland, which we use for this analysis.

The analysis is divided into two sections. First, we take a high-level look at the trends across decades for both Swedish and Finnosvenka newspapers. This looks at the overall patterns of word usage for each linguistic groups to see if they really have evolved in different ways. The second part takes a more granular approach by looking at specific words and see how they are used over time in both the Swedish and Finnish text. By examining specific words, we can begin to hunt down potential cultural influences in the difference between Swedish and Finnosvenka over time.

import json
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats
from scipy import spatial
from collections import Counter
import os
import pickle
from math import log
import numpy as np
from utils import LanguageCounter
from sklearn.decomposition import PCA
from scipy import stats

finYears = list(fin.allCounters.keys())
sweYears = list(swe.allCounters.keys())

1. The Dataset¶

The Swedish dataset comes from the Khubist-2 corpus hosted on SpråkbankenText. The newspapers span from 1750 to 1890 and come from various regions in southern Sweden, where the majority of the population is located. Due to size contraints and the much larger sample size of newspapers in Sweden, we take only a sampling of the available newspapers.

The Finnish dataset comes from Nationalbibliotekets svenskspråkiga tidningar corpus. This is a comprehensive dataset of newspapers from Finnosvenka newspapers in Finland from 1770 to 1900.

We represent each corpus as an object of the LanguageCounter object, which is defined in utils.py. Each object stores the words and their frequencies across decades. This representation saves only the word counts across decades and across languages. Later analysis could look at the grammatical structure of the texts, but for now we look at the different trends in word usage across languages.

print(LanguageCounter.__doc__)

 Class used to represent each corpus. 
    ...
    Attributes
    ----------
    dataPath: str
        Path to the directory that contains the pickled raw text for each newspaper. 
    allCounters: Dict[np.datetime64, Counter]
        A mapping from decade to a counter of each word in the newspaper publishings for that decade. 
    commonWords: Dict[np.datetime64, list[str]]
        A mapping from decade to the top 100 most commonly used words used in that decade. 
    topWordsTotal: Counter
        The top 250 words used across all time by the newspapers and their overall frequncies. 
    allFeatized: List[List[float]]
        A list of 250-dimensional vectors, one for each decade of the corpus. Vectors show the frequencies of each of the topWordsTotal within that decade.
    
    Methods
    -------
    buildCommonCounters(dataPath)
        For every pickle file in dataPath, adds a counter for the given decade to self.allCounters
    getOverlaps()
        Get the amount of word overlap across decades. 
    buildFeatures()
        Builds feature vectors that describe each decade by the frequency of each of the overall top 250 most commonly used words by the language. 
    

finCountSource = "data/Finnish/temp_txt"
sweCountSource = "data/Swedish/"

fin = LanguageCounter(finCountSource)
fin.buildCommonCounters(fin.dataPath)

swe = LanguageCounter(sweCountSource)
swe.buildCommonCounters(swe.dataPath)

2. Overall Trends¶

First we can take a look at the overall trends across decades for each language. This well help us get a better understanding for each language and confirm that there actually are differences in how Finnosvenka and Swedish have evolved.

2.1 Measuring Word Overlaps¶

The first graph shows the proprotion of words across decades that are shared with the first decade of publishing. The Swedish data starts at 1740 and the Finnish data at 1760 with 100% overlap, since we’re comparing each decade with themselves. As soon as the next decade of publishing, both languages start to deviate from their first decade of publishing. Swedish text changes more rapidly, quickly setting at about 30% of words overlapping with the original newspaper content. Finnosvenka text takes longer to diverge from the original text and settles on a higher overlap with the original newspaper content with about 40% overlap. For a first experiment, this is a good indication that Swedish text evolves faster and more rapidly than its Finnosvenka counterpart.

# Similarity wrt 1770 decade over time
_, finSimToBase = fin.getOverlaps()
plt.plot(finYears, finSimToBase, label="Finnosvenka")

_, sweSimToBase = swe.getOverlaps()
plt.plot(sweYears, sweSimToBase, label="Swedish")
plt.legend()
plt.title("Proportion of overlap with first decade")
plt.show()

../../../_images/Scandinavian-Languages_6_0.png

2.2 Building Decade Feature Vectors¶

This section takes a more detailed look at the similarity across decades of each langauge. We follow the same process as Two Pis in a Pod where we rerpresent decades through the use of high frequency words. The buildFeatures method will find the top 250 words across all years for each language. Each decade will then be characterized by the relative frequency of each of those 250 words in that decade. Top word frequency has been used as a reliable metric of author signature in the past. First off, it provides a sort of signature because of how an author’s characteristics will determine the ratio of commonly-used wrods. Second, by avoiding content words that would vary between Finnish and Swedish publications, we can make a more direct comparison between the stylistic choices of the two linguistic groups.

fin.buildFeatures()
swe.buildFeatures()

2.2.1 Similarity Across Time for Individual Languages¶

As a quick test to make sure these feature vectors make sense, we can make a similar plot to section 2.1. Here we measure similarity through cosine distance instead of the overlap of words. The similarity values are much higher than the previous metric, likely due to the curse of dimensionality, where distance metrics begin to break down as you add dimensions. However, the same general pattern is present: Swedish publications change more and at a greater pace than their Finnish counterparts. Now that we have some reassurance that these feature vectors make sense, we can continue with a more direct comparison of the evolutions of Finnish and Swedish texts.

similarity = [1-spatial.distance.cosine(fin.allFeatize[0], fin.allFeatize[i]) for i in range(len(fin.allFeatize))]
plt.plot(finYears, similarity, label="Finnish")
similarity = [1-spatial.distance.cosine(swe.allFeatize[0], swe.allFeatize[i]) for i in range(len(swe.allFeatize))]
plt.plot(sweYears, similarity, label="Swedish")
plt.title("Similarity to first decade over time")
plt.legend()
plt.show()

../../../_images/Scandinavian-Languages_11_0.png

2.2.2 Comparing Finnosvenka and Swedish Evolution¶

In the previous section, we computed the top 250 words for each langauge individually. This means that the feature vectors lived in different vector spaces for each langauge. Here, we build shared feature vectors by computing the top 250 words across both languages. Now, the feature vectors for each decade across Finnosvenka and Swedish texts can be directly compared. To do so, we project the 250-dimension vectors into 2D space using PCA. The two dimensions remaining capture the greatest variance in the vectors and allow us to make a visual inspection of the relationship between decades and language, which we conduct below.

# Find top 250 most frequent words across both langauges, based on frequency
totalSwed = sum([s[1] for s in swe.topWordsCounter])
freqsSwed = []
for c in swe.topWordsCounter:
    freqsSwed += [(c[0], c[1]/totalSwed)]
    
totalFin = sum([s[1] for s in fin.topWordsCounter])
freqsFin = []
for c in fin.topWordsCounter:
    freqsFin += [(c[0], c[1]/totalFin)]

# Build a Counter that combines the frequences from both languages
combined = Counter()
for word in freqsSwed: 
    combined[word[0]] += word[1]
for word in freqsFin: 
    combined[word[0]] += word[1]
    
# Based on the top 250 shared words, build the vector of frequencies for each decade
sweAllFeatize = []
for counter in list(swe.allCounters.values()):
    lenDoc = sum(counter.values())
    featize = np.array([counter[k] for k,v in combined.items()])
    featize = np.divide(featize, lenDoc)
    sweAllFeatize += [featize]
    
finAllFeatize = []
for counter in list(fin.allCounters.values()):
    lenDoc = sum(counter.values())
    featize = np.array([counter[k] for k,v in combined.items()])
    featize = np.divide(featize, lenDoc)
    finAllFeatize += [featize]

pca = PCA(n_components=4)
pca_result= pca.fit_transform(sweAllFeatize + finAllFeatize)
numSweDecades = len(sweAllFeatize)

x_swe = pca_result[:numSweDecades, 0]
x_fin = pca_result[numSweDecades:,0]
y_swe = pca_result[:numSweDecades,1]
y_fin = pca_result[numSweDecades:,1]

plt.figure(figsize=(15,10))
plt.scatter(x_swe, y_swe, label="Swedish")
for i in range(len(list(swe.allCounters.keys()))):
    plt.annotate(list(swe.allCounters.keys())[i], (x_swe[i], y_swe[i]))
    
plt.scatter(x_fin, y_fin, label="Finnish")
for i in range(len(list(fin.allCounters.keys()))):
    plt.annotate(list(fin.allCounters.keys())[i], (x_fin[i], y_fin[i]))
plt.legend()
plt.title("Projection of Feature Vectors")
plt.show()

../../../_images/Scandinavian-Languages_14_0.png

There are a few trends that we can analyze within the above graph. First, we see that in both Finnosvenka and Swedish, the 18th century decades seem to be oddballs. They are both off to the left hand side of the graph, with the Finnosvenka points being slightly tighter than the Swedish decades. Many language corpora show this trend of the oldest publications being the most different, likely due to the modernization that would come in later years. However it’s also important to note that the OCR system used to create both corpora is nowhere near as effective for earlier years, meaning there could simply be more noise in the early decades that create excessive differences in their contents. However, the effects of the limited OCR should be mitigated here because we are using only the top 250 most common words, which would likely have a better OCR success rate. Curiously, the one exception to this trend is the 1760 decade in Swedish texts, which is cloer to later publishings.

The next trend is the clear progression of both languages towards more negative principal component 1 values over time. This suggest some sort of common linguistic evolution, although once again Finnovsenka texts seems to exhibit less rapid changes. The Finnosvenka texts also exhibit less variance in the second principal component. A more detailed analysis of the word choice causing this evolution is the next step towards understanding this common trend, and as such we turn to analyzing specific words in the next section.

Cornell Digital Humanities Notebook Series

Scandinavian Languages Project¶

1. The Dataset¶

2. Overall Trends¶

2.1 Measuring Word Overlaps¶

2.2 Building Decade Feature Vectors¶

2.2.1 Similarity Across Time for Individual Languages¶

2.2.2 Comparing Finnosvenka and Swedish Evolution¶

3. Word Analysis¶

3.1 Words Unique to Each Corpus¶

3.2 Finnish Loanwords¶

3.3 Tracking the Usage of Loanwords¶

3.3.1 Kola:¶

3.3.2 Kova vs. Raha¶

3.3.3 Pojke vs. Poika¶

4. Future Work¶

5. Conclusion¶