The major ethnic group in Śrī Laṅkā
by Prajjval Pratap Singh, 6 Sachin Kumar 6 Nagarjuna Pasupuleti, Niraj Rai, Gyaneshwer Chaubey 7 R. Ranasinghe
Published:August 31, 2023
• Higher West Eurasian genetic component in Śrī Laṅkā than South India
• A strong gene flow beyond the boundary of ethnicity and language in Śrī Laṅkā
• Traces of common roots of Sinhala with Maratha
The current census estimates Śrī Laṅkā to have 22 million inhabitants, of which the Sinhalese represent the major ethnic group, comprising 74.9% of the population. Other ethnic groups include Śrī Laṅkān Tamils at 11.1%, Muslims or “Moors” at 9.3%, Indian Tamils at 4.1%, and others at 0.6%, i.e., Burgher, Malay, Vedda (Adivasi).2 It has been conjectured that hunter-gatherers with paleolithic technology settled in Śrī Laṅkā perhaps as early as 125,000 years ago,3,4 but the earliest anatomically modern human fossil in Śrī Laṅkā dates from 28,500 years ago, found at the Upper Pleistocene site of Batadombalena, evidently inhabited by humans from 36,000 years ago.5,6,7 Śrī Laṅkā was inhabited by Mesolithic hunter-gatherers until ca. 800–600 BC when both cattle and agriculture were introduced by the bearers of an Iron Age culture with a Black and Red Ware ceramic culture who practiced megalithic burials. The bearers of this new agricultural civilization are held to have been the Siṃhala, Ceylonese or Sinhalese.8 The Dīpavaṃsa and Mahāvaṃsa record that Prince Vijaya led the ancestral Siṃhala from Siṃhapura or Sihapura in Lāḷa or Lāṭa in what today is southern Gujarat. Vijaya reigned at the newly established Tambapaṇṇi ca. 468–448 B.C.9,10
Wilhelm Ludwig Geiger11,12,13,14 established that Koṅkaṇī, spoken on the Koṅkaṇ coast of India, represented the closest linguistic relative of both Sinhalese, spoken in Śrī Laṅkā, and Divehi, spoken in the Maldives. Geiger inferred that this demonstrable linguistic relationship reflected the ancient maritime migration across the Arabian Sea and Indian Ocean that first brought Divehi- and Sinhalese-speaking populations to their insular habitats in the first millennium BC. Geiger grouped both these languages with Koṅkaṇī, Marāṭhī, and Gujarātī, which in Turner’s classification15 together constitute the Southwestern sub-branch of the Indo-Aryan branch of Indo-European. The Sinhalese chronicles record that for nine months, the newly arrived Sinhala settlers endeavored to exterminate the native populace of the island, whom they called the yakkhas (Skt. yakṣa), which scholars have identified with the Veddas.1,8,9,16
While the Sinhalese are associated with the earliest inscriptions on the island, dating from the time of Aśoka, it has been argued on linguistic grounds that the ancestors of the Tamils crossed the Palk Strait and settled in the North of Śrī Laṅkā at roughly the same time, viz. in the second half of the first millennium BC, during the cultural foment that yielded the dawn of the Cōḻa dynasty on the subcontinent.1 This linguistic dating is supported by the fact that the thickest bundle of isoglosses runs—as one might expect—between the continental dialects of Tamil and the dialects of Ceylon’.17
After Sinhalese and Tamil colonization in the first millennium BC, Śrī Laṅkā’s geographical proximity to the Indian subcontinent was enhanced by close cultural ties. This same period saw the dawn of the great maritime Hindu and Buddhist expansion from the subcontinent into mainland and insular Southeast Asia, historically involving both, gene flow and cultural transmission.8,18,19,20,21 Centuries later, Muslim traders arrived from Arabia, Malays from Malaya and, in the British colonial period, Indian Tamils from South India.
Only a few genetic studies, including the mtDNA, Y and X chromosomes have been performed, and these confirmed the Sinhalese connection with mainland India.22,23,24,25,26,27,28,29,30,31,32,33,34,35 Some studies have shown that the Sinhalese have a distinct origin, while a few of them suggested a connection with South Indian populations.26,36 Analysis based on classical markers advocated a closer affinity of the Sinhalese population with South and West Indian populations than with the Bengalis.37,38 The question remains as to how the Sinhalese relate to the other peoples of Śrī Laṅkā in view of ongoing debates on the origin of the Sinhalese and the Śrī Laṅkān Tamils (STU). Therefore, in the present study, we have evaluated various alternatives to establish a molecular genetic perspective on the origin of Sinhalese and preclude possible source populations and genetic admixing.
Śrī Laṅkā also represents an important staging area in any scenario involving the theory of a southern migration route, and so a better understanding of the population genetics of the Sinhalese may offer novel insights into the early peopling of South Asia. Genetic studies on the Śrī Laṅkān population are mainly limited to haploid DNA markers.22,39 The majority of Śrī Laṅkān individuals studied so far showed an overwhelming presence of South-Asian-specific haplogroups. However, a significant presence of West-Eurasian-specific haplogroups has also been detected. The most common West-Eurasian mtDNA haplogroups are U7 and U1.22,40 Thus, the West Eurasian connection of Śrī Laṅkā appears likely. The lack of autosomal studies needs to be filled in order to understand the precise nature of peopling of Śrī Laṅkā. Therefore, we have analyzed and evaluated the Śrī Laṅkān Sinhalese and Tamil groups for hundreds of thousands of genetic markers.
[Editor – from Wikipedia – “West Eurasia is a region that encompasses Europe, the Middle East, South-Central Asia, North Africa, and partly South Asia, Central Asia, and the Horn of Africa. The term “West-Eurasians” is often used in population genomics to refer to the populations of these regions. The genetic history of West Eurasians has been studied extensively, and it is known that they have a complex ancestry that includes multiple ancestral components.”]
Results and discussion
To have a detailed understanding on the origin and migration of the Sinhala population, we have first evaluated the maternal gene flow among the Śrī Laṅkān population. We collected data from public sources22,40 and compared them with the South Indian maternal population composition. The Śrī Laṅkān and South Indian maternal gene pool overwhelmingly showed a South Asian affinity (Figure 1). However, we see a striking difference in the prominence of West Eurasian ancestry. Assuming that the West Eurasian ancestry of Śrī Laṅkā arrived from mainland India, we should expect to see a significantly lower proportion of this ancestry in Śrī Laṅkā than in South Indian populations, but this was not the case. Instead, we observed a significantly higher frequency (two-tailed p < 0.0001) of West-Eurasian-specific maternal ancestry in Śrī Laṅkān populations (Figure 1). This high level of West Eurasian ancestry is consistent across all the major Śrī Laṅkān groups except Indian Tamils, who are known to represent a well-documented recent migration during the British colonial period41 and the Moors, who overwhelmingly exhibit South Asian ancestry. This discrepancy can be explained by independent West Eurasian contribution to Śrī Laṅkā, likely by a sea route and putative migration from Northwest India (Figure 1).
In order to understand more about the West-Eurasian-related ancestry and the population history of the Śrī Laṅkān populations, we used hundreds of thousands of autosomal markers. We extracted a large dataset in addition to Indian samples for comparative autosomal analysis and merged these datasets with our newly generated genome-wide data. First, we performed PCA analysis in order to understand the population affinity. The scatterplot (Figure 2), using the obtained PC1 and PC2 eigenvectors, suggested that the Sinhalese, Śrī Laṅkān Tamils in Śrī Laṅkā (STS), and the Śrī Laṅkān Tamils in the United Kingdom (STU) are close to one another in a large cluster on the South Asian Indo-European to Dravidian cline. This finding suggests a closer genetic affinity of the Sinhalese population with the Śrī Laṅkān Tamil population (Figure 2). In order to investigate ancestral components, ADMIXTURE was performed, which also showed (Figure 3) that the Sinhalese are more similar to Śrī Laṅkān Tamils than to the Indian populations, and both possess a major South-Asian-related ancestral component. The light and dark green color components specific to South Asian populations were nearly equally distributed in Sinhalese and Śrī Laṅkān Tamils (Figure 3).
To ascertain the fine-scale genetic similarity, we performed haplotype-based fine structure analysis. Consistent with the PCA Admixture results, both Śrī Laṅkān populations shared a close genetic affinity (Figure 4) and fell in the same cluster. Both populations also shared a common clade with Indian Indo-European and Dravidian populations. The chunk count comparison suggested that both ethnic groups of Śrī Laṅkā received major chunks from each other and from Indian Indo-European and Dravidian populations.
These results are intriguing, considering the distinct linguistic affiliation of Sinhalese and STU/STS. The results indicate a strong gene flow beyond the boundaries of ethnicities’ in question, which is usually rare in South Asia.42,43
Runs of homozygosity (RoH) were calculated to understand the marriage pattern of Sinhalese and Śrī Laṅkān Tamils. The obtained mean values were plotted between the numbers of segments vs. the average numbers length of segments (in Kbs). The STS populations clustered at the base of the scatterplot, followed by Sinhalese, while STU showed a longer and higher number of homozygous segments (Figure 5). Results from the RoH suggest that the effective population size for these populations (Ne) varies. These disparities could be due to the sampling bias where STU were collected from outside South Asia (UK). More Tamil samples from Śrī Laṅkā could help to solve the disparity.
In order to test the linguistic hypothesis that the Sinhalese language shows closer common ancestry with Koṅkaṇī, Marāṭhī, and Gujarātī, we performed identity by descent (IBD) analysis (Figure 6), by comparing larger (2.0 to ∞ cM) and smaller (0–2 cM) chunks of DNA. When two population admix, recombination event tend to break the large DNA segments (chunks). With the time, these segment sizes become smaller and smaller. Thus comparing the large and small DNA segments can help us to understand the recent and old admixture processes. Interestingly, we found an unexpected excess of smaller chunks sharing between Marāṭhā and Sinhala (>16%) than between the Marāṭhā and STU, thus supporting the linguistic hypothesis of Geiger, Turner, and van Driem. To confirm the excess sharing, we looked for the population sharing maximum IBD with Sinhala and STU. We observed that South Indian Piramalai Kallar shared the highest IBD with Sinhala and STU, while, both populations showed highest IBD sharing, for short and long DNA segments with Piramalai Kallar. We asked whether Sinhalese or STU shared more DNA segments with Marāṭhā. The Piramalai Kallar shared nearly equally large DNA segments with Sinhalese and STU, respectively (Figure 6), whereas Marāṭhā shared significantly higher (>16%) smaller segments with Sinhalese (two tailed p < 0.001). This result is also visible in the D statistics test. However, it was non-significant (Table S2). This excess sharing of smaller segments suggests a closer, deeply rooted common genetic ancestry of the Sinhalese with the Marāṭhā.
Limitation of the study
Key resources table
|REAGENT or RESOURCE||SOURCE||IDENTIFIER|
|Sinhala||Field Collection from Śrī Laṅkā||See Figures 2 and 3 for population names|
|Śrī Laṅkā Tamils (STS)||Field Collection from Śrī Laṅkā||See Figures 2 and 3 for population names|
|Śrī Laṅkā Tamils (UK)||1000 genomes (http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/)||See Figures 2 and 3 for population names|
|World population data||Pathak et al.||See Figures 2 and 3 for population names|
|Chemicals, peptides, and recombinant proteins|
|Tris EDTA Buffer, Molecular Biology Grade||Fisher Scientific||Cat# AAJ75893AP|
|Critical commercial assays|
|DNA extraction Kit||Qiagen||Cat# 51104|
|MinElute PCR Purification Kit||Qiagen||Cat# 28006|
|Illumina-Infinium Global Screening Array||1.0 (GSA-24v1-0)||Cat# 20031669|
|The Genotype data of Sinhala||This study||https://doi.org/10.6084/m9.figshare.23975601|
|The Genotype data of Śrī Laṅkā Tamil||This study||https://doi.org/10.6084/m9.figshare.23975601|
|Software and algorithms|
|PLINK v1.9||Chang et al.||https://www.cog-genomics.org/plink/1.9/|
|EIGENSOFT v6.1.4||Patterson et al.||https://github.com/DReichLab/EIG|
|ADMIXTURE||Alexander et al.||https://dalexander.github.io/admixture/|
|ADMIXTOOL||Patterson et al.||https://github.com/DReichLab/AdmixTools|
|BEAGLE 5.4||Browning et al.||http://www.gnu.org/licenses/|
|fineStructure||Lawson et al.||https://people.maths.bris.ac.uk/∼madjl/finestructure/fs-2.1.3.tar.gz|
|MEGA-X||Kumar et al.||https://www.megasoftware.net/show_eua|
|ChromoPainter||Lawson et al.||https://people.maths.bris.ac.uk/∼madjl/finestructure/fs-2.1.3.tar.gz|
|ALDER||Loh et al.||http://cb.csail.mit.edu/cb/alder/alder_v1.03.tar.gz|
|runs of homozygosity (RoH)||Chang et al.||https://www.cog-genomics.org/plink/1.9/|
|merged IBD||Browning & Browning||http://www.gnu.org/licenses/|
|Refined IBD||Browning & Browning||http://www.gnu.org/licenses|
|mt-DNA nomenclature||Van Oven & Kayser||Phylotree.org|
Experimental model and study participant details
DNA extraction and genotyping
Data processing and population genetic analyses
The variant calling factor (vcf) file was converted to binary files with PLINK v1.9 following optimal conditions of quality filtering like, –maf 0.03, geno 0.03 and mind 0.03. The filtered samples were merged with the HGDP Panel. We found 255,063 SNPs common between the samples and panel with a 0.9987 genotyping rate. PLINK v1.944 was used for data curation and management for the statistical analyses. The PC1 and PC2 eigenvectors in Principal Component Analysis (PCA) were generated with smartpca45 (EIGEN v6.1.4), and the plot has been generated with an in-house R script. We used ADMIXTURE46 to further estimate shared ancestry (K=2 to K=15), and at K=10 the ancestry has been defined with minimum cv error value of 0.5423.
Quantification and statistical analysis
To understand population relationships, several f-statistics were performed in default setting using the Yoruba population as an outgroup population. To know the shared drift and gene flow pattern we used f3 and f4 statistics, respectively from the ADMIXTOOL package.47 We have phased the genotypic data with beagle 5.448 with default settings. Later the haplotype-based analysis was performed using MCMC algorithm-based software i.e., fineStructure49 using likelihood modelling approaches to calculate matrices. The obtained output matrix was used for construction of MCMC tree using MEGA-X.50 ChromoPainter49 was applied for the estimation of chunk counts donated by reference populations to our targeted population. ALDER51 was run to understand the admixing time using multiple source populations with default settings. In order to understand the population dynamics, the runs of homozygosity (RoH) was determined for each population using PLINK 1.944 The analysis was carried out with the use of the ‘homozyg’ function and utilised 1000 kb windows for the calculations, allowing one heterozygous call and five missing calls per window, and a minimum of 100 SNPs per window. Every person is successively scanned by the selected window, which estimates the proportion in a homozygous window for each SNP. For understanding the Identity by Descent (IBD) we used refined and merged IBD analysis.52
Data and code availability
• Data reported in this paper are publicly available from Figshare repository (https://doi.org/10.6084/m9.figshare.23975601).
• This paper does not report original code.
• Any additional information required to reanalyze the data reported in this work paper is available from the lead contact upon request.
Declaration of interests
Document S1. Tables S1–S3