Reconstructing the Population History of the Sinhalese

The major ethnic group in Śrī Laṅkā

by Prajjval Pratap Singh, 6 Sachin Kumar Nagarjuna Pasupuleti, Niraj Rai, Gyaneshwer Chaubey 7 R. Ranasinghe

Open Access
Published:August 31, 2023
DOI:https://doi.org/10.1016/j.isci.2023.107797

Highlights

• Higher West Eurasian genetic component in Śrī Laṅkā than South India

• A strong gene flow beyond the boundary of ethnicity and language in Śrī Laṅkā

Traces of common roots of Sinhala with Maratha

Summary

The Sinhalese are the major ethnic group in Śrī Laṅkā, inhabiting nearly the whole length and breadth of the island. They speak an Indo-European language of the Indo-Iranian branch, which is held to originate in northwestern India, going back to at least the fifth century BC. Previous genetic studies on low-resolution markers failed to infer the genomic history of the Sinhalese population. Therefore, we have performed a high-resolution fine-grained genetic study of the Sinhalese population and, in the broader context, we attempted to reconstruct the genetic history of Śrī Laṅkā. Our allele-frequency-based analysis showed a tight cluster of Sinhalese and Tamil populations, suggesting strong gene flow beyond the boundary of ethnicity and language. Interestingly, the haplotype-based analysis preserved a trace of the North Indian affiliation to the Sinhalese population. Overall, in the South Asian context, Śrī Laṅkān ethnic groups are genetically more homogeneous than others.

Graphical abstract

Subject areas

Introduction
Śrī Laṅkā is located at 370 N and 1270 30′ E, where the Bay of Bengal meets the Indian Ocean. The surface area comprises 65,610 km2 of land and water. The ancient Greeks referred to the island Tαπροβάνη Taprobánē, although this toponym may also have been applied by the Greeks to Sumatra. A now-lost indigenous name for the island, ultimately deriving from Sanskrit Siṃhaladvīpa, was transmogrified into Arabic Sarandīb and Persian Sarandīp and so entered European languages as Serendip. The Portuguese name, written Seylan, Ceylan, or Ceylon (cf. modern Portuguese Ceilão), preserves an old native Sinhala name for the island, which derived etymologically from Sanskrit Śrī Laṅkā. The name Ceylon was subsequently adopted by the Dutch, who governed the island from 1640 until 1796, and the British, who ruled Ceylon from 1796 to 1948. The Sanskritised tatsama loanword “Śrī Laṅkā” replaced the European rendition of the original old native Sinhala name in 1972.  

The current census estimates Śrī Laṅkā to have 22 million inhabitants, of which the Sinhalese represent the major ethnic group, comprising 74.9% of the population. Other ethnic groups include Śrī Laṅkān Tamils at 11.1%, Muslims or “Moors” at 9.3%, Indian Tamils at 4.1%, and others at 0.6%, i.e., Burgher, Malay, Vedda (Adivasi).  It has been conjectured that hunter-gatherers with paleolithic technology settled in Śrī Laṅkā perhaps as early as 125,000 years ago,, but the earliest anatomically modern human fossil in Śrī Laṅkā dates from 28,500 years ago, found at the Upper Pleistocene site of Batadombalena, evidently inhabited by humans from 36,000 years ago.,,  Śrī Laṅkā was inhabited by Mesolithic hunter-gatherers until ca. 800–600 BC when both cattle and agriculture were introduced by the bearers of an Iron Age culture with a Black and Red Ware ceramic culture who practiced megalithic burials. The bearers of this new agricultural civilization are held to have been the Siṃhala, Ceylonese or Sinhalese.  The Dīpavaṃsa and Mahāvaṃsa record that Prince Vijaya led the ancestral Siṃhala from Siṃhapura or Sihapura in Lāḷa or Lāṭa in what today is southern Gujarat. Vijaya reigned at the newly established Tambapaṇṇi ca. 468–448 B.C.,

Wilhelm Ludwig Geiger,,,  established that Koṅkaṇī, spoken on the Koṅkaṇ coast of India, represented the closest linguistic relative of both Sinhalese, spoken in Śrī Laṅkā, and Divehi, spoken in the Maldives. Geiger inferred that this demonstrable linguistic relationship reflected the ancient maritime migration across the Arabian Sea and Indian Ocean that first brought Divehi- and Sinhalese-speaking populations to their insular habitats in the first millennium BC. Geiger grouped both these languages with Koṅkaṇī, Marāṭhī, and Gujarātī, which in Turner’s classification together constitute the Southwestern sub-branch of the Indo-Aryan branch of Indo-European. The Sinhalese chronicles record that for nine months, the newly arrived Sinhala settlers endeavored to exterminate the native populace of the island, whom they called the yakkhas (Skt. yakṣa), which scholars have identified with the Veddas.,,,

While the Sinhalese are associated with the earliest inscriptions on the island, dating from the time of Aśoka, it has been argued on linguistic grounds that the ancestors of the Tamils crossed the Palk Strait and settled in the North of Śrī Laṅkā at roughly the same time, viz. in the second half of the first millennium BC, during the cultural foment that yielded the dawn of the Cōḻa dynasty on the subcontinent. This linguistic dating is supported by the fact that the thickest bundle of isoglosses runs—as one might expect—between the continental dialects of Tamil and the dialects of Ceylon’.

After Sinhalese and Tamil colonization in the first millennium BC, Śrī Laṅkā’s geographical proximity to the Indian subcontinent was enhanced by close cultural ties. This same period saw the dawn of the great maritime Hindu and Buddhist expansion from the subcontinent into mainland and insular Southeast Asia, historically involving both, gene flow and cultural transmission.,,,, Centuries later, Muslim traders arrived from Arabia, Malays from Malaya and, in the British colonial period, Indian Tamils from South India.

Only a few genetic studies, including the mtDNA, Y and X chromosomes have been performed, and these confirmed the Sinhalese connection with mainland India.,,,,,,,,,,,,,  Some studies have shown that the Sinhalese have a distinct origin, while a few of them suggested a connection with South Indian populations.,  Analysis based on classical markers advocated a closer affinity of the Sinhalese population with South and West Indian populations than with the Bengalis.,  The question remains as to how the Sinhalese relate to the other peoples of Śrī Laṅkā in view of ongoing debates on the origin of the Sinhalese and the Śrī Laṅkān Tamils (STU). Therefore, in the present study, we have evaluated various alternatives to establish a molecular genetic perspective on the origin of Sinhalese and preclude possible source populations and genetic admixing.

Śrī Laṅkā also represents an important staging area in any scenario involving the theory of a southern migration route, and so a better understanding of the population genetics of the Sinhalese may offer novel insights into the early peopling of South Asia. Genetic studies on the Śrī Laṅkān population are mainly limited to haploid DNA markers.,  The majority of Śrī Laṅkān individuals studied so far showed an overwhelming presence of South-Asian-specific haplogroups. However, a significant presence of West-Eurasian-specific haplogroups has also been detected. The most common West-Eurasian mtDNA haplogroups are U7 and U1.,  Thus, the West Eurasian connection of Śrī Laṅkā appears likely. The lack of autosomal studies needs to be filled in order to understand the precise nature of peopling of Śrī Laṅkā. Therefore, we have analyzed and evaluated the Śrī Laṅkān Sinhalese and Tamil groups for hundreds of thousands of genetic markers.

[Editor – from Wikipedia – “West Eurasia is a region that encompasses Europe, the Middle East, South-Central Asia, North Africa, and partly South Asia, Central Asia, and the Horn of Africa. The term “West-Eurasians” is often used in population genomics to refer to the populations of these regions. The genetic history of West Eurasians has been studied extensively, and it is known that they have a complex ancestry that includes multiple ancestral components.”]

Results and discussion

To have a detailed understanding on the origin and migration of the Sinhala population, we have first evaluated the maternal gene flow among the Śrī Laṅkān population. We collected data from public sources,  and compared them with the South Indian maternal population composition. The Śrī Laṅkān and South Indian maternal gene pool overwhelmingly showed a South Asian affinity (Figure 1). However, we see a striking difference in the prominence of West Eurasian ancestry. Assuming that the West Eurasian ancestry of Śrī Laṅkā arrived from mainland India, we should expect to see a significantly lower proportion of this ancestry in Śrī Laṅkā than in South Indian populations, but this was not the case. Instead, we observed a significantly higher frequency (two-tailed p < 0.0001) of West-Eurasian-specific maternal ancestry in Śrī Laṅkān populations (Figure 1). This high level of West Eurasian ancestry is consistent across all the major Śrī Laṅkān groups except Indian Tamils, who are known to represent a well-documented recent migration during the British colonial period  and the Moors, who overwhelmingly exhibit South Asian ancestry. This discrepancy can be explained by independent West Eurasian contribution to Śrī Laṅkā, likely by a sea route and putative migration from Northwest India (Figure 1).

Figure thumbnail gr1
Figure 1Comparison of maternal ancestry components between Śrī Laṅkā and South India populations

In order to understand more about the West-Eurasian-related ancestry and the population history of the Śrī Laṅkān populations, we used hundreds of thousands of autosomal markers. We extracted a large dataset in addition to Indian samples for comparative autosomal analysis and merged these datasets with our newly generated genome-wide data. First, we performed PCA analysis in order to understand the population affinity. The scatterplot (Figure 2), using the obtained PC1 and PC2 eigenvectors, suggested that the Sinhalese, Śrī Laṅkān Tamils in Śrī Laṅkā (STS), and the Śrī Laṅkān Tamils in the United Kingdom (STU) are close to one another in a large cluster on the South Asian Indo-European to Dravidian cline. This finding suggests a closer genetic affinity of the Sinhalese population with the Śrī Laṅkān Tamil population (Figure 2). In order to investigate ancestral components, ADMIXTURE was performed, which also showed (Figure 3) that the Sinhalese are more similar to Śrī Laṅkān Tamils than to the Indian populations, and both possess a major South-Asian-related ancestral component. The light and dark green color components specific to South Asian populations were nearly equally distributed in Sinhalese and Śrī Laṅkān Tamils (Figure 3).

Figure thumbnail gr2
Figure 2The principal component analysis of studied populations with respect to the Eurasian populations
Figure thumbnail gr3
Figure 3The bar plot of ADMIXTURE analysis showing the ancestral component sharing of studied populations. The Indian and Śrī Laṅkān ethnic groups are projected

To ascertain the fine-scale genetic similarity, we performed haplotype-based fine structure analysis. Consistent with the PCA Admixture results, both Śrī Laṅkān populations shared a close genetic affinity (Figure 4) and fell in the same cluster. Both populations also shared a common clade with Indian Indo-European and Dravidian populations. The chunk count comparison suggested that both ethnic groups of Śrī Laṅkā received major chunks from each other and from Indian Indo-European and Dravidian populations.

Figure thumbnail gr4
Figure 4 The Maximum Likelihood (ML) tree of Eurasian populations shows the studied populations’ genetic affinity. The closest branch of our target populations are zoomed-in and shown in a subset
In order to understand the putative source populations for both ethnic groups, firstly, f3-statistics were calculated with the world population, using several sources, such as pop1 and pop2, while Sinhalese and Śrī Laṅkān Tamils (STU) were taken as the target population. The results from f3-admix suggested that the Sinhalese and Śrī Laṅkān Tamils are admixed populations of Indian, Indo-European, and Dravidian ancestry (Table S1). Since STU are collected from the UK, we have compared them to see if they have any deviation from the genetic composition of native Sri Lankan Tamils (STS). All the analyses i.e., PCA, ADMIXTURE, outgroup f3, and D statistics did not find any significant deviation of STU from STS (Figures 23, and 4Tables S1 and S2).
To measure the gene flow using the obtained putative source populations and Yoruba as an outgroup population, D-statistics were performed, and the top ten D values for both of the populations suggested that strong gene flow has occurred between the Sinhalese and Śrī Laṅkān Tamils (STU) in the past because they show negative D-values with North Indians (Yoruba; Sinhalese/STS; STU; X) and positive D-values with South Indians. We also calculated D-statistics to infer the direction of gene flow between North vs. South Indian populations models (Yoruba; Sinhalese/STS/STU; X; Y) and obtained results suggesting that higher gene flow occurred between both the populations from the South than the North Indian populations. However, we have found slightly higher gene flow (but non-significant) from some North and Northwest Indian than the South Indian populations (Table S2).

These results are intriguing, considering the distinct linguistic affiliation of Sinhalese and STU/STS. The results indicate a strong gene flow beyond the boundaries of ethnicities’ in question, which is usually rare in South Asia.,

We also evaluated the admixture timing for both ethnic groups using ALDER. Sinhalese and Śrī Laṅkān Tamils were used as target populations, while other world populations were considered as source populations. After several permutations and combinations, we could get a few successful models for STU, while only one model was successful for Sinhalese people. The admixing dates were very recent for both populations, while the low numbers of successful models might be due to high admixing, so the software could not use other populations as a putative source population (Table S3).

Runs of homozygosity (RoH) were calculated to understand the marriage pattern of Sinhalese and Śrī Laṅkān Tamils. The obtained mean values were plotted between the numbers of segments vs. the average numbers length of segments (in Kbs). The STS populations clustered at the base of the scatterplot, followed by Sinhalese, while STU showed a longer and higher number of homozygous segments (Figure 5). Results from the RoH suggest that the effective population size for these populations (Ne) varies. These disparities could be due to the sampling bias where STU were collected from outside South Asia (UK). More Tamil samples from Śrī Laṅkā could help to solve the disparity.

Figure thumbnail gr5
Figure 5 The Runs of Homozygosity (RoH) plot of target populations with respect to the other South Asian ethnic groups

In order to test the linguistic hypothesis that the Sinhalese language shows closer common ancestry with Koṅkaṇī, Marāṭhī, and Gujarātī, we performed identity by descent (IBD) analysis (Figure 6), by comparing larger (2.0 to ∞ cM) and smaller (0–2 cM) chunks of DNA. When two population admix, recombination event tend to break the large DNA segments (chunks). With the time, these segment sizes become smaller and smaller. Thus comparing the large and small DNA segments can help us to understand the recent and old admixture processes. Interestingly, we found an unexpected excess of smaller chunks sharing between Marāṭhā and Sinhala (>16%) than between the Marāṭhā and STU, thus supporting the linguistic hypothesis of Geiger, Turner, and van Driem. To confirm the excess sharing, we looked for the population sharing maximum IBD with Sinhala and STU. We observed that South Indian Piramalai Kallar shared the highest IBD with Sinhala and STU, while, both populations showed highest IBD sharing, for short and long DNA segments with Piramalai Kallar. We asked whether Sinhalese or STU shared more DNA segments with Marāṭhā. The Piramalai Kallar shared nearly equally large DNA segments with Sinhalese and STU, respectively (Figure 6), whereas Marāṭhā shared significantly higher (>16%) smaller segments with Sinhalese (two tailed p < 0.001). This result is also visible in the D statistics test. However, it was non-significant (Table S2). This excess sharing of smaller segments suggests a closer, deeply rooted common genetic ancestry of the Sinhalese with the Marāṭhā.

Figure thumbnail gr6
Figure 6 The scatterplot of IBD (Identity by descent) sharing for smaller (x axis) and larger (y axis) IBD segments
In conclusion, this is the first comprehensive analysis with the high-throughput genome-wide autosomal data and comparative analysis of two major linguistically distinct ethnic groups of Śrī Laṅkā with ancient historical settlements. Our findings suggest a close genetic affinity of Sinhalese with STU, irrespective of their linguistic affiliation. This phenomenon is rare in South Asia. The genetic homogeneity of Sinhalese and STU is probably due to long-term close geographic sharing, which facilitated large amounts of gene flow. Furthermore, the traces of common roots of Sinhala with Maratha can also be seen in fine grained genetic analysis. Thus, the genetic analysis of Sinhalese adds another significant chapter to the history of the South Asian genetic landscape.

Limitation of the study

Although we corroborate the linguistic theory, our admixture time analysis was failed to confirm the timeline, likely due to the absence of true putative ancestor. More ancient DNA study and Y chromosomal sequencing would be useful to determine the migration timeline.

STAR★Methods

Key resources table

REAGENT or RESOURCE SOURCE IDENTIFIER
Biological samples
Sinhala Field Collection from Śrī Laṅkā See Figures 2 and 3 for population names
Śrī Laṅkā Tamils (STS) Field Collection from Śrī Laṅkā See Figures 2 and 3 for population names
Śrī Laṅkā Tamils (UK) 1000 genomes (http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/) See Figures 2 and 3 for population names
World population data Pathak et al.

See Figures 2 and 3 for population names
Chemicals, peptides, and recombinant proteins
Agarose MERCK Cat# A9539
Tris EDTA Buffer, Molecular Biology Grade Fisher Scientific Cat# AAJ75893AP
Critical commercial assays
DNA extraction Kit Qiagen Cat# 51104
MinElute PCR Purification Kit Qiagen Cat# 28006
Illumina-Infinium Global Screening Array 1.0 (GSA-24v1-0) Cat# 20031669
Deposited data
The Genotype data of Sinhala This study https://doi.org/10.6084/m9.figshare.23975601
The Genotype data of Śrī Laṅkā Tamil This study https://doi.org/10.6084/m9.figshare.23975601
Software and algorithms
PLINK v1.9 Chang et al.

https://www.cog-genomics.org/plink/1.9/
EIGENSOFT v6.1.4 Patterson et al.

https://github.com/DReichLab/EIG
ADMIXTURE Alexander et al.

https://dalexander.github.io/admixture/
ADMIXTOOL Patterson et al.

https://github.com/DReichLab/AdmixTools
BEAGLE 5.4 Browning et al.

http://www.gnu.org/licenses/
fineStructure Lawson et al.

https://people.maths.bris.ac.uk/∼madjl/finestructure/fs-2.1.3.tar.gz
MEGA-X Kumar et al.

https://www.megasoftware.net/show_eua
ChromoPainter Lawson et al.

https://people.maths.bris.ac.uk/∼madjl/finestructure/fs-2.1.3.tar.gz
ALDER Loh et al.

http://cb.csail.mit.edu/cb/alder/alder_v1.03.tar.gz
runs of homozygosity (RoH) Chang et al.

https://www.cog-genomics.org/plink/1.9/
merged IBD Browning & Browning

http://www.gnu.org/licenses/
Refined IBD Browning & Browning

http://www.gnu.org/licenses
mt-DNA nomenclature Van Oven & Kayser

Phylotree.org

Resource availability

Lead contact

Further information and requests for resources and reagents should be directed to and will be fulfilled by the lead contact, Gyaneshwer Chaubey ().

Materials availability

This study did not generate new unique reagents.

Experimental model and study participant details

Study subjects

The study involve human participation from two major ethnic groups of Śrī Laṅkā. Blood samples (5 mL) were collected in EDTA tubes from thirteen individuals (age 18-60 years), out of which nine belong to Sinhala and four from Śrī Laṅkā Tamil populations of Śrī Laṅkā. The three-generation rule was observed in collecting blood sample; no individuals were each other’s blood relatives. Although the sample size is low for the intra-population and genomic selection type studies, they are sufficient for inter-population comparison and understanding the population history. We have used STU code for the Śrī Laṅkā Tamil collected from the UK; STS code for the Śrī Laṅkā Tamil collected from Śrī Laṅkā.

Ethic statement

Ethical approval of the present study was obtained by the Ethics Review Committee, Faculty of Medicine, University of Colombo, Śrī Laṅkā under the approval EC-17-147. Sampling was performed according to the standard guidance given by the ICMR (Indian Council of Medical Research), India and each individual was subjected to interviews and questionnaires, which recorded information such as family, relations and food habitats.

Method details

DNA extraction and genotyping

According to the manufacturer’s instructions, DNA was extracted using the Puregene blood kit (Qiagen) at the Birbal Sahni Institute of Palaeosciences (BSIP), Lucknow, India.
Illumina-Infinium Global Screening Array 1.0 (GSA-24v1-0) was performed for all the samples (n=13) collected from Śrī Laṅkā, giving us 618,540 autosomal markers as per Illumina’s recommended protocol. Signal intensities detected by the GSA were converted to genotypes using Illumina AutoCall software with a GenCall threshold of 0.15. Primary quality control requirements included per-sample log R standard deviation (SD) less than 0.25 and call rates greater than 98.5% across the array (GSA-24 v1.0), or greater than 99.0% across the autosomes and chromosome X.

Data processing and population genetic analyses

The variant calling factor (vcf) file was converted to binary files with PLINK v1.9 following optimal conditions of quality filtering like, –maf 0.03, geno 0.03 and mind 0.03. The filtered samples were merged with the HGDP Panel. We found 255,063 SNPs common between the samples and panel with a 0.9987 genotyping rate. PLINK v1.9  was used for data curation and management for the statistical analyses. The PC1 and PC2 eigenvectors in Principal Component Analysis (PCA) were generated with smartpca  (EIGEN v6.1.4), and the plot has been generated with an in-house R script. We used ADMIXTURE  to further estimate shared ancestry (K=2 to K=15), and at K=10 the ancestry has been defined with minimum cv error value of 0.5423.

mtDNA analysis

We collected data from published sources for mtDNA comparison and reclassified them in their respective haplogroups manually following the latest nomenclature (Phylotree.org). The regional classification (East Eurasian, South Asian and West Eurasian) of the mtDNA haplogroups was performed manually according to the presence of a particular haplogroup in that region.

Quantification and statistical analysis

To understand population relationships, several f-statistics were performed in default setting using the Yoruba population as an outgroup population. To know the shared drift and gene flow pattern we used f3 and f4 statistics, respectively from the ADMIXTOOL package.  We have phased the genotypic data with beagle 5.4  with default settings. Later the haplotype-based analysis was performed using MCMC algorithm-based software i.e., fineStructure  using likelihood modelling approaches to calculate matrices. The obtained output matrix was used for construction of MCMC tree using MEGA-X.  ChromoPainter  was applied for the estimation of chunk counts donated by reference populations to our targeted population. ALDER  was run to understand the admixing time using multiple source populations with default settings. In order to understand the population dynamics, the runs of homozygosity (RoH) was determined for each population using PLINK 1.9  The analysis was carried out with the use of the ‘homozyg’ function and utilised 1000 kb windows for the calculations, allowing one heterozygous call and five missing calls per window, and a minimum of 100 SNPs per window. Every person is successively scanned by the selected window, which estimates the proportion in a homozygous window for each SNP. For understanding the Identity by Descent (IBD) we used refined and merged IBD analysis.

Data and code availability

Data reported in this paper are publicly available from Figshare repository (https://doi.org/10.6084/m9.figshare.23975601).

This paper does not report original code.

Any additional information required to reanalyze the data reported in this work paper is available from the lead contact upon request.

Acknowledgments

We are grateful to the volunteers for donating their blood samples. Samples were collected under research supported by National Research Council Sri Lanka, Grant No. 17-042. NR is supported by SERB-CRG/20-21/006762. GC is supported by ICMR ad hoc grants ICMR ad-hoc grants (2021-6389), (2021-11289) and BHU IoE incentive grant BHU (6031). The Open Access Article Processing Charge has been covered by the Institute of Eminence (IoE), Banaras Hindu University, India.

Author contributions

Conceptualization, R.R., N.R., and G.C.; sample collection, PRW. K.H.T., and R.R.; data generation S.K., N.P., and N.R.; formal analysis, P.P.S., S.K., G.C., G.vD.; writing—original draft, P.P.S., S.K., G.vD., and G.C.; writing—review & editing, K.H.T., R.R., and N.R.; supervision, R.R., N.R., and G.C. All authors approved the final draft of the manuscript and take responsibility for its content, including the accuracy of the data.

Declaration of interests

The authors declare no competing interests.

Supplemental information

Comments are disabled on this page.