A combination of genome‐wide and transcriptome‐wide association studies reveals genetic elements leading to male sterility during high temperature stress in cotton

Summary Global warming has reduced the productivity of many field‐grown crops, as the effects of high temperatures can lead to male sterility in such plants. Genetic regulation of the high temperature (HT) response in the major crop cotton is poorly understood. We determined the functionality and transcriptomes of the anthers of 218 cotton accessions grown under HT stress. By analyzing transcriptome divergence and implementing a genome‐wide association study (GWAS), we identified three thermal tolerance associated loci which contained 75 protein coding genes and 27 long noncoding RNAs, and provided expression quantitative trait loci (eQTLs) for 13 132 transcripts. A transcriptome‐wide association study (TWAS) confirmed six causal elements for the HT response (three genes overlapped with the GWAS results) which are involved in protein kinase activity. The most susceptible gene, GhHRK1, was confirmed to be a previously uncharacterized negative regulator of the HT response in both cotton and Arabidopsis. These functional variants provide a new understanding of the genetic basis for HT tolerance in male reproductive organs.


Introduction
Global warming has raised the average global temperature by c. 1.5°C since the beginning of the 20 th century and has led to unpredictable high temperatures (HTs) (Seneviratne et al., 2018). Although the yields of many crops have increased in recent decades, sensitivity to environmental stresses has also increased, which is problematic as the climate continues to change (Lobell et al., 2014). Major crops are reported to be vulnerable to the impacts of climate change (Schauberger et al., 2017;Zhao et al., 2017). Cotton is cultivated globally for the production of its natural fibers, and it is commonly affected by HTs. Several reports indicated that HT stress could lead to male sterility in cotton, which results in substantive yield losses (Pettigrew, 2008;Min et al., 2014;Zahid et al., 2016). Thus, breeding cotton germplasm with HT tolerance and understanding the mechanisms involved in the HT response in male reproductive organs is necessary.
In flowering plants, the male reproductive development process involves male meiosis (tetrad stage), microspore development (tapetum degradation stage) and anther dehiscence (anther dehiscence stage) (Ma, 2005). High temperature stress influences the male reproductive organs in particular, resulting in abnormal anther structure and impaired pollen viability. Recent research has uncovered the molecular basis of the HT response in male reproductive organs. Genes which regulate energy substances such as lipids and sugars have been found to be essential for the HT response in Arabidopsis, maize and cotton. The disordered accumulation of energy substances during the tetrad stage of anthers usually results in male sterility under HT stress (Begcy et al., 2019;Khan et al., 2020;Zhu et al., 2020). Accumulated auxin has been shown to mitigate pollen sterility in Arabidopsis and wheat under HT conditions (Sakata et al., 2010;Abbas et al., 2018). However, excessive auxin leads to male sterility in cotton anthers under HT stress, indicating a distinct regulatory role of plant hormones in various species (Min et al., 2014;Ding et al., 2017). Furthermore, disrupted epigenetic modifications involving small RNA, DNA methylation and tRNA modification could disorder gene expressions and lead to male sterility in maize, cotton and rice (Y. Chen et al., 2019;Teng et al., 2020;Xu et al., 2020). These findings imply that disturbed gene expression in the early developmental stages of male reproductive organs could lead to pollen sterility.
Several studies have elucidated heat-response pathways and related metabolic networks, but the genetic architecture of complex stress response traits in agriculturally important crops is much less well understood. By implementing genomic variation maps and advanced bioinformatic strategies, genome-wide association studies (GWASs) based on linkage disequilibrium (LD) have produced remarkable advances in the genetic dissection of complex traits through the use of efficient algorithms. An association study can analyze genetic correlations between single nucleotide polymorphism (SNP) markers and noted phenotypes to provide candidate regions in the genome. Subsequent LD analysis of candidate intervals allows high resolution QTL mapping and causal gene identification and cloning. Over the past decade, the genetic basis of a large number of traits has been determined in major crops (Li et al., 2013;Mao et al., 2015;Zhou et al., 2015;Wang et al., 2017;Z. Ma et al., 2018). A comprehensive GWAS has identified 4820 genes contributing to 13 fiber-related traits in cotton, providing novel genetic resources for fiber improvement (Z. . By combining the GWAS technique and pairwise LD analysis, it has been confirmed that an F-box protein (ZmFBL41) confers resistance to banded leaf and sheath blight in maize . Meanwhile, parallel transcriptome sequencing provides an expression profile of each gene as well as comprehensive variations in the gene coding regions. These genetic resources facilitate the identification of eQTLs which could regulate the expression of genes. Thus far, eQTL regulatory maps have been successfully constructed in rice (Wang et al., 2010;Kuroha et al., 2017), maize Liu et al., 2017Liu et al., , 2020Wang et al., 2018), lettuce , tomato (Ranjan et al., 2016) and cotton fiber . These eQTL maps uncover the critical roles of genomic variation in the regulation of gene expression. Furthermore, eQTLs can be used to estimate the effects on gene expression and then be combined with physical phenotypes to conduct transcriptome-wide association studies (TWASs) to identify pivotal expression-trait associations (Gusev et al., 2016). The TWAS algorithm has been successfully implemented in cotton fiber and rapeseed to identify causal genes for important agronomic traits Tang et al., 2020).
Cotton is mainly cultivated in summer, which means that cotton could be a useful object of study with which to research the mechanism of the HT response in male reproductive organs. Currently, the reasons for male sterility under HT stress in cotton require further elucidation. In order to uncover genetic regulatory mechanisms under HT stress in the anther, we performed massively parallel RNA sequencing (RNA-seq) of anthers for a collection of 218 cotton accessions, cultivated for cotton production, providing abundant SNPs directly from transcriptionally active regions of the genome and potential genes for HT tolerance.

Determination of pollen viability
The population was planted in Alear, Xinjiang province in 2015; Turpan, Xinjiang province in 2016; and Wuhan, Hubei province in 2016 and 2017. In Wuhan, the sample panel was planted in two blocks at a field density of 27 000 plants ha -1 (Dai et al., 2017). More than 12 individuals of each accession were planted in each row in each replicate. A thermometer was set in the field to show the highest and lowest daytime temperatures. High temperature (HT) stress was judged to occur when the daily maximum temperature was > 37°C for > 3 d. In both Alear and Turpan in Xinjiang province, two blocks were set in the field with a plant density of 195 000 ha -1 . More than 30 individuals of each accession were set in each row. The thresholds for HT stress were the same as those used in Wuhan. Detailed field images and meteorological data are included in Supporting Information Fig. S1 and Table S1.
Flowers subjected to HT stress were harvested and immersed in 2,3,5-Triphenyl tetrazolium chloride (TTC) solution (8 g TTC dissolved in 1 l phosphate buffer; pH = 7) to stain the pollen for viability. After staining for 1 h at room temperature, 2% (v/v) sulfuric acid solution was used to stop the staining reaction. Pollen grains were pipetted onto microscope slides and photographed using a Zeiss (Oberkochen, Germany) Axio Scope A1 microscope. At least three view fields were taken for each flower, and a total of > 50 flowers were harvested for TTC staining per sample. Pollen viability was measured as the ratio of normally stained pollen grains to total pollen counts. Phenotyping was accomplished following a multiple-environment experiment which aimed to eliminate environmental effects. The broad sense heritability of pollen viability was estimated to be 0.937 (H 2 ). The best linear unbiased predictions (BLUPs) (Poland et al., 2011) were used to evaluate pollen viability using the LME4 package in R.

Sample preparation and RNA sequencing
All 218 accessions were cultivated in flat plots in a randomized block design group with two replicates in Alear, Xinjiang, China, in the summer of 2015. Based on findings from our previous study, cotton anthers underwent the tetrad stage while bud lengths were from 5 to 8 mm . In order to determine the precise tetrad stage of each accession, the buds with lengths from 5 to 8 mm were harvested and divided into groups according to length, on a 0.5 mm scale. After petal removal, four anthers from each bud were selected, crushed and placed under the microscope. Only the anthers which contained typical tetrads were considered to be at tetrad stage. The anthers were harvested into tubes and stored at À70°C or immediately placed in liquid nitrogen for further analysis.
Extraction of RNA was carried out using a Spectrum Plant Total RNA Kit (cat. no. STRN250-1KT; Sigma). In brief, c. 100 mg anthers were collected and ground in liquid nitrogen, and the subsequent steps were completed according to the manufacturer's instructions. For the sequencing libraries, concentrated ploy-A mRNA was fragmented and reverse-transcribed using random primers. RNA samples were sent to Novogene (Tianjin, China) to construct standard Illumina transcriptome sequencing libraries.

Reads mapping and variation calling
Low quality reads were eliminated using TRIMMOMATIC software (Bolger et al., 2014). The remaining paired-end reads were subsequently mapped to the allotetraploid cotton TM-1 genome  using the BWA software package, with default parameters (Li & Durbin, 2009). Sequencing reads produced by polymerase chain reaction (PCR) duplicates for each accession were filtered using PICARD, and only the unique mapped reads were retained in BAM format files. SAMTOOLS and the Genome Analysis Toolkit (GATK) were used to perform SNP calling. In order to obtain high quality SNPs, the variations which were detected by both tools and covered by sequenced reads at least three times were retained for further analysis. Single nucleotide polymorphisms with a missing rate > 0.4 and a minor allele frequency (MAF) < 0.01 were discarded. Annotation of SNPs was performed using SNPEFF software (Cingolani et al., 2012).

Population genetic analysis
The SNPs with MAF > 0.05 were used for phylogenetic and population structure analysis. From this selection, randomly chosen SNPs were subjected to population structure analysis using the STRUCTURE software package (Falush et al., 2003) with the following parameter: K2 to K9. We repeated this process at least five times. All compressed results files were then uploaded to online analysis toolkit STRUCTURE HARVESTER (http://taylor0.biology.uc la.edu/structureHarvester/) to perform consecutive analyses. A neighbor-joining tree was constructed based on SNPs (MAF > 0.05) using PHYLIP software (Lim & Zhang, 1999), and concomitantly visualized using the online tool ITOL (https:// itol.embl.de/; Letunic & Bork, 2011). Principal component analysis (PCA) was performed using a combination of PLINK and GCAT software. Single nucleotide polymorphisms with MAF > 0.05 were pooled in to PLINK software (--make-bed) to generate GCAT identifiable files (--bfile --make-grm) (Purcell et al., 2007).

Linkage disequilibrium
Linkage disequilibrium was calculated using TASSEL software with customized parameters (-fork1 -ld -ldWinSize 1000000) (Bradbury et al., 2007). The LD decay was calculated based on the values of r 2 between two SNPs and averaged in 1 kb windows with a maximum distance of 1000 kb. Linkage disequilibrium was calculated for each subgenome of allotetraploid cotton (At and Dt).

Gene expression analysis and identification of long noncoding RNAs
Quality controlled sequencing reads for each accession were mapped to the allotetraploid cotton reference TM-1 genome using the HISAT2 aligner (--dta-cufflinks --min-intronlen 40 -max-intronlen 100000) (Kim et al., 2015a). Expression of protein coding genes (PCgenes) was predicted using CUFFLINKS software (parameter -G) with an annotated 'gff3' file. Data normalization was performed by transforming mapped transcript reads to fragments per kilobase per million mapped fragments (FPKM). The software package RMATS was used to identify intron retention (IR) events, and the genes which have IR in at least 20 accessions were harvested for analysis of overlapping SNPs and introns (Shen et al., 2014).
While identifying long noncoding RNAs (lncRNAs), CUFFLINKS analysis was carried out with parameter -g. With the 'gff3' file guiding identification, novel transcripts in each accession were merged by CUFFMERGE. CUFFCOMPARE processing was applied to compare assembled transcripts to the annotated genome information. After comparison, the transcript symbols (u: novel transcripts; x: antisense transcripts) in the 'gtf' file were extracted to identify lncRNAs as follows: (1) transcripts identified < 10 times were discarded (< 5% of 218); (2) transcripts originating from tRNA and rRNA were removed (BLAST E < 1 9 10 À5 ); (3) transcripts showing protein coding potential against Swiss-Prot and Pfam (BLAST E < 1 9 10 À5 ) were eliminated; (4) transcripts that were deemed to have low coding potential by the CODING-NON-CODING INDEX (CNCI) tool were also discarded; and (5) transcripts with length < 200 bp were removed. Finally, all the sequencing reads were mapped to lncRNA transcripts to verify validity and expression levels.

Genome-wide association studies
Single nucleotide polymorphisms with MAF > 0.05 were obtained to perform association studies. To avoid false positives, population structure was regarded as a random effect by using the kinship (K) matrix in a mixed linear model (MLM). Single nucleotide polymorphisms with MAF > 0.05 were acquired for the calculation of K. By computing the genomic inflation factor (k = 1.0020), false positive association signals arising as an effect of population structure could be controlled. We performed a GWAS using MLM embedded in the TASSEL software, and the GWAS threshold was set to Àlog 10 (1/n), according to the Bonferroni-adjusted significance threshold. The genomic inflation factor was calculated using the P-value of each SNP in the MAT-LAB software package.
Association analysis in the LD block was performed using the LDHEATMAP package in R. The 280 kb LD decay rate was used to analyze predicted genes within the 560 kb interval in which the peak SNP was centered in the chromosome.

Identification of expression quantitative trait loci (eQTLs)
Firstly, a quantile transformation was applied to normalize the expression level of each transcript using the 'qqnorm' function in R. The SNPs with MAF > 0.05 were used to assess the association between SNPs and normalized expression traits. The MATRIXEQTL software in R was used to identify positively associated SNPs (Shabalin, 2012). To increase the number of positive association results, the P-value cutoff was set to P < 1 9 10 À6 until a low Pvalue reached 1 9 10 À20 . While identifying eQTLs, the significant associated SNPs with distances < 20 kb were classified into the same cluster, and a cluster with at least four significant SNPs was regarded as a putative eQTL. Consecutive eQTLs located within 20 kb were regarded as the same effect sites and were merged (Li et al., 2013).
To identify the eQTL hotspots, the statistical algorithm described by Silva et al. (2014) was implemented. We applied different window sizes to scan the hotspots (1, 2, 5, 10, 20, 50 kb). Finally, 5 kb was set as the window size, as the average length of genomic sequences of genes was approximately 5 kb. Putative hotspots which affected > 5 genes and passed the false discovery rate (FDR) adjustment (0.05) were regarded as true hotspots.

Co-expression network construction and gene ontology analysis
The FPKM values after log transformation of PCgenes and lncRNAs of all accessions were integrated into one input file to construct a weighted gene co-expression network using the WGCNA package in R (Langfelder & Horvath, 2008). A median absolute deviation was calculated by using all log FPKM value. Genes which were expressed in > 10 accessions and whose absolute deviation ranged in 85% of median absolute deviation were obtained. The genes whose absolute deviation more than 85% of median absolute deviation were regarded as outliers and then were excluded. 73 989 genes were retained for construction of the network. Each module was set to analyze the correlation using the Pearson correlation coefficient, and gene ontology (GO) enrichment analysis was performed for enrichment in the test module compared to the reference group using Fisher's exact test.

Transcriptome-wide association study
The FUSION software package was used to perform the TWAS. Cis-SNPs (located within 500 kb of the physical position of the gene) of PCgenes and lncRNAs were harvested as described in a previous study (Gusev et al., 2016). Genes with no local eQTLs or Cis-SNPs were removed to avoid potential false heritability estimation using GCAT (embedded in FUSION). To have the best expression prediction, the model choice (--models top1, blup, bslmm, lasso, enet in FUSION) was applied (top1, single best eQTL; blup, best linear unbiased predictor; bslmm, Bayesian sparse linear model; lasso, LASSO regression; enet, elastic-net regression). After prediction, the genes with the best model performance (R 2 values of at least 0.1) were retained for further analysis Gusev et al., 2019). Single nucleotide polymorphisms used in the GWAS were captured as reference SNPs for LD value, and Z-scores were calculated using the 'qnorm' function in R, based on P-value and effect size of SNPs from the GWAS.

Tissue dissection
Tissue dissection was similar to the procedure we used previously (Y. . In brief, after HT stress treatment, anthers were fixed in 50% FAA (50 ml absolute ethanol, 10 ml 37% formaldehyde solution, 5 ml acetic acid and diluted with water to 100 ml) after removal of bracts and petals. Dehydration used a graded ethanol series (30%, 50%, 70%, 95% and 100%). Tissue was embedded in paraffin and sectioned at 10 lm thickness.

Full-length coding sequence cloning of GhHRK1
Two pairs of primers were designed to clone GhHRK1 using overlap-extension methods. Two fragments with respective sequence lengths of 1071 bp and 1832 bp were first cloned and then extended to amplify the full-length sequence of GhHRK1. The related primers are listed in Table S2.

Quantitative reverse-transcription polymerase chain reaction (qRT-PCR)
Using the M-MLV (cat. no. M1701; Promega) system, 3 lg total RNA was reverse-transcribed. An ABI 7500 Real Time PCR system was used to perform qRT-PCR. The 2 ÀDDCt method was used to calculate gene expression levels. The expression levels were normalized to internal control GhUB7 (Ghir_A11G011460), to standardize RNA content and integrity.

Research
New Phytologist confirmed by genotyping PCR. While all accessions were undergoing the flowering period, > 15 plants of each accession were subjected to HT stress (31°C for 2 d) treatment and subsequently returned to normal temperature (NT) conditions. After 10-15 d recovery, the NT controlled and HT treated siliques were sampled and immediately soaked in 75% ethanol solution to decolorize overnight. A Nikon (Tokyo, Japan) SMZ-25 stereomicroscope was used to capture images of the seed setting in the siliques. The seed setting rate was calculated as the ratio of the number of setting seeds to the total number of ovules in a whole silique. Each assay was repeated twice with at least six siliques for each technical replicate.
For the Arabidopsis pollen staining, HT treated flowers at stage 13-14 were sampled, then immersed in TTC solution to stain for 30 min. The anthers were placed on glass slides and photographed under the microscope.

In vivo and semi-in vivo pollen tube elongation assays
For in vivo pollen tube elongation assays, flowers at floral stage 12 (1 d before anthesis) were emasculated for cross pollination. After pollination, the pistils were placed on solid pollen germination medium (Boavida & McCormick, 2007) for incubation in a growth chamber under NT and HT conditions for 20 h. Subsequently, the pistils were sampled and fixed in Carnoy's Fluid (6 : 3 : 1, ethanol : chloroform : acetic acid) for at least 6 h. The fixed pistils were subsequently rehydrated using a graded ethanol series (75%, 50%, 30%, distilled water), followed by softening in 2 M NaOH overnight. For pollen tube staining, 0.1% (w/v) aniline blue solution (dissolved in 108 mM K 3 PO 4 , pH~11) was applied. Images of whole pistils were captured using a Leica (Bensheim, Germany) SP8 Lightning confocal microscope to show the elongation of pollen tubes. Each assay was repeated for twice with at least 10 siliques for each technical replicate.
For semi-in vivo pollen tube assay, styles of emasculated wildtype (WT) pistils were cut and placed on the solid pollen germination medium for 4 h after pollination. The NT control was set at 22°C, while the HT treatment was set at 31°C in both semiin vivo and in vivo pollen tube germination assays.

Transgenic complementation assays
The GhHRK1 sequence fragment was cloned into the pMDC83 vector using the Gateway cloning system (cat. no. 11789013; Invitrogen). The constructed vector (35S:GhHRK1-GFP) was introduced into WT, hrk1-1 and hrk1-2 using the floral dip method. A vector containing the 35S:GFP fragment was simultaneously introduced into WT and mutants as a transgenic control.

High temperature (HT) stress disrupts pollen activity differentially in different accessions
We performed phenotypic analysis of pollen viability among the 218 natural cotton accessions (Table S3). Where the accessions suffered high temperature (HT) stress (> 3 d on which the daily maximum temperature is > 37°C), the flowers were harvested for evaluation of pollen viability. 2,3,5-Triphenyl tetrazolium chloride (TTC) staining and microscopic analysis were carried out to determine pollen viability for each cultivar (Fig. 1a,b). Under HT stress, 42 accessions showed very high pollen viability (i.e. > 90% of the pollen was viable) and 21 accessions had low viability (< 25%); the others exhibited intermediate pollen viability.
RNA sequencing reveals genetic diversity among the accessions Microspore sterility has been observed at the tapetum degradation stage following heat stress (Y. , the aberrant gene expressions and metabolism are expected to occur at an early developmental stage (Grienenberger et al., 2010;Kim et al., 2015b). In order to uncover the initial response to HT, we sampled anthers at the tetrad stage for each accession that suffered HT stress, and we performed RNA-sequencing for both genotyping and transcriptome analysis (Tables S3, S4). After lowquality-read trimming and read mapping, a total of 3 352 750 SNPs were obtained. After filtering for missing rate < 0.4 and MAF > 0.01, 549 216 SNPs were identified for further analysis. The number of SNPs was much lower than that found for maize  and lettuce  in a similar population set, suggesting there are fewer variations across the cotton accessions. To validate the SNPs, we compared SNPs from 38 accessions in this study with the variation data for these same cotton lines described in Wang et al. (2017). Our SNP data showed a 95.1% identity with the SNP data from Wang et al. (2017), suggesting a high degree of consistency between the datasets (Table S5). Analysis of the physical position of these SNPs revealed that a large number were positioned in intergenic and intron regions (Fig. S2a). The SNPs located in genic regions showed a relatively high abundance in coding regions rather than in the 3 0 -UTR and 5 0 -UTR (Fig. S2b,c). The IR events could lead to introns being retained in mRNA. Due to the high fraction of intron-located SNPs, we analyzed IR events in the population (Table S6). We found that 54 293 genes had IR in at least 20 accessions, and 184 289 SNPs (33% percent of 549 216) were located in the retained introns. While only 2866 SNPs (0.5% percent of 549 216) were located in the spliced intron according to transcriptome data, the high IR frequency could explain the high fraction of intron located SNPs.
Following SNP annotation, the population genetic architecture was investigated. The SNPs were subjected to population structure analysis, phylogenetic tree construction, PCA and LD calculations. It was found that the 218 accessions could be organized into three genetic groups (Fig. S3), with 114 (red in Fig. 1c,d), 39 (green) and 65 (blue) samples in the three groups. The PCA showed that the accessions belonging to the green and blue groups were clustered separately, while the red group accessions were found in both clusters. The accessions belonging to the red group were all bred in China, while the samples classified into the green and blue groups were mainly derived from Russia, Kazakhstan and the USA (Figs 1e, S4). The LD decay physical distance was estimated at 286 kb, as pairwise r 2 from maximum (0.45) to half value (Figs 1f, S5). This value is slightly lower than the results reported by Wang et al. (2017) (296 kb) and Shen et al. (2019) (298 kb). Overall, our 218 accessions showed a natural LD and population structure, and were considered suitable for further analysis.

Transcriptome data integration and co-expression network analysis
Gene expression profiles under HT in anthers were investigated in our previous study using two different cotton lines (Min et al., 2014;, but the small sample size meant that the results were limited. In order to comprehensively understand the transcriptome divergence under HT stress in a large population, we first analyzed the expression of all transcripts, including protein coding genes (PCgenes) and long noncoding RNAs (lncRNAs), for each of the 218 accessions, and processed the samples by hierarchical clustering . In the clustering tree, accessions harboring parallel phenotypes were clustered into adjacent groups, relating transcriptome differences to phenotypic variations (Fig. S6).
LncRNAs perform pivotal functions in organ development and epigenetic modification of the genome in humans and other mammals (Iyer et al., 2015). One lncRNA transcript generated from the antisense transcript of a heat shock factor gene (HSFB2a) was found to participate in gametophyte development in Arabidopsis (Wunderlich et al., 2014). Comprehensive identification of lncRNAs during pollen development has been accomplished in Brassica . However, there are few studies reporting the functions of lncRNAs in relation to heat stress in cotton anther development. The abundance of SNPs located in intergenic regions implied that a substantial proportion of the transcripts were possible lncRNAs (Fig. S2a). The RNA-seq data was subsequently used to identify lncRNAs in all accessions. After merging loci and filtering for protein coding potential, a total of 26 158 lncRNA loci were identified (14 094 loci in the At subgenome and 12 064 loci in the Dt subgenome; Fig. 2a; Table S7). We overlapped the SNPs and the lncRNA loci, and found that c.16% (89 269 of 549 216) of the SNPs were located in the lncRNA regions. The lncRNAs in the cotton anther showed

Research
New Phytologist similar guanine-cytosine (GC) content to those identified in cotton fiber and root Zhang et al., 2018). In Arabidopsis, a large number of lncRNAs have been found to be derived from transposable elements (TEs), with unclear roles (Bohmdorfer et al., 2016). We studied the composition of lncRNAs in TEs in tetraploid cotton anther, and found that lncRNAs overlapped much more with TEs than with PCgenes, especially in the lncRNA body region (Fig. 2b). Approximately 30% of lncRNA loci overlapped with long terminal repeats (LTRs)the upland cotton genome comprises a large number of LTRs (Fig. S7). However, long interspersed nuclear elements (LINEs) and LTR/Copia were represented at relatively high proportions (15% and 20%) among the lncRNA loci, compared to their proportions across the whole genome (1.7% and 12% respectively). These results were similar to the observation that active lncRNAs are predominantly derived from LINEs in interspecific cotton hybrids . Interestingly, an unexpectedly high proportion of TEs overlapped with PCgenes' upstream regions (the promoters) and downstream regions (almost 50% of PCgenes) (Fig. 2b). Transposable element transcription is usually suppressed by DNA methylation in order to stabilize the genome (Autran et al., 2011), and the proportion of TEs overlapping with PCgenes suggested that gene expression in the cotton anther may be largely regulated via epigenetic mechanisms.
To understand the regulatory relationship between lncRNAs and PCgenes, we combined the expression data from PCgenes with lncRNA data, to construct a co-expression network (soft thresholding power b = 28; cut-off = 0.85) (Fig. S8). A total of The expression patterns of genes in the black module were investigated (Fig. 2c). 108 accessions with increased expression trends displayed lower pollen viability, indicating that these genes may perform negative regulatory functions in response to HT stress (Fig. S9d). The module includes 9 hub genes with 271 transcripts (209 PCgenes and 62 lncRNAs) which are involved in programmed cell death and hormone response (Fig. S11), indicating that a complex regulatory network is involved the HT stress response (Fig. 2d; Table S8).

Genome-wide association study identifies three loci associated with the HT stress response
Use of the GWAS technique has facilitated the identification of genes and QTLs linked to important agricultural traits (Si et al., 2016;Wu et al., 2019). In order to identify genes involved in the HT stress response in pollen grains, we collected 175 430 SNPs with MAF > 0.05 to perform a GWAS based on the pollen viability values. Association analysis conducted using these SNPs provided three loci associated with pollen viability at P < 6.3 9 10 À6 (Bonferroni-adjusted significance threshold 1/n, n = 175 430). The association results obtained using a mixed linear model are shown in the form of a Manhattan plot and a quantile-quantile plot in Fig. 3(a,b). The haplotypes based on the peak SNPs from three associated signals showed distinct association with the phenotype of pollen viability under HT stress (Fig. 3c-e). When comparing the three loci to the 160 QTLs of 16 traits reported previously , we found there were no QTLs overlapping with these three loci, suggesting that the tolerance of HT in cotton may have an independent regulatory mechanism. The LD in upland cotton was estimated to be 280 kb, and the three associated loci contain 75 PCgenes (Fig. S12), of which 20 had no annotations (Table S9). Among the 55 annotated genes, 16 were annotated as potential protein kinases, with 10 genes homologous to the Arabidopsis gene At4g27290, encoding a G-type lectin receptor kinase family protein. The other genes were annotated as being involved in the response to energy substances and stamen development (Fig. 3f).

Transcriptome wide association study (TWAS) identified
Ghir_A01G006180, a gene that confers male sterility under HT stress Critical functions of genetic variants which influence gene expression, known as eQTLs, have been found to affect agronomic traits. Robust eQTL mapping has been carried out in maize (Fu  (Wang et al., 2010;Kuroha et al., 2017), tomato (Ranjan et al., 2016), lettuce  and cotton fiber . Based on the expression and SNP data, we carried out transcriptome-wide eQTL identification. The SNPs with MAF > 0.05 were used to perform the association study, with the significance cutoff set at P < 1 9 10 À6 , until the lowest P-value for the peak SNPs reached 1 9 10 À20 .
A total of 32 396 eQTLs were identified, with effects on 8975 PCgenes and 4157 lncRNAs, covering 14% of the genes in the cotton genome (8975 of 70 198 PCgenes and 4157 of 26 158 lncRNAs) (Fig. 4a; Table S10). Among local eQTLs, the peak SNPs were mostly located within 10 kb of the affected genes (Fig. 4b). Individual eQTLs which could affect numerous (> 5) distant genes, known as eQTL hotspots, were also studied. We identified a total of 179 eQTL hotspots, which occupied 205 kb across the genome (Table S11). The intervals influencing pollen viability under heat stress on chromosomes A01, D01 and D05 were also found to be eQTL hotspots (Fig. 4c). The genes Ghir_A01G006180, Ghir_D01G006520 and Ghir_D01G006540, which were candidate genes in the three QTLs, were also found to be influenced by eQTLs, indicating a solid association between gene expression and pollen viability in the cotton anther (Fig. 4c,d).
We compared our eQTL-containing genes to the 125 genes contributing to fiber development that were published previously  and found 19 overlapping genes, suggesting dissimilar regulatory mechanisms for fiber and male reproductive organs.
Based on the gene expression levels and cis-SNP data, we further performed a TWAS to associate gene expressions with the different pollen viability phenotypes. According to our results ( Fig. S13; Table S12), six genes (four PCgenes and two lncRNAs) showed significant association with the pollen viability trait, supporting the idea that disruption of gene expression at early developmental stages can result in male sterility under HT. Among these four PCgenes, three genes were represented in the 75 candidate PCgenes associated by the GWAS, the most significant being Ghir_A01G006180, suggesting that it is very likely that this gene is a component of the HT response in anthers.
GhHRK1 is a negative regulator during the HT stress response in early anther development Thirteen significant SNPs (P < 6.3 9 10 À6 ) located in Ghir_A01G006180 showed an association with pollen viability (Figs 5a, S14). According to the existing genome data (Wang Ghir_A01G006180 encodes a putative protein tyrosine kinase, with protein sequence similarity to the Arabidopsis G-type lectin receptor kinase gene At4g27290. While analyzing the protein architecture of Ghir_A01G006180, we found that the protein domains seemed to be duplicated, while the homologous At4g27290 only contained half of the protein domain structure of Ghir_A01G006180 (Fig. S15). Likewise, only the first seven exons of Ghir_A01G006180 were found to have read coverage in low pollen viability accessions (Fig. S16), which led us to question the existing gene model annotation data. We referred to another TM-1 cotton genome dataset described by Hu et al. (2019), and found that the Ghir_A01G006180 locus contained two genes, referred to as GH_A01G0682 and GH_A01G0683. After sequence alignment, we found there was an incorrect annotation at the Ghir_A01G006180 locus -GH_A01G0682 was the causal gene at the Ghir_A01G006180 locus (Fig. S17). We finally designated GH_A01G0682 as Heat-related Receptor Kinase GhHRK1. The full-length coding sequence of GhHRK1 was cloned using overlap-extension methods, and the sequence is given in Table S13. Using the refined gene model, we found that there were nine SNPs in the 3 0 downstream of GhHRK1, one SNP located in the intron between the second and third exons, and three SNPs that could cause nonsynonymous mutations ( Fig. S18; Table S14). The most notable difference in terms of GhHRK1 between low viability and high viability accessions is the expression level. We therefore hypothesized that mis-regulated expression of GhHRK1 in an early developmental stage resulted in the male sterility in pollen grains. We first amplified a 2000 bp upstream fragment of GhHRK1 from each accession to analyze the variations in the promoter. No significant structural variations were found among the accessions. We further annotated the regulatory motifs in the promoter using the PLANTCARE online tool (Table S15). Aside from the conserved CAAT-box and TATAbox, there were four binding sites for MYB transcription factors, one motif for gibberellin response (GARE) and one motif

Research
New Phytologist for auxin response (AuxRR). In our previous study, AtMYB24 transcription factor was upregulated in an early developmental stage of the anthers in Arabidopsis under HT stress . Meanwhile, an MYB transcription factor was found to be a negative regulator of the heat response in Arabidopsis seedlings, indicating that the unexpected expression of GhHRK1 may be disordered by abnormal transcriptional regulation (Liao et al., 2017). We further analyzed the expression of transcription factors related to MYB, GARE and AuxRR domains in the accessions showing distinct pollen viability. There were no notable differences in expression levels between the GARE and AuxRR transcription factors. However, some MYB transcription factors showed higher expression levels in the low pollen viability accessions, which may explain the disordered expression of GhHRK1 at an early stage (Fig. S19).
To test the hypothesis that GhHRK1 contributes to the heat stress response in cotton, we first analyzed the expression of GhHRK1 in our HT-sensitive cotton genotype H05. We found that GhHRK1 was expressed in early meiosis (5-6 mm buds), at the tapetum degradation stage (9-14 mm buds) and at the anther dehiscent stage (> 24 mm buds) under NT conditions. High temperature stress disrupted the expression profile of GhHRK1, leading to an increased expression level at early developmental stages (5-10 mm buds) but a decreased level at the tapetum degradation stage (Fig. 5b). We determined the variations in expression of GhHRK1 across the population. The accessions which exhibited low pollen viability under HT stress showed relatively high accumulations of GhHRK1 mRNA transcripts (Fig. 5c). In situ hybridization analysis was carried out in eight accessions which exhibited distinct pollen viability under HT conditions (Fig. S20). Samples subjected to HT stress showed that GhHRK1 was abundant in the tapetum and tetrads in the low viability accessions at tetrad stage, suggesting that GhHRK1 expression was associated with a loss of pollen viability in response to HT stress in the early cotton anther, and may be part of a negative regulatory pathway of cell viability (Fig. 5d). We also investigated the expression of GhHRK1 at the tapetum degradation stage in the same accessions. Increased expression of GhHRK1 was found in the microspores and tapetum of high pollen viability accessions, while low expression of GhHRK1 and aberrant microspore structure were present in the low viability accessions. This suggests that GhHRK1 may normally be activated at the tapetum degradation stage (Fig. S21).

Arabidopsis homologous GhHRK1 mutant displayed enhanced HT tolerance
According to The Arabidopsis Information Resource website (TAIR, https://www.arabidopsis.org/), there are 23 G-type lectin receptor kinase genes in Arabidopsis. The homologous analysis and phylogenetic tree showed that At4g27290 is most closely adjacent to GhHRK1 (Fig. S22). We next searched Arabidopsis mutant pools to find the homologues of GhHRK1 (referred to as At4g27290) to elucidate the biological function of At4g27290 under HT stress in Arabidopsis. Two mutant lines, SALK_067606C (hrk1-1) and SALK_129987C (hrk1-2) were obtained from the SALK Confirmed T-DNA Project (Alonso et al., 2003). According to data from the gene model (Fig. 6a) and the flanking sequence on the TAIR website, two mutant lines were confirmed to be carrying target T-DNA (Fig. 6b). These two mutant lines showed reinforced vegetative development under NT conditions during the seedling period (Fig. S23). Higher pollen viability was also found in the mutant pollen grains after 3 d of HT treatment (Fig. S24). Few results have been reported regarding the function of At4g27290, but it has been shown to have potential functions relating to cell polarity (Bruex et al., 2012), implying that this gene may function in the pollen tube elongation process during HT stress. Therefore, the WT Col-0 line and mutant lines hrk1-1 and hrk1-2 were subjected to HT treatment (31°C for 2 d) at the flowering stage. A higher seed setting rate was found in the mutants after HT treatment, compared to WT, while there was no obvious difference in seed setting rate between the WT and mutants under NT conditions (Figs 6c,d, S25). Interestingly, no seeds were found at the bottom part of any of the examined siliques from the WT plants under HT conditions, indicating that almost no pollen tubes had reached the bottoms of the siliques (Fig. 6c,d).
To validate this inference, an in vivo pollen tube elongation assay was carried out under NT and HT conditions, in both WT and mutants. Three pollen tube elongation assays were conducted, referred as WT 9 WT, WT 9 hrk1-1 and WT 9 hrk1-2. The emasculated WT pistils were pollinated and subsequently subjected to a HT treatment, in which they were placed in chamber at 31°C for 20 h, while the pistils under 22°C were set as control. Pollen tubes of all three combinations reached the bottom of the pistils under NT conditions. Under HT conditions however, WT pollen tubes only reached the mid-point of the pistils. The pollen tubes of hrk1-1 and hrk1-2 elongated to the bottom of the pistils, as normal, under HT stress (Fig. 6e). To dissect the phenotype of shorter pollen tubes of WT under HT stress, a semi-in vivo pollen tube assay was carried out as per the in vivo pollen tube assay. The pollen tubes elongated normally for all three combinations under NT control conditions; however, HT stress repressed the elongation of WT pollen tubes and resulted in the exposure of pollen tubes in WT (Fig. 6f). These results indicate that At4g27290 can clearly affect pollen viability, explaining why the WT pollen tubes could not reach the bottom of the pistils and why there were no setting seeds in the bottom parts of the siliques.
To confirm the functional similarity of GhHRK1 and At4g27290, we introduced GhHRK1 into WT, hrk1-1 and hrk1-2 (Fig. S26). The transformation of 35S::GFP was used as a control. Under NT conditions lower pollen viability and seed setting rate were detected, both in transgenic complement WT and mutant lines (Figs 7a,c,e, S27). Meanwhile, we found much more severe male sterility and a markedly decreased seed setting rate, simultaneously, in GhHRK1 transgenic complement lines under HT stress (Figs 7b,d,f, S27). These results suggest that GhHRK1 may have a similar function to At4g27290, playing a critical role in the HT response in plants.

Discussion
Anther development is a genetically complex process (Zhao et al., 2016) that can be adversely influenced by environmental stresses (Stromme et al., 2015). The results described in this study show that RNA-sequencing can be used to associate important QTLs involved in male reproductive development with a stress response in a complex crop genome. The transcriptome-wide association study technique was also shown to be more efficient for the positioning of transcriptional regulators of male reproduction than the use of less rigorous correlation analyses (such as the Pearson correlation coefficient). Genes that respond to heat stress have been characterized in Arabidopsis, rice and tomato Shen et al., 2015), but the link between stress and male reproductive development is much less well understood. Transcriptome divergence could explain the molecular basis of phenotypic differences, via comparisons of the functions of differentially expressed genes. In our previous study, the altered auxin and sugar metabolism pathway was found to influence male fertility under HT stress through a comparison between two samples (Min et al., 2014). However, it is difficult to understand comprehensively the mechanisms underlying a phenotype using website. hrk1-1, SALK_067606C; hrk1-2, SALK_129987C. (b) Mutant genotyping using polymerase chain reaction (PCR). Our findings confirmed that hrk1-1 and hrk1-2 carried T-DNA in the target gene region. (c) Images of decolorized siliques of wild-type (WT) and mutants under normal temperature (NT) control conditions and high temperature (HT) treatment. The number of seeds decreased more in the WT than in the mutants (indicated by a red rectangle) under HT treatment, while there were no significant differences in seed setting rates among the three lines under NT conditions. (d) A histogram plot for seed setting rates of WT and mutants under NT and HT conditions, respectively. Asterisks indicate significant difference according to Student's ttest (**, P < 0.01). The values are mean AE SD. (e) In vivo pollen tube elongation assays of WT, hrk1-1 and hrk1-2. Both WT and mutant pollen tubes reached the bottom of the pistils under NT conditions, while HT stress notably repressed the pollen elongation in WT (indicated by a red rectangle). (f) Semi-in vivo pollen tube elongation assays of WT, hrk1-1 and hrk1-2. Pollen tubes of hrk1-1 and hrk1-2 elongated normally under HT stress, while WT pollen tubes elongated less than those in mutants. Exposure of pollen tubes could be was observed in WT pollen tubes under HT stress (indicated by a red rectangle).
such limited data. The current study benefited from the availability of a large number of accessions. By constructing a co-expression network, several lncRNAs and PCgenes were found to participate in the response to HT stress. This revealed for the first time a role for lncRNAs in male reproduction under HT (Table S7).
A GWAS provides association analysis, and eQTL maps have achieved single-gene resolution cloning in rice (Si et al., 2016) and maize . In our study, three loci were found by GWAS analysis to respond to HT, showing that GWASs can be used to dissect the genetic regulation of traits in allopolyploid cotton as well as in diploid plant species. To connect trait and candidate gene expression, some researchers directly test preliminary correlations (such as Pearson's correlation) to construct associations in the panels in which traits and expressions both exist, and this has generated reliable results (Wainberg et al., 2019). However, analyses for which no RNA-seq was available, but which relied on genotyping and phenotyping, were inefficient at identifying regulators because of their limited expression data. The TWAS technique aims to solve this problem and is becoming a powerful tool which provides abundant expression and association data aided by improved algorithms. The fact that TWASs have been successfully implemented in cotton fiber  and rapeseed (Tang et al., 2020) provided motivation to conduct a TWAS in cotton anther. By applying TWAS, GhHRK1 was associated with a pollen phenotype, and it is one of 75 PCgenes detected by the GWAS, suggesting that TWASs can be used to support the findings of GWAS analyses and so facilitate the identification of causal factors more precisely.
Mis-expression of GhHRK1 at the tetrad stage has been shown to result in male sterility in HT-sensitive cotton accessions, suggesting that GhHRK1 acts as a negative regulator in the HT stress response. It has also been demonstrated that 3 d HT treatment can result in pollen sterility at the anther dehiscent stage in WT Arabidopsis, indicating that At4g27290 performs similar functions in the HT stress response at the early anther development stage. Introduction of GhHRK1 into Arabidopsis mutants results in male sterility under HT conditions, possibly indicating that the mechanism of the HT stress response is conserved across the plant kingdom. In summary, we used RNA-sequencing to construct the first transcriptome-wide variation maps for the cotton anther, and linked genetic variation to HT tolerance by phenotyping pollen viability. These results provide potential targets for the improvement of cotton to satisfy the demand of high yield, multi-resistant breeding.

Supporting Information
Additional Supporting Information may be found online in the Supporting Information section at the end of the article.                                    Table S10 Summary of eQTLs in the genome.
Table S11 Detailed information of eQTL hotspots.
Table S12 Transcripts associated with pollen viability identified by transcriptome-wide association study.
Table S13 Full-length coding sequence for GhHRK1.
Table S14 Annotation of significantly associated SNPs in GhHRK1.
Table S15 Annotation of motifs in the promoter of GhHRK1.
Please note: Wiley Blackwell are not responsible for the content or functionality of any Supporting Information supplied by the authors. Any queries (other than missing material) should be directed to the New Phytologist Central Office.