Knowledge supply
This research makes use of knowledge on funded tasks and their related analysis outputs, each retrieved from the NSFC Challenge Consequence PortalFootnote 2. The portal serves because the official platform the place PIs are required to report the progress and outcomes of tasks funded by NSFC upon venture completion. Due to this fact, it’s thought of a extremely complete and correct knowledge supply for understanding the panorama of NSFC funding. In August 2021, we collected the metadata for NSFC tasks and corresponding outcomes, together with the venture title, venture sort, disciplinary subject, 12 months, funding quantity, PI title, PI establishment, and outcomes related to every venture. Challenge outcomes could embrace journal and convention publications, patents, studies, and books. This research analyzes journal and convention publications, accounting for 90.6% of the full outcomes.
To evaluate the standard of self-reported knowledge, we examined the accuracy of English-language publications and writer names in our dataset by cross-checking them in opposition to our in-house model of Internet of Science. We extracted 374,294 English publications with accessible DOIs from our analytical pattern and matched them with Internet of Science data, discovering that 295,871 (79%) have been listed. You will need to be aware that Internet of Science applies choice standards and doesn’t index all journals. For the listed publications, we in contrast the primary, second, third, and final authors’ names (in Pinyin) between NSFC writer bylines and Internet of Science data. This comparability lined 1,177,940 writer entries, together with duplicates throughout papers. Our evaluation confirmed a 97.7% match price between the 2 sources, offering robust proof that the NSFC database is very in line with Internet of Science data. These findings validate the dataset’s accuracy in writer attributions.
Disciplinary fields
NSFC organizes its packages based mostly on disciplinary fields. There are eight departments based mostly on disciplinary fields in NSFC, together with Arithmetic and Bodily Sciences (MPS), Chemical Sciences (Chem), Life Sciences (Life), Earth Sciences (Earth), Engineering & Materials Science (EMS), Data Sciences (Data), Administration Sciences (MS), and Well being Sciences (Well being). Moreover, there are over 120 subdepartments specializing in smaller analysis fields beneath every disciplinary division in our knowledge. We check with the disciplinary division by NSFC as fields and subdepartments as subfields. Though most of those subject names are self-evident from the English-language perspective, it’s important to notice a number of ontological variations embodied in these names. One is the Data Sciences, which is devoted to the areas of “the technology of indicators, acquisition, storage, transmission, processing, and utilization of data” (Nationwide Pure Science Basis of China 2021). These areas are strongly located within the data area of laptop science, regardless of the sector’s title in China’s analysis enterprise. One other is Administration Sciences. This division is centered on “analysis on bettering the understanding of goal legislation in administration and financial actions” (Nationwide Pure Science Basis of China 2021). Consequently, this disciplinary subject comprises many analysis domains which can be considered social sciences, akin to economics, public administration, sociology, and library and data science. For consistency, the following evaluation by disciplinary subject will depend on the disciplinary division classification by NSFC.
Challenge sorts
Every of the disciplinary departments funds a spread sort of tasks. This research chosen the next 4 sorts to concentrate on: Key Initiatives (Key), Normal Initiatives (Normal), Younger Scientist Initiatives (Younger), and Initiatives for Much less Developed Areas (Area). These 4 venture sorts account for about 95.8% of all tasks we collected (see Complement Desk S1). The Area program sometimes funds tasks (as much as 4 years) proposed by PIs affiliated with establishments positioned in economically much less developed areas in China, the place funding on the province degree is comparatively restricted, and establishments are usually ranked decrease than these in additional economically developed areas. The Key, Younger, and Normal tasks have a broader set of eligible candidates than the Area program, but are completely different in some ways, which results in a considerably “laddered” construction based mostly on PI seniority. A Key venture (1.4% of whole tasks) normally spans 5 years. It’s normally thought of probably the most prestigious among the many 4, with most of its PIs being established students of their fields with prior NSFC venture expertise. A Normal venture normally spans 4 years and has the broadest set of eligible candidates among the many 4, the place all researchers in universities with everlasting PI standing are eligible to use. Normal tasks account for about 48.7% of all tasks in our pattern. The Younger venture (42.6%) sometimes lasts three years and is open to students beneath the organic age of 35. Due to this fact, Key tasks are normally thought of extra prestigious than others and go to senior PIs extra usually, whereas Younger tasks are primarily for junior PIs, with Normal tasks in the course of the hierarchy.
Moreover, this research targeted on tasks funded by NSFC ranging from 2010 (venture funding 12 months), given the extra complete protection for tasks and outcomes after 2010 within the dataset. We selected 2015 because the ending venture funding 12 months, as tasks analyzed on this research could take as much as 5 years to finish. Utilizing 2015 because the ending 12 months permits us to guarantee that we now have full final result data for all tasks by the point we accumulate the information. The ultimate analytical pattern used on this research contains 185,465 tasks funded by NSFC between 2010 and 2015 and a pair of,323,443 corresponding publications (in each Chinese language and non-Chinese language languages). The tasks have been awarded to PIs related to 2,757 universities from all 31 provinces and areas in China.
Gender imputation
Gender imputation of names in Chinese language characters
In our research, all PI and writer names of Chinese language-language publications are in Chinese language characters. We recognized 880,311 distinctive names related to these PIs and authors. We used a Python bundle, ngender (J. Hu 2015), to deduce the binary genders related to names in Chinese language characters. ngender calculates the Bayesian likelihood of a gender class for a person title by contemplating all characters and their combos within the given title. It has been confirmed to be one of many best-performed gender-inference instruments accessible for names in Chinese language characters (Zhao and Kamareddine 2018).
In predicting the gender of names in Chinese language characters, ngender permits for self-defined thresholds for the likelihood of deciding the gender of a given title. The default likelihood of deciding a gender by ngender is 50%: if the expected Bayesian likelihood of a given title belonging to a sure gender is greater than 50%, the gender is assigned to the corresponding title. But, the 50% threshold was discovered to bear some limitations on account of gender-neutral names in Chinese language tradition, the place feminine names usually tend to be gender-neutral than male names (Huang and Wang 2022). For greater reliability of the gender imputation process, we examined gender prediction utilizing a spread of thresholds in a ten% increment from 50% to 80%. We used the self-reported gender data for 10,000 doctoral recipients in China to guage the efficiency of ngender (C. Wang et al. 2021; J. Yang et al. 2022). We selected 60% because the cut-off threshold for our gender prediction based mostly on the analysis outcomes, by F1 rating, and the share of PIs with predicted genders (see Complement Fig. S1).
Gender imputation of names in Pinyin
NSFC-funded tasks yield publications in each Chinese language (34.2% of whole publications) and non-Chinese language languages (65.8% of whole publications, principally English). When publishing in languages apart from Chinese language, it’s the regular apply that authors write their names within the format of Pinyin within the writer byline. To review the gender composition of groups comprehensively, we additionally have to predict the gender of authors of English publications the place writer names seem in Pinyin format. As a Pinyin title might be related to a number of names in Chinese language characters, predicting the gender based mostly on names in Chinese language characters ought to produce extra correct outcomes. To raised predict the gender of authors in non-Chinese language publications, we matched their names in Pinyin again to corresponding Chinese language characters inside the similar staff. It’s famous that we thought of variations of writer names when expressed in Pinyin format. For instance, Xueran Wang may be transliterated into XR Wang or Xue-ran Wang in numerous publications. Then the latter two varieties can even be matched with the Chinese language characters related to Xueran Wang in the identical NSFC venture. We then predicted the gender of authors of non-Chinese language publications based mostly on their matched names in Chinese language characters.
The above process permits us to deduce the gender classes of authors who printed in non-Chinese language languages whereas additionally sustaining believable precision, because the gender imputation was based mostly on Chinese language characters. Creator title disambiguation was carried out on the venture degree: We think about that authorships related to the identical title (in Chinese language characters or Pinyin) in a single NSFC staff are the identical particular person, based mostly on the consideration that it needs to be uncommon that a number of people inside the similar venture share a reputation. Our project-level disambiguation is vital to appropriately join people with publications, given the extra important ambiguities amongst names utilized in East Asian international locations, together with China (S. B. Xu and Hu, 2024).
Primarily based on the procedures described above, our remaining analytical pattern contains 2,049,337 (91.4% of the full) publications related to 180,534 (98.1%) tasks which have no less than one writer whose gender class was efficiently predicted. It’s value mentioning that our course of removes 17.9% of all non-Chinese language publications. Total, we have been in a position to match 46.01% of writer situations (authorships) in Pinyin to their corresponding Chinese language characters. To make sure the accuracy of gender project, our methodology excludes authors who’re neither PIs nor authors in any Chinese language-language publications, which we admit is a limitation of our research. For writer names in Pinyin, we analyzed the connection between an writer’s place within the byline and the chance of their title matching an present title in Chinese language characters. Inside publications with the identical variety of authors, this relationship reveals a U-shape, with the primary and final authors displaying the best match percentages (see Supplementary Fig. S2). This sample highlights our evaluation involving extra main authors (first and final authors), who’re usually the foremost contributors to analysis (Ni et al. 2021). Complement Desk S2 reveals the small print of the ultimate analytical pattern used on this research.