Standardized NEON organismal data for biodiversity research

Understanding patterns and drivers of species distribution and abundance, and thus biodiversity, is a core goal of ecology. Despite advances in recent decades, research into these patterns and processes is currently limited by a lack of standardized, high-quality, empirical data that span large spatial scales and long time periods. The NEON fills this gap by providing freely available observational data that are generated during robust and consistent organismal sampling of several sentinel taxonomic groups within 81 sites distributed across the United States and will be collected for at least 30 years. The breadth and scope of these data provide a unique resource for advancing biodiversity research. To maximize the potential of this opportunity, however, it is critical that NEON data be maximally accessible and easily integrated into investigators ’ workflows and analyses. To facilitate its use for biodiversity research and synthesis, we created a workflow to process and format NEON organismal data into the ecocomDP (ecological community data design pattern) format that were available through the ecocomDP R package; we then provided the standardized data as an R data package (neonDivData). We briefly summarize sampling designs and data wrangling decisions for the major taxonomic groups included in this effort. Our workflows are open-source so the biodiversity community may: add additional taxonomic groups; modify the workflow to produce datasets appropriate for their own analytical needs; and regularly update the data packages as more observations become available. Finally, we provide two simple examples of how the standardized data may be used for biodiversity research. By providing a standardized data package, we hope to enhance the utility of NEON organismal data in advancing biodiversity research and encourage the use of the harmonized ecocomDP data design pattern for community ecology data from other ecological observatory networks.


INTRODUCTION (OR WHY STANDARDIZED NEON ORGANISMAL DATA)
A central goal of ecology is to understand the patterns and processes of biodiversity, and this is particularly important in an era of rapid global environmental change (Blowes et al., 2019;Midgley & Thuiller, 2005). Such understanding is only possible through studies that address questions such as: How is biodiversity distributed across large spatial scales, ranging from ecoregions to continents? What mechanisms drive spatial patterns of biodiversity? Are spatial patterns of biodiversity similar among different taxonomic groups, and if not, why do we see variation? How does community composition vary across spatial and environmental gradients? What are the local-and landscape-scale drivers of community structure? How and why do biodiversity patterns change over time? Answers to such questions will enable better management and conservation of biodiversity and ecosystem services.
Biodiversity research has a long history (Worm & Tittensor, 2018), beginning with major scientific expeditions (e.g., Alexander von Humboldt, Charles Darwin) aiming to document global species lists after the establishment of Linnaeus's Systema Naturae (Linnaeus, 1758). Beginning in the 1950s (Curtis, 1959;Hutchinson, 1959), researchers moved beyond documentation to focus on quantifying patterns of species diversity and describing mechanisms underlying their heterogeneity. Since the beginning of this line of research, major theoretical breakthroughs (Brown et al., 2004;Harte, 2011;Hubbell, 2001;MacArthur & Wilson, 1967) have advanced our understanding of potential mechanisms causing and maintaining biodiversity. Modern empirical studies, however, have been largely constrained to local or regional scales and focused on one or a few taxonomic groups, because of the considerable effort required to collect observational data. There are now unprecedented numbers of observations from independent small and short-term ecological studies. These data support research into generalities through syntheses and meta-analyses (Blowes et al., 2019;Li et al., 2020;Vellend et al., 2013), but this work is challenged by the difficulty of integrating data from different studies and with varying limitations. Such limitations include the following: differing collection methods (methodological uncertainties); varying levels of statistical robustness; inconsistent handling of missing data; spatial bias; publication bias; and design flaws (Koricheva & Gurevitch, 2014;Martin et al., 2012;Nakagawa & Santos, 2012;Welti et al., 2021). Additionally, it has historically been challenging for researchers to obtain and collate data from a diversity of sources for use in syntheses and/or meta-analyses (Gurevitch & Hedges, 1999).
Barriers to meta-analyses have been reduced in recent years to bring biodiversity research into the big data era (Farley et al., 2018;Hampton et al., 2013) by large efforts to digitize museum and herbarium specimens (e.g., iDigBio), successful community science programs (e.g., iNaturalist, eBird), technological advances (e.g., remote sensing, automated acoustic recorders), and long-running coordinated research networks. Yet, each of these remedies comes with its own limitations. For instance, museum/herbarium specimens and community science records are increasingly available, but are still incidental and unstructured in terms of the sampling design, and exhibit marked geographic and taxonomic biases (Beck et al., 2014;Geldmann et al., 2016;Martin et al., 2012). Remote sensing approaches may cover large spatial scales, but may also be of low spatial resolution and unable to reliably penetrate vegetation canopy (Palumbo et al., 2017;Pricope et al., 2019). The standardized observational sampling of woody trees by the US Forest Service's Forest Inventory and Analysis and of birds by the US Geological Survey's Breeding Bird Survey has been ongoing across the United States since 2001 and 1966, respectively (Bechtold & Patterson, 2005;Sauer et al., 2017), but covers few taxonomic groups. The Long Term Ecological Research Network (LTER) and Critical Zone Observatory (CZO) both are hypothesis-driven research efforts built on decades of previous work (Jones et al., 2021). While both provide considerable observational and experimental datasets for diverse ecosystems and taxa, their sampling and dataset design are tailored to their specific research questions and a priori standardization is not possible. Thus, despite recent advances, biodiversity research is still impeded by a lack of standardized, highquality, and open-access data spanning large spatial scales and long time periods.
The recently established NEON provides continentalscale observational and instrumentation data for a wide variety of taxonomic groups and measurement streams. Data are collected using standardized methods, across 81 field sites in both terrestrial and freshwater ecosystems, and will be freely available for at least 30 years. These consistently collected, long-term, and spatially robust measurements are directly comparable throughout the observatory, and provide a unique opportunity for enabling a better understanding of ecosystem change and biodiversity patterns and processes across space and through time (Keller et al., 2008).
NEON data are designed to be maximally useful to ecologists by aligning with FAIR principles (findable, accessible, interoperable, and reusable; Wilkinson et al., 2016). Despite meeting these requirements, however, there are still challenges to integrating NEON organismal data (e.g., occurrence and abundance of species) for reproducible biodiversity research. For example, field names may vary across NEON data products, even for similar measurements; some measurements include sampling unit information, whereas units must be decided for others. These issues and inconsistencies may be overcome through data cleaning and formatting, but understanding how best to perform this task requires a significant investment in the comprehensive NEON documentation for each data product involved in an analysis. Thoroughly reading large amounts of NEON documentation is timeconsuming, and the path to a standard data format, as is critical for reproducibility, may vary greatly between NEON organismal data products and users-even for similar analyses. Ultimately, this may result in subtle differences from study to study that hinder meta-analyses using NEON data. A simplified and standardized format for NEON organismal data would facilitate wider usage of these datasets for biodiversity research. Furthermore, if these data were formatted to interface well with datasets from other coordinated research networks, more comprehensive syntheses could be accomplished to advance macrosystems biology .
One attractive standardized formatting style for NEON organismal data is that of ecocomDP (ecological community data design pattern; O'Brien et al., 2021). EcocomDP is the brainchild of members of the LTER network, the Environmental Data Initiative (EDI), and NEON staff, and provides a model by which data from a variety of sources may be easily transformed into consistently formatted, analysis-ready community-level organismal data packages. This is done using reproducible code that maintains dataset "levels": L0 is incoming data, L1 represents an ecocomDP data format and includes tables representing observations, sampling locations, and taxonomic information (at a minimum), and L2 is an output format. Thus far, >70 LTER organismal datasets have been harmonized to the L1 ecocomDP format through the R package ecocomDP and more datasets are in the queue for processing into the ecocomDP format by EDI (O'Brien et al., 2021).
We standardized NEON organismal data into the ecocomDP format, and all R codes to process NEON data products can be obtained through the R package ecocomDP. For the major taxonomic groups included in this initial effort, NEON sampling designs and major data wrangling decisions are summarized in the Materials and Methods section. We archived the standardized data in the EDI Data Repository (https://doi.org/10.6073/pasta/ c28dd4f6e7989003505ea02e9a92afbf). To facilitate the usage of the standardized datasets, we also developed an R data package, neonDivData (https://github.com/ daijiang/neonDivData). We refer to the input data streams provided by NEON as data products, and the cleaned and standardized collection of data files provided here as objects within the R data package, neonDiv-Data, across this paper. Standardized datasets will be maintained and updated as new data become available from the NEON portal. We hope this effort will substantially reduce data processing times for NEON data users and greatly facilitate the use of NEON organismal data to advance our understanding of Earth's biodiversity.

MATERIALS AND METHODS (OR HOW TO STANDARDIZE NEON ORGANISMAL DATA)
There are many details to consider when starting to use NEON organismal data products. Below, we outline key points relevant to community-level biodiversity analyses with regard to the NEON sampling design and decisions that were made as the data products presented in this paper were converted into the ecocomDP data model. While the methodological sections below are specific to particular taxonomic groups, there are some general points that apply to all NEON organismal data products. First, species occurrence and abundance measures as reported in NEON biodiversity data products are not standardized to sampling effort. Because there are often multiple approaches to cleaning (e.g., dealing with multiple levels of taxonomic resolution, interpretations of absences) and standardizing biodiversity survey data, NEON publishes raw observations along with sampling effort data to preserve as much information as possible so that data users can clean and standardize data as they see fit. The workflows described here for 12 taxonomic groups represented in 11 NEON data products produce standardized counts based on sampling effort, such as the count of individuals per area sampled or count standardized to the duration of trap deployment, as described in Table 1. The data wrangling workflows described below can be used to access, download, and clean data from the NEON Data Portal using the R ecocomDP package. To view a catalog of available NEON data products in the ecocomDP format, use ecocomDP::search_data ("NEON"). To import data from a given NEON data product into your R environment, use ecocomDP::read_data(), and set the id argument to the selected NEON to ecocomDP mapping workflow (the "L0 to L1 ecocomDP workflow ID" in Table 1). This will return a list of ecocomDP formatted tables and accompanying metadata. To create a flat data table (similar to the R objects in the data package neonDivData described in Table 2), use the ecocomDP::flatten_data() function.
Second, because different taxonomic groups have different sampling designs (see below for details), there are no general data processing protocol that can be applied to all taxonomic groups. Nevertheless, we tried to be as consistent as possible during the data cleaning and standardization processes. All final data products have the minimal information of locations (e.g., location_id, sit-e_id, plot_id), species names (e.g., taxon_id, taxon_name, taxon_rank), and presence/absence or abundance information (e.g., variable_name, value, unit).
Third, our processes assume that NEON ensured correct identifications of species. However, since records may be identified to any level of taxonomic resolution, and IDs above the genus level may not be useful for most biodiversity projects, we removed records with such IDs for groups that are relatively easy to identify (i.e., fish, plant, small mammals) or have very few taxon IDs that are above genus level (i.e., mosquito). However, for groups that are hard to identify (i.e., algae, beetle, bird, macroinvertebrate, tick, and tick pathogen), we decided to keep all records regardless of their taxon ID level. Users thus need to carefully consider which level of taxon IDs they need to address their research questions. Another note regarding species names is the term "sp." versus "spp." across NEON organismal data collections; the term "sp." refers to a single morphospecies, whereas the term "spp." refers to more than one morphospecies. This is an important point to consider for community ecology or biodiversity analyses because it may add uncertainty to estimates of biodiversity metrics such as species richness. It is also important to point out that NEON fuzzed taxonomic IDs to one higher taxonomic level to protect species of concern. For example, if a threatened Black-capped vireo (Vireo atricapilla) is recorded by a NEON technician, the taxonomic identification is fuzzed to Vireo in the data. Rare, threatened, and endangered species are those listed as such by federal and/or state agencies.
T A B L E 1 Mapping NEON data products to ecocomDP formatted data packages with abundance standardized to observation effort. , and community composition (DP1.10081.001) were not considered here, though future work may utilize neonDivData to align these datasets. Users interested in further explorations of these data products may find more information on the NEON data portal (https://data.neonscience.org/). Additionally, concurrent work on a suggested bioinformatics pipeline and how to run sensitivity analyses on user-defined parameters for NEON soil microbial data, including code and vignettes, is described in Qin et al. (2021).
Finally, it should be noted that NEON data collection efforts will continue well after this paper is published and new changes to data collection methods and/or processing may vary over time. Such changes (e.g., change in the number of traps used for ground beetle collection) or interruptions (e.g., due to  to data collection are documented in the issue log for each data product on the NEON Data Portal and the Readme text file that is included with NEON data downloads. We will try our best to maintain and update our standardized data products as long as possible.

Terrestrial organisms
Breeding landbirds NEON sampling design NEON designates breeding landbirds as "smaller birds (usually exclusive of raptors and upland game birds) not usually associated with aquatic habitats" (Ralph, 1993;Thibault, 2018). Most species observed are diurnal and include both resident and migrant species. Landbirds are surveyed via point counts in each of the 47 terrestrial sites (Thibault, 2018). At most NEON sites, breeding landbird points are located in 5-10, 3 Â 3 grids (Figure 1), which are themselves located in representative (dominant) vegetation. Whenever possible, grid centers are colocated with distributed base plot centers. When sites are too small to support a minimum of five grids, separated by at least 250 m from edge to edge, point counts are completed at single points instead of grids. In these cases, points are located at the southwest corners of distributed base plots within the site. Five to 25 points may be surveyed depending on the size and spatial layout of the site, with exact point locations dictated by a stratified-random spatial design that maintains a 250-m minimum separation between points.
Surveys occur during one or two sampling bouts per season, at large and small sites, respectively. Observers go to the specified points early in the morning and track birds observed during each minute of a 6-min period, following a 2-min acclimation period, at each point (Thibault, 2018). Each point count contains species, sex, and distance to each bird (measured with a laser rangefinder except in the case of flyovers) seen or heard.
T A B L E 2 Summary of data products included in this study (as of 13 April 2022). Users can call the R objects in the R object column from the R data package neonDivData to get the standardized data for specific taxonomic groups. F I G U R E 1 Generalized sampling schematics for Terrestrial Observation System (TOS) (a) and Aquatic Observation System (B-D) plots. For TOS plots, distributed, tower, and gradient plots, and locations of various sampling regimes are presented via symbols. For Aquatic Observation System plots, wadeable streams, nonwadeable streams, and lake plots are shown in detail, with locations of sensors and different sampling regimes presented using symbols. Panel (a) was originally published in Thorpe et al. (2016).
Information relevant for subsequent modeling of detectability is also collected during the point counts (e.g., weather, detection method). The point count surveys for NEON were modified from the Integrated Monitoring in Bird Conservation Regions field protocol for spatially balanced sampling of landbird populations (Pavlacky Jr et al., 2017).

Data wrangling decisions
The bird point count NEON data product ("DP1.10003.001") consists of a list of two associated data frames: brd_countdata and brd_perpoint. The former data frame contains information such as locations, species identities, and their counts. The latter data frame contains additional location information such as latitude and longitude coordinates and environmental conditions during the time of the observations. The separate data frames are linked by "eventID," which refers to the location, date, and time of the observation. To prepare the bird point count data for the L1 ecocomDP model, we first merged both data frames into one and then removed columns that are likely not needed for most communitylevel biodiversity analyses (e.g., observer names). The field taxon_id in the R object data_bird with the neonDivData data package consists of the standard AOU four-letter species code, although taxon_rank refers to seven potential levels of identification (class, family, genus, species, speciesGroup, subfamily, and subspecies). Users can decide which level is appropriate; for example, one might choose to exclude all unidentified birds (taxon_id = UNBI), where no further details are available below the class level (Aves sp.). The NEON sampling protocol has evolved over time, so users are advised to check whether the "samplingProtocolVersion" associated with bird point count data ("DP1.10003.001") fits their data requirements and subset as necessary. Older versions of protocols can be found at the NEON document library. Beetle pitfall trapping begins when the temperature has been >4 C for 10 days in the spring and ends when temperatures dip below this threshold in the fall.

Ground beetle and herp bycatch
Sampling occurs biweekly throughout the sampling season with no single trap being sampled more frequently than every 12 days (LeVan, 2020a). After collection, the samples are separated into carabid species and bycatch.
Invertebrate bycatch is pooled to the plot level and archived. Vertebrate bycatch is sorted and identified by NEON technicians, then archived at the trap level. Carabid samples are sorted and identified by NEON technicians, after which a subset of carabid individuals are sent to be pinned and reidentified by an expert taxonomist. More details can be found in Hoekman et al. (2017) and LeVan, Robinson, et al. (2019).
Pitfall traps and sampling methods are designed by NEON to reduce vertebrate bycatch (LeVan, Robinson, et al., 2019). The pitfall cup is medium in size with a low clearance cover installed over the trap entrance to minimize large vertebrate bycatch. When a live vertebrate with the ability to move on its own volition is found in a trap, the animal is released. Live but moribund vertebrates are euthanized and collected along with deceased vertebrates. When ≥15 individuals of a vertebrate species are collected, cumulatively, within a single plot, NEON may initiate localized mitigation measures such as temporarily deactivating traps and removing all traps from the site for the remainder of the season. Thus, while herpetofaunal (herp) bycatch is present in many pitfall samples it is unclear how well these pitfall traps capture herp community structure and diversity-due to these active efforts to reduce vertebrate bycatch. Users of NEON herp bycatch data should be aware of these limitations.

Data wrangling decisions
The beetle and herp bycatch data product identifier is "DDP1.10022.001." Carabid samples are recorded and identified in a multistep workflow wherein a subset of samples are passed on in each successive step. Individuals are first identified by the sorting technician after which a subset is sent on to be pinned. Some especially difficult individuals are not identified by technicians during sorting, instead of being labeled "other carabid." The identifications for those individuals are recorded with the pinning data. Any individuals for which identification is still uncertain are then verified by an expert taxonomist.
There are a few cases where an especially difficult identification was sent to multiple expert taxonomists, and they did not agree on a final taxon; these individuals were excluded from the dataset at the recommendation of NEON staff.
Preference is given to expert identification whenever available. However, these differences in taxonomic expertise do not seem to cause systematic biases in estimating species richness across sites, but nonexpert taxonomists are more likely to misidentify non-native carabid species (Egli et al., 2020). Beetle abundance is recorded for the sorted samples by NEON technicians. To account for individual samples that were later reidentified, the final abundance for a species is the original sorting sample abundance minus the number of individuals that were given a new ID.
Prior to 2018, trappingDays values were not included for many sites. Missing entries were calculated as the range from setDate through collectDate for each trap. We also accounted for a few plots for which setDate was not updated based on a previous collection event in the trappingDays calculations. To facilitate easy manipulation of data within and across bouts, a new boutID field was created to identify all trap collection events at a site in a bout. The original EventID field is intended to identify a bout, but has a number of issues that necessitates the creation of a new ID. First, EventID does not correspond to a single collection date but rather all collections in a week. This is appropriate for the small number of instances when collections for a bout happen over multiple consecutive days ($5% of bouts), but prevents analysis of bout patterns at the temporal scale of a weekday. The data here were updated so all entries for a bout correspond to the date (i.e., collectDate) on which the majority of traps are collected to maintain the weekday-level resolution with as high of fidelity as possible, while allowing for easy aggregation within bouts and collectDates. Second, there were a few instances in which plots within a site were set and collected on the same day, but have different EventIDs. These instances were all considered a single bout by our new boutID, which is a unique combination of setDate, collectDate, and siteID.
Herpetofaunal bycatch (amphibian and reptile) in pitfall traps were identified to species or the lowest taxonomic level possible within 24 h of recovery from the field. To process the herp bycatch NEON data, we cleaned trappingDays and the other variables and added boutID as described above for beetles. The variable sam-pleType in the bet_sorting table provides the type of animal caught in a pitfall trap as one of five types: "carabid," "vert bycatch herp," "other carabid," "invert bycatch," and "vert bycatch mam." We filtered the beetle data described above to only include the "carabid" and "other carabid" types. For herps, we only kept the sampleType of "vert bycatch herp." Abundance data of beetle and herp bycatch were standardized to be the number of individuals captured per trap day.

NEON sampling design
Mosquito specimens are collected at 47 terrestrial sites across all NEON domains, and the data are reported in NEON data product DP1.10043.001. Traps are distributed throughout each site according to a stratified-random spatial design used for all Terrestrial Observation System sampling, maintaining stratification across dominant (>5% of total cover) vegetation types (LeVan, 2020b). The number of mosquito traps placed in each vegetation type is proportional to its percent cover, until 10 total mosquito traps have been placed in the site. Mosquito traps are typically located within 30 m of a road to facilitate expedient sampling and are placed at least 300 m apart to maintain independence.
Mosquito monitoring is divided into off-season and field season sampling (LeVan, Paull, et al., 2019). Offseason sampling begins after three consecutive zero-catch field sampling bouts have occurred, and represents a reduced sampling regime that is designed for the rapid detection of when the next field season should begin and to provide mosquito phenology data. Off-season sampling is conducted at three dedicated mosquito traps spread throughout each core site, while temperatures are >10 C. Once per week, technicians deploy traps at dusk and then collect them at dawn the following day.
Field season sampling begins when the first mosquito is detected during off-season sampling (LeVan, Paull, et al., 2019). Technicians deploy traps at all 10 dedicated mosquito trap locations per site. Traps remain out for a 24-h period or sampling bout, and bouts occur every 2-4 weeks at core and relocatable terrestrial sites, respectively. During the sampling bout, traps are serviced twice and yield one night-active sample, collected at dawn or about 8 h after the trap was set, and 1 day-active sample, collected at dusk or $16 h after the trap was set. Thus, a 24-h sampling bout yields 20 samples from 10 traps.
NEON collects mosquito specimens using Center for Disease Control (CDC) CO 2 light traps (LeVan, Paull, et al., 2019). These traps have been used by other public health and mosquito-control agencies for a half-century so that NEON mosquito data align across NEON field sites and with existing long-term datasets. A CDC CO 2 light trap consists of a cylindrical insulated cooler that contains dry ice, a plastic rain cover attached to a batterypowered light/fan assembly, and a mesh collection cup. During deployment, the dry ice sublimates and releases CO 2 . Mosquitoes attracted to the CO 2 bait are sucked into the mesh collection cup by the battery-powered fan, where they remain alive until trap collection.
Following field collection, NEON's field ecologists process, package, and ship the samples to an external laboratory where mosquitoes are identified to species and sex (when possible). A subset of identified mosquitoes are tested for infection by pathogens to quantify the presence/ absence and prevalence of various arboviruses. Some mosquitoes are set aside for DNA barcode analysis and long-term archiving. Particularly rare or difficult-to-identify mosquito specimens are prioritized for DNA barcoding. More details can be found in LeVan, Paull, et al. (2019).

Data wrangling decisions
The mosquito data product (DP1.10043.001) consists of four data frames: trapping data (mos_trapping), sorting data (mos_sorting), archiving data (mos_archivepooling), and expert taxonomist processed data (mos_expert-TaxonomistIDProcessed). We first removed rows (records) with missing information about location, collection date, and sample or subsample ID for all data frames. We then merged all four data frames into one, wherein we only kept records for target taxa (i.e., targetTaxaPresent = "Y") with no known compromised sampling condition (i.e., sampleCondition = "No known compromise"). We further removed a small number of records with species identified only to the family level; all remaining records were identified at least to the genus level. We estimated the total individual count per trap hour for each species within a trap as (individualCount/subsampleWeight) Â totalWeight/trapHours. We then removed columns that were not likely to be used for calculating biodiversity values.

Small mammals
NEON sampling design NEON defines small mammals based on taxonomic, behavioral, dietary, and size constraints, and includes any rodent that (1) is nonvolant; (2) is nocturnally active; (3) forages predominantly aboveground; and (4) has a mass >5 g, but <500-600 g . In North America, this includes cricetids, heteromyids, small sciurids, and introduced murids, but excludes shrews, large squirrels, rabbits, or weasels, although individuals of these species may be incidentally captured.
Small mammals are collected at NEON sites using Sherman traps, identified to species in the field, marked with a unique tag, and released . Multiple 90 Â 90 m trapping grids are set up in each terrestrial field site within the dominant vegetation type. Each 90 Â 90 m trapping grid contains 100 traps placed in a pattern with 10 rows and 10 columns set 10 m apart. Three of these 90 Â 90 m grids per site are designated pathogen (as opposed to diversity) grids, and additional blood sampling is conducted here.
Small mammal sampling occurs in bouts, with a bout comprised of three consecutive (or nearly consecutive) nights of trapping at each pathogen grid and one night of trapping at each diversity grid. The timing of sampling occurs within 10 days before or after the new moon. The number of bouts per year is determined by site type: Core sites are typically trapped for six bouts per year (except for areas with shorter seasons due to cold weather), while relocatable sites are trapped for four bouts per year. More information can be found in Thibault et al. (2019).

Data wrangling decisions
In the small mammal NEON data product (DP1.10072.001), records are stratified by NEON site, year, month, and day and represent data from both the diversity and pathogen sampling grids. Capture records were removed if they were not identified to genus or species (e.g., if the species name was denoted as "either/or" or as family name), or if their trap status is not "5-capture" or "4-more than 1 capture in one trap." Abundance data for each plot and month combination were standardized to be the number of individuals captured per 100 trap nights.

Terrestrial plants
NEON sampling design NEON plant diversity sampling is completed once or twice per year (one or two "bouts") in multiscale, 400-m 2 (20 Â 20 m) plots (Barnett, 2019). Each multiscale plot is subdivided into four 100-m 2 (10 Â 10 m) subplots that each encompasses one or two sets of 10-m 2 (3.16 Â 3.16 m) subplots within which a 1-m 2 (1 Â 1 m) subplot is nested. The percent cover of each plant species is estimated visually in the 1-m 2 subplots, while only species presences are documented in the 10-and 100-m 2 subplots.
To estimate plant percent cover by species, technicians record this value for all species in a 1-m 2 subplot (Barnett, 2019). Next, the remaining 9-m 2 area of the associated 10-m 2 subplot is searched for the presence of species. The process is repeated if there is a second 1-and 10-m 2 nested pair in the specific 100-m 2 subplot. Next, the remaining 80-m 2 area is searched for the presence of species; data can be aggregated for a complete list of species present at the 100-m 2 subplot scale. Data for all four 100-m 2 subplots represent indices of species at the 400-m 2 plot scale. In most cases, species encountered in a nested, finer scale, subplot are not rerecorded in any corresponding larger subplot-in order to avoid duplication. Plant species are occasionally recorded more than once, however, when data are aggregated across all nested subplots within each 400-m 2 plot, and these require removal from the dataset. More details about the sampling design can be found in Barnett et al. (2019).
NEON manages plant taxonomic entries with a master taxonomy list that is based on the community standard, where possible. Using this list, synonyms for a given species are converted to the currently used name.
The master taxonomy for plants is the USDA PLANTS Database (USDA, NRCS, 2022;https://plants.usda.gov), and the portions of this database included in the NEON plant master taxonomy list are those pertaining to native and naturalized plants present within the NEON sampling area. A sublist for each NEON domain includes those species with ranges that overlap the domain and nativity designations-introduced or native-in that part of the range. If a species is reported at a location outside of its known range, and the record proves reliable, the master taxonomy list is updated to reflect the distribution change. For more details on plant taxonomic handling, see Barnett et al. (2019). For more on the NEON plant master taxonomy list, see NEON.DOC.014042 (https:// data.neonscience.org/api/v0/documents/NEON.DOC. 014042vK).

Data wrangling decisions
In the plant presence and percent cover NEON data product (DP1.10058.001), sampling at the 1 Â 1 m scale also includes observations of abiotic and nontarget species ground cover (i.e., soil, water, and downed wood), so we removed records with divDataType as "otherVariables." We also removed records whose targetTaxaPresent is N (i.e., a nontarget species). Additionally, for all spatial resolutions (i.e., 1-, 10-, and 100-m 2 data), any record lacking information critical for combining data within a plot and for a given sampling bout (i.e., plotID, subplotID, boutNumber, endDate, or taxonID) was dropped from the dataset. Furthermore, records without a definitive genusor species-level taxonID (i.e., those representing unidentified morphospecies) were not included. To combine data from different spatial resolutions into one data frame, we created a pivot column entitled sample_area_m2 (with possible values of 1, 10, and 100). Because of the nested sampling design of the plant data, to capture all records within a subplot at the 100-m 2 scale, we incorporated all data from both the 1-and 10-m 2 scales for that subplot. Similarly, to obtain all records within a plot at the 400-m 2 scale, we included all data from that plot. Species abundance information was only recorded as area coverage within 1 by 1 m subplots; however, users may use the frequency of a species across subplots within a plot or plots within a site as a proxy of its abundance if needed.

Ticks and tick pathogens
NEON sampling design Tick sampling occurs in six distributed plots at each site, which are randomly chosen in proportion to NLCD land cover class (LeVan, Thibault, et al., 2019). Ticks are sampled by walking the perimeter of a 40 Â 40 m plot using a 1 Â 1 m drag cloth. Ideally, 160 m is sampled (the shortest straight line distance between corners), but the cloth can be dragged around obstacles if a straight line is not possible. The acceptable total sampling area is between 80 and 180 m per plot. The cloth can also be flagged over vegetation when the cloth cannot be dragged across it. Ticks are collected from the cloth and technicians' clothing at appropriate intervals, depending on vegetation density, and at every corner of the plot. Specimens are immediately transferred to a vial containing 95% ethanol.
Onset and offset of tick sampling coincide with phenological milestones at each site, beginning within 2 weeks of the onset of green-up and ending within 2 weeks of vegetation senescence (LeVan, Thibault, et al., 2019). Sampling bouts are only initiated if the high temperature on the two consecutive days prior to planned sampling was >0 C. Early-season sampling is conducted on a low-intensity schedule, with one sampling bout every 6 weeks. When more than five ticks of any life stage have been collected within the last calendar year at a site, sampling switches to a high-intensity schedule at the site-with one bout every 3 weeks. A site remains on the high-intensity schedule until fewer than five ticks are collected within a calendar year; then, sampling reverts back to the low-intensity schedule.
Ticks are sent to an external facility for identification to species, life stage, and sex (LeVan, Thibault, et al., 2019). A subset of nymphal ticks are additionally sent to a pathogen testing facility. Ixodes species are tested for Anaplasma phagocytophilum, Babesia microti, Borrelia burgdorferi sensu lato, Borrelia miyamotoi, Borrelia mayonii, other Borrelia species (Borrelia sp.), and a Ehrlichia muris-like agent (Pritt et al., 2017). Non-Ixodes species are tested for A. phagocytophilum, Borrelia lonestari (and other undefined Borrelia species), Ehrlichia chaffeensis, Ehrlichia ewingii, Francisella tularensis, and Rickettsia rickettsii. Additional information about tick pathogen testing can be found in the Tick Pathogen Testing SOP (https://data. neonscience.org/api/v0/documents/UMASS_LMZ_ tickPathogens_SOP_20160829) for the NEON Tick-borne Pathogen Status data product.

Data wrangling decisions
The tick NEON data product (DP1.10093.001) consists of two dataframes: "tck_taxonomyProcessed," hereafter referred to as "taxonomy data"; and "tck_fielddata," hereafter referred to as "field data." Users should be aware of some issues related to taxonomic ID. Counts assigned to higher taxonomic levels (e.g., at the order-level Ixodida; IXOSP2) are not the sum of lower levels; rather, they represent the counts of individuals that could not reliably be assigned to a lower taxonomic unit. Samples that were not identified in the laboratory were assigned to the highest taxonomic level (order Ixodida; IXOSP2). However, users could make an informed decision to assign these ticks to the most probable group if a subset of individuals from the same sample were assigned to a lower taxonomy.
To clean the tick data, we first removed surveys and samples not meeting quality standards. In the taxonomy data, we removed samples where sample condition was not listed as "OK" (<1% of records). In the field data, we removed records where samples were not collected due to logistical concerns (10%). We then combined male and female counts in the taxonomy table into one "adult" class. The taxonomy table was reformatted so that every row contained a sampleID and counts for each species life stages were separate columns (i.e., "wide format"). Next, we joined the field data to the taxonomy data, using the sample ID to link the two tables. When joining, we retained field records where no ticks were found in the field, and thus, there were no associated taxonomy data. In drags where ticks were not found, counts were given zeros. All counts were standardized by area sampled.
Prior to 2019, both field surveyors and laboratory taxonomists enumerated each tick life stage; consequently, in the joined dataset there were two sets of counts ("field counts" and "laboratory counts"). However, starting in 2019, counts were performed by taxonomists rather than field surveyors. Field surveys conducted after 2019 no longer have field counts. Users of tick abundance data should be aware that this change in protocol has several implications for data wrangling and for analysis. First, after 2019, tick counts are no longer published at the same time as field survey data. Subsequently, some field records from the most recent years have tick presence recorded (targetTaxaPresent = "Y"), but do not yet have associated counts or taxonomic information and so the counts are still listed as NA. Users should be aware that counts of zero are therefore published earlier than positive counts. We strongly urge users to filter data to those years where there are no counts pending.
The second major issue is that in years where both field counts and laboratory counts were available, they did not always agree (8% of records). In cases of disagreement, we generally used laboratory counts in the final abundance data, because this is the source of all tick count data after 2019 and because life stage identification was more accurate. However, there were a few exceptions where we used field count data. In some cases, only a subsample of a certain life stage was counted in the laboratory, which resulted in higher field counts than laboratory counts. In this case, we assigned the additional unidentified individuals (e.g., the difference between the field and laboratory counts) to the order level (IXOSP2). If quality notes from NEON described ticks being lost in transit, we also added the additional lost individuals to the order level. There were some cases (<1%) where the field counts were greater than laboratory counts by more than 20% and where the explanation was not obvious; we removed these records. We note that the majority of samples ($85%) had no discrepancies between the laboratory or field; therefore, this process could be ignored by users whose analyses are not sensitive to exact counts.
The tick pathogen NEON data product (DP1.10092.001) consists of two dataframes: tck_pathogen, hereafter referred to as "pathogen data"; and tck_pathogenqa, hereafter referred to as "quality data." First, we removed any samples that had flagged quality checks from the quality data and removed any samples that did not have a positive DNA quality check from the pathogen data. Although the original online protocol aimed to test 130 ticks per site per year from multiple tick species, the final sampling decision was to extensively sample IXOSCA, AMBAME, and AMBSP species only because IXOPAC and Dermacentor nymph frequencies were too rare to generate meaningful pathogen data. Borrelia burgdorferi and B. burgdorferi sensu lato tests were merged, since the former was an incomplete pathogen name and refers to B. burgdorferi sensu lato as opposed to sensu stricto (Rudenko et al., 2011). Tick pathogen data are presented as positivity rate calculated as the number of positive tests per number of tests conducted for a given pathogen on ticks collected during a given sampling event.

Aquatic organisms
Aquatic macroinvertebrates NEON sampling design Aquatic macroinvertebrate sampling occurs three times/ year at wadeable stream, river, and lake sites from spring through fall. The timing of sampling is site-specific and based on historical hydrological, meteorological, and phenological data including dates of known ice cover, growing degree days, and green-up and brown-down (Cawley et al., 2016). Samplers vary by habitat and include Surber, Hess, hand corer, modified kicknet, Dframe sweep, and petite Ponar samplers (Parker, 2019). Stream sampling occurs throughout the 1-km permitted reach in wadeable areas of the two dominant habitat types. Lake sampling occurs with a petite Ponar near buoy, inlet and outlet sensors, and D-frame sweeps in wadeable littoral zones. Riverine sample collections in deep waters or near instrument buoys are made with a petite Ponar, and in littoral areas are made with a Dframe sweep or large woody debris sampler. In the field, samples are preserved in pure ethanol, and later in the domain support facility, glycerol is added to prevent the samples from becoming brittle. Samples are shipped from the domain facility to a taxonomy laboratory for sorting and identification to the lowest possible taxon (e.g., genus or species), and counts of each taxon per size are made to the nearest millimeter.

Data wrangling decisions
Aquatic macroinvertebrate data contained in the NEON data product DP1.20120.001 are subsampled and identified to the lowest practical taxonomic level, typically genus, by expert taxonomists in the inv_taxonomyProcessed table, measured to the nearest millimeter size class, and counted. Taxonomic naming has been standardized in the inv_taxonomyProcessed file, according to NEON's master taxonomy (https://data.neonscience.org/taxonomic-lists), removing any synonyms. We calculated macroinvertebrate density by dividing estimatedTotalCount (which includes the corrections for subsampling in the taxonomy laboratory) by benthicArea from the inv_fieldData table to return count per square meter of stream, lake, or river bottom (Chesney et al., 2021).

Microalgae (periphyton and phytoplankton)
NEON sampling design NEON collects periphyton samples from natural surface substrata (i.e., cobble, silt, woody debris) over a 1-km reach in streams and rivers, and in the littoral zone of lakes. Various collection methods and sampler types are used, depending on substrate (Parker, 2020). In lakes and rivers, periphyton are also collected from the most dominant substratum type in three areas within the littoral (i.e., shoreline) zone. Prior to 2019, littoral zone periphyton sampling occurred in five areas.
NEON collects three phytoplankton samples per sampling date using Kemmerer or Van Dorn samplers. In rivers, samples are collected near the sensor buoy and at two other deep-water points in the main channel. For lakes, phytoplankton are collected near the central sensor buoy and at two littoral sensors. Where lakes and rivers are stratified, each phytoplankton sample is a composite from one surface sample, one sample from the metalimnion (i.e., middle layer), and one sample from the bottom of the euphotic zone. For nonstratified lakes and nonwadeable streams, each phytoplankton sample is a composite from one surface sample, one sample just above the bottom of the euphotic zone, and one mideuphotic zone sample, if the euphotic zone is >5 m deep.
All microalgal sampling occurs three times per year (i.e., spring, summer, and fall bouts) in the same sampling bouts as aquatic macroinvertebrates and zooplankton. In wadeable streams, which have variable habitats (e.g., riffles, runs, pools, and step pools), three periphyton samples are collected per bout in the dominant habitat type (five samples collected prior to 2019) and three per bout in the second most dominant habitat type. No two samples are collected from the sample habitat unit (i.e., the same riffle).
Samples are processed at the domain support facility and separated into subsamples for taxonomic analysis or for biomass measurements. Aliquots shipped to an external facility for taxonomic determination are preserved in glutaraldehyde or Lugol's iodine (before 2021). Aliquots for biomass measurements are filtered onto glass-fiber filters and processed for ash-free dry mass (AFDM).

Data wrangling decisions
The periphyton, seston, and phytoplankton NEON data product (DP1.20166.001) contains three dataframes for algae containing information on algae taxonomic identification, biomass, and related field data, which are hereafter referred to as alg_tax_long, alg_biomass, and alg_field_data. Algae within samples are identified to the lowest possible taxonomic resolution, usually species, by contracting laboratory taxonomists. Some specimens can only be identified to the genus or even class level, depending on the condition of the specimen. Ten percent of all samples are checked by a second taxonomist and are noted in the qcTaxonomyStatus. Taxonomic naming has been standardized in the alg_tax_long files, according to NEON's master taxonomy, removing nomenclatural synonyms. Abundance and cell/colony counts are determined for each taxon of each sample with counts of cells or colonies that are either corrected for sample volume or not (as indicated by algalParameterUnit = "cellsperBottle").
We corrected sample units of cellsperBottle to density (Parker & Vance, 2020). First, we summed the preservative volume and the laboratory's recorded sample volume for each sample (from the alg_biomass file) and combined that with the alg_tax_long file using sampleID as a common identifier. Where samples in the alg_tax_long file were missing data in the perBottleSampleVolume field (measured after receiving samples at the external laboratory), we estimated the sample volume using NEON domain laboratory sample volumes (measured prior to shipping samples to the external laboratory). With this updated file, we combined it with alg_field_data to have the related field conditions, including benthic area sampled for each sample. parentSampleID was used for alg_field_data to join to the alg_biomass file's sampleID as alg_field_data only has parentSampleID. We then calculated cells per milliliter for the uncorrected taxon of each sample, dividing algalParameterValue by the updated sample volume. Benthic sample results are expressed in terms of area (i.e., multiplied by the field sample volume and divided by benthic area sampled), in square meters. The final abundance units are either cells per milliliter (phytoplankton and seston samples) or cells per square meter for benthic samples.
The sampleIDs are child records of each parentSampleID that will be collected as long as sampling is not impeded (i.e., ice-covered or dry). In the alg_biomass file, there should be only a single entry for each parentSampleID, sampleID, and analysisType. Most often, there were two sampleIDs per parentSampleID with one for AFDM and taxonomy (analysis types). For the creation of the observation table with standardized counts, we used only records from the alg_biomass file with the analysisType of taxonomy. In alg_tax_long, there are multiple entries for each sampleID for each taxon by scientificName and algalParameter.

NEON sampling design
Fish sampling is carried out across 19 of the NEON ecoclimatic domains, occurring in a total of 23 lotic (stream) and 5 lentic (lake) sites. In lotic sites, up to 10 nonoverlapping reaches, each 70-130 m long, are designated within a 1-km section of stream (Jensen et al., 2019a). These include three constantly sampled "fixed" reaches, which encompass all representative habitats found within the 1-km stretch, and seven "random" reaches that are sampled on a rotating schedule. In lentic sites, 10 pie-shaped segments are established, with each segment ranging from the riparian zone into the lake center, therefore effectively capturing both nearshore and offshore habitats (Jensen et al., 2019b). Three of the 10 segments are fixed and are surveyed twice a year, and the remaining segments are random and are sampled rotationally. The spatial layouts of these sites are designed to capture spatial and temporal heterogeneity in the aquatic habitats.
Lotic sampling occurs at three fixed and three random reaches per sampling bout, and there are two bouts per year-one in spring and one in fall. During each bout, the fixed reaches are sampled via a three-pass electrofishing depletion approach (Moulton II et al., 2002;Peck et al., 2006), while the random reaches being sampled are done so with a single-pass depletion approach. Which random reaches are surveyed depends on the year, with three of the random reaches sampled every other year. All sampling occurs during daylight hours, with each sampling bout completed within 5 days and with a minimum 2-week gap in between two successive sampling bouts. The initial sampling date is determined using site-specific historical data on ice melting, water temperature (or accumulated degree days), and riparian peak greenness.
The lentic sampling design is similar to that discussed above, with fixed segments being sampled twice per year and random segments sampled twice per year on a rotational basis (i.e., each random segment is not sampled every year). Lentic sampling is conducted using three gear types, with backpack electrofishing and mini-fyke nets near the shoreline and gill nets in deeper waters. Backpack electrofishing is done on a 4 Â 25 m reach near the shoreline via a three-pass (for fixed segments) or single-pass (for random segments) electrofishing depletion approach (Moulton II et al., 2002, Peck et al., 2006. All three passes in a fixed sampling segment are completed on the same night, with ≤30 min between successive passes. Electrofishing begins within 30 min of sunset and ceases within 30 min of sunrise, with a maximum of five passes per sampling bout. A single gill net is also deployed within all segments being sampled, both fixed and random, for 1-2 h in either the morning or early afternoon. Finally, a fyke (Baker et al., 1997) or mini-fyke net is deployed at each fixed or random segments, respectively. Fyke nets are positioned before sunset and recovered after sunrise on the following day. Precise start and end times for electrofishing and net deployments are documented by NEON technicians at the time of sampling.
In all surveys, captured fish are identified to the lowest practical taxonomic level, and morphometrics (i.e., body mass and body length) are recorded for 50 individuals of each taxon before releasing. Relative abundance for each fish taxon is also recorded by direct enumeration (up to first 50 individuals) or estimation by bulk counts (>50 individuals, i.e., by placing fish of a given taxon into a dip net [i.e., net scoop], counting the total number of specimens in the dip net, and then multiplying the total number of scoops of captured fish by the counts from the first scoop).

Data wrangling decisions
Fish sampled via both electrofishing and trapping are identified at variable taxonomic resolutions (as fine as subspecies level) in the field. Most identifications are made to the species or genus level by a single field technician for a given bout per site. Sampled fish are identified, measured, weighed, and then released back to the site of capture. If field technicians are unable to identify to the species level, such specimens are identified to the finest possible taxonomic resolution or assigned a morphospecies with a coarseresolution identification. The standard sources consulted for identification and a qualifier for identification validity are also documented in the fsh_perFish table. The column bulkFishCount of the fsh_bulkCount table records relative abundance for each species or the alternative next possible taxon level (specified in the column scientificName).
Fish data (taxonomic identification and relative abundance) are recorded per each sampling reach in streams or per segment in lakes in each bout and documented in the fsh_perFsh table (Monahan et al., 2020). The column eventID uniquely identifies the sampling date of the year, the specific site within the domain, a reach/segment identifier, the pass number (i.e., number of electrofishing passes or number of net deployment efforts), and the survey method. The eventID column helps tie all fish data with stream reach/lake segment data or environmental data (i.e., water quality data) and sampling effort data (e.g., electrofishing and net set time). A reachID column provided in the fsh_perPass table uniquely identifies surveys done per stream reach or lake segment. The reachID is nested within the eventID as well. We used eventID as a nominal variable to uniquely identify different sampling events and to join different, stacked fish data files as described below.
The fish NEON data product (DP1.20107.001) consists of fsh_perPass, fsh_fieldData, fsh_bulkCount, fsh_perFish, and the complete taxon table for fish, for both stream and lake sites. To join all reach-scale data, we first joined the fsh_perPass with fsh_fieldData, and eliminated all bouts where sampling was untenable. Subsequently, we joined the reach-scale table with fsh_perFsh to add individual fish counts and fish measurements. Then, to add bulk counts, we joined the reach-scale table with fsh_bulkCount datasets, and subsequently added taxonRank, which included the taxonomic resolution in the bulk-processed table. Afterward, both individual-level and bulk-processed datasets were appended into a single table. To include samples where no fish were captured, we filtered the fsh_perPass table retaining records where target taxa (fish) were absent, joined it with fsh_fieldData, and finally merged it with the table that contained both bulk-processed and individual-level data. For each finer-resolution taxon in the individual-level dataset, we considered the relative abundance as one since each row represented a single individual fish. Whenever possible, we substituted missing data by cross-referencing other data columns, omitted completely redundant data columns, and retained records with genus-and species-level taxonomic resolution. For the appended dataset, we also calculated the relative abundance for each species per sampling reach or segment at a given site. To calculate speciesspecific catch per unit effort (CPUE), we normalized the relative abundance by either average electrofishing time (i.e., efTime, efTime2) or trap deployment time (i.e., the difference between netEndTime and netSetTime). For trap data, we assumed size of the traps used, water depths, number of netters used, and the reach lengths (a significant proportion of bouts had reach lengths missing) to be comparable across different sampling reaches and segments. Zooplankton NEON sampling design Zooplankton samples are collected at seven NEON lake sites across four domains. Zooplankton samples are collected at the buoy sensor set (deepest location in the lake) and at the two nearshore sensor sets using a vertical tow net for locations deeper than 4 m and a Schindler trap for locations shallower than 4 m (Parker & Roehm, 2019). This results in three samples collected per sampling day. Samples are preserved with ethanol in the field and shipped from the domain facility to a taxonomy laboratory for sorting and identification to the lowest possible taxon (e.g., genus or species), and counts of each taxon per size are made to the nearest millimeter.

Data wrangling decisions
The NEON zooplankton data product (DP1.20219.001) consists of dataframes for taxonomic identification and related field data . Zooplankton in NEON samples are identified at contracting laboratories to the lowest possible taxonomic resolution, usually genus; however, some specimens can only be identified to the family (or even class) level, depending on the condition of the specimen. Ten percent of all samples are checked by two taxonomists and are noted in the qcTaxonomyStatus column. The taxonomic naming has been standardized in the zoo_taxonomyProcessed table, according to NEON's master taxonomy, removing any synonyms. Density was calculated using adjCountPerBottle and towsTrapsVolume to correct count data to "count per liter."

RESULTS (OR HOW TO GET AND USE STANDARDIZED NEON ORGANISMAL DATA)
All cleaned and standardized datasets can be obtained from the R package neonDivData and from the EDI data repository (https://doi.org/10.6073/pasta/c28dd4f6e7989003505ea02e 9a92afbf). Note that neonDivData included both stable and provisional data released by NEON, while the data repository in EDI only included stable datasets. If users want to change some of the decisions to wrangle the data differently, they can find the code in the R package ecocomDP and modify them for their own purposes. If this standardized version of NEON data was used, users should cite this paper along with the citations provided by NEON for each taxonomic group. Such citations can be found in the URLs presented in Table 1.
The data package neonDivData can be installed from GitHub. Installation instructions can be found on the GitHub webpage (https://github.com/daijiang/ neonDivData). Table 2 shows a brief summary of all data objects. To get data for a specific taxonomic group, we can just call the objects in the R object column in Table 2. Such data products include cleaned (and standardized if needed) occurrence data for the taxonomic groups covered and are equivalent to the "observation" table of the ecocomDP data format. If environmental information was provided by NEON for some taxonomic groups, they are also included in these data objects. Information such as latitude, longitude, and elevation for all taxonomic groups was saved in the neon_location object of the R package, which is equivalent to the "sampling_location" table of the ecocomDP data format. Information about species scientific names of all taxonomic groups was saved in the neon_taxa object, which is equivalent to the "taxon" table of the ecocomDP data format.
To demonstrate the use of data packages, we used data_plant to quickly visualize the distribution of species richness of plants across all NEON sites (Figure 2). To show how easy it is to get site-level species richness, we presented the code used to generate the data for Figure 2 as supporting information. Figure 2 shows the utility of the data package for exploring macroecological patterns. One of the most well-known and studied macroecological patterns is the latitudinal biodiversity gradient, wherein sites are more species-rich at lower latitudes relative to higher latitudes; temperature, biotic interactions, and historical biogeography are potential reasons underlying these patterns (Fischer, 1960;Hillebrand, 2004). Herbaceous plants of NEON generally follow this pattern. The latitudinal pattern for NEON small mammals is similar and is best explained by increased niche space and declining similarity in body size among species in lower latitudes, rather than a direct effect of temperature (Read et al., 2018).
In addition to allowing for quick exploration of macroecological patterns of richness at NEON sites, the data packages presented in this paper enable investigation of the effects of taxonomic resolution on diversity indices since taxonomic information is preserved for observations under family level for all groups. The degree of taxonomic resolution varies for NEON taxa depending on the diversity of the group and the level of taxonomic expertise needed to identify an organism to the species level, with more diverse groups presenting a greater challenge. Beetles are one of the most diverse groups of organisms on Earth and wide-ranging geographically, making them ideal bioindicators of environmental change (Rainio & Niemelä, 2003). To illustrate how the use of the beetle data package presented in this paper enables NEON data users to easily explore the effects of taxonomic resolution on community-level taxonomic diversity metrics, we calculated Jost's diversity indices (Jost, 2006) for beetles at the Oak Ridge National Laboratory (ORNL) NEON site for data subset at the genus, species, and subspecies level. To quantify biodiversity, we used Jost's indices, which are essentially Hill numbers that vary in how abundance is weighted with a parameter q. Higher values of q give lower weights to lowabundance species, with q = 0 being equivalent to species richness and q = 1 representing the effective number of species given by the Shannon entropy. These indices are plotted as rarefaction curves, which assess the sampling efficacy. When rarefaction curves asymptote, they suggest that additional sampling will not capture additional taxa. Statistical methods presented by Chao et al. (2014) provide estimates of sampling efficacy beyond the observed data (i.e., extrapolated values shown by dashed lines in Figure 3). For the ORNL beetle data, Jost's indices calculated with higher values of q (i.e., q > 0) indicated sampling has reached an asymptote in terms of capturing diversity regardless of taxonomic resolution (i.e., genus, species, and subspecies). However, rarefaction curves for q = 0, which is equivalent to species richness, do not asymptote, even with extrapolation. These plots suggest that if a researcher is interested in low-abundance, rare species, then the NEON beetle data stream at ORNL may need to mature with additional sample collections over time before confident inferences may be made, especially below the taxonomic resolution of the genus.

DISCUSSION (OR HOW TO MAINTAIN AND UPDATE STANDARDIZED NEON ORGANISMAL DATA)
NEON organismal data hold enormous potential to understand biodiversity change across space and time (Balch et al., 2019;Jones et al., 2021). Multiple biodiversity research and education programs have used NEON data even before NEON became fully operational in May 2019 (e.g., Farrell & Carey, 2018;Read et al., 2018). With the expected long-term investment to maintain NEON over the next 30 years, NEON organismal data will be an invaluable tool for understanding and tracking biodiversity change. NEON data are unique relative to data collected by other similar networks (e.g., LTER, CZO) because observation collection protocols are standardized across sites, enabling researchers to address macroscale questions in environmental science without having to synthesize disparate datasets that differ in collection methods (Jones et al., 2021). The data package presented in this paper holds great potential in making NEON data easier to use and more comparable across studies. Whereas the data collection protocols implemented by NEON staff are standardized, the decisions NEON data users make in wrangling their data after downloading NEON's open data will not necessarily be similar unless the user community adopts a community data standard, such as the ecocomDP data model. Adopting such a data model early on in the life of the observatory will ensure that the results of studies using NEON data will be comparable and thus easier to synthesize. By providing a standardized and easy-to-use data package of NEON organismal data, our effort here will significantly lower the barriers to use the NEON organismal data for biodiversity research by many current and future researchers and will ensure that studies using NEON organismal data are comparable. All code for the Data Wrangling Decisions are available within the R package ecocomDP (https://github. com/EDIorg/ecocomDP). Users can modify the code if they need to make different decisions during the data wrangling process and update our workflows in our code by submitting a pull request to our GitHub repository. If researchers wish to generate their own derived organismal datasets from NEON data with slightly different decisions than the ones outlined in this paper, we recommend that they use the ecocomDP framework, contribute their workflow to the ecocomDP R package, upload the data to the EDI repository, and cite their data with the discoverable DOI given to them by EDI. Note that the ecocomDP data model was intended for community ecology analyses and may not be well suited for population-level analyses. In a similar vein, researchers should ensure that they have considered sample size issues before fitting any models with these data. See Barnett (2019) for a review of the NEON organismal sampling design that contains important insights related to sample size issues.
Because ecocomDP is an R package to access and format datasets following the ecocomDP format, we developed an R data package neonDivData to host and distribute the standardized NEON organismal data derived from ecocomDP. A separate dedicated data package has several advantages. First, it is easier and ready to use and saves time for users to run the code in ecocomDP to download and standardize NEON data products. Second, it is also easy to update the data package when new raw data products are uploaded by NEON to their data portal, and the updating process does not require any change in the ecocomDP package. This is ideal because ecocomDP provides harmonized data from other sources besides NEON. Third, the GitHub repository page of neonDivData can serve as a discussion forum for researchers regarding the NEON data products without competing for attention in the ecocomDP GitHub repository page. By opening issues on the GitHub repository, users can discuss and contribute to improve our workflow of standardizing NEON data products. Users can also discuss whether there are other data models that the NEON user community should adopt at the inception of the observatory. As the observatory moves forward, this is an important discussion for the NEON user community and NEON technical working groups to promote the synthesis of NEON data with data from other efforts (e.g., LTER, CZO, AmeriFlux, the International LTER, National Phenology Network, and Long Term Agricultural Research Network). Note that the standardized datasets that are stable (defined by NEON as stable release) were archived at EDI and some of the above advantages also apply to the data repository at EDI.
The derived data products presented here collectively represent hundreds of hours of work by members of our team-a group that met at the NEON Science Summit in 2019 in Boulder, Colorado, and consists of researchers and NEON science staff. Just as it is helpful when working with a dataset to either have collected the data or be in close correspondence with the person who collected the data, final processing decisions were greatly informed by conversations with NEON science staff and the NEON user community. Future opportunities that encourage collaborations between NEON science staff and the NEON user community will be essential to achieve the full potential of the observatory data.

CONCLUSION
Macrosystems ecology (sensu Heffernan et al., 2014) is at the start of an exciting new chapter with the decades-long awaited buildout of NEON completed and standardized data streams from all sites in the observatory becoming publicly available online. As the research community embarks on discovering new scientific insights from NEON data, it is important that we make our analyses and all derived data as reproducible as possible to ensure that connections across studies are possible. Harmonized datasets will help in this endeavor because they naturally promote the collection of provenance as data are collated into derived products (O'Brien et al., 2021;Reichman et al., 2011). Harmonized data also make synthesis easier because efforts to clean and format data leading up to analyses do not have to be repeatedly performed by individual researchers (O'Brien et al., 2021). The data standardizing processes and derived data package presented here illustrate a potential path forward in achieving a reproducible framework for data derived from NEON organismal data for ecological analyses. This derived data package also highlights the value of collaboration between the NEON user community and NEON staff for advancing NEON-enabled science. Finally, the extension of the ecocomDP harmonized data design pattern to data from other ecological research and observatory networks (e.g., the Brazilian Network of Networks; de Oliveira Roque et al., 2018) and South African Environment Observation Network (Van Jaarsveld et al., 2007) has the potential to enable community ecologists to better synthesize data from across the globe.  Barnett, and Sam Simkin), Margaret O'Brien, and Tad Dallas greatly improved this work. The NEON is a program sponsored by the NSF and operated under a cooperative agreement by Battelle Memorial Institute. This material is based in part upon work supported by the NSF through the NEON Program.