close
Nautre社論:數據超載時代來臨
社論稱,每位研究人員都要確保其研究資料的正確與精確
 
■孫滔/編譯
《科學新聞》 (科學新聞09年第14期 名刊)
美國國家科學院上周發佈報告稱,應對數十億兆(petabyte)科研資料的時代已經到來。
 
近30億堿基對人類基因組譜的解析花了人類超過10年的時間,而今天的基因測序機器可以在一周之內完成同樣的工作。與此同時,美國斯隆數字巡天探測中心工作的天文學家自2000年以來已經完成了宇宙繪圖任務的25%,獲得了超過兩億個天體物質的資料。而預計於2015年完成的智利大口徑全景巡天望遠鏡(LSST)可以在一晚時間內即獲得與之等量的資料。
 
統計資料說明在許多科研領域存在類似情況。這是科研領域的利好消息,資料過剩總好於資料匱乏,但是憂慮依然。因為資料產生的速度遠遠高於資料處理能力和處理策略演進的速度,如期刊編輯要面對諸如圖像處理、原始資料保存等問題,需要確保海量資料存儲和運算法則、資料共用的持續。
 
2006年,包括Nature在內的期刊邀請美國國家科學院關注了這個問題。這份研究報告在今年7月22日發佈。
 
報告包括基於三個原則,即誠信原則(integrity principle)、開放獲取原則(access principle)、管理原則(stewardship principle)。
 
誠信原則即要求研究人員最終要保證資料的正確與精確,他們必須遵循其研究領域的專業標準,也需要研究機構培訓來實現這個要求。
 
開發獲取原則意味著其他人可以核查資料的精確與否,並進行驗證分析以及以之作為其前期研究基礎。除非當事人有特別的理由,否則應該使其資料開放獲取。
 
管理原則是資料長期存儲的需要。科學界的協會與社團應該規定資料存儲的標準,期刊需在資料存儲和規則傳播方面作出努力。資料專家需要扮演管理的角色,研究人員也應該給資料專家更多的支援。
 
這份報告承認,基於資料的複雜性,他們只是提供了總體性看法,而非提供確定的解決方案。科研人員、科學社團和科學協會應該依據各自的屬性來找到解決途徑,投資人需要加大對資料存儲的投資,而科研單位需要保證面向公眾的資料準確。
 
參考文獻:
 
Nature 460, 551 (30 July 2009) | doi:10.1038/460551a; Published online 29 July 2009

Editorial
Nature 460, 551 (30 July 2009) | doi:10.1038/460551a; Published online 29 July 2009
Abstract
A report released last week by the US National Academies makes recommendations for tackling the issues surrounding the era of petabyte science.

Geneticists spent more than a decade getting their first complete reading of the 3 billion base pairs of the human genome, which they finally published in 2003. But today's rapid sequencing machines can run through that much DNA in a week, and are busily churning out multiple sequences from an ever-expanding list of species. Meanwhile, astronomers working with the Sloan Digital Sky Survey telescope in New Mexico have mapped some 25% of the sky since 2000, obtaining data on more than 200 million objects. The Large Synoptic Survey Telescope, scheduled for completion atop Chile's Cerro Pachón in 2015, will gather that much data in one night.

Statistics tell a similar story in many scientific fields. This is great news for research: data glut is always better than data famine. But it is also cause for concern, because investigators' ability to amass huge quantities of data has accelerated much faster than have policies and practices for handling those data. Journal editors, in particular, have found themselves grappling with issues such as image manipulation, the preservation of original data, assuring continued access to large data sets, and standards for algorithm and code sharing.

    "Each researcher is ultimately responsible for ensuring the truth and accuracy of the data he or she produces."

In 2006, these concerns led a number of scientific societies and research journals, including Nature, to ask the US National Academy of Sciences to look at the problem. This resulted in the formation of a National Academies study committee, sponsors of which included Nature Publishing Group. The committee was headed by cancer researcher Phillip Sharp and physicist Daniel Kleppner, both of the Massachusetts Institute of Technology in Cambridge, and its report was published on 22 July (see http://tinyurl.com/datasteward).

The report makes 11 recommendations, organized around three major principles: integrity, access and stewardship. The integrity principle affirms that each researcher is ultimately responsible for ensuring the truth and accuracy of the data he or she produces. Individual investigators should adhere to the professional standards in their fields, and institutions should ensure that training is in place to make this possible.

The access principle asserts the value of openness: only if results are shared can other researchers check the data's accuracy, verify analyses and build on previous work. So unless there are very good reasons for researchers to withhold data — reasons that should be publicly posted and available for comment by other researchers — they should make provisions to supply public access in a timely manner, possibly as early as their grant proposals.

Finally, the stewardship principle addresses the need for long-term preservation. Scientific societies and communities need to provide guidelines on which data are worth retaining for future analysis; institutions and funding agencies need to address and support these needs. Journals can play a part in the preservation of the published record, and in the dissemination and enforcement of guidelines. And data professionals should be recognized for their crucial role in stewardship: certainly they deserve more respect and support than researchers sometimes give them.

The authors of the report readily admit that they have provided an overview, rather than a resolution, of the complexities that surround digital data. What is needed now is for institutions, consortia and scientific societies to find individual solutions that will work in their fields and physical settings. Funders must take up their responsibilities and increase investment in the upkeep of data, from the individual grant onwards. The scientific enterprise requires that the integrity of its data forms a bond of trust with the public. It is time to strengthen that bond with action.

 


arrow
arrow
    全站熱搜

    chendaneyl 發表在 痞客邦 留言(0) 人氣()