1 00:00:02,540 --> 00:00:05,010 So welcome everyone to 2 00:00:05,010 --> 00:00:06,780 the System Science Friday 3 00:00:06,780 --> 00:00:08,009 noon seminar series. 4 00:00:08,009 --> 00:00:08,910 And today we're pleased 5 00:00:08,910 --> 00:00:09,869 to have Tristan Holmes, 6 00:00:09,869 --> 00:00:11,744 he's going to talk about 7 00:00:11,744 --> 00:00:13,529 machine learning algorithm. 8 00:00:13,529 --> 00:00:15,089 I think fundamentally you could call it 9 00:00:15,089 --> 00:00:17,835 that with a particular application in mind. 10 00:00:17,835 --> 00:00:19,139 And we're looking forward 11 00:00:19,139 --> 00:00:20,489 to that and please feel free to give 12 00:00:20,489 --> 00:00:22,215 us a little bit more on your background 13 00:00:22,215 --> 00:00:23,190 as you start your talk, 14 00:00:23,190 --> 00:00:24,744 if you'd like to do that. 15 00:00:24,744 --> 00:00:26,360 Absolutely. 16 00:00:26,360 --> 00:00:28,294 Thank you very much for having me today. 17 00:00:28,294 --> 00:00:30,259 I am Tristan Holmes. 18 00:00:30,259 --> 00:00:32,509 This is work that was done 19 00:00:32,509 --> 00:00:35,029 jointly with my doctoral advisor, 20 00:00:35,029 --> 00:00:36,229 JB Nation at 21 00:00:36,229 --> 00:00:38,675 the University of Hawaii at Manoa. 22 00:00:38,675 --> 00:00:40,280 Everything I'm going to go over 23 00:00:40,280 --> 00:00:42,019 today was done in 24 00:00:42,019 --> 00:00:46,444 2019 and the project hit a couple of snags. 25 00:00:46,444 --> 00:00:49,249 I'll go over some of that very briefly. 26 00:00:49,249 --> 00:00:51,260 We can talk afterwards if you're a little 27 00:00:51,260 --> 00:00:53,720 bit more interested in some of the backstory. 28 00:00:53,720 --> 00:00:55,369 My backstory is that I got 29 00:00:55,369 --> 00:00:56,360 my bachelor's degree at 30 00:00:56,360 --> 00:00:57,439 the University of Oregon in 31 00:00:57,439 --> 00:01:02,075 2003 and then moved out to Honolulu for, 32 00:01:02,075 --> 00:01:04,460 to get my graduate degree in mathematics with 33 00:01:04,460 --> 00:01:07,009 a specialization in This study 34 00:01:07,009 --> 00:01:08,704 of partially ordered sets, 35 00:01:08,704 --> 00:01:10,640 working with JB nation 36 00:01:10,640 --> 00:01:13,100 and number of other folks out there who at 37 00:01:13,100 --> 00:01:15,949 the time had a thriving seminar 38 00:01:15,949 --> 00:01:17,690 in school in that discipline, 39 00:01:17,690 --> 00:01:19,790 it kinda fell apart after JB and 40 00:01:19,790 --> 00:01:21,304 the others retired and 41 00:01:21,304 --> 00:01:23,104 I wasn't hired to replace them. 42 00:01:23,104 --> 00:01:24,950 Since then I have moved back to 43 00:01:24,950 --> 00:01:26,599 Portland where I'm originally 44 00:01:26,599 --> 00:01:29,359 from and I've had the chance to teach as 45 00:01:29,359 --> 00:01:31,009 a part-time instructor at 46 00:01:31,009 --> 00:01:34,430 Portland State and this term at PCC. 47 00:01:34,430 --> 00:01:36,860 And I've been trying more lately to 48 00:01:36,860 --> 00:01:39,184 get back into the research side of things. 49 00:01:39,184 --> 00:01:41,450 And hopefully this will 50 00:01:41,450 --> 00:01:43,775 spark some discussion about 51 00:01:43,775 --> 00:01:46,220 possible future uses of 52 00:01:46,220 --> 00:01:48,229 the machine-learning algorithm I'm about 53 00:01:48,229 --> 00:01:50,614 to describe and other, 54 00:01:50,614 --> 00:01:53,149 other ways to compare it to other algorithms. 55 00:01:53,149 --> 00:01:55,710 So with that, I'll go ahead and get started. 56 00:01:57,490 --> 00:02:00,650 So in the abstract, 57 00:02:00,650 --> 00:02:02,749 the 29th team project 58 00:02:02,749 --> 00:02:04,490 was a collaboration hosted by the, 59 00:02:04,490 --> 00:02:06,695 UH, Manoa Cancer Center. 60 00:02:06,695 --> 00:02:09,979 And they brought in my doctoral advisor. 61 00:02:09,979 --> 00:02:11,000 I didn't come in until the 62 00:02:11,000 --> 00:02:12,425 very end of this, but they brought, 63 00:02:12,425 --> 00:02:14,734 brought in JB nation of, UH, 64 00:02:14,734 --> 00:02:17,105 Manoa and cute I add a reach of, 65 00:02:17,105 --> 00:02:19,130 at that time you get Sheba university 66 00:02:19,130 --> 00:02:20,570 to help them try to sort through 67 00:02:20,570 --> 00:02:22,760 genetic expression data that they 68 00:02:22,760 --> 00:02:25,939 were seeing biopsy tumors. 69 00:02:25,939 --> 00:02:28,054 The result was the development 70 00:02:28,054 --> 00:02:30,380 of what was later termed the 71 00:02:30,380 --> 00:02:33,679 lattice upstream targeting or lust 72 00:02:33,679 --> 00:02:34,925 algorithm to give it 73 00:02:34,925 --> 00:02:37,819 a very sexy name to put on the tin. 74 00:02:37,819 --> 00:02:39,110 And this was used to 75 00:02:39,110 --> 00:02:42,065 analyze messenger RNA expression. 76 00:02:42,065 --> 00:02:43,880 In this project, it was for 77 00:02:43,880 --> 00:02:46,054 33 different types of cancer. 78 00:02:46,054 --> 00:02:47,390 And all the data that I'm 79 00:02:47,390 --> 00:02:49,040 going to talk about today is 80 00:02:49,040 --> 00:02:50,330 publicly available from 81 00:02:50,330 --> 00:02:53,480 the Cancer Genome Atlas database. 82 00:02:53,480 --> 00:02:55,220 Full results can be found on 83 00:02:55,220 --> 00:02:56,690 GitHub. There's a link here. 84 00:02:56,690 --> 00:02:58,324 I'll be happy to provide the slides 85 00:02:58,324 --> 00:03:01,409 at the end of the talk. 86 00:03:01,570 --> 00:03:03,740 Again, we can talk a little bit 87 00:03:03,740 --> 00:03:05,780 more about this afterwards. 88 00:03:05,780 --> 00:03:08,869 There was a disconnect 89 00:03:08,869 --> 00:03:11,300 between the data analysts to end 90 00:03:11,300 --> 00:03:14,060 the biology factions of 91 00:03:14,060 --> 00:03:15,770 this collaborative group toward 92 00:03:15,770 --> 00:03:17,135 the end of the project. 93 00:03:17,135 --> 00:03:19,610 And so I am not privy to what 94 00:03:19,610 --> 00:03:20,960 exactly was done on some 95 00:03:20,960 --> 00:03:22,495 of the proprietary data, 96 00:03:22,495 --> 00:03:24,289 the, UH, candidates or center. 97 00:03:24,289 --> 00:03:25,490 But the last time I was 98 00:03:25,490 --> 00:03:27,215 able to speak with JB, 99 00:03:27,215 --> 00:03:29,120 they had used some 100 00:03:29,120 --> 00:03:30,770 of the techniques that were developed in 101 00:03:30,770 --> 00:03:33,259 this study to look for 102 00:03:33,259 --> 00:03:35,899 genes that would be effective drug targets 103 00:03:35,899 --> 00:03:37,520 for new chemotherapies, 104 00:03:37,520 --> 00:03:39,319 whether or not those studies have produced 105 00:03:39,319 --> 00:03:41,104 any applicable clinical results. 106 00:03:41,104 --> 00:03:42,200 I do not have 107 00:03:42,200 --> 00:03:45,089 that information available at this time. 108 00:03:45,640 --> 00:03:48,709 So a brief overview of the procedure. 109 00:03:48,709 --> 00:03:50,375 The lost algorithm, at 110 00:03:50,375 --> 00:03:52,024 least in this study, is, 111 00:03:52,024 --> 00:03:54,364 it is a discrete mathematical method, 112 00:03:54,364 --> 00:03:56,809 but it's used on continuous data. 113 00:03:56,809 --> 00:03:58,715 Continuous data goes in, 114 00:03:58,715 --> 00:04:02,435 it gets binned into discrete categories, 115 00:04:02,435 --> 00:04:04,024 and the analysis is then run 116 00:04:04,024 --> 00:04:05,704 on the discretized data, 117 00:04:05,704 --> 00:04:07,009 in our case, specifically 118 00:04:07,009 --> 00:04:10,189 messenger RNA expression numbers. 119 00:04:10,189 --> 00:04:12,860 We formed the expression data into 120 00:04:12,860 --> 00:04:14,989 an array, specifically a matrix. 121 00:04:14,989 --> 00:04:16,460 And the algorithm is actually 122 00:04:16,460 --> 00:04:19,054 applied in two stages. 123 00:04:19,054 --> 00:04:22,730 The first stage is unsupervised. 124 00:04:22,730 --> 00:04:26,135 It collects genes into 125 00:04:26,135 --> 00:04:27,230 groups that are referred 126 00:04:27,230 --> 00:04:29,734 to later as metagenomes. 127 00:04:29,734 --> 00:04:31,010 That is, groups of 128 00:04:31,010 --> 00:04:33,845 genetic factors which seemed to overexpressed 129 00:04:33,845 --> 00:04:35,420 or underexpressed in 130 00:04:35,420 --> 00:04:38,435 most patients simultaneously. 131 00:04:38,435 --> 00:04:40,579 For a visualization going 132 00:04:40,579 --> 00:04:41,810 to the actual paper, 133 00:04:41,810 --> 00:04:45,769 we see here a heatmap of 134 00:04:45,769 --> 00:04:47,539 a metal gene that was 135 00:04:47,539 --> 00:04:50,059 discovered during the study. 136 00:04:50,059 --> 00:04:52,549 The rows are specific genes 137 00:04:52,549 --> 00:04:54,724 that have been analyzed. 138 00:04:54,724 --> 00:04:57,635 Through standard biological methods. 139 00:04:57,635 --> 00:05:01,250 The columns are specific patients. 140 00:05:01,250 --> 00:05:03,110 So each column is 141 00:05:03,110 --> 00:05:05,210 one sample taken from one patient, 142 00:05:05,210 --> 00:05:07,505 one tumor sample that has been biopsied. 143 00:05:07,505 --> 00:05:09,425 Red is low expression, 144 00:05:09,425 --> 00:05:11,210 green is high expression. 145 00:05:11,210 --> 00:05:13,669 It's been sorted from low to high. 146 00:05:13,669 --> 00:05:14,600 And as you can see for 147 00:05:14,600 --> 00:05:16,265 this particular meta gene, 148 00:05:16,265 --> 00:05:17,584 there's a tendency for 149 00:05:17,584 --> 00:05:20,059 all the genes under consideration 150 00:05:20,059 --> 00:05:21,770 to either underexpressed or 151 00:05:21,770 --> 00:05:23,854 overexpress at the same time. 152 00:05:23,854 --> 00:05:25,129 So the assumption is that 153 00:05:25,129 --> 00:05:26,930 the data is telling us that all of 154 00:05:26,930 --> 00:05:28,070 these genes relate to 155 00:05:28,070 --> 00:05:31,379 a similar biological process. 156 00:05:31,780 --> 00:05:34,924 Once we have the meta genes in hand, 157 00:05:34,924 --> 00:05:39,004 there is a second run that is supervised on 158 00:05:39,004 --> 00:05:40,654 the expression matrices for 159 00:05:40,654 --> 00:05:43,444 each individual meated gene 160 00:05:43,444 --> 00:05:45,830 that was identified in part one. 161 00:05:45,830 --> 00:05:48,709 This is supervised by survival time. 162 00:05:48,709 --> 00:05:50,599 Although indirectly as there are 163 00:05:50,599 --> 00:05:52,669 some other scores that are used which I'll 164 00:05:52,669 --> 00:05:54,649 have the opportunity to get a little bit 165 00:05:54,649 --> 00:05:57,034 into given the time constraints. 166 00:05:57,034 --> 00:05:59,809 This pass identify subsets of 167 00:05:59,809 --> 00:06:03,125 the meta genes that we call signatures. 168 00:06:03,125 --> 00:06:06,515 The main goal of this study was to, 169 00:06:06,515 --> 00:06:08,240 as JB once put it, 170 00:06:08,240 --> 00:06:10,759 find the woods rather 171 00:06:10,759 --> 00:06:12,905 than the trees or the forest. 172 00:06:12,905 --> 00:06:14,450 If the trees are the genes, 173 00:06:14,450 --> 00:06:15,230 we don't want to look at 174 00:06:15,230 --> 00:06:16,759 the whole forest because if 175 00:06:16,759 --> 00:06:17,989 we tried to affect all 176 00:06:17,989 --> 00:06:19,399 of that with treatment, 177 00:06:19,399 --> 00:06:21,830 the treatment will probably be ineffective. 178 00:06:21,830 --> 00:06:24,380 If we can find the individual woods within 179 00:06:24,380 --> 00:06:25,460 the forest that are most 180 00:06:25,460 --> 00:06:27,664 significant to the process of disease, 181 00:06:27,664 --> 00:06:30,770 those could make good drug targets. 182 00:06:30,770 --> 00:06:33,679 In the end, the outcome of the study was that 183 00:06:33,679 --> 00:06:35,960 certain signatures and to 184 00:06:35,960 --> 00:06:37,249 almost no one's surprise, 185 00:06:37,249 --> 00:06:38,779 particularly those related to 186 00:06:38,779 --> 00:06:41,990 processes that involve the immune system, 187 00:06:41,990 --> 00:06:44,510 seemed appropriate for further study 188 00:06:44,510 --> 00:06:47,759 and even to use as guidance for treatment. 189 00:06:48,310 --> 00:06:50,839 So if we're going to run a new algorithm, 190 00:06:50,839 --> 00:06:51,680 the first thing we're going to 191 00:06:51,680 --> 00:06:53,044 need is some data. 192 00:06:53,044 --> 00:06:54,739 So in this particular study, 193 00:06:54,739 --> 00:06:57,139 we used publicly available mRNA 194 00:06:57,139 --> 00:07:00,750 expression data can be found here. 195 00:07:01,180 --> 00:07:04,624 The gene expression files were 196 00:07:04,624 --> 00:07:06,289 sequenced by a technique known 197 00:07:06,289 --> 00:07:08,044 as Illumina high sac. 198 00:07:08,044 --> 00:07:09,815 And I must confess, 199 00:07:09,815 --> 00:07:11,299 I am not familiar enough 200 00:07:11,299 --> 00:07:13,040 with the genetics to be able 201 00:07:13,040 --> 00:07:15,379 to describe that exact process 202 00:07:15,379 --> 00:07:16,849 or how it compares to others. 203 00:07:16,849 --> 00:07:18,019 But from my understanding, 204 00:07:18,019 --> 00:07:19,999 this is an industry standard. 205 00:07:19,999 --> 00:07:21,650 And we got expression levels for 206 00:07:21,650 --> 00:07:24,040 20,531 genes. 207 00:07:24,040 --> 00:07:25,280 So quite, quite a lot 208 00:07:25,280 --> 00:07:27,660 of data to sort through here. 209 00:07:28,510 --> 00:07:31,280 Study, there were also samples from 210 00:07:31,280 --> 00:07:33,515 surrounding tissue, those were removed. 211 00:07:33,515 --> 00:07:35,209 We're only interested in 212 00:07:35,209 --> 00:07:38,210 reading from the actual tumors themselves. 213 00:07:38,210 --> 00:07:40,700 The data is normalized and a number of 214 00:07:40,700 --> 00:07:42,709 ways it's log transformed, 215 00:07:42,709 --> 00:07:45,020 quantile normalized so that 216 00:07:45,020 --> 00:07:47,209 different gene expression levels 217 00:07:47,209 --> 00:07:49,639 can be more easily compared and rho centered, 218 00:07:49,639 --> 00:07:52,114 so we're going to get a median of zero. 219 00:07:52,114 --> 00:07:54,050 Finally, there is 220 00:07:54,050 --> 00:07:56,600 also anonymized clinical data 221 00:07:56,600 --> 00:07:58,280 available for all the patients who were 222 00:07:58,280 --> 00:08:00,815 biopsied in the same database. 223 00:08:00,815 --> 00:08:02,420 Although there's an, there's 224 00:08:02,420 --> 00:08:04,639 an array of clinical data available, 225 00:08:04,639 --> 00:08:07,970 we are only interested in survival times 226 00:08:07,970 --> 00:08:09,349 and whether or not patients were 227 00:08:09,349 --> 00:08:12,599 censored at any point in the study. 228 00:08:13,140 --> 00:08:15,685 The next step is, 229 00:08:15,685 --> 00:08:18,100 now that we have the continuous data is 230 00:08:18,100 --> 00:08:21,160 we want to discretize it into bins. 231 00:08:21,160 --> 00:08:23,739 So we take the expression data and we have 232 00:08:23,739 --> 00:08:26,980 a matrix 20,531 rows, 233 00:08:26,980 --> 00:08:28,839 one row for each gene, 234 00:08:28,839 --> 00:08:32,530 and the columns one for each patient. 235 00:08:32,530 --> 00:08:34,959 For the given cancer. 236 00:08:34,959 --> 00:08:36,744 Under consideration, 237 00:08:36,744 --> 00:08:38,019 33 different types of 238 00:08:38,019 --> 00:08:41,060 cancer are available in the database. 239 00:08:41,790 --> 00:08:48,819 We discretize this into values of -10.1 minus 240 00:08:48,819 --> 00:08:51,579 one represents under expression of 241 00:08:51,579 --> 00:08:53,290 a given gene in the red 242 00:08:53,290 --> 00:08:55,424 on the heat map I showed earlier, 243 00:08:55,424 --> 00:09:00,199 zero is relatively medium expression. 244 00:09:00,199 --> 00:09:01,940 One might say normal though, 245 00:09:01,940 --> 00:09:04,100 since we're looking at a tumor sample that 246 00:09:04,100 --> 00:09:05,180 the word normal might 247 00:09:05,180 --> 00:09:07,279 not be entirely appropriate. 248 00:09:07,279 --> 00:09:11,030 Positive one represents overexpression 249 00:09:11,030 --> 00:09:12,079 in the green on 250 00:09:12,079 --> 00:09:14,630 the heatmap that was shown earlier. 251 00:09:14,630 --> 00:09:17,015 Well, how do we decide what's 252 00:09:17,015 --> 00:09:19,474 overexpression and under expression, 253 00:09:19,474 --> 00:09:22,189 it a little bit of tinkering is used. 254 00:09:22,189 --> 00:09:23,149 But in the end it was 255 00:09:23,149 --> 00:09:24,649 decided we want to control 256 00:09:24,649 --> 00:09:27,380 for the number of non-zero entries, 257 00:09:27,380 --> 00:09:28,849 or rather the ratio of 258 00:09:28,849 --> 00:09:31,010 non-zero entries in the matrix. 259 00:09:31,010 --> 00:09:33,065 We want to set a density. 260 00:09:33,065 --> 00:09:34,594 And we do this by 261 00:09:34,594 --> 00:09:37,384 deciding what our density is going to be. 262 00:09:37,384 --> 00:09:39,949 And using that, using some linear algebra, 263 00:09:39,949 --> 00:09:41,450 we can compute a threshold 264 00:09:41,450 --> 00:09:43,055 where if the gene expression level 265 00:09:43,055 --> 00:09:47,195 is under the negative fi threshold, 266 00:09:47,195 --> 00:09:48,755 we pop in a minus one. 267 00:09:48,755 --> 00:09:51,800 If it's over the positive fy fresh threshold, 268 00:09:51,800 --> 00:09:53,494 we pop in a positive one, 269 00:09:53,494 --> 00:09:55,580 otherwise we pop in a zero. 270 00:09:55,580 --> 00:09:57,919 For this study, the density was 271 00:09:57,919 --> 00:10:00,379 considered 0.5 for all cancers. 272 00:10:00,379 --> 00:10:01,720 So half of the entries and 273 00:10:01,720 --> 00:10:03,260 every discretized data matrix 274 00:10:03,260 --> 00:10:04,264 are going to be a zero. 275 00:10:04,264 --> 00:10:05,420 We're considering the lower 276 00:10:05,420 --> 00:10:06,935 and upper quartiles 277 00:10:06,935 --> 00:10:08,540 of the expression data. 278 00:10:08,540 --> 00:10:10,490 For any particular study, 279 00:10:10,490 --> 00:10:12,815 one might actually want to tweak D 280 00:10:12,815 --> 00:10:15,470 depending on results that 281 00:10:15,470 --> 00:10:17,195 one wants to look for. 282 00:10:17,195 --> 00:10:19,235 So now that we have our data, 283 00:10:19,235 --> 00:10:21,889 what exactly does the algorithm do? 284 00:10:21,889 --> 00:10:25,489 The input is similar but not identical, 285 00:10:25,489 --> 00:10:27,529 but the output for part one, 286 00:10:27,529 --> 00:10:30,529 which is the unsupervised run, and part two, 287 00:10:30,529 --> 00:10:32,645 which is the supervised run, 288 00:10:32,645 --> 00:10:34,595 is quite different. 289 00:10:34,595 --> 00:10:36,290 Our input is we take 290 00:10:36,290 --> 00:10:38,749 our discretized expression matrix. 291 00:10:38,749 --> 00:10:40,474 We have some parameters. 292 00:10:40,474 --> 00:10:41,689 I'll go into these later. 293 00:10:41,689 --> 00:10:43,594 The most important is going to be 294 00:10:43,594 --> 00:10:46,339 what JB referred to as conf toll. 295 00:10:46,339 --> 00:10:48,665 This sets the sensitivity 296 00:10:48,665 --> 00:10:51,725 when we're looking for correlations. 297 00:10:51,725 --> 00:10:53,960 For the supervised run, 298 00:10:53,960 --> 00:10:55,535 we also want clinical data, 299 00:10:55,535 --> 00:10:57,800 in our case, survival. 300 00:10:57,800 --> 00:11:00,470 In the output, eventually 301 00:11:00,470 --> 00:11:02,224 we're going to get metagenomes. 302 00:11:02,224 --> 00:11:05,705 We're going to group the genes together into 303 00:11:05,705 --> 00:11:08,015 metagenome groupings that we hope 304 00:11:08,015 --> 00:11:11,060 represents similar biological processes. 305 00:11:11,060 --> 00:11:13,040 In part two, we're looking 306 00:11:13,040 --> 00:11:15,845 for signatures or subsets 307 00:11:15,845 --> 00:11:19,115 of the genetic expression data that 308 00:11:19,115 --> 00:11:21,859 regulate these metagenomes to 309 00:11:21,859 --> 00:11:23,809 a greater or lesser extent. 310 00:11:23,809 --> 00:11:25,819 In part two, we're also going to get 311 00:11:25,819 --> 00:11:27,379 some survival models, 312 00:11:27,379 --> 00:11:29,330 particularly a Kaplan-Meier. 313 00:11:29,330 --> 00:11:30,770 And this is my first time 314 00:11:30,770 --> 00:11:32,000 giving this talk in public. 315 00:11:32,000 --> 00:11:33,200 So congratulations. 316 00:11:33,200 --> 00:11:34,910 I think we found our first typo there. 317 00:11:34,910 --> 00:11:37,625 I believe Meyer is spelled with an I. 318 00:11:37,625 --> 00:11:40,805 And also a model that's 319 00:11:40,805 --> 00:11:43,354 new patients based on 320 00:11:43,354 --> 00:11:47,849 how given genetic factors affect survival. 321 00:11:50,530 --> 00:11:53,180 Trees that and could I 322 00:11:53,180 --> 00:11:56,449 interrupt with a simple question, please? 323 00:11:56,449 --> 00:12:01,235 Do, are you doing for particular cancer? 324 00:12:01,235 --> 00:12:03,439 Is for individual cancers, 325 00:12:03,439 --> 00:12:04,939 or are you doing this for 326 00:12:04,939 --> 00:12:07,340 many cancers all at once. 327 00:12:07,340 --> 00:12:09,169 Each run of the model is on 328 00:12:09,169 --> 00:12:10,955 a single particular cancer. 329 00:12:10,955 --> 00:12:13,340 The studies are, there are ways to 330 00:12:13,340 --> 00:12:14,720 compare but one cancer at 331 00:12:14,720 --> 00:12:16,534 a time, right? Okay, good. 332 00:12:16,534 --> 00:12:18,934 Thanks. Thank you. 333 00:12:18,934 --> 00:12:21,080 I'm going quite fast in 334 00:12:21,080 --> 00:12:22,579 order to respect everyone's time. 335 00:12:22,579 --> 00:12:24,050 So if if at any point 336 00:12:24,050 --> 00:12:25,519 there is a clarifying question, 337 00:12:25,519 --> 00:12:28,380 I can answer, please feel free to pipe in. 338 00:12:29,920 --> 00:12:32,824 So this is where the actual 339 00:12:32,824 --> 00:12:34,264 order theory comes in. 340 00:12:34,264 --> 00:12:36,425 We're going to define a relation 341 00:12:36,425 --> 00:12:38,780 on our genes That's going to be 342 00:12:38,780 --> 00:12:40,009 based on the density 343 00:12:40,009 --> 00:12:41,390 we're looking for that has already 344 00:12:41,390 --> 00:12:43,549 been set and our sensitivity 345 00:12:43,549 --> 00:12:44,660 metric known as 346 00:12:44,660 --> 00:12:46,925 comfortable in the literature. 347 00:12:46,925 --> 00:12:48,469 It was decided after some 348 00:12:48,469 --> 00:12:50,090 experimentation that for finding 349 00:12:50,090 --> 00:12:52,820 metagenome 0.5 works well 350 00:12:52,820 --> 00:12:54,289 when you're looking for signatures, 351 00:12:54,289 --> 00:12:56,450 depending on the number of samples you have, 352 00:12:56,450 --> 00:12:57,949 the sensitivity is set at 353 00:12:57,949 --> 00:13:00,965 a number of different levels. 354 00:13:00,965 --> 00:13:03,529 So we're going to define a relation 355 00:13:03,529 --> 00:13:04,655 among the genes. 356 00:13:04,655 --> 00:13:06,515 So for any given gene X, 357 00:13:06,515 --> 00:13:08,569 we're going to count the number of 358 00:13:08,569 --> 00:13:11,795 samples where that gene is overexpressed. 359 00:13:11,795 --> 00:13:14,284 We're going to count the ones in that row. 360 00:13:14,284 --> 00:13:16,639 We're going to call that set X plus. 361 00:13:16,639 --> 00:13:17,960 Then we're also going to count 362 00:13:17,960 --> 00:13:18,980 the number of samples 363 00:13:18,980 --> 00:13:21,095 without gene gets underexpressed. 364 00:13:21,095 --> 00:13:22,400 We're going to count the columns where 365 00:13:22,400 --> 00:13:24,274 we see a negative one. 366 00:13:24,274 --> 00:13:28,684 We're then going to say that x regulates y. 367 00:13:28,684 --> 00:13:31,445 It influences why in some way. 368 00:13:31,445 --> 00:13:33,470 If the overlap in 369 00:13:33,470 --> 00:13:36,410 those columns as a ratio to the regulator, 370 00:13:36,410 --> 00:13:39,484 both in terms of overregulation and 371 00:13:39,484 --> 00:13:42,020 under-regulation is greater than or equal 372 00:13:42,020 --> 00:13:45,184 to our sensitivity condition. 373 00:13:45,184 --> 00:13:47,179 I should note at this point that 374 00:13:47,179 --> 00:13:49,249 the word regulates is a little 375 00:13:49,249 --> 00:13:51,170 problematic here because while 376 00:13:51,170 --> 00:13:54,200 mathematically it has good meaning, 377 00:13:54,200 --> 00:13:56,420 there may be no direct regulation 378 00:13:56,420 --> 00:13:58,759 as a biological process. 379 00:13:58,759 --> 00:14:00,860 We don t know about simply from 380 00:14:00,860 --> 00:14:02,255 the data analysis that 381 00:14:02,255 --> 00:14:05,309 would require further investigation. 382 00:14:06,010 --> 00:14:08,719 Similarly, if two genes 383 00:14:08,719 --> 00:14:10,759 both regulate one another, 384 00:14:10,759 --> 00:14:12,649 that is they typically over 385 00:14:12,649 --> 00:14:14,915 or under expressed simultaneously, 386 00:14:14,915 --> 00:14:17,539 we're going to say they're equivalent and use 387 00:14:17,539 --> 00:14:18,950 the standard equivalence 388 00:14:18,950 --> 00:14:22,025 relation approximation notation. 389 00:14:22,025 --> 00:14:25,009 So we'll say x is equivalent to y if they 390 00:14:25,009 --> 00:14:27,590 normally overexpressed or underexpressed 391 00:14:27,590 --> 00:14:28,954 at the same time. 392 00:14:28,954 --> 00:14:31,940 This is also somewhat problematic terminology 393 00:14:31,940 --> 00:14:34,355 which I'll get into currently. 394 00:14:34,355 --> 00:14:36,950 So the first step of the algorithm, 395 00:14:36,950 --> 00:14:38,989 we're going to form groups of 396 00:14:38,989 --> 00:14:41,374 genes for any given gene X. 397 00:14:41,374 --> 00:14:44,945 We're going to look for any gene that is 398 00:14:44,945 --> 00:14:47,239 equivalent to it that seems to over or 399 00:14:47,239 --> 00:14:50,359 under expressed at the same time. 400 00:14:50,359 --> 00:14:53,045 Now this isn't necessarily 401 00:14:53,045 --> 00:14:54,650 an equivalence relation 402 00:14:54,650 --> 00:14:56,840 because the relation is defined on 403 00:14:56,840 --> 00:14:59,944 the previous slide, is not transitive. 404 00:14:59,944 --> 00:15:02,840 We can impose some transitivity 405 00:15:02,840 --> 00:15:04,280 by merging some of 406 00:15:04,280 --> 00:15:06,140 these groupings we're going to find, 407 00:15:06,140 --> 00:15:09,199 but we don't want to carry that too far or we 408 00:15:09,199 --> 00:15:10,520 might end up conflating 409 00:15:10,520 --> 00:15:13,025 different biological processes. 410 00:15:13,025 --> 00:15:15,409 In this particular analysis, 411 00:15:15,409 --> 00:15:20,270 there was a single merging step, initiate. 412 00:15:20,270 --> 00:15:22,790 So we'll merge two groups 413 00:15:22,790 --> 00:15:24,859 if there is significant enough overlap. 414 00:15:24,859 --> 00:15:27,900 And this is a condition that can be adjusted. 415 00:15:28,480 --> 00:15:33,199 Overlap was generally set at 0.5 for part 416 00:15:33,199 --> 00:15:34,999 one when we're looking at metagenomes and 417 00:15:34,999 --> 00:15:37,949 0.6 for part two. 418 00:15:39,580 --> 00:15:42,919 In the future, this merging step, 419 00:15:42,919 --> 00:15:44,090 at least for part one, 420 00:15:44,090 --> 00:15:46,399 we'd probably want to automate this so 421 00:15:46,399 --> 00:15:49,490 that we get more than one round of merging. 422 00:15:49,490 --> 00:15:51,889 Because when the conditions were 423 00:15:51,889 --> 00:15:54,664 checked by hand after one merging, 424 00:15:54,664 --> 00:15:56,060 in order to limit the number of 425 00:15:56,060 --> 00:15:58,040 groups to something that could 426 00:15:58,040 --> 00:16:01,940 be more easily analyzed with a critical eye. 427 00:16:01,940 --> 00:16:04,129 It turned out that we could 428 00:16:04,129 --> 00:16:06,710 have taken transitivity a little further, 429 00:16:06,710 --> 00:16:07,519 and hopefully this will be a 430 00:16:07,519 --> 00:16:08,569 little more clear when I 431 00:16:08,569 --> 00:16:11,015 go to the next graphic here. 432 00:16:11,015 --> 00:16:13,610 So here is an example 433 00:16:13,610 --> 00:16:15,560 taken from the primary paper. 434 00:16:15,560 --> 00:16:17,104 I'll find the table that I 435 00:16:17,104 --> 00:16:19,805 had trouble reproducing and Beamer. 436 00:16:19,805 --> 00:16:24,110 We're looking at a specific type of cancer. 437 00:16:24,110 --> 00:16:26,419 After step one, we have 438 00:16:26,419 --> 00:16:29,494 found five major groupings. 439 00:16:29,494 --> 00:16:32,645 So one round of merging has been done. 440 00:16:32,645 --> 00:16:34,205 So in this table, 441 00:16:34,205 --> 00:16:36,289 g1 is a single group 442 00:16:36,289 --> 00:16:38,090 of genes that was found after 443 00:16:38,090 --> 00:16:40,339 the initial investigation and first 444 00:16:40,339 --> 00:16:44,520 merging G2 another and so on and so forth. 445 00:16:44,620 --> 00:16:46,624 F of g. 446 00:16:46,624 --> 00:16:47,840 This is a measure 447 00:16:47,840 --> 00:16:52,010 of how strongly the metagenome overlaps. 448 00:16:52,010 --> 00:16:55,085 I'll define that on a future slide. 449 00:16:55,085 --> 00:16:57,874 As you can see just by eyeballing it, 450 00:16:57,874 --> 00:16:59,719 there is significant overlap 451 00:16:59,719 --> 00:17:02,060 between these first three groups. 452 00:17:02,060 --> 00:17:04,760 And so we would consider these 453 00:17:04,760 --> 00:17:07,699 to be related processes and pick the 454 00:17:07,699 --> 00:17:10,250 most strongly correlated one to 455 00:17:10,250 --> 00:17:11,870 represent the group and 456 00:17:11,870 --> 00:17:14,194 call that the metagenome. 457 00:17:14,194 --> 00:17:16,549 So in this case, G2 458 00:17:16,549 --> 00:17:18,814 is gonna be one of our meta genes. 459 00:17:18,814 --> 00:17:21,139 We seem to have another distinct process 460 00:17:21,139 --> 00:17:23,389 happening with groups 4.5. 461 00:17:23,389 --> 00:17:26,599 So we'll pick the slightly higher score here. 462 00:17:26,599 --> 00:17:28,760 And we'll say G5 is 463 00:17:28,760 --> 00:17:30,109 the Meta gene that will carry 464 00:17:30,109 --> 00:17:32,219 over into part two. 465 00:17:32,980 --> 00:17:36,629 Well, how do we score these? 466 00:17:37,180 --> 00:17:41,990 Essentially, we want to see how strong are 467 00:17:41,990 --> 00:17:44,060 the relations in this group 468 00:17:44,060 --> 00:17:46,084 given its specific size. 469 00:17:46,084 --> 00:17:48,080 Obviously, the more genes we have, 470 00:17:48,080 --> 00:17:50,974 the more arrow relations we might expect. 471 00:17:50,974 --> 00:17:52,519 So it would be nice to have 472 00:17:52,519 --> 00:17:55,070 a measure that increases both with 473 00:17:55,070 --> 00:17:58,505 the number of genes and with the density of 474 00:17:58,505 --> 00:17:59,959 arrow relations when we 475 00:17:59,959 --> 00:18:02,405 consider them as a complete graph. 476 00:18:02,405 --> 00:18:04,535 So simplistically, 477 00:18:04,535 --> 00:18:07,025 the following function was chosen. 478 00:18:07,025 --> 00:18:08,300 If we have a group, 479 00:18:08,300 --> 00:18:10,444 we're simply going to score it as 480 00:18:10,444 --> 00:18:13,009 the number of genes in the group times 481 00:18:13,009 --> 00:18:15,680 the density of edges divided by 482 00:18:15,680 --> 00:18:16,940 the number of edges we were 483 00:18:16,940 --> 00:18:18,665 getting a complete graph. 484 00:18:18,665 --> 00:18:21,200 It's simplifies rather nicely. 485 00:18:21,200 --> 00:18:23,975 It sort of keep it simple, 486 00:18:23,975 --> 00:18:25,939 silly method for finding 487 00:18:25,939 --> 00:18:28,129 a score for the different groupings 488 00:18:28,129 --> 00:18:30,360 of genes so far. 489 00:18:35,370 --> 00:18:37,750 Now the next step which 490 00:18:37,750 --> 00:18:39,835 actually gives the algorithm its name. 491 00:18:39,835 --> 00:18:41,559 This is refinement using what are 492 00:18:41,559 --> 00:18:43,899 referred to as upstream regulators. 493 00:18:43,899 --> 00:18:45,924 And technically this is optional. 494 00:18:45,924 --> 00:18:48,640 In practice, it was decided to skip 495 00:18:48,640 --> 00:18:49,870 this step when simply 496 00:18:49,870 --> 00:18:51,219 looking for metagenomes, 497 00:18:51,219 --> 00:18:52,510 but when trying to look for 498 00:18:52,510 --> 00:18:54,369 signatures that could help predict 499 00:18:54,369 --> 00:18:55,899 survival and eventually 500 00:18:55,899 --> 00:18:57,760 become good drug targets. 501 00:18:57,760 --> 00:19:01,750 This was the algorithms primary advantage. 502 00:19:01,750 --> 00:19:04,779 The first idea is that for any given gene, 503 00:19:04,779 --> 00:19:07,360 we would like some numerical measure 504 00:19:07,360 --> 00:19:09,925 of how good of a regulator it is. 505 00:19:09,925 --> 00:19:11,169 And once again, after 506 00:19:11,169 --> 00:19:12,639 a little experimentation, 507 00:19:12,639 --> 00:19:15,765 it was decided on a fairly simple metric 508 00:19:15,765 --> 00:19:17,510 that can be seen here. 509 00:19:17,510 --> 00:19:20,855 So for every, for any given x, we take, 510 00:19:20,855 --> 00:19:25,625 for every arrow relation where X is a source, 511 00:19:25,625 --> 00:19:28,369 if you think of it as a directed graph, 512 00:19:28,369 --> 00:19:29,479 will take the size of 513 00:19:29,479 --> 00:19:32,525 the overlaps, square it. 514 00:19:32,525 --> 00:19:35,585 And in the denominator we'll put the size 515 00:19:35,585 --> 00:19:39,139 of the number of non-zero entries, 516 00:19:39,139 --> 00:19:41,405 both for positive one and negative one. 517 00:19:41,405 --> 00:19:45,410 This is just, it's a fairly ad hoc measure, 518 00:19:45,410 --> 00:19:46,459 but it's served quite 519 00:19:46,459 --> 00:19:48,930 well in this particular study. 520 00:19:49,330 --> 00:19:52,220 So for any given group that 521 00:19:52,220 --> 00:19:54,484 we had from the previous step, 522 00:19:54,484 --> 00:19:55,610 we're going to look for 523 00:19:55,610 --> 00:19:57,530 regulators of that group, 524 00:19:57,530 --> 00:20:01,489 that is genes outside of this grouping that 525 00:20:01,489 --> 00:20:03,799 may in fact be controlling 526 00:20:03,799 --> 00:20:06,845 how this group is being expressed. 527 00:20:06,845 --> 00:20:08,540 Another good metaphor that 528 00:20:08,540 --> 00:20:09,994 JB came up with is, 529 00:20:09,994 --> 00:20:11,090 we're not looking for 530 00:20:11,090 --> 00:20:12,769 the distributors on the streets. 531 00:20:12,769 --> 00:20:14,749 We're looking for the mob bosses who are 532 00:20:14,749 --> 00:20:17,990 controlling things behind the scenes. 533 00:20:17,990 --> 00:20:21,769 So for any X naught in our given group, 534 00:20:21,769 --> 00:20:23,224 we're going to consider 535 00:20:23,224 --> 00:20:25,880 a new subset that consists of x and 536 00:20:25,880 --> 00:20:28,010 all the genes in our group 537 00:20:28,010 --> 00:20:30,965 that x appears to regulate. 538 00:20:30,965 --> 00:20:33,050 We're going to score each 539 00:20:33,050 --> 00:20:34,594 of these to see how well 540 00:20:34,594 --> 00:20:38,059 x regulates this particular group. 541 00:20:38,059 --> 00:20:41,959 Once again, it's simply a standard ratio. 542 00:20:41,959 --> 00:20:43,819 Alright? How, what proportion 543 00:20:43,819 --> 00:20:45,109 of this group is 544 00:20:45,109 --> 00:20:47,209 regulated and how strong 545 00:20:47,209 --> 00:20:49,474 a regulator overall is x. 546 00:20:49,474 --> 00:20:52,670 So we get a score denoted P sub X 547 00:20:52,670 --> 00:20:59,375 G. For the purposes of the analysis, 548 00:20:59,375 --> 00:21:00,770 we want to, we 549 00:21:00,770 --> 00:21:02,780 probably want to limit how many of 550 00:21:02,780 --> 00:21:04,819 these subgroups we're going 551 00:21:04,819 --> 00:21:06,440 to take into account and 552 00:21:06,440 --> 00:21:08,195 analyze further just 553 00:21:08,195 --> 00:21:10,865 to save computational resources. 554 00:21:10,865 --> 00:21:12,679 And in this particular study. 555 00:21:12,679 --> 00:21:14,479 Five groups were kept 556 00:21:14,479 --> 00:21:16,564 for every different type of cancer, 557 00:21:16,564 --> 00:21:17,869 but this can be adjusted 558 00:21:17,869 --> 00:21:20,340 based on the circumstances. 559 00:21:23,380 --> 00:21:26,675 So to consider how well 560 00:21:26,675 --> 00:21:29,150 these signatures may actually 561 00:21:29,150 --> 00:21:30,994 indicate survival and this part, 562 00:21:30,994 --> 00:21:32,269 unfortunately, I'm going to have to 563 00:21:32,269 --> 00:21:33,815 rush through a little bit. 564 00:21:33,815 --> 00:21:36,350 There's quite a bit of linear algebra and 565 00:21:36,350 --> 00:21:38,300 statistical correlation analysis 566 00:21:38,300 --> 00:21:39,814 that goes into this. 567 00:21:39,814 --> 00:21:42,199 Numerically. We're going to try 568 00:21:42,199 --> 00:21:47,585 and score how big an influence each, 569 00:21:47,585 --> 00:21:50,794 each gene in the signature. 570 00:21:50,794 --> 00:21:52,580 These G sub Xs 571 00:21:52,580 --> 00:21:55,354 might actually affect patient survival. 572 00:21:55,354 --> 00:21:56,824 So we're going to take 573 00:21:56,824 --> 00:21:58,399 an expression matrix M, 574 00:21:58,399 --> 00:21:59,689 but we're going to limit it to 575 00:21:59,689 --> 00:22:01,130 one of the signatures. 576 00:22:01,130 --> 00:22:03,065 So we're not taking the full matrix anymore. 577 00:22:03,065 --> 00:22:04,160 We're only taking those 578 00:22:04,160 --> 00:22:05,509 rows that correspond to 579 00:22:05,509 --> 00:22:06,979 the signature to 580 00:22:06,979 --> 00:22:10,099 a standard singular value decomposition. 581 00:22:10,099 --> 00:22:12,560 The next step is we're going 582 00:22:12,560 --> 00:22:15,860 to analyze each of 583 00:22:15,860 --> 00:22:20,734 the singular value vec 584 00:22:20,734 --> 00:22:23,494 vectors on the end of the decomposition. 585 00:22:23,494 --> 00:22:25,325 We're going to test it using 586 00:22:25,325 --> 00:22:27,770 Kaplan-Meier and Cox regression for 587 00:22:27,770 --> 00:22:29,419 p less than or equal to 588 00:22:29,419 --> 00:22:32,749 zero point. I beg your pardon. 589 00:22:32,749 --> 00:22:33,694 I think I've just found 590 00:22:33,694 --> 00:22:34,850 another typo that should 591 00:22:34,850 --> 00:22:37,115 be 0.05. My mistake. 592 00:22:37,115 --> 00:22:39,554 I'll get that fixed before I 593 00:22:39,554 --> 00:22:40,369 send out a copy of 594 00:22:40,369 --> 00:22:43,100 the slides, after the talk. 595 00:22:43,100 --> 00:22:45,739 Just see if that is a significant factor. 596 00:22:45,739 --> 00:22:48,890 So is the gene represented by this vector in 597 00:22:48,890 --> 00:22:51,305 the singular value decomposition 598 00:22:51,305 --> 00:22:53,089 significant when it comes 599 00:22:53,089 --> 00:22:55,865 to predicting patient's survival. 600 00:22:55,865 --> 00:22:59,630 We then create a vector that includes 601 00:22:59,630 --> 00:23:03,275 scores for any given gene. 602 00:23:03,275 --> 00:23:05,869 There's a little bit of ad hoc hurry here 603 00:23:05,869 --> 00:23:09,245 because singular value decomposition, 604 00:23:09,245 --> 00:23:11,539 the sign is only defined up to 605 00:23:11,539 --> 00:23:14,120 plus or minus one in any decomposition. 606 00:23:14,120 --> 00:23:16,295 So to keep the correlations positive, 607 00:23:16,295 --> 00:23:17,810 we may end up multiplying 608 00:23:17,810 --> 00:23:19,834 by negative one here. 609 00:23:19,834 --> 00:23:21,829 This can be found from 610 00:23:21,829 --> 00:23:24,664 the standard Singular Value Decomposition 611 00:23:24,664 --> 00:23:27,179 through the following computation. 612 00:23:29,110 --> 00:23:33,320 This gives us for any subgroup of regulators, 613 00:23:33,320 --> 00:23:34,714 we have a signature here, 614 00:23:34,714 --> 00:23:36,455 will form the sum matrix. 615 00:23:36,455 --> 00:23:40,560 This is from the discretized expression data. 616 00:23:40,570 --> 00:23:42,620 We'll use the will run 617 00:23:42,620 --> 00:23:44,839 the Eigen survival analysis to 618 00:23:44,839 --> 00:23:47,224 give us a predictive score for each patient. 619 00:23:47,224 --> 00:23:48,289 And that number is 620 00:23:48,289 --> 00:23:49,639 a linear combination 621 00:23:49,639 --> 00:23:51,290 of their expression values. 622 00:23:51,290 --> 00:23:53,360 I realize I went over that quite quickly, 623 00:23:53,360 --> 00:23:54,379 so I'd be happy to try and 624 00:23:54,379 --> 00:23:55,520 answer any questions 625 00:23:55,520 --> 00:23:56,749 after I get to the end of 626 00:23:56,749 --> 00:23:59,130 the slides in a few minutes, 627 00:23:59,830 --> 00:24:02,059 we take the top and 628 00:24:02,059 --> 00:24:05,569 bottom quartiles of patients and we do 629 00:24:05,569 --> 00:24:08,165 the Kaplan-Meier survival curves 630 00:24:08,165 --> 00:24:09,949 over time for each of 631 00:24:09,949 --> 00:24:12,630 those top and bottom quartiles. 632 00:24:12,630 --> 00:24:15,609 There are a number of statistical tests 633 00:24:15,609 --> 00:24:17,110 that can be used to see 634 00:24:17,110 --> 00:24:21,055 if those survival curves are well-separated. 635 00:24:21,055 --> 00:24:22,360 It was decided to use 636 00:24:22,360 --> 00:24:25,990 the common log rank and Cox tests. 637 00:24:25,990 --> 00:24:28,750 And I must admit this is not my specialty, 638 00:24:28,750 --> 00:24:29,920 but my understanding is 639 00:24:29,920 --> 00:24:31,540 that log rank is 640 00:24:31,540 --> 00:24:33,550 often preferable if we feel we 641 00:24:33,550 --> 00:24:38,110 don't have many significant outside factors 642 00:24:38,110 --> 00:24:39,520 such as overall health, 643 00:24:39,520 --> 00:24:42,460 whereas cock serves better if there may be 644 00:24:42,460 --> 00:24:44,170 hidden variables that we don't 645 00:24:44,170 --> 00:24:46,300 have direct access to the data for. 646 00:24:46,300 --> 00:24:49,060 So it was this fissure score is 647 00:24:49,060 --> 00:24:50,709 a hybrid of the two 648 00:24:50,709 --> 00:24:53,065 of them trying to balance it out. 649 00:24:53,065 --> 00:24:57,530 This then scores each signature. 650 00:24:57,530 --> 00:24:59,645 So the higher the score, 651 00:24:59,645 --> 00:25:01,639 the hope is that the 652 00:25:01,639 --> 00:25:04,249 more significant this signature 653 00:25:04,249 --> 00:25:06,889 is at separating who is going to 654 00:25:06,889 --> 00:25:10,980 survive well and who won't survive well. 655 00:25:14,320 --> 00:25:20,164 Indeed, if we use the actual scores from our, 656 00:25:20,164 --> 00:25:22,759 our Eigen survival vector here, 657 00:25:22,759 --> 00:25:25,325 that in and of itself can actually 658 00:25:25,325 --> 00:25:28,459 break patients into high and low-risk groups. 659 00:25:28,459 --> 00:25:29,494 I'll have a chance to 660 00:25:29,494 --> 00:25:31,130 review a follow-up project 661 00:25:31,130 --> 00:25:33,019 that I assisted Dr. 662 00:25:33,019 --> 00:25:34,550 nation with at the very end 663 00:25:34,550 --> 00:25:36,599 of the presentation. 664 00:25:37,270 --> 00:25:38,569 Well, 665 00:25:38,569 --> 00:25:40,610 if we've got a machine-learning algorithm 666 00:25:40,610 --> 00:25:42,050 that is looking for relations, 667 00:25:42,050 --> 00:25:43,009 we have to be a little 668 00:25:43,009 --> 00:25:44,299 worried about whether or not 669 00:25:44,299 --> 00:25:46,894 we're picking up on random variants. 670 00:25:46,894 --> 00:25:49,519 So there are more details in the paper, 671 00:25:49,519 --> 00:25:51,500 but the predicted number of 672 00:25:51,500 --> 00:25:53,420 random error relations is 673 00:25:53,420 --> 00:25:55,774 actually going to be quite low. 674 00:25:55,774 --> 00:25:58,280 In practice, testing was done on 675 00:25:58,280 --> 00:26:00,830 permuted data matrices and shows that you 676 00:26:00,830 --> 00:26:03,110 don't pick up on many false positives 677 00:26:03,110 --> 00:26:04,609 unless your sensitivity 678 00:26:04,609 --> 00:26:08,989 variable is set below 0.4 threshold. 679 00:26:08,989 --> 00:26:10,009 And in this study, 680 00:26:10,009 --> 00:26:12,079 the lowest value used 681 00:26:12,079 --> 00:26:15,604 was above 0.6. So false. 682 00:26:15,604 --> 00:26:18,470 False discoveries hopefully will 683 00:26:18,470 --> 00:26:20,615 not be much of an issue. 684 00:26:20,615 --> 00:26:22,759 Indeed, the probability of 685 00:26:22,759 --> 00:26:24,830 random edges is somewhere 686 00:26:24,830 --> 00:26:26,270 on the order of ten to the negative 687 00:26:26,270 --> 00:26:28,594 five for an individual edge. 688 00:26:28,594 --> 00:26:31,609 In this particular study, there was, 689 00:26:31,609 --> 00:26:33,739 for, sorry, I'm gonna have to 690 00:26:33,739 --> 00:26:36,184 brace myself for the pronunciation here. 691 00:26:36,184 --> 00:26:38,795 For colon GO carcinoma. 692 00:26:38,795 --> 00:26:40,009 I think I got that. 693 00:26:40,009 --> 00:26:42,019 There were only 36 patients. 694 00:26:42,019 --> 00:26:45,710 So we may have a lot of randomness, but here, 695 00:26:45,710 --> 00:26:47,089 the expected number of 696 00:26:47,089 --> 00:26:50,960 random edges is still only around 9,200, 697 00:26:50,960 --> 00:26:52,700 but the number of relations actually 698 00:26:52,700 --> 00:26:56,970 detected was well above 800,000. 699 00:26:57,340 --> 00:26:59,419 Part two, we'd expect 700 00:26:59,419 --> 00:27:01,579 even fewer random arrows 701 00:27:01,579 --> 00:27:02,824 because we're looking at 702 00:27:02,824 --> 00:27:04,505 tighter relationships. 703 00:27:04,505 --> 00:27:06,350 So false discovery, 704 00:27:06,350 --> 00:27:09,449 the algorithm seems to hold up quite well. 705 00:27:09,520 --> 00:27:12,334 The other issue we may run into 706 00:27:12,334 --> 00:27:14,960 is how sensitive the algorithm is. 707 00:27:14,960 --> 00:27:15,800 We want to make sure that 708 00:27:15,800 --> 00:27:17,240 the signals we're picking up 709 00:27:17,240 --> 00:27:19,789 on are actually strong ones. 710 00:27:19,789 --> 00:27:21,695 So once again, 711 00:27:21,695 --> 00:27:25,009 a fairly basic testing was done where 712 00:27:25,009 --> 00:27:28,189 a signal matrix using typical values that 713 00:27:28,189 --> 00:27:29,749 were found in the expression data 714 00:27:29,749 --> 00:27:30,980 was constructed though, 715 00:27:30,980 --> 00:27:32,404 notice that it was in 716 00:27:32,404 --> 00:27:34,069 different dimensions simply for 717 00:27:34,069 --> 00:27:36,049 the purposes of running this test, 718 00:27:36,049 --> 00:27:37,849 a step signal was added. 719 00:27:37,849 --> 00:27:39,559 So in the first 200 rows, 720 00:27:39,559 --> 00:27:41,989 30 entries had a one added to them. 721 00:27:41,989 --> 00:27:44,824 30 entries had a negative one added to them. 722 00:27:44,824 --> 00:27:47,135 And then there were 60 zeros 723 00:27:47,135 --> 00:27:48,545 added into each one. 724 00:27:48,545 --> 00:27:49,850 The goal then was to see 725 00:27:49,850 --> 00:27:52,055 how much noise could be added 726 00:27:52,055 --> 00:27:53,390 before the signal was 727 00:27:53,390 --> 00:27:56,765 lost when the algorithm was run. 728 00:27:56,765 --> 00:27:59,944 So we do a Gaussian noise matrix 729 00:27:59,944 --> 00:28:03,049 and multiply it by a constant to 730 00:28:03,049 --> 00:28:05,494 adjust the signal to noise ratio 731 00:28:05,494 --> 00:28:07,519 and run some repeated tests for 732 00:28:07,519 --> 00:28:09,470 different levels of sensitivity using 733 00:28:09,470 --> 00:28:12,725 the comp, toll parameter. 734 00:28:12,725 --> 00:28:13,999 The conclusion was that 735 00:28:13,999 --> 00:28:15,139 the signals are really 736 00:28:15,139 --> 00:28:16,160 quite strong and I've 737 00:28:16,160 --> 00:28:17,674 included a few tables here. 738 00:28:17,674 --> 00:28:20,195 These are all measured in the decibel scale. 739 00:28:20,195 --> 00:28:22,744 So when comp toll is quite high, 740 00:28:22,744 --> 00:28:25,265 we have no false positives. 741 00:28:25,265 --> 00:28:28,430 But the signal gets lost 742 00:28:28,430 --> 00:28:30,380 fairly quickly as we up 743 00:28:30,380 --> 00:28:32,600 the signal to noise ratio. 744 00:28:32,600 --> 00:28:35,960 If we lower the sensitivity a little, 745 00:28:35,960 --> 00:28:38,315 we end up finding more rows. 746 00:28:38,315 --> 00:28:39,560 Once again, we still don't have 747 00:28:39,560 --> 00:28:42,874 any false positives at the 0.6 threshold. 748 00:28:42,874 --> 00:28:45,200 Once we get to 0.5, 749 00:28:45,200 --> 00:28:46,460 we finally start worrying 750 00:28:46,460 --> 00:28:47,780 about false positives. 751 00:28:47,780 --> 00:28:50,165 But fortunately, not all that many. 752 00:28:50,165 --> 00:28:51,559 The conclusion then being 753 00:28:51,559 --> 00:28:52,549 that these signals that were 754 00:28:52,549 --> 00:28:53,600 picked up in this study 755 00:28:53,600 --> 00:28:56,099 are hopefully quite strong. 756 00:28:57,490 --> 00:29:00,050 Conclusions that were drawn from 757 00:29:00,050 --> 00:29:01,669 this study were that there were several, 758 00:29:01,669 --> 00:29:02,809 many genes that appear 759 00:29:02,809 --> 00:29:03,919 to be of interests across 760 00:29:03,919 --> 00:29:06,410 multiple types of tumors. 761 00:29:06,410 --> 00:29:08,854 Specifically any metagenomes 762 00:29:08,854 --> 00:29:11,224 that involve immune response 763 00:29:11,224 --> 00:29:13,040 are always going to be of 764 00:29:13,040 --> 00:29:16,384 interest no matter where the cancer appears. 765 00:29:16,384 --> 00:29:18,844 There were some other metagenomes. 766 00:29:18,844 --> 00:29:21,769 Another typo I'll have to fix my apologies. 767 00:29:21,769 --> 00:29:24,454 Other metagenomes seemed to only show up for 768 00:29:24,454 --> 00:29:26,090 a single tumor or 769 00:29:26,090 --> 00:29:29,339 perhaps a small number of tumor types. 770 00:29:30,670 --> 00:29:33,485 We did find genes 771 00:29:33,485 --> 00:29:36,110 where the Kaplan-Meier survival curves 772 00:29:36,110 --> 00:29:37,760 seem to be well separated and 773 00:29:37,760 --> 00:29:39,410 those indicated biological 774 00:29:39,410 --> 00:29:41,480 processes of interest. 775 00:29:41,480 --> 00:29:43,789 Interestingly enough, a little bit of 776 00:29:43,789 --> 00:29:46,490 follow-on work was done 777 00:29:46,490 --> 00:29:48,725 where they looked at the tumor. 778 00:29:48,725 --> 00:29:50,810 They separated the tumors by 779 00:29:50,810 --> 00:29:54,005 what stage the cancer was determined in, 780 00:29:54,005 --> 00:29:56,810 at diagnosis and the meta genes 781 00:29:56,810 --> 00:29:59,690 of interests seemed to differ. 782 00:29:59,690 --> 00:30:01,265 So if you're looking at stage 783 00:30:01,265 --> 00:30:03,409 one, melanoma for instance, 784 00:30:03,409 --> 00:30:06,139 you might find some different meta genes 785 00:30:06,139 --> 00:30:08,149 that show up when the algorithm is one. 786 00:30:08,149 --> 00:30:10,970 Then if you're running looking at stage four, 787 00:30:10,970 --> 00:30:14,000 melanomas might indicate that 788 00:30:14,000 --> 00:30:16,460 different biological processes become 789 00:30:16,460 --> 00:30:19,890 more prominent as the disease progresses. 790 00:30:20,860 --> 00:30:23,629 For future investigations, I'll 791 00:30:23,629 --> 00:30:25,759 show you a visual visualization 792 00:30:25,759 --> 00:30:27,769 that was made when I was assisting 793 00:30:27,769 --> 00:30:30,605 Dr. nation in 2019. 794 00:30:30,605 --> 00:30:32,929 Unfortunately, due to the lack 795 00:30:32,929 --> 00:30:34,879 of testing data available, 796 00:30:34,879 --> 00:30:36,829 due to the somewhat messy divorce 797 00:30:36,829 --> 00:30:38,480 with the, UH, Cancer Center. 798 00:30:38,480 --> 00:30:39,590 I'm only going to be able 799 00:30:39,590 --> 00:30:41,060 to show you results of 800 00:30:41,060 --> 00:30:42,980 the training data 801 00:30:42,980 --> 00:30:46,020 for specifically for melanoma. 802 00:30:46,840 --> 00:30:49,789 The following visualization 803 00:30:49,789 --> 00:30:52,500 is for melanoma patients. 804 00:30:52,960 --> 00:30:56,225 Patient, the patient number 805 00:30:56,225 --> 00:30:59,750 for our samples are on the horizontal axis. 806 00:30:59,750 --> 00:31:02,525 These are grouped so that we have 807 00:31:02,525 --> 00:31:05,779 low scores for a given signature. 808 00:31:05,779 --> 00:31:07,190 And the score are coming from 809 00:31:07,190 --> 00:31:09,260 the Eigen survival analysis. 810 00:31:09,260 --> 00:31:11,089 So their entry in 811 00:31:11,089 --> 00:31:13,970 the w vector that was generated, 812 00:31:13,970 --> 00:31:17,645 they're plotted versus the patient number. 813 00:31:17,645 --> 00:31:21,874 This is visualized in the line graph in blue. 814 00:31:21,874 --> 00:31:23,975 So as we move from left to right, 815 00:31:23,975 --> 00:31:28,380 our score for that signature is rising. 816 00:31:29,260 --> 00:31:32,360 A threshold was set to maximize 817 00:31:32,360 --> 00:31:36,590 the accuracy of high-risk and low-risk group. 818 00:31:36,590 --> 00:31:39,799 So this vertical green line is going to 819 00:31:39,799 --> 00:31:43,115 intersect the blue curve at that cutoff. 820 00:31:43,115 --> 00:31:48,229 In this case, it turned out to be -0.55. 821 00:31:48,229 --> 00:31:50,869 What that does is now 822 00:31:50,869 --> 00:31:53,614 split the graph into four quadrants. 823 00:31:53,614 --> 00:31:56,525 Red crosses indicate a fatality, 824 00:31:56,525 --> 00:31:59,060 a patient that who 825 00:31:59,060 --> 00:32:01,714 lost their life before the end of the study. 826 00:32:01,714 --> 00:32:04,670 Blue circles indicates survival 827 00:32:04,670 --> 00:32:07,160 at last contact with the patient. 828 00:32:07,160 --> 00:32:08,810 Censored patients have been 829 00:32:08,810 --> 00:32:10,594 removed from this study. 830 00:32:10,594 --> 00:32:13,475 So in the lower-left quadrant, 831 00:32:13,475 --> 00:32:16,744 red crosses indicate true negatives. 832 00:32:16,744 --> 00:32:19,070 If the score on the test for 833 00:32:19,070 --> 00:32:22,205 this signature is sufficiently low, 834 00:32:22,205 --> 00:32:25,399 we consider these high-risk patients. 835 00:32:25,399 --> 00:32:26,629 As you can see in 836 00:32:26,629 --> 00:32:28,429 this relatively small sample 837 00:32:28,429 --> 00:32:30,605 for melanoma, there are, 838 00:32:30,605 --> 00:32:31,639 there do not appear to 839 00:32:31,639 --> 00:32:33,335 be any false negatives, 840 00:32:33,335 --> 00:32:35,149 were getting only true negatives 841 00:32:35,149 --> 00:32:37,130 in the lower left quadrant. 842 00:32:37,130 --> 00:32:39,380 In the upper right quadrant, 843 00:32:39,380 --> 00:32:42,499 true positives would be blue circles. 844 00:32:42,499 --> 00:32:45,410 Now, the horizontal green line is at 845 00:32:45,410 --> 00:32:47,749 a pretty arbitrary threshold of 846 00:32:47,749 --> 00:32:50,405 550 days of survival. 847 00:32:50,405 --> 00:32:52,730 That's about a year-and-a-half. 848 00:32:52,730 --> 00:32:57,515 Based on, on certain criteria, 849 00:32:57,515 --> 00:33:00,139 it's perfectly reasonable to set what one 850 00:33:00,139 --> 00:33:02,119 considers an interesting term 851 00:33:02,119 --> 00:33:04,339 of survival to be a different number of days, 852 00:33:04,339 --> 00:33:05,749 but a year-and-a-half is pretty 853 00:33:05,749 --> 00:33:08,150 significant to spend with your loved ones. 854 00:33:08,150 --> 00:33:09,920 As you can see, there were 855 00:33:09,920 --> 00:33:12,169 a fair number of false positives, 856 00:33:12,169 --> 00:33:15,410 but overall the accuracy is pretty 857 00:33:15,410 --> 00:33:18,769 decent as we see most patients 858 00:33:18,769 --> 00:33:19,940 who score highly for 859 00:33:19,940 --> 00:33:22,145 this particular signature have 860 00:33:22,145 --> 00:33:26,749 a good chance of surviving past 550 days. 861 00:33:26,749 --> 00:33:29,045 Now, as I say, unfortunately, 862 00:33:29,045 --> 00:33:31,565 this is all training data 863 00:33:31,565 --> 00:33:32,914 that we're seeing right now. 864 00:33:32,914 --> 00:33:34,009 This is done using 865 00:33:34,009 --> 00:33:35,494 the publicly available data 866 00:33:35,494 --> 00:33:38,254 off the TCGA database. 867 00:33:38,254 --> 00:33:39,800 Ideally, 868 00:33:39,800 --> 00:33:41,060 we would have large enough 869 00:33:41,060 --> 00:33:42,200 samples where we could do 870 00:33:42,200 --> 00:33:46,955 a training test split and see how well this, 871 00:33:46,955 --> 00:33:49,610 this single score could be 872 00:33:49,610 --> 00:33:53,104 used to put patients into low-risk. 873 00:33:53,104 --> 00:33:53,645 Why? 874 00:33:53,645 --> 00:33:54,739 I beg your pardon, I 875 00:33:54,739 --> 00:33:57,260 was pointing to the wrong quadrant, 876 00:33:57,260 --> 00:34:00,260 low-risk and high-risk groups. 877 00:34:00,260 --> 00:34:02,299 Patients who are in the high-risk group 878 00:34:02,299 --> 00:34:03,919 might be considered for 879 00:34:03,919 --> 00:34:06,050 more aggressive treatment with 880 00:34:06,050 --> 00:34:10,229 more invasive chemo or radiation therapies. 881 00:34:11,680 --> 00:34:17,159 So further testing is needed here. 882 00:34:17,170 --> 00:34:20,390 There is other continuous data 883 00:34:20,390 --> 00:34:22,219 available from the Broad Institute 884 00:34:22,219 --> 00:34:23,749 on the Cancer Genome Atlas, 885 00:34:23,749 --> 00:34:25,309 including my affiliation and 886 00:34:25,309 --> 00:34:27,095 microRNA expression, 887 00:34:27,095 --> 00:34:29,989 I need to conference with some biologists 888 00:34:29,989 --> 00:34:31,370 about the exact biological 889 00:34:31,370 --> 00:34:33,304 function of those numbers. 890 00:34:33,304 --> 00:34:34,310 But there is technically 891 00:34:34,310 --> 00:34:35,569 no reason that couldn't be 892 00:34:35,569 --> 00:34:38,495 included for any given gene. 893 00:34:38,495 --> 00:34:42,050 Another possibility is that when we define 894 00:34:42,050 --> 00:34:43,280 the arrow relation that 895 00:34:43,280 --> 00:34:45,275 was only positive correlation, 896 00:34:45,275 --> 00:34:47,750 It's possible that high expression in 897 00:34:47,750 --> 00:34:49,685 one gene may result 898 00:34:49,685 --> 00:34:51,755 in low expression in another. 899 00:34:51,755 --> 00:34:53,554 How that would affect the analysis 900 00:34:53,554 --> 00:34:56,100 is at this point unknown. 901 00:34:56,680 --> 00:34:59,284 Finally, in a particular 902 00:34:59,284 --> 00:35:01,099 interests to myself is it is, 903 00:35:01,099 --> 00:35:02,630 it may be possible to use 904 00:35:02,630 --> 00:35:04,999 the algorithm when sufficiently 905 00:35:04,999 --> 00:35:08,630 generalized and we get a good pipeline 906 00:35:08,630 --> 00:35:11,764 going to steady continuous data 907 00:35:11,764 --> 00:35:13,925 related to other diseases. 908 00:35:13,925 --> 00:35:15,965 So for the last word, 909 00:35:15,965 --> 00:35:18,845 I want to give it to JB again 910 00:35:18,845 --> 00:35:21,919 as the as the father of this process, 911 00:35:21,919 --> 00:35:23,600 as he always would say if the conclusion 912 00:35:23,600 --> 00:35:25,939 of his talks lust is good. 913 00:35:25,939 --> 00:35:28,444 And so is the algorithm. 914 00:35:28,444 --> 00:35:30,199 I'll open it up for 915 00:35:30,199 --> 00:35:31,639 questions in just a moment, 916 00:35:31,639 --> 00:35:35,179 but I'd like to first give a few references. 917 00:35:35,179 --> 00:35:36,260 All of these are available 918 00:35:36,260 --> 00:35:37,370 on the GitHub page. 919 00:35:37,370 --> 00:35:39,455 And when I do a last cleanup of the slides, 920 00:35:39,455 --> 00:35:41,059 I will send those to 921 00:35:41,059 --> 00:35:44,494 the seminar organizers for dissemination. 922 00:35:44,494 --> 00:35:46,579 Lastly, I'd like to 923 00:35:46,579 --> 00:35:48,199 give a few acknowledgements. 924 00:35:48,199 --> 00:35:49,969 First of all, thank you once again 925 00:35:49,969 --> 00:35:52,309 for having me come speak today. 926 00:35:52,309 --> 00:35:53,419 And I especially want to 927 00:35:53,419 --> 00:35:55,550 thank Professors Wake Island and 928 00:35:55,550 --> 00:35:58,310 ends wick for corresponding 929 00:35:58,310 --> 00:35:59,689 with me to set this up. 930 00:35:59,689 --> 00:36:01,490 I also want to thank the 931 00:36:01,490 --> 00:36:02,930 University of Hawaii 932 00:36:02,930 --> 00:36:04,160 School of lattice theory, 933 00:36:04,160 --> 00:36:06,634 my primary mentors while I was there. 934 00:36:06,634 --> 00:36:08,750 And of course, in addition to 935 00:36:08,750 --> 00:36:10,730 moral support throughout all my studies 936 00:36:10,730 --> 00:36:12,394 from friends and family 937 00:36:12,394 --> 00:36:15,454 and mathematicians that I look up to, 938 00:36:15,454 --> 00:36:17,059 I would be remiss if I 939 00:36:17,059 --> 00:36:19,490 didn't address my special assistant, 940 00:36:19,490 --> 00:36:21,155 Dr. Frankenstein, 941 00:36:21,155 --> 00:36:22,459 I couldn't have done this work 942 00:36:22,459 --> 00:36:26,640 without him no matter how hard I tried. 943 00:36:27,820 --> 00:36:30,005 Thank you very much. 944 00:36:30,005 --> 00:36:31,459 With that, I will go ahead and 945 00:36:31,459 --> 00:36:32,869 open up the floor to 946 00:36:32,869 --> 00:36:34,250 questions and i'll I'll do 947 00:36:34,250 --> 00:36:36,870 my best to keep the discussion moving. 948 00:36:37,090 --> 00:36:39,724 I'm just dying to know 949 00:36:39,724 --> 00:36:41,870 how general this algorithm is. 950 00:36:41,870 --> 00:36:43,939 Is it the type of the data or is it 951 00:36:43,939 --> 00:36:46,325 more the source of the data that you think? 952 00:36:46,325 --> 00:36:49,220 Makes it useful in 953 00:36:49,220 --> 00:36:50,840 looking at medical gene expression 954 00:36:50,840 --> 00:36:52,534 and related kinds of data. 955 00:36:52,534 --> 00:36:53,809 But it isn't 956 00:36:53,809 --> 00:36:56,375 a more general algorithm at its core. 957 00:36:56,375 --> 00:37:00,065 At its core, I would say it probably 958 00:37:00,065 --> 00:37:02,300 can be how general 959 00:37:02,300 --> 00:37:04,220 I hesitate to say at this point. 960 00:37:04,220 --> 00:37:05,764 Because, again, 961 00:37:05,764 --> 00:37:08,960 simply because of Professor Nation, 962 00:37:08,960 --> 00:37:11,210 professor nations retirement and 963 00:37:11,210 --> 00:37:14,690 the Unfortunately tenths closing 964 00:37:14,690 --> 00:37:16,805 of the relationship with the Cancer Center. 965 00:37:16,805 --> 00:37:19,219 We didn't get a really great chance 966 00:37:19,219 --> 00:37:20,734 to follow through on 967 00:37:20,734 --> 00:37:22,789 this particular application and 968 00:37:22,789 --> 00:37:25,025 then see how applicable it is. 969 00:37:25,025 --> 00:37:27,470 In principle, if we're just looking to 970 00:37:27,470 --> 00:37:29,989 cluster Regulations 971 00:37:29,989 --> 00:37:32,315 amongst the continuous datasets, 972 00:37:32,315 --> 00:37:33,770 I don't see why 973 00:37:33,770 --> 00:37:36,830 the algorithm couldn't be applied. 974 00:37:36,830 --> 00:37:38,765 Especially part one, 975 00:37:38,765 --> 00:37:39,860 it would seem that it would be 976 00:37:39,860 --> 00:37:42,184 a good candidate for comparing 977 00:37:42,184 --> 00:37:45,485 other clustering methods in terms 978 00:37:45,485 --> 00:37:48,710 of especially in part to 979 00:37:48,710 --> 00:37:50,434 sort of predictive modeling 980 00:37:50,434 --> 00:37:54,300 outside of biomedical studies. 981 00:37:54,370 --> 00:37:57,725 Nothing occurs off the top of my head, 982 00:37:57,725 --> 00:37:59,600 but I suspect there probably 983 00:37:59,600 --> 00:38:03,650 are some areas that are worth investigation. 984 00:38:03,650 --> 00:38:06,320 Just to be a little bit more specific and I 985 00:38:06,320 --> 00:38:08,180 don't mean to be a hog for time here, 986 00:38:08,180 --> 00:38:10,805 I'll definitely yield to other questioners. 987 00:38:10,805 --> 00:38:13,009 I'm very interested in how we can 988 00:38:13,009 --> 00:38:17,284 imply causation in datasets. 989 00:38:17,284 --> 00:38:20,180 Which means more than just correlation. 990 00:38:20,180 --> 00:38:21,319 And I know it's tricky. 991 00:38:21,319 --> 00:38:23,269 It's a really interesting field. 992 00:38:23,269 --> 00:38:25,280 I'm not an expert in that, 993 00:38:25,280 --> 00:38:27,934 but it just feels like this algorithm is, 994 00:38:27,934 --> 00:38:29,795 when you call that regulation, basically, 995 00:38:29,795 --> 00:38:31,250 you're saying there seems to 996 00:38:31,250 --> 00:38:33,695 be kind of a precedence. 997 00:38:33,695 --> 00:38:35,315 A comes and then B comes. 998 00:38:35,315 --> 00:38:36,649 I mean, you didn't say it that way, 999 00:38:36,649 --> 00:38:38,270 but it feels like that. 1000 00:38:38,270 --> 00:38:40,639 And that's really interesting to try to find 1001 00:38:40,639 --> 00:38:42,830 evidence of causality in 1002 00:38:42,830 --> 00:38:45,319 a more general sense of the term. 1003 00:38:45,319 --> 00:38:49,129 Absolutely. That part, that causality, 1004 00:38:49,129 --> 00:38:52,204 that's where the upstream targeting comes in. 1005 00:38:52,204 --> 00:38:54,440 So when we're looking for the signatures, 1006 00:38:54,440 --> 00:38:57,409 That's when we're looking for this gene. 1007 00:38:57,409 --> 00:38:59,645 Expressing a certain way implies 1008 00:38:59,645 --> 00:39:01,399 another gene is going 1009 00:39:01,399 --> 00:39:02,795 to express a certain way. 1010 00:39:02,795 --> 00:39:04,159 And this is why it's 1011 00:39:04,159 --> 00:39:06,319 an algorithm that was developed by 1012 00:39:06,319 --> 00:39:08,390 folks who are well-versed in 1013 00:39:08,390 --> 00:39:10,669 the theory of partially ordered sets. 1014 00:39:10,669 --> 00:39:12,439 I didn't go into the, 1015 00:39:12,439 --> 00:39:15,349 into too much detail on the, 1016 00:39:15,349 --> 00:39:17,479 on the mathematical background that 1017 00:39:17,479 --> 00:39:19,399 inspired the algorithm itself 1018 00:39:19,399 --> 00:39:21,650 that's detailed in this other publication 1019 00:39:21,650 --> 00:39:23,974 by other each of our nation at all. 1020 00:39:23,974 --> 00:39:25,609 That gets a little technical in 1021 00:39:25,609 --> 00:39:28,339 the side of the lattice theory. 1022 00:39:28,339 --> 00:39:32,210 But if I had the chance to correspond 1023 00:39:32,210 --> 00:39:33,889 and or chat with anyone in 1024 00:39:33,889 --> 00:39:34,790 particular a little bit 1025 00:39:34,790 --> 00:39:35,945 more about that background, 1026 00:39:35,945 --> 00:39:37,010 I'd be more than happy to, 1027 00:39:37,010 --> 00:39:39,110 since that's where my training really 1028 00:39:39,110 --> 00:39:40,039 lies more than in 1029 00:39:40,039 --> 00:39:42,450 the actual machine learning. 1030 00:39:44,260 --> 00:39:46,370 Maybe if you close 1031 00:39:46,370 --> 00:39:48,350 the presentation, we reopen it. 1032 00:39:48,350 --> 00:39:49,489 That'll we can see each other. 1033 00:39:49,489 --> 00:39:52,709 It'll, it'll encourage the conversation. 1034 00:39:53,920 --> 00:39:59,015 Tristan, I'm just thinking 1035 00:39:59,015 --> 00:40:03,724 about your sample size, your end. 1036 00:40:03,724 --> 00:40:06,680 It seems like the numbers that I've seen, 1037 00:40:06,680 --> 00:40:08,840 you first mentioned 36 and then 1038 00:40:08,840 --> 00:40:11,870 on that last slide you showed 60. 1039 00:40:11,870 --> 00:40:15,545 So is, is 60 1040 00:40:15,545 --> 00:40:18,815 like the biggest number of patients that, 1041 00:40:18,815 --> 00:40:21,514 for any of the cancers that you've analyzed. 1042 00:40:21,514 --> 00:40:24,350 Fortunately, it isn't the sample size 1043 00:40:24,350 --> 00:40:26,690 on some of the cancers is small, 1044 00:40:26,690 --> 00:40:29,240 but the sample size is actually range. 1045 00:40:29,240 --> 00:40:29,720 Fortunately, 1046 00:40:29,720 --> 00:40:32,104 I have the numbers in front of me. 1047 00:40:32,104 --> 00:40:36,379 36.66 are actually definitely on the 1048 00:40:36,379 --> 00:40:38,660 small and the sample size 1049 00:40:38,660 --> 00:40:41,359 goes all the way up to over 500. 1050 00:40:41,359 --> 00:40:43,430 But generally there are at least around 1051 00:40:43,430 --> 00:40:45,589 200 to 300 samples on many of 1052 00:40:45,589 --> 00:40:46,819 these these cancers and 1053 00:40:46,819 --> 00:40:48,934 the publicly available data. 1054 00:40:48,934 --> 00:40:50,360 Yeah. 1055 00:40:50,360 --> 00:40:53,539 Now, are you using 1056 00:40:53,539 --> 00:40:56,240 positive to mean no cancer 1057 00:40:56,240 --> 00:40:57,770 and negative to mean cancer. 1058 00:40:57,770 --> 00:40:58,820 I mean, like 1059 00:40:58,820 --> 00:41:00,454 false positives and false negatives 1060 00:41:00,454 --> 00:41:04,309 that I catch it or did I miss here that? 1061 00:41:04,309 --> 00:41:07,264 When I was taking a look at 1062 00:41:07,264 --> 00:41:10,280 the the data plot for 1063 00:41:10,280 --> 00:41:12,485 the for the survival curve 1064 00:41:12,485 --> 00:41:14,000 that Dr. nation handmade. 1065 00:41:14,000 --> 00:41:18,440 So a a true negative means, 1066 00:41:18,440 --> 00:41:21,230 you've got a low score on that test. 1067 00:41:21,230 --> 00:41:22,669 That was a linear combination of 1068 00:41:22,669 --> 00:41:24,439 certain genetic factors that place you 1069 00:41:24,439 --> 00:41:26,524 into the high-risk category. 1070 00:41:26,524 --> 00:41:28,640 So if you've got a low score on 1071 00:41:28,640 --> 00:41:29,959 that test and you 1072 00:41:29,959 --> 00:41:32,090 died before a year-and-a-half, 1073 00:41:32,090 --> 00:41:33,915 that's a true negative. 1074 00:41:33,915 --> 00:41:36,984 Well, you can't get it. So negative. 1075 00:41:36,984 --> 00:41:40,599 Negative means bad, right? 1076 00:41:40,599 --> 00:41:41,260 Yeah. 1077 00:41:41,260 --> 00:41:42,369 It's being used in 1078 00:41:42,369 --> 00:41:44,109 the more pejorative sense of it, right? 1079 00:41:44,109 --> 00:41:46,150 Right, right. You know, I think in, 1080 00:41:46,150 --> 00:41:49,314 in, in some medical contexts, 1081 00:41:49,314 --> 00:41:50,694 when people talk about 1082 00:41:50,694 --> 00:41:52,629 positives and negatives, 1083 00:41:52,629 --> 00:41:57,219 positives mean the test to detect something, 1084 00:41:57,219 --> 00:41:58,674 said you have it. 1085 00:41:58,674 --> 00:42:00,325 And so I think, 1086 00:42:00,325 --> 00:42:03,010 and maybe rich, Rick can. 1087 00:42:03,010 --> 00:42:06,520 I think that in some contexts, 1088 00:42:06,520 --> 00:42:08,230 positive is bad and 1089 00:42:08,230 --> 00:42:10,419 negative means the test 1090 00:42:10,419 --> 00:42:11,529 says you don't have it. 1091 00:42:11,529 --> 00:42:13,120 So I mean, that's 1092 00:42:13,120 --> 00:42:15,820 a potential source of confusion. 1093 00:42:15,820 --> 00:42:16,455 So. 1094 00:42:16,455 --> 00:42:18,139 You should lay that, you know, 1095 00:42:18,139 --> 00:42:21,980 as you state that upfront very clearly to 1096 00:42:21,980 --> 00:42:24,980 avoid that confusion before 1097 00:42:24,980 --> 00:42:26,569 you talk to epidemiologists. 1098 00:42:26,569 --> 00:42:30,815 I think that would be okay. 1099 00:42:30,815 --> 00:42:33,079 Very much appreciate that. Thank you. 1100 00:42:33,079 --> 00:42:33,739 Yeah. 1101 00:42:33,739 --> 00:42:37,220 I think I think I have to say that I 1102 00:42:37,220 --> 00:42:41,434 don't really follow most of what you said. 1103 00:42:41,434 --> 00:42:44,520 I wish I did but I don't. 1104 00:42:45,080 --> 00:42:50,360 So are you forming these meta genes? 1105 00:42:50,360 --> 00:42:52,280 So you don't have to 1106 00:42:52,280 --> 00:42:55,040 look at individual gene expression. 1107 00:42:55,040 --> 00:42:57,410 And these metro genes are, 1108 00:42:57,410 --> 00:42:59,255 have some representative, 1109 00:42:59,255 --> 00:43:01,430 you get a signature from them 1110 00:43:01,430 --> 00:43:04,790 and the signature is quantified. 1111 00:43:04,790 --> 00:43:07,729 And that is your test score 1112 00:43:07,729 --> 00:43:09,230 or something like that. 1113 00:43:09,230 --> 00:43:10,910 That's the score at the end 1114 00:43:10,910 --> 00:43:14,180 that you're using to predict. 1115 00:43:14,180 --> 00:43:17,554 The score is based on the signature. 1116 00:43:17,554 --> 00:43:21,364 And the signature is of a meter gene. 1117 00:43:21,364 --> 00:43:24,035 The signature is a subset 1118 00:43:24,035 --> 00:43:26,569 of regulators of the metagenome. 1119 00:43:26,569 --> 00:43:28,624 Yes. Okay. 1120 00:43:28,624 --> 00:43:29,869 Okay. 1121 00:43:29,869 --> 00:43:31,610 So, so you create 1122 00:43:31,610 --> 00:43:33,005 these metro genes, which are, 1123 00:43:33,005 --> 00:43:34,490 which are equivalence classes 1124 00:43:34,490 --> 00:43:36,665 or clusters of genes. 1125 00:43:36,665 --> 00:43:41,345 And then for a given gene you have a score. 1126 00:43:41,345 --> 00:43:43,670 And that score is 1127 00:43:43,670 --> 00:43:47,539 your predictor that you're 1128 00:43:47,539 --> 00:43:49,999 going to predict with, right? Yes. 1129 00:43:49,999 --> 00:43:54,859 Um, and so so 1130 00:43:54,859 --> 00:43:56,765 essentially the score is like 1131 00:43:56,765 --> 00:43:58,789 a single IV that's kinda 1132 00:43:58,789 --> 00:44:00,800 predict the DV of survival. 1133 00:44:00,800 --> 00:44:01,819 Is that right? 1134 00:44:01,819 --> 00:44:03,485 Yes, you're getting. 1135 00:44:03,485 --> 00:44:05,554 So that score is sort of like 1136 00:44:05,554 --> 00:44:06,950 a factor analysis or 1137 00:44:06,950 --> 00:44:08,854 something like that where you are, 1138 00:44:08,854 --> 00:44:11,689 we are taking all this information 1139 00:44:11,689 --> 00:44:13,069 on the gene and 1140 00:44:13,069 --> 00:44:14,330 collapsing it into 1141 00:44:14,330 --> 00:44:17,344 a single quantitative score. 1142 00:44:17,344 --> 00:44:18,920 Is that right? 1143 00:44:18,920 --> 00:44:21,290 That's a very eloquent 1144 00:44:21,290 --> 00:44:22,969 and concise way of framing it. 1145 00:44:22,969 --> 00:44:23,989 And if you don't mind, 1146 00:44:23,989 --> 00:44:25,399 I'm probably going to steal that when 1147 00:44:25,399 --> 00:44:26,990 I refine the presentation. 1148 00:44:26,990 --> 00:44:28,459 But yes, we're taking a lot 1149 00:44:28,459 --> 00:44:29,899 of this expression data, 1150 00:44:29,899 --> 00:44:31,249 reducing it to looking at 1151 00:44:31,249 --> 00:44:33,199 a smaller subset of genes that 1152 00:44:33,199 --> 00:44:34,970 have an outsized influence 1153 00:44:34,970 --> 00:44:36,725 on the overall process. 1154 00:44:36,725 --> 00:44:38,839 Scoring based on those 1155 00:44:38,839 --> 00:44:41,029 and getting a single number that 1156 00:44:41,029 --> 00:44:43,310 will hopefully give some kind 1157 00:44:43,310 --> 00:44:45,635 of useful information for clinical outcome. 1158 00:44:45,635 --> 00:44:47,539 Yeah, Maybe this is to 1159 00:44:47,539 --> 00:44:50,360 everyone but isn't totally obvious. 1160 00:44:50,360 --> 00:44:54,619 And if I can make a suggestion, please do. 1161 00:44:54,619 --> 00:44:57,620 I think it's always good to have 1162 00:44:57,620 --> 00:45:00,154 high level summary slides that 1163 00:45:00,154 --> 00:45:01,654 that's sort of give you 1164 00:45:01,654 --> 00:45:05,524 a gestalt view of the overall thing. 1165 00:45:05,524 --> 00:45:07,819 I mean, people feel that that isn't 1166 00:45:07,819 --> 00:45:10,339 necessary and if they go step-by-step, 1167 00:45:10,339 --> 00:45:11,990 everybody will follow and 1168 00:45:11,990 --> 00:45:13,039 see the big picture. 1169 00:45:13,039 --> 00:45:17,734 But I think it's a mistake that you never do. 1170 00:45:17,734 --> 00:45:19,669 You never go wrong by presenting 1171 00:45:19,669 --> 00:45:23,240 high-level summaries and redundant, 1172 00:45:23,240 --> 00:45:25,069 some hurries, and then 1173 00:45:25,069 --> 00:45:27,425 people can grasp what's going on. 1174 00:45:27,425 --> 00:45:29,509 So I think you should do 1175 00:45:29,509 --> 00:45:32,120 that and leave out some of 1176 00:45:32,120 --> 00:45:34,550 the very fine details 1177 00:45:34,550 --> 00:45:37,175 that then if people ask you, 1178 00:45:37,175 --> 00:45:40,954 yeah, so, but there's also 1179 00:45:40,954 --> 00:45:42,560 a sum of what 1180 00:45:42,560 --> 00:45:45,350 you talked about discretization. 1181 00:45:45,350 --> 00:45:47,810 And you also seems like 1182 00:45:47,810 --> 00:45:51,800 you're analyzing things continuously. 1183 00:45:51,800 --> 00:45:54,064 So it's like, I don't have a high level. 1184 00:45:54,064 --> 00:45:56,600 Are you using the discretization to get 1185 00:45:56,600 --> 00:45:58,069 the metagenomes and from 1186 00:45:58,069 --> 00:46:00,860 then on, it's continuous. 1187 00:46:00,860 --> 00:46:02,420 It's its analysis of 1188 00:46:02,420 --> 00:46:08,359 continuous metrics or what's there. 1189 00:46:08,359 --> 00:46:10,759 Can you give me a high level picture of 1190 00:46:10,759 --> 00:46:13,399 where is the discretization it being 1191 00:46:13,399 --> 00:46:15,590 used for and where are you 1192 00:46:15,590 --> 00:46:18,920 using kind of continuous analysis. 1193 00:46:18,920 --> 00:46:20,180 I'll do my best. 1194 00:46:20,180 --> 00:46:21,439 Thank you very much for the 1195 00:46:21,439 --> 00:46:22,654 input and the questioning. 1196 00:46:22,654 --> 00:46:23,899 I very much appreciate 1197 00:46:23,899 --> 00:46:26,344 any sort of constructive criticism. 1198 00:46:26,344 --> 00:46:30,530 The discretization comes into trying 1199 00:46:30,530 --> 00:46:33,559 to that comes in 1200 00:46:33,559 --> 00:46:34,940 with the arrow relations 1201 00:46:34,940 --> 00:46:36,020 with the regulations. 1202 00:46:36,020 --> 00:46:40,310 We want to think what is under expression for 1203 00:46:40,310 --> 00:46:42,065 a given genetic marker and what is 1204 00:46:42,065 --> 00:46:44,929 over-expression for a given genetic marker. 1205 00:46:44,929 --> 00:46:46,190 So that's, that's where 1206 00:46:46,190 --> 00:46:48,230 the discrete comes in. 1207 00:46:48,230 --> 00:46:49,954 Once we have 1208 00:46:49,954 --> 00:46:52,295 those metagenomes and signatures. 1209 00:46:52,295 --> 00:46:53,989 Now we're gonna be looking at 1210 00:46:53,989 --> 00:46:56,240 continuous expression data when 1211 00:46:56,240 --> 00:46:58,849 we're trying to get those predictive scores. 1212 00:46:58,849 --> 00:47:00,470 You know, really, you 1213 00:47:00,470 --> 00:47:02,405 should say that up front. 1214 00:47:02,405 --> 00:47:07,085 That's very helpful. Yeah. 1215 00:47:07,085 --> 00:47:09,499 Okay, great. So I'm I'm glad. 1216 00:47:09,499 --> 00:47:09,994 Thank you. 1217 00:47:09,994 --> 00:47:11,539 New slide going in there 1218 00:47:11,539 --> 00:47:13,234 before I before I send it out, 1219 00:47:13,234 --> 00:47:14,614 I'll, over the next week, 1220 00:47:14,614 --> 00:47:16,640 I will try and refine a few of 1221 00:47:16,640 --> 00:47:18,199 the slides with some 1222 00:47:18,199 --> 00:47:20,269 of the requests here before, right. 1223 00:47:20,269 --> 00:47:22,819 I send it to you and weighing to 1224 00:47:22,819 --> 00:47:26,839 redistribute is a signature. 1225 00:47:26,839 --> 00:47:30,514 A list basically are just a set of things. 1226 00:47:30,514 --> 00:47:33,499 It is, it's a set of 1227 00:47:33,499 --> 00:47:35,929 individual genes and often 1228 00:47:35,929 --> 00:47:37,519 they're actually quite short. 1229 00:47:37,519 --> 00:47:38,930 Sometimes there's as few as 1230 00:47:38,930 --> 00:47:42,004 two genes that end up strongly regulating 1231 00:47:42,004 --> 00:47:44,330 an entire group that seemed to 1232 00:47:44,330 --> 00:47:47,759 be related to a same to the same process. 1233 00:47:49,720 --> 00:47:54,440 Are you spelling signature that way to denote 1234 00:47:54,440 --> 00:47:56,660 that because it seemed like signature 1235 00:47:56,660 --> 00:47:57,770 was spelled incorrectly 1236 00:47:57,770 --> 00:47:59,704 throughout the entire presentation. 1237 00:47:59,704 --> 00:48:01,699 In that case, I apologize. 1238 00:48:01,699 --> 00:48:02,989 That is simply an oversight. 1239 00:48:02,989 --> 00:48:04,849 I'm a math major, not a spelling major, 1240 00:48:04,849 --> 00:48:05,659 and I wasn't using 1241 00:48:05,659 --> 00:48:07,040 the spell checker on latex. 1242 00:48:07,040 --> 00:48:09,050 So that's a great constructive criticism 1243 00:48:09,050 --> 00:48:10,969 and something else I want to fix. I forget. 1244 00:48:10,969 --> 00:48:11,389 Yeah. 1245 00:48:11,389 --> 00:48:14,119 Hey, so you studied lattice theory, right? 1246 00:48:14,119 --> 00:48:16,235 I did. Yeah, that's interesting. 1247 00:48:16,235 --> 00:48:18,515 So I'm a spatial data scientists. 1248 00:48:18,515 --> 00:48:21,379 And I know that the lattice theory is 1249 00:48:21,379 --> 00:48:24,260 used in the analysis of spatial data. 1250 00:48:24,260 --> 00:48:25,550 And I was just curious if 1251 00:48:25,550 --> 00:48:27,110 you had any thoughts. 1252 00:48:27,110 --> 00:48:29,359 So there's a lot of spatial data that can be 1253 00:48:29,359 --> 00:48:31,039 discretized in the similar way 1254 00:48:31,039 --> 00:48:33,049 as sort of over and under kind of stuff. 1255 00:48:33,049 --> 00:48:35,030 So have you thought about any way to 1256 00:48:35,030 --> 00:48:37,444 sort of generalize this to spatial data? 1257 00:48:37,444 --> 00:48:39,665 Not in particular. 1258 00:48:39,665 --> 00:48:41,689 Because of this originated 1259 00:48:41,689 --> 00:48:43,010 in a biology study. 1260 00:48:43,010 --> 00:48:44,329 Most of my thoughts on 1261 00:48:44,329 --> 00:48:47,315 future studies have also been in that realm. 1262 00:48:47,315 --> 00:48:51,154 I must admit I'm not entirely familiar with 1263 00:48:51,154 --> 00:48:53,270 applications of lattice theory 1264 00:48:53,270 --> 00:48:55,609 in spatial data necessarily. 1265 00:48:55,609 --> 00:48:58,325 I've certainly seen it applied to geometries. 1266 00:48:58,325 --> 00:49:01,610 But anything where you can do high, 1267 00:49:01,610 --> 00:49:02,839 low, medium, it would 1268 00:49:02,839 --> 00:49:04,160 seem like that would be a candidate. 1269 00:49:04,160 --> 00:49:05,540 So I'd be very curious to see 1270 00:49:05,540 --> 00:49:06,559 some datasets and see 1271 00:49:06,559 --> 00:49:08,029 if they might be compatible. 1272 00:49:08,029 --> 00:49:09,725 Yeah. Okay. 1273 00:49:09,725 --> 00:49:12,079 Well, personally in your talk, basically, 1274 00:49:12,079 --> 00:49:13,789 you talked about that when you guys were 1275 00:49:13,789 --> 00:49:15,949 choosing what kind of terrain is it, 1276 00:49:15,949 --> 00:49:18,425 isn't it almost exactly that kind of data? 1277 00:49:18,425 --> 00:49:20,705 Yeah. Yeah, you're right. 1278 00:49:20,705 --> 00:49:22,670 I mean, that's probably 1279 00:49:22,670 --> 00:49:23,764 why I'm thinking this. 1280 00:49:23,764 --> 00:49:24,769 Yes, I think so. 1281 00:49:24,769 --> 00:49:25,939 Okay. 1282 00:49:25,939 --> 00:49:28,009 So they have they have all this 1283 00:49:28,009 --> 00:49:29,900 lidar data and they're trying to say, Well, 1284 00:49:29,900 --> 00:49:30,920 do we call this a tree or 1285 00:49:30,920 --> 00:49:32,060 do we call this a road, 1286 00:49:32,060 --> 00:49:34,205 or I'm oversimplifying it. 1287 00:49:34,205 --> 00:49:36,980 But that's kind of the same sort of data. 1288 00:49:36,980 --> 00:49:38,239 It seems that feels like to 1289 00:49:38,239 --> 00:49:41,299 me where you'd want to end up saying, 1290 00:49:41,299 --> 00:49:43,099 well, and maybe it's more complicated 1291 00:49:43,099 --> 00:49:44,359 because there's a lot of candidates. 1292 00:49:44,359 --> 00:49:46,145 And so it's not, it's not a, 1293 00:49:46,145 --> 00:49:48,259 it's not binary, so to speak, 1294 00:49:48,259 --> 00:49:50,510 or near binary is quite multi-valued. 1295 00:49:50,510 --> 00:49:52,040 And so you're looking for 1296 00:49:52,040 --> 00:49:55,410 more cut points than just one kind of. 1297 00:49:55,480 --> 00:49:57,514 But I loved the idea 1298 00:49:57,514 --> 00:50:00,514 of emphasizing the extrema, 1299 00:50:00,514 --> 00:50:01,970 ignoring the stuff in the middle 1300 00:50:01,970 --> 00:50:03,469 because usually there's not much value there. 1301 00:50:03,469 --> 00:50:05,104 It seems like a very general 1302 00:50:05,104 --> 00:50:06,500 thing that can be done in 1303 00:50:06,500 --> 00:50:07,729 all kinds of different sorts of machine 1304 00:50:07,729 --> 00:50:10,739 learning applications in my mind. 1305 00:50:12,430 --> 00:50:14,675 Yeah, my, you know, 1306 00:50:14,675 --> 00:50:17,104 when I worked this I need to unmute 1307 00:50:17,104 --> 00:50:19,099 or you are unmuted but your mic is 1308 00:50:19,099 --> 00:50:21,719 not working. We're not hearing you. 1309 00:50:21,820 --> 00:50:23,300 Oh, there you go. 1310 00:50:23,300 --> 00:50:23,839 Better. 1311 00:50:23,839 --> 00:50:27,469 Yeah. Go ahead, Dr. Zach. 1312 00:50:27,469 --> 00:50:31,229 I'll go after you. No, no. Go ahead. 1313 00:50:32,020 --> 00:50:34,910 Well, I was Thanks for the talk trust 1314 00:50:34,910 --> 00:50:37,955 and it looks like really interesting work. 1315 00:50:37,955 --> 00:50:41,930 A question about the results 1316 00:50:41,930 --> 00:50:43,549 and how they might generalize well 1317 00:50:43,549 --> 00:50:45,930 to new data, unseen data. 1318 00:50:45,930 --> 00:50:49,609 It seems like there 1319 00:50:49,609 --> 00:50:53,135 was a lot of tuning throughout the process, 1320 00:50:53,135 --> 00:50:55,879 several steps of the process. 1321 00:50:55,879 --> 00:50:59,600 So I was just curious how you felt. 1322 00:50:59,600 --> 00:51:02,990 This process might generalize 1323 00:51:02,990 --> 00:51:05,419 well to new unseen data and what you did 1324 00:51:05,419 --> 00:51:08,090 to make sure that the model wasn't 1325 00:51:08,090 --> 00:51:11,940 overfit to the data that you're analyzing. 1326 00:51:13,990 --> 00:51:17,299 It's curious that you mentioned overfitting 1327 00:51:17,299 --> 00:51:18,530 because having only heard 1328 00:51:18,530 --> 00:51:19,955 one side of the story, 1329 00:51:19,955 --> 00:51:21,589 I don't necessarily feel 1330 00:51:21,589 --> 00:51:23,330 comfortable repeating some of 1331 00:51:23,330 --> 00:51:24,469 the details I heard, 1332 00:51:24,469 --> 00:51:26,569 but supposedly this same debate 1333 00:51:26,569 --> 00:51:28,684 was one of the reasons why there was 1334 00:51:28,684 --> 00:51:30,755 a divergence 1335 00:51:30,755 --> 00:51:35,190 among the research group at later stages. 1336 00:51:36,040 --> 00:51:39,815 In terms of new datasets. 1337 00:51:39,815 --> 00:51:42,965 I mentioned, I think a couple of times. 1338 00:51:42,965 --> 00:51:45,169 It was decided when you're looking at 1339 00:51:45,169 --> 00:51:47,240 over and under expression, 1340 00:51:47,240 --> 00:51:50,525 core tiles seemed like a good measure. 1341 00:51:50,525 --> 00:51:53,390 That would probably generalize, well, 1342 00:51:53,390 --> 00:51:54,319 I think the thing that would 1343 00:51:54,319 --> 00:51:55,504 probably need a lot 1344 00:51:55,504 --> 00:51:58,159 of fine tuning would be the sensitivity. 1345 00:51:58,159 --> 00:52:03,379 And when do you group things? 1346 00:52:03,379 --> 00:52:06,244 And when do you want to keep them separated? 1347 00:52:06,244 --> 00:52:11,150 So if I was I would be fairly confident if 1348 00:52:11,150 --> 00:52:13,700 this process was applied to 1349 00:52:13,700 --> 00:52:16,820 similar cancer tumor samples. 1350 00:52:16,820 --> 00:52:18,019 I feel like this would probably 1351 00:52:18,019 --> 00:52:20,119 generalize quite well if we're 1352 00:52:20,119 --> 00:52:21,350 going to look at another health 1353 00:52:21,350 --> 00:52:25,579 condition at the top of my head 1354 00:52:25,579 --> 00:52:27,799 simply because I happen to have an OBGYN and 1355 00:52:27,799 --> 00:52:29,449 the family preeclampsia that 1356 00:52:29,449 --> 00:52:31,894 affects a lot of pregnant women. If we had. 1357 00:52:31,894 --> 00:52:35,190 Tissue samples from the 1358 00:52:36,490 --> 00:52:39,169 from the placenta and 1359 00:52:39,169 --> 00:52:42,109 then genetic screening was run on that. 1360 00:52:42,109 --> 00:52:43,625 I suspect that that would probably 1361 00:52:43,625 --> 00:52:45,784 be another complete round of 1362 00:52:45,784 --> 00:52:46,610 tuning a lot of 1363 00:52:46,610 --> 00:52:48,079 the different parameters and there 1364 00:52:48,079 --> 00:52:50,330 may even need to be changes to the algorithm. 1365 00:52:50,330 --> 00:52:52,654 We simply don't know at this point. 1366 00:52:52,654 --> 00:52:54,889 Yeah, no, thanks for that. 1367 00:52:54,889 --> 00:52:56,359 And I was focusing 1368 00:52:56,359 --> 00:52:59,269 in on different types of cancer really 1369 00:52:59,269 --> 00:53:02,359 just how well this 1370 00:53:02,359 --> 00:53:04,925 would predict on unseen data. 1371 00:53:04,925 --> 00:53:08,629 You trained on the current dataset 1372 00:53:08,629 --> 00:53:10,249 and then you help them out. 1373 00:53:10,249 --> 00:53:12,365 Did you do any testing to see how 1374 00:53:12,365 --> 00:53:15,940 well this process or 1375 00:53:15,940 --> 00:53:18,580 this method predicted on data 1376 00:53:18,580 --> 00:53:19,900 that was unseen even 1377 00:53:19,900 --> 00:53:22,764 from the same dataset, e.g. 1378 00:53:22,764 --> 00:53:24,655 as far as I know, 1379 00:53:24,655 --> 00:53:25,960 I came in at the tail end 1380 00:53:25,960 --> 00:53:27,429 of this and there was no 1381 00:53:27,429 --> 00:53:29,739 general training test split 1382 00:53:29,739 --> 00:53:31,975 that was done on the publicly available data. 1383 00:53:31,975 --> 00:53:33,909 There are certain cancer samples 1384 00:53:33,909 --> 00:53:35,725 where that would theirs. 1385 00:53:35,725 --> 00:53:38,274 If you have 500 entries that 1386 00:53:38,274 --> 00:53:39,370 a training test split 1387 00:53:39,370 --> 00:53:40,855 would be very applicable. 1388 00:53:40,855 --> 00:53:43,359 And that's something if if I can find 1389 00:53:43,359 --> 00:53:45,220 more time besides the jobs 1390 00:53:45,220 --> 00:53:46,870 that I'm doing to keep food on the table. 1391 00:53:46,870 --> 00:53:47,800 I'd love to, I'd love to 1392 00:53:47,800 --> 00:53:49,059 actually do that and see 1393 00:53:49,059 --> 00:53:50,320 how well the signature actually 1394 00:53:50,320 --> 00:53:52,520 works on the testing data. 1395 00:53:53,700 --> 00:53:56,770 Yeah, I think that'd be interesting for 1396 00:53:56,770 --> 00:53:58,954 this to see how that does. 1397 00:53:58,954 --> 00:54:01,099 Just I didn't I 1398 00:54:01,099 --> 00:54:03,049 don't have the full process in my mind, 1399 00:54:03,049 --> 00:54:04,624 but I thought you mentioned 1400 00:54:04,624 --> 00:54:07,699 the 50% here and 60% there. 1401 00:54:07,699 --> 00:54:10,130 There were several areas within the process 1402 00:54:10,130 --> 00:54:13,894 that it appeared there was tuning going on. 1403 00:54:13,894 --> 00:54:16,609 And if you're doing that in a way 1404 00:54:16,609 --> 00:54:20,449 that tuning to optimize your training data, 1405 00:54:20,449 --> 00:54:23,014 you would think that that would 1406 00:54:23,014 --> 00:54:25,370 have maybe a negative effect 1407 00:54:25,370 --> 00:54:28,079 once it's applied to test data. 1408 00:54:32,440 --> 00:54:37,610 And the kind of machine learning that I do, 1409 00:54:37,610 --> 00:54:39,770 this reconstructed ability and 1410 00:54:39,770 --> 00:54:41,854 atlases and OK on. 1411 00:54:41,854 --> 00:54:45,725 Because generally, when you think of a model, 1412 00:54:45,725 --> 00:54:47,074 you think of a trade-off 1413 00:54:47,074 --> 00:54:52,115 between accuracy and complexity of the model. 1414 00:54:52,115 --> 00:54:54,875 And the overfitting problem 1415 00:54:54,875 --> 00:54:58,670 is the problem of having too complex a model. 1416 00:54:58,670 --> 00:55:02,870 Now in the, in the methodology that I use, 1417 00:55:02,870 --> 00:55:04,700 you can, you can 1418 00:55:04,700 --> 00:55:07,820 talk about how complex the model is. 1419 00:55:07,820 --> 00:55:09,679 You can quantify that and 1420 00:55:09,679 --> 00:55:12,229 traded off against accuracy. 1421 00:55:12,229 --> 00:55:15,365 But the question is, in lust, 1422 00:55:15,365 --> 00:55:17,359 could you talk about how 1423 00:55:17,359 --> 00:55:21,169 complex your model is? 1424 00:55:21,169 --> 00:55:23,605 I mean, could, could you have different LAS, 1425 00:55:23,605 --> 00:55:27,754 models of varying complexity? 1426 00:55:27,754 --> 00:55:29,719 And the very complex will do 1427 00:55:29,719 --> 00:55:31,489 very well on the training data, 1428 00:55:31,489 --> 00:55:34,075 but not so well on the test data. 1429 00:55:34,075 --> 00:55:36,004 It seems that it sounded like 1430 00:55:36,004 --> 00:55:38,240 you got one model. 1431 00:55:38,240 --> 00:55:42,514 But if maybe if you change the tuning, 1432 00:55:42,514 --> 00:55:46,040 you could get multiple models 1433 00:55:46,040 --> 00:55:48,260 and they would be different, 1434 00:55:48,260 --> 00:55:51,410 perhaps in complexity, somehow, 1435 00:55:51,410 --> 00:55:54,949 if that word could be made tangible. 1436 00:55:54,949 --> 00:55:58,204 And then the very complex ones, 1437 00:55:58,204 --> 00:56:00,634 I would presume would predict better. 1438 00:56:00,634 --> 00:56:02,690 We do better in the end 1439 00:56:02,690 --> 00:56:05,464 your confusion matrix there. 1440 00:56:05,464 --> 00:56:09,079 And then, so is there some notion in 1441 00:56:09,079 --> 00:56:10,850 the last algorithm of 1442 00:56:10,850 --> 00:56:13,505 the complexity of a model? 1443 00:56:13,505 --> 00:56:16,039 Not as of yet. 1444 00:56:16,039 --> 00:56:19,309 Those of us who were working on it, 1445 00:56:19,309 --> 00:56:21,439 specifically JB and I were not 1446 00:56:21,439 --> 00:56:24,320 trained machine machine-learning analysts. 1447 00:56:24,320 --> 00:56:27,440 So we at least up until now, 1448 00:56:27,440 --> 00:56:28,610 I haven't tried to use any 1449 00:56:28,610 --> 00:56:29,690 of the standard measures of 1450 00:56:29,690 --> 00:56:32,254 complexity of this algorithm. 1451 00:56:32,254 --> 00:56:33,800 But that would certainly be 1452 00:56:33,800 --> 00:56:35,480 another great avenue to see 1453 00:56:35,480 --> 00:56:37,385 if this is generalizable, 1454 00:56:37,385 --> 00:56:40,115 is how complex are we making it? 1455 00:56:40,115 --> 00:56:42,814 And how is that going to affect two, 1456 00:56:42,814 --> 00:56:46,415 as we tried to apply it to other other areas. 1457 00:56:46,415 --> 00:56:48,214 You know, at first thought, 1458 00:56:48,214 --> 00:56:50,509 if you've had a whole bunch of things and yet 1459 00:56:50,509 --> 00:56:52,939 many independent things combined, 1460 00:56:52,939 --> 00:56:54,500 That's a little bit more complex 1461 00:56:54,500 --> 00:56:55,774 than just two things. 1462 00:56:55,774 --> 00:56:57,019 I don't know if that's if I'm 1463 00:56:57,019 --> 00:56:58,129 making any sense at all. 1464 00:56:58,129 --> 00:56:59,479 But it does seem like there should be 1465 00:56:59,479 --> 00:57:02,674 some way to interpret 1466 00:57:02,674 --> 00:57:05,630 the amount of independent things 1467 00:57:05,630 --> 00:57:07,969 as an indicator of complexity. 1468 00:57:07,969 --> 00:57:09,390 Maybe. 1469 00:57:11,100 --> 00:57:13,449 That is an interesting notion. 1470 00:57:13,449 --> 00:57:14,319 Actually, I'll have to give that 1471 00:57:14,319 --> 00:57:15,504 some more thought if I can, 1472 00:57:15,504 --> 00:57:17,065 if I can spare the time. 1473 00:57:17,065 --> 00:57:17,950 He trusts. 1474 00:57:17,950 --> 00:57:19,630 And one idea that might work 1475 00:57:19,630 --> 00:57:21,579 for your train test splits, 1476 00:57:21,579 --> 00:57:23,230 since you have a small dataset is you can 1477 00:57:23,230 --> 00:57:27,159 use the leave-one-out algorithm. 1478 00:57:27,159 --> 00:57:29,259 Do the entire, you'd 1479 00:57:29,259 --> 00:57:30,880 have to write an algorithm that basically 1480 00:57:30,880 --> 00:57:32,589 did your entire dataset 1481 00:57:32,589 --> 00:57:34,794 as many times as you have. 1482 00:57:34,794 --> 00:57:37,059 Like you've got 250 rows, 1483 00:57:37,059 --> 00:57:37,779 so you're going to run it at 1484 00:57:37,779 --> 00:57:39,504 249 times or whatever. 1485 00:57:39,504 --> 00:57:41,349 And you just leave leave a different one out 1486 00:57:41,349 --> 00:57:42,129 each time and then you 1487 00:57:42,129 --> 00:57:43,405 can compare the results. 1488 00:57:43,405 --> 00:57:44,815 I've seen some software like 1489 00:57:44,815 --> 00:57:47,139 orange data mining, does that. 1490 00:57:47,139 --> 00:57:49,479 Fascinating? Yeah, with access 1491 00:57:49,479 --> 00:57:51,309 to greater computing resources, 1492 00:57:51,309 --> 00:57:53,420 even, even free accounts on the 1493 00:57:53,420 --> 00:57:54,649 Cloud these days give 1494 00:57:54,649 --> 00:57:56,329 you quite a bit of computing. 1495 00:57:56,329 --> 00:57:57,979 We might be able to run that 1496 00:57:57,979 --> 00:57:59,780 on a small dataset in a, 1497 00:57:59,780 --> 00:58:01,505 in a decent timeframe. Thank you. 1498 00:58:01,505 --> 00:58:02,314 Yeah. 1499 00:58:02,314 --> 00:58:03,740 Yeah, you can you can get 1500 00:58:03,740 --> 00:58:06,149 a free RStudio account. 1501 00:58:06,670 --> 00:58:08,569 I'm gonna go ahead and 1502 00:58:08,569 --> 00:58:09,740 turn off the recording and 1503 00:58:09,740 --> 00:58:12,154 feels like we're sort of the saturating here. 1504 00:58:12,154 --> 00:58:14,690 And then if people want to hang out with 1505 00:58:14,690 --> 00:58:16,009 those few parting thoughts 1506 00:58:16,009 --> 00:58:18,149 and questions, that'd be great.