linux command line for large data set?
up vote
0
down vote
favorite
the problem: I a large excel file of data, there are over 1000 columns and over 40,000 rows. I have to identify where a given row has a value of >199 in any given cell. If any row does not have >199 in any given cell then I want to delete those rows. So that I am left with only rows where at least one cell has a value of >199.
I also have the same data file as a text file, so I was thinking that the best approach might be to use linux command line to do this problem rather than using the excel file (which is bulky to work with given the number of rows and columns). But I am a novice at linux and awk so I was looking for general advice of how to approach this issue? thanks so much
Thank you for your help.
Example image of data set below. Here I would want only the rows that have highlighted cells (because those are >200) but I can't just use the sort function or complicated if than statements because there are so many columns in my data set, so that is too time consuming...
enter image description here
microsoft-excel-2010
add a comment |
up vote
0
down vote
favorite
the problem: I a large excel file of data, there are over 1000 columns and over 40,000 rows. I have to identify where a given row has a value of >199 in any given cell. If any row does not have >199 in any given cell then I want to delete those rows. So that I am left with only rows where at least one cell has a value of >199.
I also have the same data file as a text file, so I was thinking that the best approach might be to use linux command line to do this problem rather than using the excel file (which is bulky to work with given the number of rows and columns). But I am a novice at linux and awk so I was looking for general advice of how to approach this issue? thanks so much
Thank you for your help.
Example image of data set below. Here I would want only the rows that have highlighted cells (because those are >200) but I can't just use the sort function or complicated if than statements because there are so many columns in my data set, so that is too time consuming...
enter image description here
microsoft-excel-2010
I don't understand. Is this a question about Excel? Have you exported your data from Excel to a text file? Details, please. Please do not respond in comments; edit your question to make it clearer and more complete.
– Scott
Nov 27 at 1:02
Anna did describe her/his case clear. She/He has "the same data file as a text file" and wants to ask for an advice on how to approach parsing this txt file in linux command line and which combination of tools to use for that. Should piping with several commands be used, or some other approach. I am interested in this also. Thank you for the question.
– titus
Nov 27 at 1:28
1
@titus: I would expect somebody who has been on the site for eight years to know about edits and the revision history. If you look, you'll see that the "same data file as a text file" statement was added after I made my comment.
– Scott
Nov 27 at 3:09
add a comment |
up vote
0
down vote
favorite
up vote
0
down vote
favorite
the problem: I a large excel file of data, there are over 1000 columns and over 40,000 rows. I have to identify where a given row has a value of >199 in any given cell. If any row does not have >199 in any given cell then I want to delete those rows. So that I am left with only rows where at least one cell has a value of >199.
I also have the same data file as a text file, so I was thinking that the best approach might be to use linux command line to do this problem rather than using the excel file (which is bulky to work with given the number of rows and columns). But I am a novice at linux and awk so I was looking for general advice of how to approach this issue? thanks so much
Thank you for your help.
Example image of data set below. Here I would want only the rows that have highlighted cells (because those are >200) but I can't just use the sort function or complicated if than statements because there are so many columns in my data set, so that is too time consuming...
enter image description here
microsoft-excel-2010
the problem: I a large excel file of data, there are over 1000 columns and over 40,000 rows. I have to identify where a given row has a value of >199 in any given cell. If any row does not have >199 in any given cell then I want to delete those rows. So that I am left with only rows where at least one cell has a value of >199.
I also have the same data file as a text file, so I was thinking that the best approach might be to use linux command line to do this problem rather than using the excel file (which is bulky to work with given the number of rows and columns). But I am a novice at linux and awk so I was looking for general advice of how to approach this issue? thanks so much
Thank you for your help.
Example image of data set below. Here I would want only the rows that have highlighted cells (because those are >200) but I can't just use the sort function or complicated if than statements because there are so many columns in my data set, so that is too time consuming...
enter image description here
microsoft-excel-2010
microsoft-excel-2010
edited Nov 27 at 1:30
asked Nov 27 at 0:45
Anna
11
11
I don't understand. Is this a question about Excel? Have you exported your data from Excel to a text file? Details, please. Please do not respond in comments; edit your question to make it clearer and more complete.
– Scott
Nov 27 at 1:02
Anna did describe her/his case clear. She/He has "the same data file as a text file" and wants to ask for an advice on how to approach parsing this txt file in linux command line and which combination of tools to use for that. Should piping with several commands be used, or some other approach. I am interested in this also. Thank you for the question.
– titus
Nov 27 at 1:28
1
@titus: I would expect somebody who has been on the site for eight years to know about edits and the revision history. If you look, you'll see that the "same data file as a text file" statement was added after I made my comment.
– Scott
Nov 27 at 3:09
add a comment |
I don't understand. Is this a question about Excel? Have you exported your data from Excel to a text file? Details, please. Please do not respond in comments; edit your question to make it clearer and more complete.
– Scott
Nov 27 at 1:02
Anna did describe her/his case clear. She/He has "the same data file as a text file" and wants to ask for an advice on how to approach parsing this txt file in linux command line and which combination of tools to use for that. Should piping with several commands be used, or some other approach. I am interested in this also. Thank you for the question.
– titus
Nov 27 at 1:28
1
@titus: I would expect somebody who has been on the site for eight years to know about edits and the revision history. If you look, you'll see that the "same data file as a text file" statement was added after I made my comment.
– Scott
Nov 27 at 3:09
I don't understand. Is this a question about Excel? Have you exported your data from Excel to a text file? Details, please. Please do not respond in comments; edit your question to make it clearer and more complete.
– Scott
Nov 27 at 1:02
I don't understand. Is this a question about Excel? Have you exported your data from Excel to a text file? Details, please. Please do not respond in comments; edit your question to make it clearer and more complete.
– Scott
Nov 27 at 1:02
Anna did describe her/his case clear. She/He has "the same data file as a text file" and wants to ask for an advice on how to approach parsing this txt file in linux command line and which combination of tools to use for that. Should piping with several commands be used, or some other approach. I am interested in this also. Thank you for the question.
– titus
Nov 27 at 1:28
Anna did describe her/his case clear. She/He has "the same data file as a text file" and wants to ask for an advice on how to approach parsing this txt file in linux command line and which combination of tools to use for that. Should piping with several commands be used, or some other approach. I am interested in this also. Thank you for the question.
– titus
Nov 27 at 1:28
1
1
@titus: I would expect somebody who has been on the site for eight years to know about edits and the revision history. If you look, you'll see that the "same data file as a text file" statement was added after I made my comment.
– Scott
Nov 27 at 3:09
@titus: I would expect somebody who has been on the site for eight years to know about edits and the revision history. If you look, you'll see that the "same data file as a text file" statement was added after I made my comment.
– Scott
Nov 27 at 3:09
add a comment |
2 Answers
2
active
oldest
votes
up vote
0
down vote
Since you said "looking for general advice of how to approach this issue?" here is one approach:
If you know how to use Python, you could save the file as a comma separated file and run through the file simply by writing a small script and then use csv
to do something with the data. You can use any operating system that supports python.
I am just learning python, so I haven't written a script before, what script would do this?
– Anna
Nov 27 at 1:34
You could, for example, use the "xlrd" module, in particular its sheet.nrows and sheet.row_values(n) methods. A small tutorial for using xlrd is described on geeksforgeeks.org/reading-excel-file-using-python
– Christoph Sommer
Nov 27 at 1:46
add a comment |
up vote
0
down vote
I don't have much general advice.
Specifically, I advise you to use this awk command:
awk '{
over=0
for (i=1; i<=NF; i++) if ($i > 199) over=1
if (over) print
}'
I created a small data file,
based on numbers from your file, and a few I made up on my own:
$ cat input
81 23 40
31 0 416 12
2 2 1
157 41 80 201
417 42 17
$ ./myscript input
31 0 416 12
157 41 80 201
417 42 17
To delete rows from your file, do
$ ./myscript input > input.new
$ mv input.new input
Notes:
- For your own sake,
you should decide whether your requirement is > 199, > 200, ≥ 200, or what. - If you need to keep Row 1 (i.e., line 1, the header row), say so.
- I haven't tested this on a large file.
awk
shouldn't have any problem with a huge number of rows (lines).
A thousand columns (fields) might be an issue, but I doubt it.
add a comment |
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
0
down vote
Since you said "looking for general advice of how to approach this issue?" here is one approach:
If you know how to use Python, you could save the file as a comma separated file and run through the file simply by writing a small script and then use csv
to do something with the data. You can use any operating system that supports python.
I am just learning python, so I haven't written a script before, what script would do this?
– Anna
Nov 27 at 1:34
You could, for example, use the "xlrd" module, in particular its sheet.nrows and sheet.row_values(n) methods. A small tutorial for using xlrd is described on geeksforgeeks.org/reading-excel-file-using-python
– Christoph Sommer
Nov 27 at 1:46
add a comment |
up vote
0
down vote
Since you said "looking for general advice of how to approach this issue?" here is one approach:
If you know how to use Python, you could save the file as a comma separated file and run through the file simply by writing a small script and then use csv
to do something with the data. You can use any operating system that supports python.
I am just learning python, so I haven't written a script before, what script would do this?
– Anna
Nov 27 at 1:34
You could, for example, use the "xlrd" module, in particular its sheet.nrows and sheet.row_values(n) methods. A small tutorial for using xlrd is described on geeksforgeeks.org/reading-excel-file-using-python
– Christoph Sommer
Nov 27 at 1:46
add a comment |
up vote
0
down vote
up vote
0
down vote
Since you said "looking for general advice of how to approach this issue?" here is one approach:
If you know how to use Python, you could save the file as a comma separated file and run through the file simply by writing a small script and then use csv
to do something with the data. You can use any operating system that supports python.
Since you said "looking for general advice of how to approach this issue?" here is one approach:
If you know how to use Python, you could save the file as a comma separated file and run through the file simply by writing a small script and then use csv
to do something with the data. You can use any operating system that supports python.
answered Nov 27 at 1:29
Elmo
22914
22914
I am just learning python, so I haven't written a script before, what script would do this?
– Anna
Nov 27 at 1:34
You could, for example, use the "xlrd" module, in particular its sheet.nrows and sheet.row_values(n) methods. A small tutorial for using xlrd is described on geeksforgeeks.org/reading-excel-file-using-python
– Christoph Sommer
Nov 27 at 1:46
add a comment |
I am just learning python, so I haven't written a script before, what script would do this?
– Anna
Nov 27 at 1:34
You could, for example, use the "xlrd" module, in particular its sheet.nrows and sheet.row_values(n) methods. A small tutorial for using xlrd is described on geeksforgeeks.org/reading-excel-file-using-python
– Christoph Sommer
Nov 27 at 1:46
I am just learning python, so I haven't written a script before, what script would do this?
– Anna
Nov 27 at 1:34
I am just learning python, so I haven't written a script before, what script would do this?
– Anna
Nov 27 at 1:34
You could, for example, use the "xlrd" module, in particular its sheet.nrows and sheet.row_values(n) methods. A small tutorial for using xlrd is described on geeksforgeeks.org/reading-excel-file-using-python
– Christoph Sommer
Nov 27 at 1:46
You could, for example, use the "xlrd" module, in particular its sheet.nrows and sheet.row_values(n) methods. A small tutorial for using xlrd is described on geeksforgeeks.org/reading-excel-file-using-python
– Christoph Sommer
Nov 27 at 1:46
add a comment |
up vote
0
down vote
I don't have much general advice.
Specifically, I advise you to use this awk command:
awk '{
over=0
for (i=1; i<=NF; i++) if ($i > 199) over=1
if (over) print
}'
I created a small data file,
based on numbers from your file, and a few I made up on my own:
$ cat input
81 23 40
31 0 416 12
2 2 1
157 41 80 201
417 42 17
$ ./myscript input
31 0 416 12
157 41 80 201
417 42 17
To delete rows from your file, do
$ ./myscript input > input.new
$ mv input.new input
Notes:
- For your own sake,
you should decide whether your requirement is > 199, > 200, ≥ 200, or what. - If you need to keep Row 1 (i.e., line 1, the header row), say so.
- I haven't tested this on a large file.
awk
shouldn't have any problem with a huge number of rows (lines).
A thousand columns (fields) might be an issue, but I doubt it.
add a comment |
up vote
0
down vote
I don't have much general advice.
Specifically, I advise you to use this awk command:
awk '{
over=0
for (i=1; i<=NF; i++) if ($i > 199) over=1
if (over) print
}'
I created a small data file,
based on numbers from your file, and a few I made up on my own:
$ cat input
81 23 40
31 0 416 12
2 2 1
157 41 80 201
417 42 17
$ ./myscript input
31 0 416 12
157 41 80 201
417 42 17
To delete rows from your file, do
$ ./myscript input > input.new
$ mv input.new input
Notes:
- For your own sake,
you should decide whether your requirement is > 199, > 200, ≥ 200, or what. - If you need to keep Row 1 (i.e., line 1, the header row), say so.
- I haven't tested this on a large file.
awk
shouldn't have any problem with a huge number of rows (lines).
A thousand columns (fields) might be an issue, but I doubt it.
add a comment |
up vote
0
down vote
up vote
0
down vote
I don't have much general advice.
Specifically, I advise you to use this awk command:
awk '{
over=0
for (i=1; i<=NF; i++) if ($i > 199) over=1
if (over) print
}'
I created a small data file,
based on numbers from your file, and a few I made up on my own:
$ cat input
81 23 40
31 0 416 12
2 2 1
157 41 80 201
417 42 17
$ ./myscript input
31 0 416 12
157 41 80 201
417 42 17
To delete rows from your file, do
$ ./myscript input > input.new
$ mv input.new input
Notes:
- For your own sake,
you should decide whether your requirement is > 199, > 200, ≥ 200, or what. - If you need to keep Row 1 (i.e., line 1, the header row), say so.
- I haven't tested this on a large file.
awk
shouldn't have any problem with a huge number of rows (lines).
A thousand columns (fields) might be an issue, but I doubt it.
I don't have much general advice.
Specifically, I advise you to use this awk command:
awk '{
over=0
for (i=1; i<=NF; i++) if ($i > 199) over=1
if (over) print
}'
I created a small data file,
based on numbers from your file, and a few I made up on my own:
$ cat input
81 23 40
31 0 416 12
2 2 1
157 41 80 201
417 42 17
$ ./myscript input
31 0 416 12
157 41 80 201
417 42 17
To delete rows from your file, do
$ ./myscript input > input.new
$ mv input.new input
Notes:
- For your own sake,
you should decide whether your requirement is > 199, > 200, ≥ 200, or what. - If you need to keep Row 1 (i.e., line 1, the header row), say so.
- I haven't tested this on a large file.
awk
shouldn't have any problem with a huge number of rows (lines).
A thousand columns (fields) might be an issue, but I doubt it.
answered Nov 27 at 3:05
Scott
15.5k113789
15.5k113789
add a comment |
add a comment |
Thanks for contributing an answer to Super User!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fsuperuser.com%2fquestions%2f1378599%2flinux-command-line-for-large-data-set%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
I don't understand. Is this a question about Excel? Have you exported your data from Excel to a text file? Details, please. Please do not respond in comments; edit your question to make it clearer and more complete.
– Scott
Nov 27 at 1:02
Anna did describe her/his case clear. She/He has "the same data file as a text file" and wants to ask for an advice on how to approach parsing this txt file in linux command line and which combination of tools to use for that. Should piping with several commands be used, or some other approach. I am interested in this also. Thank you for the question.
– titus
Nov 27 at 1:28
1
@titus: I would expect somebody who has been on the site for eight years to know about edits and the revision history. If you look, you'll see that the "same data file as a text file" statement was added after I made my comment.
– Scott
Nov 27 at 3:09