unix - split a huge .gz file by line
I'm sure someone has had the below need, what is a quick way of splitting a huge .gz file by line? The underlying text file has 120million rows. I don't have enough disk space to gunzip the entire file at once so I was wondering if someone knows of a bash/perl script or tool that could split the file (either the .gz or inner .txt) into 3x 40mn line files. ie calling it like:
bash splitter.sh hugefile.txt.gz 4000000 1
would get lines 1 to 40 mn
bash splitter.sh hugefile.txt.gz 4000000 2
would get lines 40mn to 80 mn
bash splitter.sh hugefile.txt.gz 4000000 3
would get lines 80mn to 120 mn
Is perhaps doing a series of these a solution or would the gunzip -c require enough space for the entire file to be unzipped(ie the original problem): gunzip -c hugefile.txt.gz | head 4000000
Note: I can't get extra disk.
Thanks!
linux perl bash shell unix
migrated from stackoverflow.com Jan 23 '12 at 11:28
This question came from our site for professional and enthusiast programmers.
add a comment |
I'm sure someone has had the below need, what is a quick way of splitting a huge .gz file by line? The underlying text file has 120million rows. I don't have enough disk space to gunzip the entire file at once so I was wondering if someone knows of a bash/perl script or tool that could split the file (either the .gz or inner .txt) into 3x 40mn line files. ie calling it like:
bash splitter.sh hugefile.txt.gz 4000000 1
would get lines 1 to 40 mn
bash splitter.sh hugefile.txt.gz 4000000 2
would get lines 40mn to 80 mn
bash splitter.sh hugefile.txt.gz 4000000 3
would get lines 80mn to 120 mn
Is perhaps doing a series of these a solution or would the gunzip -c require enough space for the entire file to be unzipped(ie the original problem): gunzip -c hugefile.txt.gz | head 4000000
Note: I can't get extra disk.
Thanks!
linux perl bash shell unix
migrated from stackoverflow.com Jan 23 '12 at 11:28
This question came from our site for professional and enthusiast programmers.
1
Do you want the resulting files to be gziped again?
– user95605
Jan 23 '12 at 11:25
You can use gunzip in a ipe. The rest can be done with head and tail
– Ingo
Jan 23 '12 at 11:25
@Tichodroma - no I don't need them gziped again. But I could not store all the split text files at once. So i would like to get the first split, do stuff with it, then delete the first split, and then get the 2nd split.etc finally removing the original gz
– toop
Jan 23 '12 at 11:42
1
@toop: Thanks for the clarification. Note that it's generally better to edit your question if you want to clarify it, rather than put it into a comment; that way everyone will see it.
– sleske
Jan 23 '12 at 12:06
The accepted answer is good if you only want a fraction of the chunks, and do not know them in advance. If you want to generate all the chunks at once, the solutions based on split will be a lot faster ,O(N) instead of O(N²).
– b0fh
Aug 13 '14 at 15:27
add a comment |
I'm sure someone has had the below need, what is a quick way of splitting a huge .gz file by line? The underlying text file has 120million rows. I don't have enough disk space to gunzip the entire file at once so I was wondering if someone knows of a bash/perl script or tool that could split the file (either the .gz or inner .txt) into 3x 40mn line files. ie calling it like:
bash splitter.sh hugefile.txt.gz 4000000 1
would get lines 1 to 40 mn
bash splitter.sh hugefile.txt.gz 4000000 2
would get lines 40mn to 80 mn
bash splitter.sh hugefile.txt.gz 4000000 3
would get lines 80mn to 120 mn
Is perhaps doing a series of these a solution or would the gunzip -c require enough space for the entire file to be unzipped(ie the original problem): gunzip -c hugefile.txt.gz | head 4000000
Note: I can't get extra disk.
Thanks!
linux perl bash shell unix
I'm sure someone has had the below need, what is a quick way of splitting a huge .gz file by line? The underlying text file has 120million rows. I don't have enough disk space to gunzip the entire file at once so I was wondering if someone knows of a bash/perl script or tool that could split the file (either the .gz or inner .txt) into 3x 40mn line files. ie calling it like:
bash splitter.sh hugefile.txt.gz 4000000 1
would get lines 1 to 40 mn
bash splitter.sh hugefile.txt.gz 4000000 2
would get lines 40mn to 80 mn
bash splitter.sh hugefile.txt.gz 4000000 3
would get lines 80mn to 120 mn
Is perhaps doing a series of these a solution or would the gunzip -c require enough space for the entire file to be unzipped(ie the original problem): gunzip -c hugefile.txt.gz | head 4000000
Note: I can't get extra disk.
Thanks!
linux perl bash shell unix
linux perl bash shell unix
asked Jan 23 '12 at 11:21
tooptoop
1951416
1951416
migrated from stackoverflow.com Jan 23 '12 at 11:28
This question came from our site for professional and enthusiast programmers.
migrated from stackoverflow.com Jan 23 '12 at 11:28
This question came from our site for professional and enthusiast programmers.
1
Do you want the resulting files to be gziped again?
– user95605
Jan 23 '12 at 11:25
You can use gunzip in a ipe. The rest can be done with head and tail
– Ingo
Jan 23 '12 at 11:25
@Tichodroma - no I don't need them gziped again. But I could not store all the split text files at once. So i would like to get the first split, do stuff with it, then delete the first split, and then get the 2nd split.etc finally removing the original gz
– toop
Jan 23 '12 at 11:42
1
@toop: Thanks for the clarification. Note that it's generally better to edit your question if you want to clarify it, rather than put it into a comment; that way everyone will see it.
– sleske
Jan 23 '12 at 12:06
The accepted answer is good if you only want a fraction of the chunks, and do not know them in advance. If you want to generate all the chunks at once, the solutions based on split will be a lot faster ,O(N) instead of O(N²).
– b0fh
Aug 13 '14 at 15:27
add a comment |
1
Do you want the resulting files to be gziped again?
– user95605
Jan 23 '12 at 11:25
You can use gunzip in a ipe. The rest can be done with head and tail
– Ingo
Jan 23 '12 at 11:25
@Tichodroma - no I don't need them gziped again. But I could not store all the split text files at once. So i would like to get the first split, do stuff with it, then delete the first split, and then get the 2nd split.etc finally removing the original gz
– toop
Jan 23 '12 at 11:42
1
@toop: Thanks for the clarification. Note that it's generally better to edit your question if you want to clarify it, rather than put it into a comment; that way everyone will see it.
– sleske
Jan 23 '12 at 12:06
The accepted answer is good if you only want a fraction of the chunks, and do not know them in advance. If you want to generate all the chunks at once, the solutions based on split will be a lot faster ,O(N) instead of O(N²).
– b0fh
Aug 13 '14 at 15:27
1
1
Do you want the resulting files to be gziped again?
– user95605
Jan 23 '12 at 11:25
Do you want the resulting files to be gziped again?
– user95605
Jan 23 '12 at 11:25
You can use gunzip in a ipe. The rest can be done with head and tail
– Ingo
Jan 23 '12 at 11:25
You can use gunzip in a ipe. The rest can be done with head and tail
– Ingo
Jan 23 '12 at 11:25
@Tichodroma - no I don't need them gziped again. But I could not store all the split text files at once. So i would like to get the first split, do stuff with it, then delete the first split, and then get the 2nd split.etc finally removing the original gz
– toop
Jan 23 '12 at 11:42
@Tichodroma - no I don't need them gziped again. But I could not store all the split text files at once. So i would like to get the first split, do stuff with it, then delete the first split, and then get the 2nd split.etc finally removing the original gz
– toop
Jan 23 '12 at 11:42
1
1
@toop: Thanks for the clarification. Note that it's generally better to edit your question if you want to clarify it, rather than put it into a comment; that way everyone will see it.
– sleske
Jan 23 '12 at 12:06
@toop: Thanks for the clarification. Note that it's generally better to edit your question if you want to clarify it, rather than put it into a comment; that way everyone will see it.
– sleske
Jan 23 '12 at 12:06
The accepted answer is good if you only want a fraction of the chunks, and do not know them in advance. If you want to generate all the chunks at once, the solutions based on split will be a lot faster ,O(N) instead of O(N²).
– b0fh
Aug 13 '14 at 15:27
The accepted answer is good if you only want a fraction of the chunks, and do not know them in advance. If you want to generate all the chunks at once, the solutions based on split will be a lot faster ,O(N) instead of O(N²).
– b0fh
Aug 13 '14 at 15:27
add a comment |
7 Answers
7
active
oldest
votes
How to do this best depends on what you want:
- Do you want to extract a single part of the large file?
- Or do you want to create all the parts in one go?
If you want a single part of the file, your idea to use gunzip
and head
is right. You can use:
gunzip -c hugefile.txt.gz | head -n 4000000
That would output the first 4000000 lines on standard out - you probably want to append another pipe to actually do something with the data.
To get the other parts, you'd use a combination of head
and tail
, like:
gunzip -c hugefile.txt.gz | head -n 8000000 |tail -n 4000000
to get the second block.
Is perhaps doing a series of these a solution or would the gunzip -c
require enough space for the entire file to be unzipped
No, the gunzip -c
does not require any disk space - it does everything in memory, then streams it out to stdout.
If you want to create all the parts in one go, it is more efficient to create them all with a single command, because then the input file is only read once. One good solution is to use split
; see jim mcnamara's answer for details.
From performance view: does gzip actually unzip whole file? Or is it able to "magically" know that only 4mn lines are needed?
– Alois Mahdal
Mar 22 '12 at 12:57
3
@AloisMahdal: Actually, that would be a good separate question :-). Short version:gzip
does not know about the limit (which comes from a different process). Ifhead
is used,head
will exit when it has received enough, and this will propagate togzip
(via SIGPIPE, see Wikipedia). Fortail
this is not possible, so yes,gzip
will decompress everything.
– sleske
Mar 22 '12 at 15:26
But if you are interested, you should really ask this as a separate question.
– sleske
Mar 22 '12 at 15:35
add a comment |
pipe to split use either gunzip -c or zcat to open the file
gunzip -c bigfile.gz | split -l 400000
Add output specifications to the split command.
2
This is massively more efficient than the accepted answer, unless you only require a fraction of the split chunks. Please upvote.
– b0fh
Aug 13 '14 at 15:29
@b0fh: Yes, your are right. Upvoted, and referenced in my answer :-).
– sleske
Sep 7 '16 at 7:20
Best answer for sure.
– Stephen Blum
Mar 7 '18 at 21:27
what are the output specs so that the outputs are .gz files themselves?
– Quetzalcoatl
Oct 26 '18 at 5:53
add a comment |
As you are working on a (non-rewindable) stream, you will want to use the '+N' form of tail to get lines starting from line N onwards.
zcat hugefile.txt.gz | head -n 40000000
zcat hugefile.txt.gz | tail -n +40000001 | head -n 40000000
zcat hugefile.txt.gz | tail -n +80000001 | head -n 40000000
add a comment |
I'd consider using split.
split a file into pieces
add a comment |
Here's a python script to open a globbed set of files from a directory, gunzip them if necessary, and read through them line by line. It only uses the space necessary in memory for holding the filenames, and the current line, plus a little overhead.
#!/usr/bin/env python
import gzip, bz2
import os
import fnmatch
def gen_find(filepat,top):
for path, dirlist, filelist in os.walk(top):
for name in fnmatch.filter(filelist,filepat):
yield os.path.join(path,name)
def gen_open(filenames):
for name in filenames:
if name.endswith(".gz"):
yield gzip.open(name)
elif name.endswith(".bz2"):
yield bz2.BZ2File(name)
else:
yield open(name)
def gen_cat(sources):
for s in sources:
for item in s:
yield item
def main(regex, searchDir):
fileNames = gen_find(regex,searchDir)
fileHandles = gen_open(fileNames)
fileLines = gen_cat(fileHandles)
for line in fileLines:
print line
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='Search globbed files line by line', version='%(prog)s 1.0')
parser.add_argument('regex', type=str, default='*', help='Regular expression')
parser.add_argument('searchDir', , type=str, default='.', help='list of input files')
args = parser.parse_args()
main(args.regex, args.searchDir)
The print line command will send every line to std out, so you can redirect to a file. Alternatively, if you let us know what you want done with the lines, I can add it to the python script and you won't need to leave chunks of the file laying around.
add a comment |
Here's a perl program that can be used to read stdin, and split the lines, piping each clump to a separate command that can use a shell variable $SPLIT to route it to a different destination. For your case, it would be invoked with
zcat hugefile.txt.gz | perl xsplit.pl 40000000 'cat > tmp$SPLIT.txt; do_something tmp$SPLIT.txt; rm tmp$SPLIT.txt'
Sorry the command-line processing is a little kludgy but you get the idea.
#!/usr/bin/perl -w
#####
# xsplit.pl: like xargs but instead of clumping input into each command's args, clumps it into each command's input.
# Usage: perl xsplit.pl LINES 'COMMAND'
# where: 'COMMAND' can include shell variable expansions and can use $SPLIT, e.g.
# 'cat > tmp$SPLIT.txt'
# or:
# 'gzip > tmp$SPLIT.gz'
#####
use strict;
sub pipeHandler {
my $sig = shift @_;
print " Caught SIGPIPE: $sign";
exit(1);
}
$SIG{PIPE} = &pipeHandler;
my $LINES = shift;
die "LINES must be a positive numbern" if ($LINES <= 0);
my $COMMAND = shift || die "second argument should be COMMANDn";
my $line_number = 0;
while (<STDIN>) {
if ($line_number%$LINES == 0) {
close OUTFILE;
my $split = $ENV{SPLIT} = sprintf("%05d", $line_number/$LINES+1);
print "$splitn";
my $command = $COMMAND;
open (OUTFILE, "| $command") or die "failed to write to command '$command'n";
}
print OUTFILE $_;
$line_number++;
}
exit 0;
add a comment |
Directly split .gz file to .gz files:
zcat bigfile.gz | split -l 400000 --filter='gzip > $FILE.gz'
I think this is what OP wanted, because he don't have much space.
add a comment |
Your Answer
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "3"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fsuperuser.com%2fquestions%2f381394%2funix-split-a-huge-gz-file-by-line%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
7 Answers
7
active
oldest
votes
7 Answers
7
active
oldest
votes
active
oldest
votes
active
oldest
votes
How to do this best depends on what you want:
- Do you want to extract a single part of the large file?
- Or do you want to create all the parts in one go?
If you want a single part of the file, your idea to use gunzip
and head
is right. You can use:
gunzip -c hugefile.txt.gz | head -n 4000000
That would output the first 4000000 lines on standard out - you probably want to append another pipe to actually do something with the data.
To get the other parts, you'd use a combination of head
and tail
, like:
gunzip -c hugefile.txt.gz | head -n 8000000 |tail -n 4000000
to get the second block.
Is perhaps doing a series of these a solution or would the gunzip -c
require enough space for the entire file to be unzipped
No, the gunzip -c
does not require any disk space - it does everything in memory, then streams it out to stdout.
If you want to create all the parts in one go, it is more efficient to create them all with a single command, because then the input file is only read once. One good solution is to use split
; see jim mcnamara's answer for details.
From performance view: does gzip actually unzip whole file? Or is it able to "magically" know that only 4mn lines are needed?
– Alois Mahdal
Mar 22 '12 at 12:57
3
@AloisMahdal: Actually, that would be a good separate question :-). Short version:gzip
does not know about the limit (which comes from a different process). Ifhead
is used,head
will exit when it has received enough, and this will propagate togzip
(via SIGPIPE, see Wikipedia). Fortail
this is not possible, so yes,gzip
will decompress everything.
– sleske
Mar 22 '12 at 15:26
But if you are interested, you should really ask this as a separate question.
– sleske
Mar 22 '12 at 15:35
add a comment |
How to do this best depends on what you want:
- Do you want to extract a single part of the large file?
- Or do you want to create all the parts in one go?
If you want a single part of the file, your idea to use gunzip
and head
is right. You can use:
gunzip -c hugefile.txt.gz | head -n 4000000
That would output the first 4000000 lines on standard out - you probably want to append another pipe to actually do something with the data.
To get the other parts, you'd use a combination of head
and tail
, like:
gunzip -c hugefile.txt.gz | head -n 8000000 |tail -n 4000000
to get the second block.
Is perhaps doing a series of these a solution or would the gunzip -c
require enough space for the entire file to be unzipped
No, the gunzip -c
does not require any disk space - it does everything in memory, then streams it out to stdout.
If you want to create all the parts in one go, it is more efficient to create them all with a single command, because then the input file is only read once. One good solution is to use split
; see jim mcnamara's answer for details.
From performance view: does gzip actually unzip whole file? Or is it able to "magically" know that only 4mn lines are needed?
– Alois Mahdal
Mar 22 '12 at 12:57
3
@AloisMahdal: Actually, that would be a good separate question :-). Short version:gzip
does not know about the limit (which comes from a different process). Ifhead
is used,head
will exit when it has received enough, and this will propagate togzip
(via SIGPIPE, see Wikipedia). Fortail
this is not possible, so yes,gzip
will decompress everything.
– sleske
Mar 22 '12 at 15:26
But if you are interested, you should really ask this as a separate question.
– sleske
Mar 22 '12 at 15:35
add a comment |
How to do this best depends on what you want:
- Do you want to extract a single part of the large file?
- Or do you want to create all the parts in one go?
If you want a single part of the file, your idea to use gunzip
and head
is right. You can use:
gunzip -c hugefile.txt.gz | head -n 4000000
That would output the first 4000000 lines on standard out - you probably want to append another pipe to actually do something with the data.
To get the other parts, you'd use a combination of head
and tail
, like:
gunzip -c hugefile.txt.gz | head -n 8000000 |tail -n 4000000
to get the second block.
Is perhaps doing a series of these a solution or would the gunzip -c
require enough space for the entire file to be unzipped
No, the gunzip -c
does not require any disk space - it does everything in memory, then streams it out to stdout.
If you want to create all the parts in one go, it is more efficient to create them all with a single command, because then the input file is only read once. One good solution is to use split
; see jim mcnamara's answer for details.
How to do this best depends on what you want:
- Do you want to extract a single part of the large file?
- Or do you want to create all the parts in one go?
If you want a single part of the file, your idea to use gunzip
and head
is right. You can use:
gunzip -c hugefile.txt.gz | head -n 4000000
That would output the first 4000000 lines on standard out - you probably want to append another pipe to actually do something with the data.
To get the other parts, you'd use a combination of head
and tail
, like:
gunzip -c hugefile.txt.gz | head -n 8000000 |tail -n 4000000
to get the second block.
Is perhaps doing a series of these a solution or would the gunzip -c
require enough space for the entire file to be unzipped
No, the gunzip -c
does not require any disk space - it does everything in memory, then streams it out to stdout.
If you want to create all the parts in one go, it is more efficient to create them all with a single command, because then the input file is only read once. One good solution is to use split
; see jim mcnamara's answer for details.
edited Sep 7 '16 at 7:19
answered Jan 23 '12 at 11:29
sleskesleske
18k85383
18k85383
From performance view: does gzip actually unzip whole file? Or is it able to "magically" know that only 4mn lines are needed?
– Alois Mahdal
Mar 22 '12 at 12:57
3
@AloisMahdal: Actually, that would be a good separate question :-). Short version:gzip
does not know about the limit (which comes from a different process). Ifhead
is used,head
will exit when it has received enough, and this will propagate togzip
(via SIGPIPE, see Wikipedia). Fortail
this is not possible, so yes,gzip
will decompress everything.
– sleske
Mar 22 '12 at 15:26
But if you are interested, you should really ask this as a separate question.
– sleske
Mar 22 '12 at 15:35
add a comment |
From performance view: does gzip actually unzip whole file? Or is it able to "magically" know that only 4mn lines are needed?
– Alois Mahdal
Mar 22 '12 at 12:57
3
@AloisMahdal: Actually, that would be a good separate question :-). Short version:gzip
does not know about the limit (which comes from a different process). Ifhead
is used,head
will exit when it has received enough, and this will propagate togzip
(via SIGPIPE, see Wikipedia). Fortail
this is not possible, so yes,gzip
will decompress everything.
– sleske
Mar 22 '12 at 15:26
But if you are interested, you should really ask this as a separate question.
– sleske
Mar 22 '12 at 15:35
From performance view: does gzip actually unzip whole file? Or is it able to "magically" know that only 4mn lines are needed?
– Alois Mahdal
Mar 22 '12 at 12:57
From performance view: does gzip actually unzip whole file? Or is it able to "magically" know that only 4mn lines are needed?
– Alois Mahdal
Mar 22 '12 at 12:57
3
3
@AloisMahdal: Actually, that would be a good separate question :-). Short version:
gzip
does not know about the limit (which comes from a different process). If head
is used, head
will exit when it has received enough, and this will propagate to gzip
(via SIGPIPE, see Wikipedia). For tail
this is not possible, so yes, gzip
will decompress everything.– sleske
Mar 22 '12 at 15:26
@AloisMahdal: Actually, that would be a good separate question :-). Short version:
gzip
does not know about the limit (which comes from a different process). If head
is used, head
will exit when it has received enough, and this will propagate to gzip
(via SIGPIPE, see Wikipedia). For tail
this is not possible, so yes, gzip
will decompress everything.– sleske
Mar 22 '12 at 15:26
But if you are interested, you should really ask this as a separate question.
– sleske
Mar 22 '12 at 15:35
But if you are interested, you should really ask this as a separate question.
– sleske
Mar 22 '12 at 15:35
add a comment |
pipe to split use either gunzip -c or zcat to open the file
gunzip -c bigfile.gz | split -l 400000
Add output specifications to the split command.
2
This is massively more efficient than the accepted answer, unless you only require a fraction of the split chunks. Please upvote.
– b0fh
Aug 13 '14 at 15:29
@b0fh: Yes, your are right. Upvoted, and referenced in my answer :-).
– sleske
Sep 7 '16 at 7:20
Best answer for sure.
– Stephen Blum
Mar 7 '18 at 21:27
what are the output specs so that the outputs are .gz files themselves?
– Quetzalcoatl
Oct 26 '18 at 5:53
add a comment |
pipe to split use either gunzip -c or zcat to open the file
gunzip -c bigfile.gz | split -l 400000
Add output specifications to the split command.
2
This is massively more efficient than the accepted answer, unless you only require a fraction of the split chunks. Please upvote.
– b0fh
Aug 13 '14 at 15:29
@b0fh: Yes, your are right. Upvoted, and referenced in my answer :-).
– sleske
Sep 7 '16 at 7:20
Best answer for sure.
– Stephen Blum
Mar 7 '18 at 21:27
what are the output specs so that the outputs are .gz files themselves?
– Quetzalcoatl
Oct 26 '18 at 5:53
add a comment |
pipe to split use either gunzip -c or zcat to open the file
gunzip -c bigfile.gz | split -l 400000
Add output specifications to the split command.
pipe to split use either gunzip -c or zcat to open the file
gunzip -c bigfile.gz | split -l 400000
Add output specifications to the split command.
answered Jan 23 '12 at 16:41
jim mcnamarajim mcnamara
75947
75947
2
This is massively more efficient than the accepted answer, unless you only require a fraction of the split chunks. Please upvote.
– b0fh
Aug 13 '14 at 15:29
@b0fh: Yes, your are right. Upvoted, and referenced in my answer :-).
– sleske
Sep 7 '16 at 7:20
Best answer for sure.
– Stephen Blum
Mar 7 '18 at 21:27
what are the output specs so that the outputs are .gz files themselves?
– Quetzalcoatl
Oct 26 '18 at 5:53
add a comment |
2
This is massively more efficient than the accepted answer, unless you only require a fraction of the split chunks. Please upvote.
– b0fh
Aug 13 '14 at 15:29
@b0fh: Yes, your are right. Upvoted, and referenced in my answer :-).
– sleske
Sep 7 '16 at 7:20
Best answer for sure.
– Stephen Blum
Mar 7 '18 at 21:27
what are the output specs so that the outputs are .gz files themselves?
– Quetzalcoatl
Oct 26 '18 at 5:53
2
2
This is massively more efficient than the accepted answer, unless you only require a fraction of the split chunks. Please upvote.
– b0fh
Aug 13 '14 at 15:29
This is massively more efficient than the accepted answer, unless you only require a fraction of the split chunks. Please upvote.
– b0fh
Aug 13 '14 at 15:29
@b0fh: Yes, your are right. Upvoted, and referenced in my answer :-).
– sleske
Sep 7 '16 at 7:20
@b0fh: Yes, your are right. Upvoted, and referenced in my answer :-).
– sleske
Sep 7 '16 at 7:20
Best answer for sure.
– Stephen Blum
Mar 7 '18 at 21:27
Best answer for sure.
– Stephen Blum
Mar 7 '18 at 21:27
what are the output specs so that the outputs are .gz files themselves?
– Quetzalcoatl
Oct 26 '18 at 5:53
what are the output specs so that the outputs are .gz files themselves?
– Quetzalcoatl
Oct 26 '18 at 5:53
add a comment |
As you are working on a (non-rewindable) stream, you will want to use the '+N' form of tail to get lines starting from line N onwards.
zcat hugefile.txt.gz | head -n 40000000
zcat hugefile.txt.gz | tail -n +40000001 | head -n 40000000
zcat hugefile.txt.gz | tail -n +80000001 | head -n 40000000
add a comment |
As you are working on a (non-rewindable) stream, you will want to use the '+N' form of tail to get lines starting from line N onwards.
zcat hugefile.txt.gz | head -n 40000000
zcat hugefile.txt.gz | tail -n +40000001 | head -n 40000000
zcat hugefile.txt.gz | tail -n +80000001 | head -n 40000000
add a comment |
As you are working on a (non-rewindable) stream, you will want to use the '+N' form of tail to get lines starting from line N onwards.
zcat hugefile.txt.gz | head -n 40000000
zcat hugefile.txt.gz | tail -n +40000001 | head -n 40000000
zcat hugefile.txt.gz | tail -n +80000001 | head -n 40000000
As you are working on a (non-rewindable) stream, you will want to use the '+N' form of tail to get lines starting from line N onwards.
zcat hugefile.txt.gz | head -n 40000000
zcat hugefile.txt.gz | tail -n +40000001 | head -n 40000000
zcat hugefile.txt.gz | tail -n +80000001 | head -n 40000000
answered Jan 23 '12 at 11:33
zgpmaxzgpmax
1692
1692
add a comment |
add a comment |
I'd consider using split.
split a file into pieces
add a comment |
I'd consider using split.
split a file into pieces
add a comment |
I'd consider using split.
split a file into pieces
I'd consider using split.
split a file into pieces
edited Feb 20 '12 at 0:22
Tom Wijsman
50.4k24164247
50.4k24164247
answered Jan 23 '12 at 11:24
Michael Krelin - hackerMichael Krelin - hacker
58636
58636
add a comment |
add a comment |
Here's a python script to open a globbed set of files from a directory, gunzip them if necessary, and read through them line by line. It only uses the space necessary in memory for holding the filenames, and the current line, plus a little overhead.
#!/usr/bin/env python
import gzip, bz2
import os
import fnmatch
def gen_find(filepat,top):
for path, dirlist, filelist in os.walk(top):
for name in fnmatch.filter(filelist,filepat):
yield os.path.join(path,name)
def gen_open(filenames):
for name in filenames:
if name.endswith(".gz"):
yield gzip.open(name)
elif name.endswith(".bz2"):
yield bz2.BZ2File(name)
else:
yield open(name)
def gen_cat(sources):
for s in sources:
for item in s:
yield item
def main(regex, searchDir):
fileNames = gen_find(regex,searchDir)
fileHandles = gen_open(fileNames)
fileLines = gen_cat(fileHandles)
for line in fileLines:
print line
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='Search globbed files line by line', version='%(prog)s 1.0')
parser.add_argument('regex', type=str, default='*', help='Regular expression')
parser.add_argument('searchDir', , type=str, default='.', help='list of input files')
args = parser.parse_args()
main(args.regex, args.searchDir)
The print line command will send every line to std out, so you can redirect to a file. Alternatively, if you let us know what you want done with the lines, I can add it to the python script and you won't need to leave chunks of the file laying around.
add a comment |
Here's a python script to open a globbed set of files from a directory, gunzip them if necessary, and read through them line by line. It only uses the space necessary in memory for holding the filenames, and the current line, plus a little overhead.
#!/usr/bin/env python
import gzip, bz2
import os
import fnmatch
def gen_find(filepat,top):
for path, dirlist, filelist in os.walk(top):
for name in fnmatch.filter(filelist,filepat):
yield os.path.join(path,name)
def gen_open(filenames):
for name in filenames:
if name.endswith(".gz"):
yield gzip.open(name)
elif name.endswith(".bz2"):
yield bz2.BZ2File(name)
else:
yield open(name)
def gen_cat(sources):
for s in sources:
for item in s:
yield item
def main(regex, searchDir):
fileNames = gen_find(regex,searchDir)
fileHandles = gen_open(fileNames)
fileLines = gen_cat(fileHandles)
for line in fileLines:
print line
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='Search globbed files line by line', version='%(prog)s 1.0')
parser.add_argument('regex', type=str, default='*', help='Regular expression')
parser.add_argument('searchDir', , type=str, default='.', help='list of input files')
args = parser.parse_args()
main(args.regex, args.searchDir)
The print line command will send every line to std out, so you can redirect to a file. Alternatively, if you let us know what you want done with the lines, I can add it to the python script and you won't need to leave chunks of the file laying around.
add a comment |
Here's a python script to open a globbed set of files from a directory, gunzip them if necessary, and read through them line by line. It only uses the space necessary in memory for holding the filenames, and the current line, plus a little overhead.
#!/usr/bin/env python
import gzip, bz2
import os
import fnmatch
def gen_find(filepat,top):
for path, dirlist, filelist in os.walk(top):
for name in fnmatch.filter(filelist,filepat):
yield os.path.join(path,name)
def gen_open(filenames):
for name in filenames:
if name.endswith(".gz"):
yield gzip.open(name)
elif name.endswith(".bz2"):
yield bz2.BZ2File(name)
else:
yield open(name)
def gen_cat(sources):
for s in sources:
for item in s:
yield item
def main(regex, searchDir):
fileNames = gen_find(regex,searchDir)
fileHandles = gen_open(fileNames)
fileLines = gen_cat(fileHandles)
for line in fileLines:
print line
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='Search globbed files line by line', version='%(prog)s 1.0')
parser.add_argument('regex', type=str, default='*', help='Regular expression')
parser.add_argument('searchDir', , type=str, default='.', help='list of input files')
args = parser.parse_args()
main(args.regex, args.searchDir)
The print line command will send every line to std out, so you can redirect to a file. Alternatively, if you let us know what you want done with the lines, I can add it to the python script and you won't need to leave chunks of the file laying around.
Here's a python script to open a globbed set of files from a directory, gunzip them if necessary, and read through them line by line. It only uses the space necessary in memory for holding the filenames, and the current line, plus a little overhead.
#!/usr/bin/env python
import gzip, bz2
import os
import fnmatch
def gen_find(filepat,top):
for path, dirlist, filelist in os.walk(top):
for name in fnmatch.filter(filelist,filepat):
yield os.path.join(path,name)
def gen_open(filenames):
for name in filenames:
if name.endswith(".gz"):
yield gzip.open(name)
elif name.endswith(".bz2"):
yield bz2.BZ2File(name)
else:
yield open(name)
def gen_cat(sources):
for s in sources:
for item in s:
yield item
def main(regex, searchDir):
fileNames = gen_find(regex,searchDir)
fileHandles = gen_open(fileNames)
fileLines = gen_cat(fileHandles)
for line in fileLines:
print line
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='Search globbed files line by line', version='%(prog)s 1.0')
parser.add_argument('regex', type=str, default='*', help='Regular expression')
parser.add_argument('searchDir', , type=str, default='.', help='list of input files')
args = parser.parse_args()
main(args.regex, args.searchDir)
The print line command will send every line to std out, so you can redirect to a file. Alternatively, if you let us know what you want done with the lines, I can add it to the python script and you won't need to leave chunks of the file laying around.
answered Jan 23 '12 at 13:39
Spencer RathbunSpencer Rathbun
27125
27125
add a comment |
add a comment |
Here's a perl program that can be used to read stdin, and split the lines, piping each clump to a separate command that can use a shell variable $SPLIT to route it to a different destination. For your case, it would be invoked with
zcat hugefile.txt.gz | perl xsplit.pl 40000000 'cat > tmp$SPLIT.txt; do_something tmp$SPLIT.txt; rm tmp$SPLIT.txt'
Sorry the command-line processing is a little kludgy but you get the idea.
#!/usr/bin/perl -w
#####
# xsplit.pl: like xargs but instead of clumping input into each command's args, clumps it into each command's input.
# Usage: perl xsplit.pl LINES 'COMMAND'
# where: 'COMMAND' can include shell variable expansions and can use $SPLIT, e.g.
# 'cat > tmp$SPLIT.txt'
# or:
# 'gzip > tmp$SPLIT.gz'
#####
use strict;
sub pipeHandler {
my $sig = shift @_;
print " Caught SIGPIPE: $sign";
exit(1);
}
$SIG{PIPE} = &pipeHandler;
my $LINES = shift;
die "LINES must be a positive numbern" if ($LINES <= 0);
my $COMMAND = shift || die "second argument should be COMMANDn";
my $line_number = 0;
while (<STDIN>) {
if ($line_number%$LINES == 0) {
close OUTFILE;
my $split = $ENV{SPLIT} = sprintf("%05d", $line_number/$LINES+1);
print "$splitn";
my $command = $COMMAND;
open (OUTFILE, "| $command") or die "failed to write to command '$command'n";
}
print OUTFILE $_;
$line_number++;
}
exit 0;
add a comment |
Here's a perl program that can be used to read stdin, and split the lines, piping each clump to a separate command that can use a shell variable $SPLIT to route it to a different destination. For your case, it would be invoked with
zcat hugefile.txt.gz | perl xsplit.pl 40000000 'cat > tmp$SPLIT.txt; do_something tmp$SPLIT.txt; rm tmp$SPLIT.txt'
Sorry the command-line processing is a little kludgy but you get the idea.
#!/usr/bin/perl -w
#####
# xsplit.pl: like xargs but instead of clumping input into each command's args, clumps it into each command's input.
# Usage: perl xsplit.pl LINES 'COMMAND'
# where: 'COMMAND' can include shell variable expansions and can use $SPLIT, e.g.
# 'cat > tmp$SPLIT.txt'
# or:
# 'gzip > tmp$SPLIT.gz'
#####
use strict;
sub pipeHandler {
my $sig = shift @_;
print " Caught SIGPIPE: $sign";
exit(1);
}
$SIG{PIPE} = &pipeHandler;
my $LINES = shift;
die "LINES must be a positive numbern" if ($LINES <= 0);
my $COMMAND = shift || die "second argument should be COMMANDn";
my $line_number = 0;
while (<STDIN>) {
if ($line_number%$LINES == 0) {
close OUTFILE;
my $split = $ENV{SPLIT} = sprintf("%05d", $line_number/$LINES+1);
print "$splitn";
my $command = $COMMAND;
open (OUTFILE, "| $command") or die "failed to write to command '$command'n";
}
print OUTFILE $_;
$line_number++;
}
exit 0;
add a comment |
Here's a perl program that can be used to read stdin, and split the lines, piping each clump to a separate command that can use a shell variable $SPLIT to route it to a different destination. For your case, it would be invoked with
zcat hugefile.txt.gz | perl xsplit.pl 40000000 'cat > tmp$SPLIT.txt; do_something tmp$SPLIT.txt; rm tmp$SPLIT.txt'
Sorry the command-line processing is a little kludgy but you get the idea.
#!/usr/bin/perl -w
#####
# xsplit.pl: like xargs but instead of clumping input into each command's args, clumps it into each command's input.
# Usage: perl xsplit.pl LINES 'COMMAND'
# where: 'COMMAND' can include shell variable expansions and can use $SPLIT, e.g.
# 'cat > tmp$SPLIT.txt'
# or:
# 'gzip > tmp$SPLIT.gz'
#####
use strict;
sub pipeHandler {
my $sig = shift @_;
print " Caught SIGPIPE: $sign";
exit(1);
}
$SIG{PIPE} = &pipeHandler;
my $LINES = shift;
die "LINES must be a positive numbern" if ($LINES <= 0);
my $COMMAND = shift || die "second argument should be COMMANDn";
my $line_number = 0;
while (<STDIN>) {
if ($line_number%$LINES == 0) {
close OUTFILE;
my $split = $ENV{SPLIT} = sprintf("%05d", $line_number/$LINES+1);
print "$splitn";
my $command = $COMMAND;
open (OUTFILE, "| $command") or die "failed to write to command '$command'n";
}
print OUTFILE $_;
$line_number++;
}
exit 0;
Here's a perl program that can be used to read stdin, and split the lines, piping each clump to a separate command that can use a shell variable $SPLIT to route it to a different destination. For your case, it would be invoked with
zcat hugefile.txt.gz | perl xsplit.pl 40000000 'cat > tmp$SPLIT.txt; do_something tmp$SPLIT.txt; rm tmp$SPLIT.txt'
Sorry the command-line processing is a little kludgy but you get the idea.
#!/usr/bin/perl -w
#####
# xsplit.pl: like xargs but instead of clumping input into each command's args, clumps it into each command's input.
# Usage: perl xsplit.pl LINES 'COMMAND'
# where: 'COMMAND' can include shell variable expansions and can use $SPLIT, e.g.
# 'cat > tmp$SPLIT.txt'
# or:
# 'gzip > tmp$SPLIT.gz'
#####
use strict;
sub pipeHandler {
my $sig = shift @_;
print " Caught SIGPIPE: $sign";
exit(1);
}
$SIG{PIPE} = &pipeHandler;
my $LINES = shift;
die "LINES must be a positive numbern" if ($LINES <= 0);
my $COMMAND = shift || die "second argument should be COMMANDn";
my $line_number = 0;
while (<STDIN>) {
if ($line_number%$LINES == 0) {
close OUTFILE;
my $split = $ENV{SPLIT} = sprintf("%05d", $line_number/$LINES+1);
print "$splitn";
my $command = $COMMAND;
open (OUTFILE, "| $command") or die "failed to write to command '$command'n";
}
print OUTFILE $_;
$line_number++;
}
exit 0;
answered Jan 23 '12 at 20:54
Liudvikas BukysLiudvikas Bukys
27149
27149
add a comment |
add a comment |
Directly split .gz file to .gz files:
zcat bigfile.gz | split -l 400000 --filter='gzip > $FILE.gz'
I think this is what OP wanted, because he don't have much space.
add a comment |
Directly split .gz file to .gz files:
zcat bigfile.gz | split -l 400000 --filter='gzip > $FILE.gz'
I think this is what OP wanted, because he don't have much space.
add a comment |
Directly split .gz file to .gz files:
zcat bigfile.gz | split -l 400000 --filter='gzip > $FILE.gz'
I think this is what OP wanted, because he don't have much space.
Directly split .gz file to .gz files:
zcat bigfile.gz | split -l 400000 --filter='gzip > $FILE.gz'
I think this is what OP wanted, because he don't have much space.
answered Feb 24 at 10:53
siulkilulkisiulkilulki
1111
1111
add a comment |
add a comment |
Thanks for contributing an answer to Super User!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fsuperuser.com%2fquestions%2f381394%2funix-split-a-huge-gz-file-by-line%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
Do you want the resulting files to be gziped again?
– user95605
Jan 23 '12 at 11:25
You can use gunzip in a ipe. The rest can be done with head and tail
– Ingo
Jan 23 '12 at 11:25
@Tichodroma - no I don't need them gziped again. But I could not store all the split text files at once. So i would like to get the first split, do stuff with it, then delete the first split, and then get the 2nd split.etc finally removing the original gz
– toop
Jan 23 '12 at 11:42
1
@toop: Thanks for the clarification. Note that it's generally better to edit your question if you want to clarify it, rather than put it into a comment; that way everyone will see it.
– sleske
Jan 23 '12 at 12:06
The accepted answer is good if you only want a fraction of the chunks, and do not know them in advance. If you want to generate all the chunks at once, the solutions based on split will be a lot faster ,O(N) instead of O(N²).
– b0fh
Aug 13 '14 at 15:27